I have data points that can briefly be described with the following fields: - timestamp - name - value
The value field can be 1) a scalar integer (all sizes), a float or double, a string or, 2) an array (list) of all 1) data types. A named data point can for its lifetime only have a single/fixed value data type. You can imagine this data is coming for a large set if IOT devices. I can not modify the data generation in order to change / consolidate the data type. The amount of data points can vary in time and is unbounded in terms of how many can there. I noticed that parquet does not support union, which would be a perfect match here (but see the footnote). I was thinking of working around the issue by having a struct with fields for all the data types + a tag field to specify the actual data type. As this seems viable on the writer side I feel like it is not going to play nice for the reader side; the user querying the data would need to know which struct field to put in SQL query for example. Essentially the user would need to know the data type in advance which is kind of not desired or practically possible. Then I was thinking on having different schemas, each supporting a specific data type in value field, and have different files generated. I'm not sure how they could coexist in the same 'data pool' and how the readers would transparently be able to access them (ie. if SQL wants to specify the value field in the WHERE clause will it go well with all the different data types?). Another issue with this approach is that I would have very small files for certain data types as some data points generate data at a small rate. Another idea was to use the binary data type for the value field and serialize all the data with msgpack or similar. Needless to say that this introduces storage and processing overhead, and does not allow me to query value field from SQL. Any ideas on how to approach this? I hope there is a way to handle this use case.. TIA! //hinxx [FWIW, It seems that the processing engines are not quite interested in supporting unions at all. I've realized that the union data type as seen in ORC file format is actually not supported by (py)arrow for file I/O. Not sure about the how spark works with them. Nevertheless, I'm leaning towards the use of some python based framework for processing that would revolve around arrow/pandas like duckDB or dask]