I have data points that can briefly be described with the following fields:
 - timestamp
 - name
 - value

The value field can be 
 1) a scalar integer (all sizes), a float or double, a string or,
 2) an array (list) of all 1) data types.

A named data point can for its lifetime only have a single/fixed value data 
type. You can imagine this data is coming for a large set if IOT devices. I can 
not modify the data generation in order to change / consolidate the data type. 
The amount of data points can vary in time and is unbounded in terms of how 
many can there.

I noticed that parquet does not support union, which would be a perfect match 
here (but see the footnote).

I was thinking of working around the issue by having a struct with fields for 
all the data types + a tag field to specify the actual data type. As this seems 
viable on the writer side I feel like it is not going to play nice for the 
reader side; the user querying the data would need to know which struct field 
to put in SQL query for example. Essentially the user would need to know the 
data type in advance which is kind of not desired or practically possible.

Then I was thinking on having different schemas, each supporting a specific 
data type in value field, and have different files generated. I'm not sure how 
they could coexist in the same 'data pool' and how the readers would 
transparently be able to access them (ie. if SQL wants to specify the value 
field in the WHERE clause will it go well with all the different data types?). 
Another issue with this approach is that I would have very small files for 
certain data types as some data points generate data at a small rate.

Another idea was to use the binary data type for the value field and serialize 
all the data with msgpack or similar. Needless to say that this introduces 
storage and processing overhead, and does not allow me to query value field 
from SQL.

Any ideas on how to approach this? I hope there is a way to handle this use 
case..



TIA!
//hinxx

[FWIW, It seems that the processing engines are not quite interested in 
supporting unions at all. I've realized that the union data type as seen in ORC 
file format is actually not supported by (py)arrow for file I/O. Not sure about 
the how spark works with them. Nevertheless, I'm leaning towards the use of 
some python based framework for processing that would revolve around 
arrow/pandas like duckDB or dask]

Reply via email to