Hi Hinko, Yes, the parquet specs do not support union type natively. Workarounds that you have mentioned require additional work on the reader side.
I am working on this issue already: https://github.com/apache/arrow/issues/34262 . Not sure if this is what you are looking for. Best, Gang On Thu, Mar 2, 2023 at 5:00 PM Hinko Kocevar <hinko.koce...@ess.eu.invalid> wrote: > I have data points that can briefly be described with the following fields: > - timestamp > - name > - value > > The value field can be > 1) a scalar integer (all sizes), a float or double, a string or, > 2) an array (list) of all 1) data types. > > A named data point can for its lifetime only have a single/fixed value > data type. You can imagine this data is coming for a large set if IOT > devices. I can not modify the data generation in order to change / > consolidate the data type. The amount of data points can vary in time and > is unbounded in terms of how many can there. > > I noticed that parquet does not support union, which would be a perfect > match here (but see the footnote). > > I was thinking of working around the issue by having a struct with fields > for all the data types + a tag field to specify the actual data type. As > this seems viable on the writer side I feel like it is not going to play > nice for the reader side; the user querying the data would need to know > which struct field to put in SQL query for example. Essentially the user > would need to know the data type in advance which is kind of not desired or > practically possible. > > Then I was thinking on having different schemas, each supporting a > specific data type in value field, and have different files generated. I'm > not sure how they could coexist in the same 'data pool' and how the readers > would transparently be able to access them (ie. if SQL wants to specify the > value field in the WHERE clause will it go well with all the different data > types?). Another issue with this approach is that I would have very small > files for certain data types as some data points generate data at a small > rate. > > Another idea was to use the binary data type for the value field and > serialize all the data with msgpack or similar. Needless to say that this > introduces storage and processing overhead, and does not allow me to query > value field from SQL. > > Any ideas on how to approach this? I hope there is a way to handle this > use case.. > > > > TIA! > //hinxx > > [FWIW, It seems that the processing engines are not quite interested in > supporting unions at all. I've realized that the union data type as seen in > ORC file format is actually not supported by (py)arrow for file I/O. Not > sure about the how spark works with them. Nevertheless, I'm leaning towards > the use of some python based framework for processing that would revolve > around arrow/pandas like duckDB or dask]