Hi Hinko,

Yes, the parquet specs do not support union type natively. Workarounds that
you have mentioned require additional work on the reader side.

I am working on this issue already:
https://github.com/apache/arrow/issues/34262 . Not sure if this is what you
are looking for.

Best,
Gang

On Thu, Mar 2, 2023 at 5:00 PM Hinko Kocevar <hinko.koce...@ess.eu.invalid>
wrote:

> I have data points that can briefly be described with the following fields:
>  - timestamp
>  - name
>  - value
>
> The value field can be
>  1) a scalar integer (all sizes), a float or double, a string or,
>  2) an array (list) of all 1) data types.
>
> A named data point can for its lifetime only have a single/fixed value
> data type. You can imagine this data is coming for a large set if IOT
> devices. I can not modify the data generation in order to change /
> consolidate the data type. The amount of data points can vary in time and
> is unbounded in terms of how many can there.
>
> I noticed that parquet does not support union, which would be a perfect
> match here (but see the footnote).
>
> I was thinking of working around the issue by having a struct with fields
> for all the data types + a tag field to specify the actual data type. As
> this seems viable on the writer side I feel like it is not going to play
> nice for the reader side; the user querying the data would need to know
> which struct field to put in SQL query for example. Essentially the user
> would need to know the data type in advance which is kind of not desired or
> practically possible.
>
> Then I was thinking on having different schemas, each supporting a
> specific data type in value field, and have different files generated. I'm
> not sure how they could coexist in the same 'data pool' and how the readers
> would transparently be able to access them (ie. if SQL wants to specify the
> value field in the WHERE clause will it go well with all the different data
> types?). Another issue with this approach is that I would have very small
> files for certain data types as some data points generate data at a small
> rate.
>
> Another idea was to use the binary data type for the value field and
> serialize all the data with msgpack or similar. Needless to say that this
> introduces storage and processing overhead, and does not allow me to query
> value field from SQL.
>
> Any ideas on how to approach this? I hope there is a way to handle this
> use case..
>
>
>
> TIA!
> //hinxx
>
> [FWIW, It seems that the processing engines are not quite interested in
> supporting unions at all. I've realized that the union data type as seen in
> ORC file format is actually not supported by (py)arrow for file I/O. Not
> sure about the how spark works with them. Nevertheless, I'm leaning towards
> the use of some python based framework for processing that would revolve
> around arrow/pandas like duckDB or dask]

Reply via email to