Round-trip of categorical data with Arrow and Parquet

Hatem Helal Thu, 24 Jan 2019 08:06:39 -0800

Hi everyone,

I wanted to gauge interest and feasibility for adding support for natively 
reading an arrow::DictionaryArray from a parquet file.  Currently, writing an 
arrow::DictionaryArray is read back as the native index type [0].  I came 
across a prior discussion for this problem in the context of pandas [1] but I 
think this would be useful for other arrow clients (C++ or otherwise).


The solution I had in mind would be to add arrow type information as column 
metadata.  This metadata would then be used when reading back the parquet file 
to determine which arrow type to create for the column data.

I’m willing to contribute this feature but first wanted to get some feedback on 
whether this would be generally useful and if the high-level proposed solution 
would make sense.

Thanks!

Hatem


[0] This test demonstrates this behavior
https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/arrow-reader-writer-test.cc#L1848
[1] https://github.com/apache/arrow/issues/1688

Round-trip of categorical data with Arrow and Parquet

Reply via email to