hi Hatem -- I was planning to look at the round-trip question here early this week, since I have all the code fresh in my mind let me have a look and I'll report back.
Accurately preserving the original dictionary in a round trip tricky because the low-level column writer doesn't expose any detail about what it's doing with each batch of data. At minimum if we automatically set "read_dictionary" based on what the original schema was then that gets us part of the way there. - Wes On Mon, Aug 5, 2019 at 5:34 AM Hatem Helal <[email protected]> wrote: > > Thanks for sharing this very illustrative benchmark. Really nice to see the > huge benefit for languages that have a type for modelling categorical data. > > I'm interested in whether we can make the parquet/arrow integration > automatically handle the round-trip for Arrow DictionaryArrays. We've had > this requested from users of the MATLAB-Parquet integration. We've suggested > workarounds for those users but as your benchmark shows, you need to have > enough memory to store the "dense" representation. I think this could be > solved by writing metadata with the Arrow data type. An added benefit of > doing this at the Arrow-level is that any language that uses the C++ > parquet/arrow integration could round-trip DictionaryArrays. > > I'm not currently sure how all the pieces would fit together but let me know > if there is interest and I'm happy to flesh this out as a PR. > > > On 8/2/19, 4:55 PM, "Wes McKinney" <[email protected]> wrote: > > I've been working (with Hatem Helal's assistance!) the last few months > to put the pieces in place to enable reading BYTE_ARRAY columns in > Parquet files directly to Arrow DictionaryArray. As context, it's not > uncommon for a Parquet file to occupy ~100x less (even greater > compression factor) space on disk than fully-decoded in memory when > there are a lot of common strings. Users get frustrated sometimes when > they read a "small" Parquet file and have memory use problems. > > I made a benchmark to exhibit an example "worst case scenario" > > https://gist.github.com/wesm/450d85e52844aee685c0680111cbb1d7 > > In this example, we have a table with a single column containing 10 > million values drawn from a dictionary of 1000 values that's about 50 > kilobytes in size. Written to Parquet, the file a little over 1 > megabyte due to Parquet's layers of compression. But read naively to > Arrow BinaryArray, about 500MB of memory is taken up (10M values * 54 > bytes per value). With the new decoding machinery, we can skip the > dense decoding of the binary data and append the Parquet file's > internal dictionary indices directly into an arrow::DictionaryBuilder, > yielding a DictionaryArray at the end. The end result uses less than > 10% as much memory (about 40MB compared with 500MB) and is almost 20x > faster to decode. > > The PR making this available finally in Python is here: > https://github.com/apache/arrow/pull/4999 > > Complex, multi-layered projects like this can be a little bit > inscrutable when discussed strictly at a code/technical level, but I > hope this helps show that employing dictionary encoding can have a lot > of user impact both in memory use and performance. > > - Wes > >
