I've been working (with Hatem Helal's assistance!) the last few months
to put the pieces in place to enable reading BYTE_ARRAY columns in
Parquet files directly to Arrow DictionaryArray. As context, it's not
uncommon for a Parquet file to occupy ~100x less (even greater
compression factor) space on disk than fully-decoded in memory when
there are a lot of common strings. Users get frustrated sometimes when
they read a "small" Parquet file and have memory use problems.

I made a benchmark to exhibit an example "worst case scenario"

https://gist.github.com/wesm/450d85e52844aee685c0680111cbb1d7

In this example, we have a table with a single column containing 10
million values drawn from a dictionary of 1000 values that's about 50
kilobytes in size. Written to Parquet, the file a little over 1
megabyte due to Parquet's layers of compression. But read naively to
Arrow BinaryArray, about 500MB of memory is taken up (10M values * 54
bytes per value). With the new decoding machinery, we can skip the
dense decoding of the binary data and append the Parquet file's
internal dictionary indices directly into an arrow::DictionaryBuilder,
yielding a DictionaryArray at the end. The end result uses less than
10% as much memory (about 40MB compared with 500MB) and is almost 20x
faster to decode.

The PR making this available finally in Python is here:
https://github.com/apache/arrow/pull/4999

Complex, multi-layered projects like this can be a little bit
inscrutable when discussed strictly at a code/technical level, but I
hope this helps show that employing dictionary encoding can have a lot
of user impact both in memory use and performance.

- Wes

Reply via email to