Re: [C++][Python] Direct Arrow DictionaryArray reads from Parquet files

Wes McKinney Mon, 05 Aug 2019 07:04:43 -0700

hi Hatem -- I was planning to look at the round-trip question here
early this week, since I have all the code fresh in my mind let me
have a look and I'll report back.


Accurately preserving the original dictionary in a round trip tricky
because the low-level column writer doesn't expose any detail about
what it's doing with each batch of data. At minimum if we
automatically set "read_dictionary" based on what the original schema
was then that gets us part of the way there.

- Wes

On Mon, Aug 5, 2019 at 5:34 AM Hatem Helal <[email protected]> wrote:
>
> Thanks for sharing this very illustrative benchmark.  Really nice to see the 
> huge benefit for languages that have a type for modelling categorical data.
>
> I'm interested in whether we can make the parquet/arrow integration 
> automatically handle the round-trip for Arrow DictionaryArrays.  We've had 
> this requested from users of the MATLAB-Parquet integration.  We've suggested 
> workarounds for those users but as your benchmark shows, you need to have 
> enough memory to store the "dense" representation.  I think this could be 
> solved by writing metadata with the Arrow data type.  An added benefit of 
> doing this at the Arrow-level is that any language that uses the C++ 
> parquet/arrow integration could round-trip DictionaryArrays.
>
> I'm not currently sure how all the pieces would fit together but let me know 
> if there is interest and I'm happy to flesh this out as a PR.
>
>
> On 8/2/19, 4:55 PM, "Wes McKinney" <[email protected]> wrote:
>
>     I've been working (with Hatem Helal's assistance!) the last few months
>     to put the pieces in place to enable reading BYTE_ARRAY columns in
>     Parquet files directly to Arrow DictionaryArray. As context, it's not
>     uncommon for a Parquet file to occupy ~100x less (even greater
>     compression factor) space on disk than fully-decoded in memory when
>     there are a lot of common strings. Users get frustrated sometimes when
>     they read a "small" Parquet file and have memory use problems.
>
>     I made a benchmark to exhibit an example "worst case scenario"
>
>     https://gist.github.com/wesm/450d85e52844aee685c0680111cbb1d7
>
>     In this example, we have a table with a single column containing 10
>     million values drawn from a dictionary of 1000 values that's about 50
>     kilobytes in size. Written to Parquet, the file a little over 1
>     megabyte due to Parquet's layers of compression. But read naively to
>     Arrow BinaryArray, about 500MB of memory is taken up (10M values * 54
>     bytes per value). With the new decoding machinery, we can skip the
>     dense decoding of the binary data and append the Parquet file's
>     internal dictionary indices directly into an arrow::DictionaryBuilder,
>     yielding a DictionaryArray at the end. The end result uses less than
>     10% as much memory (about 40MB compared with 500MB) and is almost 20x
>     faster to decode.
>
>     The PR making this available finally in Python is here:
>     https://github.com/apache/arrow/pull/4999
>
>     Complex, multi-layered projects like this can be a little bit
>     inscrutable when discussed strictly at a code/technical level, but I
>     hope this helps show that employing dictionary encoding can have a lot
>     of user impact both in memory use and performance.
>
>     - Wes
>
>

Re: [C++][Python] Direct Arrow DictionaryArray reads from Parquet files

Reply via email to