pyarrow: write table where columns share the same dictionary

Joris Peeters Thu, 25 Feb 2021 08:36:48 -0800

Hello,

I have a pandas DataFrame with many string columns (>30,000), and they
share a low-cardinality set of values (e.g. size 100). I'd like to convert
this to an Arrow table of dictionary encoded columns (let's say int16 for
the index cols), but with just one shared dictionary of strings.
This is to avoid ending up with >30,000 tiny dictionaries on the wire,
which doesn't even load in e.g. Java (due to a stackoverflow error).


Despite my efforts, I haven't really been able to achieve this with the
public API's I could find. Does anyone have an idea? I'm using pyarrow
3.0.0.

For a mickey mouse example, I'm looking at e.g.

df = pd.DataFrame({'a': ['foo', None, 'bar'], 'b': [None, 'quux', 'foo']})

and would like a Table with dictionary-encoded columns a and b, both
nullable, that both refer to the same dictionary with id=0 (or whatever id)
containing ['foo', 'bar', 'quux'].

Thanks,
-Joris.

pyarrow: write table where columns share the same dictionary

Reply via email to