> However I cannot figure out any easy way to get the mapping
> used to create the dictionary array (vals) easily from the table. Can
> you please let me know the easiest way?

A dictionary is going to be associated with an array and not a table.
So you first need to get the array from the table.  Tables are made of
columns and each column is made of chunks and each chunk is an array.
Each chunk could have a different mapping, so that is something you
may need to deal with at some point depending on your goal.

The table you are creating in your example has one column and that
column has one chunk so we can get to the mapping with:

    tab.column(0).chunks[0].dictionary

And we can get to the indices with:

    tab.column(0).chunks[0].indices

> Also since this is effectively a string array which is dictionary
> encoded, is there any way to use string compute kernels like
> starts_with etc. Right now I am aware of two methods and they are not
> straightforward.

Regrettably, I don't think we have kernels in place for string
functions on dictionary arrays.  At least, that is my reading of [1].
So the two workarounds you have are may be the best there is at the
moment.

[1] https://issues.apache.org/jira/browse/ARROW-14068

On Wed, Apr 20, 2022 at 10:00 AM Suresh V <[email protected]> wrote:
>
> Hi .. I created a pyarrow table from a dictionary array as shown
> below. However I cannot figure out any easy way to get the mapping
> used to create the dictionary array (vals) easily from the table. Can
> you please let me know the easiest way? Other than the ones which
> involve pyarrow.compute/conversion to pandas as they are expensive
> operations for large datasets.
>
> import pyarrow as pa
> import pyarrow.compute as pc
> import numpy as np
>
> vals = ['aa', 'ab', 'ac', 'ba', 'bb', 'bc']
> int_vals = [3, 4, 3, 0, 2, 0, 1, 5, 0, 0]
> x = pa.DictionaryArray.from_arrays(pa.array(int_vals), vals)
> tab = pa.Table.from_arrays([x], names=['x'])
>
> Also since this is effectively a string array which is dictionary
> encoded, is there any way to use string compute kernels like
> starts_with etc. Right now I am aware of two methods and they are not
> straightforward.
>
> approach 1:
> Cast to string and then run string kernel
> expr = pc.starts_with(pc.field("x").cast(pa.string()), "a")
> ds.Scanner.from_batches(tab.to_batches(), schema=tab.schema,
> columns={'x': pc.field('x')}, filter=expr).to_table()
>
> approach 2:
> filter using the corresponding indices assuming we have access to the 
> dictionary
> filter_ = np.where(pc.starts_with(x.dictionary, "a"))[0]
> pc.is_in(x.indices, filter_)
>
> Approach 2 is better/faster .. but I am not able to figure out how to
> get the dictionary/indices assuming we start from a table read from
> parquet/feather.
>
> Thanks

Reply via email to