Thank you very much for the response. I was looking directly at tab['x'].
Didnt realize that the dictionary is present at chunk level.

On Thu, Apr 21, 2022, 1:17 AM Weston Pace <[email protected]> wrote:

> > However I cannot figure out any easy way to get the mapping
> > used to create the dictionary array (vals) easily from the table. Can
> > you please let me know the easiest way?
>
> A dictionary is going to be associated with an array and not a table.
> So you first need to get the array from the table.  Tables are made of
> columns and each column is made of chunks and each chunk is an array.
> Each chunk could have a different mapping, so that is something you
> may need to deal with at some point depending on your goal.
>
> The table you are creating in your example has one column and that
> column has one chunk so we can get to the mapping with:
>
>     tab.column(0).chunks[0].dictionary
>
> And we can get to the indices with:
>
>     tab.column(0).chunks[0].indices
>
> > Also since this is effectively a string array which is dictionary
> > encoded, is there any way to use string compute kernels like
> > starts_with etc. Right now I am aware of two methods and they are not
> > straightforward.
>
> Regrettably, I don't think we have kernels in place for string
> functions on dictionary arrays.  At least, that is my reading of [1].
> So the two workarounds you have are may be the best there is at the
> moment.
>
> [1] https://issues.apache.org/jira/browse/ARROW-14068
>
> On Wed, Apr 20, 2022 at 10:00 AM Suresh V <[email protected]> wrote:
> >
> > Hi .. I created a pyarrow table from a dictionary array as shown
> > below. However I cannot figure out any easy way to get the mapping
> > used to create the dictionary array (vals) easily from the table. Can
> > you please let me know the easiest way? Other than the ones which
> > involve pyarrow.compute/conversion to pandas as they are expensive
> > operations for large datasets.
> >
> > import pyarrow as pa
> > import pyarrow.compute as pc
> > import numpy as np
> >
> > vals = ['aa', 'ab', 'ac', 'ba', 'bb', 'bc']
> > int_vals = [3, 4, 3, 0, 2, 0, 1, 5, 0, 0]
> > x = pa.DictionaryArray.from_arrays(pa.array(int_vals), vals)
> > tab = pa.Table.from_arrays([x], names=['x'])
> >
> > Also since this is effectively a string array which is dictionary
> > encoded, is there any way to use string compute kernels like
> > starts_with etc. Right now I am aware of two methods and they are not
> > straightforward.
> >
> > approach 1:
> > Cast to string and then run string kernel
> > expr = pc.starts_with(pc.field("x").cast(pa.string()), "a")
> > ds.Scanner.from_batches(tab.to_batches(), schema=tab.schema,
> > columns={'x': pc.field('x')}, filter=expr).to_table()
> >
> > approach 2:
> > filter using the corresponding indices assuming we have access to the
> dictionary
> > filter_ = np.where(pc.starts_with(x.dictionary, "a"))[0]
> > pc.is_in(x.indices, filter_)
> >
> > Approach 2 is better/faster .. but I am not able to figure out how to
> > get the dictionary/indices assuming we start from a table read from
> > parquet/feather.
> >
> > Thanks
>

Reply via email to