I could be wrong, but in my experience the dictionary is available at the
chunk level, because that is where you know it is a DictionaryArray (or at
least, an Array). At the column level, you only know it's a ChunkedArray,
which seems to roughly be an alias to a vector<Array> (list[Array]) at
least type-wise.

Also, I think each chunk references the same dictionary, so I think you can
access any chunk's dictionary and get the same one.

Aldrin Montana
Computer Science PhD Student
UC Santa Cruz


On Wed, Apr 20, 2022 at 10:54 PM Suresh V <[email protected]> wrote:

> Thank you very much for the response. I was looking directly at tab['x'].
> Didnt realize that the dictionary is present at chunk level.
>
> On Thu, Apr 21, 2022, 1:17 AM Weston Pace <[email protected]> wrote:
>
>> > However I cannot figure out any easy way to get the mapping
>> > used to create the dictionary array (vals) easily from the table. Can
>> > you please let me know the easiest way?
>>
>> A dictionary is going to be associated with an array and not a table.
>> So you first need to get the array from the table.  Tables are made of
>> columns and each column is made of chunks and each chunk is an array.
>> Each chunk could have a different mapping, so that is something you
>> may need to deal with at some point depending on your goal.
>>
>> The table you are creating in your example has one column and that
>> column has one chunk so we can get to the mapping with:
>>
>>     tab.column(0).chunks[0].dictionary
>>
>> And we can get to the indices with:
>>
>>     tab.column(0).chunks[0].indices
>>
>> > Also since this is effectively a string array which is dictionary
>> > encoded, is there any way to use string compute kernels like
>> > starts_with etc. Right now I am aware of two methods and they are not
>> > straightforward.
>>
>> Regrettably, I don't think we have kernels in place for string
>> functions on dictionary arrays.  At least, that is my reading of [1].
>> So the two workarounds you have are may be the best there is at the
>> moment.
>>
>> [1] https://issues.apache.org/jira/browse/ARROW-14068
>>
>> On Wed, Apr 20, 2022 at 10:00 AM Suresh V <[email protected]> wrote:
>> >
>> > Hi .. I created a pyarrow table from a dictionary array as shown
>> > below. However I cannot figure out any easy way to get the mapping
>> > used to create the dictionary array (vals) easily from the table. Can
>> > you please let me know the easiest way? Other than the ones which
>> > involve pyarrow.compute/conversion to pandas as they are expensive
>> > operations for large datasets.
>> >
>> > import pyarrow as pa
>> > import pyarrow.compute as pc
>> > import numpy as np
>> >
>> > vals = ['aa', 'ab', 'ac', 'ba', 'bb', 'bc']
>> > int_vals = [3, 4, 3, 0, 2, 0, 1, 5, 0, 0]
>> > x = pa.DictionaryArray.from_arrays(pa.array(int_vals), vals)
>> > tab = pa.Table.from_arrays([x], names=['x'])
>> >
>> > Also since this is effectively a string array which is dictionary
>> > encoded, is there any way to use string compute kernels like
>> > starts_with etc. Right now I am aware of two methods and they are not
>> > straightforward.
>> >
>> > approach 1:
>> > Cast to string and then run string kernel
>> > expr = pc.starts_with(pc.field("x").cast(pa.string()), "a")
>> > ds.Scanner.from_batches(tab.to_batches(), schema=tab.schema,
>> > columns={'x': pc.field('x')}, filter=expr).to_table()
>> >
>> > approach 2:
>> > filter using the corresponding indices assuming we have access to the
>> dictionary
>> > filter_ = np.where(pc.starts_with(x.dictionary, "a"))[0]
>> > pc.is_in(x.indices, filter_)
>> >
>> > Approach 2 is better/faster .. but I am not able to figure out how to
>> > get the dictionary/indices assuming we start from a table read from
>> > parquet/feather.
>> >
>> > Thanks
>>
>

Reply via email to