I could be wrong, but in my experience the dictionary is available at the chunk level, because that is where you know it is a DictionaryArray (or at least, an Array). At the column level, you only know it's a ChunkedArray, which seems to roughly be an alias to a vector<Array> (list[Array]) at least type-wise.
Also, I think each chunk references the same dictionary, so I think you can access any chunk's dictionary and get the same one. Aldrin Montana Computer Science PhD Student UC Santa Cruz On Wed, Apr 20, 2022 at 10:54 PM Suresh V <[email protected]> wrote: > Thank you very much for the response. I was looking directly at tab['x']. > Didnt realize that the dictionary is present at chunk level. > > On Thu, Apr 21, 2022, 1:17 AM Weston Pace <[email protected]> wrote: > >> > However I cannot figure out any easy way to get the mapping >> > used to create the dictionary array (vals) easily from the table. Can >> > you please let me know the easiest way? >> >> A dictionary is going to be associated with an array and not a table. >> So you first need to get the array from the table. Tables are made of >> columns and each column is made of chunks and each chunk is an array. >> Each chunk could have a different mapping, so that is something you >> may need to deal with at some point depending on your goal. >> >> The table you are creating in your example has one column and that >> column has one chunk so we can get to the mapping with: >> >> tab.column(0).chunks[0].dictionary >> >> And we can get to the indices with: >> >> tab.column(0).chunks[0].indices >> >> > Also since this is effectively a string array which is dictionary >> > encoded, is there any way to use string compute kernels like >> > starts_with etc. Right now I am aware of two methods and they are not >> > straightforward. >> >> Regrettably, I don't think we have kernels in place for string >> functions on dictionary arrays. At least, that is my reading of [1]. >> So the two workarounds you have are may be the best there is at the >> moment. >> >> [1] https://issues.apache.org/jira/browse/ARROW-14068 >> >> On Wed, Apr 20, 2022 at 10:00 AM Suresh V <[email protected]> wrote: >> > >> > Hi .. I created a pyarrow table from a dictionary array as shown >> > below. However I cannot figure out any easy way to get the mapping >> > used to create the dictionary array (vals) easily from the table. Can >> > you please let me know the easiest way? Other than the ones which >> > involve pyarrow.compute/conversion to pandas as they are expensive >> > operations for large datasets. >> > >> > import pyarrow as pa >> > import pyarrow.compute as pc >> > import numpy as np >> > >> > vals = ['aa', 'ab', 'ac', 'ba', 'bb', 'bc'] >> > int_vals = [3, 4, 3, 0, 2, 0, 1, 5, 0, 0] >> > x = pa.DictionaryArray.from_arrays(pa.array(int_vals), vals) >> > tab = pa.Table.from_arrays([x], names=['x']) >> > >> > Also since this is effectively a string array which is dictionary >> > encoded, is there any way to use string compute kernels like >> > starts_with etc. Right now I am aware of two methods and they are not >> > straightforward. >> > >> > approach 1: >> > Cast to string and then run string kernel >> > expr = pc.starts_with(pc.field("x").cast(pa.string()), "a") >> > ds.Scanner.from_batches(tab.to_batches(), schema=tab.schema, >> > columns={'x': pc.field('x')}, filter=expr).to_table() >> > >> > approach 2: >> > filter using the corresponding indices assuming we have access to the >> dictionary >> > filter_ = np.where(pc.starts_with(x.dictionary, "a"))[0] >> > pc.is_in(x.indices, filter_) >> > >> > Approach 2 is better/faster .. but I am not able to figure out how to >> > get the dictionary/indices assuming we start from a table read from >> > parquet/feather. >> > >> > Thanks >> >
