There is some overhead with dataset.to_table() but I would expect it
to be mostly amortized out if you have 50m rows (assuming this is 50
million).  It may also depend on how many row groups you have.  Feel
free to create a JIRA with a reproducible example.

On Thu, Apr 21, 2022 at 10:29 AM Suresh V <[email protected]> wrote:
>
> Thanks for the explanation.  But I am now worried that the approach 2 for 
> using a string kernel like starts with will become more complicated as i have 
> to run the filter in a python loop effectively slowing it down.
>
> If we had access to the indices directly, approach 2 was 100% faster than 1 
> for an array of 50m entries.
>
> Completely unrelated,  I also noticed that reading feather file from 
> ipc/feather API is 10x faster (0.2 vs 2s) than the dataset.to_table() for 50m 
> rows. I wouldn't expect there to be such a big difference. Will create a bug 
> if it isn't a known issue.
>
> Thanks
> On Thu, Apr 21, 2022, 4:07 PM Aldrin <[email protected]> wrote:
>>
>> I could be wrong, but in my experience the dictionary is available at the 
>> chunk level, because that is where you know it is a DictionaryArray (or at 
>> least, an Array). At the column level, you only know it's a ChunkedArray, 
>> which seems to roughly be an alias to a vector<Array> (list[Array]) at least 
>> type-wise.
>>
>> Also, I think each chunk references the same dictionary, so I think you can 
>> access any chunk's dictionary and get the same one.
>>
>> Aldrin Montana
>> Computer Science PhD Student
>> UC Santa Cruz
>>
>>
>> On Wed, Apr 20, 2022 at 10:54 PM Suresh V <[email protected]> wrote:
>>>
>>> Thank you very much for the response. I was looking directly at tab['x']. 
>>> Didnt realize that the dictionary is present at chunk level.
>>>
>>> On Thu, Apr 21, 2022, 1:17 AM Weston Pace <[email protected]> wrote:
>>>>
>>>> > However I cannot figure out any easy way to get the mapping
>>>> > used to create the dictionary array (vals) easily from the table. Can
>>>> > you please let me know the easiest way?
>>>>
>>>> A dictionary is going to be associated with an array and not a table.
>>>> So you first need to get the array from the table.  Tables are made of
>>>> columns and each column is made of chunks and each chunk is an array.
>>>> Each chunk could have a different mapping, so that is something you
>>>> may need to deal with at some point depending on your goal.
>>>>
>>>> The table you are creating in your example has one column and that
>>>> column has one chunk so we can get to the mapping with:
>>>>
>>>>     tab.column(0).chunks[0].dictionary
>>>>
>>>> And we can get to the indices with:
>>>>
>>>>     tab.column(0).chunks[0].indices
>>>>
>>>> > Also since this is effectively a string array which is dictionary
>>>> > encoded, is there any way to use string compute kernels like
>>>> > starts_with etc. Right now I am aware of two methods and they are not
>>>> > straightforward.
>>>>
>>>> Regrettably, I don't think we have kernels in place for string
>>>> functions on dictionary arrays.  At least, that is my reading of [1].
>>>> So the two workarounds you have are may be the best there is at the
>>>> moment.
>>>>
>>>> [1] https://issues.apache.org/jira/browse/ARROW-14068
>>>>
>>>> On Wed, Apr 20, 2022 at 10:00 AM Suresh V <[email protected]> wrote:
>>>> >
>>>> > Hi .. I created a pyarrow table from a dictionary array as shown
>>>> > below. However I cannot figure out any easy way to get the mapping
>>>> > used to create the dictionary array (vals) easily from the table. Can
>>>> > you please let me know the easiest way? Other than the ones which
>>>> > involve pyarrow.compute/conversion to pandas as they are expensive
>>>> > operations for large datasets.
>>>> >
>>>> > import pyarrow as pa
>>>> > import pyarrow.compute as pc
>>>> > import numpy as np
>>>> >
>>>> > vals = ['aa', 'ab', 'ac', 'ba', 'bb', 'bc']
>>>> > int_vals = [3, 4, 3, 0, 2, 0, 1, 5, 0, 0]
>>>> > x = pa.DictionaryArray.from_arrays(pa.array(int_vals), vals)
>>>> > tab = pa.Table.from_arrays([x], names=['x'])
>>>> >
>>>> > Also since this is effectively a string array which is dictionary
>>>> > encoded, is there any way to use string compute kernels like
>>>> > starts_with etc. Right now I am aware of two methods and they are not
>>>> > straightforward.
>>>> >
>>>> > approach 1:
>>>> > Cast to string and then run string kernel
>>>> > expr = pc.starts_with(pc.field("x").cast(pa.string()), "a")
>>>> > ds.Scanner.from_batches(tab.to_batches(), schema=tab.schema,
>>>> > columns={'x': pc.field('x')}, filter=expr).to_table()
>>>> >
>>>> > approach 2:
>>>> > filter using the corresponding indices assuming we have access to the 
>>>> > dictionary
>>>> > filter_ = np.where(pc.starts_with(x.dictionary, "a"))[0]
>>>> > pc.is_in(x.indices, filter_)
>>>> >
>>>> > Approach 2 is better/faster .. but I am not able to figure out how to
>>>> > get the dictionary/indices assuming we start from a table read from
>>>> > parquet/feather.
>>>> >
>>>> > Thanks

Reply via email to