I'd like to add a question that I am not sure how to answer from looking at
the code:
- When writing a table in IPC format (in my case to a file), how does the
writer write a DictionaryArray if it spans multiple chunks?

My intuition is that if a DictionaryArray is split into multiple chunks,
then the dictionary would be split into multiple chunks. I would guess it's
straightforward to split the indices, and then the dictionary ("symbol
dictionary"?) would either have some duplicates across chunks (each chunk
is independent), or it would be common across the chunks (optimize overall
storage/representation, but chunks are not independent). I think this is
relevant to delta dictionaries, but I haven't mentally parsed what a delta
dictionary is yet.

When writing to storage, I assume that independent dictionaries per batch
is the most straightforward, but I can imagine various trade-offs for other
approaches.

Pointers to where to look for this would be appreciated too! I looked at
IpcPayload stuff before when looking at the feather readers/writers but the
hierarchy was initially hard to follow.

Aldrin Montana
Computer Science PhD Student
UC Santa Cruz


On Thu, Apr 21, 2022 at 2:52 PM Weston Pace <[email protected]> wrote:

> There is some overhead with dataset.to_table() but I would expect it
> to be mostly amortized out if you have 50m rows (assuming this is 50
> million).  It may also depend on how many row groups you have.  Feel
> free to create a JIRA with a reproducible example.
>
> On Thu, Apr 21, 2022 at 10:29 AM Suresh V <[email protected]> wrote:
> >
> > Thanks for the explanation.  But I am now worried that the approach 2
> for using a string kernel like starts with will become more complicated as
> i have to run the filter in a python loop effectively slowing it down.
> >
> > If we had access to the indices directly, approach 2 was 100% faster
> than 1 for an array of 50m entries.
> >
> > Completely unrelated,  I also noticed that reading feather file from
> ipc/feather API is 10x faster (0.2 vs 2s) than the dataset.to_table() for
> 50m rows. I wouldn't expect there to be such a big difference. Will create
> a bug if it isn't a known issue.
> >
> > Thanks
> > On Thu, Apr 21, 2022, 4:07 PM Aldrin <[email protected]> wrote:
> >>
> >> I could be wrong, but in my experience the dictionary is available at
> the chunk level, because that is where you know it is a DictionaryArray (or
> at least, an Array). At the column level, you only know it's a
> ChunkedArray, which seems to roughly be an alias to a vector<Array>
> (list[Array]) at least type-wise.
> >>
> >> Also, I think each chunk references the same dictionary, so I think you
> can access any chunk's dictionary and get the same one.
> >>
> >> Aldrin Montana
> >> Computer Science PhD Student
> >> UC Santa Cruz
> >>
> >>
> >> On Wed, Apr 20, 2022 at 10:54 PM Suresh V <[email protected]> wrote:
> >>>
> >>> Thank you very much for the response. I was looking directly at
> tab['x']. Didnt realize that the dictionary is present at chunk level.
> >>>
> >>> On Thu, Apr 21, 2022, 1:17 AM Weston Pace <[email protected]>
> wrote:
> >>>>
> >>>> > However I cannot figure out any easy way to get the mapping
> >>>> > used to create the dictionary array (vals) easily from the table.
> Can
> >>>> > you please let me know the easiest way?
> >>>>
> >>>> A dictionary is going to be associated with an array and not a table.
> >>>> So you first need to get the array from the table.  Tables are made of
> >>>> columns and each column is made of chunks and each chunk is an array.
> >>>> Each chunk could have a different mapping, so that is something you
> >>>> may need to deal with at some point depending on your goal.
> >>>>
> >>>> The table you are creating in your example has one column and that
> >>>> column has one chunk so we can get to the mapping with:
> >>>>
> >>>>     tab.column(0).chunks[0].dictionary
> >>>>
> >>>> And we can get to the indices with:
> >>>>
> >>>>     tab.column(0).chunks[0].indices
> >>>>
> >>>> > Also since this is effectively a string array which is dictionary
> >>>> > encoded, is there any way to use string compute kernels like
> >>>> > starts_with etc. Right now I am aware of two methods and they are
> not
> >>>> > straightforward.
> >>>>
> >>>> Regrettably, I don't think we have kernels in place for string
> >>>> functions on dictionary arrays.  At least, that is my reading of [1].
> >>>> So the two workarounds you have are may be the best there is at the
> >>>> moment.
> >>>>
> >>>> [1] https://issues.apache.org/jira/browse/ARROW-14068
> >>>>
> >>>> On Wed, Apr 20, 2022 at 10:00 AM Suresh V <[email protected]>
> wrote:
> >>>> >
> >>>> > Hi .. I created a pyarrow table from a dictionary array as shown
> >>>> > below. However I cannot figure out any easy way to get the mapping
> >>>> > used to create the dictionary array (vals) easily from the table.
> Can
> >>>> > you please let me know the easiest way? Other than the ones which
> >>>> > involve pyarrow.compute/conversion to pandas as they are expensive
> >>>> > operations for large datasets.
> >>>> >
> >>>> > import pyarrow as pa
> >>>> > import pyarrow.compute as pc
> >>>> > import numpy as np
> >>>> >
> >>>> > vals = ['aa', 'ab', 'ac', 'ba', 'bb', 'bc']
> >>>> > int_vals = [3, 4, 3, 0, 2, 0, 1, 5, 0, 0]
> >>>> > x = pa.DictionaryArray.from_arrays(pa.array(int_vals), vals)
> >>>> > tab = pa.Table.from_arrays([x], names=['x'])
> >>>> >
> >>>> > Also since this is effectively a string array which is dictionary
> >>>> > encoded, is there any way to use string compute kernels like
> >>>> > starts_with etc. Right now I am aware of two methods and they are
> not
> >>>> > straightforward.
> >>>> >
> >>>> > approach 1:
> >>>> > Cast to string and then run string kernel
> >>>> > expr = pc.starts_with(pc.field("x").cast(pa.string()), "a")
> >>>> > ds.Scanner.from_batches(tab.to_batches(), schema=tab.schema,
> >>>> > columns={'x': pc.field('x')}, filter=expr).to_table()
> >>>> >
> >>>> > approach 2:
> >>>> > filter using the corresponding indices assuming we have access to
> the dictionary
> >>>> > filter_ = np.where(pc.starts_with(x.dictionary, "a"))[0]
> >>>> > pc.is_in(x.indices, filter_)
> >>>> >
> >>>> > Approach 2 is better/faster .. but I am not able to figure out how
> to
> >>>> > get the dictionary/indices assuming we start from a table read from
> >>>> > parquet/feather.
> >>>> >
> >>>> > Thanks
>

Reply via email to