Re: Best practice for data iteration over arrays or tabular data

Weston Pace Thu, 30 Sep 2021 15:29:25 -0700

> Please remove me from this email list

Can you send a message to [email protected] with no
subject?  That will remove you from the list.  Most of us do not have
permission to remove people from mailing lists.

> I am curious what the recommended approach is for processing values in arrow 
> arrays/tables? Is a simple for loop fine most of the time, or is the 
> arrow::Iterator recommended, or is it better to somehow define a compute 
> function and let arrow do all of the iteration itself?

Probably the main thing to avoid when getting started is to avoid
calling `GetScalar` on every single item.  Something like casting to
an Int32Array and using an iterator over raw_values should be very
efficient.  If you can use an existing compute function then by all
means use it.  Defining your own compute isn't going to magically make
performance better but it does make it easier to efficiently provide
implementations for many different types of data and, once you've
passed the learning curve, they can take a lot of the boilerplate out.

> The simple approach I can think of is to: (1) take the column of interest, 
> (2) iterate over it and note the indices of values to drop, and (3) use 
> Table::Slice and arrow::ConcatenateTables to produce a result Table. This 
> feels like I'm missing out on > some things that arrow may provide at least 
> for the 2nd step.

Instead of gathering indices to drop I think you can gather indices to
keep.  You can then use compute::Take to get the subset of your table.
Otherwise, making many calls to Table::Slice (e.g. if you end up
needing to keep every other row) might be inefficient.

> The better approach to step (2) above, would be to use 
> arrow::compute::Unique, but instead of producing unique values, produce 
> indexes. This way I could perhaps also setup a function Options that could 
> choose to keep the first duplicate, or
> keep the last, etc.

A version of compute::Unique that returns indices sounds like a pretty
useful feature to me.  Even if you don't create this I'd recommend
creating a JIRA for it (unless someone knows of one that already
exists).

> My C++ is not particularly advanced, so I find it hard to know where to start 
> for adapting an existing compute function (also, it is very hard to search 
> for the unique function because of "unique_ptr").

I find it a little confusing too :).  Eduardo Ponce had started work
on a guide of sorts
(https://github.com/apache/arrow/pull/10296/files).  I'm not sure what
the status is for this.  It might be an ok place to start.

Also, a dirty hack, when searching for compute functions I add _doc to
the end of my search string (e.g. unique.*_doc) and it finds the
docstrings for the functions and I can usually work back from there.

On Thu, Sep 30, 2021 at 11:48 AM Burke Kaltenberger
<[email protected]> wrote:
>
> Please remove me from this email list
>
> On Thu, Sep 30, 2021, 11:45 AM Aldrin <[email protected]> wrote:
>>
>> Hello!
>>
>> I am curious what the recommended approach is for processing values in arrow 
>> arrays/tables? Is a simple for loop fine most of the time, or is the 
>> arrow::Iterator recommended, or is it better to somehow define a compute 
>> function and let arrow do all of the iteration itself? I provide context 
>> below for what I'm trying to do, and hopefully it makes clear why I'm asking 
>> this question, and what it is I'm asking.
>>
>> For reference, I am trying to remove rows based on duplicates in a 
>> particular column. There doesn't seem to be a compute function that already 
>> does this, and I can't think of a way to compose existing functions to get 
>> what I need. I can think of a simple approach I can implement, and an 
>> approach that requires a slight modification of an existing compute function.
>>
>> The simple approach I can think of is to: (1) take the column of interest, 
>> (2) iterate over it and note the indices of values to drop, and (3) use 
>> Table::Slice and arrow::ConcatenateTables to produce a result Table. This 
>> feels like I'm missing out on some things that arrow may provide at least 
>> for the 2nd step.
>>
>> The better approach to step (2) above, would be to use 
>> arrow::compute::Unique, but instead of producing unique values, produce 
>> indexes. This way I could perhaps also setup a function Options that could 
>> choose to keep the first duplicate, or keep the last, etc.
>>
>> My C++ is not particularly advanced, so I find it hard to know where to 
>> start for adapting an existing compute function (also, it is very hard to 
>> search for the unique function because of "unique_ptr").
>>
>> Thanks for any help and advice!
>>
>> Aldrin Montana
>> Computer Science PhD Student
>> UC Santa Cruz

Re: Best practice for data iteration over arrays or tabular data

Reply via email to