> Please remove me from this email list Can you send a message to [email protected] with no subject? That will remove you from the list. Most of us do not have permission to remove people from mailing lists.
> I am curious what the recommended approach is for processing values in arrow > arrays/tables? Is a simple for loop fine most of the time, or is the > arrow::Iterator recommended, or is it better to somehow define a compute > function and let arrow do all of the iteration itself? Probably the main thing to avoid when getting started is to avoid calling `GetScalar` on every single item. Something like casting to an Int32Array and using an iterator over raw_values should be very efficient. If you can use an existing compute function then by all means use it. Defining your own compute isn't going to magically make performance better but it does make it easier to efficiently provide implementations for many different types of data and, once you've passed the learning curve, they can take a lot of the boilerplate out. > The simple approach I can think of is to: (1) take the column of interest, > (2) iterate over it and note the indices of values to drop, and (3) use > Table::Slice and arrow::ConcatenateTables to produce a result Table. This > feels like I'm missing out on > some things that arrow may provide at least > for the 2nd step. Instead of gathering indices to drop I think you can gather indices to keep. You can then use compute::Take to get the subset of your table. Otherwise, making many calls to Table::Slice (e.g. if you end up needing to keep every other row) might be inefficient. > The better approach to step (2) above, would be to use > arrow::compute::Unique, but instead of producing unique values, produce > indexes. This way I could perhaps also setup a function Options that could > choose to keep the first duplicate, or > keep the last, etc. A version of compute::Unique that returns indices sounds like a pretty useful feature to me. Even if you don't create this I'd recommend creating a JIRA for it (unless someone knows of one that already exists). > My C++ is not particularly advanced, so I find it hard to know where to start > for adapting an existing compute function (also, it is very hard to search > for the unique function because of "unique_ptr"). I find it a little confusing too :). Eduardo Ponce had started work on a guide of sorts (https://github.com/apache/arrow/pull/10296/files). I'm not sure what the status is for this. It might be an ok place to start. Also, a dirty hack, when searching for compute functions I add _doc to the end of my search string (e.g. unique.*_doc) and it finds the docstrings for the functions and I can usually work back from there. On Thu, Sep 30, 2021 at 11:48 AM Burke Kaltenberger <[email protected]> wrote: > > Please remove me from this email list > > On Thu, Sep 30, 2021, 11:45 AM Aldrin <[email protected]> wrote: >> >> Hello! >> >> I am curious what the recommended approach is for processing values in arrow >> arrays/tables? Is a simple for loop fine most of the time, or is the >> arrow::Iterator recommended, or is it better to somehow define a compute >> function and let arrow do all of the iteration itself? I provide context >> below for what I'm trying to do, and hopefully it makes clear why I'm asking >> this question, and what it is I'm asking. >> >> For reference, I am trying to remove rows based on duplicates in a >> particular column. There doesn't seem to be a compute function that already >> does this, and I can't think of a way to compose existing functions to get >> what I need. I can think of a simple approach I can implement, and an >> approach that requires a slight modification of an existing compute function. >> >> The simple approach I can think of is to: (1) take the column of interest, >> (2) iterate over it and note the indices of values to drop, and (3) use >> Table::Slice and arrow::ConcatenateTables to produce a result Table. This >> feels like I'm missing out on some things that arrow may provide at least >> for the 2nd step. >> >> The better approach to step (2) above, would be to use >> arrow::compute::Unique, but instead of producing unique values, produce >> indexes. This way I could perhaps also setup a function Options that could >> choose to keep the first duplicate, or keep the last, etc. >> >> My C++ is not particularly advanced, so I find it hard to know where to >> start for adapting an existing compute function (also, it is very hard to >> search for the unique function because of "unique_ptr"). >> >> Thanks for any help and advice! >> >> Aldrin Montana >> Computer Science PhD Student >> UC Santa Cruz
