[ 
https://issues.apache.org/jira/browse/ARROW-16641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17542849#comment-17542849
 ] 

Vladimir commented on ARROW-16641:
----------------------------------

Will, thank you very much for your support!
The solution you proposed works very well.


However, there is an issue applying it to a real dataset. It looks like the 
`{{{}call_function("list_parent_indices"){}}}` does not work with a 
`{{{}FileSystemDataset{}}}` (parquet files opened with 
`{{{}arrow::open_dataset(){}}}`). So to prepare an "exploded table", we need to 
load the full dataset into memory. And the "exploded table" becomes huge if 
there are multiple flags per record. So I'm running out of RAM very quickly 
(>100 GB).

I've also tried the other way to filter the records - using `DuckDB`+`arrow` 
with  SQL query of form "{{{}SELECT * FROM my_table WHERE 'FlagA' != ANY 
(my_table.issue) AND 'FlagB' != ANY (my_table.issue){}}}". But this approach is 
also very RAM-consuming.

So probably currently, there is no simple way to perform this filtering in one 
pass, and we need to split the data into chunks.

It would be pretty cool if one day `arrow` would have a built-in function to 
perform such a task on the fly!

PS. The dataset we are working on is in open access - it's [GBIF occurrence 
records|https://github.com/gbif/occurrence/blob/master/aws-public-data.md]. It 
has almost 2 billion records, and some of the records could have ~10 flags 
(column `issue`).



 

> [R] How to filter array columns?
> --------------------------------
>
>                 Key: ARROW-16641
>                 URL: https://issues.apache.org/jira/browse/ARROW-16641
>             Project: Apache Arrow
>          Issue Type: Wish
>          Components: R
>            Reporter: Vladimir
>            Priority: Minor
>             Fix For: 8.0.0
>
>
> In the parquet data we have, there is a column with the array data type 
> ({*}list<array_element <string>>{*}), which flags records that have different 
> issues. For each record, multiple values could be stored in the column. For 
> example, `{_}[A, B, C]{_}`.
> I'm trying to perform a data filtering step and exclude some flagged records.
> Filtering is trivial for the regular columns that contain just a single 
> value. E.g.,
> {code:java}
> flags_to_exclude <- c("A", "B")
> datt %>% filter(! col %in% flags_to_exclude)
> {code}
> Given the array column, is it possible to exclude records with at least one 
> of the flags from `flags_to_exclude` using the arrow R package?
> I really appreciate any advice you can provide!



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to