[ 
https://issues.apache.org/jira/browse/ARROW-15081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17531783#comment-17531783
 ] 

Weston Pace commented on ARROW-15081:
-------------------------------------

Moving to one file instead of many files will save you ~10-11GB of RAM today.  
I don't know if that is enough to prevent a crash.

Serializing the metadata cache should be possible.  I think there is a bigger 
question around whether or not we should either invent a serialization format 
for the datasets API or adopt some kind of existing format (there is some 
discussion of this at ARROW-15317)

I would guess the out-of-core part Hannes is referring to is keeping track of 
the group identities.  The {{group_by(sampling_event_identifier, 
scientific_name)}} operation is the culprit here.  If there are millions / 
billions of combinations of {{(sampling_event_identifier, scientific_name)}} 
then the output table is going to be very large.  That is where the bulk of the 
memory usage is in your query.

We also cannot stream the output because the very last row of your data might 
just happen to be in the same group as the very first row of your data.  So we 
don't know that our {{count}} is correct until we've seen every single row.

There are a few ways around this inability to stream but these are longer term 
goals.

We can spill the groupings to disk as we perform the query.  If the last row 
does happen to belong in the first group we saw then we have to load it back 
from disk, update it, and put it back on disk.  Once we've seen all the rows we 
can stream the result from our temporary disk storage.  This is what I assume 
Hannes is referring to when he describes an out-of-core operator.

Another approach we could take involves sorting/partitioning.  If the incoming 
data is sorted or partitioned by one of those two columns then we could emit 
results early.  For example, if we know that the incoming data is sorted on 
sampling_event_identifier then as soon as we stop seeing some sampling event 
(e.g. we've finished reading S10054205) we know that we can emit all 
combinations of {{(S10054205, scientific_name)}}.

Of course, if you are doing {{|> collect()}} to collect the entire result into 
memory it won't make much difference.  However, if you're able to process 
results incrementally, then either of these two approaches could allow you to 
run this query with much less RAM.

> [R][C++] Arrow crashes (OOM) on R client with large remote parquet files
> ------------------------------------------------------------------------
>
>                 Key: ARROW-15081
>                 URL: https://issues.apache.org/jira/browse/ARROW-15081
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: R
>            Reporter: Carl Boettiger
>            Assignee: Weston Pace
>            Priority: Major
>
> The below should be a reproducible crash:
> {code:java}
> library(arrow)
> library(dplyr)
> server <- arrow::s3_bucket("ebird",endpoint_override = 
> "minio.cirrus.carlboettiger.info")
> path <- server$path("Oct-2021/observations")
> obs <- arrow::open_dataset(path)
> path$ls() # observe -- 1 parquet file
> obs %>% count() # CRASH
> obs %>% to_duckdb() # also crash{code}
> I have attempted to split this large (~100 GB parquet file) into some smaller 
> files, which helps: 
> {code:java}
> path <- server$path("partitioned")
> obs <- arrow::open_dataset(path)
> obs$ls() # observe, multiple parquet files now
> obs %>% count() 
>  {code}
> (These parquet files have also been created by arrow, btw, from a single 
> large csv file provided by the original data provider (eBird).  Unfortunately 
> generating the partitioned versions is cumbersome as the data is very 
> unevenly distributed, there's few columns that can avoid creating 1000s of 
> parquet partition files and even so the bulk of the 1-billion rows fall 
> within the same group.  But all the same I think this is a bug as there's no 
> indication why arrow cannot handle a single 100GB parquet file I think?). 
>  
> Let me know if I can provide more info! I'm testing in R with latest CRAN 
> version of arrow on a machine with 200 GB RAM. 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to