[ 
https://issues.apache.org/jira/browse/ARROW-15081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17531751#comment-17531751
 ] 

Carl Boettiger commented on ARROW-15081:
----------------------------------------

Thanks Weston, sounds promising!  Hmm... in this particular case I have control 
of the serialization (the official eBird is distributed as a single giant 
tab-separated values file inside a tarball), so I wonder if this suggests 
tweaks I can do on my end.  Previously I was actually serializing into a single 
parquet file, which would suggest lower memory use due to the metadata cache 
you mention. 

Would it be possible to serialize the metadata cache to a parquet file in 
tempdir rather than turning it off?  (Not sure if it really would improve 
things, but it seems impossibly magical that arrow can do these operations 
without write operations and in small memory). 

Not sure if it is relevant, but I have identical OOM issues doing this in pure 
duckdb (with local copies of the parquet), see 
[https://github.com/duckdb/duckdb/issues/3554.]  Hannes suggests that on the 
duckdb side that is expected since some operations are not yet done 
"out-of-core" (and suggests a similar issue impacts Arrow I think?)

 

Anyway thanks again for your sluething here!

> [R][C++] Arrow crashes (OOM) on R client with large remote parquet files
> ------------------------------------------------------------------------
>
>                 Key: ARROW-15081
>                 URL: https://issues.apache.org/jira/browse/ARROW-15081
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: R
>            Reporter: Carl Boettiger
>            Assignee: Weston Pace
>            Priority: Major
>
> The below should be a reproducible crash:
> {code:java}
> library(arrow)
> library(dplyr)
> server <- arrow::s3_bucket("ebird",endpoint_override = 
> "minio.cirrus.carlboettiger.info")
> path <- server$path("Oct-2021/observations")
> obs <- arrow::open_dataset(path)
> path$ls() # observe -- 1 parquet file
> obs %>% count() # CRASH
> obs %>% to_duckdb() # also crash{code}
> I have attempted to split this large (~100 GB parquet file) into some smaller 
> files, which helps: 
> {code:java}
> path <- server$path("partitioned")
> obs <- arrow::open_dataset(path)
> obs$ls() # observe, multiple parquet files now
> obs %>% count() 
>  {code}
> (These parquet files have also been created by arrow, btw, from a single 
> large csv file provided by the original data provider (eBird).  Unfortunately 
> generating the partitioned versions is cumbersome as the data is very 
> unevenly distributed, there's few columns that can avoid creating 1000s of 
> parquet partition files and even so the bulk of the 1-billion rows fall 
> within the same group.  But all the same I think this is a bug as there's no 
> indication why arrow cannot handle a single 100GB parquet file I think?). 
>  
> Let me know if I can provide more info! I'm testing in R with latest CRAN 
> version of arrow on a machine with 200 GB RAM. 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to