[ 
https://issues.apache.org/jira/browse/ARROW-15081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17489827#comment-17489827
 ] 

Weston Pace commented on ARROW-15081:
-------------------------------------

There is definitely something going on here.  When I run this on a 1GB dataset 
stored on local disk I can verify that we only read ~29MB out of 1GB (which 
still seems a little high).  Extrapolating I would expect to see something on 
the order of ~300MB for a 100GB file.  Instead, when running against Carl's 
dataset, I see RAM hit 24GB before hitting the RAM limit on my system and 
crashing.

Even just opening the dataset seems to take quite a while.  I verified that the 
execution plan is the same whether we are using S3 or local disk.  It is:

{noformat}
ExecPlan with 5 nodes:
4:SinkNode{}
  3:ProjectNode{projection=[n]}
    2:ScalarAggregateNode{aggregates=[
        sum(n, {skip_nulls=true, min_count=1}),
]}
      1:ProjectNode{projection=["n": 1]}
        0:SourceNode{}
{noformat}

Earlier I mentioned:

{quote}
I thought we had some special paths in place for this.
{quote}

We do, but the R bindings are not using them.  The special paths are in the 
scanner's "lightweight consumer API" and the dplyr bindings build and execute 
an exec plan directly, using the scanner only as a scanner.

I'm guessing the solution will be to move this fast path from scanner and into 
the scan node as a scan option.  This still doesn't really explain why we are 
reading so efficiently from local filesystem so I think I don't yet have the 
full picture.

I'll try and find some time to do some more instrumentation & debugging soon, 
probably Friday if I can't get to it any earlier.

> [R][C++] Arrow crashes (OOM) on R client with large remote parquet files
> ------------------------------------------------------------------------
>
>                 Key: ARROW-15081
>                 URL: https://issues.apache.org/jira/browse/ARROW-15081
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: R
>            Reporter: Carl Boettiger
>            Assignee: Weston Pace
>            Priority: Major
>
> The below should be a reproducible crash:
> {code:java}
> library(arrow)
> library(dplyr)
> server <- arrow::s3_bucket("ebird",endpoint_override = 
> "minio.cirrus.carlboettiger.info")
> path <- server$path("Oct-2021/observations")
> obs <- arrow::open_dataset(path)
> path$ls() # observe -- 1 parquet file
> obs %>% count() # CRASH
> obs %>% to_duckdb() # also crash{code}
> I have attempted to split this large (~100 GB parquet file) into some smaller 
> files, which helps: 
> {code:java}
> path <- server$path("partitioned")
> obs <- arrow::open_dataset(path)
> obs$ls() # observe, multiple parquet files now
> obs %>% count() 
>  {code}
> (These parquet files have also been created by arrow, btw, from a single 
> large csv file provided by the original data provider (eBird).  Unfortunately 
> generating the partitioned versions is cumbersome as the data is very 
> unevenly distributed, there's few columns that can avoid creating 1000s of 
> parquet partition files and even so the bulk of the 1-billion rows fall 
> within the same group.  But all the same I think this is a bug as there's no 
> indication why arrow cannot handle a single 100GB parquet file I think?). 
>  
> Let me know if I can provide more info! I'm testing in R with latest CRAN 
> version of arrow on a machine with 200 GB RAM. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to