[ 
https://issues.apache.org/jira/browse/ARROW-15081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17531476#comment-17531476
 ] 

Weston Pace commented on ARROW-15081:
-------------------------------------

One mystery solved, a few more remained, I managed to pinpoint the memory usage 
in my reproduction.  When we scan parquet files we store the metadata in the 
parquet fragment.  At one time this was added to make it slightly quicker to 
re-read the dataset because all of the metadata information is cached.  However:

 * Parquet is the only format to do this despite the fact that all formats 
follow a similar pattern
 * There is no way to turn this off
 * Parquet metadata can be surprisingly large (~10MB per file in your dataset)

I've created ARROW-16451 to track this.

If I turn this off (clear the metadata from the fragment when we are done 
scanning it) then the following query uses, at most, 292MB of RSS:
{noformat}
library(arrow)
library(dplyr)
packageVersion("arrow")
path <- arrow::s3_bucket("ebird/Mar-2022/observations",
                         endpoint_override = "minio.carlboettiger.info",
                         anonymous=TRUE)
obs <- arrow::open_dataset(path)

tmp <- obs |> summarize(count = sum(observation_count, na.rm=TRUE), .groups = 
"drop") 
print(tmp |> collect(as_data_frame=FALSE))
{noformat}

If I run the query that you have posted (e.g. with the group-by) I see 
considerably more memory usage.  Some of this is expected because I now need to 
keep track of the key values for the groups and there are quite a few unique 
key combinations.  For example, from the first 20 files (~20 million rows):

{noformat}
# A tibble: 19,997,847 × 3
   sampling_event_identifier scientific_name       count
   <chr>                     <chr>                 <int>
 1 S59303569                 Butorides striata         2
 2 S67152642                 Calidris alpina           1
 3 S51998273                 Setophaga virens          1
 4 S58666072                 Hirundo rustica           8
 5 S61157542                 Laniarius major           0
 6 S22508263                 Leiothlypis peregrina     0
 7 S16296356                 Pluvialis squatarola      0
 8 S47732847                 Progne subis             25
 9 S10054205                 Dendrocygna bicolor       0
10 S48511589                 Empidonax traillii        1
{noformat}

Some of this is freed after garbage collection (ARROW-16452)

After I run gc() then I get the amount of memory I would expect (e.g. 
arrow::default_memory_pool()::bytes_allocated == collected_dataframe$nbytes) 
but RSS still seems a little large.  I may do a touch more investigation there.

> [R][C++] Arrow crashes (OOM) on R client with large remote parquet files
> ------------------------------------------------------------------------
>
>                 Key: ARROW-15081
>                 URL: https://issues.apache.org/jira/browse/ARROW-15081
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: R
>            Reporter: Carl Boettiger
>            Assignee: Weston Pace
>            Priority: Major
>
> The below should be a reproducible crash:
> {code:java}
> library(arrow)
> library(dplyr)
> server <- arrow::s3_bucket("ebird",endpoint_override = 
> "minio.cirrus.carlboettiger.info")
> path <- server$path("Oct-2021/observations")
> obs <- arrow::open_dataset(path)
> path$ls() # observe -- 1 parquet file
> obs %>% count() # CRASH
> obs %>% to_duckdb() # also crash{code}
> I have attempted to split this large (~100 GB parquet file) into some smaller 
> files, which helps: 
> {code:java}
> path <- server$path("partitioned")
> obs <- arrow::open_dataset(path)
> obs$ls() # observe, multiple parquet files now
> obs %>% count() 
>  {code}
> (These parquet files have also been created by arrow, btw, from a single 
> large csv file provided by the original data provider (eBird).  Unfortunately 
> generating the partitioned versions is cumbersome as the data is very 
> unevenly distributed, there's few columns that can avoid creating 1000s of 
> parquet partition files and even so the bulk of the 1-billion rows fall 
> within the same group.  But all the same I think this is a bug as there's no 
> indication why arrow cannot handle a single 100GB parquet file I think?). 
>  
> Let me know if I can provide more info! I'm testing in R with latest CRAN 
> version of arrow on a machine with 200 GB RAM. 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to