[ https://issues.apache.org/jira/browse/ARROW-15081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17533976#comment-17533976 ]
Carl Boettiger commented on ARROW-15081: ---------------------------------------- Just a note that even with a single parquet file I see the same crash after exceeding my 50 GB RAM, so I don't think the parquet per-file-metadata is the main culprit here? Probably the group-identities tracking like you say. Is it possible for arrow to use the local disk for this instead of attempting to keep all this in RAM? > [R][C++] Arrow crashes (OOM) on R client with large remote parquet files > ------------------------------------------------------------------------ > > Key: ARROW-15081 > URL: https://issues.apache.org/jira/browse/ARROW-15081 > Project: Apache Arrow > Issue Type: Bug > Components: R > Reporter: Carl Boettiger > Assignee: Weston Pace > Priority: Major > > The below should be a reproducible crash: > {code:java} > library(arrow) > library(dplyr) > server <- arrow::s3_bucket("ebird",endpoint_override = > "minio.cirrus.carlboettiger.info") > path <- server$path("Oct-2021/observations") > obs <- arrow::open_dataset(path) > path$ls() # observe -- 1 parquet file > obs %>% count() # CRASH > obs %>% to_duckdb() # also crash{code} > I have attempted to split this large (~100 GB parquet file) into some smaller > files, which helps: > {code:java} > path <- server$path("partitioned") > obs <- arrow::open_dataset(path) > obs$ls() # observe, multiple parquet files now > obs %>% count() > {code} > (These parquet files have also been created by arrow, btw, from a single > large csv file provided by the original data provider (eBird). Unfortunately > generating the partitioned versions is cumbersome as the data is very > unevenly distributed, there's few columns that can avoid creating 1000s of > parquet partition files and even so the bulk of the 1-billion rows fall > within the same group. But all the same I think this is a bug as there's no > indication why arrow cannot handle a single 100GB parquet file I think?). > > Let me know if I can provide more info! I'm testing in R with latest CRAN > version of arrow on a machine with 200 GB RAM. -- This message was sent by Atlassian Jira (v8.20.7#820007)