[ https://issues.apache.org/jira/browse/ARROW-15081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17531476#comment-17531476 ]
Weston Pace commented on ARROW-15081: ------------------------------------- One mystery solved, a few more remained, I managed to pinpoint the memory usage in my reproduction. When we scan parquet files we store the metadata in the parquet fragment. At one time this was added to make it slightly quicker to re-read the dataset because all of the metadata information is cached. However: * Parquet is the only format to do this despite the fact that all formats follow a similar pattern * There is no way to turn this off * Parquet metadata can be surprisingly large (~10MB per file in your dataset) I've created ARROW-16451 to track this. If I turn this off (clear the metadata from the fragment when we are done scanning it) then the following query uses, at most, 292MB of RSS: {noformat} library(arrow) library(dplyr) packageVersion("arrow") path <- arrow::s3_bucket("ebird/Mar-2022/observations", endpoint_override = "minio.carlboettiger.info", anonymous=TRUE) obs <- arrow::open_dataset(path) tmp <- obs |> summarize(count = sum(observation_count, na.rm=TRUE), .groups = "drop") print(tmp |> collect(as_data_frame=FALSE)) {noformat} If I run the query that you have posted (e.g. with the group-by) I see considerably more memory usage. Some of this is expected because I now need to keep track of the key values for the groups and there are quite a few unique key combinations. For example, from the first 20 files (~20 million rows): {noformat} # A tibble: 19,997,847 × 3 sampling_event_identifier scientific_name count <chr> <chr> <int> 1 S59303569 Butorides striata 2 2 S67152642 Calidris alpina 1 3 S51998273 Setophaga virens 1 4 S58666072 Hirundo rustica 8 5 S61157542 Laniarius major 0 6 S22508263 Leiothlypis peregrina 0 7 S16296356 Pluvialis squatarola 0 8 S47732847 Progne subis 25 9 S10054205 Dendrocygna bicolor 0 10 S48511589 Empidonax traillii 1 {noformat} Some of this is freed after garbage collection (ARROW-16452) After I run gc() then I get the amount of memory I would expect (e.g. arrow::default_memory_pool()::bytes_allocated == collected_dataframe$nbytes) but RSS still seems a little large. I may do a touch more investigation there. > [R][C++] Arrow crashes (OOM) on R client with large remote parquet files > ------------------------------------------------------------------------ > > Key: ARROW-15081 > URL: https://issues.apache.org/jira/browse/ARROW-15081 > Project: Apache Arrow > Issue Type: Bug > Components: R > Reporter: Carl Boettiger > Assignee: Weston Pace > Priority: Major > > The below should be a reproducible crash: > {code:java} > library(arrow) > library(dplyr) > server <- arrow::s3_bucket("ebird",endpoint_override = > "minio.cirrus.carlboettiger.info") > path <- server$path("Oct-2021/observations") > obs <- arrow::open_dataset(path) > path$ls() # observe -- 1 parquet file > obs %>% count() # CRASH > obs %>% to_duckdb() # also crash{code} > I have attempted to split this large (~100 GB parquet file) into some smaller > files, which helps: > {code:java} > path <- server$path("partitioned") > obs <- arrow::open_dataset(path) > obs$ls() # observe, multiple parquet files now > obs %>% count() > {code} > (These parquet files have also been created by arrow, btw, from a single > large csv file provided by the original data provider (eBird). Unfortunately > generating the partitioned versions is cumbersome as the data is very > unevenly distributed, there's few columns that can avoid creating 1000s of > parquet partition files and even so the bulk of the 1-billion rows fall > within the same group. But all the same I think this is a bug as there's no > indication why arrow cannot handle a single 100GB parquet file I think?). > > Let me know if I can provide more info! I'm testing in R with latest CRAN > version of arrow on a machine with 200 GB RAM. -- This message was sent by Atlassian Jira (v8.20.7#820007)