[ https://issues.apache.org/jira/browse/ARROW-16320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529181#comment-17529181 ]
Weston Pace commented on ARROW-16320: ------------------------------------- The writing behavior you described seemed odd so I modified your script a little (and added a memory print which, sadly, will only work on Linux): {noformat} > > print_rss <- function() { + print(grep("vmrss", readLines("/proc/self/status"), ignore.case=TRUE, value=TRUE)) + } > > n = 99e6 + as.integer(1e6 * runif(n = 1)) > a = + tibble( + key1 = sample(datasets::state.abb, size = n, replace = TRUE), + key2 = sample(datasets::state.name, size = n, replace = TRUE), + subkey1 = sample(LETTERS, size = n, replace = TRUE), + subkey2 = sample(letters, size = n, replace = TRUE), + value1 = runif(n = n), + value2 = as.integer(1000 * runif(n = n)), + time = as.POSIXct(1e8 * runif(n = n), tz = "UTC", origin = "2020-01-01") + ) |> + mutate( + subkey1 = if_else(key1 %in% c("WA", "WV", "WI", "WY"), + subkey1, NA_character_), + subkey2 = if_else(key2 %in% c("Washington", "West Virginia", "Wisconsin", "Wyoming"), + subkey2, NA_character_), + ) > lobstr::obj_size(a) 5,171,792,240 B > print("Memory usage after creating the tibble") [1] "Memory usage after creating the tibble" > print_rss() [1] "VmRSS:\t 5159276 kB" > > > readr::write_rds(a, here::here("db", "test100m.rds")) > print("Memory usage after writing rds") [1] "Memory usage after writing rds" > print_rss() [1] "VmRSS:\t 5161776 kB" > > > arrow::write_parquet(a, here::here("db", "test100m.parquet")) > print("Memory usage after writing parquet") [1] "Memory usage after writing parquet" > print_rss() [1] "VmRSS:\t 8990620 kB" > Sys.sleep(5) > print("And after sleeping 5 seconds") [1] "And after sleeping 5 seconds" > print_rss() [1] "VmRSS:\t 8990620 kB" > print(gc()) used (Mb) gc trigger (Mb) max used (Mb) Ncells 892040 47.7 1749524 93.5 1265150 67.6 Vcells 647980229 4943.7 1392905158 10627.1 1240800333 9466.6 > Sys.sleep(5) > print("And again after a garbage collection and 5 more seconds") [1] "And again after a garbage collection and 5 more seconds" > print_rss() [1] "VmRSS:\t 5377900 kB" {noformat} Summarizing... {noformat} Create table ~5.15GB RAM used Write RDS ~5.16GB RAM used Write Parquet ~9GB RAM used Wait 5 seconds ~9GB RAM used Run garbage collection Wait 5 seconds ~5.38GB RAM used {noformat} This doesn't seem terribly ideal. I think, after writing, some R objects are holding references (possibly transitively) to some shared pointers to record batches in C++. When the garbage collection runs those R objects are destroyed and the shared pointers (and buffers) can be freed. > Dataset re-partitioning consumes considerable amount of memory > -------------------------------------------------------------- > > Key: ARROW-16320 > URL: https://issues.apache.org/jira/browse/ARROW-16320 > Project: Apache Arrow > Issue Type: Bug > Affects Versions: 7.0.0 > Reporter: Zsolt Kegyes-Brassai > Priority: Minor > Attachments: 100m_1_create.jpg, 100m_2_rds.jpg, 100m_3_parquet.jpg, > 100m_4_read_rds.jpg, 100m_5_read-parquet.jpg, Rgui_mem.jpg, Rstudio_env.jpg, > Rstudio_mem.jpg > > > A short background: I was trying to create a dataset from a big pile of csv > files (couple of hundreds). In first step the csv were parsed and saved to > parquet files because there were many inconsistencies between csv files. In a > consequent step the dataset was re-partitioned using one column (code_key). > > {code:java} > new_dataset <- open_dataset( > temp_parquet_folder, > format = "parquet", > unify_schemas = TRUE > ) > new_dataset |> > group_by(code_key) |> > write_dataset( > folder_repartitioned_dataset, > format = "parquet" > ) > {code} > > This re-partitioning consumed a considerable amount of memory (5 GB). > * Is this a normal behavior? Or a bug? > * Is there any rule of thumb to estimate the memory requirement for a > dataset re-partitioning? (it’s important when scaling up this approach) > The drawback is that this memory space is not freed up after the > re-partitioning (I am using RStudio). > The {{gc()}} useless in this situation. And there is no any associated object > (to the repartitioning) in the {{R}} environment which can be removed from > memory (using the {{rm()}} function). > * How one can regain this memory space used by re-partitioning? > The rationale behind choosing the dataset re-partitioning: if my > understanding is correct, in the current arrow version the append is not > working when writing parquet files/datasets. (the original csv files were > partly partitioned according to a different variable) > Can you recommend any better approach? -- This message was sent by Atlassian Jira (v8.20.7#820007)