[ https://issues.apache.org/jira/browse/ARROW-16320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17528661#comment-17528661 ]
Zsolt Kegyes-Brassai commented on ARROW-16320: ---------------------------------------------- Hi [~westonpace] I tried to create a reproducible example. In the first step I created a dummy dataset wit nearly 100 M rows, having different column types and missing data. When writing this dataset to a parquet file I realized, that even the {{write_parquet()}} consumes a large amount of memory which is not returned back. Here is the data generation part: {code:java} library(tidyverse) n = 99e6 + as.integer(1e6 * runif(n = 1)) # n = 1000 a = tibble( key1 = sample(datasets::state.abb, size = n, replace = TRUE), key2 = sample(datasets::state.name, size = n, replace = TRUE), subkey1 = sample(LETTERS, size = n, replace = TRUE), subkey2 = sample(letters, size = n, replace = TRUE), value1 = runif(n = n), value2 = as.integer(1000 * runif(n = n)), time = as.POSIXct(1e8 * runif(n = n), tz = "UTC", origin = "2020-01-01") ) |> mutate( subkey1 = if_else(key1 %in% c("WA", "WV", "WI", "WY"), subkey1, NA_character_), subkey2 = if_else(key2 %in% c("Washington", "West Virginia", "Wisconsin", "Wyoming"), subkey2, NA_character_), ) lobstr::obj_size(a) #> 5,177,583,640 B {code} and the memory utilization after the dataset creation !100m_1_create.jpg! and writing to *{{rds}}* file {code:java} readr::write_rds(a, here::here("db", "test100m.rds")){code} no visible memory utilization increase !100m_2_rds.jpg! and writing to *parquet* file {code:java} arrow::write_parquet(a, here::here("db", "test100m.parquet")){code} there is a drastic increase in memory utilization 10.6 GB -> 15 GB - just for writing the file !100m_3_parquet.jpg! +This memory amount consumed during writing the parquet file was not returned back even after 15 minutes.+ My biggest concern is that the ability to handle datasets larger than the available memory seems increasingly remote. I consider that this is a critical bug, but it might happen that is affecting only me… as I don’t have possibility to test elsewhere. > Dataset re-partitioning consumes considerable amount of memory > -------------------------------------------------------------- > > Key: ARROW-16320 > URL: https://issues.apache.org/jira/browse/ARROW-16320 > Project: Apache Arrow > Issue Type: Bug > Affects Versions: 7.0.0 > Reporter: Zsolt Kegyes-Brassai > Priority: Minor > Attachments: 100m_1_create.jpg, 100m_2_rds.jpg, 100m_3_parquet.jpg, > Rgui_mem.jpg, Rstudio_env.jpg, Rstudio_mem.jpg > > > A short background: I was trying to create a dataset from a big pile of csv > files (couple of hundreds). In first step the csv were parsed and saved to > parquet files because there were many inconsistencies between csv files. In a > consequent step the dataset was re-partitioned using one column (code_key). > > {code:java} > new_dataset <- open_dataset( > temp_parquet_folder, > format = "parquet", > unify_schemas = TRUE > ) > new_dataset |> > group_by(code_key) |> > write_dataset( > folder_repartitioned_dataset, > format = "parquet" > ) > {code} > > This re-partitioning consumed a considerable amount of memory (5 GB). > * Is this a normal behavior? Or a bug? > * Is there any rule of thumb to estimate the memory requirement for a > dataset re-partitioning? (it’s important when scaling up this approach) > The drawback is that this memory space is not freed up after the > re-partitioning (I am using RStudio). > The {{gc()}} useless in this situation. And there is no any associated object > (to the repartitioning) in the {{R}} environment which can be removed from > memory (using the {{rm()}} function). > * How one can regain this memory space used by re-partitioning? > The rationale behind choosing the dataset re-partitioning: if my > understanding is correct, in the current arrow version the append is not > working when writing parquet files/datasets. (the original csv files were > partly partitioned according to a different variable) > Can you recommend any better approach? -- This message was sent by Atlassian Jira (v8.20.7#820007)