Jared Lander created ARROW-12529: ------------------------------------ Summary: Writing to Parquet from tibble Consumes Large Amount of Memory Key: ARROW-12529 URL: https://issues.apache.org/jira/browse/ARROW-12529 Project: Apache Arrow Issue Type: Bug Affects Versions: 3.0.0 Reporter: Jared Lander
When writing a large `tibble` to a parquet file, a large amount of memory is consumed. I first discovered this when using `targets::tar_read(obj)` to load in an object that had been saved in the parquet format. That particular object was an `sf` object with about 20 million rows and 26 columns. For a 5-6 GB object, memory ballooned by 22 GB. I wrote the following code to test this using a regular `tibble`, not `sf`. In this test memory increases dramatically when writing, but not when reading, which I'm still trying to figure out. {code:java} library(arrow) library(dplyr) library(lobstr) library(tictoc)n <- 10000000system('free -m') tic() fake <- tibble( ID=seq(n), x=runif(n=n, min=-170, max=170), y=runif(n=n, min=-60, max=70), text1=sample(x=state.name, size=n, replace=TRUE), text2=sample(x=state.name, size=n, replace=TRUE), text3=sample(x=state.division, size=n, replace=TRUE), text4=sample(x=state.region, size=n, replace=TRUE), text5=sample(x=state.abb, size=n, replace=TRUE), num1=sample(x=state.center$x, size=n, replace=TRUE), num2=sample(x=state.center$y, size=n, replace=TRUE), num3=sample(x=state.area, size=n, replace=TRUE), Rand1=rnorm(n=n), Rand2=rnorm(n=n, mean=100, sd=3), Rand3=rbinom(n=n, size=10, prob=0.4) ) toc() system('free -m')obj_size(fake)/1024/1024/1024system('free -m') tic() write_parquet(fake, 'data/write_fake.parquet') toc() system('free -m')system('free -m') gc() system('free -m')system('free -m') tic() fake_parquet <- read_parquet('data/write_test.parquet') toc() system('free -m') obj_size(spat_parquet)/1024/1024/1024 {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)