[ https://issues.apache.org/jira/browse/ARROW-12529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17332559#comment-17332559 ]
Jared Lander commented on ARROW-12529: -------------------------------------- Ubuntu 18.04 with R 4.0.5 and arrow 3.0.0. > [R] Writing to Parquet from tibble Consumes Large Amount of Memory > ------------------------------------------------------------------ > > Key: ARROW-12529 > URL: https://issues.apache.org/jira/browse/ARROW-12529 > Project: Apache Arrow > Issue Type: Bug > Affects Versions: 3.0.0 > Reporter: Jared Lander > Priority: Major > > When writing a large `tibble` to a parquet file, a large amount of memory is > consumed. I first discovered this when using `targets::tar_read(obj)` to load > in an object that had been saved in the parquet format. That particular > object was an `sf` object with about 20 million rows and 26 columns. For a > 5-6 GB object, memory ballooned by 22 GB. > I wrote the following code to test this using a regular `tibble`, not `sf`. > In this test memory increases dramatically when writing, but not when > reading, which I'm still trying to figure out. > {code:java} > library(arrow) > library(dplyr) > library(lobstr) > library(tictoc)n <- 10000000system('free -m') > tic() > fake <- tibble( > ID=seq(n), > x=runif(n=n, min=-170, max=170), > y=runif(n=n, min=-60, max=70), > text1=sample(x=state.name, size=n, replace=TRUE), > text2=sample(x=state.name, size=n, replace=TRUE), > text3=sample(x=state.division, size=n, replace=TRUE), > text4=sample(x=state.region, size=n, replace=TRUE), > text5=sample(x=state.abb, size=n, replace=TRUE), > num1=sample(x=state.center$x, size=n, replace=TRUE), > num2=sample(x=state.center$y, size=n, replace=TRUE), > num3=sample(x=state.area, size=n, replace=TRUE), > Rand1=rnorm(n=n), > Rand2=rnorm(n=n, mean=100, sd=3), > Rand3=rbinom(n=n, size=10, prob=0.4) > ) > toc() > system('free -m')obj_size(fake)/1024/1024/1024system('free -m') > tic() > write_parquet(fake, 'data/write_fake.parquet') > toc() > system('free -m')system('free -m') > gc() > system('free -m')system('free -m') > tic() > fake_parquet <- read_parquet('data/write_test.parquet') > toc() > system('free -m') > obj_size(spat_parquet)/1024/1024/1024 > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)