[ 
https://issues.apache.org/jira/browse/ARROW-12529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17331599#comment-17331599
 ] 

Jared Lander commented on ARROW-12529:
--------------------------------------

It appears to get non-linearly worse with larger data. Try setting n to 
20,000,000 and try try converting x and y into a geometry column. Memory usage 
gets really out of hand.

> [R] Writing to Parquet from tibble Consumes Large Amount of Memory
> ------------------------------------------------------------------
>
>                 Key: ARROW-12529
>                 URL: https://issues.apache.org/jira/browse/ARROW-12529
>             Project: Apache Arrow
>          Issue Type: Bug
>    Affects Versions: 3.0.0
>            Reporter: Jared Lander
>            Priority: Major
>
> When writing a large `tibble` to a parquet file, a large amount of memory is 
> consumed. I first discovered this when using `targets::tar_read(obj)` to load 
> in an object that had been saved in the parquet format. That particular 
> object was an `sf` object with about 20 million rows and 26 columns. For a 
> 5-6 GB object, memory ballooned by 22 GB.
> I wrote the following code to test this using a regular `tibble`, not `sf`. 
> In this test memory increases dramatically when writing, but not when 
> reading, which I'm still trying to figure out.
> {code:java}
> library(arrow)
> library(dplyr)
> library(lobstr)
> library(tictoc)n <- 10000000system('free -m')
> tic()
> fake <- tibble(
>     ID=seq(n),
>     x=runif(n=n, min=-170, max=170),
>     y=runif(n=n, min=-60, max=70),
>     text1=sample(x=state.name, size=n, replace=TRUE),
>     text2=sample(x=state.name, size=n, replace=TRUE),
>     text3=sample(x=state.division, size=n, replace=TRUE),
>     text4=sample(x=state.region, size=n, replace=TRUE),
>     text5=sample(x=state.abb, size=n, replace=TRUE),
>     num1=sample(x=state.center$x, size=n, replace=TRUE),
>     num2=sample(x=state.center$y, size=n, replace=TRUE),
>     num3=sample(x=state.area, size=n, replace=TRUE),
>     Rand1=rnorm(n=n),
>     Rand2=rnorm(n=n, mean=100, sd=3),
>     Rand3=rbinom(n=n, size=10, prob=0.4)
> )
> toc()
> system('free -m')obj_size(fake)/1024/1024/1024system('free -m')
> tic()
> write_parquet(fake, 'data/write_fake.parquet')
> toc()
> system('free -m')system('free -m')
> gc()
> system('free -m')system('free -m')
> tic()
> fake_parquet <- read_parquet('data/write_test.parquet')
> toc()
> system('free -m')
> obj_size(spat_parquet)/1024/1024/1024
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to