Hello,
I am a (learning) user of the Arrow R package on Windows. I am currently
focused on using Arrow to do data preparation on bigger-than-my-memory set
of csv files, transform them into parquet files, for further analysis with
DuckDB. I have about 600 csv files, totaling about 200 GBs which had been
dumped out of a database. I've had luck doing some of this, but for the
biggest table I am struggling with understanding when Arrow may fill memory
and grind to a halt, versus when I should expect that Arrow can iterate
through.
For reproducibility purposes, I did some working with the nyc-taxi dataset
down below. These do not fill my memory, but they do use up more than I
expected, and I don't know how to free it without restarting the R session.
My questions:
1) When working with dplyr & datasets, are there parameters that determine
whether operations can be performed in a streaming/iterative form that is
needed when data is much bigger than memory?
2) I wasn't expecting write_dataset to continue consuming memory when
finished. I don't think gc() or pryr functions are able to clear or measure
memory used by Arrow. Are there different tools I should be using here?
Maybe I need to be telling Arrow to limit usage somehow?
3) The current documentation for write_dataset says you can't rename while
writing -- in my experience this did work. Is the reason for this note that
in order to rename, Arrow will change the dataset to an in-memory Table?
Based on my test, the memory usage didn't seem less, but this was one of my
theories of what was going on.
thanks,
Jameel
```
#### Read dataset -> write dataset ---------
library(tidyverse)
library(arrow)
library(duckdb)
# Do I understand the limitations of out of memory dataset manipulations?
packageVersion("arrow")
# [1] ‘5.0.0.20211016’
ds <- open_dataset("nyc-taxi", partitioning = c("year", "month"))
# The documentation for write_dataset says you can't rename in the process
of writing
# In @param dataset:
# "Note that select()-ed columns may not be renamed."
ds %>%
select(vendor_id, pickup_at, dropoff_at, year, month) %>%
rename(
pickup_dttm = pickup_at,
dropoff_dttm = dropoff_at
) %>%
write_dataset("nyc-taxi-mod", partitioning = c("year", "month"))
# Starting memory usage: 420 MB (task manager - RStudio/R)
# Ending memory usage: 12,100 MB (task manager - RStudio/R)
# it does _work_, but a lot more memory is used. Task manager sees the
memory as used by the
# RStudio session, but Rstudio sees the memory as used by the system. I am
assuming it is Arrow,
# but I'm not sure how to control this, as e.g., there is no gc() for Arrow.
# RESTART R SESSION HERE TO RECOVER MEMORY
# Its possible that out of memory dataset operations can't use rename.
# If you do not rename, and only select:
ds %>%
select(vendor_id, pickup_at, dropoff_at, year, month) %>%
write_dataset("nyc-taxi-mod", partitioning = c("year", "month"))
# starting memory usage: 425 MB (Task manager - for Rstudio/R)
# end usage: 10,600 MB (task manager - for Rstudio/R)
```
--
Jameel Alsalam
(510) 717-9637
[email protected]