Thank you for your interest! As you are discovering it can be a
slightly tricky topic. Disclaimer: I am neither an R expert or a
Windows expert but I have worked some on the underlying C++ feature.
Some of the limitations you are running into are probably bugs and not
intentional limitations. In particular, there was a regression in
4.0.0 and 5.0.0 [1] which meant that a slow consumer of data (writing
a dataset is typically a slow consumer) could lead to Arrow using too
much RAM (in such a case it needs to backoff on the read and it was
not doing this) and eventually crashing out of memory. This issue has
been recently fixed and, using a very recent build, I confirmed that I
was able to read a 30GB CSV dataset and write it to parquet. The peak
RAM usage of the Arrow process was 1GB. Note, this was on Linux and
not Windows but this shouldn't matter.
Some of this may be due to the way that the OS handles disk I/O. As
users, we are typically used to smallish disk writes (e.g. a couple of
GB or less) and, from a user perspective, these writes are often
non-blocking and very fast. In reality the write is only pushing the
data into the OS' disk cache and then returning as soon as that memcpy
is done. The actual write to the physical disk happens behind the
scenes (and can even happen outside the lifetime of the process). By
default, this disk cache is (I think, not sure for Windows) allowed to
consume all available RAM. Once that happens additional writes (and
possible regular allocations) will be slowed down while the OS waits
for the data to be persisted to disk so that the RAM can be used.
At the moment, Arrow does nothing to prevent this OS cache from
filling up. This may be something we can investigate in future
releases, it is an interesting question what the best behavior is.
When working with dplyr & datasets, are there parameters that determine whether
operations can be performed in a streaming/iterative form that is needed when data
is much bigger than memory?
There are a few operations which are not implemented by Arrow and I
believe when this situation is encountered Arrow will load the entire
dataset into memory and apply the operation in R. This would
definitely be a bad thing for your goals. I'm not sure the exact
details of what operations will trigger this but the basic select /
rename operations you have should be ok.
There are also a few operations which are implemented in arrow but
will force the data to be buffered in memory. These are arrange (or
anything ordering data like top-k) and join.
I wasn't expecting write_dataset to continue consuming memory when finished. I
don't think gc() or pryr functions are able to clear or measure memory used by
Arrow. Are there different tools I should be using here? Maybe I need to be
telling Arrow to limit usage somehow?
After write_dataset is finished the process should not be holding on
to any memory. If it is doing so then that is a bug. However, the OS
may still be holding onto data that is in the disk cache waiting to be
flushed to disk. A good quick test is to check
"arrow::default_memory_pool()$bytes_allocated". This will report how
much memory Arrow believes it is using. If this is 0 then that is a
good bet (though by no means a guarantee) that anything the Arrow
system library has allocated has been released. In Windows you might
be able to use the program RAMMap [3] might give you some more
information on how much data is in the disk cache.
The current documentation for write_dataset says you can't rename while writing
-- in my experience this did work. Is the reason for this note that in order to
rename, Arrow will change the dataset to an in-memory Table? Based on my test,
the memory usage didn't seem less, but this was one of my theories of what was
going on.
The note here is quite old and the functions it describes have been
changed a lot in the last year. My guess is this is a relic from the
time that dplyr functions were handled differently. Maybe someone
else can chime in to verify. From what I can tell a rename in R is
translated into a select which is translated into a project in the C++
layer. A project operation should be able to operate in a streaming
fashion and will not force the memory to buffer.
In summary, out of core processing is something I think many of us are
interested in and want to support. Basic out-of-core manipulation
(repartitioning, transforming from one format to another) should be
pretty well supported in 6.0.0 but it might consume all available OS
RAM as disk cache. Out of core work in general is still getting
started and you will hopefully continue to see improvements as we work
on them. For example, I hope future releases will be be able to
support out-of-core joins and ordering statements by spilling to disk.
[1] https://issues.apache.org/jira/browse/ARROW-13611
[2] https://docs.microsoft.com/en-us/windows/win32/fileio/file-caching
[3] https://docs.microsoft.com/en-us/sysinternals/downloads/rammap
On Mon, Oct 18, 2021 at 11:34 AM Jameel Alsalam <[email protected]> wrote:
Hello,
I am a (learning) user of the Arrow R package on Windows. I am currently
focused on using Arrow to do data preparation on bigger-than-my-memory set of
csv files, transform them into parquet files, for further analysis with DuckDB.
I have about 600 csv files, totaling about 200 GBs which had been dumped out of
a database. I've had luck doing some of this, but for the biggest table I am
struggling with understanding when Arrow may fill memory and grind to a halt,
versus when I should expect that Arrow can iterate through.
For reproducibility purposes, I did some working with the nyc-taxi dataset down
below. These do not fill my memory, but they do use up more than I expected,
and I don't know how to free it without restarting the R session.
My questions:
1) When working with dplyr & datasets, are there parameters that determine
whether operations can be performed in a streaming/iterative form that is needed
when data is much bigger than memory?
2) I wasn't expecting write_dataset to continue consuming memory when finished.
I don't think gc() or pryr functions are able to clear or measure memory used
by Arrow. Are there different tools I should be using here? Maybe I need to be
telling Arrow to limit usage somehow?
3) The current documentation for write_dataset says you can't rename while
writing -- in my experience this did work. Is the reason for this note that in
order to rename, Arrow will change the dataset to an in-memory Table? Based on
my test, the memory usage didn't seem less, but this was one of my theories of
what was going on.
thanks,
Jameel
```
#### Read dataset -> write dataset ---------
library(tidyverse)
library(arrow)
library(duckdb)
# Do I understand the limitations of out of memory dataset manipulations?
packageVersion("arrow")
# [1] ‘5.0.0.20211016’
ds <- open_dataset("nyc-taxi", partitioning = c("year", "month"))
# The documentation for write_dataset says you can't rename in the process of
writing
# In @param dataset:
# "Note that select()-ed columns may not be renamed."
ds %>%
select(vendor_id, pickup_at, dropoff_at, year, month) %>%
rename(
pickup_dttm = pickup_at,
dropoff_dttm = dropoff_at
) %>%
write_dataset("nyc-taxi-mod", partitioning = c("year", "month"))
# Starting memory usage: 420 MB (task manager - RStudio/R)
# Ending memory usage: 12,100 MB (task manager - RStudio/R)
# it does _work_, but a lot more memory is used. Task manager sees the memory
as used by the
# RStudio session, but Rstudio sees the memory as used by the system. I am
assuming it is Arrow,
# but I'm not sure how to control this, as e.g., there is no gc() for Arrow.
# RESTART R SESSION HERE TO RECOVER MEMORY
# Its possible that out of memory dataset operations can't use rename.
# If you do not rename, and only select:
ds %>%
select(vendor_id, pickup_at, dropoff_at, year, month) %>%
write_dataset("nyc-taxi-mod", partitioning = c("year", "month"))
# starting memory usage: 425 MB (Task manager - for Rstudio/R)
# end usage: 10,600 MB (task manager - for Rstudio/R)
```
--
Jameel Alsalam
(510) 717-9637
[email protected]