Hello, I wanted to share a great experience I had with arrow, Python and R,
and possible contribute a package I wrote.

I work with many colleagues that build engineering systems in python. As a
data scientist I work almost entirely in R and Rcpp. To bridge the
engineering work with data science work my team uses rpy2. Our workflow is
usually this:

   1. Python is used to coordinate with other engineering systems
   2. Python ultimately gets a dataset that needs to be analyzed. This can
   take the form of parquet or arrow formatted data
   3. Using rpy2, python can pass data to R to be analyzed, and receive
   output from R at the end. Python can then coordinate with downstream systems

What we have found is that passing data from python to R can be quite slow.
Recently, I found that if python already has a pyarrow Table object in the
session, it can be passed to Rcpp as a SEXP through rpy2. Rcpp can use the
C++ arrow library to extract the underlying arrow Table object from the
pyarrow Table object, and materialize a data frame out of that. I have
found that I can transfer 20 million row datasets from python to R in ~10
seconds. This is particularly powerful when Python is already the driver of
the engineering systems, and the compute is pushed into R.

This is a huge advantage. Performance wise, this is the fastest way to
transfer data to R that we have seen. Culture wise, it means my engineering
and data science teams can collaborate much better as both teams operate on
arrow types.

I wrote an R package containing an Rcpp function
RcppReceiveArrowTableFromPython (link
<https://github.com/jeffwong-nflx/RcppPyArrow>) which receives the SEXP
from rpy2, unwraps the underlying arrow table, and then produces an R
dataframe. I am interested in contributing this package back to the Arrow
community, as I believe part of the spirit of Arrow is to facilitate
seamless data transfer across languages. Is the Arrow codebase a proper
home for such a package? This package has a dependency on having Python
headers and being able to link to libpython.so, will that complicate this
contribution?

-- 
Jeffrey Wong
Computational Causal Inference

Reply via email to