Hey Everyone!

I've been converting between Parquet <-> Spark Data Frames <-> R Data
Frames for larger data sets. I have found the conversion speed quite
slow in the Spark <-> R side and am looking for some insight on how to
speed it up (or determine what I have failed to do properly)!

In R, "sparkR::collect" and "sparkR::write.df" take much longer than
Spark reading and writing Parquet. While these aren’t the same
operations, the difference suggests that there is a bottleneck within
the translation between R data frames and Spark Data Frames. A profile
of the SparkR code shows that R is spending a large portion of its
time within "sparkR:::readTypedObject", "sparkR:::readBin", and
"sparkR:::readObject". To me, this suggests that the serialization
step accounts for the slow speed, but I don't want to guess too much.
Any thoughts on how to speed the conversion?

Details:
Tried with Spark 2.0 and 1.6.1 (and the associated SparkR package) and R 3.3.0.
On a Macbook Pro, 16BG Ram, Quad Core.

+--------+-----------+-----------------+------------------+
| # Rows | # Columns | sparkR::collect | sparkR::write.df |
+--------+-----------+-----------------+------------------+
| 600K   | 20        | 3min            | 6min             |
+--------+-----------+-----------------+------------------+
| 1.8M   | 20        | 9min            | 20min            |
+--------+-----------+-----------------+------------------+
| 600K   | 1         | 40 sec          | 4min             |
+--------+-----------+-----------------+------------------+

Thanks!
Jonathan

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to