Re: Slow collecting of large Spark Data Frames into R

2016-06-11 Thread Sun Rui
Hi, Jonathan,

Thanks for reporting. This is a known issue that the community would like to 
address later.

Please refer to https://issues.apache.org/jira/browse/SPARK-14037. It would be 
better that you can profile your use case using the method  discussed in the 
JIRA issue and paste the metrics information into it? This would be helpful for 
addressing the issue.

> On Jun 11, 2016, at 08:31, Jonathan Mortensen  wrote:
> 
> 16BG



Slow collecting of large Spark Data Frames into R

2016-06-10 Thread Jonathan Mortensen
Hey Everyone!

I've been converting between Parquet <-> Spark Data Frames <-> R Data
Frames for larger data sets. I have found the conversion speed quite
slow in the Spark <-> R side and am looking for some insight on how to
speed it up (or determine what I have failed to do properly)!

In R, "sparkR::collect" and "sparkR::write.df" take much longer than
Spark reading and writing Parquet. While these aren’t the same
operations, the difference suggests that there is a bottleneck within
the translation between R data frames and Spark Data Frames. A profile
of the SparkR code shows that R is spending a large portion of its
time within "sparkR:::readTypedObject", "sparkR:::readBin", and
"sparkR:::readObject". To me, this suggests that the serialization
step accounts for the slow speed, but I don't want to guess too much.
Any thoughts on how to speed the conversion?

Details:
Tried with Spark 2.0 and 1.6.1 (and the associated SparkR package) and R 3.3.0.
On a Macbook Pro, 16BG Ram, Quad Core.

++---+-+--+
| # Rows | # Columns | sparkR::collect | sparkR::write.df |
++---+-+--+
| 600K   | 20| 3min| 6min |
++---+-+--+
| 1.8M   | 20| 9min| 20min|
++---+-+--+
| 600K   | 1 | 40 sec  | 4min |
++---+-+--+

Thanks!
Jonathan

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org