It’s a possible approach. It actually leverages Spark’s parallel execution.  
PipeRDD’s  launching of external processes is just like that in pySpark and 
SparkR for RDD API.

The concern is pipeRDD relies on text based serialization/deserialization. 
Whether the performance is acceptable actually depends on your workload and 
cluster configurations. You can do some profiling to evaluate it.

From: sujeet jog [mailto:sujeet....@gmail.com]
Sent: Monday, March 21, 2016 2:10 PM
To: user@spark.apache.org
Subject: Run External R script from Spark

Hi,

I have been working on a POC on some time series related stuff, i'm using 
python since i need spark streaming and sparkR is yet to have a spark streaming 
front end,  couple of algorithms i want to use are not yet present in Spark-TS 
package, so I'm thinking of invoking a external R script for the Algorithm part 
& pass the data from Spark to the R script via pipeRdd,


What i wanted to understand is can something like this be used in a production 
deployment,  since passing the data via R script would mean lot of serializing 
and would actually not use the power of spark for parallel execution,

Has anyone used this kind of workaround  Spark -> pipeRdd-> R script.


Thanks,
Sujeet

Reply via email to