It’s a possible approach. It actually leverages Spark’s parallel execution. PipeRDD’s launching of external processes is just like that in pySpark and SparkR for RDD API.
The concern is pipeRDD relies on text based serialization/deserialization. Whether the performance is acceptable actually depends on your workload and cluster configurations. You can do some profiling to evaluate it. From: sujeet jog [mailto:sujeet....@gmail.com] Sent: Monday, March 21, 2016 2:10 PM To: user@spark.apache.org Subject: Run External R script from Spark Hi, I have been working on a POC on some time series related stuff, i'm using python since i need spark streaming and sparkR is yet to have a spark streaming front end, couple of algorithms i want to use are not yet present in Spark-TS package, so I'm thinking of invoking a external R script for the Algorithm part & pass the data from Spark to the R script via pipeRdd, What i wanted to understand is can something like this be used in a production deployment, since passing the data via R script would mean lot of serializing and would actually not use the power of spark for parallel execution, Has anyone used this kind of workaround Spark -> pipeRdd-> R script. Thanks, Sujeet