http://spark.apache.org/docs/latest/streaming-programming-guide.html 
<http://spark.apache.org/docs/latest/streaming-programming-guide.html>

foreachRDD is executed on the driver….

mn

> On Oct 20, 2014, at 3:07 AM, Gerard Maas <gerard.m...@gmail.com> wrote:
> 
> Pinging TD  -- I'm sure you know :-)
> 
> -kr, Gerard.
> 
> On Fri, Oct 17, 2014 at 11:20 PM, Gerard Maas <gerard.m...@gmail.com 
> <mailto:gerard.m...@gmail.com>> wrote:
> Hi,
> 
> We have been implementing several Spark Streaming jobs that are basically 
> processing data and inserting it into Cassandra, sorting it among different 
> keyspaces.
> 
> We've been following the pattern:
> 
> dstream.foreachRDD(rdd => 
>     val records = rdd.map(elem => record(elem))
>     targets.foreach(target => records.filter{record => 
> isTarget(target,record)}.writeToCassandra(target,table))
> )
> 
> I've been wondering whether there would be a performance difference in 
> transforming the dstream instead of transforming the RDD within the dstream 
> with regards to how the transformations get scheduled.
> 
> Instead of the RDD-centric computation, I could transform the dstream until 
> the last step, where I need an rdd to store.
> For example, the  previous  transformation could be written as:
> 
> val recordStream = dstream.map(elem => record(elem))
> targets.foreach{target => recordStream.filter(record => 
> isTarget(target,record)).foreachRDD(_.writeToCassandra(target,table))}
> 
> Would  be a difference in execution and/or performance?  What would be the 
> preferred way to do this?
> 
> Bonus question: Is there a better (more performant) way to sort the data in 
> different "buckets" instead of filtering the data collection times the 
> #buckets?
> 
> thanks,  Gerard.
> 
> 

Reply via email to