http://spark.apache.org/docs/latest/streaming-programming-guide.html <http://spark.apache.org/docs/latest/streaming-programming-guide.html>
foreachRDD is executed on the driver…. mn > On Oct 20, 2014, at 3:07 AM, Gerard Maas <gerard.m...@gmail.com> wrote: > > Pinging TD -- I'm sure you know :-) > > -kr, Gerard. > > On Fri, Oct 17, 2014 at 11:20 PM, Gerard Maas <gerard.m...@gmail.com > <mailto:gerard.m...@gmail.com>> wrote: > Hi, > > We have been implementing several Spark Streaming jobs that are basically > processing data and inserting it into Cassandra, sorting it among different > keyspaces. > > We've been following the pattern: > > dstream.foreachRDD(rdd => > val records = rdd.map(elem => record(elem)) > targets.foreach(target => records.filter{record => > isTarget(target,record)}.writeToCassandra(target,table)) > ) > > I've been wondering whether there would be a performance difference in > transforming the dstream instead of transforming the RDD within the dstream > with regards to how the transformations get scheduled. > > Instead of the RDD-centric computation, I could transform the dstream until > the last step, where I need an rdd to store. > For example, the previous transformation could be written as: > > val recordStream = dstream.map(elem => record(elem)) > targets.foreach{target => recordStream.filter(record => > isTarget(target,record)).foreachRDD(_.writeToCassandra(target,table))} > > Would be a difference in execution and/or performance? What would be the > preferred way to do this? > > Bonus question: Is there a better (more performant) way to sort the data in > different "buckets" instead of filtering the data collection times the > #buckets? > > thanks, Gerard. > >