Hi, We have been implementing several Spark Streaming jobs that are basically processing data and inserting it into Cassandra, sorting it among different keyspaces.
We've been following the pattern: dstream.foreachRDD(rdd => val records = rdd.map(elem => record(elem)) targets.foreach(target => records.filter{record => isTarget(target,record)}.writeToCassandra(target,table)) ) I've been wondering whether there would be a performance difference in transforming the dstream instead of transforming the RDD within the dstream with regards to how the transformations get scheduled. Instead of the RDD-centric computation, I could transform the dstream until the last step, where I need an rdd to store. For example, the previous transformation could be written as: val recordStream = dstream.map(elem => record(elem)) targets.foreach{target => recordStream.filter(record => isTarget(target,record)).foreachRDD(_.writeToCassandra(target,table))} Would be a difference in execution and/or performance? What would be the preferred way to do this? Bonus question: Is there a better (more performant) way to sort the data in different "buckets" instead of filtering the data collection times the #buckets? thanks, Gerard.