I'm currently doing something like this in my Spark Streaming program (Java):
dStream.foreachRDD((rdd, batchTime) -> { log.info("processing RDD from batch {}", batchTime); .... // my rdd processing code .... }); Instead of having my rdd processing code called once for each RDD in the batch, is it possible to essentially group all of the RDDs from the batch into a single RDD and single partition and therefore operate on all of the elements in the batch at once? My goal here is to do an operation exactly once for every batch. As I understand it, foreachRDD is going to do the operation once for each RDD in the batch, which is not what I want. I've looked at DStream.repartition(int), but the docs make it sound like it only changes the number of partitions in the batch's existing RDDs, not the number of RDDs.