I should note that the amount of data in each batch is very small, so I'm not concerned with performance implications of grouping into a single RDD.
On Wed, Jul 15, 2015 at 9:58 PM, Jon Chase <jon.ch...@gmail.com> wrote: > I'm currently doing something like this in my Spark Streaming program > (Java): > > dStream.foreachRDD((rdd, batchTime) -> { > log.info("processing RDD from batch {}", batchTime); > .... > // my rdd processing code > .... > }); > > Instead of having my rdd processing code called once for each RDD in the > batch, is it possible to essentially group all of the RDDs from the batch > into a single RDD and single partition and therefore operate on all of the > elements in the batch at once? > > My goal here is to do an operation exactly once for every batch. As I > understand it, foreachRDD is going to do the operation once for each RDD in > the batch, which is not what I want. > > I've looked at DStream.repartition(int), but the docs make it sound like > it only changes the number of partitions in the batch's existing RDDs, not > the number of RDDs. >