I'm currently doing something like this in my Spark Streaming program
(Java):

        dStream.foreachRDD((rdd, batchTime) -> {
            log.info("processing RDD from batch {}", batchTime);
            ....
            // my rdd processing code
            ....
        });

Instead of having my rdd processing code called once for each RDD in the
batch, is it possible to essentially group all of the RDDs from the batch
into a single RDD and single partition and therefore operate on all of the
elements in the batch at once?

My goal here is to do an operation exactly once for every batch.  As I
understand it, foreachRDD is going to do the operation once for each RDD in
the batch, which is not what I want.

I've looked at DStream.repartition(int), but the docs make it sound like it
only changes the number of partitions in the batch's existing RDDs, not the
number of RDDs.

Reply via email to