I should note that the amount of data in each batch is very small, so I'm
not concerned with performance implications of grouping into a single RDD.

On Wed, Jul 15, 2015 at 9:58 PM, Jon Chase <jon.ch...@gmail.com> wrote:

> I'm currently doing something like this in my Spark Streaming program
> (Java):
>
>         dStream.foreachRDD((rdd, batchTime) -> {
>             log.info("processing RDD from batch {}", batchTime);
>             ....
>             // my rdd processing code
>             ....
>         });
>
> Instead of having my rdd processing code called once for each RDD in the
> batch, is it possible to essentially group all of the RDDs from the batch
> into a single RDD and single partition and therefore operate on all of the
> elements in the batch at once?
>
> My goal here is to do an operation exactly once for every batch.  As I
> understand it, foreachRDD is going to do the operation once for each RDD in
> the batch, which is not what I want.
>
> I've looked at DStream.repartition(int), but the docs make it sound like
> it only changes the number of partitions in the batch's existing RDDs, not
> the number of RDDs.
>

Reply via email to