Looks like this method should serve Jon's needs: def reduceByWindow( reduceFunc: (T, T) => T, windowDuration: Duration, slideDuration: Duration
On Wed, Jul 15, 2015 at 8:23 PM, N B <nb.nos...@gmail.com> wrote: > Hi Jon, > > In Spark streaming, 1 batch = 1 RDD. Essentially, the terms are used > interchangeably. If you are trying to collect multiple batches across a > DStream into a single RDD, look at the window() operations. > > Hope this helps > Nikunj > > > On Wed, Jul 15, 2015 at 7:00 PM, Jon Chase <jon.ch...@gmail.com> wrote: > >> I should note that the amount of data in each batch is very small, so I'm >> not concerned with performance implications of grouping into a single RDD. >> >> On Wed, Jul 15, 2015 at 9:58 PM, Jon Chase <jon.ch...@gmail.com> wrote: >> >>> I'm currently doing something like this in my Spark Streaming program >>> (Java): >>> >>> dStream.foreachRDD((rdd, batchTime) -> { >>> log.info("processing RDD from batch {}", batchTime); >>> .... >>> // my rdd processing code >>> .... >>> }); >>> >>> Instead of having my rdd processing code called once for each RDD in the >>> batch, is it possible to essentially group all of the RDDs from the batch >>> into a single RDD and single partition and therefore operate on all of the >>> elements in the batch at once? >>> >>> My goal here is to do an operation exactly once for every batch. As I >>> understand it, foreachRDD is going to do the operation once for each RDD in >>> the batch, which is not what I want. >>> >>> I've looked at DStream.repartition(int), but the docs make it sound like >>> it only changes the number of partitions in the batch's existing RDDs, not >>> the number of RDDs. >>> >> >> >