Re: Possible to combine all RDDs from a DStream batch into one?

Ted Yu Wed, 15 Jul 2015 20:32:22 -0700

Looks like this method should serve Jon's needs:

  def reduceByWindow(
      reduceFunc: (T, T) => T,
      windowDuration: Duration,
      slideDuration: Duration


On Wed, Jul 15, 2015 at 8:23 PM, N B <nb.nos...@gmail.com> wrote:

> Hi Jon,
>
> In Spark streaming, 1 batch = 1 RDD. Essentially, the terms are used
> interchangeably. If you are trying to collect multiple batches across a
> DStream into a single RDD, look at the window() operations.
>
> Hope this helps
> Nikunj
>
>
> On Wed, Jul 15, 2015 at 7:00 PM, Jon Chase <jon.ch...@gmail.com> wrote:
>
>> I should note that the amount of data in each batch is very small, so I'm
>> not concerned with performance implications of grouping into a single RDD.
>>
>> On Wed, Jul 15, 2015 at 9:58 PM, Jon Chase <jon.ch...@gmail.com> wrote:
>>
>>> I'm currently doing something like this in my Spark Streaming program
>>> (Java):
>>>
>>>         dStream.foreachRDD((rdd, batchTime) -> {
>>>             log.info("processing RDD from batch {}", batchTime);
>>>             ....
>>>             // my rdd processing code
>>>             ....
>>>         });
>>>
>>> Instead of having my rdd processing code called once for each RDD in the
>>> batch, is it possible to essentially group all of the RDDs from the batch
>>> into a single RDD and single partition and therefore operate on all of the
>>> elements in the batch at once?
>>>
>>> My goal here is to do an operation exactly once for every batch.  As I
>>> understand it, foreachRDD is going to do the operation once for each RDD in
>>> the batch, which is not what I want.
>>>
>>> I've looked at DStream.repartition(int), but the docs make it sound like
>>> it only changes the number of partitions in the batch's existing RDDs, not
>>> the number of RDDs.
>>>
>>
>>
>

Re: Possible to combine all RDDs from a DStream batch into one?

Reply via email to