Re: Relation between DStream and RDDs

Sanjay Awatramani Thu, 20 Mar 2014 01:52:09 -0700

@TD: I do not need multiple RDDs in a DStream in every batch. On the contrary 
my logic would work fine if there is only 1 RDD. But then the description for 
functions like reduce & count (Return a new DStream of single-element RDDs by 
counting the number of elements in each RDD of the source DStream.) left me 
confused whether I should account for the fact that a DStream can have multiple 
RDDs. My streaming code processes a batch every hour. In the 2nd batch, i 
checked that the DStream contains only 1 RDD, i.e. the 2nd batch's RDD. I 
verified this using sysout in foreachRDD. Does that mean that the DStream will 
always contain only 1 RDD ? Is there a way to access the RDD of the 1st batch 
in the 2nd batch ? The 1st batch may contain some records which were not 
relevant to the first batch and are to be processed in the 2nd batch. I know i 
can use the sliding window mechanism of streaming, but if i'm not using it and 
there is no way to access the previous
 batch's RDD, then it means that functions like count will always return a 
DStream containing only 1 RDD, am i correct ?

@Pascal, yes your answer resolves my question partially, but the other part of 
the question(which i've clarified in above paragraph) still remains.

Thanks for your answers !

Regards,
Sanjay

On Thursday, 20 March 2014 1:27 PM, Pascal Voitot Dev 
<pascal.voitot....@gmail.com> wrote:

If I may add my contribution to this discussion if I understand well your 
question...

DStream is discretized stream. It discretized the data stream over windows of 
time (according to the project code I've read and paper too). so when you write:

JavaStreamingContext stcObj = new JavaStreamingContext(confObj, new Duration(60 
* 60 * 1000)); //1 hour

It means you are discretizing over a 1h window. Each batch so each RDD of the 
dstream will collect data for 1h before going to next RDD.

So if you want to have more RDD, you should reduce batch size/duration...

Pascal

On Thu, Mar 20, 2014 at 7:51 AM, Tathagata Das <tathagata.das1...@gmail.com> 
wrote:

That is a good question. If I understand correctly, you need multiple RDDs from 
a DStream in *every batch*. Can you elaborate on why do you need multiple RDDs 
every batch?
>
>
>TD
>
>
>
>On Wed, Mar 19, 2014 at 10:20 PM, Sanjay Awatramani <sanjay_a...@yahoo.com> 
>wrote:
>
>Hi,
>>
>>
>>As I understand, a DStream consists of 1 or more RDDs. And foreachRDD will 
>>run a given func on each and every RDD inside a DStream.
>>
>>
>>I created a simple program which reads log files from a folder every hour:
>>JavaStreamingContext stcObj = new JavaStreamingContext(confObj, new 
>>Duration(60 * 60 * 1000)); //1 hour
>>JavaDStream<String> obj = stcObj.textFileStream("/Users/path/to/Input");
>>
>>
>>When the interval is reached, Spark reads all the files and creates one and 
>>only one RDD (as i verified from a sysout inside foreachRDD).
>>
>>
>>The streaming doc at a lot of places gives an indication that many operations 
>>(e.g. flatMap) on a DStream are applied individually to a RDD and the 
>>resulting DStream consists of the mapped RDDs in the same number as the input 
>>DStream.
>>ref: 
>>https://spark.apache.org/docs/latest/streaming-programming-guide.html#dstreams
>>
>>
>>If that is the case, how can i generate a scenario where in I have multiple 
>>RDDs inside a DStream in my example ?
>>
>>
>>Regards,
>>Sanjay
>

Re: Relation between DStream and RDDs

Reply via email to