Re: Relation between DStream and RDDs

Pascal Voitot Dev Thu, 20 Mar 2014 04:20:08 -0700

On Thu, Mar 20, 2014 at 11:57 AM, andy petrella <andy.petre...@gmail.com>wrote:


> also consider creating pairs and use *byKey* operators, and then the key
> will be the structure that will be used to consolidate or deduplicate your
> data
> my2c
>
>
One thing I wonder: imagine I want to sub-divide RDDs in a DStream into
several RDDs but not according to time window, I don't see any trivial way
to do it...


>
>
> On Thu, Mar 20, 2014 at 11:50 AM, Pascal Voitot Dev <
> pascal.voitot....@gmail.com> wrote:
>
>> Actually it's quite simple...
>>
>> DStream[T] is a stream of RDD[T].
>> So applying count on DStream is just applying count on each RDD of this
>> DStream.
>> So at the end of count, you have a DStream[Int] containing the same
>> number of RDDs as before but each RDD just contains one element being the
>> count result for the corresponding original RDD.
>>
>> For reduce, it's the same using reduce operation...
>>
>> The only operations that are a bit more complex are reduceByWindow &
>> countByValueAndWindow which union RDD over the time window...
>>
>> On Thu, Mar 20, 2014 at 9:51 AM, Sanjay Awatramani <sanjay_a...@yahoo.com
>> > wrote:
>>
>>> @TD: I do not need multiple RDDs in a DStream in every batch. On the
>>> contrary my logic would work fine if there is only 1 RDD. But then the
>>> description for functions like reduce & count (Return a new DStream of
>>> single-element RDDs by counting the number of elements in each RDD of the
>>> source DStream.) left me confused whether I should account for the fact
>>> that a DStream can have multiple RDDs. My streaming code processes a batch
>>> every hour. In the 2nd batch, i checked that the DStream contains only 1
>>> RDD, i.e. the 2nd batch's RDD. I verified this using sysout in foreachRDD.
>>> Does that mean that the DStream will always contain only 1 RDD ?
>>>
>>
>> A DStream creates a RDD for each window corresponding to your batch
>> duration (maybe if there are no data in the current time window, no RDD is
>> created but I'm not sure about that)
>> So no, there is not one single RDD in a DStream, it just depends on the
>> batch duration and the collected data.
>>
>>
>>
>>> Is there a way to access the RDD of the 1st batch in the 2nd batch ? The
>>> 1st batch may contain some records which were not relevant to the first
>>> batch and are to be processed in the 2nd batch. I know i can use the
>>> sliding window mechanism of streaming, but if i'm not using it and there is
>>> no way to access the previous batch's RDD, then it means that functions
>>> like count will always return a DStream containing only 1 RDD, am i correct
>>> ?
>>>
>>>
>> count will be executed for each RDD in the dstream as explained above.
>>
>> If you want to do operations on several RDD in the same DStream, you
>> should try using reduceByWindow for example to "union" several RDD and
>> perform operations on them. But it really depends on what you want to do
>> and I advise you to test different approaches.
>>
>> Maybe other people more skilled than me will have better answers ?
>>
>>
>>>  @Pascal, yes your answer resolves my question partially, but the other
>>> part of the question(which i've clarified in above paragraph) still remains.
>>>
>>> Thanks for your answers !
>>>
>>> Regards,
>>> Sanjay
>>>
>>>
>>>   On Thursday, 20 March 2014 1:27 PM, Pascal Voitot Dev <
>>> pascal.voitot....@gmail.com> wrote:
>>>   If I may add my contribution to this discussion if I understand well
>>> your question...
>>>
>>> DStream is discretized stream. It discretized the data stream over
>>> windows of time (according to the project code I've read and paper too). so
>>> when you write:
>>>
>>> JavaStreamingContext stcObj = new JavaStreamingContext(confObj, new
>>> Duration(60 * 60 * 1000)); //1 hour
>>>
>>> It means you are discretizing over a 1h window. Each batch so each RDD
>>> of the dstream will collect data for 1h before going to next RDD.
>>> So if you want to have more RDD, you should reduce batch size/duration...
>>>
>>> Pascal
>>>
>>>
>>> On Thu, Mar 20, 2014 at 7:51 AM, Tathagata Das <
>>> tathagata.das1...@gmail.com> wrote:
>>>
>>> That is a good question. If I understand correctly, you need multiple
>>> RDDs from a DStream in *every batch*. Can you elaborate on why do you need
>>> multiple RDDs every batch?
>>>
>>> TD
>>>
>>>
>>> On Wed, Mar 19, 2014 at 10:20 PM, Sanjay Awatramani <
>>> sanjay_a...@yahoo.com> wrote:
>>>
>>> Hi,
>>>
>>> As I understand, a DStream consists of 1 or more RDDs. And foreachRDD
>>> will run a given func on each and every RDD inside a DStream.
>>>
>>> I created a simple program which reads log files from a folder every
>>> hour:
>>> JavaStreamingContext stcObj = new JavaStreamingContext(confObj, new
>>> Duration(60 * 60 * 1000)); //1 hour
>>> JavaDStream<String> obj = stcObj.textFileStream("/Users/path/to/Input");
>>>
>>> When the interval is reached, Spark reads all the files and creates one
>>> and only one RDD (as i verified from a sysout inside foreachRDD).
>>>
>>> The streaming doc at a lot of places gives an indication that many
>>> operations (e.g. flatMap) on a DStream are applied individually to a RDD
>>> and the resulting DStream consists of the mapped RDDs in the same number as
>>> the input DStream.
>>> ref:
>>> https://spark.apache.org/docs/latest/streaming-programming-guide.html#dstreams
>>>
>>> If that is the case, how can i generate a scenario where in I have
>>> multiple RDDs inside a DStream in my example ?
>>>
>>> Regards,
>>> Sanjay
>>>
>>>
>>>
>>>
>>>
>>>
>>
>

Re: Relation between DStream and RDDs

Reply via email to