Re: Relation between DStream and RDDs

Tathagata Das Wed, 19 Mar 2014 23:53:23 -0700

That is a good question. If I understand correctly, you need multiple RDDs
from a DStream in *every batch*. Can you elaborate on why do you need
multiple RDDs every batch?


TD


On Wed, Mar 19, 2014 at 10:20 PM, Sanjay Awatramani
<sanjay_a...@yahoo.com>wrote:

> Hi,
>
> As I understand, a DStream consists of 1 or more RDDs. And foreachRDD will
> run a given func on each and every RDD inside a DStream.
>
> I created a simple program which reads log files from a folder every hour:
> JavaStreamingContext stcObj = new JavaStreamingContext(confObj, new
> Duration(60 * 60 * 1000)); //1 hour
> JavaDStream<String> obj = stcObj.textFileStream("/Users/path/to/Input");
>
> When the interval is reached, Spark reads all the files and creates one
> and only one RDD (as i verified from a sysout inside foreachRDD).
>
> The streaming doc at a lot of places gives an indication that many
> operations (e.g. flatMap) on a DStream are applied individually to a RDD
> and the resulting DStream consists of the mapped RDDs in the same number as
> the input DStream.
> ref:
> https://spark.apache.org/docs/latest/streaming-programming-guide.html#dstreams
>
> If that is the case, how can i generate a scenario where in I have
> multiple RDDs inside a DStream in my example ?
>
> Regards,
> Sanjay
>

Re: Relation between DStream and RDDs

Reply via email to