Re: dstream.foreachRDD iteration

Mich Talebzadeh Wed, 07 Sep 2016 14:49:06 -0700

I am sure few Spark gurus can explain this much better than me. So here we
go.

A DStream is an abstraction that breaks a continuous stream of data into
small chunks. This is called "micro-batching" and Spark streaming is all
about micro-batching

You have batch interval, windows length and sliding window. So each batch
interval  streaming is passed to Spark for further processing, in the form
of an RDD. So there is only one RDD produced for each DStream at each batch
interval. That is exactly what the doc says.

An RDD is a set of pointers to where the actual data is in a cluster. So
RDD is not the data but a construct.

Now DStream.foreachRDD is basically an output operator. It allows you to
access the underlying RDDs of the DStream to execute actions that do
something practical with the data. For example, work out the incoming
pricing, choose a security and recommend that as a good buy etc. Then of
course you can push that particular data to HDFS if you wished.

This RDD may have many many rows of pricing for many securities (IBM,
Microsoft, Oracle, JPM etc) so at this stage it is all in one RDD with many
elements.

In our example the DStream.foreachRDD gives an RDD [Security, Prices], not
a single Security and its price. To access single elements of the
collection, we need to further operate on the RDD. So this becomes:

     dstream.foreachRDD { pricesRDD =>  // Loop over RDD
       val x= pricesRDD.count
       if (x > 0)  // RDD has data
       {
         for(line <- pricesRDD.collect.toArray) // Look for each record in
the RDD
         {
           var index = line._2.split(',').view(0).toInt          // That is
the index
           var timestamp = line._2.split(',').view(1).toString   // This is
the timestamp from source
           var security =  line._2.split(',').view(12.toString   // This is
the name of the security
           var price = line._2.split(',').view(3).toFloat        // This is
the price of the security
           if (price.toFloat > 90.0)
           {
            // Do something here
            // Sent notification, write to HDFS etc
           }
         }
       }
     }

I trust that this makes it clearer.

Dr Mich Talebzadeh

LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*

http://talebzadehmich.wordpress.com

*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.

On 7 September 2016 at 18:13, Ashok Kumar <ashok34...@yahoo.com.invalid>
wrote:

> I have checked that doc sir.
>
> My understand every batch interval of data always generates one RDD, So
> why do we need to use foreachRDD when there is only one.
>
> Sorry for this question but bit confusing me.
>
> Thanks
>
>
>
>
> On Wednesday, 7 September 2016, 18:05, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>
> Hi,
>
> What is so confusing about RDD. Have you checked this doc?
>
> http://spark.apache.org/docs/latest/streaming-programming-
> guide.html#design-patterns-for-using-foreachrdd
>
> HTH
>
> Dr Mich Talebzadeh
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
> http://talebzadehmich.wordpress.com
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
> On 7 September 2016 at 11:39, Ashok Kumar <ashok34...@yahoo.com.invalid>
> wrote:
>
> Hi,
>
> A bit confusing to me
>
> How many layers involved in DStream.foreachRDD.
>
> Do I need to loop over it more than once? I mean  DStream.foreachRDD{ rdd
> = > }
>
> I am trying to get individual lines in RDD.
>
> Thanks
>
>
>
>
>

Re: dstream.foreachRDD iteration

Reply via email to