I am sure few Spark gurus can explain this much better than me. So here we go.
A DStream is an abstraction that breaks a continuous stream of data into small chunks. This is called "micro-batching" and Spark streaming is all about micro-batching You have batch interval, windows length and sliding window. So each batch interval streaming is passed to Spark for further processing, in the form of an RDD. So there is only one RDD produced for each DStream at each batch interval. That is exactly what the doc says. An RDD is a set of pointers to where the actual data is in a cluster. So RDD is not the data but a construct. Now DStream.foreachRDD is basically an output operator. It allows you to access the underlying RDDs of the DStream to execute actions that do something practical with the data. For example, work out the incoming pricing, choose a security and recommend that as a good buy etc. Then of course you can push that particular data to HDFS if you wished. This RDD may have many many rows of pricing for many securities (IBM, Microsoft, Oracle, JPM etc) so at this stage it is all in one RDD with many elements. In our example the DStream.foreachRDD gives an RDD [Security, Prices], not a single Security and its price. To access single elements of the collection, we need to further operate on the RDD. So this becomes: dstream.foreachRDD { pricesRDD => // Loop over RDD val x= pricesRDD.count if (x > 0) // RDD has data { for(line <- pricesRDD.collect.toArray) // Look for each record in the RDD { var index = line._2.split(',').view(0).toInt // That is the index var timestamp = line._2.split(',').view(1).toString // This is the timestamp from source var security = line._2.split(',').view(12.toString // This is the name of the security var price = line._2.split(',').view(3).toFloat // This is the price of the security if (price.toFloat > 90.0) { // Do something here // Sent notification, write to HDFS etc } } } } I trust that this makes it clearer. Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com *Disclaimer:* Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction. On 7 September 2016 at 18:13, Ashok Kumar <ashok34...@yahoo.com.invalid> wrote: > I have checked that doc sir. > > My understand every batch interval of data always generates one RDD, So > why do we need to use foreachRDD when there is only one. > > Sorry for this question but bit confusing me. > > Thanks > > > > > On Wednesday, 7 September 2016, 18:05, Mich Talebzadeh < > mich.talebza...@gmail.com> wrote: > > > Hi, > > What is so confusing about RDD. Have you checked this doc? > > http://spark.apache.org/docs/latest/streaming-programming- > guide.html#design-patterns-for-using-foreachrdd > > HTH > > Dr Mich Talebzadeh > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > http://talebzadehmich.wordpress.com > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > On 7 September 2016 at 11:39, Ashok Kumar <ashok34...@yahoo.com.invalid> > wrote: > > Hi, > > A bit confusing to me > > How many layers involved in DStream.foreachRDD. > > Do I need to loop over it more than once? I mean DStream.foreachRDD{ rdd > = > } > > I am trying to get individual lines in RDD. > > Thanks > > > > >