How to restrict foreach on a streaming RDD only once upon receiver completion

2015-04-06 Thread Hari Polisetty
I have created a Custom Receiver to fetch records pertaining to a specific query from Elastic Search and have implemented Streaming RDD transformations to process the data generated by the receiver. The final RDD is a sorted list of name value pairs and I want to read the top 20 results

Re: How to restrict foreach on a streaming RDD only once upon receiver completion

2015-04-06 Thread Hari Polisetty
Polisetty hpoli...@icloud.com mailto:hpoli...@icloud.com To: Tathagata Das t...@databricks.com mailto:t...@databricks.com Cc: user user@spark.apache.org mailto:user@spark.apache.org Sent: Monday, April 6, 2015 2:02 PM Subject: Re: How to restrict foreach on a streaming RDD only once upon receiver

Re: How to restrict foreach on a streaming RDD only once upon receiver completion

2015-04-06 Thread Michael Malak
YouTube version cued to that place: http://www.youtube.com/watch?v=W5Uece_JmNst=23m18s    From: Hari Polisetty hpoli...@icloud.com To: Tathagata Das t...@databricks.com Cc: user user@spark.apache.org Sent: Monday, April 6, 2015 2:02 PM Subject: Re: How to restrict foreach on a streaming RDD

Re: How to restrict foreach on a streaming RDD only once upon receiver completion

2015-04-06 Thread Hari Polisetty
Yes, I’m using updateStateByKey and it works. But then I need to perform further computation on this Stateful RDD (see code snippet below). I perform forEach on the final RDD and get the top 10 records. I just don’t want the foreach to be performed every time a new batch is received. Only when

Re: How to restrict foreach on a streaming RDD only once upon receiver completion

2015-04-06 Thread Tathagata Das
So you want to sort based on the total count of the all the records received through receiver? In that case, you have to combine all the counts using updateStateByKey (