You could use updateStateByKey. There's a stateful word count example on Github.

https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/StatefulNetworkWordCount.scala
________________________________
From: Sandeep Giri<mailto:sand...@knowbigdata.com>
Sent: ‎10/‎29/‎2015 6:08 PM
To: user<mailto:user@spark.apache.org>; dev<mailto:d...@spark.apache.org>
Subject: Maintaining overall cumulative data in Spark Streaming

Dear All,

If a continuous stream of text is coming in and you have to keep publishing the 
overall word count so far since 0:00 today, what would you do?

Publishing the results for a window is easy but if we have to keep aggregating 
the results, how to go about it?

I have tried to keep an StreamRDD with aggregated count and keep doing a 
fullouterjoin but didn't work. Seems like the StreamRDD gets reset.

Kindly help.

Regards,
Sandeep Giri

Reply via email to