Thanks Anwar.
On Tue, Jun 17, 2014 at 11:54 AM, Anwar Rizal <anriza...@gmail.com> wrote: > > On Tue, Jun 17, 2014 at 5:39 PM, Chen Song <chen.song...@gmail.com> wrote: > >> Hey >> >> I am new to spark streaming and apologize if these questions have been >> asked. >> >> * In StreamingContext, reduceByKey() seems to only work on the RDDs of >> the current batch interval, not including RDDs of previous batches. Is my >> understanding correct? >> > > It's correct. > > >> >> * If the above statement is correct, what functions to use if one wants >> to do processing on the continuous stream batches of data? I see 2 >> functions, reduceByKeyAndWindow and updateStateByKey which serve this >> purpose. >> > > I presume that you need to keep a state that goes beyond one batch, so > multiple batches. In this case, yes, updateStateByKey is the one you will > use. Basically, updateStateByKey wraps a state into an RDD. > > > > >> >> My use case is an aggregation and doesn't fit a windowing scenario. >> >> * As for updateStateByKey, I have a few questions. >> ** Over time, will spark stage original data somewhere to replay in case >> of failures? Say the Spark job run for weeks, I am wondering how that >> sustains? >> ** Say my reduce key space is partitioned by some date field and I would >> like to stop processing old dates after a period time (this is not a simply >> windowing scenario as which date the data belongs to is not the same thing >> when the data arrives). How can I handle this to tell spark to discard data >> for old dates? >> > > You will need to call checkpoint (see > http://spark.apache.org/docs/latest/streaming-programming-guide.html#rdd-checkpointing) > that will persist the metadata of RDD that will consume memory (and stack > execution) otherwise. You can set the interval of checkpointing that suits > your need. > > Now, if you want to also reset your state after some times, there is no > immediate way I can think of ,but you can do it through updateStateByKey, > maybe by book-keeping the timestamp. > > > >> >> Thank you, >> >> Best >> Chen >> >> >> > -- Chen Song