Hey I am new to spark streaming and apologize if these questions have been asked.
* In StreamingContext, reduceByKey() seems to only work on the RDDs of the current batch interval, not including RDDs of previous batches. Is my understanding correct? * If the above statement is correct, what functions to use if one wants to do processing on the continuous stream batches of data? I see 2 functions, reduceByKeyAndWindow and updateStateByKey which serve this purpose. My use case is an aggregation and doesn't fit a windowing scenario. * As for updateStateByKey, I have a few questions. ** Over time, will spark stage original data somewhere to replay in case of failures? Say the Spark job run for weeks, I am wondering how that sustains? ** Say my reduce key space is partitioned by some date field and I would like to stop processing old dates after a period time (this is not a simply windowing scenario as which date the data belongs to is not the same thing when the data arrives). How can I handle this to tell spark to discard data for old dates? Thank you, Best Chen -- Chen Song