I have a question on the following paper "Discretized Streams: Fault-Tolerant Streaming Computation at Scale" written by Matei Zaharia, Tathagata Das, Haoyuan Li, Timothy Hunter, Scott Shenker, Ion Stoica and available at http://www.cs.berkeley.edu/~matei/papers/2013/sosp_spark_streaming.pdf
Specifically I'm interested in Section 3.2 on page 5 called "Timing Considerations". This section talks about external timestamp. For me I'm looking to use method 2 and correct for late records at the application level. The paper says "[application] could output a new count for time interval [t, t+1) at time t+5, based on the records for this interval received between t and t+5. This computation can be performed with an efficient incremental reduce operation that adds the old counts computed at t+1 to the counts of new records since then, avoiding wasted work." Q1: If my data comes 24 hours late on the same DStream (ie: t=t0+24hr) and I'm recording per minute aggregates, wouldn't the RDD with data which came 24 hours ago be already deleted from disk by Spark? (I'd hope so otherwise it runs out of space) Q2: The paper talks about "incremental reduce". I'd like to know what it is. I do use reduce so I could get an aggregate of counts. What is this incremental reduce? Thanks -A
