I have a question on the following paper
"Discretized Streams: Fault-Tolerant Streaming Computation at Scale"
written by
Matei Zaharia, Tathagata Das, Haoyuan Li, Timothy Hunter, Scott Shenker, Ion 
Stoica
and available at
http://www.cs.berkeley.edu/~matei/papers/2013/sosp_spark_streaming.pdf

Specifically I'm interested in Section 3.2 on page 5 called "Timing 
Considerations".
This section talks about external timestamp. For me I'm looking to use method 2 
and correct for late records at the
application level.
The paper says "[application] could output a new count for time interval [t, 
t+1) at time t+5, based on the records for this interval received between t and 
t+5. This computation can be performed with an efficient incremental reduce 
operation that adds the old counts computed at t+1 to the counts of new records 
since then, avoiding wasted work."

Q1:
If my data comes 24 hours late on the same DStream (ie: t=t0+24hr) and I'm 
recording per minute aggregates, wouldn't the RDD with data which came 24 hours 
ago be already deleted from disk by Spark? (I'd hope so otherwise it runs out 
of space)

Q2:
The paper talks about "incremental reduce". I'd like to know what it is. I do 
use reduce so I could get an aggregate of counts. What is this incremental 
reduce?

Thanks
-A

Reply via email to