That makes sense. I wonder if any of the authors of that paper could comment.
From: dachuan [mailto:[email protected]] Sent: February-19-14 3:55 PM To: [email protected] Subject: Re: Q: Discretized Streams: Fault-Tolerant Streaming Computation paper It seems StreamingContext has a function: def remember(duration: Duration) { graph.remember(duration) } and in my opinion, incremental reduce means: 1 2 3 4 5 6 window_size =5 sum_of_first_window = 1+2+3+4+5=15 sum_of_second_window_method_1=2+3+4+5+6=20 sum_of_second_window_method_2=15+6-1=20 On Wed, Feb 19, 2014 at 3:45 PM, Adrian Mocanu <[email protected]<mailto:[email protected]>> wrote: I have a question on the following paper "Discretized Streams: Fault-Tolerant Streaming Computation at Scale" written by Matei Zaharia, Tathagata Das, Haoyuan Li, Timothy Hunter, Scott Shenker, Ion Stoica and available at http://www.cs.berkeley.edu/~matei/papers/2013/sosp_spark_streaming.pdf Specifically I'm interested in Section 3.2 on page 5 called "Timing Considerations". This section talks about external timestamp. For me I'm looking to use method 2 and correct for late records at the application level. The paper says "[application] could output a new count for time interval [t, t+1) at time t+5, based on the records for this interval received between t and t+5. This computation can be performed with an efficient incremental reduce operation that adds the old counts computed at t+1 to the counts of new records since then, avoiding wasted work." Q1: If my data comes 24 hours late on the same DStream (ie: t=t0+24hr) and I'm recording per minute aggregates, wouldn't the RDD with data which came 24 hours ago be already deleted from disk by Spark? (I'd hope so otherwise it runs out of space) Q2: The paper talks about "incremental reduce". I'd like to know what it is. I do use reduce so I could get an aggregate of counts. What is this incremental reduce? Thanks -A -- Dachuan Huang Cellphone: 614-390-7234 2015 Neil Avenue Ohio State University Columbus, Ohio U.S.A. 43210
