Assaf, I think you are using the term "window" differently than Structured Streaming,... Also, you didn't consider groupBy. Here is an example: I want to maintain, for every minute over the last six hours, a computation (trend or average or stddev) on a five-minute window (from t-4 to t). So, 1. My window size is 5 minutes 2. The window slides every 1 minute (so, there is a new 5-minute window for every minute) 3. Old windows should be purged if they are 6 hours old (based on event time vs. clock?) Option 3 is currently missing - the streaming job keeps all windows forever, as the app may want to access very old windows, unless it would explicitly say otherwise.
Ofir Manor Co-Founder & CTO | Equalum Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io On Thu, Oct 27, 2016 at 9:46 AM, assaf.mendelson <assaf.mendel...@rsa.com> wrote: > Hi, > > Should comments come here or in the JIRA? > > Any, I am a little confused on the need to expose this as an API to begin > with. > > Let’s consider for a second the most basic behavior: We have some input > stream and we want to aggregate a sum over a time window. > > This means that the window we should be looking at would be the maximum > time across our data and back by the window interval. Everything older can > be dropped. > > When new data arrives, the maximum time cannot move back so we generally > drop everything tool old. > > This basically means we save only the latest time window. > > This simpler model would only break if we have a secondary aggregation > which needs the results of multiple windows. > > Is this the use case we are trying to solve? > > If so, wouldn’t just calculating the bigger time window across the entire > aggregation solve this? > > Am I missing something here? > > > > *From:* Michael Armbrust [via Apache Spark Developers List] [mailto: > ml-node+[hidden email] > <http:///user/SendEmail.jtp?type=node&node=19591&i=0>] > *Sent:* Thursday, October 27, 2016 3:04 AM > *To:* Mendelson, Assaf > *Subject:* Re: Watermarking in Structured Streaming to drop late data > > > > And the JIRA: https://issues.apache.org/jira/browse/SPARK-18124 > > > > On Wed, Oct 26, 2016 at 4:56 PM, Tathagata Das <[hidden email] > <http:///user/SendEmail.jtp?type=node&node=19590&i=0>> wrote: > > Hey all, > > > > We are planning implement watermarking in Structured Streaming that would > allow us handle late, out-of-order data better. Specially, when we are > aggregating over windows on event-time, we currently can end up keeping > unbounded amount data as state. We want to define watermarks on the event > time in order mark and drop data that are "too late" and accordingly age > out old aggregates that will not be updated any more. > > > > To enable the user to specify details like lateness threshold, we are > considering adding a new method to Dataset. We would like to get more > feedback on this API. Here is the design doc > > > > https://docs.google.com/document/d/1z-Pazs5v4rA31azvmYhu4I5xwqaNQl6Z > LIS03xhkfCQ/ > > > > Please comment on the design and proposed APIs. > > > > Thank you very much! > > > > TD > > > > > ------------------------------ > > *If you reply to this email, your message will be added to the discussion > below:* > > http://apache-spark-developers-list.1001551.n3.nabble.com/Watermarking-in- > Structured-Streaming-to-drop-late-data-tp19589p19590.html > > To start a new topic under Apache Spark Developers List, email [hidden > email] <http:///user/SendEmail.jtp?type=node&node=19591&i=1> > To unsubscribe from Apache Spark Developers List, click here. > NAML > <http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml> > > ------------------------------ > View this message in context: RE: Watermarking in Structured Streaming to > drop late data > <http://apache-spark-developers-list.1001551.n3.nabble.com/Watermarking-in-Structured-Streaming-to-drop-late-data-tp19589p19591.html> > Sent from the Apache Spark Developers List mailing list archive > <http://apache-spark-developers-list.1001551.n3.nabble.com/> at > Nabble.com. >