Re: Watermarking in Structured Streaming to drop late data

Ofir Manor Thu, 27 Oct 2016 01:53:26 -0700

Assaf,
I think you are using the term "window" differently than Structured
Streaming,... Also, you didn't consider groupBy. Here is an example:
I want to maintain, for every minute over the last six hours, a computation
(trend or average or stddev) on a five-minute window (from t-4 to t). So,
1. My window size is 5 minutes
2. The window slides every 1 minute (so, there is a new 5-minute window for
every minute)
3. Old windows should be purged if they are 6 hours old (based on event
time vs. clock?)
Option 3 is currently missing - the streaming job keeps all windows
forever, as the app may want to access very old windows, unless it would
explicitly say otherwise.



Ofir Manor

Co-Founder & CTO | Equalum

Mobile: +972-54-7801286 | Email: [email protected]

On Thu, Oct 27, 2016 at 9:46 AM, assaf.mendelson <[email protected]>
wrote:

> Hi,
>
> Should comments come here or in the JIRA?
>
> Any, I am a little confused on the need to expose this as an API to begin
> with.
>
> Let’s consider for a second the most basic behavior: We have some input
> stream and we want to aggregate a sum over a time window.
>
> This means that the window we should be looking at would be the maximum
> time across our data and back by the window interval. Everything older can
> be dropped.
>
> When new data arrives, the maximum time cannot move back so we generally
> drop everything tool old.
>
> This basically means we save only the latest time window.
>
> This simpler model would only break if we have a secondary aggregation
> which needs the results of multiple windows.
>
> Is this the use case we are trying to solve?
>
> If so, wouldn’t just calculating the bigger time window across the entire
> aggregation solve this?
>
> Am I missing something here?
>
>
>
> *From:* Michael Armbrust [via Apache Spark Developers List] [mailto:
> ml-node+[hidden email]
> <http:///user/SendEmail.jtp?type=node&node=19591&i=0>]
> *Sent:* Thursday, October 27, 2016 3:04 AM
> *To:* Mendelson, Assaf
> *Subject:* Re: Watermarking in Structured Streaming to drop late data
>
>
>
> And the JIRA: https://issues.apache.org/jira/browse/SPARK-18124
>
>
>
> On Wed, Oct 26, 2016 at 4:56 PM, Tathagata Das <[hidden email]
> <http:///user/SendEmail.jtp?type=node&node=19590&i=0>> wrote:
>
> Hey all,
>
>
>
> We are planning implement watermarking in Structured Streaming that would
> allow us handle late, out-of-order data better. Specially, when we are
> aggregating over windows on event-time, we currently can end up keeping
> unbounded amount data as state. We want to define watermarks on the event
> time in order mark and drop data that are "too late" and accordingly age
> out old aggregates that will not be updated any more.
>
>
>
> To enable the user to specify details like lateness threshold, we are
> considering adding a new method to Dataset. We would like to get more
> feedback on this API. Here is the design doc
>
>
>
> https://docs.google.com/document/d/1z-Pazs5v4rA31azvmYhu4I5xwqaNQl6Z
> LIS03xhkfCQ/
>
>
>
> Please comment on the design and proposed APIs.
>
>
>
> Thank you very much!
>
>
>
> TD
>
>
>
>
> ------------------------------
>
> *If you reply to this email, your message will be added to the discussion
> below:*
>
> http://apache-spark-developers-list.1001551.n3.nabble.com/Watermarking-in-
> Structured-Streaming-to-drop-late-data-tp19589p19590.html
>
> To start a new topic under Apache Spark Developers List, email [hidden
> email] <http:///user/SendEmail.jtp?type=node&node=19591&i=1>
> To unsubscribe from Apache Spark Developers List, click here.
> NAML
> <http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>
> ------------------------------
> View this message in context: RE: Watermarking in Structured Streaming to
> drop late data
> <http://apache-spark-developers-list.1001551.n3.nabble.com/Watermarking-in-Structured-Streaming-to-drop-late-data-tp19589p19591.html>
> Sent from the Apache Spark Developers List mailing list archive
> <http://apache-spark-developers-list.1001551.n3.nabble.com/> at
> Nabble.com.
>

Re: Watermarking in Structured Streaming to drop late data

Reply via email to