Re: Watermarking in Structured Streaming to drop late data

2016-10-27 Thread Tathagata Das
Assaf, thanks for the feedback!

On Thu, Oct 27, 2016 at 3:28 AM, assaf.mendelson 
wrote:

> Thanks.
>
> This article is excellent. It completely explains everything.
>
> I would add it as a reference to any and all explanations of structured
> streaming (and in the case of watermarking, I simply didn’t understand the
> definition before reading this).
>
>
>
> Thanks,
>
> Assaf.
>
>
>
>
>
> *From:* kostas papageorgopoylos [via Apache Spark Developers List]
> [mailto:ml-node+[hidden email]
> <http:///user/SendEmail.jtp?type=node&node=19600&i=0>]
> *Sent:* Thursday, October 27, 2016 10:17 AM
> *To:* Mendelson, Assaf
> *Subject:* Re: Watermarking in Structured Streaming to drop late data
>
>
>
> Hi all
>
> I would highly recommend to all users-devs interested in the design
> suggestions / discussions for Structured Streaming Spark API watermarking
>
> to take a look on the following links along with the design document. It
> would help to understand the notions of watermark , out of order data and
> possible use cases.
>
>
>
> https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101
>
> https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102
>
>
>
> Kind Regards
>
>
>
>
>
> 2016-10-27 9:46 GMT+03:00 assaf.mendelson <[hidden email]
> <http:///user/SendEmail.jtp?type=node&node=19592&i=0>>:
>
> Hi,
>
> Should comments come here or in the JIRA?
>
> Any, I am a little confused on the need to expose this as an API to begin
> with.
>
> Let’s consider for a second the most basic behavior: We have some input
> stream and we want to aggregate a sum over a time window.
>
> This means that the window we should be looking at would be the maximum
> time across our data and back by the window interval. Everything older can
> be dropped.
>
> When new data arrives, the maximum time cannot move back so we generally
> drop everything tool old.
>
> This basically means we save only the latest time window.
>
> This simpler model would only break if we have a secondary aggregation
> which needs the results of multiple windows.
>
> Is this the use case we are trying to solve?
>
> If so, wouldn’t just calculating the bigger time window across the entire
> aggregation solve this?
>
> Am I missing something here?
>
>
>
> *From:* Michael Armbrust [via Apache Spark Developers List] [mailto:[hidden
> email] <http:///user/SendEmail.jtp?type=node&node=19592&i=1>[hidden email]
> <http://user/SendEmail.jtp?type=node&node=19591&i=0>]
> *Sent:* Thursday, October 27, 2016 3:04 AM
> *To:* Mendelson, Assaf
> *Subject:* Re: Watermarking in Structured Streaming to drop late data
>
>
>
> And the JIRA: https://issues.apache.org/jira/browse/SPARK-18124
>
>
>
> On Wed, Oct 26, 2016 at 4:56 PM, Tathagata Das <[hidden email]
> <http://user/SendEmail.jtp?type=node&node=19590&i=0>> wrote:
>
> Hey all,
>
>
>
> We are planning implement watermarking in Structured Streaming that would
> allow us handle late, out-of-order data better. Specially, when we are
> aggregating over windows on event-time, we currently can end up keeping
> unbounded amount data as state. We want to define watermarks on the event
> time in order mark and drop data that are "too late" and accordingly age
> out old aggregates that will not be updated any more.
>
>
>
> To enable the user to specify details like lateness threshold, we are
> considering adding a new method to Dataset. We would like to get more
> feedback on this API. Here is the design doc
>
>
>
> https://docs.google.com/document/d/1z-Pazs5v4rA31azvmYhu4I5xwqaNQl6Z
> LIS03xhkfCQ/
>
>
>
> Please comment on the design and proposed APIs.
>
>
>
> Thank you very much!
>
>
>
> TD
>
>
>
>
> --
>
> *If you reply to this email, your message will be added to the discussion
> below:*
>
> http://apache-spark-developers-list.1001551.n3.nabble.com/Watermarking-in-
> Structured-Streaming-to-drop-late-data-tp19589p19590.html
>
> To start a new topic under Apache Spark Developers List, email [hidden
> email] <http://user/SendEmail.jtp?type=node&node=19591&i=1>
> To unsubscribe from Apache Spark Developers List, click here.
> NAML
> <http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subs

RE: Watermarking in Structured Streaming to drop late data

2016-10-27 Thread assaf.mendelson
Thanks.
This article is excellent. It completely explains everything.
I would add it as a reference to any and all explanations of structured 
streaming (and in the case of watermarking, I simply didn’t understand the 
definition before reading this).

Thanks,
Assaf.


From: kostas papageorgopoylos [via Apache Spark Developers List] 
[mailto:ml-node+s1001551n19592...@n3.nabble.com]
Sent: Thursday, October 27, 2016 10:17 AM
To: Mendelson, Assaf
Subject: Re: Watermarking in Structured Streaming to drop late data

Hi all

I would highly recommend to all users-devs interested in the design suggestions 
/ discussions for Structured Streaming Spark API watermarking
to take a look on the following links along with the design document. It would 
help to understand the notions of watermark , out of order data and possible 
use cases.

https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101
https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102

Kind Regards


2016-10-27 9:46 GMT+03:00 assaf.mendelson <[hidden 
email]>:
Hi,
Should comments come here or in the JIRA?
Any, I am a little confused on the need to expose this as an API to begin with.
Let’s consider for a second the most basic behavior: We have some input stream 
and we want to aggregate a sum over a time window.
This means that the window we should be looking at would be the maximum time 
across our data and back by the window interval. Everything older can be 
dropped.
When new data arrives, the maximum time cannot move back so we generally drop 
everything tool old.
This basically means we save only the latest time window.
This simpler model would only break if we have a secondary aggregation which 
needs the results of multiple windows.
Is this the use case we are trying to solve?
If so, wouldn’t just calculating the bigger time window across the entire 
aggregation solve this?
Am I missing something here?

From: Michael Armbrust [via Apache Spark Developers List] [mailto:[hidden 
email][hidden 
email]<http://user/SendEmail.jtp?type=node&node=19591&i=0>]
Sent: Thursday, October 27, 2016 3:04 AM
To: Mendelson, Assaf
Subject: Re: Watermarking in Structured Streaming to drop late data

And the JIRA: https://issues.apache.org/jira/browse/SPARK-18124

On Wed, Oct 26, 2016 at 4:56 PM, Tathagata Das <[hidden 
email]<http://user/SendEmail.jtp?type=node&node=19590&i=0>> wrote:
Hey all,

We are planning implement watermarking in Structured Streaming that would allow 
us handle late, out-of-order data better. Specially, when we are aggregating 
over windows on event-time, we currently can end up keeping unbounded amount 
data as state. We want to define watermarks on the event time in order mark and 
drop data that are "too late" and accordingly age out old aggregates that will 
not be updated any more.

To enable the user to specify details like lateness threshold, we are 
considering adding a new method to Dataset. We would like to get more feedback 
on this API. Here is the design doc

https://docs.google.com/document/d/1z-Pazs5v4rA31azvmYhu4I5xwqaNQl6ZLIS03xhkfCQ/

Please comment on the design and proposed APIs.

Thank you very much!

TD



If you reply to this email, your message will be added to the discussion below:
http://apache-spark-developers-list.1001551.n3.nabble.com/Watermarking-in-Structured-Streaming-to-drop-late-data-tp19589p19590.html
To start a new topic under Apache Spark Developers List, email [hidden 
email]<http://user/SendEmail.jtp?type=node&node=19591&i=1>
To unsubscribe from Apache Spark Developers List, click here.
NAML<http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>

____________
View this message in context: RE: Watermarking in Structured Streaming to drop 
late 
data<http://apache-spark-developers-list.1001551.n3.nabble.com/Watermarking-in-Structured-Streaming-to-drop-late-data-tp19589p19591.html>
Sent from the Apache Spark Developers List mailing list 
archive<http://apache-spark-developers-list.1001551.n3.nabble.com/> at 
Nabble.com.



If you reply to this email, your message will be added to the discussion below:
http://apache-spark-developers-list.1001551.n3.nabble.com/Watermarking-in-Structured-Streaming-to-drop-late-data-tp19589p19592.html
To start a new topic under Apache Spark Developers List, email 
ml-node+s1001551n1...@n3.nabble.com<mailto:ml-node+s1001551n1...@n3.nabble.com>
To unsubscribe from Apache Spark Developers List, click 
here<http://apache-spa

Re: Watermarking in Structured Streaming to drop late data

2016-10-27 Thread Tathagata Das
Hello Assaf,

I think you are missing the fact that we want to compute over event-time of
the data (e.g. data generation time), which may arrive at Spark
out-of-order and late. And we want to aggregate over late data. The
watermark is an estimate made by the system that there wont be any data
later than the watermark time arriving after now.

If this basic context is clear, then please read the design doc for further
details. Please comments in the doc for more specific design discussions.

On Thu, Oct 27, 2016 at 1:52 AM, Ofir Manor  wrote:

> Assaf,
> I think you are using the term "window" differently than Structured
> Streaming,... Also, you didn't consider groupBy. Here is an example:
> I want to maintain, for every minute over the last six hours, a
> computation (trend or average or stddev) on a five-minute window (from t-4
> to t). So,
> 1. My window size is 5 minutes
> 2. The window slides every 1 minute (so, there is a new 5-minute window
> for every minute)
> 3. Old windows should be purged if they are 6 hours old (based on event
> time vs. clock?)
> Option 3 is currently missing - the streaming job keeps all windows
> forever, as the app may want to access very old windows, unless it would
> explicitly say otherwise.
>
>
> Ofir Manor
>
> Co-Founder & CTO | Equalum
>
> Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io
>
> On Thu, Oct 27, 2016 at 9:46 AM, assaf.mendelson 
> wrote:
>
>> Hi,
>>
>> Should comments come here or in the JIRA?
>>
>> Any, I am a little confused on the need to expose this as an API to begin
>> with.
>>
>> Let’s consider for a second the most basic behavior: We have some input
>> stream and we want to aggregate a sum over a time window.
>>
>> This means that the window we should be looking at would be the maximum
>> time across our data and back by the window interval. Everything older can
>> be dropped.
>>
>> When new data arrives, the maximum time cannot move back so we generally
>> drop everything tool old.
>>
>> This basically means we save only the latest time window.
>>
>> This simpler model would only break if we have a secondary aggregation
>> which needs the results of multiple windows.
>>
>> Is this the use case we are trying to solve?
>>
>> If so, wouldn’t just calculating the bigger time window across the entire
>> aggregation solve this?
>>
>> Am I missing something here?
>>
>>
>>
>> *From:* Michael Armbrust [via Apache Spark Developers List] [mailto:
>> ml-node+[hidden email]
>> <http:///user/SendEmail.jtp?type=node&node=19591&i=0>]
>> *Sent:* Thursday, October 27, 2016 3:04 AM
>> *To:* Mendelson, Assaf
>> *Subject:* Re: Watermarking in Structured Streaming to drop late data
>>
>>
>>
>> And the JIRA: https://issues.apache.org/jira/browse/SPARK-18124
>>
>>
>>
>> On Wed, Oct 26, 2016 at 4:56 PM, Tathagata Das <[hidden email]
>> <http:///user/SendEmail.jtp?type=node&node=19590&i=0>> wrote:
>>
>> Hey all,
>>
>>
>>
>> We are planning implement watermarking in Structured Streaming that would
>> allow us handle late, out-of-order data better. Specially, when we are
>> aggregating over windows on event-time, we currently can end up keeping
>> unbounded amount data as state. We want to define watermarks on the event
>> time in order mark and drop data that are "too late" and accordingly age
>> out old aggregates that will not be updated any more.
>>
>>
>>
>> To enable the user to specify details like lateness threshold, we are
>> considering adding a new method to Dataset. We would like to get more
>> feedback on this API. Here is the design doc
>>
>>
>>
>> https://docs.google.com/document/d/1z-Pazs5v4rA31azvmYhu4I5x
>> wqaNQl6ZLIS03xhkfCQ/
>>
>>
>>
>> Please comment on the design and proposed APIs.
>>
>>
>>
>> Thank you very much!
>>
>>
>>
>> TD
>>
>>
>>
>>
>> --
>>
>> *If you reply to this email, your message will be added to the discussion
>> below:*
>>
>> http://apache-spark-developers-list.1001551.n3.nabble.com/
>> Watermarking-in-Structured-Streaming-to-drop-late-data-tp19589p19590.html
>>
>> To start a new topic under Apache Spark Developers List, email [hidden
>> email] <http:///user/SendEmail.jtp?type=node&node=19591&i=1>
>> To unsubscribe from Apache Spark Developers List, click here.
>> NAML
>> <ht

Re: Watermarking in Structured Streaming to drop late data

2016-10-27 Thread Ofir Manor
Assaf,
I think you are using the term "window" differently than Structured
Streaming,... Also, you didn't consider groupBy. Here is an example:
I want to maintain, for every minute over the last six hours, a computation
(trend or average or stddev) on a five-minute window (from t-4 to t). So,
1. My window size is 5 minutes
2. The window slides every 1 minute (so, there is a new 5-minute window for
every minute)
3. Old windows should be purged if they are 6 hours old (based on event
time vs. clock?)
Option 3 is currently missing - the streaming job keeps all windows
forever, as the app may want to access very old windows, unless it would
explicitly say otherwise.


Ofir Manor

Co-Founder & CTO | Equalum

Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io

On Thu, Oct 27, 2016 at 9:46 AM, assaf.mendelson 
wrote:

> Hi,
>
> Should comments come here or in the JIRA?
>
> Any, I am a little confused on the need to expose this as an API to begin
> with.
>
> Let’s consider for a second the most basic behavior: We have some input
> stream and we want to aggregate a sum over a time window.
>
> This means that the window we should be looking at would be the maximum
> time across our data and back by the window interval. Everything older can
> be dropped.
>
> When new data arrives, the maximum time cannot move back so we generally
> drop everything tool old.
>
> This basically means we save only the latest time window.
>
> This simpler model would only break if we have a secondary aggregation
> which needs the results of multiple windows.
>
> Is this the use case we are trying to solve?
>
> If so, wouldn’t just calculating the bigger time window across the entire
> aggregation solve this?
>
> Am I missing something here?
>
>
>
> *From:* Michael Armbrust [via Apache Spark Developers List] [mailto:
> ml-node+[hidden email]
> <http:///user/SendEmail.jtp?type=node&node=19591&i=0>]
> *Sent:* Thursday, October 27, 2016 3:04 AM
> *To:* Mendelson, Assaf
> *Subject:* Re: Watermarking in Structured Streaming to drop late data
>
>
>
> And the JIRA: https://issues.apache.org/jira/browse/SPARK-18124
>
>
>
> On Wed, Oct 26, 2016 at 4:56 PM, Tathagata Das <[hidden email]
> <http:///user/SendEmail.jtp?type=node&node=19590&i=0>> wrote:
>
> Hey all,
>
>
>
> We are planning implement watermarking in Structured Streaming that would
> allow us handle late, out-of-order data better. Specially, when we are
> aggregating over windows on event-time, we currently can end up keeping
> unbounded amount data as state. We want to define watermarks on the event
> time in order mark and drop data that are "too late" and accordingly age
> out old aggregates that will not be updated any more.
>
>
>
> To enable the user to specify details like lateness threshold, we are
> considering adding a new method to Dataset. We would like to get more
> feedback on this API. Here is the design doc
>
>
>
> https://docs.google.com/document/d/1z-Pazs5v4rA31azvmYhu4I5xwqaNQl6Z
> LIS03xhkfCQ/
>
>
>
> Please comment on the design and proposed APIs.
>
>
>
> Thank you very much!
>
>
>
> TD
>
>
>
>
> --
>
> *If you reply to this email, your message will be added to the discussion
> below:*
>
> http://apache-spark-developers-list.1001551.n3.nabble.com/Watermarking-in-
> Structured-Streaming-to-drop-late-data-tp19589p19590.html
>
> To start a new topic under Apache Spark Developers List, email [hidden
> email] <http:///user/SendEmail.jtp?type=node&node=19591&i=1>
> To unsubscribe from Apache Spark Developers List, click here.
> NAML
> <http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>
> --
> View this message in context: RE: Watermarking in Structured Streaming to
> drop late data
> <http://apache-spark-developers-list.1001551.n3.nabble.com/Watermarking-in-Structured-Streaming-to-drop-late-data-tp19589p19591.html>
> Sent from the Apache Spark Developers List mailing list archive
> <http://apache-spark-developers-list.1001551.n3.nabble.com/> at
> Nabble.com.
>


Re: Watermarking in Structured Streaming to drop late data

2016-10-27 Thread kostas papageorgopoylos
Hi all

I would highly recommend to all users-devs interested in the design
suggestions / discussions for Structured Streaming Spark API watermarking
to take a look on the following links along with the design document. It
would help to understand the notions of watermark , out of order data and
possible use cases.

https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101
https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102

Kind Regards


2016-10-27 9:46 GMT+03:00 assaf.mendelson :

> Hi,
>
> Should comments come here or in the JIRA?
>
> Any, I am a little confused on the need to expose this as an API to begin
> with.
>
> Let’s consider for a second the most basic behavior: We have some input
> stream and we want to aggregate a sum over a time window.
>
> This means that the window we should be looking at would be the maximum
> time across our data and back by the window interval. Everything older can
> be dropped.
>
> When new data arrives, the maximum time cannot move back so we generally
> drop everything tool old.
>
> This basically means we save only the latest time window.
>
> This simpler model would only break if we have a secondary aggregation
> which needs the results of multiple windows.
>
> Is this the use case we are trying to solve?
>
> If so, wouldn’t just calculating the bigger time window across the entire
> aggregation solve this?
>
> Am I missing something here?
>
>
>
> *From:* Michael Armbrust [via Apache Spark Developers List] [mailto:
> ml-node+[hidden email]
> <http:///user/SendEmail.jtp?type=node&node=19591&i=0>]
> *Sent:* Thursday, October 27, 2016 3:04 AM
> *To:* Mendelson, Assaf
> *Subject:* Re: Watermarking in Structured Streaming to drop late data
>
>
>
> And the JIRA: https://issues.apache.org/jira/browse/SPARK-18124
>
>
>
> On Wed, Oct 26, 2016 at 4:56 PM, Tathagata Das <[hidden email]
> <http:///user/SendEmail.jtp?type=node&node=19590&i=0>> wrote:
>
> Hey all,
>
>
>
> We are planning implement watermarking in Structured Streaming that would
> allow us handle late, out-of-order data better. Specially, when we are
> aggregating over windows on event-time, we currently can end up keeping
> unbounded amount data as state. We want to define watermarks on the event
> time in order mark and drop data that are "too late" and accordingly age
> out old aggregates that will not be updated any more.
>
>
>
> To enable the user to specify details like lateness threshold, we are
> considering adding a new method to Dataset. We would like to get more
> feedback on this API. Here is the design doc
>
>
>
> https://docs.google.com/document/d/1z-Pazs5v4rA31azvmYhu4I5xwqaNQl6Z
> LIS03xhkfCQ/
>
>
>
> Please comment on the design and proposed APIs.
>
>
>
> Thank you very much!
>
>
>
> TD
>
>
>
>
> --
>
> *If you reply to this email, your message will be added to the discussion
> below:*
>
> http://apache-spark-developers-list.1001551.n3.nabble.com/Watermarking-in-
> Structured-Streaming-to-drop-late-data-tp19589p19590.html
>
> To start a new topic under Apache Spark Developers List, email [hidden
> email] <http:///user/SendEmail.jtp?type=node&node=19591&i=1>
> To unsubscribe from Apache Spark Developers List, click here.
> NAML
> <http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>
> --
> View this message in context: RE: Watermarking in Structured Streaming to
> drop late data
> <http://apache-spark-developers-list.1001551.n3.nabble.com/Watermarking-in-Structured-Streaming-to-drop-late-data-tp19589p19591.html>
> Sent from the Apache Spark Developers List mailing list archive
> <http://apache-spark-developers-list.1001551.n3.nabble.com/> at
> Nabble.com.
>


RE: Watermarking in Structured Streaming to drop late data

2016-10-26 Thread assaf.mendelson
Hi,
Should comments come here or in the JIRA?
Any, I am a little confused on the need to expose this as an API to begin with.
Let’s consider for a second the most basic behavior: We have some input stream 
and we want to aggregate a sum over a time window.
This means that the window we should be looking at would be the maximum time 
across our data and back by the window interval. Everything older can be 
dropped.
When new data arrives, the maximum time cannot move back so we generally drop 
everything tool old.
This basically means we save only the latest time window.
This simpler model would only break if we have a secondary aggregation which 
needs the results of multiple windows.
Is this the use case we are trying to solve?
If so, wouldn’t just calculating the bigger time window across the entire 
aggregation solve this?
Am I missing something here?

From: Michael Armbrust [via Apache Spark Developers List] 
[mailto:ml-node+s1001551n19590...@n3.nabble.com]
Sent: Thursday, October 27, 2016 3:04 AM
To: Mendelson, Assaf
Subject: Re: Watermarking in Structured Streaming to drop late data

And the JIRA: https://issues.apache.org/jira/browse/SPARK-18124

On Wed, Oct 26, 2016 at 4:56 PM, Tathagata Das <[hidden 
email]> wrote:
Hey all,

We are planning implement watermarking in Structured Streaming that would allow 
us handle late, out-of-order data better. Specially, when we are aggregating 
over windows on event-time, we currently can end up keeping unbounded amount 
data as state. We want to define watermarks on the event time in order mark and 
drop data that are "too late" and accordingly age out old aggregates that will 
not be updated any more.

To enable the user to specify details like lateness threshold, we are 
considering adding a new method to Dataset. We would like to get more feedback 
on this API. Here is the design doc

https://docs.google.com/document/d/1z-Pazs5v4rA31azvmYhu4I5xwqaNQl6ZLIS03xhkfCQ/

Please comment on the design and proposed APIs.

Thank you very much!

TD



If you reply to this email, your message will be added to the discussion below:
http://apache-spark-developers-list.1001551.n3.nabble.com/Watermarking-in-Structured-Streaming-to-drop-late-data-tp19589p19590.html
To start a new topic under Apache Spark Developers List, email 
ml-node+s1001551n1...@n3.nabble.com<mailto:ml-node+s1001551n1...@n3.nabble.com>
To unsubscribe from Apache Spark Developers List, click 
here<http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=1&code=YXNzYWYubWVuZGVsc29uQHJzYS5jb218MXwtMTI4OTkxNTg1Mg==>.
NAML<http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>




--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Watermarking-in-Structured-Streaming-to-drop-late-data-tp19589p19591.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

Re: Watermarking in Structured Streaming to drop late data

2016-10-26 Thread Michael Armbrust
And the JIRA: https://issues.apache.org/jira/browse/SPARK-18124

On Wed, Oct 26, 2016 at 4:56 PM, Tathagata Das  wrote:

> Hey all,
>
> We are planning implement watermarking in Structured Streaming that would
> allow us handle late, out-of-order data better. Specially, when we are
> aggregating over windows on event-time, we currently can end up keeping
> unbounded amount data as state. We want to define watermarks on the event
> time in order mark and drop data that are "too late" and accordingly age
> out old aggregates that will not be updated any more.
>
> To enable the user to specify details like lateness threshold, we are
> considering adding a new method to Dataset. We would like to get more
> feedback on this API. Here is the design doc
>
> https://docs.google.com/document/d/1z-Pazs5v4rA31azvmYhu4I5x
> wqaNQl6ZLIS03xhkfCQ/
>
> Please comment on the design and proposed APIs.
>
> Thank you very much!
>
> TD
>


Watermarking in Structured Streaming to drop late data

2016-10-26 Thread Tathagata Das
Hey all,

We are planning implement watermarking in Structured Streaming that would
allow us handle late, out-of-order data better. Specially, when we are
aggregating over windows on event-time, we currently can end up keeping
unbounded amount data as state. We want to define watermarks on the event
time in order mark and drop data that are "too late" and accordingly age
out old aggregates that will not be updated any more.

To enable the user to specify details like lateness threshold, we are
considering adding a new method to Dataset. We would like to get more
feedback on this API. Here is the design doc

https://docs.google.com/document/d/1z-Pazs5v4rA31azvmYhu4I5xwqaNQl6Z
LIS03xhkfCQ/

Please comment on the design and proposed APIs.

Thank you very much!

TD