Re: Proposal for Implementing Keyed Watermarks in Apache Flink

2023-11-20 Thread David Morávek
The paper looks interesting, but it might not manifest the described
benefit for practical reasons:

1. It forces you to remember all keys in the broadcasted (partitioned is
impossible without timeouts, etc.) operator state. Forever. This itself is
a blocker for a bunch of pipelines. The primary motivation for using the
state is that you can't simply recompute the WM as with the global one.
2. It's super easy to run into idleness issues (it's almost likely).
3. Thinking about multiple chained aggregations on different keys is just
... 🤯
4. This is a significant change to public APIs.

The main problem the paper needs to address is idleness and stuck per-key
watermarks (pipeline not making progress).

What do you think about these issues?

Best,
D.


On Sat, Nov 18, 2023 at 6:41 PM Tawfek Yasser Tawfek 
wrote:

> Hello Alexander,
>
> Will we continue the discussion?
>
>
>
> Thanks & BR,
>
> Tawfik
>
> 
> From: Tawfek Yasser Tawfek 
> Sent: 30 October 2023 15:32
> To: dev@flink.apache.org 
> Subject: Re: Proposal for Implementing Keyed Watermarks in Apache Flink
>
> Hi Alexander,
>
> Thank you for your reply.
>
> Yes. As you showed keyed-watermarks mechanism is mainly required for the
> case when we need a fine-grained calculation for each partition
> [Calculation over data produced by each individual sensor], as scalability
> factors require partitioning the calculations,
> so, the keyed-watermarks mechanism is designed for this type of problem.
>
> Thanks,
> Tawfik
> ________
> From: Alexander Fedulov 
> Sent: 30 October 2023 13:37
> To: dev@flink.apache.org 
> Subject: Re: Proposal for Implementing Keyed Watermarks in Apache Flink
>
> [You don't often get email from alexander.fedu...@gmail.com. Learn why
> this is important at https://aka.ms/LearnAboutSenderIdentification ]
>
> Hi Tawfek,
>
> > The idea is to generate a watermark for each key (sub-stream), in order
> to avoid the fast progress of the global watermark which affects low-rate
> sources.
>
> Let's consider the sensors example from the paper. Shouldn't it be about
> the delay between the time of taking the measurement and its arrival at
> Flink, rather than the rate at which the measurements are produced? If a
> particular sensor produces no data during a specific time window, it
> doesn't make sense to wait for it—there won't be any corresponding
> measurement arriving because none was produced. Thus, I believe we should
> be talking about situations where data from certain sensors can arrive with
> significant delay compared to most other sensors.
>
> From the perspective of data aggregation, there are two main scenarios:
> 1) Calculation over data produced by multiple sensors
> 2) Calculation over data produced by an individual sensor
>
> In scenario 1), there are two subcategories:
> a) Meaningful results cannot be produced without data from those delayed
> sensors; hence, you need to wait longer.
>   => Time is propagated by the mix of all sources. You just need to set
> a bounded watermark with enough lag to accommodate the delayed results.
> This is precisely what event time processing and bounded watermarks are for
> (no keyed watermarking is required).
> b) You need to produce the results as they are and perhaps patch them later
> when the delayed data arrives.
>  => Time is propagated by the mix of all sources. You produce the
> results as they are but utilize allowedLateness to patch the aggregates if
> needed (no keyed watermarking is required).
>
> So, is it correct to say that keyed watermarking is applicable only in
> scenario 2)?
>
> Best,
> Alexander Fedulov
>
> On Sat, 28 Oct 2023 at 14:33, Tawfek Yasser Tawfek 
> wrote:
>
> > Thanks, Alexander for your reply.
> >
> > Our solution initiated from this inquiry on Stack Overflow:
> >
> >
> https://stackoverflow.com/questions/52179898/does-flink-support-keyed-watermarks-if-not-is-there-any-plan-of-implementing-i
> >
> > The idea is to generate a watermark for each key (sub-stream), in order
> to
> > avoid the fast progress of the global watermark which affects low-rate
> > sources.
> >
> > Instead of using only one watermark (vanilla/global watermark), we
> changed
> > the API to allow moving the keyBy() before the
> > assignTimestampsAndWatermarks() so the stream will be partitioned then
> the
> > TimestampsAndWatermarkOperator will handle the generation of each
> watermark
> > for each key (source/sub-stream/partition).
> >
> > *Let's discuss more if you want I have a presentation at a conference, we
> > c

Re: Proposal for Implementing Keyed Watermarks in Apache Flink

2023-11-18 Thread Tawfek Yasser Tawfek
Hello Alexander,

Will we continue the discussion?



Thanks & BR,

Tawfik


From: Tawfek Yasser Tawfek 
Sent: 30 October 2023 15:32
To: dev@flink.apache.org 
Subject: Re: Proposal for Implementing Keyed Watermarks in Apache Flink

Hi Alexander,

Thank you for your reply.

Yes. As you showed keyed-watermarks mechanism is mainly required for the case 
when we need a fine-grained calculation for each partition
[Calculation over data produced by each individual sensor], as scalability 
factors require partitioning the calculations,
so, the keyed-watermarks mechanism is designed for this type of problem.

Thanks,
Tawfik

From: Alexander Fedulov 
Sent: 30 October 2023 13:37
To: dev@flink.apache.org 
Subject: Re: Proposal for Implementing Keyed Watermarks in Apache Flink

[You don't often get email from alexander.fedu...@gmail.com. Learn why this is 
important at https://aka.ms/LearnAboutSenderIdentification ]

Hi Tawfek,

> The idea is to generate a watermark for each key (sub-stream), in order
to avoid the fast progress of the global watermark which affects low-rate
sources.

Let's consider the sensors example from the paper. Shouldn't it be about
the delay between the time of taking the measurement and its arrival at
Flink, rather than the rate at which the measurements are produced? If a
particular sensor produces no data during a specific time window, it
doesn't make sense to wait for it—there won't be any corresponding
measurement arriving because none was produced. Thus, I believe we should
be talking about situations where data from certain sensors can arrive with
significant delay compared to most other sensors.

>From the perspective of data aggregation, there are two main scenarios:
1) Calculation over data produced by multiple sensors
2) Calculation over data produced by an individual sensor

In scenario 1), there are two subcategories:
a) Meaningful results cannot be produced without data from those delayed
sensors; hence, you need to wait longer.
  => Time is propagated by the mix of all sources. You just need to set
a bounded watermark with enough lag to accommodate the delayed results.
This is precisely what event time processing and bounded watermarks are for
(no keyed watermarking is required).
b) You need to produce the results as they are and perhaps patch them later
when the delayed data arrives.
 => Time is propagated by the mix of all sources. You produce the
results as they are but utilize allowedLateness to patch the aggregates if
needed (no keyed watermarking is required).

So, is it correct to say that keyed watermarking is applicable only in
scenario 2)?

Best,
Alexander Fedulov

On Sat, 28 Oct 2023 at 14:33, Tawfek Yasser Tawfek 
wrote:

> Thanks, Alexander for your reply.
>
> Our solution initiated from this inquiry on Stack Overflow:
>
> https://stackoverflow.com/questions/52179898/does-flink-support-keyed-watermarks-if-not-is-there-any-plan-of-implementing-i
>
> The idea is to generate a watermark for each key (sub-stream), in order to
> avoid the fast progress of the global watermark which affects low-rate
> sources.
>
> Instead of using only one watermark (vanilla/global watermark), we changed
> the API to allow moving the keyBy() before the
> assignTimestampsAndWatermarks() so the stream will be partitioned then the
> TimestampsAndWatermarkOperator will handle the generation of each watermark
> for each key (source/sub-stream/partition).
>
> *Let's discuss more if you want I have a presentation at a conference, we
> can meet or whatever is suitable.*
>
> Also, I contacted David Anderson one year ago and he followed me step by
> step and helped me a lot.
>
> I attached some messages with David.
>
>
> *Thanks & BR,*
>
>
> <http://www.nu.edu.eg/>
>
>
>
>
>  Tawfik Yasser Tawfik
>
> * Teaching Assistant | AI-ITCS-NU*
>
>  Office: UB1-B, Room 229
>
>  26th of July Corridor, Sheikh Zayed City, Giza, Egypt
> ----------
> *From:* Alexander Fedulov 
> *Sent:* 27 October 2023 20:09
> *To:* dev@flink.apache.org 
> *Subject:* Re: Proposal for Implementing Keyed Watermarks in Apache Flink
>
> [You don't often get email from alexander.fedu...@gmail.com. Learn why
> this is important at https://aka.ms/LearnAboutSenderIdentification ]
>
> Hi Tawfek,
>
> Thanks for sharing. I am trying to understand what exact real-life problem
> you are tackling with this approach. My understanding from skimming through
> the paper is that you are concerned about some outlier event producers from
> which the events can be delayed beyond what is expected in the overall
> system.
> Do I get it correctly that the keyed watermarking only targets scenarios of
> calculating keyed windows (whic

Re: Proposal for Implementing Keyed Watermarks in Apache Flink

2023-10-30 Thread Tawfek Yasser Tawfek
Hi Alexander,

Thank you for your reply.

Yes. As you showed keyed-watermarks mechanism is mainly required for the case 
when we need a fine-grained calculation for each partition
[Calculation over data produced by each individual sensor], as scalability 
factors require partitioning the calculations,
so, the keyed-watermarks mechanism is designed for this type of problem.

Thanks,
Tawfik

From: Alexander Fedulov 
Sent: 30 October 2023 13:37
To: dev@flink.apache.org 
Subject: Re: Proposal for Implementing Keyed Watermarks in Apache Flink

[You don't often get email from alexander.fedu...@gmail.com. Learn why this is 
important at https://aka.ms/LearnAboutSenderIdentification ]

Hi Tawfek,

> The idea is to generate a watermark for each key (sub-stream), in order
to avoid the fast progress of the global watermark which affects low-rate
sources.

Let's consider the sensors example from the paper. Shouldn't it be about
the delay between the time of taking the measurement and its arrival at
Flink, rather than the rate at which the measurements are produced? If a
particular sensor produces no data during a specific time window, it
doesn't make sense to wait for it—there won't be any corresponding
measurement arriving because none was produced. Thus, I believe we should
be talking about situations where data from certain sensors can arrive with
significant delay compared to most other sensors.

>From the perspective of data aggregation, there are two main scenarios:
1) Calculation over data produced by multiple sensors
2) Calculation over data produced by an individual sensor

In scenario 1), there are two subcategories:
a) Meaningful results cannot be produced without data from those delayed
sensors; hence, you need to wait longer.
  => Time is propagated by the mix of all sources. You just need to set
a bounded watermark with enough lag to accommodate the delayed results.
This is precisely what event time processing and bounded watermarks are for
(no keyed watermarking is required).
b) You need to produce the results as they are and perhaps patch them later
when the delayed data arrives.
 => Time is propagated by the mix of all sources. You produce the
results as they are but utilize allowedLateness to patch the aggregates if
needed (no keyed watermarking is required).

So, is it correct to say that keyed watermarking is applicable only in
scenario 2)?

Best,
Alexander Fedulov

On Sat, 28 Oct 2023 at 14:33, Tawfek Yasser Tawfek 
wrote:

> Thanks, Alexander for your reply.
>
> Our solution initiated from this inquiry on Stack Overflow:
>
> https://stackoverflow.com/questions/52179898/does-flink-support-keyed-watermarks-if-not-is-there-any-plan-of-implementing-i
>
> The idea is to generate a watermark for each key (sub-stream), in order to
> avoid the fast progress of the global watermark which affects low-rate
> sources.
>
> Instead of using only one watermark (vanilla/global watermark), we changed
> the API to allow moving the keyBy() before the
> assignTimestampsAndWatermarks() so the stream will be partitioned then the
> TimestampsAndWatermarkOperator will handle the generation of each watermark
> for each key (source/sub-stream/partition).
>
> *Let's discuss more if you want I have a presentation at a conference, we
> can meet or whatever is suitable.*
>
> Also, I contacted David Anderson one year ago and he followed me step by
> step and helped me a lot.
>
> I attached some messages with David.
>
>
> *Thanks & BR,*
>
>
> <http://www.nu.edu.eg/>
>
>
>
>
>  Tawfik Yasser Tawfik
>
> * Teaching Assistant | AI-ITCS-NU*
>
>  Office: UB1-B, Room 229
>
>  26th of July Corridor, Sheikh Zayed City, Giza, Egypt
> ----------
> *From:* Alexander Fedulov 
> *Sent:* 27 October 2023 20:09
> *To:* dev@flink.apache.org 
> *Subject:* Re: Proposal for Implementing Keyed Watermarks in Apache Flink
>
> [You don't often get email from alexander.fedu...@gmail.com. Learn why
> this is important at https://aka.ms/LearnAboutSenderIdentification ]
>
> Hi Tawfek,
>
> Thanks for sharing. I am trying to understand what exact real-life problem
> you are tackling with this approach. My understanding from skimming through
> the paper is that you are concerned about some outlier event producers from
> which the events can be delayed beyond what is expected in the overall
> system.
> Do I get it correctly that the keyed watermarking only targets scenarios of
> calculating keyed windows (which are also keyed by the same producer ids)?
>
> Best,
> Alexander Fedulov
>
> On Fri, 27 Oct 2023 at 19:07, Tawfek Yasser Tawfek 
> wrote:
>
> > Dear Apache Flink Development Team,
> >
> > I hope this email finds you well. I pro

Re: Proposal for Implementing Keyed Watermarks in Apache Flink

2023-10-30 Thread Alexander Fedulov
Hi Tawfek,

> The idea is to generate a watermark for each key (sub-stream), in order
to avoid the fast progress of the global watermark which affects low-rate
sources.

Let's consider the sensors example from the paper. Shouldn't it be about
the delay between the time of taking the measurement and its arrival at
Flink, rather than the rate at which the measurements are produced? If a
particular sensor produces no data during a specific time window, it
doesn't make sense to wait for it—there won't be any corresponding
measurement arriving because none was produced. Thus, I believe we should
be talking about situations where data from certain sensors can arrive with
significant delay compared to most other sensors.

>From the perspective of data aggregation, there are two main scenarios:
1) Calculation over data produced by multiple sensors
2) Calculation over data produced by an individual sensor

In scenario 1), there are two subcategories:
a) Meaningful results cannot be produced without data from those delayed
sensors; hence, you need to wait longer.
  => Time is propagated by the mix of all sources. You just need to set
a bounded watermark with enough lag to accommodate the delayed results.
This is precisely what event time processing and bounded watermarks are for
(no keyed watermarking is required).
b) You need to produce the results as they are and perhaps patch them later
when the delayed data arrives.
 => Time is propagated by the mix of all sources. You produce the
results as they are but utilize allowedLateness to patch the aggregates if
needed (no keyed watermarking is required).

So, is it correct to say that keyed watermarking is applicable only in
scenario 2)?

Best,
Alexander Fedulov

On Sat, 28 Oct 2023 at 14:33, Tawfek Yasser Tawfek 
wrote:

> Thanks, Alexander for your reply.
>
> Our solution initiated from this inquiry on Stack Overflow:
>
> https://stackoverflow.com/questions/52179898/does-flink-support-keyed-watermarks-if-not-is-there-any-plan-of-implementing-i
>
> The idea is to generate a watermark for each key (sub-stream), in order to
> avoid the fast progress of the global watermark which affects low-rate
> sources.
>
> Instead of using only one watermark (vanilla/global watermark), we changed
> the API to allow moving the keyBy() before the
> assignTimestampsAndWatermarks() so the stream will be partitioned then the
> TimestampsAndWatermarkOperator will handle the generation of each watermark
> for each key (source/sub-stream/partition).
>
> *Let's discuss more if you want I have a presentation at a conference, we
> can meet or whatever is suitable.*
>
> Also, I contacted David Anderson one year ago and he followed me step by
> step and helped me a lot.
>
> I attached some messages with David.
>
>
> *Thanks & BR,*
>
>
> <http://www.nu.edu.eg/>
>
>
>
>
>  Tawfik Yasser Tawfik
>
> * Teaching Assistant | AI-ITCS-NU*
>
>  Office: UB1-B, Room 229
>
>  26th of July Corridor, Sheikh Zayed City, Giza, Egypt
> --------------
> *From:* Alexander Fedulov 
> *Sent:* 27 October 2023 20:09
> *To:* dev@flink.apache.org 
> *Subject:* Re: Proposal for Implementing Keyed Watermarks in Apache Flink
>
> [You don't often get email from alexander.fedu...@gmail.com. Learn why
> this is important at https://aka.ms/LearnAboutSenderIdentification ]
>
> Hi Tawfek,
>
> Thanks for sharing. I am trying to understand what exact real-life problem
> you are tackling with this approach. My understanding from skimming through
> the paper is that you are concerned about some outlier event producers from
> which the events can be delayed beyond what is expected in the overall
> system.
> Do I get it correctly that the keyed watermarking only targets scenarios of
> calculating keyed windows (which are also keyed by the same producer ids)?
>
> Best,
> Alexander Fedulov
>
> On Fri, 27 Oct 2023 at 19:07, Tawfek Yasser Tawfek 
> wrote:
>
> > Dear Apache Flink Development Team,
> >
> > I hope this email finds you well. I propose an exciting new feature for
> > Apache Flink that has the potential to significantly enhance its
> > capabilities in handling unbounded streams of events, particularly in the
> > context of event-time windowing.
> >
> > As you may be aware, Apache Flink has been at the forefront of Big Data
> > Stream processing engines, leveraging windowing techniques to manage
> > unbounded event streams effectively. The accuracy of the results obtained
> > from these streams relies heavily on the ability to gather all relevant
> > input within a window. At the core of this process are watermarks, which
> > serve as unique timestamps marking the progression of even

Re: Proposal for Implementing Keyed Watermarks in Apache Flink

2023-10-28 Thread Tawfek Yasser Tawfek
Thanks, Alexander for your reply.

Our solution initiated from this inquiry on Stack Overflow:
https://stackoverflow.com/questions/52179898/does-flink-support-keyed-watermarks-if-not-is-there-any-plan-of-implementing-i

The idea is to generate a watermark for each key (sub-stream), in order to 
avoid the fast progress of the global watermark which affects low-rate sources.

Instead of using only one watermark (vanilla/global watermark), we changed the 
API to allow moving the keyBy() before the assignTimestampsAndWatermarks() so 
the stream will be partitioned then the TimestampsAndWatermarkOperator will 
handle the generation of each watermark for each key 
(source/sub-stream/partition).

Let's discuss more if you want I have a presentation at a conference, we can 
meet or whatever is suitable.

Also, I contacted David Anderson one year ago and he followed me step by step 
and helped me a lot.

I attached some messages with David.


Thanks & BR,


[https://lh5.googleusercontent.com/MiACj615x33GxCbMNBzNppSALcuxUYChqrKMR1YZdR4zYSu1HFg3rbI7HUhmzrZhlHwHHF3QAhx2M4wlB_OQ1mmq1C6gQHVjhrfeGzmSvZdfszJeZHB8ll_WKwuxN6b6BRih4lkw5P-LuBtMKZweJD0]<http://www.nu.edu.eg/>



[https://lh4.googleusercontent.com/YakyNaMyo42CikAlRHeAN0s-s5bwjaJsThDsWKKWN0Hc7BXS4Fg57ZGWnvCGm3tZxfifJWvPrj9fyDBvtg61xDQ0ZF6eGM8kCFU-qa3gPXHTSxWlYugBFDGH2hGw-gfsz546mPGLIQSI__MruVabQCs]


 Tawfik Yasser Tawfik

 Teaching Assistant | AI-ITCS-NU

 Office: UB1-B, Room 229

 26th of July Corridor, Sheikh Zayed City, Giza, Egypt


From: Alexander Fedulov 
Sent: 27 October 2023 20:09
To: dev@flink.apache.org 
Subject: Re: Proposal for Implementing Keyed Watermarks in Apache Flink

[You don't often get email from alexander.fedu...@gmail.com. Learn why this is 
important at https://aka.ms/LearnAboutSenderIdentification ]

Hi Tawfek,

Thanks for sharing. I am trying to understand what exact real-life problem
you are tackling with this approach. My understanding from skimming through
the paper is that you are concerned about some outlier event producers from
which the events can be delayed beyond what is expected in the overall
system.
Do I get it correctly that the keyed watermarking only targets scenarios of
calculating keyed windows (which are also keyed by the same producer ids)?

Best,
Alexander Fedulov

On Fri, 27 Oct 2023 at 19:07, Tawfek Yasser Tawfek 
wrote:

> Dear Apache Flink Development Team,
>
> I hope this email finds you well. I propose an exciting new feature for
> Apache Flink that has the potential to significantly enhance its
> capabilities in handling unbounded streams of events, particularly in the
> context of event-time windowing.
>
> As you may be aware, Apache Flink has been at the forefront of Big Data
> Stream processing engines, leveraging windowing techniques to manage
> unbounded event streams effectively. The accuracy of the results obtained
> from these streams relies heavily on the ability to gather all relevant
> input within a window. At the core of this process are watermarks, which
> serve as unique timestamps marking the progression of events in time.
>
> However, our analysis has revealed a critical issue with the current
> watermark generation method in Apache Flink. This method, which operates at
> the input stream level, exhibits a bias towards faster sub-streams,
> resulting in the unfortunate consequence of dropped events from slower
> sub-streams. Our investigations showed that Apache Flink's conventional
> watermark generation approach led to an alarming data loss of approximately
> 33% when 50% of the keys around the median experienced delays. This loss
> further escalated to over 37% when 50% of random keys were delayed.
>
> In response to this issue, we have authored a research paper outlining a
> novel strategy named "keyed watermarks" to address data loss and
> substantially enhance data processing accuracy, achieving at least 99%
> accuracy in most scenarios.
>
> Moreover, we have conducted comprehensive comparative studies to evaluate
> the effectiveness of our strategy against the conventional watermark
> generation method, specifically in terms of event-time tracking accuracy.
>
> We believe that implementing keyed watermarks in Apache Flink can greatly
> enhance its performance and reliability, making it an even more valuable
> tool for organizations dealing with complex, high-throughput data
> processing tasks.
>
> We kindly request your consideration of this proposal. We would be eager
> to discuss further details, provide the full research paper, or collaborate
> closely to facilitate the integration of this feature into Apache Flink.
>
> Please check this preprint on Research Square:
> https://www.researchsquare.com/article/rs-3395909/<
> https://www.researchsquare.com/article/rs-3395909/v1>
>

Re: Proposal for Implementing Keyed Watermarks in Apache Flink

2023-10-27 Thread Alexander Fedulov
Hi Tawfek,

Thanks for sharing. I am trying to understand what exact real-life problem
you are tackling with this approach. My understanding from skimming through
the paper is that you are concerned about some outlier event producers from
which the events can be delayed beyond what is expected in the overall
system.
Do I get it correctly that the keyed watermarking only targets scenarios of
calculating keyed windows (which are also keyed by the same producer ids)?

Best,
Alexander Fedulov

On Fri, 27 Oct 2023 at 19:07, Tawfek Yasser Tawfek 
wrote:

> Dear Apache Flink Development Team,
>
> I hope this email finds you well. I propose an exciting new feature for
> Apache Flink that has the potential to significantly enhance its
> capabilities in handling unbounded streams of events, particularly in the
> context of event-time windowing.
>
> As you may be aware, Apache Flink has been at the forefront of Big Data
> Stream processing engines, leveraging windowing techniques to manage
> unbounded event streams effectively. The accuracy of the results obtained
> from these streams relies heavily on the ability to gather all relevant
> input within a window. At the core of this process are watermarks, which
> serve as unique timestamps marking the progression of events in time.
>
> However, our analysis has revealed a critical issue with the current
> watermark generation method in Apache Flink. This method, which operates at
> the input stream level, exhibits a bias towards faster sub-streams,
> resulting in the unfortunate consequence of dropped events from slower
> sub-streams. Our investigations showed that Apache Flink's conventional
> watermark generation approach led to an alarming data loss of approximately
> 33% when 50% of the keys around the median experienced delays. This loss
> further escalated to over 37% when 50% of random keys were delayed.
>
> In response to this issue, we have authored a research paper outlining a
> novel strategy named "keyed watermarks" to address data loss and
> substantially enhance data processing accuracy, achieving at least 99%
> accuracy in most scenarios.
>
> Moreover, we have conducted comprehensive comparative studies to evaluate
> the effectiveness of our strategy against the conventional watermark
> generation method, specifically in terms of event-time tracking accuracy.
>
> We believe that implementing keyed watermarks in Apache Flink can greatly
> enhance its performance and reliability, making it an even more valuable
> tool for organizations dealing with complex, high-throughput data
> processing tasks.
>
> We kindly request your consideration of this proposal. We would be eager
> to discuss further details, provide the full research paper, or collaborate
> closely to facilitate the integration of this feature into Apache Flink.
>
> Please check this preprint on Research Square:
> https://www.researchsquare.com/article/rs-3395909/<
> https://www.researchsquare.com/article/rs-3395909/v1>
>
> Thank you for your time and attention to this proposal. We look forward to
> the opportunity to contribute to the continued success and evolution of
> Apache Flink.
>
> Best Regards,
>
> Tawfik Yasser
> Senior Teaching Assistant @ Nile University, Egypt
> Email: tyas...@nu.edu.eg
> LinkedIn: https://www.linkedin.com/in/tawfikyasser/
>


Re: Proposal for Implementing Keyed Watermarks in Apache Flink

2023-09-06 Thread Jane Chan
Hi Tawfik,

In response to this issue, we have authored a research paper outlining a
> novel strategy named "keyed watermarks" to address data loss and
> substantially enhance data processing accuracy, achieving at least 99%
> accuracy in most scenarios.
>

Sounds like a significant improvement! Looking forward to the details of
your research.

Best,
Jane

On Thu, Sep 7, 2023 at 9:50 AM liu ron  wrote:

> Hi Tawfik,
>
> Fast and slow streaming in distributed scenarios leads to watermark
> advancing too fast, which leads to lost data and is a headache in Flink.
> Can't wait to read your research paper!
>
> Best,
> Ron
>
> Yun Tang  于2023年9月6日周三 14:46写道:
>
> > Hi Tawfik,
> >
> > Thanks for offering such a proposal, looking forward to your research
> > paper!
> >
> > You could also ask the edit permission for Flink improvement proposals to
> > create a new proposal if you want to contribute this to the community by
> > yourself.
> >
> > [1]
> >
> https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals
> >
> > Best
> > Yun Tang
> > ________________
> > From: yuxia 
> > Sent: Wednesday, September 6, 2023 12:31
> > To: dev 
> > Subject: Re: Proposal for Implementing Keyed Watermarks in Apache Flink
> >
> > Hi, Tawfik Yasser.
> > Thanks for the proposal.
> > It sounds exciting. I can't wait the research paper for more details.
> >
> > Best regards,
> > Yuxia
> >
> > - 原始邮件 -
> > 发件人: "David Morávek" 
> > 收件人: "dev" 
> > 发送时间: 星期二, 2023年 9 月 05日 下午 4:36:51
> > 主题: Re: Proposal for Implementing Keyed Watermarks in Apache Flink
> >
> > Hi Tawfik,
> >
> > It's exciting to see any ongoing research that tries to push Flink
> forward!
> >
> > The get the discussion started, can you please your paper with the
> > community? Assessing the proposal without further context is tough.
> >
> > Best,
> > D.
> >
> > On Mon, Sep 4, 2023 at 4:42 PM Tawfek Yasser Tawfek 
> > wrote:
> >
> > > Dear Apache Flink Development Team,
> > >
> > > I hope this email finds you well. I am writing to propose an exciting
> new
> > > feature for Apache Flink that has the potential to significantly
> enhance
> > > its capabilities in handling unbounded streams of events, particularly
> in
> > > the context of event-time windowing.
> > >
> > > As you may be aware, Apache Flink has been at the forefront of Big Data
> > > Stream processing engines, leveraging windowing techniques to manage
> > > unbounded event streams effectively. The accuracy of the results
> obtained
> > > from these streams relies heavily on the ability to gather all relevant
> > > input within a window. At the core of this process are watermarks,
> which
> > > serve as unique timestamps marking the progression of events in time.
> > >
> > > However, our analysis has revealed a critical issue with the current
> > > watermark generation method in Apache Flink. This method, which
> operates
> > at
> > > the input stream level, exhibits a bias towards faster sub-streams,
> > > resulting in the unfortunate consequence of dropped events from slower
> > > sub-streams. Our investigations showed that Apache Flink's conventional
> > > watermark generation approach led to an alarming data loss of
> > approximately
> > > 33% when 50% of the keys around the median experienced delays. This
> loss
> > > further escalated to over 37% when 50% of random keys were delayed.
> > >
> > > In response to this issue, we have authored a research paper outlining
> a
> > > novel strategy named "keyed watermarks" to address data loss and
> > > substantially enhance data processing accuracy, achieving at least 99%
> > > accuracy in most scenarios.
> > >
> > > Moreover, we have conducted comprehensive comparative studies to
> evaluate
> > > the effectiveness of our strategy against the conventional watermark
> > > generation method, specifically in terms of event-time tracking
> accuracy.
> > >
> > > We believe that implementing keyed watermarks in Apache Flink can
> greatly
> > > enhance its performance and reliability, making it an even more
> valuable
> > > tool for organizations dealing with complex, high-throughput data
> > > processing tasks.
> > >
> > > We kindly request your consideration of this proposal. We would be
> eager
> > > to discuss further details, provide the full research paper, or
> > collaborate
> > > closely to facilitate the integration of this feature into Apache
> Flink.
> > >
> > > Thank you for your time and attention to this proposal. We look forward
> > to
> > > the opportunity to contribute to the continued success and evolution of
> > > Apache Flink.
> > >
> > > Best Regards,
> > >
> > > Tawfik Yasser
> > > Senior Teaching Assistant @ Nile University, Egypt
> > > Email: tyas...@nu.edu.eg
> > > LinkedIn: https://www.linkedin.com/in/tawfikyasser/
> > >
> >
>


Re: Proposal for Implementing Keyed Watermarks in Apache Flink

2023-09-06 Thread liu ron
Hi Tawfik,

Fast and slow streaming in distributed scenarios leads to watermark
advancing too fast, which leads to lost data and is a headache in Flink.
Can't wait to read your research paper!

Best,
Ron

Yun Tang  于2023年9月6日周三 14:46写道:

> Hi Tawfik,
>
> Thanks for offering such a proposal, looking forward to your research
> paper!
>
> You could also ask the edit permission for Flink improvement proposals to
> create a new proposal if you want to contribute this to the community by
> yourself.
>
> [1]
> https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals
>
> Best
> Yun Tang
> 
> From: yuxia 
> Sent: Wednesday, September 6, 2023 12:31
> To: dev 
> Subject: Re: Proposal for Implementing Keyed Watermarks in Apache Flink
>
> Hi, Tawfik Yasser.
> Thanks for the proposal.
> It sounds exciting. I can't wait the research paper for more details.
>
> Best regards,
> Yuxia
>
> - 原始邮件 -
> 发件人: "David Morávek" 
> 收件人: "dev" 
> 发送时间: 星期二, 2023年 9 月 05日 下午 4:36:51
> 主题: Re: Proposal for Implementing Keyed Watermarks in Apache Flink
>
> Hi Tawfik,
>
> It's exciting to see any ongoing research that tries to push Flink forward!
>
> The get the discussion started, can you please your paper with the
> community? Assessing the proposal without further context is tough.
>
> Best,
> D.
>
> On Mon, Sep 4, 2023 at 4:42 PM Tawfek Yasser Tawfek 
> wrote:
>
> > Dear Apache Flink Development Team,
> >
> > I hope this email finds you well. I am writing to propose an exciting new
> > feature for Apache Flink that has the potential to significantly enhance
> > its capabilities in handling unbounded streams of events, particularly in
> > the context of event-time windowing.
> >
> > As you may be aware, Apache Flink has been at the forefront of Big Data
> > Stream processing engines, leveraging windowing techniques to manage
> > unbounded event streams effectively. The accuracy of the results obtained
> > from these streams relies heavily on the ability to gather all relevant
> > input within a window. At the core of this process are watermarks, which
> > serve as unique timestamps marking the progression of events in time.
> >
> > However, our analysis has revealed a critical issue with the current
> > watermark generation method in Apache Flink. This method, which operates
> at
> > the input stream level, exhibits a bias towards faster sub-streams,
> > resulting in the unfortunate consequence of dropped events from slower
> > sub-streams. Our investigations showed that Apache Flink's conventional
> > watermark generation approach led to an alarming data loss of
> approximately
> > 33% when 50% of the keys around the median experienced delays. This loss
> > further escalated to over 37% when 50% of random keys were delayed.
> >
> > In response to this issue, we have authored a research paper outlining a
> > novel strategy named "keyed watermarks" to address data loss and
> > substantially enhance data processing accuracy, achieving at least 99%
> > accuracy in most scenarios.
> >
> > Moreover, we have conducted comprehensive comparative studies to evaluate
> > the effectiveness of our strategy against the conventional watermark
> > generation method, specifically in terms of event-time tracking accuracy.
> >
> > We believe that implementing keyed watermarks in Apache Flink can greatly
> > enhance its performance and reliability, making it an even more valuable
> > tool for organizations dealing with complex, high-throughput data
> > processing tasks.
> >
> > We kindly request your consideration of this proposal. We would be eager
> > to discuss further details, provide the full research paper, or
> collaborate
> > closely to facilitate the integration of this feature into Apache Flink.
> >
> > Thank you for your time and attention to this proposal. We look forward
> to
> > the opportunity to contribute to the continued success and evolution of
> > Apache Flink.
> >
> > Best Regards,
> >
> > Tawfik Yasser
> > Senior Teaching Assistant @ Nile University, Egypt
> > Email: tyas...@nu.edu.eg
> > LinkedIn: https://www.linkedin.com/in/tawfikyasser/
> >
>


Re: Proposal for Implementing Keyed Watermarks in Apache Flink

2023-09-05 Thread Yun Tang
Hi Tawfik,

Thanks for offering such a proposal, looking forward to your research paper!

You could also ask the edit permission for Flink improvement proposals to 
create a new proposal if you want to contribute this to the community by 
yourself.

[1] 
https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals

Best
Yun Tang

From: yuxia 
Sent: Wednesday, September 6, 2023 12:31
To: dev 
Subject: Re: Proposal for Implementing Keyed Watermarks in Apache Flink

Hi, Tawfik Yasser.
Thanks for the proposal.
It sounds exciting. I can't wait the research paper for more details.

Best regards,
Yuxia

- 原始邮件 -
发件人: "David Morávek" 
收件人: "dev" 
发送时间: 星期二, 2023年 9 月 05日 下午 4:36:51
主题: Re: Proposal for Implementing Keyed Watermarks in Apache Flink

Hi Tawfik,

It's exciting to see any ongoing research that tries to push Flink forward!

The get the discussion started, can you please your paper with the
community? Assessing the proposal without further context is tough.

Best,
D.

On Mon, Sep 4, 2023 at 4:42 PM Tawfek Yasser Tawfek 
wrote:

> Dear Apache Flink Development Team,
>
> I hope this email finds you well. I am writing to propose an exciting new
> feature for Apache Flink that has the potential to significantly enhance
> its capabilities in handling unbounded streams of events, particularly in
> the context of event-time windowing.
>
> As you may be aware, Apache Flink has been at the forefront of Big Data
> Stream processing engines, leveraging windowing techniques to manage
> unbounded event streams effectively. The accuracy of the results obtained
> from these streams relies heavily on the ability to gather all relevant
> input within a window. At the core of this process are watermarks, which
> serve as unique timestamps marking the progression of events in time.
>
> However, our analysis has revealed a critical issue with the current
> watermark generation method in Apache Flink. This method, which operates at
> the input stream level, exhibits a bias towards faster sub-streams,
> resulting in the unfortunate consequence of dropped events from slower
> sub-streams. Our investigations showed that Apache Flink's conventional
> watermark generation approach led to an alarming data loss of approximately
> 33% when 50% of the keys around the median experienced delays. This loss
> further escalated to over 37% when 50% of random keys were delayed.
>
> In response to this issue, we have authored a research paper outlining a
> novel strategy named "keyed watermarks" to address data loss and
> substantially enhance data processing accuracy, achieving at least 99%
> accuracy in most scenarios.
>
> Moreover, we have conducted comprehensive comparative studies to evaluate
> the effectiveness of our strategy against the conventional watermark
> generation method, specifically in terms of event-time tracking accuracy.
>
> We believe that implementing keyed watermarks in Apache Flink can greatly
> enhance its performance and reliability, making it an even more valuable
> tool for organizations dealing with complex, high-throughput data
> processing tasks.
>
> We kindly request your consideration of this proposal. We would be eager
> to discuss further details, provide the full research paper, or collaborate
> closely to facilitate the integration of this feature into Apache Flink.
>
> Thank you for your time and attention to this proposal. We look forward to
> the opportunity to contribute to the continued success and evolution of
> Apache Flink.
>
> Best Regards,
>
> Tawfik Yasser
> Senior Teaching Assistant @ Nile University, Egypt
> Email: tyas...@nu.edu.eg
> LinkedIn: https://www.linkedin.com/in/tawfikyasser/
>


Re: Proposal for Implementing Keyed Watermarks in Apache Flink

2023-09-05 Thread yuxia
Hi, Tawfik Yasser.
Thanks for the proposal. 
It sounds exciting. I can't wait the research paper for more details.

Best regards,
Yuxia

- 原始邮件 -
发件人: "David Morávek" 
收件人: "dev" 
发送时间: 星期二, 2023年 9 月 05日 下午 4:36:51
主题: Re: Proposal for Implementing Keyed Watermarks in Apache Flink

Hi Tawfik,

It's exciting to see any ongoing research that tries to push Flink forward!

The get the discussion started, can you please your paper with the
community? Assessing the proposal without further context is tough.

Best,
D.

On Mon, Sep 4, 2023 at 4:42 PM Tawfek Yasser Tawfek 
wrote:

> Dear Apache Flink Development Team,
>
> I hope this email finds you well. I am writing to propose an exciting new
> feature for Apache Flink that has the potential to significantly enhance
> its capabilities in handling unbounded streams of events, particularly in
> the context of event-time windowing.
>
> As you may be aware, Apache Flink has been at the forefront of Big Data
> Stream processing engines, leveraging windowing techniques to manage
> unbounded event streams effectively. The accuracy of the results obtained
> from these streams relies heavily on the ability to gather all relevant
> input within a window. At the core of this process are watermarks, which
> serve as unique timestamps marking the progression of events in time.
>
> However, our analysis has revealed a critical issue with the current
> watermark generation method in Apache Flink. This method, which operates at
> the input stream level, exhibits a bias towards faster sub-streams,
> resulting in the unfortunate consequence of dropped events from slower
> sub-streams. Our investigations showed that Apache Flink's conventional
> watermark generation approach led to an alarming data loss of approximately
> 33% when 50% of the keys around the median experienced delays. This loss
> further escalated to over 37% when 50% of random keys were delayed.
>
> In response to this issue, we have authored a research paper outlining a
> novel strategy named "keyed watermarks" to address data loss and
> substantially enhance data processing accuracy, achieving at least 99%
> accuracy in most scenarios.
>
> Moreover, we have conducted comprehensive comparative studies to evaluate
> the effectiveness of our strategy against the conventional watermark
> generation method, specifically in terms of event-time tracking accuracy.
>
> We believe that implementing keyed watermarks in Apache Flink can greatly
> enhance its performance and reliability, making it an even more valuable
> tool for organizations dealing with complex, high-throughput data
> processing tasks.
>
> We kindly request your consideration of this proposal. We would be eager
> to discuss further details, provide the full research paper, or collaborate
> closely to facilitate the integration of this feature into Apache Flink.
>
> Thank you for your time and attention to this proposal. We look forward to
> the opportunity to contribute to the continued success and evolution of
> Apache Flink.
>
> Best Regards,
>
> Tawfik Yasser
> Senior Teaching Assistant @ Nile University, Egypt
> Email: tyas...@nu.edu.eg
> LinkedIn: https://www.linkedin.com/in/tawfikyasser/
>


Re: Proposal for Implementing Keyed Watermarks in Apache Flink

2023-09-05 Thread David Morávek
Hi Tawfik,

It's exciting to see any ongoing research that tries to push Flink forward!

The get the discussion started, can you please your paper with the
community? Assessing the proposal without further context is tough.

Best,
D.

On Mon, Sep 4, 2023 at 4:42 PM Tawfek Yasser Tawfek 
wrote:

> Dear Apache Flink Development Team,
>
> I hope this email finds you well. I am writing to propose an exciting new
> feature for Apache Flink that has the potential to significantly enhance
> its capabilities in handling unbounded streams of events, particularly in
> the context of event-time windowing.
>
> As you may be aware, Apache Flink has been at the forefront of Big Data
> Stream processing engines, leveraging windowing techniques to manage
> unbounded event streams effectively. The accuracy of the results obtained
> from these streams relies heavily on the ability to gather all relevant
> input within a window. At the core of this process are watermarks, which
> serve as unique timestamps marking the progression of events in time.
>
> However, our analysis has revealed a critical issue with the current
> watermark generation method in Apache Flink. This method, which operates at
> the input stream level, exhibits a bias towards faster sub-streams,
> resulting in the unfortunate consequence of dropped events from slower
> sub-streams. Our investigations showed that Apache Flink's conventional
> watermark generation approach led to an alarming data loss of approximately
> 33% when 50% of the keys around the median experienced delays. This loss
> further escalated to over 37% when 50% of random keys were delayed.
>
> In response to this issue, we have authored a research paper outlining a
> novel strategy named "keyed watermarks" to address data loss and
> substantially enhance data processing accuracy, achieving at least 99%
> accuracy in most scenarios.
>
> Moreover, we have conducted comprehensive comparative studies to evaluate
> the effectiveness of our strategy against the conventional watermark
> generation method, specifically in terms of event-time tracking accuracy.
>
> We believe that implementing keyed watermarks in Apache Flink can greatly
> enhance its performance and reliability, making it an even more valuable
> tool for organizations dealing with complex, high-throughput data
> processing tasks.
>
> We kindly request your consideration of this proposal. We would be eager
> to discuss further details, provide the full research paper, or collaborate
> closely to facilitate the integration of this feature into Apache Flink.
>
> Thank you for your time and attention to this proposal. We look forward to
> the opportunity to contribute to the continued success and evolution of
> Apache Flink.
>
> Best Regards,
>
> Tawfik Yasser
> Senior Teaching Assistant @ Nile University, Egypt
> Email: tyas...@nu.edu.eg
> LinkedIn: https://www.linkedin.com/in/tawfikyasser/
>