Re: Windows and data loss.

2021-12-07 Thread John Smith
ven if file-ingress is
>   blocked
>   2. Can you publish/share the 3 odd lines of code for your watermark
>   strategy setup?
>
>
>
> Just as said before, ignoring-late-events is a default strategy, that can
> be adjusted by means of a custom window trigger which trades off between
> latency, state size, correctness of the final results.
>
>
>
> Thias
>
>
>
> *From:* John Smith 
> *Sent:* Freitag, 26. November 2021 17:17
> *To:* Schwalbe Matthias 
> *Cc:* Caizhi Weng ; user 
> *Subject:* Re: Windows and data loss.
>
>
>
> Or as an example we have a 5 minutes window and lateness of 5 minutes.
>
> We have the following events in the logs
> 10:00:01 PM > Already pushed to Kafka
> 10:00:30 PM > Already pushed to Kafka
> 10:01:00 PM > Already pushed to Kafka
> 10:03:45 PM > Already pushed to Kafka
> 10:04:00 PM > Log agent crashed for 30 minutes not delivered to Kafla
> yet
> 10:05:10 PM > Pushed to Kafka cause I came from a log agent that isn't
> dead.
>
> Flink window of 10:00:00
> 10:00:01 PM > Received
> 10:00:30 PM > Received
> 10:01:00 PM > Received
> 10:03:45 PM > Received
> 10:04:00 PM > Still nothing
>
> Flink window of 10:00:00 5 lateness minutes are up.
> 10:00:01 PM > Counted
> 10:00:30 PM > Counted
> 10:01:00 PM > Counted
> 10:03:45 PM > Counted
> 10:04:00 PM > Still nothing
>
> Flink window of 10:05:00 started
>
> 10:05:10 PM.> I'm new cause I came from a log agent that isn't dead.
> 10:04:00 PM > Still nothing
>
> Flink window of 10:05:00 5 lateness minutes are up.
> 10:05:10 PM.> I have been counted, I'm happy!
> 10:04:00 PM > Still nothing
>
> And so on...
>
> Flink window of 10:30:00 started
> 10:04:00 PM > Hi guys, sorry I'm late 30 minutes, I ran into log agent
> problems. Sorry you are late, you missed the Flink bus.
>
>
>
> On Fri, 26 Nov 2021 at 10:53, John Smith  wrote:
>
> Ok,
>
>
> So processing time we get 100% accuracy because we don't care when the
> event comes, we just count and move along.
>
> As for event time processing, what I meant to say is if for example if the
> log shipper is late at pushing events into Kafka, Flink will not notice
> this, the watermarks will keep watermarking. So given that, let's say we
> have a window of 5 minutes and a lateness of 5 minutes, it means we will
> see counts on the "dashboard" every 10 minutes. But say the log shipper
> fails/falls behind for 30 minutes or more, the Flink Kafka consumer will
> simply not see any events and it will continue chugging along, after 30
> minutes a late event comes in at 2 windows already too late, that event is
> discarded.
>
> Or did I miss the point on the last part?
>
>
>
>
>
> On Fri, 26 Nov 2021 at 09:38, Schwalbe Matthias <
> matthias.schwa...@viseca.ch> wrote:
>
> Actually not, because processing-time does not matter at all.
>
> Event-time timers are always compared to watermark-time progress.
>
> If system happens to be compromised for (say) 4 hours, also watermarks
> won’t progress, hence the windows get not evicted and wait for watermarks
> to pick up from when the system crashed.
>
>
>
> Your watermark strategy can decide how strict you handle time progress:
>
>- Super strict: the watermark time indicates that there will be no
>events with an older timestamp
>- Semi strict: you accept late events and give a time-range when this
>can happen (still processing time put aside)
>
>
>- You need to configure acceptable lateness in your windowing operator
>   - Accepted lateness implies higher overall latency
>
>
>- Custom strategy
>
>
>- Use a combination of accepted lateness and a custom trigger in your
>   windowing operator
>   - The trigger decide when and how often window results are emitted
>   - The following operator would the probably implement some
>   idempotence/updating scheme for the window values
>   - This way you get immediate low latency results and allow for
>   later corrections if late events arrive
>
>
>
> My favorite source on this is Tyler Akidau’s book [1] and the excerpt
> blog: [2] [3]
>
> I believe his code uses Beam, but the same ideas can be implemented
> directly in Flink API
>
>
>
> [1] https://www.oreilly.com/library/view/streaming-systems/9781491983867/
>
> [2] https://www.oreilly.com/radar/the-world-beyond-batch-streaming-101/
>
> [3] https://www.oreilly.com/radar/the-wor

RE: Windows and data loss.

2021-12-01 Thread Schwalbe Matthias
Hi John,

Sorry for the delay … I’m a little tight on spare time for user@flink currently.
If you are still interested we could pick up the discussion and continue.
However I’m don’t exactly understand what you want to achieve:

  1.  Would processing time windows be enough for you (and misplacement of 
events into the wrong window acceptable)?
  2.  Do you want to use event time windows, but cannot afford losing late 
events? (we can work out a scheme, that this would work)
  3.  How do you currently organize your input events in kafka?
 *   1 event per log row?
 *   Kafka-event timestamp extracted from/per the log row?
 *   You mentioned shuffling (random assignment) to kafka partition,

  i.Is this per log row, or is this 
per log file

 ii.Do you kafka-key by log file, 
or even by log application

 *   Do you select log files to be collected in file timestamp order
  1.  I assume your windows are keyed by application, or do you use another 
keyBy()?
  2.  What watermarking strategy did you configure?
 *   You mentioned that watermarks advance even if file-ingress is blocked
 *   Can you publish/share the 3 odd lines of code for your watermark 
strategy setup?

Just as said before, ignoring-late-events is a default strategy, that can be 
adjusted by means of a custom window trigger which trades off between latency, 
state size, correctness of the final results.

Thias

From: John Smith 
Sent: Freitag, 26. November 2021 17:17
To: Schwalbe Matthias 
Cc: Caizhi Weng ; user 
Subject: Re: Windows and data loss.

Or as an example we have a 5 minutes window and lateness of 5 minutes.

We have the following events in the logs
10:00:01 PM > Already pushed to Kafka
10:00:30 PM > Already pushed to Kafka
10:01:00 PM > Already pushed to Kafka
10:03:45 PM > Already pushed to Kafka
10:04:00 PM > Log agent crashed for 30 minutes not delivered to Kafla yet
10:05:10 PM > Pushed to Kafka cause I came from a log agent that isn't dead.

Flink window of 10:00:00
10:00:01 PM > Received
10:00:30 PM > Received
10:01:00 PM > Received
10:03:45 PM > Received
10:04:00 PM > Still nothing

Flink window of 10:00:00 5 lateness minutes are up.
10:00:01 PM > Counted
10:00:30 PM > Counted
10:01:00 PM > Counted
10:03:45 PM > Counted
10:04:00 PM > Still nothing

Flink window of 10:05:00 started
10:05:10 PM.> I'm new cause I came from a log agent that isn't dead.
10:04:00 PM > Still nothing

Flink window of 10:05:00 5 lateness minutes are up.
10:05:10 PM.> I have been counted, I'm happy!
10:04:00 PM > Still nothing

And so on...

Flink window of 10:30:00 started
10:04:00 PM > Hi guys, sorry I'm late 30 minutes, I ran into log agent 
problems. Sorry you are late, you missed the Flink bus.

On Fri, 26 Nov 2021 at 10:53, John Smith 
mailto:java.dev@gmail.com>> wrote:
Ok,

So processing time we get 100% accuracy because we don't care when the event 
comes, we just count and move along.
As for event time processing, what I meant to say is if for example if the log 
shipper is late at pushing events into Kafka, Flink will not notice this, the 
watermarks will keep watermarking. So given that, let's say we have a window of 
5 minutes and a lateness of 5 minutes, it means we will see counts on the 
"dashboard" every 10 minutes. But say the log shipper fails/falls behind for 30 
minutes or more, the Flink Kafka consumer will simply not see any events and it 
will continue chugging along, after 30 minutes a late event comes in at 2 
windows already too late, that event is discarded.

Or did I miss the point on the last part?


On Fri, 26 Nov 2021 at 09:38, Schwalbe Matthias 
mailto:matthias.schwa...@viseca.ch>> wrote:
Actually not, because processing-time does not matter at all.
Event-time timers are always compared to watermark-time progress.
If system happens to be compromised for (say) 4 hours, also watermarks won’t 
progress, hence the windows get not evicted and wait for watermarks to pick up 
from when the system crashed.

Your watermark strategy can decide how strict you handle time progress:

  *   Super strict: the watermark time indicates that there will be no events 
with an older timestamp
  *   Semi strict: you accept late events and give a time-range when this can 
happen (still processing time put aside)

 *   You need to configure acceptable lateness in your windowing operator
 *   Accepted lateness implies higher overall latency

  *   Custom strategy

 *   Use a combination of accepted lateness and a custom trigger in your 
windowing operator
 *   The trigger decide when and how often window results are emitted
 *   The following operator would the probably implement some 

Re: Windows and data loss.

2021-11-26 Thread John Smith
Or as an example we have a 5 minutes window and lateness of 5 minutes.

We have the following events in the logs
10:00:01 PM > Already pushed to Kafka
10:00:30 PM > Already pushed to Kafka
10:01:00 PM > Already pushed to Kafka
10:03:45 PM > Already pushed to Kafka
10:04:00 PM > Log agent crashed for 30 minutes not delivered to Kafla
yet
10:05:10 PM > Pushed to Kafka cause I came from a log agent that isn't
dead.

Flink window of 10:00:00
10:00:01 PM > Received
10:00:30 PM > Received
10:01:00 PM > Received
10:03:45 PM > Received
10:04:00 PM > Still nothing

Flink window of 10:00:00 5 lateness minutes are up.
10:00:01 PM > Counted
10:00:30 PM > Counted
10:01:00 PM > Counted
10:03:45 PM > Counted
10:04:00 PM > Still nothing

Flink window of 10:05:00 started
10:05:10 PM.> I'm new cause I came from a log agent that isn't dead.
10:04:00 PM > Still nothing

Flink window of 10:05:00 5 lateness minutes are up.
10:05:10 PM.> I have been counted, I'm happy!
10:04:00 PM > Still nothing

And so on...

Flink window of 10:30:00 started
10:04:00 PM > Hi guys, sorry I'm late 30 minutes, I ran into log agent
problems. Sorry you are late, you missed the Flink bus.


On Fri, 26 Nov 2021 at 10:53, John Smith  wrote:

> Ok,
>
> So processing time we get 100% accuracy because we don't care when the
> event comes, we just count and move along.
>
> As for event time processing, what I meant to say is if for example if the
> log shipper is late at pushing events into Kafka, Flink will not notice
> this, the watermarks will keep watermarking. So given that, let's say we
> have a window of 5 minutes and a lateness of 5 minutes, it means we will
> see counts on the "dashboard" every 10 minutes. But say the log shipper
> fails/falls behind for 30 minutes or more, the Flink Kafka consumer will
> simply not see any events and it will continue chugging along, after 30
> minutes a late event comes in at 2 windows already too late, that event is
> discarded.
>
> Or did I miss the point on the last part?
>
>
>
> On Fri, 26 Nov 2021 at 09:38, Schwalbe Matthias <
> matthias.schwa...@viseca.ch> wrote:
>
>> Actually not, because processing-time does not matter at all.
>>
>> Event-time timers are always compared to watermark-time progress.
>>
>> If system happens to be compromised for (say) 4 hours, also watermarks
>> won’t progress, hence the windows get not evicted and wait for watermarks
>> to pick up from when the system crashed.
>>
>>
>>
>> Your watermark strategy can decide how strict you handle time progress:
>>
>>- Super strict: the watermark time indicates that there will be no
>>events with an older timestamp
>>- Semi strict: you accept late events and give a time-range when this
>>can happen (still processing time put aside)
>>   - You need to configure acceptable lateness in your windowing
>>   operator
>>   - Accepted lateness implies higher overall latency
>>- Custom strategy
>>   - Use a combination of accepted lateness and a custom trigger in
>>   your windowing operator
>>   - The trigger decide when and how often window results are emitted
>>   - The following operator would the probably implement some
>>   idempotence/updating scheme for the window values
>>   - This way you get immediate low latency results and allow for
>>   later corrections if late events arrive
>>
>>
>>
>> My favorite source on this is Tyler Akidau’s book [1] and the excerpt
>> blog: [2] [3]
>>
>> I believe his code uses Beam, but the same ideas can be implemented
>> directly in Flink API
>>
>>
>>
>> [1] https://www.oreilly.com/library/view/streaming-systems/9781491983867/
>>
>> [2] https://www.oreilly.com/radar/the-world-beyond-batch-streaming-101/
>>
>> [3] https://www.oreilly.com/radar/the-world-beyond-batch-streaming-102/
>>
>>
>>
>> … happy to discuss further 😊
>>
>>
>>
>> Thias
>>
>>
>>
>>
>>
>>
>>
>> *From:* John Smith 
>> *Sent:* Freitag, 26. November 2021 14:09
>> *To:* Schwalbe Matthias 
>> *Cc:* Caizhi Weng ; user 
>> *Subject:* Re: Windows and data loss.
>>
>>
>>
>> But if we use event time, if a failure happens potentially those events
>> can't be delivered in their windo they will be dropped if they come after
>> the lateness and watermark settings no?
>>
>>
>>
>>

Re: Windows and data loss.

2021-11-26 Thread John Smith
Ok,

So processing time we get 100% accuracy because we don't care when the
event comes, we just count and move along.

As for event time processing, what I meant to say is if for example if the
log shipper is late at pushing events into Kafka, Flink will not notice
this, the watermarks will keep watermarking. So given that, let's say we
have a window of 5 minutes and a lateness of 5 minutes, it means we will
see counts on the "dashboard" every 10 minutes. But say the log shipper
fails/falls behind for 30 minutes or more, the Flink Kafka consumer will
simply not see any events and it will continue chugging along, after 30
minutes a late event comes in at 2 windows already too late, that event is
discarded.

Or did I miss the point on the last part?



On Fri, 26 Nov 2021 at 09:38, Schwalbe Matthias 
wrote:

> Actually not, because processing-time does not matter at all.
>
> Event-time timers are always compared to watermark-time progress.
>
> If system happens to be compromised for (say) 4 hours, also watermarks
> won’t progress, hence the windows get not evicted and wait for watermarks
> to pick up from when the system crashed.
>
>
>
> Your watermark strategy can decide how strict you handle time progress:
>
>- Super strict: the watermark time indicates that there will be no
>events with an older timestamp
>- Semi strict: you accept late events and give a time-range when this
>can happen (still processing time put aside)
>   - You need to configure acceptable lateness in your windowing
>   operator
>   - Accepted lateness implies higher overall latency
>- Custom strategy
>   - Use a combination of accepted lateness and a custom trigger in
>   your windowing operator
>   - The trigger decide when and how often window results are emitted
>   - The following operator would the probably implement some
>   idempotence/updating scheme for the window values
>   - This way you get immediate low latency results and allow for
>   later corrections if late events arrive
>
>
>
> My favorite source on this is Tyler Akidau’s book [1] and the excerpt
> blog: [2] [3]
>
> I believe his code uses Beam, but the same ideas can be implemented
> directly in Flink API
>
>
>
> [1] https://www.oreilly.com/library/view/streaming-systems/9781491983867/
>
> [2] https://www.oreilly.com/radar/the-world-beyond-batch-streaming-101/
>
> [3] https://www.oreilly.com/radar/the-world-beyond-batch-streaming-102/
>
>
>
> … happy to discuss further 😊
>
>
>
> Thias
>
>
>
>
>
>
>
> *From:* John Smith 
> *Sent:* Freitag, 26. November 2021 14:09
> *To:* Schwalbe Matthias 
> *Cc:* Caizhi Weng ; user 
> *Subject:* Re: Windows and data loss.
>
>
>
> But if we use event time, if a failure happens potentially those events
> can't be delivered in their windo they will be dropped if they come after
> the lateness and watermark settings no?
>
>
>
>
>
> On Fri, 26 Nov 2021 at 02:35, Schwalbe Matthias <
> matthias.schwa...@viseca.ch> wrote:
>
> Hi John,
>
>
>
> Going with processing time is perfectly sound if the results meet your
> requirements and you can easily live with events misplaced into the wrong
> time window.
>
> This is also quite a bit cheaper resource-wise.
>
> However you might want to keep in mind situations when things break down
> (network interrupt, datacenter flooded etc. 😊). With processing time
> events count into the time window when processed, with event time they
> count into the time window when originally created a the source … even if
> processed much later …
>
>
>
> Thias
>
>
>
>
>
>
>
> *From:* John Smith 
> *Sent:* Freitag, 26. November 2021 02:55
> *To:* Schwalbe Matthias 
> *Cc:* Caizhi Weng ; user 
> *Subject:* Re: Windows and data loss.
>
>
>
> Well what I'm thinking for 100% accuracy no data loss just to base the
> count on processing time. So whatever arrives in that window is counted. If
> I get some events of the "current" window late and they go into another
> window it's ok.
>
> My pipeline is like so
>
> browser(user)->REST API-->log file-->Filebeat-->Kafka (18
> partitions)->flink->destination
>
> Filebeat inserts into Kafka it's kindof a big bucket of "logs" which I use
> flink to filter the specific app and do the counts. The logs are round
> robin into the topic/partitions. Where I FORSEE a delay is Filebeat can't
> push fast enough into Kafka AND/OR the flink consumer has not read all
> events for that window from all partitions.
>
>
>
> On Thu, 25

RE: Windows and data loss.

2021-11-26 Thread Schwalbe Matthias
Actually not, because processing-time does not matter at all.
Event-time timers are always compared to watermark-time progress.
If system happens to be compromised for (say) 4 hours, also watermarks won’t 
progress, hence the windows get not evicted and wait for watermarks to pick up 
from when the system crashed.

Your watermark strategy can decide how strict you handle time progress:

  *   Super strict: the watermark time indicates that there will be no events 
with an older timestamp
  *   Semi strict: you accept late events and give a time-range when this can 
happen (still processing time put aside)
 *   You need to configure acceptable lateness in your windowing operator
 *   Accepted lateness implies higher overall latency
  *   Custom strategy
 *   Use a combination of accepted lateness and a custom trigger in your 
windowing operator
 *   The trigger decide when and how often window results are emitted
 *   The following operator would the probably implement some 
idempotence/updating scheme for the window values
 *   This way you get immediate low latency results and allow for later 
corrections if late events arrive

My favorite source on this is Tyler Akidau’s book [1] and the excerpt blog: [2] 
[3]
I believe his code uses Beam, but the same ideas can be implemented directly in 
Flink API

[1] https://www.oreilly.com/library/view/streaming-systems/9781491983867/
[2] https://www.oreilly.com/radar/the-world-beyond-batch-streaming-101/
[3] https://www.oreilly.com/radar/the-world-beyond-batch-streaming-102/

… happy to discuss further 😊

Thias



From: John Smith 
Sent: Freitag, 26. November 2021 14:09
To: Schwalbe Matthias 
Cc: Caizhi Weng ; user 
Subject: Re: Windows and data loss.

But if we use event time, if a failure happens potentially those events can't 
be delivered in their windo they will be dropped if they come after the 
lateness and watermark settings no?


On Fri, 26 Nov 2021 at 02:35, Schwalbe Matthias 
mailto:matthias.schwa...@viseca.ch>> wrote:
Hi John,

Going with processing time is perfectly sound if the results meet your 
requirements and you can easily live with events misplaced into the wrong time 
window.
This is also quite a bit cheaper resource-wise.
However you might want to keep in mind situations when things break down 
(network interrupt, datacenter flooded etc. 😊). With processing time events 
count into the time window when processed, with event time they count into the 
time window when originally created a the source … even if processed much later 
…

Thias



From: John Smith mailto:java.dev@gmail.com>>
Sent: Freitag, 26. November 2021 02:55
To: Schwalbe Matthias 
mailto:matthias.schwa...@viseca.ch>>
Cc: Caizhi Weng mailto:tsreape...@gmail.com>>; user 
mailto:user@flink.apache.org>>
Subject: Re: Windows and data loss.

Well what I'm thinking for 100% accuracy no data loss just to base the count on 
processing time. So whatever arrives in that window is counted. If I get some 
events of the "current" window late and they go into another window it's ok.

My pipeline is like so

browser(user)->REST API-->log file-->Filebeat-->Kafka (18 
partitions)->flink->destination
Filebeat inserts into Kafka it's kindof a big bucket of "logs" which I use 
flink to filter the specific app and do the counts. The logs are round robin 
into the topic/partitions. Where I FORSEE a delay is Filebeat can't push fast 
enough into Kafka AND/OR the flink consumer has not read all events for that 
window from all partitions.

On Thu, 25 Nov 2021 at 11:28, Schwalbe Matthias 
mailto:matthias.schwa...@viseca.ch>> wrote:
Hi John,

… just a short hint:
With datastream API you can

  *   hand-craft a trigger that decides when an how often emit intermediate, 
punctual and late window results, and when to evict the window and stop 
processing late events
  *   in order to process late event you also need to specify for how long you 
will extend the window processing (or is that done in the trigger … I don’t 
remember right know)
  *   overall window state grows, if you extend window processing to after it 
is finished …

Hope this helps 😊

Thias

From: Caizhi Weng mailto:tsreape...@gmail.com>>
Sent: Donnerstag, 25. November 2021 02:56
To: John Smith mailto:java.dev@gmail.com>>
Cc: user mailto:user@flink.apache.org>>
Subject: Re: Windows and data loss.

Hi!

Are you using the datastream API or the table / SQL API? I don't know if 
datastream API has this functionality, but in table / SQL API we have the 
following configurations [1].

  *   table.exec.emit.late-fire.enabled: Emit window results for late records;
  *   table.exec.emit.late-fire.delay: How often shall we emit results for late 
records (for example, once per 10 minutes or for every record).

[1] 
https://github.com/apache/flink/blob/601ef3b

Re: Windows and data loss.

2021-11-26 Thread John Smith
But if we use event time, if a failure happens potentially those events
can't be delivered in their windo they will be dropped if they come after
the lateness and watermark settings no?


On Fri, 26 Nov 2021 at 02:35, Schwalbe Matthias 
wrote:

> Hi John,
>
>
>
> Going with processing time is perfectly sound if the results meet your
> requirements and you can easily live with events misplaced into the wrong
> time window.
>
> This is also quite a bit cheaper resource-wise.
>
> However you might want to keep in mind situations when things break down
> (network interrupt, datacenter flooded etc. 😊). With processing time
> events count into the time window when processed, with event time they
> count into the time window when originally created a the source … even if
> processed much later …
>
>
>
> Thias
>
>
>
>
>
>
>
> *From:* John Smith 
> *Sent:* Freitag, 26. November 2021 02:55
> *To:* Schwalbe Matthias 
> *Cc:* Caizhi Weng ; user 
> *Subject:* Re: Windows and data loss.
>
>
>
> Well what I'm thinking for 100% accuracy no data loss just to base the
> count on processing time. So whatever arrives in that window is counted. If
> I get some events of the "current" window late and they go into another
> window it's ok.
>
> My pipeline is like so
>
> browser(user)->REST API-->log file-->Filebeat-->Kafka (18
> partitions)->flink->destination
>
> Filebeat inserts into Kafka it's kindof a big bucket of "logs" which I use
> flink to filter the specific app and do the counts. The logs are round
> robin into the topic/partitions. Where I FORSEE a delay is Filebeat can't
> push fast enough into Kafka AND/OR the flink consumer has not read all
> events for that window from all partitions.
>
>
>
> On Thu, 25 Nov 2021 at 11:28, Schwalbe Matthias <
> matthias.schwa...@viseca.ch> wrote:
>
> Hi John,
>
>
>
> … just a short hint:
>
> With datastream API you can
>
>- hand-craft a trigger that decides when an how often emit
>intermediate, punctual and late window results, and when to evict the
>window and stop processing late events
>- in order to process late event you also need to specify for how long
>you will extend the window processing (or is that done in the trigger … I
>don’t remember right know)
>    - overall window state grows, if you extend window processing to after
>it is finished …
>
>
>
> Hope this helps 😊
>
>
>
> Thias
>
>
>
> *From:* Caizhi Weng 
> *Sent:* Donnerstag, 25. November 2021 02:56
> *To:* John Smith 
> *Cc:* user 
> *Subject:* Re: Windows and data loss.
>
>
>
> Hi!
>
>
>
> Are you using the datastream API or the table / SQL API? I don't know if
> datastream API has this functionality, but in table / SQL API we have the
> following configurations [1].
>
>- table.exec.emit.late-fire.enabled: Emit window results for late
>records;
>- table.exec.emit.late-fire.delay: How often shall we emit results for
>late records (for example, once per 10 minutes or for every record).
>
>
>
> [1]
> https://github.com/apache/flink/blob/601ef3b3bce040264daa3aedcb9d98ead8303485/flink-table/flink-table-planner/src/main/scala/org/apache/flink/table/planner/plan/utils/WindowEmitStrategy.scala#L214
>
>
>
> John Smith  于2021年11月25日周四 上午12:45写道:
>
> Hi I understand that when using windows and having set the watermarks and
> lateness configs. That if an event comes late it is lost and we can
> output it to side output.
>
> But wondering is there a way to do it without the loss?
>
> I'm guessing an "all" window with a custom trigger that just fires X
> period and whatever is on that bucket is in that bucket?
>
> Diese Nachricht ist ausschliesslich für den Adressaten bestimmt und
> beinhaltet unter Umständen vertrauliche Mitteilungen. Da die
> Vertraulichkeit von e-Mail-Nachrichten nicht gewährleistet werden kann,
> übernehmen wir keine Haftung für die Gewährung der Vertraulichkeit und
> Unversehrtheit dieser Mitteilung. Bei irrtümlicher Zustellung bitten wir
> Sie um Benachrichtigung per e-Mail und um Löschung dieser Nachricht sowie
> eventueller Anhänge. Jegliche unberechtigte Verwendung oder Verbreitung
> dieser Informationen ist streng verboten.
>
> This message is intended only for the named recipient and may contain
> confidential or privileged information. As the confidentiality of email
> communication cannot be guaranteed, we do not accept any responsibility for
> the confidentiality and the intactness of this message. If you have
> received it in error, 

RE: Windows and data loss.

2021-11-25 Thread Schwalbe Matthias
Hi John,

Going with processing time is perfectly sound if the results meet your 
requirements and you can easily live with events misplaced into the wrong time 
window.
This is also quite a bit cheaper resource-wise.
However you might want to keep in mind situations when things break down 
(network interrupt, datacenter flooded etc. 😊). With processing time events 
count into the time window when processed, with event time they count into the 
time window when originally created a the source … even if processed much later 
…

Thias



From: John Smith 
Sent: Freitag, 26. November 2021 02:55
To: Schwalbe Matthias 
Cc: Caizhi Weng ; user 
Subject: Re: Windows and data loss.

Well what I'm thinking for 100% accuracy no data loss just to base the count on 
processing time. So whatever arrives in that window is counted. If I get some 
events of the "current" window late and they go into another window it's ok.

My pipeline is like so

browser(user)->REST API-->log file-->Filebeat-->Kafka (18 
partitions)->flink->destination
Filebeat inserts into Kafka it's kindof a big bucket of "logs" which I use 
flink to filter the specific app and do the counts. The logs are round robin 
into the topic/partitions. Where I FORSEE a delay is Filebeat can't push fast 
enough into Kafka AND/OR the flink consumer has not read all events for that 
window from all partitions.

On Thu, 25 Nov 2021 at 11:28, Schwalbe Matthias 
mailto:matthias.schwa...@viseca.ch>> wrote:
Hi John,

… just a short hint:
With datastream API you can

  *   hand-craft a trigger that decides when an how often emit intermediate, 
punctual and late window results, and when to evict the window and stop 
processing late events
  *   in order to process late event you also need to specify for how long you 
will extend the window processing (or is that done in the trigger … I don’t 
remember right know)
  *   overall window state grows, if you extend window processing to after it 
is finished …

Hope this helps 😊

Thias

From: Caizhi Weng mailto:tsreape...@gmail.com>>
Sent: Donnerstag, 25. November 2021 02:56
To: John Smith mailto:java.dev@gmail.com>>
Cc: user mailto:user@flink.apache.org>>
Subject: Re: Windows and data loss.

Hi!

Are you using the datastream API or the table / SQL API? I don't know if 
datastream API has this functionality, but in table / SQL API we have the 
following configurations [1].

  *   table.exec.emit.late-fire.enabled: Emit window results for late records;
  *   table.exec.emit.late-fire.delay: How often shall we emit results for late 
records (for example, once per 10 minutes or for every record).

[1] 
https://github.com/apache/flink/blob/601ef3b3bce040264daa3aedcb9d98ead8303485/flink-table/flink-table-planner/src/main/scala/org/apache/flink/table/planner/plan/utils/WindowEmitStrategy.scala#L214

John Smith mailto:java.dev@gmail.com>> 
于2021年11月25日周四 上午12:45写道:
Hi I understand that when using windows and having set the watermarks and 
lateness configs. That if an event comes late it is lost and we can output it 
to side output.

But wondering is there a way to do it without the loss?

I'm guessing an "all" window with a custom trigger that just fires X period and 
whatever is on that bucket is in that bucket?
Diese Nachricht ist ausschliesslich für den Adressaten bestimmt und beinhaltet 
unter Umständen vertrauliche Mitteilungen. Da die Vertraulichkeit von 
e-Mail-Nachrichten nicht gewährleistet werden kann, übernehmen wir keine 
Haftung für die Gewährung der Vertraulichkeit und Unversehrtheit dieser 
Mitteilung. Bei irrtümlicher Zustellung bitten wir Sie um Benachrichtigung per 
e-Mail und um Löschung dieser Nachricht sowie eventueller Anhänge. Jegliche 
unberechtigte Verwendung oder Verbreitung dieser Informationen ist streng 
verboten.

This message is intended only for the named recipient and may contain 
confidential or privileged information. As the confidentiality of email 
communication cannot be guaranteed, we do not accept any responsibility for the 
confidentiality and the intactness of this message. If you have received it in 
error, please advise the sender by return e-mail and delete this message and 
any attachments. Any unauthorised use or dissemination of this information is 
strictly prohibited.
Diese Nachricht ist ausschliesslich für den Adressaten bestimmt und beinhaltet 
unter Umständen vertrauliche Mitteilungen. Da die Vertraulichkeit von 
e-Mail-Nachrichten nicht gewährleistet werden kann, übernehmen wir keine 
Haftung für die Gewährung der Vertraulichkeit und Unversehrtheit dieser 
Mitteilung. Bei irrtümlicher Zustellung bitten wir Sie um Benachrichtigung per 
e-Mail und um Löschung dieser Nachricht sowie eventueller Anhänge. Jegliche 
unberechtigte Verwendung oder Verbreitung dieser Informationen ist streng 
verboten.

This message is intended only for 

Re: Windows and data loss.

2021-11-25 Thread John Smith
Well what I'm thinking for 100% accuracy no data loss just to base the
count on processing time. So whatever arrives in that window is counted. If
I get some events of the "current" window late and they go into another
window it's ok.

My pipeline is like so

browser(user)->REST API-->log file-->Filebeat-->Kafka (18
partitions)->flink->destination

Filebeat inserts into Kafka it's kindof a big bucket of "logs" which I use
flink to filter the specific app and do the counts. The logs are round
robin into the topic/partitions. Where I FORSEE a delay is Filebeat can't
push fast enough into Kafka AND/OR the flink consumer has not read all
events for that window from all partitions.

On Thu, 25 Nov 2021 at 11:28, Schwalbe Matthias 
wrote:

> Hi John,
>
>
>
> … just a short hint:
>
> With datastream API you can
>
>- hand-craft a trigger that decides when an how often emit
>intermediate, punctual and late window results, and when to evict the
>window and stop processing late events
>- in order to process late event you also need to specify for how long
>you will extend the window processing (or is that done in the trigger … I
>don’t remember right know)
>- overall window state grows, if you extend window processing to after
>it is finished …
>
>
>
> Hope this helps 😊
>
>
>
> Thias
>
>
>
> *From:* Caizhi Weng 
> *Sent:* Donnerstag, 25. November 2021 02:56
> *To:* John Smith 
> *Cc:* user 
> *Subject:* Re: Windows and data loss.
>
>
>
> Hi!
>
>
>
> Are you using the datastream API or the table / SQL API? I don't know if
> datastream API has this functionality, but in table / SQL API we have the
> following configurations [1].
>
>- table.exec.emit.late-fire.enabled: Emit window results for late
>records;
>- table.exec.emit.late-fire.delay: How often shall we emit results for
>late records (for example, once per 10 minutes or for every record).
>
>
>
> [1]
> https://github.com/apache/flink/blob/601ef3b3bce040264daa3aedcb9d98ead8303485/flink-table/flink-table-planner/src/main/scala/org/apache/flink/table/planner/plan/utils/WindowEmitStrategy.scala#L214
>
>
>
> John Smith  于2021年11月25日周四 上午12:45写道:
>
> Hi I understand that when using windows and having set the watermarks and
> lateness configs. That if an event comes late it is lost and we can
> output it to side output.
>
> But wondering is there a way to do it without the loss?
>
> I'm guessing an "all" window with a custom trigger that just fires X
> period and whatever is on that bucket is in that bucket?
>
> Diese Nachricht ist ausschliesslich für den Adressaten bestimmt und
> beinhaltet unter Umständen vertrauliche Mitteilungen. Da die
> Vertraulichkeit von e-Mail-Nachrichten nicht gewährleistet werden kann,
> übernehmen wir keine Haftung für die Gewährung der Vertraulichkeit und
> Unversehrtheit dieser Mitteilung. Bei irrtümlicher Zustellung bitten wir
> Sie um Benachrichtigung per e-Mail und um Löschung dieser Nachricht sowie
> eventueller Anhänge. Jegliche unberechtigte Verwendung oder Verbreitung
> dieser Informationen ist streng verboten.
>
> This message is intended only for the named recipient and may contain
> confidential or privileged information. As the confidentiality of email
> communication cannot be guaranteed, we do not accept any responsibility for
> the confidentiality and the intactness of this message. If you have
> received it in error, please advise the sender by return e-mail and delete
> this message and any attachments. Any unauthorised use or dissemination of
> this information is strictly prohibited.
>


RE: Windows and data loss.

2021-11-25 Thread Schwalbe Matthias
Hi John,

… just a short hint:
With datastream API you can

  *   hand-craft a trigger that decides when an how often emit intermediate, 
punctual and late window results, and when to evict the window and stop 
processing late events
  *   in order to process late event you also need to specify for how long you 
will extend the window processing (or is that done in the trigger … I don’t 
remember right know)
  *   overall window state grows, if you extend window processing to after it 
is finished …

Hope this helps 😊

Thias

From: Caizhi Weng 
Sent: Donnerstag, 25. November 2021 02:56
To: John Smith 
Cc: user 
Subject: Re: Windows and data loss.

Hi!

Are you using the datastream API or the table / SQL API? I don't know if 
datastream API has this functionality, but in table / SQL API we have the 
following configurations [1].

  *   table.exec.emit.late-fire.enabled: Emit window results for late records;
  *   table.exec.emit.late-fire.delay: How often shall we emit results for late 
records (for example, once per 10 minutes or for every record).

[1] 
https://github.com/apache/flink/blob/601ef3b3bce040264daa3aedcb9d98ead8303485/flink-table/flink-table-planner/src/main/scala/org/apache/flink/table/planner/plan/utils/WindowEmitStrategy.scala#L214

John Smith mailto:java.dev@gmail.com>> 
于2021年11月25日周四 上午12:45写道:
Hi I understand that when using windows and having set the watermarks and 
lateness configs. That if an event comes late it is lost and we can output it 
to side output.

But wondering is there a way to do it without the loss?

I'm guessing an "all" window with a custom trigger that just fires X period and 
whatever is on that bucket is in that bucket?
Diese Nachricht ist ausschliesslich für den Adressaten bestimmt und beinhaltet 
unter Umständen vertrauliche Mitteilungen. Da die Vertraulichkeit von 
e-Mail-Nachrichten nicht gewährleistet werden kann, übernehmen wir keine 
Haftung für die Gewährung der Vertraulichkeit und Unversehrtheit dieser 
Mitteilung. Bei irrtümlicher Zustellung bitten wir Sie um Benachrichtigung per 
e-Mail und um Löschung dieser Nachricht sowie eventueller Anhänge. Jegliche 
unberechtigte Verwendung oder Verbreitung dieser Informationen ist streng 
verboten.

This message is intended only for the named recipient and may contain 
confidential or privileged information. As the confidentiality of email 
communication cannot be guaranteed, we do not accept any responsibility for the 
confidentiality and the intactness of this message. If you have received it in 
error, please advise the sender by return e-mail and delete this message and 
any attachments. Any unauthorised use or dissemination of this information is 
strictly prohibited.


Re: Windows and data loss.

2021-11-25 Thread John Smith
Thanks. Using, data streaming.

On Wed, 24 Nov 2021 at 20:56, Caizhi Weng  wrote:

> Hi!
>
> Are you using the datastream API or the table / SQL API? I don't know if
> datastream API has this functionality, but in table / SQL API we have the
> following configurations [1].
>
>- table.exec.emit.late-fire.enabled: Emit window results for late
>records;
>- table.exec.emit.late-fire.delay: How often shall we emit results for
>late records (for example, once per 10 minutes or for every record).
>
>
> [1]
> https://github.com/apache/flink/blob/601ef3b3bce040264daa3aedcb9d98ead8303485/flink-table/flink-table-planner/src/main/scala/org/apache/flink/table/planner/plan/utils/WindowEmitStrategy.scala#L214
>
> John Smith  于2021年11月25日周四 上午12:45写道:
>
>> Hi I understand that when using windows and having set the watermarks and
>> lateness configs. That if an event comes late it is lost and we can
>> output it to side output.
>>
>> But wondering is there a way to do it without the loss?
>>
>> I'm guessing an "all" window with a custom trigger that just fires X
>> period and whatever is on that bucket is in that bucket?
>>
>


Re: Windows and data loss.

2021-11-24 Thread Caizhi Weng
Hi!

Are you using the datastream API or the table / SQL API? I don't know if
datastream API has this functionality, but in table / SQL API we have the
following configurations [1].

   - table.exec.emit.late-fire.enabled: Emit window results for late
   records;
   - table.exec.emit.late-fire.delay: How often shall we emit results for
   late records (for example, once per 10 minutes or for every record).


[1]
https://github.com/apache/flink/blob/601ef3b3bce040264daa3aedcb9d98ead8303485/flink-table/flink-table-planner/src/main/scala/org/apache/flink/table/planner/plan/utils/WindowEmitStrategy.scala#L214

John Smith  于2021年11月25日周四 上午12:45写道:

> Hi I understand that when using windows and having set the watermarks and
> lateness configs. That if an event comes late it is lost and we can
> output it to side output.
>
> But wondering is there a way to do it without the loss?
>
> I'm guessing an "all" window with a custom trigger that just fires X
> period and whatever is on that bucket is in that bucket?
>