Hi Mason,

In Flink 1.14 we have also changed the timeout behavior from checking
against the alignment duration, to simply checking how old is the
checkpoint barrier (so it would also account for the start delay) [1]. It
was done in order to solve problems as you are describing. Unfortunately we
can not backport this change to 1.13.x as it's a breaking change.

Anyway, from our experience I would recommend going all in with the
unaligned checkpoints, so setting the timeout back to the default value of
0ms. With timeouts you are gaining very little (a tiny bit smaller state
size if there is no backpressure - tiny bit because without backpressure,
even with timeout set to 0ms, the amount of captured inflight data is
basically insignificant), while in practise you slow down the checkpoint
barriers propagation time by quite a lot.

Best,
Piotrek

[1] https://issues.apache.org/jira/browse/FLINK-23041

wt., 14 gru 2021 o 22:04 Mason Chen <mas.chen6...@gmail.com> napisaƂ(a):

> Hi all,
>
> I'm using Flink 1.13 and my job is experiencing high start delay, more so
> than high alignment time. (our flip 27 kafka source is heavily
> backpressured). Since our alignment timeout is set to 1s, the unaligned
> checkpoint never triggers since alignment delay is always below the
> threshold.
>
> It's seems there is only a configuration for alignment timeout but should
> there also be one for start delay timeout:
> https://nightlies.apache.org/flink/flink-docs-master/docs/ops/state/checkpointing_under_backpressure/#aligned-checkpoint-timeout
>
> I'm interested to know the reasoning why there isn't a timeout for start
> delay as well--was it because it was deemed too complex for the user to
> configure two parameters for unaligned checkpoints?
>
> I'm aware of buffer debloating in 1.14 that could help but I'm trying to
> see how far unaligned checkpointing can take me.
>
> Best,
> Mason
>

Reply via email to