+1 to this change!

When I was working on the reactive mode blog post [1] I also ran into this
issue, leading to a poor "out of the box" experience when scaling down.
For my experiments, I've chosen a timeout of 8 seconds, and the cluster has
been running for 76 days (so far) on Kubernetes.
I also consider this change somewhat low-risk, because we can provide a
quick fix for people running into problems.

[1]https://flink.apache.org/2021/05/06/reactive-mode.html


On Fri, Jul 16, 2021 at 7:05 PM Till Rohrmann <trohrm...@apache.org> wrote:

> Hi everyone,
>
> Since Flink 1.5 we have the same heartbeat timeout and interval default
> values that are defined as heartbeat.timeout: 50s and heartbeat.interval:
> 10s. These values were mainly chosen to compensate for lengthy GC pauses
> and blocking operations that were executed in the main threads of Flink's
> components. Since then, there were quite some advancements wrt the JVM's
> GCs and we also got rid of a lot of blocking calls that were executed in
> the main thread. Moreover, a long heartbeat.timeout causes long recovery
> times in case of a TaskManager loss because the system can only properly
> recover after the dead TaskManager has been removed from the scheduler.
> Hence, I wanted to propose to change the timeout and interval to:
>
> heartbeat.timeout: 15s
> heartbeat.interval: 3s
>
> Since there is no perfect solution that fits all use cases, I would really
> like to hear from you what you think about it and how you configure these
> heartbeat options. Based on your experience we might actually come up with
> better default values that allow us to be resilient but also to detect
> failed components fast. FLIP-185 can be found here [1].
>
> [1] https://cwiki.apache.org/confluence/x/GAoBCw
>
> Cheers,
> Till
>

Reply via email to