Objet : Re: [DISCUSS] FLIP-185: Shorter heartbeat timeout and interval default
values
Thanks for your inputs Gen and Arnaud.
I do agree with you, Gen, that we need better guidance for our users on when to
change the heartbeat configuration. I think this should happen in any case. I
am, however
system
>> is
>> > under heavy load they may block more than a few seconds, and having our
>> app
>> > killed because of a short timeout is not an option.
>> >
>> >
>> >
>> > That’s why I’m not in favor of very short timeouts… Because
t;
>
> *De :* Gen Luo mailto:luogen...@gmail.com>>
> *Envoyé :* jeudi 22 juillet 2021 05:46
> *À :* Till Rohrmann mailto:trohrm...@apache.org>>
> *Cc :* Yang Wang mailto:danrtsey...@gmail.com>>; dev mailto:dev@flink.apache.org>>;
mpact on heartbeats, but from experience, it
> > really does)
> >
> >
> >
> > Cheers,
> >
> > Arnaud
> >
> >
> >
> >
> >
> > *De :* Gen Luo
> > *Envoyé :* jeudi 22 juillet 2021 05:46
> > *À :* Till Rohrmann
&g
erstand that normally, as user code is not a JVM-blocking activity such
> as a GC, it should have no impact on heartbeats, but from experience, it
> really does)
>
>
>
> Cheers,
>
> Arnaud
>
>
>
>
>
> *De :* Gen Luo
> *Envoyé :* jeudi 22 juillet 2021 05:46
> *
)
Cheers,
Arnaud
De : Gen Luo
Envoyé : jeudi 22 juillet 2021 05:46
À : Till Rohrmann
Cc : Yang Wang ; dev ; user
Objet : Re: [DISCUSS] FLIP-185: Shorter heartbeat timeout and interval default
values
Hi,
Thanks for driving this @Till Rohrmann<mailto:trohrm...@apache.org> . I would
g
Hi,
Thanks for driving this @Till Rohrmann . I would
give +1 on reducing the heartbeat timeout and interval, though I'm not sure
if 15s and 3s would be enough either.
IMO, except for the standalone cluster, where the heartbeat mechanism in
Flink is totally relied, reducing the heartbeat can also
Thanks for sharing these insights.
I think it is no longer true that the ResourceManager notifies the
JobMaster about lost TaskExecutors. See FLINK-23216 [1] for more details.
Given the GC pauses, would you then be ok with decreasing the heartbeat
timeout to 20 seconds? This should give enough
Thanks @Till Rohrmann for starting this discussion
Firstly, I try to understand the benefit of shorter heartbeat timeout.
IIUC, it will make the JobManager aware of
TaskManager faster. However, it seems that only the standalone cluster
could benefit from this. For Yarn and
native Kubernetes
+1 to this change!
When I was working on the reactive mode blog post [1] I also ran into this
issue, leading to a poor "out of the box" experience when scaling down.
For my experiments, I've chosen a timeout of 8 seconds, and the cluster has
been running for 76 days (so far) on Kubernetes.
I also
Hi everyone,
Since Flink 1.5 we have the same heartbeat timeout and interval default
values that are defined as heartbeat.timeout: 50s and heartbeat.interval:
10s. These values were mainly chosen to compensate for lengthy GC pauses
and blocking operations that were executed in the main threads of
11 matches
Mail list logo