RE: [DISCUSS] FLIP-185: Shorter heartbeat timeout and interval default values

2021-07-23 Thread LINZ, Arnaud
Objet : Re: [DISCUSS] FLIP-185: Shorter heartbeat timeout and interval default values Thanks for your inputs Gen and Arnaud. I do agree with you, Gen, that we need better guidance for our users on when to change the heartbeat configuration. I think this should happen in any case. I am, however

Re: [DISCUSS] FLIP-185: Shorter heartbeat timeout and interval default values

2021-07-22 Thread Gen Luo
system >> is >> > under heavy load they may block more than a few seconds, and having our >> app >> > killed because of a short timeout is not an option. >> > >> > >> > >> > That’s why I’m not in favor of very short timeouts… Because

Re: [DISCUSS] FLIP-185: Shorter heartbeat timeout and interval default values

2021-07-22 Thread Chesnay Schepler
t; > > *De :* Gen Luo mailto:luogen...@gmail.com>> > *Envoyé :* jeudi 22 juillet 2021 05:46 > *À :* Till Rohrmann mailto:trohrm...@apache.org>> > *Cc :* Yang Wang mailto:danrtsey...@gmail.com>>; dev mailto:dev@flink.apache.org>>;

Re: [DISCUSS] FLIP-185: Shorter heartbeat timeout and interval default values

2021-07-22 Thread 刘建刚
mpact on heartbeats, but from experience, it > > really does) > > > > > > > > Cheers, > > > > Arnaud > > > > > > > > > > > > *De :* Gen Luo > > *Envoyé :* jeudi 22 juillet 2021 05:46 > > *À :* Till Rohrmann &g

Re: [DISCUSS] FLIP-185: Shorter heartbeat timeout and interval default values

2021-07-22 Thread Till Rohrmann
erstand that normally, as user code is not a JVM-blocking activity such > as a GC, it should have no impact on heartbeats, but from experience, it > really does) > > > > Cheers, > > Arnaud > > > > > > *De :* Gen Luo > *Envoyé :* jeudi 22 juillet 2021 05:46 > *

RE: [DISCUSS] FLIP-185: Shorter heartbeat timeout and interval default values

2021-07-22 Thread LINZ, Arnaud
) Cheers, Arnaud De : Gen Luo Envoyé : jeudi 22 juillet 2021 05:46 À : Till Rohrmann Cc : Yang Wang ; dev ; user Objet : Re: [DISCUSS] FLIP-185: Shorter heartbeat timeout and interval default values Hi, Thanks for driving this @Till Rohrmann<mailto:trohrm...@apache.org> . I would g

Re: [DISCUSS] FLIP-185: Shorter heartbeat timeout and interval default values

2021-07-21 Thread Gen Luo
Hi, Thanks for driving this @Till Rohrmann . I would give +1 on reducing the heartbeat timeout and interval, though I'm not sure if 15s and 3s would be enough either. IMO, except for the standalone cluster, where the heartbeat mechanism in Flink is totally relied, reducing the heartbeat can also

Re: [DISCUSS] FLIP-185: Shorter heartbeat timeout and interval default values

2021-07-21 Thread Till Rohrmann
Thanks for sharing these insights. I think it is no longer true that the ResourceManager notifies the JobMaster about lost TaskExecutors. See FLINK-23216 [1] for more details. Given the GC pauses, would you then be ok with decreasing the heartbeat timeout to 20 seconds? This should give enough

Re: [DISCUSS] FLIP-185: Shorter heartbeat timeout and interval default values

2021-07-20 Thread Yang Wang
Thanks @Till Rohrmann for starting this discussion Firstly, I try to understand the benefit of shorter heartbeat timeout. IIUC, it will make the JobManager aware of TaskManager faster. However, it seems that only the standalone cluster could benefit from this. For Yarn and native Kubernetes

Re: [DISCUSS] FLIP-185: Shorter heartbeat timeout and interval default values

2021-07-20 Thread Robert Metzger
+1 to this change! When I was working on the reactive mode blog post [1] I also ran into this issue, leading to a poor "out of the box" experience when scaling down. For my experiments, I've chosen a timeout of 8 seconds, and the cluster has been running for 76 days (so far) on Kubernetes. I also

[DISCUSS] FLIP-185: Shorter heartbeat timeout and interval default values

2021-07-16 Thread Till Rohrmann
Hi everyone, Since Flink 1.5 we have the same heartbeat timeout and interval default values that are defined as heartbeat.timeout: 50s and heartbeat.interval: 10s. These values were mainly chosen to compensate for lengthy GC pauses and blocking operations that were executed in the main threads of