RE: [DISCUSS] FLIP-185: Shorter heartbeat timeout and interval default values

2021-07-23 Thread LINZ, Arnaud
Objet : Re: [DISCUSS] FLIP-185: Shorter heartbeat timeout and interval default values Thanks for your inputs Gen and Arnaud. I do agree with you, Gen, that we need better guidance for our users on when to change the heartbeat configuration. I think this should happen in any case. I am, however

Re: [DISCUSS] FLIP-185: Shorter heartbeat timeout and interval default values

2021-07-22 Thread Gen Luo
g. > The heartbeat system right now is fairly binary. It works fine while the > configuration is suitable, until it no longer is and everything goes up in > flames. > > If we were to log warnings if Flink was close to hitting the heartbeat > timeout, or even expose metrics for heart

Re: [DISCUSS] FLIP-185: Shorter heartbeat timeout and interval default values

2021-07-22 Thread Chesnay Schepler
thing goes up in flames. If we were to log warnings if Flink was close to hitting the heartbeat timeout, or even expose metrics for heartbeat round-trip times or similar, I think we could alleviate many concerns that people have. Or we just provide 2 configs with the distribution, one f

Re: [DISCUSS] FLIP-185: Shorter heartbeat timeout and interval default values

2021-07-22 Thread 刘建刚
only thing that can happen is that the > user code starves the heartbeat in terms of CPU cycles or causes a lot of > GC pauses. If you are observing the former problem, then we might think > about changing the priorities of the respective threads. This should then > improve Flink's

Re: [DISCUSS] FLIP-185: Shorter heartbeat timeout and interval default values

2021-07-22 Thread Till Rohrmann
ve Flink's stability for these workloads and a shorter heartbeat timeout should be possible. Also for the RAM-cached repositories, what exactly is causing the heartbeat to time out? Is it because you have a lot of GC or that the heartbeat thread does not get enough CPU cycles? Cheers, Till

RE: [DISCUSS] FLIP-185: Shorter heartbeat timeout and interval default values

2021-07-22 Thread LINZ, Arnaud
) Cheers, Arnaud De : Gen Luo Envoyé : jeudi 22 juillet 2021 05:46 À : Till Rohrmann Cc : Yang Wang ; dev ; user Objet : Re: [DISCUSS] FLIP-185: Shorter heartbeat timeout and interval default values Hi, Thanks for driving this @Till Rohrmann<mailto:trohrm...@apache.org> . I would give

Re: [DISCUSS] FLIP-185: Shorter heartbeat timeout and interval default values

2021-07-21 Thread Gen Luo
Hi, Thanks for driving this @Till Rohrmann . I would give +1 on reducing the heartbeat timeout and interval, though I'm not sure if 15s and 3s would be enough either. IMO, except for the standalone cluster, where the heartbeat mechanism in Flink is totally relied, reducing the heartbeat can

Re: [DISCUSS] FLIP-185: Shorter heartbeat timeout and interval default values

2021-07-21 Thread Till Rohrmann
Thanks for sharing these insights. I think it is no longer true that the ResourceManager notifies the JobMaster about lost TaskExecutors. See FLINK-23216 [1] for more details. Given the GC pauses, would you then be ok with decreasing the heartbeat timeout to 20 seconds? This should give enough

Re: [DISCUSS] FLIP-185: Shorter heartbeat timeout and interval default values

2021-07-20 Thread Yang Wang
Thanks @Till Rohrmann for starting this discussion Firstly, I try to understand the benefit of shorter heartbeat timeout. IIUC, it will make the JobManager aware of TaskManager faster. However, it seems that only the standalone cluster could benefit from this. For Yarn and native Kubernetes

Re: [DISCUSS] FLIP-185: Shorter heartbeat timeout and interval default values

2021-07-20 Thread Robert Metzger
ernetes. I also consider this change somewhat low-risk, because we can provide a quick fix for people running into problems. [1]https://flink.apache.org/2021/05/06/reactive-mode.html On Fri, Jul 16, 2021 at 7:05 PM Till Rohrmann wrote: > Hi everyone, > > Since Flink 1.5 we have the sa

[DISCUSS] FLIP-185: Shorter heartbeat timeout and interval default values

2021-07-16 Thread Till Rohrmann
Hi everyone, Since Flink 1.5 we have the same heartbeat timeout and interval default values that are defined as heartbeat.timeout: 50s and heartbeat.interval: 10s. These values were mainly chosen to compensate for lengthy GC pauses and blocking operations that were executed in the main threads of

Re: Heartbeat Timeout

2021-05-28 Thread Robert Cullen
Matthias, I increased the JVM Heap size as Jan suggested and it appears to be a memory leak in the user code (although I'm not sure why since this is a simple job that uses a loop to simulate data being written to an S3 data store). Yes, the logs show no apparent problem but the timestamp corresp

Re: Heartbeat Timeout

2021-05-28 Thread Matthias Pohl
Hi Robert, increasing heap memory usage could be due to some memory leak in the user code. Have you analyzed a heap dump? About the TM logs you shared. I don't see anything suspicious there. Nothing about memory problems. Are those the correct logs? Best, Matthias On Thu, May 27, 2021 at 6:01 PM

Re: Heartbeat Timeout

2021-05-27 Thread wangwj
hi, I have encountered the same problem. Check gc log and jstack, you will resolve this problem. good luck -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Re: Heartbeat Timeout

2021-05-27 Thread JING ZHANG
Hi Robert Cullen, 1. You may solve the problem by increasing the timeout by define `heartbeat.timeout` [1], however I do not recommend this way because it will hide the real problem. 2. Please find out the gc log and log of timeout-taskmanager(10.42.0.49:6122 -e26293), check 1. is there memory leak

Re: Heartbeat Timeout

2021-05-27 Thread Yangze Guo
Hi, Rober, To mitigate this issue, you can increase the "heartbeat.interval" and "heartbeat.timeout". However, I think we should first figure out the root cause, would you like to provide the log of 10.42.0.49:6122-e26293? Best, Yangze Guo On Thu, May 27, 2021 at 10:44 PM Robert Cullen wrote: >

Re: Heartbeat Timeout

2021-05-27 Thread Jan Brusch
Hi Robert, that sounds like a case of either your application state ultimately being bigger than the available RAM or a memory leak in your application (e.g., some states are not properly cleaned out after they are not needed anymore). If you have the available resources you could try and in

Re: Heartbeat Timeout

2021-05-27 Thread Robert Cullen
Hello Jan, My flink cluster is running on a kubernetes single node (rke). I have the JVM Heap Size set at 2.08 GB and the Managed Memory at 2.93 GB. The TaskManger reaches the max JVM Heap Size after about one hour then fails. Here is a snippet from the TaskManager log: 2021-05-27 15:36:36,040 IN

Fwd: Heartbeat Timeout

2021-05-27 Thread Jan Brusch
Hi Robert, do you have some additional info? For example the last log message of the unreachable TaskManagers. Is the Job running in kubernetes? What backend are you using? From the first looks of it, I have seen this behaviour mostly in cases where one or more taskmanagers shut down due to

Heartbeat Timeout

2021-05-27 Thread Robert Cullen
I have a job that fails after @1 hour due to a TaskManager Timeout. How can I prevent this from happening? 2021-05-27 10:24:21 org.apache.flink.runtime.JobException: Recovery is suppressed by NoRestartBackoffTimeStrategy at org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailu

Re: TM heartbeat timeout due to ResourceManager being busy

2020-10-11 Thread Xintong Song
everal HDFS operations in the rpc > main thread when preparing the TM context, which may block the main thread > when HDFS is slow. > > Unfortunately, I don't see any out-of-box approach that fixes the problem > at the moment, except for increasing the heartbeat timeout. >

Re: TM heartbeat timeout due to ResourceManager being busy

2020-10-11 Thread Paul Lam
> main thread when preparing the TM context, which may block the main thread >> when HDFS is slow. >> >> Unfortunately, I don't see any out-of-box approach that fixes the problem at >> the moment, except for increasing the heartbeat timeout. >> >&g

Re: TM heartbeat timeout due to ResourceManager being busy

2020-10-11 Thread Paul Lam
t; Indeed, Flink's RM currently performs several HDFS operations in the rpc main > thread when preparing the TM context, which may block the main thread when > HDFS is slow. > > Unfortunately, I don't see any out-of-box approach that fixes the problem at > the moment, except for

Re: TM heartbeat timeout due to ResourceManager being busy

2020-10-11 Thread Xintong Song
erforms several HDFS operations in the rpc > main thread when preparing the TM context, which may block the main thread > when HDFS is slow. > > Unfortunately, I don't see any out-of-box approach that fixes the problem > at the moment, except for increasing the heartbeat timeout. &

Re: TM heartbeat timeout due to ResourceManager being busy

2020-10-11 Thread Xintong Song
the moment, except for increasing the heartbeat timeout. As for the long run solution, I think there's an easier approach. We can move creating of the TM contexts away from the rpc main thread. Ideally, we should try to avoid performing any heavy operations which do not modify the RM's intern

TM heartbeat timeout due to ResourceManager being busy

2020-10-11 Thread Paul Lam
Hi, After FLINK-13184 is implemented (even with Flink 1.11), occasionally there would still be jobs with high parallelism getting TM-RM heartbeat timeouts when RM is busy creating TM contexts on cluster initialization and HDFS is slow at that moment. Apart from increasing the TM heartbeat

Frequent Heartbeat timeout

2019-02-11 Thread sohimankotia
Hi, I am using Flink 1.5.5 . I have streaming job with 25 * 6 (150) parallelism . I am facing too frequent heartbeat timeout . Even during off peak hours to rule out memory issues . Also I enabled debug logs for flink and observed Heartbeat request is getting triggered every 5 seconds. * Flink