Objet : Re: [DISCUSS] FLIP-185: Shorter heartbeat timeout and interval default
values
Thanks for your inputs Gen and Arnaud.
I do agree with you, Gen, that we need better guidance for our users on when to
change the heartbeat configuration. I think this should happen in any case. I
am, however
g.
> The heartbeat system right now is fairly binary. It works fine while the
> configuration is suitable, until it no longer is and everything goes up in
> flames.
>
> If we were to log warnings if Flink was close to hitting the heartbeat
> timeout, or even expose metrics for heart
thing goes up
in flames.
If we were to log warnings if Flink was close to hitting the heartbeat
timeout, or even expose metrics for heartbeat round-trip times or
similar, I think we could alleviate many concerns that people have.
Or we just provide 2 configs with the distribution, one f
only thing that can happen is that the
> user code starves the heartbeat in terms of CPU cycles or causes a lot of
> GC pauses. If you are observing the former problem, then we might think
> about changing the priorities of the respective threads. This should then
> improve Flink's
ve Flink's stability for these workloads and a shorter heartbeat
timeout should be possible.
Also for the RAM-cached repositories, what exactly is causing the heartbeat
to time out? Is it because you have a lot of GC or that the heartbeat
thread does not get enough CPU cycles?
Cheers,
Till
)
Cheers,
Arnaud
De : Gen Luo
Envoyé : jeudi 22 juillet 2021 05:46
À : Till Rohrmann
Cc : Yang Wang ; dev ; user
Objet : Re: [DISCUSS] FLIP-185: Shorter heartbeat timeout and interval default
values
Hi,
Thanks for driving this @Till Rohrmann<mailto:trohrm...@apache.org> . I would
give
Hi,
Thanks for driving this @Till Rohrmann . I would
give +1 on reducing the heartbeat timeout and interval, though I'm not sure
if 15s and 3s would be enough either.
IMO, except for the standalone cluster, where the heartbeat mechanism in
Flink is totally relied, reducing the heartbeat can
Thanks for sharing these insights.
I think it is no longer true that the ResourceManager notifies the
JobMaster about lost TaskExecutors. See FLINK-23216 [1] for more details.
Given the GC pauses, would you then be ok with decreasing the heartbeat
timeout to 20 seconds? This should give enough
Thanks @Till Rohrmann for starting this discussion
Firstly, I try to understand the benefit of shorter heartbeat timeout.
IIUC, it will make the JobManager aware of
TaskManager faster. However, it seems that only the standalone cluster
could benefit from this. For Yarn and
native Kubernetes
ernetes.
I also consider this change somewhat low-risk, because we can provide a
quick fix for people running into problems.
[1]https://flink.apache.org/2021/05/06/reactive-mode.html
On Fri, Jul 16, 2021 at 7:05 PM Till Rohrmann wrote:
> Hi everyone,
>
> Since Flink 1.5 we have the sa
Hi everyone,
Since Flink 1.5 we have the same heartbeat timeout and interval default
values that are defined as heartbeat.timeout: 50s and heartbeat.interval:
10s. These values were mainly chosen to compensate for lengthy GC pauses
and blocking operations that were executed in the main threads of
Matthias, I increased the JVM Heap size as Jan suggested and it appears to
be a memory leak in the user code (although I'm not sure why since this is
a simple job that uses a loop to simulate data being written to an S3 data
store). Yes, the logs show no apparent problem but the timestamp
corresp
Hi Robert,
increasing heap memory usage could be due to some memory leak in the user
code. Have you analyzed a heap dump? About the TM logs you shared. I don't
see anything suspicious there. Nothing about memory problems. Are those the
correct logs?
Best,
Matthias
On Thu, May 27, 2021 at 6:01 PM
hi, I have encountered the same problem.
Check gc log and jstack, you will resolve this problem.
good luck
--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
Hi Robert Cullen,
1. You may solve the problem by increasing the timeout by define
`heartbeat.timeout` [1], however I do not recommend this way because it
will hide the real problem.
2. Please find out the gc log and log of timeout-taskmanager(10.42.0.49:6122
-e26293), check 1. is there memory leak
Hi, Rober,
To mitigate this issue, you can increase the "heartbeat.interval" and
"heartbeat.timeout". However, I think we should first figure out the
root cause, would you like to provide the log of
10.42.0.49:6122-e26293?
Best,
Yangze Guo
On Thu, May 27, 2021 at 10:44 PM Robert Cullen wrote:
>
Hi Robert,
that sounds like a case of either your application state ultimately
being bigger than the available RAM or a memory leak in your application
(e.g., some states are not properly cleaned out after they are not
needed anymore).
If you have the available resources you could try and in
Hello Jan,
My flink cluster is running on a kubernetes single node (rke). I have the
JVM Heap Size set at 2.08 GB and the Managed Memory at 2.93 GB. The
TaskManger reaches the max JVM Heap Size after about one hour then fails.
Here is a snippet from the TaskManager log:
2021-05-27 15:36:36,040 IN
Hi Robert,
do you have some additional info? For example the last log message of
the unreachable TaskManagers. Is the Job running in kubernetes? What
backend are you using?
From the first looks of it, I have seen this behaviour mostly in cases
where one or more taskmanagers shut down due to
I have a job that fails after @1 hour due to a TaskManager Timeout. How can
I prevent this from happening?
2021-05-27 10:24:21
org.apache.flink.runtime.JobException: Recovery is suppressed by
NoRestartBackoffTimeStrategy
at
org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailu
everal HDFS operations in the rpc
> main thread when preparing the TM context, which may block the main thread
> when HDFS is slow.
>
> Unfortunately, I don't see any out-of-box approach that fixes the problem
> at the moment, except for increasing the heartbeat timeout.
>
> main thread when preparing the TM context, which may block the main thread
>> when HDFS is slow.
>>
>> Unfortunately, I don't see any out-of-box approach that fixes the problem at
>> the moment, except for increasing the heartbeat timeout.
>>
>&g
t; Indeed, Flink's RM currently performs several HDFS operations in the rpc main
> thread when preparing the TM context, which may block the main thread when
> HDFS is slow.
>
> Unfortunately, I don't see any out-of-box approach that fixes the problem at
> the moment, except for
erforms several HDFS operations in the rpc
> main thread when preparing the TM context, which may block the main thread
> when HDFS is slow.
>
> Unfortunately, I don't see any out-of-box approach that fixes the problem
> at the moment, except for increasing the heartbeat timeout.
&
the moment, except for increasing the heartbeat timeout.
As for the long run solution, I think there's an easier approach. We can
move creating of the TM contexts away from the rpc main thread. Ideally, we
should try to avoid performing any heavy operations which do not modify the
RM's intern
Hi,
After FLINK-13184 is implemented (even with Flink 1.11), occasionally there
would still be jobs
with high parallelism getting TM-RM heartbeat timeouts when RM is busy creating
TM contexts
on cluster initialization and HDFS is slow at that moment.
Apart from increasing the TM heartbeat
Hi,
I am using Flink 1.5.5 . I have streaming job with 25 * 6 (150) parallelism
. I am facing too frequent heartbeat timeout . Even during off peak hours to
rule out memory issues .
Also I enabled debug logs for flink and observed Heartbeat request is
getting triggered every 5 seconds.
*
Flink
27 matches
Mail list logo