The log shows that “Diagnostics Cluster entrypoint has been closed 
externally..” So are you trying to kill the YARN cluster entrypoint process 
directly in the terminal using “kill <pid>”? If users want to kill a TM, they 
should go to the machine that the TM process resides and kill the TM process. 
Cluster entrypoint is the driver to launch the flink cluster on YARN, not JM or 
TM process.
The zk HA is for JM(i.e. starting a new JM when previous JM fails) and TM is 
managed by JM which, IIUC, does not directly interact with zk. It is possible 
that JM will be restarted repeated (check details in this 
doc<https://help.aliyun.com/document_detail/411149.html#section-cco-ygc-hfe> ) 
due to wrong configuration but it may not be your case here.

Best,
Biao Geng

From: SmileSmile <a511955...@163.com>
Date: Monday, July 18, 2022 at 11:08 PM
To: biaogeng7 <biaoge...@gmail.com>
Cc: user <user@flink.apache.org>
Subject: Re: flink on yarn job always restart
Thanks for the reply, our scenario was a failure test to see if the job would 
recover on its own after killing a TM.
It turns out that the job gets a SIGNAL 15 hang during the switch from 
DEPLOYING to INITIALIZING. Because zk's ha appears to restart repeatedly

My confusion
1. why does it receive SIGNAL 15
2. is it because of some configuration? (e.g. deploy timeout causing kill?)

---- Replied Message ----
From
Geng Biao<biaoge...@gmail.com><mailto:biaoge...@gmail.com>
Date
07/18/2022 22:36
To
SmileSmile<a511955...@163.com><mailto:a511955...@163.com>、user<user@flink.apache.org><mailto:user@flink.apache.org>
Cc
Subject
Re: flink on yarn job always restart
Hi,

One possible direction is to check your YARN log or TM log to see if the YARN 
RM kills the TM for some reason(e.g. physical memory is over limit) and as a 
result, the JM will try to recover the TM repeatedly according to your restart 
strategy.
The snippet of JM logs you provide is usually not the root cause.

Best,
Biao Geng

From: SmileSmile <a511955...@163.com<mailto:a511955...@163.com>>
Date: Monday, July 18, 2022 at 8:46 PM
To: user <user@flink.apache.org<mailto:user@flink.apache.org>>
Subject: flink on yarn job always restart
hi all
we meet a situation, parallelism 3000,the job contains multiple agg 
operation,the job recover from checkpoint or savepoint must be unrecoverable, 
the job restarts repeatedly
jm error logorg.apache.flink.runtime.entrypoint.ClusterEntrypoint        [] - 
RECEIVED S
IGNAL 15: SIGTERM. Shutting down as requested.
flink version 1.14.5
Have any good ideas for troubleshooting?





Reply via email to