[
https://issues.apache.org/jira/browse/FLINK-39984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
xingsuo-zbz updated FLINK-39984:
--------------------------------
Issue Type: Bug (was: Improvement)
> Thread dump RPC can cause heartbeat timeout and unnecessary JM/TM failover
> --------------------------------------------------------------------------
>
> Key: FLINK-39984
> URL: https://issues.apache.org/jira/browse/FLINK-39984
> Project: Flink
> Issue Type: Bug
> Components: Runtime / REST
> Affects Versions: 1.17.2, 2.3.0, 1.20.5
> Reporter: xingsuo-zbz
> Priority: Critical
>
>
> Both `Dispatcher#requestThreadDump` (JobManager) and
> `TaskExecutor#requestThreadDump` (TaskManager) currently execute
> `ThreadDumpInfo.dumpAndCreate(...)` synchronously on the RPC actor main
> thread:
> {{ }}
> {code:java}
> return
> CompletableFuture.completedFuture(ThreadDumpInfo.dumpAndCreate(stacktraceMaxDepth));{code}
> {{ }}
> `dumpAndCreate` ends up calling `threadMxBean.dumpAllThreads(true, true)`,
> which on a JVM with many threads (Netty, RocksDB, async I/O, user threads —
> easily ~ 10k in production) can take several seconds to tens of seconds,
> especially when collecting monitor and synchronizer info.
> While this call is in progress, the RPC actor cannot process any other
> message, including:
> * {{ heartbeat pings from the JobManager / ResourceManager,}}
> * {{ task lifecycle messages,}}
> * {{{} checkpoint trigger / confirm / abort messages.{}}}{{{{}}{}}}
>
> If the dump takes longer than `heartbeat.timeout` (default 50s), the JM
> declares the TM dead and triggers a failover, even though the TM itself is
> fully functional — the heartbeat thread was simply queued behind the dump
> request in the actor mailbox.
>
> {{We have observed this in production:}}
> {{ }}
> {code:java}
> Heartbeat of TaskManager with id <tm-id> timed out.
> java.util.concurrent.TimeoutException: Heartbeat of TaskManager with id ...
> timed out.{code}
> {{ }}
> This is essentially a self-inflicted failure caused by a diagnostic tool —
> clicking "Thread Dump" in the Web UI of a large-state job can kill the job.
> Other "potentially expensive" RPC handlers in `TaskExecutor` (e.g.
> `requestLogList`,`requestFileUploadByFilePath`, `updatePartitions`) are
> already dispatched onto `ioExecutor` / the scheduled executor.
> `requestThreadDump` should follow the same pattern.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)