xingsuo-zbz created FLINK-39984:
------------------------------------
Summary: `requestThreadDump` blocks the JM/TM main thread and can
cause heartbeat timeout / job failure
Key: FLINK-39984
URL: https://issues.apache.org/jira/browse/FLINK-39984
Project: Flink
Issue Type: Improvement
Components: Runtime / REST
Affects Versions: 1.20.5, 2.3.0, 1.17.2
Reporter: xingsuo-zbz
{{Both `Dispatcher#requestThreadDump` (JobManager) and
`TaskExecutor#requestThreadDump` (TaskManager)
currently execute `ThreadDumpInfo.dumpAndCreate(...)` synchronously on the RPC
actor main thread:}}
{{ }}
{code:java}
return
CompletableFuture.completedFuture(ThreadDumpInfo.dumpAndCreate(stacktraceMaxDepth));{code}
{{ }}
{{`dumpAndCreate` ends up calling `threadMxBean.dumpAllThreads(true, true)`,
which on a JVM with
many threads (Netty, RocksDB, async I/O, user threads — easily ~ 10k in
production) can take
several seconds to tens of seconds, especially when collecting monitor and
synchronizer info.
While this call is in progress, the RPC actor cannot process any other message,
including:}}
* {{ heartbeat pings from the JobManager / ResourceManager,}}
* {{ task lifecycle messages,}}
* {{{} checkpoint trigger / confirm / abort messages.{}}}{{{}{}}}
{{}}
{{If the dump takes longer than `heartbeat.timeout` (default 50s), the JM
declares the TM dead
and triggers a failover, even though the TM itself is fully functional — the
heartbeat thread
was simply queued behind the dump request in the actor mailbox.}}
{{}}
{{}}
{{We have observed this in production:}}
{{ }}
{code:java}
Heartbeat of TaskManager with id <tm-id> timed out.
java.util.concurrent.TimeoutException: Heartbeat of TaskManager with id ...
timed out.{code}
{{ }}
{{This is essentially a self-inflicted failure caused by a diagnostic tool —
clicking
"Thread Dump" in the Web UI of a large-state job can kill the job.}}
{{}}
{{Other "potentially expensive" RPC handlers in `TaskExecutor` (e.g.
`requestLogList`,
`requestFileUploadByFilePath`, `updatePartitions`) are already dispatched onto
`ioExecutor` /
the scheduled executor. `requestThreadDump` should follow the same pattern. }}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)