[
https://issues.apache.org/jira/browse/FLINK-39984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
xingsuo-zbz updated FLINK-39984:
--------------------------------
Component/s: Runtime / Web Frontend
Description:
h2. Summary
Clicking "Thread Dump" on a JobManager or TaskManager in the Flink Web UI can
cause the targeted process to miss heartbeats and be killed as failed, taking
down the running job. The diagnostic feature itself triggers the failure.
Observed in production with errors such as:
{quote}
Heartbeat of TaskManager with id <tm-id> timed out.
java.util.concurrent.TimeoutException: Heartbeat of TaskManager ... timed out.
{quote}
h2. Root cause
Two _independent_ issues compound:
*1. The RPC handler runs synchronously on the main actor thread.*
{{TaskExecutor#requestThreadDump}}
(flink-runtime/.../taskexecutor/TaskExecutor.java:1463)
and {{Dispatcher#requestThreadDump}}
(flink-runtime/.../dispatcher/Dispatcher.java:1858)
both return:
{code:java}
return
CompletableFuture.completedFuture(ThreadDumpInfo.dumpAndCreate(stacktraceMaxDepth));
{code}
While the dump is being constructed, the actor mailbox does not advance, so
heartbeat replies, task lifecycle messages, and checkpoint coordination
messages all queue up behind it. Other heavy handlers in {{TaskExecutor}}
(e.g. {{requestLogList}}, {{requestFileUploadByFilePath}},
{{updatePartitions}})
are already offloaded to {{ioExecutor}} or the scheduled executor — this one
was not.
*2. {{dumpAllThreads(true, true)}} triggers a long JVM-wide safepoint.*
{{JvmUtils#createThreadDump}} (flink-runtime/.../util/JvmUtils.java:50) calls:
{code:java}
threadMxBean.dumpAllThreads(true, true); // lockedMonitors +
lockedSynchronizers
{code}
Collecting locked monitors and AQS synchronizers requires walking every
thread's lock state inside a single safepoint. On busy JVMs (Netty + RocksDB
+ async I/O + user threads — easily 10k+ threads in production), this can
take many seconds to tens of seconds. During the safepoint, _every_ thread
in the JVM is paused, including the heartbeat dispatcher itself.
If the safepoint duration plus mailbox queueing exceeds {{heartbeat.timeout}}
(default 50s), the JM marks the TM dead and triggers a failover — even
though the TM is functional.
Note: fixing only (1) helps short dumps but not long ones, because the
safepoint pauses the heartbeat thread regardless of which executor the
caller runs on. Fixing only (2) helps long dumps but still allows the
mailbox to stall briefly. Both fixes are needed to fully address the issue.
h2. Reproduction
# Start a TaskManager with a job that creates many threads (e.g. a high-
parallelism job with RocksDB state backend and async I/O operators).
# In the Web UI, navigate to the TaskManager → Thread Dump tab.
# Observe in the JM log: {{Heartbeat of TaskManager with id ... timed out}}
within ~50s; the job enters failover.
h2. Proposed fix
*Step 1 (this ticket): purely additive changes, no default-behavior change.*
* Offload the dump computation off the RPC main thread, using
{{ioExecutor}} (consistent with {{requestLogList}} et al.). Apply
single-flight (cache the in-flight future) so repeated UI clicks do
not queue multiple dumps.
* Introduce {{ThreadDumpMode \{FULL, SAFE\}}}:
** {{FULL}} — {{dumpAllThreads(true, true)}}, current behavior, retains
locked-monitor / synchronizer info, useful for deadlock analysis.
** {{SAFE}} — {{dumpAllThreads(false, false)}}, skips monitor /
synchronizer collection; safepoint is dramatically shorter on busy JVMs.
* Surface the mode through:
** REST query parameter: {{GET
/taskmanagers/\{id\}/thread-dump?mode=safe|full}}
(and the analogous JM endpoint).
** Cluster config {{cluster.thread-dump.mode}} (default {{FULL}} — same as
today; this ticket does not change observable defaults).
** Web UI: radio selector ({{Safe}} / {{Full}}) on the Thread Dump tab,
with a popconfirm on {{Full}}.
*Step 2 (separate \[DISCUSS\] on dev@): consider flipping the default to
{{SAFE}}.*
Splitting the default change out keeps Step 1 strictly additive and easy
to review/merge; the default flip can be argued with production data on
its own merits.
was:
Both `Dispatcher#requestThreadDump` (JobManager) and
`TaskExecutor#requestThreadDump` (TaskManager) currently execute
`ThreadDumpInfo.dumpAndCreate(...)` synchronously on the RPC actor main thread:
{{ }}
{code:java}
return
CompletableFuture.completedFuture(ThreadDumpInfo.dumpAndCreate(stacktraceMaxDepth));{code}
{{ }}
`dumpAndCreate` ends up calling `threadMxBean.dumpAllThreads(true, true)`,
which on a JVM with many threads (Netty, RocksDB, async I/O, user threads —
easily ~ 10k in production) can take several seconds to tens of seconds,
especially when collecting monitor and synchronizer info.
While this call is in progress, the RPC actor cannot process any other message,
including:
* {{ heartbeat pings from the JobManager / ResourceManager,}}
* {{ task lifecycle messages,}}
* {{{} checkpoint trigger / confirm / abort messages.{}}}{{{{}}{}}}
If the dump takes longer than `heartbeat.timeout` (default 50s), the JM
declares the TM dead and triggers a failover, even though the TM itself is
fully functional — the heartbeat thread was simply queued behind the dump
request in the actor mailbox.
{{We have observed this in production:}}
{{ }}
{code:java}
Heartbeat of TaskManager with id <tm-id> timed out.
java.util.concurrent.TimeoutException: Heartbeat of TaskManager with id ...
timed out.{code}
{{ }}
This is essentially a self-inflicted failure caused by a diagnostic tool —
clicking "Thread Dump" in the Web UI of a large-state job can kill the job.
Other "potentially expensive" RPC handlers in `TaskExecutor` (e.g.
`requestLogList`,`requestFileUploadByFilePath`, `updatePartitions`) are already
dispatched onto `ioExecutor` / the scheduled executor. `requestThreadDump`
should follow the same pattern.
> Thread dump RPC can cause heartbeat timeout and unnecessary JM/TM failover
> --------------------------------------------------------------------------
>
> Key: FLINK-39984
> URL: https://issues.apache.org/jira/browse/FLINK-39984
> Project: Flink
> Issue Type: Bug
> Components: Runtime / REST, Runtime / Web Frontend
> Affects Versions: 1.17.2, 2.3.0, 1.20.5
> Reporter: xingsuo-zbz
> Priority: Critical
>
> h2. Summary
> Clicking "Thread Dump" on a JobManager or TaskManager in the Flink Web UI
> can
> cause the targeted process to miss heartbeats and be killed as failed,
> taking
> down the running job. The diagnostic feature itself triggers the failure.
> Observed in production with errors such as:
> {quote}
> Heartbeat of TaskManager with id <tm-id> timed out.
> java.util.concurrent.TimeoutException: Heartbeat of TaskManager ... timed
> out.
> {quote}
> h2. Root cause
> Two _independent_ issues compound:
> *1. The RPC handler runs synchronously on the main actor thread.*
> {{TaskExecutor#requestThreadDump}}
> (flink-runtime/.../taskexecutor/TaskExecutor.java:1463)
> and {{Dispatcher#requestThreadDump}}
> (flink-runtime/.../dispatcher/Dispatcher.java:1858)
> both return:
> {code:java}
> return
> CompletableFuture.completedFuture(ThreadDumpInfo.dumpAndCreate(stacktraceMaxDepth));
> {code}
> While the dump is being constructed, the actor mailbox does not advance, so
> heartbeat replies, task lifecycle messages, and checkpoint coordination
> messages all queue up behind it. Other heavy handlers in {{TaskExecutor}}
> (e.g. {{requestLogList}}, {{requestFileUploadByFilePath}},
> {{updatePartitions}})
> are already offloaded to {{ioExecutor}} or the scheduled executor — this one
> was not.
> *2. {{dumpAllThreads(true, true)}} triggers a long JVM-wide safepoint.*
> {{JvmUtils#createThreadDump}} (flink-runtime/.../util/JvmUtils.java:50)
> calls:
> {code:java}
> threadMxBean.dumpAllThreads(true, true); // lockedMonitors +
> lockedSynchronizers
> {code}
> Collecting locked monitors and AQS synchronizers requires walking every
> thread's lock state inside a single safepoint. On busy JVMs (Netty + RocksDB
> + async I/O + user threads — easily 10k+ threads in production), this can
> take many seconds to tens of seconds. During the safepoint, _every_ thread
> in the JVM is paused, including the heartbeat dispatcher itself.
> If the safepoint duration plus mailbox queueing exceeds
> {{heartbeat.timeout}}
> (default 50s), the JM marks the TM dead and triggers a failover — even
> though the TM is functional.
> Note: fixing only (1) helps short dumps but not long ones, because the
> safepoint pauses the heartbeat thread regardless of which executor the
> caller runs on. Fixing only (2) helps long dumps but still allows the
> mailbox to stall briefly. Both fixes are needed to fully address the issue.
> h2. Reproduction
> # Start a TaskManager with a job that creates many threads (e.g. a high-
> parallelism job with RocksDB state backend and async I/O operators).
> # In the Web UI, navigate to the TaskManager → Thread Dump tab.
> # Observe in the JM log: {{Heartbeat of TaskManager with id ... timed out}}
> within ~50s; the job enters failover.
> h2. Proposed fix
> *Step 1 (this ticket): purely additive changes, no default-behavior change.*
> * Offload the dump computation off the RPC main thread, using
> {{ioExecutor}} (consistent with {{requestLogList}} et al.). Apply
> single-flight (cache the in-flight future) so repeated UI clicks do
> not queue multiple dumps.
> * Introduce {{ThreadDumpMode \{FULL, SAFE\}}}:
> ** {{FULL}} — {{dumpAllThreads(true, true)}}, current behavior, retains
> locked-monitor / synchronizer info, useful for deadlock analysis.
> ** {{SAFE}} — {{dumpAllThreads(false, false)}}, skips monitor /
> synchronizer collection; safepoint is dramatically shorter on busy JVMs.
> * Surface the mode through:
> ** REST query parameter: {{GET
> /taskmanagers/\{id\}/thread-dump?mode=safe|full}}
> (and the analogous JM endpoint).
> ** Cluster config {{cluster.thread-dump.mode}} (default {{FULL}} — same as
> today; this ticket does not change observable defaults).
> ** Web UI: radio selector ({{Safe}} / {{Full}}) on the Thread Dump tab,
> with a popconfirm on {{Full}}.
> *Step 2 (separate \[DISCUSS\] on dev@): consider flipping the default to
> {{SAFE}}.*
> Splitting the default change out keeps Step 1 strictly additive and easy
> to review/merge; the default flip can be argued with production data on
> its own merits.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)