[ 
https://issues.apache.org/jira/browse/FLINK-39984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yun Tang reassigned FLINK-39984:
--------------------------------

    Assignee:  xingsuo-zbz

> Thread dump RPC can cause heartbeat timeout and unnecessary JM/TM failover
> --------------------------------------------------------------------------
>
>                 Key: FLINK-39984
>                 URL: https://issues.apache.org/jira/browse/FLINK-39984
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / REST, Runtime / Web Frontend
>    Affects Versions: 1.17.2, 2.3.0, 1.20.5
>            Reporter:  xingsuo-zbz
>            Assignee:  xingsuo-zbz
>            Priority: Critical
>
> h2. Summary
> Clicking "Thread Dump" on a JobManager or TaskManager in the Flink Web UI can
> cause the targeted process to miss heartbeats and be killed as failed, taking
> down the running job. The diagnostic feature itself triggers the failure.
> Observed in production with errors such as:
> {quote}Heartbeat of TaskManager with id <tm-id> timed out.
> java.util.concurrent.TimeoutException: Heartbeat of TaskManager ... timed out.
> {quote}
> h2. Root cause
> Two _independent_ issues compound:
> *1. The RPC handler runs synchronously on the main actor thread.*
> {{TaskExecutor#requestThreadDump}} 
> (flink-runtime/.../taskexecutor/TaskExecutor.java:1463)
> and {{Dispatcher#requestThreadDump}} 
> (flink-runtime/.../dispatcher/Dispatcher.java:1858)
> both return:
> {code:java}
>   return 
> CompletableFuture.completedFuture(ThreadDumpInfo.dumpAndCreate(stacktraceMaxDepth));
>   {code}
> While the dump is being constructed, the actor mailbox does not advance, so
> heartbeat replies, task lifecycle messages, and checkpoint coordination
> messages all queue up behind it. Other heavy handlers in {{TaskExecutor}}
> (e.g. {{{}requestLogList{}}}, {{{}requestFileUploadByFilePath{}}}, 
> {{{}updatePartitions{}}})
> are already offloaded to {{ioExecutor}} or the scheduled executor — this one
> was not.
> *2. {{dumpAllThreads(true, true)}} triggers a long JVM-wide safepoint.*
> {{JvmUtils#createThreadDump}} (flink-runtime/.../util/JvmUtils.java:50) calls:
> {code:java}
>   threadMxBean.dumpAllThreads(true, true);  // lockedMonitors + 
> lockedSynchronizers
>   {code}
> Collecting locked monitors and AQS synchronizers requires walking every
> thread's lock state inside a single safepoint. On busy JVMs (Netty + RocksDB
> + async I/O + user threads — easily 10k+ threads in production), this can
> take many seconds to tens of seconds. During the safepoint, _every_ thread
> in the JVM is paused, including the heartbeat dispatcher itself.
> If the safepoint duration plus mailbox queueing exceeds {{heartbeat.timeout}}
> (default 50s), the JM marks the TM dead and triggers a failover — even
> though the TM is functional.
> Note: fixing only (1) helps short dumps but not long ones, because the
> safepoint pauses the heartbeat thread regardless of which executor the
> caller runs on. Fixing only (2) helps long dumps but still allows the
> mailbox to stall briefly. Both fixes are needed to fully address the issue.
> h2. Reproduction
>  # Start a TaskManager with a job that creates many threads (e.g. a high-
> parallelism job with RocksDB state backend and async I/O operators).
>  # In the Web UI, navigate to the TaskManager → Thread Dump tab.
>  # Observe in the JM log: {{Heartbeat of TaskManager with id ... timed out}}
> within ~50s; the job enters failover.
> h2. Proposed fix
> *Step 1 (this ticket): purely additive changes, no default-behavior change.*
>  * Offload the dump computation off the RPC main thread, using
> {{ioExecutor}} (consistent with {{requestLogList}} et al.). Apply
> single-flight (cache the in-flight future) so repeated UI clicks do
> not queue multiple dumps.
>  * Introduce {{{}ThreadDumpMode {FULL, SAFE{}}}}:
>  ** {{FULL}} — {{{}dumpAllThreads(true, true){}}}, current behavior, retains
> locked-monitor / synchronizer info, useful for deadlock analysis.
>  ** {{SAFE}} — {{{}dumpAllThreads(false, false){}}}, skips monitor /
> synchronizer collection; safepoint is dramatically shorter on busy JVMs.
>  * Surface the mode through:
>  ** REST query parameter: {{GET 
> /taskmanagers/\{id}/thread-dump?mode=safe|full}}
> (and the analogous JM endpoint).
>  ** Cluster config {{cluster.thread-dump.default.mode}} (default {{FULL}} — 
> same as
> today; this ticket does not change observable defaults).
>  ** Web UI: radio selector ({{{}Safe{}}} / {{{}Full{}}}) on the Thread Dump 
> tab,
> with a popconfirm on {{{}Full{}}}.
> *Step 2 (separate [DISCUSS] on dev@): consider flipping the default to 
> {{{}SAFE{}}}.*
> Splitting the default change out keeps Step 1 strictly additive and easy
> to review/merge; the default flip can be argued with production data on
> its own merits.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to