[ 
https://issues.apache.org/jira/browse/FLINK-39984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xingsuo-zbz updated FLINK-39984:
--------------------------------
    Description: 
h2. Summary

Clicking "Thread Dump" on a JobManager or TaskManager in the Flink Web UI can
cause the targeted process to miss heartbeats and be killed as failed, taking
down the running job. The diagnostic feature itself triggers the failure.

Observed in production with errors such as:
{quote}Heartbeat of TaskManager with id <tm-id> timed out.
java.util.concurrent.TimeoutException: Heartbeat of TaskManager ... timed out.
{quote}
h2. Root cause

Two _independent_ issues compound:

*1. The RPC handler runs synchronously on the main actor thread.*

{{TaskExecutor#requestThreadDump}} 
(flink-runtime/.../taskexecutor/TaskExecutor.java:1463)
and {{Dispatcher#requestThreadDump}} 
(flink-runtime/.../dispatcher/Dispatcher.java:1858)
both return:
{code:java}
  return 
CompletableFuture.completedFuture(ThreadDumpInfo.dumpAndCreate(stacktraceMaxDepth));
  {code}
While the dump is being constructed, the actor mailbox does not advance, so
heartbeat replies, task lifecycle messages, and checkpoint coordination
messages all queue up behind it. Other heavy handlers in {{TaskExecutor}}
(e.g. {{{}requestLogList{}}}, {{{}requestFileUploadByFilePath{}}}, 
{{{}updatePartitions{}}})
are already offloaded to {{ioExecutor}} or the scheduled executor — this one
was not.

*2. {{dumpAllThreads(true, true)}} triggers a long JVM-wide safepoint.*

{{JvmUtils#createThreadDump}} (flink-runtime/.../util/JvmUtils.java:50) calls:
{code:java}
  threadMxBean.dumpAllThreads(true, true);  // lockedMonitors + 
lockedSynchronizers
  {code}
Collecting locked monitors and AQS synchronizers requires walking every
thread's lock state inside a single safepoint. On busy JVMs (Netty + RocksDB
+ async I/O + user threads — easily 10k+ threads in production), this can
take many seconds to tens of seconds. During the safepoint, _every_ thread
in the JVM is paused, including the heartbeat dispatcher itself.

If the safepoint duration plus mailbox queueing exceeds {{heartbeat.timeout}}
(default 50s), the JM marks the TM dead and triggers a failover — even
though the TM is functional.

Note: fixing only (1) helps short dumps but not long ones, because the
safepoint pauses the heartbeat thread regardless of which executor the
caller runs on. Fixing only (2) helps long dumps but still allows the
mailbox to stall briefly. Both fixes are needed to fully address the issue.
h2. Reproduction
 # Start a TaskManager with a job that creates many threads (e.g. a high-
parallelism job with RocksDB state backend and async I/O operators).
 # In the Web UI, navigate to the TaskManager → Thread Dump tab.
 # Observe in the JM log: {{Heartbeat of TaskManager with id ... timed out}}
within ~50s; the job enters failover.

h2. Proposed fix

*Step 1 (this ticket): purely additive changes, no default-behavior change.*
 * Offload the dump computation off the RPC main thread, using
{{ioExecutor}} (consistent with {{requestLogList}} et al.). Apply
single-flight (cache the in-flight future) so repeated UI clicks do
not queue multiple dumps.
 * Introduce {{{}ThreadDumpMode {FULL, SAFE{}}}}:
 ** {{FULL}} — {{{}dumpAllThreads(true, true){}}}, current behavior, retains
locked-monitor / synchronizer info, useful for deadlock analysis.
 ** {{SAFE}} — {{{}dumpAllThreads(false, false){}}}, skips monitor /
synchronizer collection; safepoint is dramatically shorter on busy JVMs.
 * Surface the mode through:
 ** REST query parameter: {{GET /taskmanagers/\{id}/thread-dump?mode=safe|full}}
(and the analogous JM endpoint).
 ** Cluster config {{cluster.thread-dump.default.mode}} (default {{FULL}} — 
same as
today; this ticket does not change observable defaults).
 ** Web UI: radio selector ({{{}Safe{}}} / {{{}Full{}}}) on the Thread Dump tab,
with a popconfirm on {{{}Full{}}}.

*Step 2 (separate [DISCUSS] on dev@): consider flipping the default to 
{{{}SAFE{}}}.*

Splitting the default change out keeps Step 1 strictly additive and easy
to review/merge; the default flip can be argued with production data on
its own merits.

  was:
  h2. Summary

  Clicking "Thread Dump" on a JobManager or TaskManager in the Flink Web UI can
  cause the targeted process to miss heartbeats and be killed as failed, taking
  down the running job. The diagnostic feature itself triggers the failure.

  Observed in production with errors such as:
  {quote}
  Heartbeat of TaskManager with id <tm-id> timed out.
  java.util.concurrent.TimeoutException: Heartbeat of TaskManager ... timed out.
  {quote}

  h2. Root cause

  Two _independent_ issues compound:

  *1. The RPC handler runs synchronously on the main actor thread.*

  {{TaskExecutor#requestThreadDump}} 
(flink-runtime/.../taskexecutor/TaskExecutor.java:1463)
  and {{Dispatcher#requestThreadDump}} 
(flink-runtime/.../dispatcher/Dispatcher.java:1858)
  both return:

  {code:java}
  return 
CompletableFuture.completedFuture(ThreadDumpInfo.dumpAndCreate(stacktraceMaxDepth));
  {code}

  While the dump is being constructed, the actor mailbox does not advance, so
  heartbeat replies, task lifecycle messages, and checkpoint coordination
  messages all queue up behind it. Other heavy handlers in {{TaskExecutor}}
  (e.g. {{requestLogList}}, {{requestFileUploadByFilePath}}, 
{{updatePartitions}})
  are already offloaded to {{ioExecutor}} or the scheduled executor — this one
  was not.

  *2. {{dumpAllThreads(true, true)}} triggers a long JVM-wide safepoint.*

  {{JvmUtils#createThreadDump}} (flink-runtime/.../util/JvmUtils.java:50) calls:

  {code:java}
  threadMxBean.dumpAllThreads(true, true);  // lockedMonitors + 
lockedSynchronizers
  {code}

  Collecting locked monitors and AQS synchronizers requires walking every
  thread's lock state inside a single safepoint. On busy JVMs (Netty + RocksDB
  + async I/O + user threads — easily 10k+ threads in production), this can
  take many seconds to tens of seconds. During the safepoint, _every_ thread
  in the JVM is paused, including the heartbeat dispatcher itself.

  If the safepoint duration plus mailbox queueing exceeds {{heartbeat.timeout}}
  (default 50s), the JM marks the TM dead and triggers a failover — even
  though the TM is functional.

  Note: fixing only (1) helps short dumps but not long ones, because the
  safepoint pauses the heartbeat thread regardless of which executor the
  caller runs on. Fixing only (2) helps long dumps but still allows the
  mailbox to stall briefly. Both fixes are needed to fully address the issue.

  h2. Reproduction

  # Start a TaskManager with a job that creates many threads (e.g. a high-
    parallelism job with RocksDB state backend and async I/O operators).
  # In the Web UI, navigate to the TaskManager → Thread Dump tab.
  # Observe in the JM log: {{Heartbeat of TaskManager with id ... timed out}}
    within ~50s; the job enters failover.

  h2. Proposed fix

  *Step 1 (this ticket): purely additive changes, no default-behavior change.*

  * Offload the dump computation off the RPC main thread, using
    {{ioExecutor}} (consistent with {{requestLogList}} et al.). Apply
    single-flight (cache the in-flight future) so repeated UI clicks do
    not queue multiple dumps.
  * Introduce {{ThreadDumpMode \{FULL, SAFE\}}}:
  ** {{FULL}} — {{dumpAllThreads(true, true)}}, current behavior, retains
     locked-monitor / synchronizer info, useful for deadlock analysis.
  ** {{SAFE}} — {{dumpAllThreads(false, false)}}, skips monitor /
     synchronizer collection; safepoint is dramatically shorter on busy JVMs.
  * Surface the mode through:
  ** REST query parameter: {{GET 
/taskmanagers/\{id\}/thread-dump?mode=safe|full}}
     (and the analogous JM endpoint).
  ** Cluster config {{cluster.thread-dump.mode}} (default {{FULL}} — same as
     today; this ticket does not change observable defaults).
  ** Web UI: radio selector ({{Safe}} / {{Full}}) on the Thread Dump tab,
     with a popconfirm on {{Full}}.

  *Step 2 (separate \[DISCUSS\] on dev@): consider flipping the default to 
{{SAFE}}.*

  Splitting the default change out keeps Step 1 strictly additive and easy
  to review/merge; the default flip can be argued with production data on
  its own merits.



> Thread dump RPC can cause heartbeat timeout and unnecessary JM/TM failover
> --------------------------------------------------------------------------
>
>                 Key: FLINK-39984
>                 URL: https://issues.apache.org/jira/browse/FLINK-39984
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / REST, Runtime / Web Frontend
>    Affects Versions: 1.17.2, 2.3.0, 1.20.5
>            Reporter:  xingsuo-zbz
>            Priority: Critical
>
> h2. Summary
> Clicking "Thread Dump" on a JobManager or TaskManager in the Flink Web UI can
> cause the targeted process to miss heartbeats and be killed as failed, taking
> down the running job. The diagnostic feature itself triggers the failure.
> Observed in production with errors such as:
> {quote}Heartbeat of TaskManager with id <tm-id> timed out.
> java.util.concurrent.TimeoutException: Heartbeat of TaskManager ... timed out.
> {quote}
> h2. Root cause
> Two _independent_ issues compound:
> *1. The RPC handler runs synchronously on the main actor thread.*
> {{TaskExecutor#requestThreadDump}} 
> (flink-runtime/.../taskexecutor/TaskExecutor.java:1463)
> and {{Dispatcher#requestThreadDump}} 
> (flink-runtime/.../dispatcher/Dispatcher.java:1858)
> both return:
> {code:java}
>   return 
> CompletableFuture.completedFuture(ThreadDumpInfo.dumpAndCreate(stacktraceMaxDepth));
>   {code}
> While the dump is being constructed, the actor mailbox does not advance, so
> heartbeat replies, task lifecycle messages, and checkpoint coordination
> messages all queue up behind it. Other heavy handlers in {{TaskExecutor}}
> (e.g. {{{}requestLogList{}}}, {{{}requestFileUploadByFilePath{}}}, 
> {{{}updatePartitions{}}})
> are already offloaded to {{ioExecutor}} or the scheduled executor — this one
> was not.
> *2. {{dumpAllThreads(true, true)}} triggers a long JVM-wide safepoint.*
> {{JvmUtils#createThreadDump}} (flink-runtime/.../util/JvmUtils.java:50) calls:
> {code:java}
>   threadMxBean.dumpAllThreads(true, true);  // lockedMonitors + 
> lockedSynchronizers
>   {code}
> Collecting locked monitors and AQS synchronizers requires walking every
> thread's lock state inside a single safepoint. On busy JVMs (Netty + RocksDB
> + async I/O + user threads — easily 10k+ threads in production), this can
> take many seconds to tens of seconds. During the safepoint, _every_ thread
> in the JVM is paused, including the heartbeat dispatcher itself.
> If the safepoint duration plus mailbox queueing exceeds {{heartbeat.timeout}}
> (default 50s), the JM marks the TM dead and triggers a failover — even
> though the TM is functional.
> Note: fixing only (1) helps short dumps but not long ones, because the
> safepoint pauses the heartbeat thread regardless of which executor the
> caller runs on. Fixing only (2) helps long dumps but still allows the
> mailbox to stall briefly. Both fixes are needed to fully address the issue.
> h2. Reproduction
>  # Start a TaskManager with a job that creates many threads (e.g. a high-
> parallelism job with RocksDB state backend and async I/O operators).
>  # In the Web UI, navigate to the TaskManager → Thread Dump tab.
>  # Observe in the JM log: {{Heartbeat of TaskManager with id ... timed out}}
> within ~50s; the job enters failover.
> h2. Proposed fix
> *Step 1 (this ticket): purely additive changes, no default-behavior change.*
>  * Offload the dump computation off the RPC main thread, using
> {{ioExecutor}} (consistent with {{requestLogList}} et al.). Apply
> single-flight (cache the in-flight future) so repeated UI clicks do
> not queue multiple dumps.
>  * Introduce {{{}ThreadDumpMode {FULL, SAFE{}}}}:
>  ** {{FULL}} — {{{}dumpAllThreads(true, true){}}}, current behavior, retains
> locked-monitor / synchronizer info, useful for deadlock analysis.
>  ** {{SAFE}} — {{{}dumpAllThreads(false, false){}}}, skips monitor /
> synchronizer collection; safepoint is dramatically shorter on busy JVMs.
>  * Surface the mode through:
>  ** REST query parameter: {{GET 
> /taskmanagers/\{id}/thread-dump?mode=safe|full}}
> (and the analogous JM endpoint).
>  ** Cluster config {{cluster.thread-dump.default.mode}} (default {{FULL}} — 
> same as
> today; this ticket does not change observable defaults).
>  ** Web UI: radio selector ({{{}Safe{}}} / {{{}Full{}}}) on the Thread Dump 
> tab,
> with a popconfirm on {{{}Full{}}}.
> *Step 2 (separate [DISCUSS] on dev@): consider flipping the default to 
> {{{}SAFE{}}}.*
> Splitting the default change out keeps Step 1 strictly additive and easy
> to review/merge; the default flip can be argued with production data on
> its own merits.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to