xingsuo-zbz created FLINK-39984:
------------------------------------

             Summary: `requestThreadDump` blocks the JM/TM main thread and can 
cause heartbeat timeout / job failure
                 Key: FLINK-39984
                 URL: https://issues.apache.org/jira/browse/FLINK-39984
             Project: Flink
          Issue Type: Improvement
          Components: Runtime / REST
    Affects Versions: 1.20.5, 2.3.0, 1.17.2
            Reporter:  xingsuo-zbz


 
{{Both `Dispatcher#requestThreadDump` (JobManager) and 
`TaskExecutor#requestThreadDump` (TaskManager)
currently execute `ThreadDumpInfo.dumpAndCreate(...)` synchronously on the RPC 
actor main thread:}}

{{    }}
{code:java}
return 
CompletableFuture.completedFuture(ThreadDumpInfo.dumpAndCreate(stacktraceMaxDepth));{code}
{{ }}

{{`dumpAndCreate` ends up calling `threadMxBean.dumpAllThreads(true, true)`, 
which on a JVM with
many threads (Netty, RocksDB, async I/O, user threads — easily ~ 10k in 
production) can take
several seconds to tens of seconds, especially when collecting monitor and 
synchronizer info.

While this call is in progress, the RPC actor cannot process any other message, 
including:}}
 * {{ heartbeat pings from the JobManager / ResourceManager,}}
 * {{ task lifecycle messages,}}
 * {{{} checkpoint trigger / confirm / abort messages.{}}}{{{}{}}}

{{}}

{{If the dump takes longer than `heartbeat.timeout` (default 50s), the JM 
declares the TM dead
and triggers a failover, even though the TM itself is fully functional — the 
heartbeat thread
was simply queued behind the dump request in the actor mailbox.}}

{{}}

{{}}

{{We have observed this in production:}}

{{  }}
{code:java}
Heartbeat of TaskManager with id <tm-id> timed out. 
java.util.concurrent.TimeoutException: Heartbeat of TaskManager with id ... 
timed out.{code}
{{ }}

{{This is essentially a self-inflicted failure caused by a diagnostic tool — 
clicking
"Thread Dump" in the Web UI of a large-state job can kill the job.}}

{{}}

{{Other "potentially expensive" RPC handlers in `TaskExecutor` (e.g. 
`requestLogList`,
`requestFileUploadByFilePath`, `updatePartitions`) are already dispatched onto 
`ioExecutor` /
the scheduled executor. `requestThreadDump` should follow the same pattern. }}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to