Chao Sun created SPARK-56952:
--------------------------------
Summary: Preserve executor heartbeat timeout loss reason when
executor removal is reported as ExecutorKilled
Key: SPARK-56952
URL: https://issues.apache.org/jira/browse/SPARK-56952
Project: Spark
Issue Type: Improvement
Components: Kubernetes
Affects Versions: 5.0.0
Reporter: Chao Sun
When Spark expires an executor due to heartbeat timeout, `HeartbeatReceiver`
creates a specific loss reason:
{code:java}
ExecutorProcessLost("Executor heartbeat timed out ...")
{code}
However, for coarse-grained backends, the executor removal path can later
report the executor as `ExecutorKilled`. In that case, the more specific
heartbeat-timeout reason is lost and Spark surfaces only the generic backend
reason.
This loses useful failure context and can cause downstream handling or
diagnostics to treat a heartbeat-timeout removal differently from the original
driver-side failure condition.
The issue is especially visible in flows where Spark requests executor
replacement after heartbeat expiry, while the backend later confirms the
removal with a generic `ExecutorKilled` reason.
We should preserve the original heartbeat-timeout loss reason across the
kill-and-remove flow when the backend provides only `ExecutorKilled`, while
still respecting any concrete backend-provided loss reason when one exists.
Proposed behavior:
- Carry the heartbeat-timeout `ExecutorProcessLost` reason through executor
replacement.
- Use it only when the backend reports generic `ExecutorKilled`.
- Do not override more specific backend reasons such as `ExecutorExited`.
- Clear any pending preserved loss reason if the kill request is rejected or
fails.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]