Chao Sun created SPARK-56952:
--------------------------------

             Summary: Preserve executor heartbeat timeout loss reason when 
executor removal is reported as ExecutorKilled
                 Key: SPARK-56952
                 URL: https://issues.apache.org/jira/browse/SPARK-56952
             Project: Spark
          Issue Type: Improvement
          Components: Kubernetes
    Affects Versions: 5.0.0
            Reporter: Chao Sun


When Spark expires an executor due to heartbeat timeout, `HeartbeatReceiver` 
creates a specific loss reason:


{code:java}
  ExecutorProcessLost("Executor heartbeat timed out ...")
{code}


However, for coarse-grained backends, the executor removal path can later 
report the executor as `ExecutorKilled`. In that case, the more specific 
heartbeat-timeout reason is lost and Spark surfaces only the generic backend 
reason.

This loses useful failure context and can cause downstream handling or 
diagnostics to treat a heartbeat-timeout removal differently from the original 
driver-side failure condition.

The issue is especially visible in flows where Spark requests executor 
replacement after heartbeat expiry, while the backend later confirms the 
removal with a generic `ExecutorKilled` reason.

We should preserve the original heartbeat-timeout loss reason across the 
kill-and-remove flow when the backend provides only `ExecutorKilled`, while 
still respecting any concrete backend-provided loss reason when one exists.

Proposed behavior:
- Carry the heartbeat-timeout `ExecutorProcessLost` reason through executor 
replacement.
- Use it only when the backend reports generic `ExecutorKilled`.
- Do not override more specific backend reasons such as `ExecutorExited`.
- Clear any pending preserved loss reason if the kill request is rejected or 
fails.




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to