[ 
https://issues.apache.org/jira/browse/SPARK-56952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-56952:
-----------------------------------
    Labels: pull-request-available  (was: )

> Preserve executor heartbeat timeout loss reason when executor removal is 
> reported as ExecutorKilled
> ---------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-56952
>                 URL: https://issues.apache.org/jira/browse/SPARK-56952
>             Project: Spark
>          Issue Type: Improvement
>          Components: Kubernetes
>    Affects Versions: 5.0.0
>            Reporter: Chao Sun
>            Priority: Major
>              Labels: pull-request-available
>
> When Spark expires an executor due to heartbeat timeout, `HeartbeatReceiver` 
> creates a specific loss reason:
> {code:java}
>   ExecutorProcessLost("Executor heartbeat timed out ...")
> {code}
> However, for coarse-grained backends, the executor removal path can later 
> report the executor as `ExecutorKilled`. In that case, the more specific 
> heartbeat-timeout reason is lost and Spark surfaces only the generic backend 
> reason.
> This loses useful failure context and can cause downstream handling or 
> diagnostics to treat a heartbeat-timeout removal differently from the 
> original driver-side failure condition.
> The issue is especially visible in flows where Spark requests executor 
> replacement after heartbeat expiry, while the backend later confirms the 
> removal with a generic `ExecutorKilled` reason.
> We should preserve the original heartbeat-timeout loss reason across the 
> kill-and-remove flow when the backend provides only `ExecutorKilled`, while 
> still respecting any concrete backend-provided loss reason when one exists.
> Proposed behavior:
> - Carry the heartbeat-timeout `ExecutorProcessLost` reason through executor 
> replacement.
> - Use it only when the backend reports generic `ExecutorKilled`.
> - Do not override more specific backend reasons such as `ExecutorExited`.
> - Clear any pending preserved loss reason if the kill request is rejected or 
> fails.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to