[
https://issues.apache.org/jira/browse/SPARK-55974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
angerszhu updated SPARK-55974:
------------------------------
Description:
In YARN mode, executors can get stuck during launch (e.g., slow node, resource
contention, network issues). Without a timeout, the AM keeps waiting
indefinitely, which can:
* Block progress when executors never register.
* Prevent new executors from being requested.
* Cause jobs to hang or run with fewer executors than expected.
This change adds a configurable timeout so the AM can detect stuck launches and
request replacement executors, improving reliability and resource utilization.
> Relaunch new executors if the executor launching take too long time
> -------------------------------------------------------------------
>
> Key: SPARK-55974
> URL: https://issues.apache.org/jira/browse/SPARK-55974
> Project: Spark
> Issue Type: Bug
> Components: Spark Core, YARN
> Affects Versions: 3.2.4, 3.5.8, 4.1.1
> Reporter: angerszhu
> Priority: Major
> Labels: pull-request-available
>
> In YARN mode, executors can get stuck during launch (e.g., slow node,
> resource contention, network issues). Without a timeout, the AM keeps waiting
> indefinitely, which can:
> * Block progress when executors never register.
> * Prevent new executors from being requested.
> * Cause jobs to hang or run with fewer executors than expected.
> This change adds a configurable timeout so the AM can detect stuck launches
> and request replacement executors, improving reliability and resource
> utilization.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]