[ 
https://issues.apache.org/jira/browse/SPARK-44609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-44609:
-----------------------------------
    Labels: pull-request-available  (was: )

> ExecutorPodsAllocator doesn't create new executors if no pod snapshot 
> captured pod creation
> -------------------------------------------------------------------------------------------
>
>                 Key: SPARK-44609
>                 URL: https://issues.apache.org/jira/browse/SPARK-44609
>             Project: Spark
>          Issue Type: Bug
>          Components: Kubernetes, Scheduler
>    Affects Versions: 3.4.1
>            Reporter: Alibi Yeslambek
>            Priority: Major
>              Labels: pull-request-available
>
> There’s a following race condition in ExecutorPodsAllocator when running a 
> spark application with static allocation on kubernetes with numExecutors >= 1:
>  * Driver requests an executor
>  * exec-1 gets created and registers with driver
>  * exec-1 is moved from {{newlyCreatedExecutors}} to 
> {{schedulerKnownNewlyCreatedExecs}}
>  * exec-1 got deleted very quickly (~1-30 sec) after registration
>  * {{ExecutorPodsWatchSnapshotSource}} fails to catch the creation of the pod 
> (e.g. websocket connection was reset, k8s-apiserver was down, etc.)
>  * {{ExecutorPodsPollingSnapshotSource}} fails to catch the creation because 
> it runs every 30 secs, but executor was removed much quicker after creation
>  * exec-1 is never removed from {{schedulerKnownNewlyCreatedExecs}}
>  * {{ExecutorPodsAllocator}} will never request new executor because it’s 
> slot is occupied by exec-1, due to {{schedulerKnownNewlyCreatedExecs}} never 
> being cleared.
>  
> Put up a fix here https://github.com/apache/spark/pull/42297



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to