[ 
https://issues.apache.org/jira/browse/SPARK-36577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vikas Garg updated SPARK-36577:
-------------------------------
    Affects Version/s: 2.4.8

> Spark on K8s: Driver pod keeps running when executor allocator terminates 
> with Fatal exception
> ----------------------------------------------------------------------------------------------
>
>                 Key: SPARK-36577
>                 URL: https://issues.apache.org/jira/browse/SPARK-36577
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.4.8, 3.0.0, 3.0.1, 3.0.2, 3.0.3, 3.1.0, 3.1.1, 3.1.2
>            Reporter: Vikas Garg
>            Priority: Critical
>
> In Spark on Kubernetes, the class ExecutorPodsSnapshotsStoreImpl creates a 
> thread which is responsible for creating new executor pods. The thread 
> catches all 'NonFatal' exceptions, logs them and ignores these NonFatal 
> exceptions. However, from Fatal exceptions, it only handles 
> {color:#20999d}IllegalArgumentException {color}exception and terminates the 
> driver pod in that case. Other Fatal exceptions are not handled at all, which 
> means that if such a Fatal exception occurs, the executor creation thread 
> abruptly terminates, while the main thread keeps running. Thus, the Spark 
> application/job would keep running indefinitely without making any progress.
> To fix this, 2 of the possible options are:
> *Option#1*: Fail the driver pod whenever any Fatal exception happens. 
> However, this approach has following disadvantages:
>  # A few number of executors may have already been created when this Fatal 
> exception happens. These executors can still take the job to completion, 
> although slower than expected as all executors were not launched. Thus, we 
> would fail the job instead of letting the job succeed slowly.
>  # JVM can sometimes recover from Fatal exceptions on its own. Thus, we are 
> not giving a chance to driver pod to recover from failure, rather we are 
> killing it on first occurrence of {{Fatal}} exception.
> *Option#2*: Fail the driver pod only when there are 0 executors running 
> currently
> In this approach, we fail the driver pod only when number of currently 
> running executors is 0. This is so that we don’t kill a job which can 
> potentially complete. Thus, 1 single running executor can keep the driver pod 
> from dying. This may mean that job may make very slow progress when actual 
> number of requested executors is very large.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to