Vikas Garg created SPARK-36577:
----------------------------------

             Summary: Spark on K8s: Driver pod keeps running when executor 
allocator terminates with Fatal exception
                 Key: SPARK-36577
                 URL: https://issues.apache.org/jira/browse/SPARK-36577
             Project: Spark
          Issue Type: Bug
          Components: Spark Core
    Affects Versions: 3.1.2, 3.1.1, 3.1.0, 3.0.3, 3.0.2, 3.0.1, 3.0.0
            Reporter: Vikas Garg


In Spark on Kubernetes, the class ExecutorPodsSnapshotsStoreImpl creates a 
thread which is responsible for creating new executor pods. The thread catches 
all 'NonFatal' exceptions and logs and ignores these NonFatal exceptions. 
However, from Fatal exceptions, it only handles 
{color:#20999d}IllegalArgumentException {color}exception and terminates the 
driver pod in that case. Other Fatal exceptions are not handled at all, which 
means that if such a Fatal exception occurs, the executor creation thread 
abruptly terminates, while the main thread keeps running. Thus, the Spark 
application/job would keep running indefinitely without making any progress.

To fix this, 2 of the possible options are (looking for feedback on these):

*Option#1*: Fail the driver pod whenever any Fatal exception happens. However, 
this approach has following disadvantages:
 # A few number of executors may have already been created when this Fatal 
exception happens. These executors can still take the job to completion, 
although slower than expected as all executors were not launched. Thus, we 
would fail the job instead of letting the job succeed slowly.
 # JVM can sometimes recover from Fatal exceptions on its own. Thus, we are not 
giving a chance to driver pod to recover from failure, rather we are killing it 
on first occurrence of {{Fatal}} exception.

*Option#2*: Fail the driver pod only when there are 0 executors running 
currently

In this approach, we fail the driver pod only when number of currently running 
executors is 0. This is so that we don’t kill a job which can potentially 
complete. Thus, 1 single running executor can keep the driver pod from dying. 
This may mean that job may make very slow progress when actual number of 
requested executors is very large.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to