[ https://issues.apache.org/jira/browse/SPARK-36577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Vikas Garg updated SPARK-36577: ------------------------------- Affects Version/s: 2.4.8 > Spark on K8s: Driver pod keeps running when executor allocator terminates > with Fatal exception > ---------------------------------------------------------------------------------------------- > > Key: SPARK-36577 > URL: https://issues.apache.org/jira/browse/SPARK-36577 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 2.4.8, 3.0.0, 3.0.1, 3.0.2, 3.0.3, 3.1.0, 3.1.1, 3.1.2 > Reporter: Vikas Garg > Priority: Critical > > In Spark on Kubernetes, the class ExecutorPodsSnapshotsStoreImpl creates a > thread which is responsible for creating new executor pods. The thread > catches all 'NonFatal' exceptions, logs them and ignores these NonFatal > exceptions. However, from Fatal exceptions, it only handles > {color:#20999d}IllegalArgumentException {color}exception and terminates the > driver pod in that case. Other Fatal exceptions are not handled at all, which > means that if such a Fatal exception occurs, the executor creation thread > abruptly terminates, while the main thread keeps running. Thus, the Spark > application/job would keep running indefinitely without making any progress. > To fix this, 2 of the possible options are: > *Option#1*: Fail the driver pod whenever any Fatal exception happens. > However, this approach has following disadvantages: > # A few number of executors may have already been created when this Fatal > exception happens. These executors can still take the job to completion, > although slower than expected as all executors were not launched. Thus, we > would fail the job instead of letting the job succeed slowly. > # JVM can sometimes recover from Fatal exceptions on its own. Thus, we are > not giving a chance to driver pod to recover from failure, rather we are > killing it on first occurrence of {{Fatal}} exception. > *Option#2*: Fail the driver pod only when there are 0 executors running > currently > In this approach, we fail the driver pod only when number of currently > running executors is 0. This is so that we don’t kill a job which can > potentially complete. Thus, 1 single running executor can keep the driver pod > from dying. This may mean that job may make very slow progress when actual > number of requested executors is very large. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org