[jira] [Created] (SPARK-24135) [K8s] Executors that fail to start up because of init-container errors are not retried and limit the executor pool size

Matt Cheah (JIRA) Tue, 01 May 2018 08:21:19 -0700

Matt Cheah created SPARK-24135:
----------------------------------

             Summary: [K8s] Executors that fail to start up because of 
init-container errors are not retried and limit the executor pool size
                 Key: SPARK-24135
                 URL: https://issues.apache.org/jira/browse/SPARK-24135
             Project: Spark
          Issue Type: Bug
          Components: Kubernetes
    Affects Versions: 2.3.0
            Reporter: Matt Cheah



In KubernetesClusterSchedulerBackend, we detect if executors disconnect after 
having been started or if executors hit the {{ERROR}} or {{DELETED}} states. 
When executors fail in these ways, they are removed from the pending executors 
pool and the driver should retry requesting these executors.

However, the driver does not handle a different class of error: when the pod 
enters the {{Init:Error}} state. This state comes up when the executor fails to 
launch because one of its init-containers fails. Spark itself doesn't attach 
any init-containers to the executors. However, custom web hooks can run on the 
cluster and attach init-containers to the executor pods. Additionally, pod 
presets can specify init containers to run on these pods. Therefore Spark 
should be handling the {{Init:Error}} cases regardless if Spark itself is aware 
of init-containers or not.

This class of error is particularly bad because when we hit this state, the 
failed executor will never start, but it's still seen as pending by the 
executor allocator. The executor allocator won't request more rounds of 
executors because its current batch hasn't been resolved to either running or 
failed. Therefore we end up with being stuck with the number of executors that 
successfully started before the faulty one failed to start, potentially 
creating a fake resource bottleneck.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24135) [K8s] Executors that fail to start up because of init-container errors are not retried and limit the executor pool size

Reply via email to