[jira] [Commented] (SPARK-24135) [K8s] Executors that fail to start up because of init-container errors are not retried and limit the executor pool size

Matt Cheah (JIRA) Tue, 01 May 2018 12:38:19 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-24135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16460047#comment-16460047
 ]


Matt Cheah commented on SPARK-24135:
------------------------------------

_> But I'm not sure how much this buys us because very likely the newly 
requested executors will fail to be initialized,_

That's entirely up to the behavior of the init container itself - there's many 
reasons for one to believe that a given init container's logic can be flaky. 
But it's not immediately obvious to me whether or not the init container's 
failure should count towards a job failure. Job failures shouldn't be caused by 
failures in the framework, and in this case, the framework has added the 
init-container for these pods - in other words the user's code didn't directly 
cause the job failure.

> [K8s] Executors that fail to start up because of init-container errors are 
> not retried and limit the executor pool size
> -----------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-24135
>                 URL: https://issues.apache.org/jira/browse/SPARK-24135
>             Project: Spark
>          Issue Type: Bug
>          Components: Kubernetes
>    Affects Versions: 2.3.0
>            Reporter: Matt Cheah
>            Priority: Major
>
> In KubernetesClusterSchedulerBackend, we detect if executors disconnect after 
> having been started or if executors hit the {{ERROR}} or {{DELETED}} states. 
> When executors fail in these ways, they are removed from the pending 
> executors pool and the driver should retry requesting these executors.
> However, the driver does not handle a different class of error: when the pod 
> enters the {{Init:Error}} state. This state comes up when the executor fails 
> to launch because one of its init-containers fails. Spark itself doesn't 
> attach any init-containers to the executors. However, custom web hooks can 
> run on the cluster and attach init-containers to the executor pods. 
> Additionally, pod presets can specify init containers to run on these pods. 
> Therefore Spark should be handling the {{Init:Error}} cases regardless if 
> Spark itself is aware of init-containers or not.
> This class of error is particularly bad because when we hit this state, the 
> failed executor will never start, but it's still seen as pending by the 
> executor allocator. The executor allocator won't request more rounds of 
> executors because its current batch hasn't been resolved to either running or 
> failed. Therefore we end up with being stuck with the number of executors 
> that successfully started before the faulty one failed to start, potentially 
> creating a fake resource bottleneck.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24135) [K8s] Executors that fail to start up because of init-container errors are not retried and limit the executor pool size

Reply via email to