[ 
https://issues.apache.org/jira/browse/SPARK-24135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16462540#comment-16462540
 ] 

Imran Rashid commented on SPARK-24135:
--------------------------------------

Honestly I don't understand the failure mode described here at all, but I can 
make some comparisons to yarn's handling of executor failures at the allocator 
level.

In yarn, spark already has a check for the number of executor failures, and it 
fails the entire application if there are too many.  Its controlled by 
"spark.yarn.max.executor.failures".  The failures expire over time, controlled 
by "spark.yarn.executor.failuresValidityInterval", so really long running apps 
are not penalized by a few errors spread out over a long period of time.  See 
code in ApplicationMaster & YarnAllocator.

There is also ongoing work to have spark realize where container initialization 
has failed, and then request other nodes selected instead, SPARK-16630. There 
is a PR under review for that now.

>From the bug description, I do think there should be some better error 
>handling than what there is now so the user at least knows what is going on, 
>but it sounds like you're all in agreement about that already :).

> [K8s] Executors that fail to start up because of init-container errors are 
> not retried and limit the executor pool size
> -----------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-24135
>                 URL: https://issues.apache.org/jira/browse/SPARK-24135
>             Project: Spark
>          Issue Type: Bug
>          Components: Kubernetes
>    Affects Versions: 2.3.0
>            Reporter: Matt Cheah
>            Priority: Major
>
> In KubernetesClusterSchedulerBackend, we detect if executors disconnect after 
> having been started or if executors hit the {{ERROR}} or {{DELETED}} states. 
> When executors fail in these ways, they are removed from the pending 
> executors pool and the driver should retry requesting these executors.
> However, the driver does not handle a different class of error: when the pod 
> enters the {{Init:Error}} state. This state comes up when the executor fails 
> to launch because one of its init-containers fails. Spark itself doesn't 
> attach any init-containers to the executors. However, custom web hooks can 
> run on the cluster and attach init-containers to the executor pods. 
> Additionally, pod presets can specify init containers to run on these pods. 
> Therefore Spark should be handling the {{Init:Error}} cases regardless if 
> Spark itself is aware of init-containers or not.
> This class of error is particularly bad because when we hit this state, the 
> failed executor will never start, but it's still seen as pending by the 
> executor allocator. The executor allocator won't request more rounds of 
> executors because its current batch hasn't been resolved to either running or 
> failed. Therefore we end up with being stuck with the number of executors 
> that successfully started before the faulty one failed to start, potentially 
> creating a fake resource bottleneck.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to