[jira] [Commented] (SPARK-34389) Spark job on Kubernetes scheduled For Zero or less than minimum number of executors and Wait indefinitely under resource starvation

Attila Zsolt Piros (Jira) Mon, 08 Feb 2021 07:14:04 -0800


    [ 
https://issues.apache.org/jira/browse/SPARK-34389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17281134#comment-17281134
 ]


Attila Zsolt Piros commented on SPARK-34389:
--------------------------------------------

[~ranju] I understand your concern but Spark could have got the missing 
resources anytime (depending on the load of the k8s cluster): it can be just a 
few minutes / hours or even never but waiting for the resources and use the 
existing one(s) is also an answer to this problem.

And from the log you can see that Spark uses that single successfully allocated 
executor as 89 jobs are finished.

>From the Spark side the request is sent and we assume the resource manager 
>eventually will satisfy it. This is very similar why we does not have retry 
>logic for unsatisfied requests:
 
[https://github.com/apache/spark/blob/c03258ebef8fdd1c3d83c1fe5b77732a2069aa53/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala#L69-L70]

So this problem is external to the Spark. The minExecutors in this sense means 
the minimum number which will be requested (and during downscaling those 
minimum executors will be protected from killing by the driver).

If you accept my explanation please close this issue. 

> Spark job on Kubernetes scheduled For Zero or less than minimum number of 
> executors and Wait indefinitely under resource starvation
> -----------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-34389
>                 URL: https://issues.apache.org/jira/browse/SPARK-34389
>             Project: Spark
>          Issue Type: Bug
>          Components: Kubernetes
>    Affects Versions: 3.0.1
>            Reporter: Ranju
>            Priority: Major
>         Attachments: DriverLogs_ExecutorLaunchedLessThanMinExecutor.txt, 
> Steps to reproduce.docx
>
>
> In case Cluster does not have sufficient resource (CPU/ Memory ) for minimum 
> number of executors , the executors goes in Pending State for indefinite time 
> until the resource gets free.
> Suppose, Cluster Configurations are:
> total Memory=204Gi
> used Memory=200Gi
> free memory= 4Gi
> SPARK.EXECUTOR.MEMORY=10G
> SPARK.DYNAMICALLOCTION.MINEXECUTORS=4
> SPARK.DYNAMICALLOCATION.MAXEXECUTORS=8
> Rather, the job should be cancelled if requested number of minimum executors 
> are not available at that point of time because of resource unavailability.
> Currently it is doing partial scheduling or no scheduling and waiting 
> indefinitely. And the job got stuck.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34389) Spark job on Kubernetes scheduled For Zero or less than minimum number of executors and Wait indefinitely under resource starvation

Reply via email to