[
https://issues.apache.org/jira/browse/SPARK-52933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dongjoon Hyun updated SPARK-52933:
----------------------------------
Parent Issue: SPARK-54016 (was: SPARK-51166)
> Verify if the executor pod cpu request exceeds limit
> ----------------------------------------------------
>
> Key: SPARK-52933
> URL: https://issues.apache.org/jira/browse/SPARK-52933
> Project: Spark
> Issue Type: Sub-task
> Components: Kubernetes
> Affects Versions: 4.1.0
> Reporter: Zemin Piao
> Assignee: Dongjoon Hyun
> Priority: Minor
> Labels: pull-request-available
> Fix For: 4.1.0
>
>
> h2. *Context*
> By mistake, the configuration *spark.kubernetes.executor.request.cores* or
> *spark.executor.cores* can be set to exceed the number of
> {*}spark.kubernetes.executor.limit.cores{*}. This can cause unwanted
> behavior.
> N.B. The scope of this ticket is not about how to fix the mistake, given
> there are lots of hints on how to troubleshoot and solve such issues (e.g.
> logs) but rather about the failure mode when error happens.
> h3. Current behaviour:
> Assuming the driver pod is scheduled correctly:
> When a driver pod is sending POST request to kubeapi server for executor pod
> creation, when *spark.executor.cores* or
> *spark.kubernetes.executor.request.cores* is larger than the
> {*}spark.kubernetes.executor.limit.cores{*}, this POST request fails with 422
> status code without executor pods being created, because *request.cores*
> exceeds the {*}limit.cores{*}.
> Later driver will retry every second
> ({*}spark.kubernetes.allocation.batch.delay{*} default 1 second) to do the
> POST request for executor creation and fail continuously. There is no limit
> on the number of retries for such request.
> Two problems arise:
> * From the spark job status perspective (seen from the spark UI): this job
> is in running status however it will be stuck for "indefinite" time (in
> theory) if there is no interruption.
> * When a great deal of spark jobs are having such issue, loads to kubeapi
> server increases wastefully given drivers are sending pod creation requests
> that are for sure going to fail.
> h3. Expected behaviour:
> Failing the spark job as early as possible when
> *spark.kubernetes.executor.request.cores* or *spark.executor.cores* exceeds
> the number of {*}spark.kubernetes.executor.limit.cores{*}, given the spark
> job for sure won't work without executors being scheduled.
> In this case:
> * No wasteful loads to kubeapi server given the whole job already fails.
> * Spark job in failed status: direct feedback to users on things not right.
> h2. *Proposal:*
> At the `KubernetesClusterManager.scala`, add a failing fast step (e.g.
> require) for ensuring `spark.kubernetes.executor.limit.cores` is always
> higher than `spark.kubernetes.executor.request.cores` and
> `spark.executor.cores`
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]