[
https://issues.apache.org/jira/browse/SPARK-38079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17966966#comment-17966966
]
Junghyun Kim commented on SPARK-38079:
--------------------------------------
I’ve encountered a similar issue before. In my case, Kubernetes kept retrying
the resource mount, and eventually the mount succeeded.
This issue happens because the necessary resources (like the configmap) are
created _after_ the pod is created, which introduces a race condition.
You can see this behavior in the Spark codebase:
[https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/KubernetesClientApplication.scala#L175]
Some resources can be created as {*}pre-resources{*}, before the pod is created
— but not all are.
For example, {{KerberosConfDriverFeatureStep}} creates its secret _after_ the
pod is created, because it only implements
{{{}getAdditionalKubernetesResources(){}}}:
[https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/KerberosConfDriverFeatureStep.scala]
I think certain critical resources should instead implement
{{getAdditionalPreKubernetesResources()}} to ensure they are available _before_
pod creation.
But I’m not entirely sure which resources should follow this pattern:
[https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/KubernetesDriverBuilder.scala#L94]
In my case, Kubernetes eventually mounted the resources after retries, so the
impact was minor.
But if this issue causes significant trouble, I hope someone considers
improving it.
(As a side note, I heard this issue doesn't occur when using the Spark
Kubernetes Operator — though I haven’t tested it myself.)
> Not waiting for configmap before starting driver
> ------------------------------------------------
>
> Key: SPARK-38079
> URL: https://issues.apache.org/jira/browse/SPARK-38079
> Project: Spark
> Issue Type: Bug
> Components: Kubernetes
> Affects Versions: 3.2.0, 3.2.1
> Reporter: Ben
> Priority: Major
>
> *The problem*
> When you spark-submit to kubernetes in cluster-mode:
> # Kubernetes creates the driver
> # Kubernetes creates a configmap that the driver depends on
> This is a race condition. If the configmap is not created quickly enough,
> then the driver will fail to start up properly.
> See [this stackoverflow post|https://stackoverflow.com/a/58508313] for an
> alternate description of this problem.
>
> *To Reproduce*
> # Download spark 3.2.0 or 3.2.1 from
> [https://spark.apache.org/downloads.html]
> # Create an image with
> {code:java}
> bin/docker-image-tool.sh{code}
> # Spark submit one of the examples to some kubernetes instance
> # Observe the race condition
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]