[ 
https://issues.apache.org/jira/browse/SPARK-38079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17966966#comment-17966966
 ] 

Junghyun Kim edited comment on SPARK-38079 at 6/13/25 6:03 AM:
---------------------------------------------------------------

I’ve encountered a similar issue before. In my case, Kubernetes kept retrying 
the resource mount, and eventually the mount succeeded.

This issue happens because the necessary resources (like the configmap) are 
created _after_ the pod is created, which introduces a race condition.

You can see this behavior in the Spark codebase:
[https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/KubernetesClientApplication.scala#L175]

Some resources can be created as {*}pre-resources{*}, before the pod is created 
— but not all are.
For example, {{KerberosConfDriverFeatureStep}} creates its secret _after_ the 
pod is created, because it only implements 
{{{}getAdditionalKubernetesResources(){}}}:
[https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/KerberosConfDriverFeatureStep.scala]

I think some critical resources should instead implement 
{{getAdditionalPreKubernetesResources()}} to ensure they are available before 
pod creation:
[https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/KubernetesDriverBuilder.scala#L94]

Of course, even if a resource is created early, it doesn't necessarily 
guarantee that it will be successfully mounted when the pod starts — there 
could still be timing issues.

In my case, Kubernetes eventually handled the retries, so the impact was minor.
But if others are struggling with this more severely, I hope someone considers 
addressing it.
(As a side note, I heard that this issue doesn’t occur when using the Spark 
Kubernetes Operator — though I haven’t tested it myself.)


was (Author: JIRAUSER310008):
I’ve encountered a similar issue before. In my case, Kubernetes kept retrying 
the resource mount, and eventually the mount succeeded.

This issue happens because the necessary resources (like the configmap) are 
created _after_ the pod is created, which introduces a race condition.

You can see this behavior in the Spark codebase:
[https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/KubernetesClientApplication.scala#L175]

Some resources can be created as {*}pre-resources{*}, before the pod is created 
— but not all are.
For example, {{KerberosConfDriverFeatureStep}} creates its secret _after_ the 
pod is created, because it only implements 
{{{}getAdditionalKubernetesResources(){}}}:
[https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/KerberosConfDriverFeatureStep.scala]

I think certain critical resources should instead implement 
{{getAdditionalPreKubernetesResources()}} to ensure they are available _before_ 
pod creation.
But I’m not entirely sure which resources should follow this pattern:
[https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/KubernetesDriverBuilder.scala#L94]

In my case, Kubernetes eventually mounted the resources after retries, so the 
impact was minor.
But if this issue causes significant trouble, I hope someone considers 
improving it.
(As a side note, I heard this issue doesn't occur when using the Spark 
Kubernetes Operator — though I haven’t tested it myself.)

> Not waiting for configmap before starting driver
> ------------------------------------------------
>
>                 Key: SPARK-38079
>                 URL: https://issues.apache.org/jira/browse/SPARK-38079
>             Project: Spark
>          Issue Type: Bug
>          Components: Kubernetes
>    Affects Versions: 3.2.0, 3.2.1
>            Reporter: Ben
>            Priority: Major
>
> *The problem*
> When you spark-submit to kubernetes in cluster-mode:
>  # Kubernetes creates the driver
>  # Kubernetes creates a configmap that the driver depends on
> This is a race condition. If the configmap is not created quickly enough, 
> then the driver will fail to start up properly.
> See [this stackoverflow post|https://stackoverflow.com/a/58508313] for an 
> alternate description of this problem.
>  
> *To Reproduce*
>  # Download spark 3.2.0 or 3.2.1 from 
> [https://spark.apache.org/downloads.html]
>  # Create an image with 
> {code:java}
> bin/docker-image-tool.sh{code}
>  # Spark submit one of the examples to some kubernetes instance
>  # Observe the race condition



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to