[ https://issues.apache.org/jira/browse/SPARK-38079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17966966#comment-17966966 ]
Junghyun Kim edited comment on SPARK-38079 at 6/13/25 6:03 AM: --------------------------------------------------------------- I’ve encountered a similar issue before. In my case, Kubernetes kept retrying the resource mount, and eventually the mount succeeded. This issue happens because the necessary resources (like the configmap) are created _after_ the pod is created, which introduces a race condition. You can see this behavior in the Spark codebase: [https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/KubernetesClientApplication.scala#L175] Some resources can be created as {*}pre-resources{*}, before the pod is created — but not all are. For example, {{KerberosConfDriverFeatureStep}} creates its secret _after_ the pod is created, because it only implements {{{}getAdditionalKubernetesResources(){}}}: [https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/KerberosConfDriverFeatureStep.scala] I think some critical resources should instead implement {{getAdditionalPreKubernetesResources()}} to ensure they are available before pod creation: [https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/KubernetesDriverBuilder.scala#L94] Of course, even if a resource is created early, it doesn't necessarily guarantee that it will be successfully mounted when the pod starts — there could still be timing issues. In my case, Kubernetes eventually handled the retries, so the impact was minor. But if others are struggling with this more severely, I hope someone considers addressing it. (As a side note, I heard that this issue doesn’t occur when using the Spark Kubernetes Operator — though I haven’t tested it myself.) was (Author: JIRAUSER310008): I’ve encountered a similar issue before. In my case, Kubernetes kept retrying the resource mount, and eventually the mount succeeded. This issue happens because the necessary resources (like the configmap) are created _after_ the pod is created, which introduces a race condition. You can see this behavior in the Spark codebase: [https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/KubernetesClientApplication.scala#L175] Some resources can be created as {*}pre-resources{*}, before the pod is created — but not all are. For example, {{KerberosConfDriverFeatureStep}} creates its secret _after_ the pod is created, because it only implements {{{}getAdditionalKubernetesResources(){}}}: [https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/KerberosConfDriverFeatureStep.scala] I think certain critical resources should instead implement {{getAdditionalPreKubernetesResources()}} to ensure they are available _before_ pod creation. But I’m not entirely sure which resources should follow this pattern: [https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/KubernetesDriverBuilder.scala#L94] In my case, Kubernetes eventually mounted the resources after retries, so the impact was minor. But if this issue causes significant trouble, I hope someone considers improving it. (As a side note, I heard this issue doesn't occur when using the Spark Kubernetes Operator — though I haven’t tested it myself.) > Not waiting for configmap before starting driver > ------------------------------------------------ > > Key: SPARK-38079 > URL: https://issues.apache.org/jira/browse/SPARK-38079 > Project: Spark > Issue Type: Bug > Components: Kubernetes > Affects Versions: 3.2.0, 3.2.1 > Reporter: Ben > Priority: Major > > *The problem* > When you spark-submit to kubernetes in cluster-mode: > # Kubernetes creates the driver > # Kubernetes creates a configmap that the driver depends on > This is a race condition. If the configmap is not created quickly enough, > then the driver will fail to start up properly. > See [this stackoverflow post|https://stackoverflow.com/a/58508313] for an > alternate description of this problem. > > *To Reproduce* > # Download spark 3.2.0 or 3.2.1 from > [https://spark.apache.org/downloads.html] > # Create an image with > {code:java} > bin/docker-image-tool.sh{code} > # Spark submit one of the examples to some kubernetes instance > # Observe the race condition -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org