Lucca Sergi created SPARK-46310: ----------------------------------- Summary: Cannot deploy Spark application using VolcanoFeatureStep to specify podGroupTemplate file Key: SPARK-46310 URL: https://issues.apache.org/jira/browse/SPARK-46310 Project: Spark Issue Type: Bug Components: Kubernetes Affects Versions: 3.4.1 Reporter: Lucca Sergi
I'm trying to deploy a Spark application (version 3.4.1) on Kubernetes using Volcano as the scheduler. I define a VolcanoJob that represents the Spark driver - it has only one task, whose pod specification includes the driver container, which invokes the spark-submit command. Following the official Spark documentation (available on "[Using Volcano as Customized Scheduler for Spark on Kubernetes|https://spark.apache.org/docs/latest/running-on-kubernetes.html#using-volcano-as-customized-scheduler-for-spark-on-kubernetes]"), I define the necessary configuration parameters to make use of Volcano as the scheduler for my Spark workload: {code:java} /opt/spark/bin/spark-submit --name "volcano-spark-1" --deploy-mode="client" \ --class "org.apache.spark.examples.SparkPi" \ --conf spark.executor.instances="1" \ --conf spark.kubernetes.driver.pod.featureSteps="org.apache.spark.deploy.k8s.features.VolcanoFeatureStep" \ --conf spark.kubernetes.executor.pod.featureSteps="org.apache.spark.deploy.k8s.features.VolcanoFeatureStep" \ --conf spark.kubernetes.scheduler.volcano.podGroupTemplateFile="/var/template/podgroup.yaml" \ file:///opt/spark/examples/jars/spark-examples_2.12-3.4.1.jar {code} In the block above, I omitted some Kubernetes configuration parameters that aren't important for this example. The parameter *{{spark.kubernetes.scheduler.volcano.podGroupTemplateFile}}* points to a file mounted in the driver container. It has a content as follows: {code:yaml} apiVersion: scheduling.volcano.sh/v1beta1 kind: PodGroup metadata: name: pod-group-test spec: minResources: cpu: "2" memory: "2Gi" queue: some-existing-queue {code} I manually verified that this file "/var/template/podgroup.yaml" exists in the container before the "spark-submit" command is issued. I also granted all the necessary RBAC permissions so that the driver pod can interact with Kubernetes objects (pods, VolcanoJobs, podgroups, queues, etc.). When I execute this VolcanoJob, I see only the driver pod being created, and when inspecting its logs, I see the following error: {code:java} io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: POST at: https://api.<masked-environment-endpoint>/api/v1/namespaces/04522055-15b3-40d8-ba07-22b1a2a5ffcc/pods. Message: admission webhook "validatepod.volcano.sh" denied the request: failed to get PodGroup for pod <04522055-15b3-40d8-ba07-22b1a2a5ffcc/volcano-spark-1-driver-0-exec-789>: podgroups.scheduling.volcano.sh "spark-5ad570e340934d3997065fa6d504910e-podgroup" not found. Received status: Status(apiVersion=v1, code=400, details=null, kind=Status, message=admission webhook "validatepod.volcano.sh" denied the request: failed to get PodGroup for pod <04522055-15b3-40d8-ba07-22b1a2a5ffcc/volcano-spark-1-driver-0-exec-789>: podgroups.scheduling.volcano.sh "spark-5ad570e340934d3997065fa6d504910e-podgroup" not found, metadata=ListMeta(_continue=null, remainingItemCount=null, resourceVersion=null, selfLink=null, additionalProperties={}), reason=null, status=Failure, additionalProperties={}). at io.fabric8.kubernetes.client.KubernetesClientException.copyAsCause(KubernetesClientException.java:238) at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.waitForResult(OperationSupport.java:538) at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.handleResponse(OperationSupport.java:558) at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.handleCreate(OperationSupport.java:349) at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.handleCreate(BaseOperation.java:711) at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.handleCreate(BaseOperation.java:93) at io.fabric8.kubernetes.client.dsl.internal.CreateOnlyResourceOperation.create(CreateOnlyResourceOperation.java:42) at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.create(BaseOperation.java:1113) at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.create(BaseOperation.java:93) at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.$anonfun$requestNewExecutors$1(ExecutorPodsAllocator.scala:440) at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:158) at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.requestNewExecutors(ExecutorPodsAllocator.scala:417) at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.$anonfun$onNewSnapshots$36(ExecutorPodsAllocator.scala:370) at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.$anonfun$onNewSnapshots$36$adapted(ExecutorPodsAllocator.scala:363) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.onNewSnapshots(ExecutorPodsAllocator.scala:363) at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.$anonfun$start$3(ExecutorPodsAllocator.scala:134) at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.$anonfun$start$3$adapted(ExecutorPodsAllocator.scala:134) at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsSnapshotsStoreImpl$SnapshotsSubscriber.org$apache$spark$scheduler$cluster$k8s$ExecutorPodsSnapshotsStoreImpl$SnapshotsSubscriber$$processSnapshotsInternal(ExecutorPodsSnapshotsStoreImpl.scala:143) at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsSnapshotsStoreImpl$SnapshotsSubscriber.processSnapshots(ExecutorPodsSnapshotsStoreImpl.scala:131) at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsSnapshotsStoreImpl.$anonfun$addSubscriber$1(ExecutorPodsSnapshotsStoreImpl.scala:85) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:182) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:296) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:838) {code} The error seems to be triggered when the driver attempts to deploy the executors of my Spark application. In the error message, it says that the podGroup "spark-5ad570e340934d3997065fa6d504910e-podgroup" cannot be found (pointed out by the Volcano admission hook). I was expecting that the driver and executors would be assigned to the same PodGroup object, created by the VolcanoFeatureStep using the template file that I provided through the configuration parameter "{*}{{spark.kubernetes.scheduler.volcano.podGroupTemplateFile}}{*}". With that, I would have a proper batch scheduling of my Spark application, as driver and executor pods would reside in the same pod group, and would be scheduled together by Volcano. But instead, only the driver pod is deployed, and the error seen above is found on its logs. The documentation "[Using Volcano as Customized Scheduler for Spark on Kubernetes|https://spark.apache.org/docs/latest/running-on-kubernetes.html#using-volcano-as-customized-scheduler-for-spark-on-kubernetes]" leads me to understand that by providing the PodGroup template file, my Spark application (i.e., driver and executors) would be allocated in the same PodGroup object, following the specification I provided. That doesn't seem to be the case, and it looks like the PodGroup isn't created following the provided template, nor can the executors be created. Some more details about the environment I used: - Volcano Version: v1.8.0 - Spark Version: 3.4.1 - Kubernetes version: v1.26.7 - Cloud provider: GCP -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org