Anton Ippolitov created SPARK-40817: ---------------------------------------
Summary: Remote spark.jars URIs ignored for Spark on Kubernetes in cluster mode Key: SPARK-40817 URL: https://issues.apache.org/jira/browse/SPARK-40817 Project: Spark Issue Type: Bug Components: Kubernetes, Spark Submit Affects Versions: 3.2.2, 3.3.0, 3.1.3, 3.0.0 Environment: Spark 3.1.3 Kubernetes 1.21 Ubuntu 20.04.1 Reporter: Anton Ippolitov Attachments: image-2022-10-17-10-44-46-862.png I discovered that remote URIs in {{spark.jars}} get discarded when launching Spark on Kubernetes in cluster mode via spark-submit. h1. Reproduction Here is an example reproduction with S3 being used for remote JAR storage: I first created 2 JARs: * {{/opt/my-local-jar.jar}} on the host where I'm running spark-submit * {{s3://$BUCKET_NAME/my-remote-jar.jar}} in an S3 bucket I own I then ran the following spark-submit command with {{spark.jars}} pointing to both the local JAR and the remote JAR: {code:java} spark-submit \ --master k8s://https://$KUBERNETES_API_SERVER_URL:443 \ --deploy-mode cluster \ --name=spark-submit-test \ --class org.apache.spark.examples.SparkPi \ --conf spark.jars=/opt/my-local-jar.jar,s3a://$BUCKET_NAME/my-remote-jar.jar \ --conf spark.kubernetes.file.upload.path=s3a://$BUCKET_NAME/my-upload-path/ \ [...] /opt/spark/examples/jars/spark-examples_2.12-3.1.3.jar {code} Once the driver and the executors started, I confirmed that there was no trace of {{my-remote-jar.jar}} anymore. For example, looking at the Spark History Server, I could see that {{spark.jars}} got transformed into this: !image-2022-10-17-10-17-03-697.png|width=744,height=60! There was no mention of {{my-remote-jar.jar}} on the classpath or anywhere else. Note that I ran all tests with Spark 3.1.3, however the code which handles those dependencies seems to be the same for more recent versions of Spark as well. h1. Root cause description I believe that the issue seems to be coming from [this logic|https://github.com/apache/spark/blob/d1f8a503a26bcfb4e466d9accc5fa241a7933667/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicDriverFeatureStep.scala#L163-L186] in {{{}BasicDriverFeatureStep.getAdditionalPodSystemProperties(){}}}. Specifically, this logic takes all URIs in {{{}spark.jars{}}}, [filters only on local URIs,|https://github.com/apache/spark/blob/d1f8a503a26bcfb4e466d9accc5fa241a7933667/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicDriverFeatureStep.scala#L165] [uploads|https://github.com/apache/spark/blob/d1f8a503a26bcfb4e466d9accc5fa241a7933667/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicDriverFeatureStep.scala#L173] those local files to {{spark.kubernetes.file.upload.path }}and then [*replaces*|https://github.com/apache/spark/blob/d1f8a503a26bcfb4e466d9accc5fa241a7933667/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicDriverFeatureStep.scala#L182] the value of {{spark.jars}} with those newly uploaded JARs. By overwriting the previous value of {{{}spark.jars{}}}, we are losing all mention of remote JARs that were previously specified there. Consequently, when the Spark driver starts afterwards, it only downloads JARs from {{{}spark.kubernetes.file.upload.path{}}}. h1. Possible solution I think a possible fix would be to not fully overwrite the value of {{spark.jars}} but to make sure that we keep remote URIs there. The new logic would look something like this: {code:java} Seq(JARS, FILES, ARCHIVES, SUBMIT_PYTHON_FILES).foreach { key => val uris = conf.get(key).filter(uri => KubernetesUtils.isLocalAndResolvable(uri)) // Save remote URIs val remoteUris = conf.get(key).filter(uri => !KubernetesUtils.isLocalAndResolvable(uri)) val value = { if (key == ARCHIVES) { uris.map(UriBuilder.fromUri(_).fragment(null).build()).map(_.toString) } else { uris } } val resolved = KubernetesUtils.uploadAndTransformFileUris(value, Some(conf.sparkConf)) if (resolved.nonEmpty) { val resolvedValue = if (key == ARCHIVES) { uris.zip(resolved).map { case (uri, r) => UriBuilder.fromUri(r).fragment(new java.net.URI(uri).getFragment).build().toString } } else { resolved } // don't forget to add remote URIs additionalProps.put(key.key, (resolvedValue ++ remoteUris).mkString(",")) } } {code} I tested it out in my environment and it worked: {{s3a://$BUCKET_NAME/my-remote-jar.jar}} was kept in {{spark.jars}} and the Spark driver was able to download it. I don't know the codebase well enough though to assess whether I am missing something else or if this is enough to fix the issue. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org