[ https://issues.apache.org/jira/browse/SPARK-40817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Dongjoon Hyun resolved SPARK-40817. ----------------------------------- Fix Version/s: 3.3.2 3.4.0 Resolution: Fixed > Remote spark.jars URIs ignored for Spark on Kubernetes in cluster mode > ----------------------------------------------------------------------- > > Key: SPARK-40817 > URL: https://issues.apache.org/jira/browse/SPARK-40817 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Spark Submit > Affects Versions: 3.0.0, 3.1.3, 3.3.0, 3.2.2, 3.4.0 > Environment: Spark 3.1.3 > Kubernetes 1.21 > Ubuntu 20.04.1 > Reporter: Anton Ippolitov > Priority: Major > Fix For: 3.3.2, 3.4.0 > > Attachments: image-2022-10-17-10-44-46-862.png > > > I discovered that remote URIs in {{spark.jars}} get discarded when launching > Spark on Kubernetes in cluster mode via spark-submit. > h1. Reproduction > Here is an example reproduction with S3 being used for remote JAR storage: > I first created 2 JARs: > * {{/opt/my-local-jar.jar}} on the host where I'm running spark-submit > * {{s3://$BUCKET_NAME/my-remote-jar.jar}} in an S3 bucket I own > I then ran the following spark-submit command with {{spark.jars}} pointing to > both the local JAR and the remote JAR: > {code:java} > spark-submit \ > --master k8s://https://$KUBERNETES_API_SERVER_URL:443 \ > --deploy-mode cluster \ > --name=spark-submit-test \ > --class org.apache.spark.examples.SparkPi \ > --conf > spark.jars=/opt/my-local-jar.jar,s3a://$BUCKET_NAME/my-remote-jar.jar \ > --conf spark.kubernetes.file.upload.path=s3a://$BUCKET_NAME/my-upload-path/ > \ > [...] > /opt/spark/examples/jars/spark-examples_2.12-3.1.3.jar > {code} > Once the driver and the executors started, I confirmed that there was no > trace of {{my-remote-jar.jar}} anymore. For example, looking at the Spark > History Server, I could see that {{spark.jars}} got transformed into this: > !image-2022-10-17-10-44-46-862.png|width=991,height=80! > There was no mention of {{my-remote-jar.jar}} on the classpath or anywhere > else. > Note that I ran all tests with Spark 3.1.3, however the code which handles > those dependencies seems to be the same for more recent versions of Spark as > well. > h1. Root cause description > I believe that the issue seems to be coming from [this > logic|https://github.com/apache/spark/blob/d1f8a503a26bcfb4e466d9accc5fa241a7933667/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicDriverFeatureStep.scala#L163-L186] > in {{{}BasicDriverFeatureStep.getAdditionalPodSystemProperties(){}}}. > Specifically, this logic takes all URIs in {{{}spark.jars{}}}, [filters only > on local > URIs,|https://github.com/apache/spark/blob/d1f8a503a26bcfb4e466d9accc5fa241a7933667/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicDriverFeatureStep.scala#L165] > > [uploads|https://github.com/apache/spark/blob/d1f8a503a26bcfb4e466d9accc5fa241a7933667/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicDriverFeatureStep.scala#L173] > those local files to {{spark.kubernetes.file.upload.path }}and then > [*replaces*|https://github.com/apache/spark/blob/d1f8a503a26bcfb4e466d9accc5fa241a7933667/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicDriverFeatureStep.scala#L182] > the value of {{spark.jars}} with those newly uploaded JARs. By overwriting > the previous value of {{{}spark.jars{}}}, we are losing all mention of remote > JARs that were previously specified there. > Consequently, when the Spark driver starts afterwards, it only downloads JARs > from {{{}spark.kubernetes.file.upload.path{}}}. > h1. Possible solution > I think a possible fix would be to not fully overwrite the value of > {{spark.jars}} but to make sure that we keep remote URIs there. > The new logic would look something like this: > {code:java} > Seq(JARS, FILES, ARCHIVES, SUBMIT_PYTHON_FILES).foreach { key => > val uris = conf.get(key).filter(uri => > KubernetesUtils.isLocalAndResolvable(uri)) > // Save remote URIs > val remoteUris = conf.get(key).filter(uri => > !KubernetesUtils.isLocalAndResolvable(uri)) > val value = { > if (key == ARCHIVES) { > uris.map(UriBuilder.fromUri(_).fragment(null).build()).map(_.toString) > } else { > uris > } > } > val resolved = KubernetesUtils.uploadAndTransformFileUris(value, > Some(conf.sparkConf)) > if (resolved.nonEmpty) { > val resolvedValue = if (key == ARCHIVES) { > uris.zip(resolved).map { case (uri, r) => > UriBuilder.fromUri(r).fragment(new > java.net.URI(uri).getFragment).build().toString > } > } else { > resolved > } > // don't forget to add remote URIs > additionalProps.put(key.key, (resolvedValue ++ remoteUris).mkString(",")) > } > } {code} > I tested it out in my environment and it worked: > {{s3a://$BUCKET_NAME/my-remote-jar.jar}} was kept in {{spark.jars}} and the > Spark driver was able to download it. > I don't know the codebase well enough though to assess whether I am missing > something else or if this is enough to fix the issue. > > > -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org