This is an automated email from the ASF dual-hosted git repository. tgraves pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git
The following commit(s) were added to refs/heads/master by this push: new 37b7d32 [SPARK-30845] Do not upload local pyspark archives for spark-submit on Yarn 37b7d32 is described below commit 37b7d32dbd3546c303d31305ed40c6435390bb4d Author: Shanyu Zhao <shz...@microsoft.com> AuthorDate: Mon Jun 8 15:55:49 2020 -0500 [SPARK-30845] Do not upload local pyspark archives for spark-submit on Yarn ### What changes were proposed in this pull request? Use spark-submit to submit a pyspark app on Yarn, and set this in spark-env.sh: export PYSPARK_ARCHIVES_PATH=local:/opt/spark/python/lib/pyspark.zip,local:/opt/spark/python/lib/py4j-0.10.7-src.zip You can see that these local archives are still uploaded to Yarn distributed cache: yarn.Client: Uploading resource file:/opt/spark/python/lib/pyspark.zip -> hdfs://myhdfs/user/test1/.sparkStaging/application_1581024490249_0001/pyspark.zip This PR fix this issue by checking the files specified in PYSPARK_ARCHIVES_PATH, if they are local archives, don't distribute to Yarn dist cache. ### Why are the changes needed? For pyspark appp to support local pyspark archives set in PYSPARK_ARCHIVES_PATH. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Existing tests and manual tests. Closes #27598 from shanyu/shanyu-30845. Authored-by: Shanyu Zhao <shz...@microsoft.com> Signed-off-by: Thomas Graves <tgra...@apache.org> --- .../yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala b/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala index fc429d6..7b12119 100644 --- a/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala +++ b/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala @@ -635,7 +635,12 @@ private[spark] class Client( distribute(args.primaryPyFile, appMasterOnly = true) } - pySparkArchives.foreach { f => distribute(f) } + pySparkArchives.foreach { f => + val uri = Utils.resolveURI(f) + if (uri.getScheme != Utils.LOCAL_SCHEME) { + distribute(f) + } + } // The python files list needs to be treated especially. All files that are not an // archive need to be placed in a subdirectory that will be added to PYTHONPATH. --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org