This is an automated email from the ASF dual-hosted git repository.

tgraves pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
     new 37b7d32  [SPARK-30845] Do not upload local pyspark archives for 
spark-submit on Yarn
37b7d32 is described below

commit 37b7d32dbd3546c303d31305ed40c6435390bb4d
Author: Shanyu Zhao <shz...@microsoft.com>
AuthorDate: Mon Jun 8 15:55:49 2020 -0500

    [SPARK-30845] Do not upload local pyspark archives for spark-submit on Yarn
    
    ### What changes were proposed in this pull request?
    Use spark-submit to submit a pyspark app on Yarn, and set this in 
spark-env.sh:
    export 
PYSPARK_ARCHIVES_PATH=local:/opt/spark/python/lib/pyspark.zip,local:/opt/spark/python/lib/py4j-0.10.7-src.zip
    
    You can see that these local archives are still uploaded to Yarn 
distributed cache:
    yarn.Client: Uploading resource file:/opt/spark/python/lib/pyspark.zip -> 
hdfs://myhdfs/user/test1/.sparkStaging/application_1581024490249_0001/pyspark.zip
    
    This PR fix this issue by checking the files specified in 
PYSPARK_ARCHIVES_PATH, if they are local archives, don't distribute to Yarn 
dist cache.
    
    ### Why are the changes needed?
    For pyspark appp to support local pyspark archives set in 
PYSPARK_ARCHIVES_PATH.
    
    ### Does this PR introduce any user-facing change?
    No
    
    ### How was this patch tested?
    Existing tests and manual tests.
    
    Closes #27598 from shanyu/shanyu-30845.
    
    Authored-by: Shanyu Zhao <shz...@microsoft.com>
    Signed-off-by: Thomas Graves <tgra...@apache.org>
---
 .../yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala  | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git 
a/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala
 
b/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala
index fc429d6..7b12119 100644
--- 
a/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala
+++ 
b/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala
@@ -635,7 +635,12 @@ private[spark] class Client(
       distribute(args.primaryPyFile, appMasterOnly = true)
     }
 
-    pySparkArchives.foreach { f => distribute(f) }
+    pySparkArchives.foreach { f =>
+      val uri = Utils.resolveURI(f)
+      if (uri.getScheme != Utils.LOCAL_SCHEME) {
+        distribute(f)
+      }
+    }
 
     // The python files list needs to be treated especially. All files that 
are not an
     // archive need to be placed in a subdirectory that will be added to 
PYTHONPATH.


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

Reply via email to