Jonathan Kelly created SPARK-10789:
--------------------------------------

             Summary: Cluster mode SparkSubmit classpath only includes Spark 
classpath
                 Key: SPARK-10789
                 URL: https://issues.apache.org/jira/browse/SPARK-10789
             Project: Spark
          Issue Type: Bug
    Affects Versions: 1.5.0
            Reporter: Jonathan Kelly


When using cluster deploy mode, the classpath of the SparkSubmit process that 
gets launched only includes the Spark assembly and not 
spark.driver.extraClassPath. This is of course by design, since the driver 
actually runs on the cluster and not inside the SparkSubmit process.

However, if the SparkSubmit process, minimal as it may be, needs any extra 
libraries that are not part of the Spark assembly, there is no good way to 
include them. (I say "no good way" because including them in the 
SPARK_CLASSPATH environment variable does cause the SparkSubmit process to 
include them, but this is not acceptable because this environment variable has 
long been deprecated, and it prevents the use of spark.driver.extraClassPath.)

An example of when this matters is on Amazon EMR when using an S3 path for the 
application JAR and running in yarn-cluster mode. The SparkSubmit process needs 
the EmrFileSystem implementation and its dependencies in the classpath in order 
to download the application JAR from S3, so it fails with a 
ClassNotFoundException. (EMR currently gets around this by setting 
SPARK_CLASSPATH, but as mentioned above this is less than ideal.)

I have tried modifying SparkSubmitCommandBuilder to include the driver extra 
classpath whether it's client mode or cluster mode, and this seems to work, but 
I don't know if there is any downside to this.

Example that fails on emr-4.0.0 (if you switch to setting 
spark.{driver,executor}.extraClassPath instead of SPARK_CLASSPATH): 
spark-submit --deploy-mode cluster --class 
org.apache.spark.examples.JavaWordCount s3://my-bucket/spark-examples.jar 
s3://my-bucket/word-count-input.txt

Resulting Exception:
Exception in thread "main" java.lang.RuntimeException: 
java.lang.ClassNotFoundException: Class 
com.amazon.ws.emr.hadoop.fs.EmrFileSystem not found
        at 
org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2074)
        at 
org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2626)
        at 
org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2639)
        at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:90)
        at 
org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2678)
        at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2660)
        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:374)
        at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
        at 
org.apache.spark.deploy.yarn.Client.copyFileToRemote(Client.scala:233)
        at 
org.apache.spark.deploy.yarn.Client.org$apache$spark$deploy$yarn$Client$$distribute$1(Client.scala:327)
        at 
org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$5.apply(Client.scala:366)
        at 
org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$5.apply(Client.scala:364)
        at scala.collection.immutable.List.foreach(List.scala:318)
        at 
org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:364)
        at 
org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:629)
        at 
org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:119)
        at org.apache.spark.deploy.yarn.Client.run(Client.scala:907)
        at org.apache.spark.deploy.yarn.Client$.main(Client.scala:966)
        at org.apache.spark.deploy.yarn.Client.main(Client.scala)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:672)
        at 
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
        at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: Class 
com.amazon.ws.emr.hadoop.fs.EmrFileSystem not found
        at 
org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:1980)
        at 
org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2072)
        ... 27 more



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to