[jira] [Commented] (SPARK-10789) Cluster mode SparkSubmit classpath only includes Spark assembly

Jonathan Kelly (JIRA) Mon, 21 Dec 2015 09:57:11 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-10789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15066797#comment-15066797
 ]


Jonathan Kelly commented on SPARK-10789:
----------------------------------------

[~roireshef], sure, here's a patch for my workaround: 
https://issues.apache.org/jira/secure/attachment/12778871/SPARK-10789.diff

> Cluster mode SparkSubmit classpath only includes Spark assembly
> ---------------------------------------------------------------
>
>                 Key: SPARK-10789
>                 URL: https://issues.apache.org/jira/browse/SPARK-10789
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Submit
>    Affects Versions: 1.5.0
>            Reporter: Jonathan Kelly
>         Attachments: SPARK-10789.diff
>
>
> When using cluster deploy mode, the classpath of the SparkSubmit process that 
> gets launched only includes the Spark assembly and not 
> spark.driver.extraClassPath. This is of course by design, since the driver 
> actually runs on the cluster and not inside the SparkSubmit process.
> However, if the SparkSubmit process, minimal as it may be, needs any extra 
> libraries that are not part of the Spark assembly, there is no good way to 
> include them. (I say "no good way" because including them in the 
> SPARK_CLASSPATH environment variable does cause the SparkSubmit process to 
> include them, but this is not acceptable because this environment variable 
> has long been deprecated, and it prevents the use of 
> spark.driver.extraClassPath.)
> An example of when this matters is on Amazon EMR when using an S3 path for 
> the application JAR and running in yarn-cluster mode. The SparkSubmit process 
> needs the EmrFileSystem implementation and its dependencies in the classpath 
> in order to download the application JAR from S3, so it fails with a 
> ClassNotFoundException. (EMR currently gets around this by setting 
> SPARK_CLASSPATH, but as mentioned above this is less than ideal.)
> I have tried modifying SparkSubmitCommandBuilder to include the driver extra 
> classpath whether it's client mode or cluster mode, and this seems to work, 
> but I don't know if there is any downside to this.
> Example that fails on emr-4.0.0 (if you switch to setting 
> spark.(driver,executor).extraClassPath instead of SPARK_CLASSPATH): 
> spark-submit --deploy-mode cluster --class 
> org.apache.spark.examples.JavaWordCount s3://my-bucket/spark-examples.jar 
> s3://my-bucket/word-count-input.txt
> Resulting Exception:
> Exception in thread "main" java.lang.RuntimeException: 
> java.lang.ClassNotFoundException: Class 
> com.amazon.ws.emr.hadoop.fs.EmrFileSystem not found
>       at 
> org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2074)
>       at 
> org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2626)
>       at 
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2639)
>       at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:90)
>       at 
> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2678)
>       at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2660)
>       at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:374)
>       at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
>       at 
> org.apache.spark.deploy.yarn.Client.copyFileToRemote(Client.scala:233)
>       at 
> org.apache.spark.deploy.yarn.Client.org$apache$spark$deploy$yarn$Client$$distribute$1(Client.scala:327)
>       at 
> org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$5.apply(Client.scala:366)
>       at 
> org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$5.apply(Client.scala:364)
>       at scala.collection.immutable.List.foreach(List.scala:318)
>       at 
> org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:364)
>       at 
> org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:629)
>       at 
> org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:119)
>       at org.apache.spark.deploy.yarn.Client.run(Client.scala:907)
>       at org.apache.spark.deploy.yarn.Client$.main(Client.scala:966)
>       at org.apache.spark.deploy.yarn.Client.main(Client.scala)
>       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>       at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>       at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>       at java.lang.reflect.Method.invoke(Method.java:606)
>       at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:672)
>       at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
>       at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
>       at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
>       at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: java.lang.ClassNotFoundException: Class 
> com.amazon.ws.emr.hadoop.fs.EmrFileSystem not found
>       at 
> org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:1980)
>       at 
> org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2072)
>       ... 27 more



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10789) Cluster mode SparkSubmit classpath only includes Spark assembly

Reply via email to