[ 
https://issues.apache.org/jira/browse/SPARK-10789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15074246#comment-15074246
 ] 

Jonathan Kelly commented on SPARK-10789:
----------------------------------------

Thanks, [~srowen], that makes sense and was something else I was considering as 
well. I suppose it doesn't necessarily make sense to add a new Spark property 
when this is really more of a Hadoop-related issue than a Spark-related one (in 
that the extra library needed is for a custom Hadoop FileSystem).

One downside though is that the EMRFS libraries would then be duplicated inside 
the Spark assembly, making it even more massive than it already is, but it 
already contains so much duplication (of other jars already present elsewhere 
on the cluster, that is) that it doesn't seem so bad. Also, I think this would 
just require some changes to the pom.xml rather than a code patch, so that's 
nice. Lastly, I saw that removing the necessity for a Spark assembly is under 
consideration for Spark 2.x, so hopefully any downside to adding more libraries 
to the assembly now will be mitigated once the Spark assembly is no longer 
necessary because then we could just make sure that the EMRFS libraries are in 
whatever list of jars are included in the Spark classpath.

And yes, the title of this JIRA issue has been bugging me too. I'm not sure why 
I gave it such a specific title without referencing the actual problem. I'll 
fix it.

> Cluster mode SparkSubmit classpath only includes Spark assembly
> ---------------------------------------------------------------
>
>                 Key: SPARK-10789
>                 URL: https://issues.apache.org/jira/browse/SPARK-10789
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Submit
>    Affects Versions: 1.5.0, 1.6.0
>            Reporter: Jonathan Kelly
>         Attachments: SPARK-10789.diff, SPARK-10789.v1.6.0.diff
>
>
> When using cluster deploy mode, the classpath of the SparkSubmit process that 
> gets launched only includes the Spark assembly and not 
> spark.driver.extraClassPath. This is of course by design, since the driver 
> actually runs on the cluster and not inside the SparkSubmit process.
> However, if the SparkSubmit process, minimal as it may be, needs any extra 
> libraries that are not part of the Spark assembly, there is no good way to 
> include them. (I say "no good way" because including them in the 
> SPARK_CLASSPATH environment variable does cause the SparkSubmit process to 
> include them, but this is not acceptable because this environment variable 
> has long been deprecated, and it prevents the use of 
> spark.driver.extraClassPath.)
> An example of when this matters is on Amazon EMR when using an S3 path for 
> the application JAR and running in yarn-cluster mode. The SparkSubmit process 
> needs the EmrFileSystem implementation and its dependencies in the classpath 
> in order to download the application JAR from S3, so it fails with a 
> ClassNotFoundException. (EMR currently gets around this by setting 
> SPARK_CLASSPATH, but as mentioned above this is less than ideal.)
> I have tried modifying SparkSubmitCommandBuilder to include the driver extra 
> classpath whether it's client mode or cluster mode, and this seems to work, 
> but I don't know if there is any downside to this.
> Example that fails on emr-4.0.0 (if you switch to setting 
> spark.(driver,executor).extraClassPath instead of SPARK_CLASSPATH): 
> spark-submit --deploy-mode cluster --class 
> org.apache.spark.examples.JavaWordCount s3://my-bucket/spark-examples.jar 
> s3://my-bucket/word-count-input.txt
> Resulting Exception:
> Exception in thread "main" java.lang.RuntimeException: 
> java.lang.ClassNotFoundException: Class 
> com.amazon.ws.emr.hadoop.fs.EmrFileSystem not found
>       at 
> org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2074)
>       at 
> org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2626)
>       at 
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2639)
>       at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:90)
>       at 
> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2678)
>       at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2660)
>       at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:374)
>       at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
>       at 
> org.apache.spark.deploy.yarn.Client.copyFileToRemote(Client.scala:233)
>       at 
> org.apache.spark.deploy.yarn.Client.org$apache$spark$deploy$yarn$Client$$distribute$1(Client.scala:327)
>       at 
> org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$5.apply(Client.scala:366)
>       at 
> org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$5.apply(Client.scala:364)
>       at scala.collection.immutable.List.foreach(List.scala:318)
>       at 
> org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:364)
>       at 
> org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:629)
>       at 
> org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:119)
>       at org.apache.spark.deploy.yarn.Client.run(Client.scala:907)
>       at org.apache.spark.deploy.yarn.Client$.main(Client.scala:966)
>       at org.apache.spark.deploy.yarn.Client.main(Client.scala)
>       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>       at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>       at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>       at java.lang.reflect.Method.invoke(Method.java:606)
>       at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:672)
>       at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
>       at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
>       at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
>       at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: java.lang.ClassNotFoundException: Class 
> com.amazon.ws.emr.hadoop.fs.EmrFileSystem not found
>       at 
> org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:1980)
>       at 
> org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2072)
>       ... 27 more



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to