[ https://issues.apache.org/jira/browse/SPARK-10789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15066797#comment-15066797 ]
Jonathan Kelly commented on SPARK-10789: ---------------------------------------- [~roireshef], sure, here's a patch for my workaround: https://issues.apache.org/jira/secure/attachment/12778871/SPARK-10789.diff > Cluster mode SparkSubmit classpath only includes Spark assembly > --------------------------------------------------------------- > > Key: SPARK-10789 > URL: https://issues.apache.org/jira/browse/SPARK-10789 > Project: Spark > Issue Type: Bug > Components: Spark Submit > Affects Versions: 1.5.0 > Reporter: Jonathan Kelly > Attachments: SPARK-10789.diff > > > When using cluster deploy mode, the classpath of the SparkSubmit process that > gets launched only includes the Spark assembly and not > spark.driver.extraClassPath. This is of course by design, since the driver > actually runs on the cluster and not inside the SparkSubmit process. > However, if the SparkSubmit process, minimal as it may be, needs any extra > libraries that are not part of the Spark assembly, there is no good way to > include them. (I say "no good way" because including them in the > SPARK_CLASSPATH environment variable does cause the SparkSubmit process to > include them, but this is not acceptable because this environment variable > has long been deprecated, and it prevents the use of > spark.driver.extraClassPath.) > An example of when this matters is on Amazon EMR when using an S3 path for > the application JAR and running in yarn-cluster mode. The SparkSubmit process > needs the EmrFileSystem implementation and its dependencies in the classpath > in order to download the application JAR from S3, so it fails with a > ClassNotFoundException. (EMR currently gets around this by setting > SPARK_CLASSPATH, but as mentioned above this is less than ideal.) > I have tried modifying SparkSubmitCommandBuilder to include the driver extra > classpath whether it's client mode or cluster mode, and this seems to work, > but I don't know if there is any downside to this. > Example that fails on emr-4.0.0 (if you switch to setting > spark.(driver,executor).extraClassPath instead of SPARK_CLASSPATH): > spark-submit --deploy-mode cluster --class > org.apache.spark.examples.JavaWordCount s3://my-bucket/spark-examples.jar > s3://my-bucket/word-count-input.txt > Resulting Exception: > Exception in thread "main" java.lang.RuntimeException: > java.lang.ClassNotFoundException: Class > com.amazon.ws.emr.hadoop.fs.EmrFileSystem not found > at > org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2074) > at > org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2626) > at > org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2639) > at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:90) > at > org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2678) > at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2660) > at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:374) > at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296) > at > org.apache.spark.deploy.yarn.Client.copyFileToRemote(Client.scala:233) > at > org.apache.spark.deploy.yarn.Client.org$apache$spark$deploy$yarn$Client$$distribute$1(Client.scala:327) > at > org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$5.apply(Client.scala:366) > at > org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$5.apply(Client.scala:364) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:364) > at > org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:629) > at > org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:119) > at org.apache.spark.deploy.yarn.Client.run(Client.scala:907) > at org.apache.spark.deploy.yarn.Client$.main(Client.scala:966) > at org.apache.spark.deploy.yarn.Client.main(Client.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:672) > at > org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180) > at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > Caused by: java.lang.ClassNotFoundException: Class > com.amazon.ws.emr.hadoop.fs.EmrFileSystem not found > at > org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:1980) > at > org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2072) > ... 27 more -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org