[jira] [Updated] (SPARK-10789) Cluster mode SparkSubmit classpath only includes Spark assembly

2015-12-28 Thread Jonathan Kelly (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Kelly updated SPARK-10789:
---
Attachment: SPARK-10789.v1.6.0.diff

Here's another patch that can be applied to v1.6.0.

> Cluster mode SparkSubmit classpath only includes Spark assembly
> ---
>
> Key: SPARK-10789
> URL: https://issues.apache.org/jira/browse/SPARK-10789
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 1.5.0, 1.6.0
>Reporter: Jonathan Kelly
> Attachments: SPARK-10789.diff, SPARK-10789.v1.6.0.diff
>
>
> When using cluster deploy mode, the classpath of the SparkSubmit process that 
> gets launched only includes the Spark assembly and not 
> spark.driver.extraClassPath. This is of course by design, since the driver 
> actually runs on the cluster and not inside the SparkSubmit process.
> However, if the SparkSubmit process, minimal as it may be, needs any extra 
> libraries that are not part of the Spark assembly, there is no good way to 
> include them. (I say "no good way" because including them in the 
> SPARK_CLASSPATH environment variable does cause the SparkSubmit process to 
> include them, but this is not acceptable because this environment variable 
> has long been deprecated, and it prevents the use of 
> spark.driver.extraClassPath.)
> An example of when this matters is on Amazon EMR when using an S3 path for 
> the application JAR and running in yarn-cluster mode. The SparkSubmit process 
> needs the EmrFileSystem implementation and its dependencies in the classpath 
> in order to download the application JAR from S3, so it fails with a 
> ClassNotFoundException. (EMR currently gets around this by setting 
> SPARK_CLASSPATH, but as mentioned above this is less than ideal.)
> I have tried modifying SparkSubmitCommandBuilder to include the driver extra 
> classpath whether it's client mode or cluster mode, and this seems to work, 
> but I don't know if there is any downside to this.
> Example that fails on emr-4.0.0 (if you switch to setting 
> spark.(driver,executor).extraClassPath instead of SPARK_CLASSPATH): 
> spark-submit --deploy-mode cluster --class 
> org.apache.spark.examples.JavaWordCount s3://my-bucket/spark-examples.jar 
> s3://my-bucket/word-count-input.txt
> Resulting Exception:
> Exception in thread "main" java.lang.RuntimeException: 
> java.lang.ClassNotFoundException: Class 
> com.amazon.ws.emr.hadoop.fs.EmrFileSystem not found
>   at 
> org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2074)
>   at 
> org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2626)
>   at 
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2639)
>   at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:90)
>   at 
> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2678)
>   at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2660)
>   at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:374)
>   at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
>   at 
> org.apache.spark.deploy.yarn.Client.copyFileToRemote(Client.scala:233)
>   at 
> org.apache.spark.deploy.yarn.Client.org$apache$spark$deploy$yarn$Client$$distribute$1(Client.scala:327)
>   at 
> org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$5.apply(Client.scala:366)
>   at 
> org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$5.apply(Client.scala:364)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:364)
>   at 
> org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:629)
>   at 
> org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:119)
>   at org.apache.spark.deploy.yarn.Client.run(Client.scala:907)
>   at org.apache.spark.deploy.yarn.Client$.main(Client.scala:966)
>   at org.apache.spark.deploy.yarn.Client.main(Client.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:672)
>   at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
>   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
>   at 

[jira] [Updated] (SPARK-10789) Cluster mode SparkSubmit classpath only includes Spark assembly

2015-12-21 Thread Jonathan Kelly (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Kelly updated SPARK-10789:
---
Attachment: SPARK-10789.diff

> Cluster mode SparkSubmit classpath only includes Spark assembly
> ---
>
> Key: SPARK-10789
> URL: https://issues.apache.org/jira/browse/SPARK-10789
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 1.5.0
>Reporter: Jonathan Kelly
> Attachments: SPARK-10789.diff
>
>
> When using cluster deploy mode, the classpath of the SparkSubmit process that 
> gets launched only includes the Spark assembly and not 
> spark.driver.extraClassPath. This is of course by design, since the driver 
> actually runs on the cluster and not inside the SparkSubmit process.
> However, if the SparkSubmit process, minimal as it may be, needs any extra 
> libraries that are not part of the Spark assembly, there is no good way to 
> include them. (I say "no good way" because including them in the 
> SPARK_CLASSPATH environment variable does cause the SparkSubmit process to 
> include them, but this is not acceptable because this environment variable 
> has long been deprecated, and it prevents the use of 
> spark.driver.extraClassPath.)
> An example of when this matters is on Amazon EMR when using an S3 path for 
> the application JAR and running in yarn-cluster mode. The SparkSubmit process 
> needs the EmrFileSystem implementation and its dependencies in the classpath 
> in order to download the application JAR from S3, so it fails with a 
> ClassNotFoundException. (EMR currently gets around this by setting 
> SPARK_CLASSPATH, but as mentioned above this is less than ideal.)
> I have tried modifying SparkSubmitCommandBuilder to include the driver extra 
> classpath whether it's client mode or cluster mode, and this seems to work, 
> but I don't know if there is any downside to this.
> Example that fails on emr-4.0.0 (if you switch to setting 
> spark.(driver,executor).extraClassPath instead of SPARK_CLASSPATH): 
> spark-submit --deploy-mode cluster --class 
> org.apache.spark.examples.JavaWordCount s3://my-bucket/spark-examples.jar 
> s3://my-bucket/word-count-input.txt
> Resulting Exception:
> Exception in thread "main" java.lang.RuntimeException: 
> java.lang.ClassNotFoundException: Class 
> com.amazon.ws.emr.hadoop.fs.EmrFileSystem not found
>   at 
> org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2074)
>   at 
> org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2626)
>   at 
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2639)
>   at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:90)
>   at 
> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2678)
>   at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2660)
>   at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:374)
>   at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
>   at 
> org.apache.spark.deploy.yarn.Client.copyFileToRemote(Client.scala:233)
>   at 
> org.apache.spark.deploy.yarn.Client.org$apache$spark$deploy$yarn$Client$$distribute$1(Client.scala:327)
>   at 
> org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$5.apply(Client.scala:366)
>   at 
> org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$5.apply(Client.scala:364)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:364)
>   at 
> org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:629)
>   at 
> org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:119)
>   at org.apache.spark.deploy.yarn.Client.run(Client.scala:907)
>   at org.apache.spark.deploy.yarn.Client$.main(Client.scala:966)
>   at org.apache.spark.deploy.yarn.Client.main(Client.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:672)
>   at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
>   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: java.lang.ClassNotFoundException: Class 
> 

[jira] [Updated] (SPARK-10789) Cluster mode SparkSubmit classpath only includes Spark assembly

2015-12-21 Thread Jonathan Kelly (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Kelly updated SPARK-10789:
---
Affects Version/s: 1.6.0

> Cluster mode SparkSubmit classpath only includes Spark assembly
> ---
>
> Key: SPARK-10789
> URL: https://issues.apache.org/jira/browse/SPARK-10789
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 1.5.0, 1.6.0
>Reporter: Jonathan Kelly
> Attachments: SPARK-10789.diff
>
>
> When using cluster deploy mode, the classpath of the SparkSubmit process that 
> gets launched only includes the Spark assembly and not 
> spark.driver.extraClassPath. This is of course by design, since the driver 
> actually runs on the cluster and not inside the SparkSubmit process.
> However, if the SparkSubmit process, minimal as it may be, needs any extra 
> libraries that are not part of the Spark assembly, there is no good way to 
> include them. (I say "no good way" because including them in the 
> SPARK_CLASSPATH environment variable does cause the SparkSubmit process to 
> include them, but this is not acceptable because this environment variable 
> has long been deprecated, and it prevents the use of 
> spark.driver.extraClassPath.)
> An example of when this matters is on Amazon EMR when using an S3 path for 
> the application JAR and running in yarn-cluster mode. The SparkSubmit process 
> needs the EmrFileSystem implementation and its dependencies in the classpath 
> in order to download the application JAR from S3, so it fails with a 
> ClassNotFoundException. (EMR currently gets around this by setting 
> SPARK_CLASSPATH, but as mentioned above this is less than ideal.)
> I have tried modifying SparkSubmitCommandBuilder to include the driver extra 
> classpath whether it's client mode or cluster mode, and this seems to work, 
> but I don't know if there is any downside to this.
> Example that fails on emr-4.0.0 (if you switch to setting 
> spark.(driver,executor).extraClassPath instead of SPARK_CLASSPATH): 
> spark-submit --deploy-mode cluster --class 
> org.apache.spark.examples.JavaWordCount s3://my-bucket/spark-examples.jar 
> s3://my-bucket/word-count-input.txt
> Resulting Exception:
> Exception in thread "main" java.lang.RuntimeException: 
> java.lang.ClassNotFoundException: Class 
> com.amazon.ws.emr.hadoop.fs.EmrFileSystem not found
>   at 
> org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2074)
>   at 
> org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2626)
>   at 
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2639)
>   at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:90)
>   at 
> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2678)
>   at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2660)
>   at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:374)
>   at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
>   at 
> org.apache.spark.deploy.yarn.Client.copyFileToRemote(Client.scala:233)
>   at 
> org.apache.spark.deploy.yarn.Client.org$apache$spark$deploy$yarn$Client$$distribute$1(Client.scala:327)
>   at 
> org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$5.apply(Client.scala:366)
>   at 
> org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$5.apply(Client.scala:364)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:364)
>   at 
> org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:629)
>   at 
> org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:119)
>   at org.apache.spark.deploy.yarn.Client.run(Client.scala:907)
>   at org.apache.spark.deploy.yarn.Client$.main(Client.scala:966)
>   at org.apache.spark.deploy.yarn.Client.main(Client.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:672)
>   at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
>   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: java.lang.ClassNotFoundException: Class 
> 

[jira] [Updated] (SPARK-10789) Cluster mode SparkSubmit classpath only includes Spark assembly

2015-09-24 Thread Jonathan Kelly (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Kelly updated SPARK-10789:
---
Summary: Cluster mode SparkSubmit classpath only includes Spark assembly  
(was: Cluster mode SparkSubmit classpath only includes Spark classpath)

> Cluster mode SparkSubmit classpath only includes Spark assembly
> ---
>
> Key: SPARK-10789
> URL: https://issues.apache.org/jira/browse/SPARK-10789
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 1.5.0
>Reporter: Jonathan Kelly
>
> When using cluster deploy mode, the classpath of the SparkSubmit process that 
> gets launched only includes the Spark assembly and not 
> spark.driver.extraClassPath. This is of course by design, since the driver 
> actually runs on the cluster and not inside the SparkSubmit process.
> However, if the SparkSubmit process, minimal as it may be, needs any extra 
> libraries that are not part of the Spark assembly, there is no good way to 
> include them. (I say "no good way" because including them in the 
> SPARK_CLASSPATH environment variable does cause the SparkSubmit process to 
> include them, but this is not acceptable because this environment variable 
> has long been deprecated, and it prevents the use of 
> spark.driver.extraClassPath.)
> An example of when this matters is on Amazon EMR when using an S3 path for 
> the application JAR and running in yarn-cluster mode. The SparkSubmit process 
> needs the EmrFileSystem implementation and its dependencies in the classpath 
> in order to download the application JAR from S3, so it fails with a 
> ClassNotFoundException. (EMR currently gets around this by setting 
> SPARK_CLASSPATH, but as mentioned above this is less than ideal.)
> I have tried modifying SparkSubmitCommandBuilder to include the driver extra 
> classpath whether it's client mode or cluster mode, and this seems to work, 
> but I don't know if there is any downside to this.
> Example that fails on emr-4.0.0 (if you switch to setting 
> spark.(driver,executor).extraClassPath instead of SPARK_CLASSPATH): 
> spark-submit --deploy-mode cluster --class 
> org.apache.spark.examples.JavaWordCount s3://my-bucket/spark-examples.jar 
> s3://my-bucket/word-count-input.txt
> Resulting Exception:
> Exception in thread "main" java.lang.RuntimeException: 
> java.lang.ClassNotFoundException: Class 
> com.amazon.ws.emr.hadoop.fs.EmrFileSystem not found
>   at 
> org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2074)
>   at 
> org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2626)
>   at 
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2639)
>   at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:90)
>   at 
> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2678)
>   at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2660)
>   at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:374)
>   at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
>   at 
> org.apache.spark.deploy.yarn.Client.copyFileToRemote(Client.scala:233)
>   at 
> org.apache.spark.deploy.yarn.Client.org$apache$spark$deploy$yarn$Client$$distribute$1(Client.scala:327)
>   at 
> org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$5.apply(Client.scala:366)
>   at 
> org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$5.apply(Client.scala:364)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:364)
>   at 
> org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:629)
>   at 
> org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:119)
>   at org.apache.spark.deploy.yarn.Client.run(Client.scala:907)
>   at org.apache.spark.deploy.yarn.Client$.main(Client.scala:966)
>   at org.apache.spark.deploy.yarn.Client.main(Client.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:672)
>   at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
>   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: