[jira] [Comment Edited] (SPARK-10789) Cluster mode SparkSubmit classpath only includes Spark assembly

2015-12-29 Thread Jonathan Kelly (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15074225#comment-15074225
 ] 

Jonathan Kelly edited comment on SPARK-10789 at 12/29/15 8:13 PM:
--

Yes, using this patch requires a rebuild. If you are using Spark on YARN, the 
Spark assembly should only need to be on the master node, but yes, you'd need 
to distribute the new assembly across your cluster if you are using Spark 
Standalone.

Also, yes, the spark.{driver,executor}.extra{ClassPath,Library} lets you 
distribute extra jar and so files without a rebuild, but the point of this JIRA 
issue is that spark.driver.extraClassPath takes effect with client deploy-mode 
but not cluster deploy-mode. This means that if you need any extra jars for 
accessing a custom Hadoop FileSystem to get the application jar (e.g., EMRFS), 
they'll either need to be included in the Spark assembly jar, or you'll need 
this patch.


was (Author: jonathak):
Yes, using this patch requires a rebuild. If you are using Spark on YARN, the 
Spark assembly should only need to be on the master node, but yes, you'd need 
to distribute the new assembly across your cluster if you are using Spark 
Standalone.

> Cluster mode SparkSubmit classpath only includes Spark assembly
> ---
>
> Key: SPARK-10789
> URL: https://issues.apache.org/jira/browse/SPARK-10789
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 1.5.0, 1.6.0
>Reporter: Jonathan Kelly
> Attachments: SPARK-10789.diff, SPARK-10789.v1.6.0.diff
>
>
> When using cluster deploy mode, the classpath of the SparkSubmit process that 
> gets launched only includes the Spark assembly and not 
> spark.driver.extraClassPath. This is of course by design, since the driver 
> actually runs on the cluster and not inside the SparkSubmit process.
> However, if the SparkSubmit process, minimal as it may be, needs any extra 
> libraries that are not part of the Spark assembly, there is no good way to 
> include them. (I say "no good way" because including them in the 
> SPARK_CLASSPATH environment variable does cause the SparkSubmit process to 
> include them, but this is not acceptable because this environment variable 
> has long been deprecated, and it prevents the use of 
> spark.driver.extraClassPath.)
> An example of when this matters is on Amazon EMR when using an S3 path for 
> the application JAR and running in yarn-cluster mode. The SparkSubmit process 
> needs the EmrFileSystem implementation and its dependencies in the classpath 
> in order to download the application JAR from S3, so it fails with a 
> ClassNotFoundException. (EMR currently gets around this by setting 
> SPARK_CLASSPATH, but as mentioned above this is less than ideal.)
> I have tried modifying SparkSubmitCommandBuilder to include the driver extra 
> classpath whether it's client mode or cluster mode, and this seems to work, 
> but I don't know if there is any downside to this.
> Example that fails on emr-4.0.0 (if you switch to setting 
> spark.(driver,executor).extraClassPath instead of SPARK_CLASSPATH): 
> spark-submit --deploy-mode cluster --class 
> org.apache.spark.examples.JavaWordCount s3://my-bucket/spark-examples.jar 
> s3://my-bucket/word-count-input.txt
> Resulting Exception:
> Exception in thread "main" java.lang.RuntimeException: 
> java.lang.ClassNotFoundException: Class 
> com.amazon.ws.emr.hadoop.fs.EmrFileSystem not found
>   at 
> org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2074)
>   at 
> org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2626)
>   at 
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2639)
>   at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:90)
>   at 
> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2678)
>   at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2660)
>   at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:374)
>   at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
>   at 
> org.apache.spark.deploy.yarn.Client.copyFileToRemote(Client.scala:233)
>   at 
> org.apache.spark.deploy.yarn.Client.org$apache$spark$deploy$yarn$Client$$distribute$1(Client.scala:327)
>   at 
> org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$5.apply(Client.scala:366)
>   at 
> org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$5.apply(Client.scala:364)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:364)
>   at 
> org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:629)
>   at 
> 

[jira] [Comment Edited] (SPARK-10789) Cluster mode SparkSubmit classpath only includes Spark assembly

2015-12-29 Thread Roi Reshef (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15073832#comment-15073832
 ] 

Roi Reshef edited comment on SPARK-10789 at 12/29/15 11:56 AM:
---

Thanks [~jonathak]. That requires rebuilding spark and redistributing it across 
my cluster, right? I finally figured out a solution to import external jars 
without rebuilding spark. One can modify two configurations inside spark-env.sh 
(at least for Netlib package, which include *.jar and *.so):
spark.{driver,executor}.extraClassPath - for *.jar
spark.{driver,executor}.extraLibraryPath - for *.so

And spark (I'm using v1.5.2) will pick them up automatically


was (Author: roireshef):
Thanks [~jonathak]. That requires rebuilding spark and redistributing it across 
my cluster, right? I finally figured out a solution to import external jars 
without rebuilding spark. One can modify two configurations inside spark-env.sh 
(at least for Netlib package, which include *.jar and *.so):
spark. { driver,executor } .extraClassPath - for *.jar
spark. { driver,executor } .extraLibraryPath - for *.so

And spark (I'm using v1.5.2) will pick them up automatically

> Cluster mode SparkSubmit classpath only includes Spark assembly
> ---
>
> Key: SPARK-10789
> URL: https://issues.apache.org/jira/browse/SPARK-10789
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 1.5.0, 1.6.0
>Reporter: Jonathan Kelly
> Attachments: SPARK-10789.diff, SPARK-10789.v1.6.0.diff
>
>
> When using cluster deploy mode, the classpath of the SparkSubmit process that 
> gets launched only includes the Spark assembly and not 
> spark.driver.extraClassPath. This is of course by design, since the driver 
> actually runs on the cluster and not inside the SparkSubmit process.
> However, if the SparkSubmit process, minimal as it may be, needs any extra 
> libraries that are not part of the Spark assembly, there is no good way to 
> include them. (I say "no good way" because including them in the 
> SPARK_CLASSPATH environment variable does cause the SparkSubmit process to 
> include them, but this is not acceptable because this environment variable 
> has long been deprecated, and it prevents the use of 
> spark.driver.extraClassPath.)
> An example of when this matters is on Amazon EMR when using an S3 path for 
> the application JAR and running in yarn-cluster mode. The SparkSubmit process 
> needs the EmrFileSystem implementation and its dependencies in the classpath 
> in order to download the application JAR from S3, so it fails with a 
> ClassNotFoundException. (EMR currently gets around this by setting 
> SPARK_CLASSPATH, but as mentioned above this is less than ideal.)
> I have tried modifying SparkSubmitCommandBuilder to include the driver extra 
> classpath whether it's client mode or cluster mode, and this seems to work, 
> but I don't know if there is any downside to this.
> Example that fails on emr-4.0.0 (if you switch to setting 
> spark.(driver,executor).extraClassPath instead of SPARK_CLASSPATH): 
> spark-submit --deploy-mode cluster --class 
> org.apache.spark.examples.JavaWordCount s3://my-bucket/spark-examples.jar 
> s3://my-bucket/word-count-input.txt
> Resulting Exception:
> Exception in thread "main" java.lang.RuntimeException: 
> java.lang.ClassNotFoundException: Class 
> com.amazon.ws.emr.hadoop.fs.EmrFileSystem not found
>   at 
> org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2074)
>   at 
> org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2626)
>   at 
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2639)
>   at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:90)
>   at 
> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2678)
>   at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2660)
>   at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:374)
>   at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
>   at 
> org.apache.spark.deploy.yarn.Client.copyFileToRemote(Client.scala:233)
>   at 
> org.apache.spark.deploy.yarn.Client.org$apache$spark$deploy$yarn$Client$$distribute$1(Client.scala:327)
>   at 
> org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$5.apply(Client.scala:366)
>   at 
> org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$5.apply(Client.scala:364)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:364)
>   at 
> org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:629)
>   at 
> 

[jira] [Comment Edited] (SPARK-10789) Cluster mode SparkSubmit classpath only includes Spark assembly

2015-12-29 Thread Roi Reshef (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15073832#comment-15073832
 ] 

Roi Reshef edited comment on SPARK-10789 at 12/29/15 11:55 AM:
---

Thanks [~jonathak]. That requires rebuilding spark and redistributing it across 
my cluster, right? I finally figured out a solution to import external jars 
without rebuilding spark. One can modify two configurations inside spark-env.sh 
(at least for Netlib package, which include *.jar and *.so):
spark. { driver,executor } .extraClassPath - for *.jar
spark. { driver,executor } .extraLibraryPath - for *.so

And spark (I'm using v1.5.2) will pick them up automatically


was (Author: roireshef):
Thanks [~jonathak]. That requires rebuilding spark and redistributing it across 
my cluster, right? I finally figured out a solution to import external jars 
without rebuilding spark. One can modify two configurations inside spark-env.sh 
(at least for Netlib package, which include *.jar and *.so):
spark.{driver,executor}.extraClassPath - for *.jar
spark.{driver,executor}.extraLibraryPath - for *.so

And spark (I'm using v1.5.2) will pick them up automatically

> Cluster mode SparkSubmit classpath only includes Spark assembly
> ---
>
> Key: SPARK-10789
> URL: https://issues.apache.org/jira/browse/SPARK-10789
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 1.5.0, 1.6.0
>Reporter: Jonathan Kelly
> Attachments: SPARK-10789.diff, SPARK-10789.v1.6.0.diff
>
>
> When using cluster deploy mode, the classpath of the SparkSubmit process that 
> gets launched only includes the Spark assembly and not 
> spark.driver.extraClassPath. This is of course by design, since the driver 
> actually runs on the cluster and not inside the SparkSubmit process.
> However, if the SparkSubmit process, minimal as it may be, needs any extra 
> libraries that are not part of the Spark assembly, there is no good way to 
> include them. (I say "no good way" because including them in the 
> SPARK_CLASSPATH environment variable does cause the SparkSubmit process to 
> include them, but this is not acceptable because this environment variable 
> has long been deprecated, and it prevents the use of 
> spark.driver.extraClassPath.)
> An example of when this matters is on Amazon EMR when using an S3 path for 
> the application JAR and running in yarn-cluster mode. The SparkSubmit process 
> needs the EmrFileSystem implementation and its dependencies in the classpath 
> in order to download the application JAR from S3, so it fails with a 
> ClassNotFoundException. (EMR currently gets around this by setting 
> SPARK_CLASSPATH, but as mentioned above this is less than ideal.)
> I have tried modifying SparkSubmitCommandBuilder to include the driver extra 
> classpath whether it's client mode or cluster mode, and this seems to work, 
> but I don't know if there is any downside to this.
> Example that fails on emr-4.0.0 (if you switch to setting 
> spark.(driver,executor).extraClassPath instead of SPARK_CLASSPATH): 
> spark-submit --deploy-mode cluster --class 
> org.apache.spark.examples.JavaWordCount s3://my-bucket/spark-examples.jar 
> s3://my-bucket/word-count-input.txt
> Resulting Exception:
> Exception in thread "main" java.lang.RuntimeException: 
> java.lang.ClassNotFoundException: Class 
> com.amazon.ws.emr.hadoop.fs.EmrFileSystem not found
>   at 
> org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2074)
>   at 
> org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2626)
>   at 
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2639)
>   at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:90)
>   at 
> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2678)
>   at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2660)
>   at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:374)
>   at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
>   at 
> org.apache.spark.deploy.yarn.Client.copyFileToRemote(Client.scala:233)
>   at 
> org.apache.spark.deploy.yarn.Client.org$apache$spark$deploy$yarn$Client$$distribute$1(Client.scala:327)
>   at 
> org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$5.apply(Client.scala:366)
>   at 
> org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$5.apply(Client.scala:364)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:364)
>   at 
> org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:629)
>   at 
>