We're running Spark 1.6.0 on EMR, in YARN client mode. We run Python code, but 
we want to add a custom jar file to the driver.

When running on a local one-node standalone cluster, we just use 
spark.driver.extraClassPath and everything works:

spark-submit --conf spark.driver.extraClassPath=/path/to/our/custom/jar/*  
our-python-script.py

But on EMR, this value is set to something that is needed to make their 
installation of Spark work. Setting it to point to our custom jar overwrites 
the original setting rather than adding to it and breaks Spark.

Our current workaround is to capture to whatever EMR sets 
spark.driver.extraClassPath once, then use that path and add our jar file to 
it. Of course this breaks when EMR changes this path in their cluster settings. 
We wouldn't necessarily notice this easily. This is how it looks:

spark-submit --conf 
spark.driver.extraClassPath=/path/to/our/custom/jar/*:/etc/hadoop/conf:/etc/hive/conf:/usr/lib/hadoop/*:/usr/lib/hadoop-hdfs/*:/usr/lib/hadoop-yarn/*:/usr/lib/hadoop-lzo/lib/*:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*
  our-python-script.py

We prefer not to do this...

We tried the spark-submit argument --jars, but it didn't seem to do anything. 
Like this:

spark-submit --jars /path/to/our/custom/jar/file.jar  our-python-script.py

We also tried to set CLASSPATH, but it doesn't seem to have any impact:

export CLASSPATH=/path/to/our/custom/jar/*
spark-submit  our-python-script.py

When using SPARK_CLASSPATH, we got warnings that it is deprecated, and the 
messages also seemed to imply that it affects the same configuration that is 
set by spark.driver.extraClassPath.


So, my question is: Is there a clean way to add a custom jar file to a Spark 
configuration?

Thanks,
Gerhard

Reply via email to