We're deploying Spark in yarn-cluster mode (Spark 0.9), and we add jar dependencies in command line with "--addJars" option. However, those external jars are only available in the driver (application running in hadoop), and not available in the executors (workers).
After doing some research, we realize that we've to push those jars to executors in driver via sc.AddJar(fileName). Although in the driver's log (see the following), the jar is successfully added in the http server in the driver, and I confirm that it's downloadable from any machine in the network, I still get `java.lang.NoClassDefFoundError` in the executors. 14/05/09 14:51:41 INFO spark.SparkContext: Added JAR analyticshadoop-eba5cdce1.jar at http://10.0.0.56:42522/jars/analyticshadoop-eba5cdce1.jar<http://www.google.com/url?q=http%3A%2F%2F10.0.0.56%3A42522%2Fjars%2Fanalyticshadoop-eba5cdce1.jar&sa=D&sntz=1&usg=AFQjCNHIabiEg8WjBS-aJvvhVorlTRatnw> with timestamp 1399672301568 Then I check the log in the executors, and I don't find anything `Fetching <file> with timestamp <timestamp>`, which implies something is wrong; the executors are not downloading the external jars. Any suggestion what we can look at? After digging into how spark distributes external jars, I wonder the scalability of this approach. What if there are thousands of nodes downloading the jar from single http server in the driver? Why don't we push the jars into HDFS distributed cache by default instead of distributing them via http server? Thanks. Sincerely, DB Tsai ------------------------------------------------------- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai