We're deploying Spark in yarn-cluster mode (Spark 0.9), and we add jar
dependencies in command line with "--addJars" option. However, those
external jars are only available in the driver (application running in
hadoop), and not available in the executors (workers).

After doing some research, we realize that we've to push those jars to
executors in driver via sc.AddJar(fileName). Although in the driver's log
(see the following), the jar is successfully added in the http server in
the driver, and I confirm that it's downloadable from any machine in the
network, I still get `java.lang.NoClassDefFoundError` in the executors.

14/05/09 14:51:41 INFO spark.SparkContext: Added JAR
analyticshadoop-eba5cdce1.jar at
http://10.0.0.56:42522/jars/analyticshadoop-eba5cdce1.jar<http://www.google.com/url?q=http%3A%2F%2F10.0.0.56%3A42522%2Fjars%2Fanalyticshadoop-eba5cdce1.jar&sa=D&sntz=1&usg=AFQjCNHIabiEg8WjBS-aJvvhVorlTRatnw>
with
timestamp 1399672301568

Then I check the log in the executors, and I don't find anything `Fetching
<file> with timestamp <timestamp>`, which implies something is wrong; the
executors are not downloading the external jars.

Any suggestion what we can look at?

After digging into how spark distributes external jars, I wonder the
scalability of this approach. What if there are thousands of nodes
downloading the jar from single http server in the driver? Why don't we
push the jars into HDFS distributed cache by default instead of
distributing them via http server?

Thanks.

Sincerely,

DB Tsai
-------------------------------------------------------
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai

Reply via email to