Re: Distribute jar dependencies via sc.AddJar(fileName)

DB Tsai Wed, 14 May 2014 03:14:33 -0700

Hi Xiangrui,

I actually used `yarn-standalone`, sorry for misleading. I did debugging in
the last couple days, and everything up to updateDependency in
executor.scala works. I also checked the file size and md5sum in the
executors, and they are the same as the one in driver. Gonna do more
testing tomorrow.


Thanks.


Sincerely,

DB Tsai
-------------------------------------------------------
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai


On Tue, May 13, 2014 at 11:41 PM, Xiangrui Meng <men...@gmail.com> wrote:

> I don't know whether this would fix the problem. In v0.9, you need
> `yarn-standalone` instead of `yarn-cluster`.
>
> See
> https://github.com/apache/spark/commit/328c73d037c17440c2a91a6c88b4258fbefa0c08
>
> On Tue, May 13, 2014 at 11:36 PM, Xiangrui Meng <men...@gmail.com> wrote:
> > Does v0.9 support yarn-cluster mode? I checked SparkContext.scala in
> > v0.9.1 and didn't see special handling of `yarn-cluster`. -Xiangrui
> >
> > On Mon, May 12, 2014 at 11:14 AM, DB Tsai <dbt...@stanford.edu> wrote:
> >> We're deploying Spark in yarn-cluster mode (Spark 0.9), and we add jar
> >> dependencies in command line with "--addJars" option. However, those
> >> external jars are only available in the driver (application running in
> >> hadoop), and not available in the executors (workers).
> >>
> >> After doing some research, we realize that we've to push those jars to
> >> executors in driver via sc.AddJar(fileName). Although in the driver's
> log
> >> (see the following), the jar is successfully added in the http server
> in the
> >> driver, and I confirm that it's downloadable from any machine in the
> >> network, I still get `java.lang.NoClassDefFoundError` in the executors.
> >>
> >> 14/05/09 14:51:41 INFO spark.SparkContext: Added JAR
> >> analyticshadoop-eba5cdce1.jar at
> >> http://10.0.0.56:42522/jars/analyticshadoop-eba5cdce1.jar with
> timestamp
> >> 1399672301568
> >>
> >> Then I check the log in the executors, and I don't find anything
> `Fetching
> >> <file> with timestamp <timestamp>`, which implies something is wrong;
> the
> >> executors are not downloading the external jars.
> >>
> >> Any suggestion what we can look at?
> >>
> >> After digging into how spark distributes external jars, I wonder the
> >> scalability of this approach. What if there are thousands of nodes
> >> downloading the jar from single http server in the driver? Why don't we
> push
> >> the jars into HDFS distributed cache by default instead of distributing
> them
> >> via http server?
> >>
> >> Thanks.
> >>
> >> Sincerely,
> >>
> >> DB Tsai
> >> -------------------------------------------------------
> >> My Blog: https://www.dbtsai.com
> >> LinkedIn: https://www.linkedin.com/in/dbtsai
>

Re: Distribute jar dependencies via sc.AddJar(fileName)

Reply via email to