Re: Spark context jar confusions

Eugen Cepoi Thu, 02 Jan 2014 05:32:28 -0800

Spark will send the closures to workers. If you don't have any external
dependency in your closure (using only spark types & scala/java) then it
will work fine. But now suppose you use some classes you have defined in
your project or depend on some common libs like jodatime. The workers don't
know about those classes, they must be in your classpath. Thus you need to
tell to the spark context which jars must be added in your classpath and
shared to the workers. Doing a fat jar is just easier than having a list of
jars.


To test it you can try with the spark shell, do something like
sc.makeRDD(Seq(DateTime.now(), DateTime.now())).map(date => date.getMillis
-> date).collect

when launching the shell do SPARK_CLASSPATH=path/to/joda-time.jar
spark-shell

if you don't do sc.addJar("path/to/jodatime.jar") you will get
classnotfound exceptions





2014/1/2 Archit Thakur <archit279tha...@gmail.com>

> Eugen, you said spark sends the jar to each worker, if we specify it. What
> if we only create a fat jar and do not do the sc.jarOfclass(class)? If we
> have created a fat jar. Won't all of the class be available on the slave
> node? What if access it in the code which is supposed to be executed on one
> of the slave node? Eg, Object Z. which is present the fat jar and is
> accessed in the map function(which is executed distributedly?). Won't it be
> accessible(Coz it is at compile time) ? It usually is, Isn't it?
>
>
> On Thu, Jan 2, 2014 at 6:02 PM, Archit Thakur 
> <archit279tha...@gmail.com>wrote:
>
>> Aureliano, It doesn't matter actually. specifying "local" as your spark
>> master only does is It uses the single JVM to run whole application. Making
>> a cluster and then specifying "spark://localhost:7077" runs it on a set
>> of machines. Running spark in lcoal mode will be helpful for debugging
>> purposes but will perform much slower than if you have a cluster of 3-4-n
>> machines. If you do not have a set of machines, you can make your same
>> machine as a slave and start both master and slave on the same machine. Go
>> through Apache Spark home to know more about starting various node. Thx.
>>
>>
>>
>> On Thu, Jan 2, 2014 at 5:21 PM, Aureliano Buendia 
>> <buendia...@gmail.com>wrote:
>>
>>> How about when developing the spark application, do you use "localhost",
>>> or "spark://localhost:7077" for spark context master during development?
>>>
>>> Using "spark://localhost:7077" is a good way to simulate the production
>>> driver and it provides the web ui. When using "spark://localhost:7077", is
>>> it required to create the fat jar? Wouldn't that significantly slow down
>>> the development cycle?
>>>
>>>
>>> On Thu, Jan 2, 2014 at 11:38 AM, Eugen Cepoi <cepoi.eu...@gmail.com>wrote:
>>>
>>>> It depends how you deploy, I don't find it so complicated...
>>>>
>>>> 1) To build the fat jar I am using maven (as I am not familiar with
>>>> sbt).
>>>>
>>>> Inside I have something like that, saying which libs should be used in
>>>> the fat jar (the others won't be present in the final artifact).
>>>>
>>>> <plugin>
>>>>                 <groupId>org.apache.maven.plugins</groupId>
>>>>                 <artifactId>maven-shade-plugin</artifactId>
>>>>                 <version>2.1</version>
>>>>                 <executions>
>>>>                     <execution>
>>>>                         <phase>package</phase>
>>>>                         <goals>
>>>>                             <goal>shade</goal>
>>>>                         </goals>
>>>>                         <configuration>
>>>>                             <minimizeJar>true</minimizeJar>
>>>>
>>>> <createDependencyReducedPom>false</createDependencyReducedPom>
>>>>                             <artifactSet>
>>>>                                 <includes>
>>>>
>>>> <include>org.apache.hbase:*</include>
>>>>
>>>> <include>org.apache.hadoop:*</include>
>>>>
>>>> <include>com.typesafe:config</include>
>>>>                                     <include>org.apache.avro:*</include>
>>>>                                     <include>joda-time:*</include>
>>>>                                     <include>org.joda:*</include>
>>>>                                 </includes>
>>>>                             </artifactSet>
>>>>                             <filters>
>>>>                                 <filter>
>>>>                                     <artifact>*:*</artifact>
>>>>                                     <excludes>
>>>>                                         <exclude>META-INF/*.SF</exclude>
>>>>
>>>> <exclude>META-INF/*.DSA</exclude>
>>>>
>>>> <exclude>META-INF/*.RSA</exclude>
>>>>                                     </excludes>
>>>>                                 </filter>
>>>>                             </filters>
>>>>                         </configuration>
>>>>                     </execution>
>>>>                 </executions>
>>>>             </plugin>
>>>>
>>>>
>>>> 2) The App is the jar you have built, so you ship it to the driver node
>>>> (it depends a lot on how you are planing to use it, debian packaging, a
>>>> plain old scp, etc) to run it you can do something like:
>>>>
>>>> $SPARK_HOME/spark-class SPARK_CLASSPATH=PathToYour.jar
>>>> com.myproject.MyJob
>>>>
>>>> where MyJob is the entry point to your job it defines a main method.
>>>>
>>>> 3) I don't know whats the "common way" but I am doing things this way:
>>>> build the fat jar, provide some launch scripts, make debian packaging, ship
>>>> it to a node that plays the role of the driver, run it over mesos using the
>>>> launch scripts + some conf.
>>>>
>>>>
>>>> 2014/1/2 Aureliano Buendia <buendia...@gmail.com>
>>>>
>>>>> I wasn't aware of jarOfClass. I wish there was only one good way of
>>>>> deploying in spark, instead of many ambiguous methods. (seems like spark
>>>>> has followed scala in that there are more than one way of accomplishing a
>>>>> job, making scala an overcomplicated language)
>>>>>
>>>>> 1. Should sbt assembly be used to make the fat jar? If so, which sbt
>>>>> should be used? My local sbt or that $SPARK_HOME/sbt/sbt? Why is that 
>>>>> spark
>>>>> is shipped with a separate sbt?
>>>>>
>>>>> 2. Let's say we have the dependencies fat jar which is supposed to be
>>>>> shipped to the workers. Now how do we deploy the main app which is 
>>>>> supposed
>>>>> to be executed on the driver? Make jar another jar out of it? Does sbt
>>>>> assembly also create that jar?
>>>>>
>>>>> 3. Is calling sc.jarOfClass() the most common way of doing this? I
>>>>> cannot find any example by googling. What's the most common way that 
>>>>> people
>>>>> use?
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Jan 2, 2014 at 10:58 AM, Eugen Cepoi <cepoi.eu...@gmail.com>wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> This is the list of the jars you use in your job, the driver will
>>>>>> send all those jars to each worker (otherwise the workers won't have the
>>>>>> classes you need in your job). The easy way to go is to build a fat jar
>>>>>> with your code and all the libs you depend on and then use this utility 
>>>>>> to
>>>>>> get the path: SparkContext.jarOfClass(YourJob.getClass)
>>>>>>
>>>>>>
>>>>>> 2014/1/2 Aureliano Buendia <buendia...@gmail.com>
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I do not understand why spark context has an option for loading jars
>>>>>>> at runtime.
>>>>>>>
>>>>>>> As an example, consider 
>>>>>>> this<https://github.com/apache/incubator-spark/blob/50fd8d98c00f7db6aa34183705c9269098c62486/examples/src/main/scala/org/apache/spark/examples/BroadcastTest.scala#L36>
>>>>>>> :
>>>>>>>
>>>>>>> object BroadcastTest {
>>>>>>>   def main(args: Array[String]) {
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>   val sc = new SparkContext(args(0), "Broadcast Test",
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>       System.getenv("SPARK_HOME"), 
>>>>>>> Seq(System.getenv("SPARK_EXAMPLES_JAR")))
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>  }
>>>>>>> }
>>>>>>>
>>>>>>>
>>>>>>> This is *the* example, or *the* application that we want to run, what 
>>>>>>> does SPARK_EXAMPLES_JAR supposed to be?
>>>>>>> In this particular case, the BroadcastTest example is self-contained, 
>>>>>>> why would it want to load other unrelated example jars?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Finally, how does this help a real world spark application?
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Spark context jar confusions

Reply via email to