Re: Spark context jar confusions

Eugen Cepoi Thu, 02 Jan 2014 05:57:43 -0800

2014/1/2 Aureliano Buendia <buendia...@gmail.com>

>
>
>
> On Thu, Jan 2, 2014 at 1:19 PM, Eugen Cepoi <cepoi.eu...@gmail.com> wrote:
>
>> When developing I am using local[2] that launches a local cluster with 2
>> workers. In most cases it is fine, I just encountered some strange
>> behaviours for broadcasted variables, in local mode no broadcast is done
>> (at least in 0.8).
>>
>
> That's not good. This could hide bugs in production.
>


That depends on what you want to test...spark is really easy to unit test,
IMO when developping you don't need a full cluster.


>
>
>> You also have access to the ui in that case at localhost:4040.
>>
>
> That server has a short life, it dies when the program exits.
>

Sure, but you are developing at that moment, you want to make unit tests
and make sure they pass, no?


>
>>
>> In dev mode I am directly launching my main class from intellij so no I
>> don't need to build the fat jar.
>>
>
> Why is that it is not possible to work with spark://localhost:7077 while
> developing? This allows to monitor and review the jobs, while keeping a
> record of the past jobs.
>
> I've never been able to connect to spark://localhost:7077 in development,
> I get:
>
> WARN cluster.ClusterScheduler: Initial job has not accepted any resources;
> check your cluster UI to ensure that workers are registered and have
> sufficient memory
>
>
Try setting spark.executor.memory,
http://spark.incubator.apache.org/docs/latest/configuration.html


> The ui says the workers are alive and they do have plenty of memory. Also,
> I tried the exact spark master name given by the ui with no luck
> (apparently akka is too fragile and sensitive to this). Also, turning off
> firewall on os x had no effect.
>
>
>>
>>
>> 2014/1/2 Aureliano Buendia <buendia...@gmail.com>
>>
>>> How about when developing the spark application, do you use "localhost",
>>> or "spark://localhost:7077" for spark context master during development?
>>>
>>> Using "spark://localhost:7077" is a good way to simulate the production
>>> driver and it provides the web ui. When using "spark://localhost:7077", is
>>> it required to create the fat jar? Wouldn't that significantly slow down
>>> the development cycle?
>>>
>>>
>>> On Thu, Jan 2, 2014 at 11:38 AM, Eugen Cepoi <cepoi.eu...@gmail.com>wrote:
>>>
>>>> It depends how you deploy, I don't find it so complicated...
>>>>
>>>> 1) To build the fat jar I am using maven (as I am not familiar with
>>>> sbt).
>>>>
>>>> Inside I have something like that, saying which libs should be used in
>>>> the fat jar (the others won't be present in the final artifact).
>>>>
>>>> <plugin>
>>>>                 <groupId>org.apache.maven.plugins</groupId>
>>>>                 <artifactId>maven-shade-plugin</artifactId>
>>>>                 <version>2.1</version>
>>>>                 <executions>
>>>>                     <execution>
>>>>                         <phase>package</phase>
>>>>                         <goals>
>>>>                             <goal>shade</goal>
>>>>                         </goals>
>>>>                         <configuration>
>>>>                             <minimizeJar>true</minimizeJar>
>>>>
>>>> <createDependencyReducedPom>false</createDependencyReducedPom>
>>>>                             <artifactSet>
>>>>                                 <includes>
>>>>
>>>> <include>org.apache.hbase:*</include>
>>>>
>>>> <include>org.apache.hadoop:*</include>
>>>>
>>>> <include>com.typesafe:config</include>
>>>>                                     <include>org.apache.avro:*</include>
>>>>                                     <include>joda-time:*</include>
>>>>                                     <include>org.joda:*</include>
>>>>                                 </includes>
>>>>                             </artifactSet>
>>>>                             <filters>
>>>>                                 <filter>
>>>>                                     <artifact>*:*</artifact>
>>>>                                     <excludes>
>>>>                                         <exclude>META-INF/*.SF</exclude>
>>>>
>>>> <exclude>META-INF/*.DSA</exclude>
>>>>
>>>> <exclude>META-INF/*.RSA</exclude>
>>>>                                     </excludes>
>>>>                                 </filter>
>>>>                             </filters>
>>>>                         </configuration>
>>>>                     </execution>
>>>>                 </executions>
>>>>             </plugin>
>>>>
>>>>
>>>> 2) The App is the jar you have built, so you ship it to the driver node
>>>> (it depends a lot on how you are planing to use it, debian packaging, a
>>>> plain old scp, etc) to run it you can do something like:
>>>>
>>>> $SPARK_HOME/spark-class SPARK_CLASSPATH=PathToYour.jar
>>>> com.myproject.MyJob
>>>>
>>>> where MyJob is the entry point to your job it defines a main method.
>>>>
>>>> 3) I don't know whats the "common way" but I am doing things this way:
>>>> build the fat jar, provide some launch scripts, make debian packaging, ship
>>>> it to a node that plays the role of the driver, run it over mesos using the
>>>> launch scripts + some conf.
>>>>
>>>>
>>>> 2014/1/2 Aureliano Buendia <buendia...@gmail.com>
>>>>
>>>>> I wasn't aware of jarOfClass. I wish there was only one good way of
>>>>> deploying in spark, instead of many ambiguous methods. (seems like spark
>>>>> has followed scala in that there are more than one way of accomplishing a
>>>>> job, making scala an overcomplicated language)
>>>>>
>>>>> 1. Should sbt assembly be used to make the fat jar? If so, which sbt
>>>>> should be used? My local sbt or that $SPARK_HOME/sbt/sbt? Why is that 
>>>>> spark
>>>>> is shipped with a separate sbt?
>>>>>
>>>>> 2. Let's say we have the dependencies fat jar which is supposed to be
>>>>> shipped to the workers. Now how do we deploy the main app which is 
>>>>> supposed
>>>>> to be executed on the driver? Make jar another jar out of it? Does sbt
>>>>> assembly also create that jar?
>>>>>
>>>>> 3. Is calling sc.jarOfClass() the most common way of doing this? I
>>>>> cannot find any example by googling. What's the most common way that 
>>>>> people
>>>>> use?
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Jan 2, 2014 at 10:58 AM, Eugen Cepoi <cepoi.eu...@gmail.com>wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> This is the list of the jars you use in your job, the driver will
>>>>>> send all those jars to each worker (otherwise the workers won't have the
>>>>>> classes you need in your job). The easy way to go is to build a fat jar
>>>>>> with your code and all the libs you depend on and then use this utility 
>>>>>> to
>>>>>> get the path: SparkContext.jarOfClass(YourJob.getClass)
>>>>>>
>>>>>>
>>>>>> 2014/1/2 Aureliano Buendia <buendia...@gmail.com>
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I do not understand why spark context has an option for loading jars
>>>>>>> at runtime.
>>>>>>>
>>>>>>> As an example, consider 
>>>>>>> this<https://github.com/apache/incubator-spark/blob/50fd8d98c00f7db6aa34183705c9269098c62486/examples/src/main/scala/org/apache/spark/examples/BroadcastTest.scala#L36>
>>>>>>> :
>>>>>>>
>>>>>>> object BroadcastTest {
>>>>>>>   def main(args: Array[String]) {
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>   val sc = new SparkContext(args(0), "Broadcast Test",
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>       System.getenv("SPARK_HOME"), 
>>>>>>> Seq(System.getenv("SPARK_EXAMPLES_JAR")))
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>  }
>>>>>>> }
>>>>>>>
>>>>>>>
>>>>>>> This is *the* example, or *the* application that we want to run, what 
>>>>>>> does SPARK_EXAMPLES_JAR supposed to be?
>>>>>>> In this particular case, the BroadcastTest example is self-contained, 
>>>>>>> why would it want to load other unrelated example jars?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Finally, how does this help a real world spark application?
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Spark context jar confusions

Reply via email to