2014/1/2 Aureliano Buendia <buendia...@gmail.com> > > > > On Thu, Jan 2, 2014 at 1:19 PM, Eugen Cepoi <cepoi.eu...@gmail.com> wrote: > >> When developing I am using local[2] that launches a local cluster with 2 >> workers. In most cases it is fine, I just encountered some strange >> behaviours for broadcasted variables, in local mode no broadcast is done >> (at least in 0.8). >> > > That's not good. This could hide bugs in production. >
That depends on what you want to test...spark is really easy to unit test, IMO when developping you don't need a full cluster. > > >> You also have access to the ui in that case at localhost:4040. >> > > That server has a short life, it dies when the program exits. > Sure, but you are developing at that moment, you want to make unit tests and make sure they pass, no? > >> >> In dev mode I am directly launching my main class from intellij so no I >> don't need to build the fat jar. >> > > Why is that it is not possible to work with spark://localhost:7077 while > developing? This allows to monitor and review the jobs, while keeping a > record of the past jobs. > > I've never been able to connect to spark://localhost:7077 in development, > I get: > > WARN cluster.ClusterScheduler: Initial job has not accepted any resources; > check your cluster UI to ensure that workers are registered and have > sufficient memory > > Try setting spark.executor.memory, http://spark.incubator.apache.org/docs/latest/configuration.html > The ui says the workers are alive and they do have plenty of memory. Also, > I tried the exact spark master name given by the ui with no luck > (apparently akka is too fragile and sensitive to this). Also, turning off > firewall on os x had no effect. > > >> >> >> 2014/1/2 Aureliano Buendia <buendia...@gmail.com> >> >>> How about when developing the spark application, do you use "localhost", >>> or "spark://localhost:7077" for spark context master during development? >>> >>> Using "spark://localhost:7077" is a good way to simulate the production >>> driver and it provides the web ui. When using "spark://localhost:7077", is >>> it required to create the fat jar? Wouldn't that significantly slow down >>> the development cycle? >>> >>> >>> On Thu, Jan 2, 2014 at 11:38 AM, Eugen Cepoi <cepoi.eu...@gmail.com>wrote: >>> >>>> It depends how you deploy, I don't find it so complicated... >>>> >>>> 1) To build the fat jar I am using maven (as I am not familiar with >>>> sbt). >>>> >>>> Inside I have something like that, saying which libs should be used in >>>> the fat jar (the others won't be present in the final artifact). >>>> >>>> <plugin> >>>> <groupId>org.apache.maven.plugins</groupId> >>>> <artifactId>maven-shade-plugin</artifactId> >>>> <version>2.1</version> >>>> <executions> >>>> <execution> >>>> <phase>package</phase> >>>> <goals> >>>> <goal>shade</goal> >>>> </goals> >>>> <configuration> >>>> <minimizeJar>true</minimizeJar> >>>> >>>> <createDependencyReducedPom>false</createDependencyReducedPom> >>>> <artifactSet> >>>> <includes> >>>> >>>> <include>org.apache.hbase:*</include> >>>> >>>> <include>org.apache.hadoop:*</include> >>>> >>>> <include>com.typesafe:config</include> >>>> <include>org.apache.avro:*</include> >>>> <include>joda-time:*</include> >>>> <include>org.joda:*</include> >>>> </includes> >>>> </artifactSet> >>>> <filters> >>>> <filter> >>>> <artifact>*:*</artifact> >>>> <excludes> >>>> <exclude>META-INF/*.SF</exclude> >>>> >>>> <exclude>META-INF/*.DSA</exclude> >>>> >>>> <exclude>META-INF/*.RSA</exclude> >>>> </excludes> >>>> </filter> >>>> </filters> >>>> </configuration> >>>> </execution> >>>> </executions> >>>> </plugin> >>>> >>>> >>>> 2) The App is the jar you have built, so you ship it to the driver node >>>> (it depends a lot on how you are planing to use it, debian packaging, a >>>> plain old scp, etc) to run it you can do something like: >>>> >>>> $SPARK_HOME/spark-class SPARK_CLASSPATH=PathToYour.jar >>>> com.myproject.MyJob >>>> >>>> where MyJob is the entry point to your job it defines a main method. >>>> >>>> 3) I don't know whats the "common way" but I am doing things this way: >>>> build the fat jar, provide some launch scripts, make debian packaging, ship >>>> it to a node that plays the role of the driver, run it over mesos using the >>>> launch scripts + some conf. >>>> >>>> >>>> 2014/1/2 Aureliano Buendia <buendia...@gmail.com> >>>> >>>>> I wasn't aware of jarOfClass. I wish there was only one good way of >>>>> deploying in spark, instead of many ambiguous methods. (seems like spark >>>>> has followed scala in that there are more than one way of accomplishing a >>>>> job, making scala an overcomplicated language) >>>>> >>>>> 1. Should sbt assembly be used to make the fat jar? If so, which sbt >>>>> should be used? My local sbt or that $SPARK_HOME/sbt/sbt? Why is that >>>>> spark >>>>> is shipped with a separate sbt? >>>>> >>>>> 2. Let's say we have the dependencies fat jar which is supposed to be >>>>> shipped to the workers. Now how do we deploy the main app which is >>>>> supposed >>>>> to be executed on the driver? Make jar another jar out of it? Does sbt >>>>> assembly also create that jar? >>>>> >>>>> 3. Is calling sc.jarOfClass() the most common way of doing this? I >>>>> cannot find any example by googling. What's the most common way that >>>>> people >>>>> use? >>>>> >>>>> >>>>> >>>>> On Thu, Jan 2, 2014 at 10:58 AM, Eugen Cepoi <cepoi.eu...@gmail.com>wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> This is the list of the jars you use in your job, the driver will >>>>>> send all those jars to each worker (otherwise the workers won't have the >>>>>> classes you need in your job). The easy way to go is to build a fat jar >>>>>> with your code and all the libs you depend on and then use this utility >>>>>> to >>>>>> get the path: SparkContext.jarOfClass(YourJob.getClass) >>>>>> >>>>>> >>>>>> 2014/1/2 Aureliano Buendia <buendia...@gmail.com> >>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> I do not understand why spark context has an option for loading jars >>>>>>> at runtime. >>>>>>> >>>>>>> As an example, consider >>>>>>> this<https://github.com/apache/incubator-spark/blob/50fd8d98c00f7db6aa34183705c9269098c62486/examples/src/main/scala/org/apache/spark/examples/BroadcastTest.scala#L36> >>>>>>> : >>>>>>> >>>>>>> object BroadcastTest { >>>>>>> def main(args: Array[String]) { >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> val sc = new SparkContext(args(0), "Broadcast Test", >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> System.getenv("SPARK_HOME"), >>>>>>> Seq(System.getenv("SPARK_EXAMPLES_JAR"))) >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> } >>>>>>> } >>>>>>> >>>>>>> >>>>>>> This is *the* example, or *the* application that we want to run, what >>>>>>> does SPARK_EXAMPLES_JAR supposed to be? >>>>>>> In this particular case, the BroadcastTest example is self-contained, >>>>>>> why would it want to load other unrelated example jars? >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> Finally, how does this help a real world spark application? >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >