Hello Mark, I am no expert but I can answer some of your questions.
On Oct 2, 2014 2:15 AM, "Mark Mandel" <mark.man...@gmail.com> wrote: > > Hi, > > So I'm super confused about how to take my Spark code and actually deploy and run it on a cluster. > > Let's assume I'm writing in Java, and we'll take a simple example such as: https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/JavaLogQuery.java, and this is a process I want to be running quite regularly (say more than once a minute). > > From the documentation ( http://spark.apache.org/docs/1.1.0/submitting-applications.html), it reads as if I need to create a jar from the above code, and every time I want to run this code, I use ./bin/spark-submit to upload it to the cluster, which would then run it straight away. > > This would mean that every time I want to run my process, I need to have a .jar file travel over the network? Is this correct? (seems like this would be very slow? I should try it however). > > Doing some digging around the JavaDocs, I can see that the Java/SparkContext has the option to .addJar()'s , but I can't see any documentation that actually outlines how this can be used? If someone can point me towards an article or tutorial on how this is meant to work, I'd greatly appreciate it. > > It would seem like I could write a simple process that ran, quite probably on the same machine as master, that added a Jar through the SparkContext... but then, how to run the code from that Jar? > > Or is the Jar include the code that I would run, that would then create the SparkContext that would addJar itself? (now my head hurts). > On a single node program this works. The way I think of this is, though I might be wrong here, you specify spark configurations in your driver program. This is where you specify master, the jar containing all dependencies, memory & serialization parameters. When you do a SparkContext in this driver program the embedded(?) Spark instance runs and picks up the jar that you specified. The code that you wrote is added like any other dependency. If you have all configuration provided through SparkConf along with the setJars, you could do a sbt 'runMain className args[]' to invoke your application. > Would Spark also be smart enough to know that the JAR was already uploaded, if addJar was called once it had already been uploaded? > > I'm not seeing this shown in the examples either. > > I'm really excited by what I see in Spark, but I am totally confused by how to actually get code up on Spark and make it run, and nothing I read seems to explain this aspect very well (at least to my thick head). > > I have seen: https://github.com/spark-jobserver/spark-jobserver, but from initial review, it looks like it will only work with Scala, (because you need to use the ScalaJob trait), and I have a Java dependency. > You can implement this as an interface in Java. You can pass the SparkContext as a parameter to JavaSparkContext. The only non intuitive effort here is to return a valid job you will need to write this - return SparkValid$.MODULES$; > Any help on this aspect would be greatly appreciated! > > Mark > > > -- > E: mark.man...@gmail.com > T: http://www.twitter.com/neurotic > W: www.compoundtheory.com > > 2 Devs from Down Under Podcast > http://www.2ddu.com/