Re: Confusion over how to deploy/run JAR files to a Spark Cluster

Ashish Jain Thu, 02 Oct 2014 07:01:53 -0700

Hello Mark,

I am no expert but I can answer some of your questions.


On Oct 2, 2014 2:15 AM, "Mark Mandel" <mark.man...@gmail.com> wrote:
>
> Hi,
>
> So I'm super confused about how to take my Spark code and actually deploy
and run it on a cluster.
>
> Let's assume I'm writing in Java, and we'll take a simple example such
as:
https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/JavaLogQuery.java,
and this is a process I want to be running quite regularly (say more than
once a minute).
>
> From the documentation (
http://spark.apache.org/docs/1.1.0/submitting-applications.html), it reads
as if I need to create a jar from the above code, and every time I want to
run this code, I use ./bin/spark-submit to upload it to the cluster, which
would then run it straight away.
>
> This would mean that every time I want to run my process, I need to have
a .jar file travel over the network? Is this correct? (seems like this
would be very slow? I should try it however).
>
> Doing some digging around the JavaDocs, I can see that the
Java/SparkContext has the option to .addJar()'s , but I can't see any
documentation that actually outlines how this can be used?  If someone can
point me towards an article or tutorial on how this is meant to work, I'd
greatly appreciate it.
>
> It would seem like I could write a simple process that ran, quite
probably on the same machine as master, that added a Jar through the
SparkContext... but then, how to run the code from that Jar?
>
> Or is the Jar include the code that I would run, that would then create
the SparkContext that would addJar itself? (now my head hurts).
>
On a single node program this works. The way I think of this is, though I
might be wrong here, you specify spark configurations in your driver
program. This is where you specify master, the jar containing all
dependencies, memory & serialization parameters. When you do a SparkContext
in this driver program the embedded(?) Spark instance runs and picks up the
jar that you specified. The code that you wrote is added like any other
dependency.

If you have all configuration provided through SparkConf along with the
setJars, you could do a sbt 'runMain className args[]' to invoke your
application.
> Would Spark also be smart enough to know that the JAR was already
uploaded, if addJar was called once it had already been uploaded?
>
> I'm not seeing this shown in the examples either.
>
> I'm really excited by what I see in Spark, but I am totally confused by
how to actually get code up on Spark and make it run, and nothing I read
seems to explain this aspect very well (at least to my thick head).
>
> I have seen: https://github.com/spark-jobserver/spark-jobserver, but from
initial review, it looks like it will only work with Scala, (because you
need to use the ScalaJob trait), and I have a Java dependency.
>
You can implement this as an interface in Java. You can pass the
SparkContext as a parameter to JavaSparkContext. The only non intuitive
effort here is to return a valid job you will need to write this - return
SparkValid$.MODULES$;

> Any help on this aspect would be greatly appreciated!
>
> Mark
>
>
> --
> E: mark.man...@gmail.com
> T: http://www.twitter.com/neurotic
> W: www.compoundtheory.com
>
> 2 Devs from Down Under Podcast
> http://www.2ddu.com/

Re: Confusion over how to deploy/run JAR files to a Spark Cluster

Reply via email to