Re: How does spark-submit handle Python scripts (and how to repeat it)?

Andrei Tue, 12 Apr 2016 13:32:57 -0700

>
> One part is passing the command line options, like “--master”, from the
> JVM launched by spark-submit to the JVM where SparkContext resides



Since I have full control over both - JVM and Julia parts - I can pass
whatever options to both. But what exactly should be passed? Currently
pipeline looks like this:

spark-submit JVM -> JuliaRunner -> julia process -> new JVM with
SparkContext


 I want to make the last JVM's SparkContext to understand that it should
run on YARN. Obviously, I can't pass `--master yarn` option to JVM itself.
Instead, I can pass system property "spark.master" = "yarn-client", but
this results in an error:

Retrying connect to server: 0.0.0.0/0.0.0.0:8032



So it's definitely not enough. I tried to set manually all system
properties that `spark-submit` adds to the JVM (including
"spark-submit=true", "spark.submit.deployMode=client", etc.), but it didn't
help too. Source code is always good, but for a stranger like me it's a
little bit hard to grasp control flow in SparkSubmit class.


For pySpark & SparkR, when running scripts in client deployment modes
> (standalone client and yarn client), the JVM is the same (py4j/RBackend
> running as a thread in the JVM launched by spark-submit)


Can you elaborate on this? Does it mean that `spark-submit` creates new
Python/R process that connects back to that same JVM and creates
SparkContext in it?


On Tue, Apr 12, 2016 at 2:04 PM, Sun, Rui <rui....@intel.com> wrote:

> There is much deployment preparation work handling different deployment
> modes for pyspark and SparkR in SparkSubmit. It is difficult to summarize
> it briefly, you had better refer to the source code.
>
>
>
> Supporting running Julia scripts in SparkSubmit is more than implementing
> a ‘JuliaRunner’. One part is passing the command line options, like
> “--master”, from the JVM launched by spark-submit to the JVM where
> SparkContext resides, in the case that the two JVMs are not the same. For
> pySpark & SparkR, when running scripts in client deployment modes
> (standalone client and yarn client), the JVM is the same (py4j/RBackend
> running as a thread in the JVM launched by spark-submit) , so no need to
> pass the command line options around. However, in your case, Julia
> interpreter launches an in-process JVM for SparkContext, which is a
> separate JVM from the one launched by spark-submit. So you need a way,
> typically an environment environment variable, like “SPARKR_SUBMIT_ARGS”
> for SparkR or “PYSPARK_SUBMIT_ARGS” for pyspark, to pass command line args
> to the in-process JVM in the Julia interpreter so that SparkConf can pick
> the options.
>
>
>
> *From:* Andrei [mailto:faithlessfri...@gmail.com]
> *Sent:* Tuesday, April 12, 2016 3:48 AM
> *To:* user <user@spark.apache.org>
> *Subject:* How does spark-submit handle Python scripts (and how to repeat
> it)?
>
>
>
> I'm working on a wrapper [1] around Spark for the Julia programming
> language [2] similar to PySpark. I've got it working with Spark Standalone
> server by creating local JVM and setting master programmatically. However,
> this approach doesn't work with YARN (and probably Mesos), which require
> running via `spark-submit`.
>
>
>
> In `SparkSubmit` class I see that for Python a special class
> `PythonRunner` is launched, so I tried to do similar `JuliaRunner`, which
> essentially does the following:
>
>
>
> val pb = new ProcessBuilder(Seq("julia", juliaScript))
>
> val process = pb.start()
>
> process.waitFor()
>
>
>
> where `juliaScript` itself creates new JVM and `SparkContext` inside it
> WITHOUT setting master URL. I then tried to launch this class using
>
>
>
> spark-submit --master yarn \
>
>                       --class o.a.s.a.j.JuliaRunner \
>
>                       project.jar my_script.jl
>
>
>
> I expected that `spark-submit` would set environment variables or
> something that SparkContext would then read and connect to appropriate
> master. This didn't happen, however, and process failed while trying to
> instantiate `SparkContext`, saying that master is not specified.
>
>
>
> So what am I missing? How can use `spark-submit` to run driver in a
> non-JVM language?
>
>
>
>
>
> [1]: https://github.com/dfdx/Sparta.jl
>
> [2]: http://julialang.org/
>

Re: How does spark-submit handle Python scripts (and how to repeat it)?

Reply via email to