Re: Can't submit job to stand alone cluster

SparkUser Wed, 30 Dec 2015 11:24:07 -0800

Sorry need to clarify:

When you say:


   /When the docs say //"If your application is launched through Spark
   submit, then the application jar is automatically distributed to all
   worker nodes,"//it is actually saying that your executors get their
   jars from the driver. This is true whether you're running in client
   mode or cluster mode./

Don't you mean the master, not the driver? I thought the whole point ofconfusion is that people expect the driver to distribute jars but theyhave to be visible to the master on the file system local to the master?

I see a lot of people tripped up by this and a nice mail from Greg Hillto the list cleared this up for me but now I am confused again. I am acouple days away from having a way to test this myself, so I am just "intheory" right now.


   On 12/29/2015 05:18 AM, Greg Hill wrote:

    Yes, you have misunderstood, but so did I.  So the problem is that
    --deploy-mode cluster runs the Driver on the cluster as well, and you
    don't know which node it's going to run on, so every node needs
    access to
    the JAR.  spark-submit does not pass the JAR along to the Driver,
    but the
    Driver will pass it to the executors.  I ended up putting the JAR
    in HDFS
    and passing an hdfs:// path to spark-submit.  This is a subtle
    difference
    from Spark on YARN which does pass the JAR along to the Driver

automatically, and IMO should probably be fixed in spark-submit.It's

    really confusing for newcomers.



Thanks,

Jim


On 12/29/2015 04:36 PM, Daniel Valdivia wrote:

That makes things more clear! Thanks

Issue resolved

Sent from my iPhone

On Dec 29, 2015, at 2:43 PM, Annabel Melongo<melongo_anna...@yahoo.com <mailto:melongo_anna...@yahoo.com>> wrote:

Thanks Andrew for this awesome explanation *:) happy

On Tuesday, December 29, 2015 5:30 PM, Andrew Or<and...@databricks.com <mailto:and...@databricks.com>> wrote:



Let me clarify a few things for everyone:

There are three *cluster managers*: standalone, YARN, and Mesos. Eachcluster manager can run in two *deploy modes*, client or cluster. Inclient mode, the driver runs on the machine that submitted theapplication (the client). In cluster mode, the driver runs on one ofthe worker machines in the cluster.

When I say "standalone cluster mode" I am referring to the standalonecluster manager running in cluster deploy mode.


Here's how the resources are distributed in each mode (omitting Mesos):

    *Standalone / YARN client mode. *The driver runs on the client
    machine (i.e. machine that ran Spark submit) so it should already
    have access to the jars. The executors then pull the jars from an
    HTTP server started in the driver.

    *Standalone cluster mode. *Spark submit does /not/ upload your
    jars to the cluster, so all the resources you need must already
    be on all of the worker machines. The executors, however,
    actually just pull the jars from the driver as in client mode
    instead of finding it in their own local file systems.

    *YARN cluster mode. *Spark submit /does/ upload your jars to the
    cluster. In particular, it puts the jars in HDFS so your driver
    can just read from there. As in other deployments, the executors
    pull the jars from the driver.

When the docs say "If your application is launched through Sparksubmit, then the application jar is automatically distributed to allworker nodes," it is actually saying that your executors get theirjars from the driver. This is true whether you're running in clientmode or cluster mode.

If the docs are unclear (and they seem to be), then we should updatethem. I have filed SPARK-12565<https://issues.apache.org/jira/browse/SPARK-12565> to track this.


Please let me know if there's anything else I can help clarify.

Cheers,
-Andrew

2015-12-29 13:07 GMT-08:00 Annabel Melongo <melongo_anna...@yahoo.com<mailto:melongo_anna...@yahoo.com>>:

Andrew,

Now I see where the confusion lays. Standalone cluster mode, your
link, is nothing but a combination of client-mode and standalone
mode, my link, without YARN.

But I'm confused by this paragraph in your link:

If your application is launched through Spark submit, then the
application jar is automatically distributed to all worker nodes.
For any additional jars that your
application depends on, you should specify them through
the |--jars| flag using comma as a delimiter (e.g. |--jars
jar1,jar2|).

That can't be true; this is only the case when Spark runs on top
of YARN. Please correct me, if I'm wrong.

Thanks

On Tuesday, December 29, 2015 2:54 PM, Andrew Or
<and...@databricks.com <mailto:and...@databricks.com>> wrote:

http://spark.apache.org/docs/latest/spark-standalone.html#launching-spark-applications

2015-12-29 11:48 GMT-08:00 Annabel Melongo
<melongo_anna...@yahoo.com <mailto:melongo_anna...@yahoo.com>>:

Greg,

Can you please send me a doc describing the standalone
cluster mode? Honestly, I never heard about it.

The three different modes, I've listed appear in the last
paragraph of this doc: Running Spark Applications

<http://www.cloudera.com/content/www/en-us/documentation/enterprise/latest/topics/cdh_ig_running_spark_apps.html>

Running Spark Applications

<http://www.cloudera.com/content/www/en-us/documentation/enterprise/latest/topics/cdh_ig_running_spark_apps.html>

--class The FQCN of the class containing the main method of
the application. For example,
org.apache.spark.examples.SparkPi. --conf

View on www.cloudera.com

<http://www.cloudera.com/content/www/en-us/documentation/enterprise/latest/topics/cdh_ig_running_spark_apps.html>

Preview by Yahoo

On Tuesday, December 29, 2015 2:42 PM, Andrew Or
<and...@databricks.com <mailto:and...@databricks.com>> wrote:

The confusion here is the expression "standalone cluster
mode". Either it's stand-alone or it's cluster mode but
it can't be both.

@Annabel That's not true. There /is/ a standalone cluster
mode where driver runs on one of the workers instead of on
the client machine. What you're describing is standalone
client mode.

2015-12-29 11:32 GMT-08:00 Annabel Melongo
<melongo_anna...@yahoo.com <mailto:melongo_anna...@yahoo.com>>:

Greg,

The confusion here is the expression "standalone cluster
mode". Either it's stand-alone or it's cluster mode but
it can't be both.

With this in mind, here's how jars are uploaded:
1. Spark Stand-alone mode: client and driver run on the
same machine; use --packages option to submit a jar
2. Yarn Cluster-mode: client and driver run on
separate machines; additionally driver runs as a thread
in ApplicationMaster; use --jars option with a globally
visible path to said jar
3. Yarn Client-mode: client and driver run on the
same machine. driver is *NOT* a thread in
ApplicationMaster; use --packages to submit a jar

On Tuesday, December 29, 2015 1:54 PM, Andrew Or
<and...@databricks.com <mailto:and...@databricks.com>> wrote:

Hi Greg,

It's actually intentional for standalone cluster mode to
not upload jars. One of the reasons why YARN takes at
least 10 seconds before running any simple application is
because there's a lot of random overhead (e.g. putting
jars in HDFS). If this missing functionality is not
documented somewhere then we should add that.

Also, the packages problem seems legitimate. Thanks for
reporting it. I have filed
https://issues.apache.org/jira/browse/SPARK-12559.

-Andrew

2015-12-29 4:18 GMT-08:00 Greg Hill
<greg.h...@rackspace.com <mailto:greg.h...@rackspace.com>>:

On 12/28/15, 5:16 PM, "Daniel Valdivia"
<h...@danielvaldivia.com
<mailto:h...@danielvaldivia.com>> wrote:

>Hi,
>
>I'm trying to submit a job to a small spark cluster
running in stand
>alone mode, however it seems like the jar file I'm
submitting to the
>cluster is "not found" by the workers nodes.
>
>I might have understood wrong, but I though the
Driver node would send
>this jar file to the worker nodes, or should I
manually send this file to
>each worker node before I submit the job?

Yes, you have misunderstood, but so did I. So the
problem is that
--deploy-mode cluster runs the Driver on the cluster
as well, and you
don't know which node it's going to run on, so every
node needs access to
the JAR. spark-submit does not pass the JAR along to
the Driver, but the
Driver will pass it to the executors. I ended up
putting the JAR in HDFS
and passing an hdfs:// path to spark-submit. This is
a subtle difference
from Spark on YARN which does pass the JAR along to
the Driver
automatically, and IMO should probably be fixed in
spark-submit. It's
really confusing for newcomers.

Another problem I ran into that you also might is
that --packages doesn't
work with --deploy-mode cluster. It downloads the
packages to a temporary
location on the node running spark-submit, then
passes those paths to the
node that is running the Driver, but since that isn't
the same machine, it
can't find anything and fails. The driver process
*should* be the one
doing the downloading, but it isn't. I ended up
having to create a fat JAR
with all of the dependencies to get around that one.

Greg

---------------------------------------------------------------------
To unsubscribe, e-mail:
user-unsubscr...@spark.apache.org
<mailto:user-unsubscr...@spark.apache.org>
For additional commands, e-mail:
user-h...@spark.apache.org
<mailto:user-h...@spark.apache.org>

Re: Can't submit job to stand alone cluster

Reply via email to