Thanks Andrew for this awesome explanation *:) happy
On Tuesday, December 29, 2015 5:30 PM, Andrew Or
<and...@databricks.com <mailto:and...@databricks.com>> wrote:
Let me clarify a few things for everyone:
There are three *cluster managers*: standalone, YARN, and Mesos. Each
cluster manager can run in two *deploy modes*, client or cluster. In
client mode, the driver runs on the machine that submitted the
application (the client). In cluster mode, the driver runs on one of
the worker machines in the cluster.
When I say "standalone cluster mode" I am referring to the standalone
cluster manager running in cluster deploy mode.
Here's how the resources are distributed in each mode (omitting Mesos):
*Standalone / YARN client mode. *The driver runs on the client
machine (i.e. machine that ran Spark submit) so it should already
have access to the jars. The executors then pull the jars from an
HTTP server started in the driver.
*Standalone cluster mode. *Spark submit does /not/ upload your
jars to the cluster, so all the resources you need must already
be on all of the worker machines. The executors, however,
actually just pull the jars from the driver as in client mode
instead of finding it in their own local file systems.
*YARN cluster mode. *Spark submit /does/ upload your jars to the
cluster. In particular, it puts the jars in HDFS so your driver
can just read from there. As in other deployments, the executors
pull the jars from the driver.
When the docs say "If your application is launched through Spark
submit, then the application jar is automatically distributed to all
worker nodes," it is actually saying that your executors get their
jars from the driver. This is true whether you're running in client
mode or cluster mode.
If the docs are unclear (and they seem to be), then we should update
them. I have filed SPARK-12565
<https://issues.apache.org/jira/browse/SPARK-12565> to track this.
Please let me know if there's anything else I can help clarify.
Cheers,
-Andrew
2015-12-29 13:07 GMT-08:00 Annabel Melongo <melongo_anna...@yahoo.com
<mailto:melongo_anna...@yahoo.com>>:
Andrew,
Now I see where the confusion lays. Standalone cluster mode, your
link, is nothing but a combination of client-mode and standalone
mode, my link, without YARN.
But I'm confused by this paragraph in your link:
If your application is launched through Spark submit, then the
application jar is automatically distributed to all worker nodes.
For any additional jars that your
application depends on, you should specify them through
the |--jars| flag using comma as a delimiter (e.g. |--jars
jar1,jar2|).
That can't be true; this is only the case when Spark runs on top
of YARN. Please correct me, if I'm wrong.
Thanks
On Tuesday, December 29, 2015 2:54 PM, Andrew Or
<and...@databricks.com <mailto:and...@databricks.com>> wrote:
http://spark.apache.org/docs/latest/spark-standalone.html#launching-spark-applications
2015-12-29 11:48 GMT-08:00 Annabel Melongo
<melongo_anna...@yahoo.com <mailto:melongo_anna...@yahoo.com>>:
Greg,
Can you please send me a doc describing the standalone
cluster mode? Honestly, I never heard about it.
The three different modes, I've listed appear in the last
paragraph of this doc: Running Spark Applications
<http://www.cloudera.com/content/www/en-us/documentation/enterprise/latest/topics/cdh_ig_running_spark_apps.html>
Running Spark Applications
<http://www.cloudera.com/content/www/en-us/documentation/enterprise/latest/topics/cdh_ig_running_spark_apps.html>
--class The FQCN of the class containing the main method of
the application. For example,
org.apache.spark.examples.SparkPi. --conf
View on www.cloudera.com
<http://www.cloudera.com/content/www/en-us/documentation/enterprise/latest/topics/cdh_ig_running_spark_apps.html>
Preview by Yahoo
On Tuesday, December 29, 2015 2:42 PM, Andrew Or
<and...@databricks.com <mailto:and...@databricks.com>> wrote:
The confusion here is the expression "standalone cluster
mode". Either it's stand-alone or it's cluster mode but
it can't be both.
@Annabel That's not true. There /is/ a standalone cluster
mode where driver runs on one of the workers instead of on
the client machine. What you're describing is standalone
client mode.
2015-12-29 11:32 GMT-08:00 Annabel Melongo
<melongo_anna...@yahoo.com <mailto:melongo_anna...@yahoo.com>>:
Greg,
The confusion here is the expression "standalone cluster
mode". Either it's stand-alone or it's cluster mode but
it can't be both.
With this in mind, here's how jars are uploaded:
1. Spark Stand-alone mode: client and driver run on the
same machine; use --packages option to submit a jar
2. Yarn Cluster-mode: client and driver run on
separate machines; additionally driver runs as a thread
in ApplicationMaster; use --jars option with a globally
visible path to said jar
3. Yarn Client-mode: client and driver run on the
same machine. driver is *NOT* a thread in
ApplicationMaster; use --packages to submit a jar
On Tuesday, December 29, 2015 1:54 PM, Andrew Or
<and...@databricks.com <mailto:and...@databricks.com>> wrote:
Hi Greg,
It's actually intentional for standalone cluster mode to
not upload jars. One of the reasons why YARN takes at
least 10 seconds before running any simple application is
because there's a lot of random overhead (e.g. putting
jars in HDFS). If this missing functionality is not
documented somewhere then we should add that.
Also, the packages problem seems legitimate. Thanks for
reporting it. I have filed
https://issues.apache.org/jira/browse/SPARK-12559.
-Andrew
2015-12-29 4:18 GMT-08:00 Greg Hill
<greg.h...@rackspace.com <mailto:greg.h...@rackspace.com>>:
On 12/28/15, 5:16 PM, "Daniel Valdivia"
<h...@danielvaldivia.com
<mailto:h...@danielvaldivia.com>> wrote:
>Hi,
>
>I'm trying to submit a job to a small spark cluster
running in stand
>alone mode, however it seems like the jar file I'm
submitting to the
>cluster is "not found" by the workers nodes.
>
>I might have understood wrong, but I though the
Driver node would send
>this jar file to the worker nodes, or should I
manually send this file to
>each worker node before I submit the job?
Yes, you have misunderstood, but so did I. So the
problem is that
--deploy-mode cluster runs the Driver on the cluster
as well, and you
don't know which node it's going to run on, so every
node needs access to
the JAR. spark-submit does not pass the JAR along to
the Driver, but the
Driver will pass it to the executors. I ended up
putting the JAR in HDFS
and passing an hdfs:// path to spark-submit. This is
a subtle difference
from Spark on YARN which does pass the JAR along to
the Driver
automatically, and IMO should probably be fixed in
spark-submit. It's
really confusing for newcomers.
Another problem I ran into that you also might is
that --packages doesn't
work with --deploy-mode cluster. It downloads the
packages to a temporary
location on the node running spark-submit, then
passes those paths to the
node that is running the Driver, but since that isn't
the same machine, it
can't find anything and fails. The driver process
*should* be the one
doing the downloading, but it isn't. I ended up
having to create a fat JAR
with all of the dependencies to get around that one.
Greg
---------------------------------------------------------------------
To unsubscribe, e-mail:
user-unsubscr...@spark.apache.org
<mailto:user-unsubscr...@spark.apache.org>
For additional commands, e-mail:
user-h...@spark.apache.org
<mailto:user-h...@spark.apache.org>