Sorry need to clarify:

When you say:

   /When the docs say //"If your application is launched through Spark
   submit, then the application jar is automatically distributed to all
   worker nodes,"//it is actually saying that your executors get their
   jars from the driver. This is true whether you're running in client
   mode or cluster mode./


Don't you mean the master, not the driver? I thought the whole point of confusion is that people expect the driver to distribute jars but they have to be visible to the master on the file system local to the master?

I see a lot of people tripped up by this and a nice mail from Greg Hill to the list cleared this up for me but now I am confused again. I am a couple days away from having a way to test this myself, so I am just "in theory" right now.

   On 12/29/2015 05:18 AM, Greg Hill wrote:
    Yes, you have misunderstood, but so did I.  So the problem is that
    --deploy-mode cluster runs the Driver on the cluster as well, and you
    don't know which node it's going to run on, so every node needs
    access to
    the JAR.  spark-submit does not pass the JAR along to the Driver,
    but the
    Driver will pass it to the executors.  I ended up putting the JAR
    in HDFS
    and passing an hdfs:// path to spark-submit.  This is a subtle
    difference
    from Spark on YARN which does pass the JAR along to the Driver
automatically, and IMO should probably be fixed in spark-submit. It's
    really confusing for newcomers.


Thanks,

Jim


On 12/29/2015 04:36 PM, Daniel Valdivia wrote:
That makes things more clear! Thanks

Issue resolved

Sent from my iPhone

On Dec 29, 2015, at 2:43 PM, Annabel Melongo <melongo_anna...@yahoo.com <mailto:melongo_anna...@yahoo.com>> wrote:

Thanks Andrew for this awesome explanation *:) happy


On Tuesday, December 29, 2015 5:30 PM, Andrew Or <and...@databricks.com <mailto:and...@databricks.com>> wrote:


Let me clarify a few things for everyone:

There are three *cluster managers*: standalone, YARN, and Mesos. Each cluster manager can run in two *deploy modes*, client or cluster. In client mode, the driver runs on the machine that submitted the application (the client). In cluster mode, the driver runs on one of the worker machines in the cluster.

When I say "standalone cluster mode" I am referring to the standalone cluster manager running in cluster deploy mode.

Here's how the resources are distributed in each mode (omitting Mesos):

    *Standalone / YARN client mode. *The driver runs on the client
    machine (i.e. machine that ran Spark submit) so it should already
    have access to the jars. The executors then pull the jars from an
    HTTP server started in the driver.

    *Standalone cluster mode. *Spark submit does /not/ upload your
    jars to the cluster, so all the resources you need must already
    be on all of the worker machines. The executors, however,
    actually just pull the jars from the driver as in client mode
    instead of finding it in their own local file systems.

    *YARN cluster mode. *Spark submit /does/ upload your jars to the
    cluster. In particular, it puts the jars in HDFS so your driver
    can just read from there. As in other deployments, the executors
    pull the jars from the driver.


When the docs say "If your application is launched through Spark submit, then the application jar is automatically distributed to all worker nodes," it is actually saying that your executors get their jars from the driver. This is true whether you're running in client mode or cluster mode.

If the docs are unclear (and they seem to be), then we should update them. I have filed SPARK-12565 <https://issues.apache.org/jira/browse/SPARK-12565> to track this.

Please let me know if there's anything else I can help clarify.

Cheers,
-Andrew




2015-12-29 13:07 GMT-08:00 Annabel Melongo <melongo_anna...@yahoo.com <mailto:melongo_anna...@yahoo.com>>:

    Andrew,

    Now I see where the confusion lays. Standalone cluster mode, your
    link, is nothing but a combination of client-mode and standalone
    mode, my link, without YARN.

    But I'm confused by this paragraph in your link:

    If your application is launched through Spark submit, then the
    application jar is automatically distributed to all worker nodes.
    For any additional jars that your
            application depends on, you should specify them through
    the |--jars| flag using comma as a delimiter (e.g. |--jars
    jar1,jar2|).

    That can't be true; this is only the case when Spark runs on top
    of YARN. Please correct me, if I'm wrong.

    Thanks


    On Tuesday, December 29, 2015 2:54 PM, Andrew Or
    <and...@databricks.com <mailto:and...@databricks.com>> wrote:


    
http://spark.apache.org/docs/latest/spark-standalone.html#launching-spark-applications

    2015-12-29 11:48 GMT-08:00 Annabel Melongo
    <melongo_anna...@yahoo.com <mailto:melongo_anna...@yahoo.com>>:

        Greg,

        Can you please send me a doc describing the standalone
        cluster mode? Honestly, I never heard about it.

        The three different modes, I've listed appear in the last
        paragraph of this doc: Running Spark Applications
        
<http://www.cloudera.com/content/www/en-us/documentation/enterprise/latest/topics/cdh_ig_running_spark_apps.html>
                
                
                
                
        Running Spark Applications
        
<http://www.cloudera.com/content/www/en-us/documentation/enterprise/latest/topics/cdh_ig_running_spark_apps.html>

        --class The FQCN of the class containing the main method of
        the application. For example,
        org.apache.spark.examples.SparkPi. --conf

        View on www.cloudera.com
        
<http://www.cloudera.com/content/www/en-us/documentation/enterprise/latest/topics/cdh_ig_running_spark_apps.html>
                
        Preview by Yahoo




        On Tuesday, December 29, 2015 2:42 PM, Andrew Or
        <and...@databricks.com <mailto:and...@databricks.com>> wrote:


            The confusion here is the expression "standalone cluster
            mode". Either it's stand-alone or it's cluster mode but
            it can't be both.


        @Annabel That's not true. There /is/ a standalone cluster
        mode where driver runs on one of the workers instead of on
        the client machine. What you're describing is standalone
        client mode.

        2015-12-29 11:32 GMT-08:00 Annabel Melongo
        <melongo_anna...@yahoo.com <mailto:melongo_anna...@yahoo.com>>:

            Greg,

            The confusion here is the expression "standalone cluster
            mode". Either it's stand-alone or it's cluster mode but
            it can't be both.

             With this in mind, here's how jars are uploaded:
              1. Spark Stand-alone mode: client and driver run on the
            same machine; use --packages option to submit a jar
                2. Yarn Cluster-mode: client and driver run on
            separate machines; additionally driver runs as a thread
            in ApplicationMaster; use --jars option with a globally
            visible path to said jar
                3. Yarn Client-mode: client and driver run on the
            same machine. driver is *NOT* a thread in
            ApplicationMaster; use --packages to submit a jar


            On Tuesday, December 29, 2015 1:54 PM, Andrew Or
            <and...@databricks.com <mailto:and...@databricks.com>> wrote:


            Hi Greg,

            It's actually intentional for standalone cluster mode to
            not upload jars. One of the reasons why YARN takes at
            least 10 seconds before running any simple application is
            because there's a lot of random overhead (e.g. putting
            jars in HDFS). If this missing functionality is not
            documented somewhere then we should add that.

            Also, the packages problem seems legitimate. Thanks for
            reporting it. I have filed
            https://issues.apache.org/jira/browse/SPARK-12559.

            -Andrew

            2015-12-29 4:18 GMT-08:00 Greg Hill
            <greg.h...@rackspace.com <mailto:greg.h...@rackspace.com>>:



                On 12/28/15, 5:16 PM, "Daniel Valdivia"
                <h...@danielvaldivia.com
                <mailto:h...@danielvaldivia.com>> wrote:

                >Hi,
                >
                >I'm trying to submit a job to a small spark cluster
                running in stand
                >alone mode, however it seems like the jar file I'm
                submitting to the
                >cluster is "not found" by the workers nodes.
                >
                >I might have understood wrong, but I though the
                Driver node would send
                >this jar file to the worker nodes, or should I
                manually send this file to
                >each worker node before I submit the job?

                Yes, you have misunderstood, but so did I. So the
                problem is that
                --deploy-mode cluster runs the Driver on the cluster
                as well, and you
                don't know which node it's going to run on, so every
                node needs access to
                the JAR. spark-submit does not pass the JAR along to
                the Driver, but the
                Driver will pass it to the executors.  I ended up
                putting the JAR in HDFS
                and passing an hdfs:// path to spark-submit. This is
                a subtle difference
                from Spark on YARN which does pass the JAR along to
                the Driver
                automatically, and IMO should probably be fixed in
                spark-submit. It's
                really confusing for newcomers.

                Another problem I ran into that you also might is
                that --packages doesn't
                work with --deploy-mode cluster.  It downloads the
                packages to a temporary
                location on the node running spark-submit, then
                passes those paths to the
                node that is running the Driver, but since that isn't
                the same machine, it
                can't find anything and fails.  The driver process
                *should* be the one
                doing the downloading, but it isn't. I ended up
                having to create a fat JAR
                with all of the dependencies to get around that one.

                Greg


                
---------------------------------------------------------------------
                To unsubscribe, e-mail:
                user-unsubscr...@spark.apache.org
                <mailto:user-unsubscr...@spark.apache.org>
                For additional commands, e-mail:
                user-h...@spark.apache.org
                <mailto:user-h...@spark.apache.org>














Reply via email to