I am looking to decide what is best for my production grade spark
application(s).

YARN
=====

   1. YARN supports security. When Spark is run over YARN the communication
   between processes can use secure authentication through Kerberos.
   2. Spark standalone cluster can only run Spark jobs and nothing else.
   With YARN you can have different kinds of jobs like M/R, or Spark.
   3. YARN scheduler has multiple features like Queues, hierarchical queues
   with pluggable policies, auto placement of apps into queues, easy
   installation, ACLs for queues. These are missing with standalone spark
   scheduler. Resources are more intelligently and dynamically used.
   4. Spark standalone scheduler requires each application to run an
   executor on every node in cluster, whereas with YARN you can run
   executor(s) on subset of nodes. I haven't tested it.
   5. On YARN spark supports driver to run on the client machine itself
   (yarn-client) which requires the client application to run for the lifetime
   of application. With yarn-cluster mode, the spark driver will run on
   Application master and hence the client program can either exit or do
   something else. This feature might be missing with standalone scheduler.
   6. YARN provides finer control of resources like CPU cores. Number of
   executors per node is configurable with YARN depending on the number of
   CPUs present on the node, this is missing on Mesos. It might be
   available with future releases.
   7. Standalone mode requires management of daemon services. It also
   requires a Zookeeper setup as Spark master node needs to be highly
   available to avoid single point of failure.
   8. Most of the existing users of Hadoop cluster have large amounts data
   (TBs/PBs) on residing on the cluster and hence Spark applications can make
   use of data locality when running on YARN cluster. Making it available on
   standalone cluster might be a challenge.
   9. I do not think there are performance impacts of running a spark
   application on YARN/Mesos/Standalone cluster. Might require a test.
   10. Mesos & Spark both were developed at Amplab, so both might be better
   compatible with each other. However I do not have any working knowledge of
   Mesos.



I was thinking what are advantages and disadvantages of running Spark over
Mesos and Spark over Standalone cluster. This will help me ( and others on
the verge of using Spark systems) to decide which direction to go.

Regards,
Deepak

On Wed, 12 Aug 2015 at 10:28 PM Tim Chen <t...@mesosphere.io> wrote:

> I'm not sure what you're looking for, since you can't really compare
> Standalone with YARN or Mesos, as Standalone is assuming the Spark
> workers/master owns the cluster, and YARN/Mesos is trying to share the
> cluster among different applications/frameworks.
>
> And when you refer to resource utilization, what exactly does it mean to
> you? Is it the ability to maximize the usage of your resources with
> multiple applications in mind, or just how much configuration Spark allows
> you to in each mode?
>
> Tim
>
> On Wed, Aug 12, 2015 at 2:16 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) <deepuj...@gmail.com>
> wrote:
>
>> Do we have any comparisons in terms of resource utilization, scheduling
>> of running Spark in the below three modes
>> 1) Standalone
>> 2) over YARN
>> 3) over Mesos
>>
>> Can some one share resources (thoughts/URLs) on this area.
>>
>>
>> --
>> Deepak
>>
>>
>

Reply via email to