We are now testing precisely what you ask about in our environment.  But Sandy's questions are relevant.  The bigger issue is not Spark vs. Yarn but "client" vs. "standalone" and where the client is located on the network relative to the cluster.

The "client" options that locate the client/master remote from the cluster, while useful for interactive queries, suffer from considerable network traffic overhead as the master schedules and transfers data with the worker nodes on the cluster.  The "standalone" options locate the master/client on the cluster.  In yarn-standalone, the master is a thread contained by the Yarn Resource Manager.  Lots less traffic, as the master is co-located with the worker nodes on the cluster and its scheduling/data communication has less latency.

In my comparisons between yarn-client and yarn-standalone (so as not to conflate yarn vs Spark), yarn-client computation time is at least double yarn-standalone!  At least for a job with lots of stages and lots of client/worker communication, although rather few "collect" actions, so it's mainly scheduling that's relevant here.

I'll be posting more information as I have it available.

Kevin


On 03/03/2014 03:48 PM, Sandy Ryza wrote:
Are you running in yarn-standalone mode or yarn-client mode?  Also, what YARN scheduler and what NodeManager heartbeat?  


On Sun, Mar 2, 2014 at 9:41 PM, polkosity <polkos...@gmail.com> wrote:
Thanks for the advice Mayur.

I thought I'd report back on the performance difference...  Spark standalone
mode has executors processing at capacity in under a second :)



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Job-initialization-performance-of-Spark-standalone-mode-vs-YARN-tp2016p2243.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Reply via email to