We are now testing precisely what you ask about in our environment.
But Sandy's questions are relevant. The bigger issue is not Spark
vs. Yarn but "client" vs. "standalone" and where the client is
located on the network relative to the cluster. The "client" options that locate the client/master remote from the cluster, while useful for interactive queries, suffer from considerable network traffic overhead as the master schedules and transfers data with the worker nodes on the cluster. The "standalone" options locate the master/client on the cluster. In yarn-standalone, the master is a thread contained by the Yarn Resource Manager. Lots less traffic, as the master is co-located with the worker nodes on the cluster and its scheduling/data communication has less latency. In my comparisons between yarn-client and yarn-standalone (so as not to conflate yarn vs Spark), yarn-client computation time is at least double yarn-standalone! At least for a job with lots of stages and lots of client/worker communication, although rather few "collect" actions, so it's mainly scheduling that's relevant here. I'll be posting more information as I have it available. Kevin On 03/03/2014 03:48 PM, Sandy Ryza
wrote:
|
- Re: Job initialization performance of Spark standal... Andrew Ash
- Re: Job initialization performance of Spark sta... polkosity
- Re: Job initialization performance of Spark... Mayur Rustagi
- Re: Job initialization performance of Spark... Koert Kuipers
- Re: Job initialization performance of ... Koert Kuipers
- Re: Job initialization performance of Spark standalone ... polkosity
- Re: Job initialization performance of Spark standal... Mayur Rustagi
- Re: Job initialization performance of Spark sta... polkosity
- Re: Job initialization performance of Spark... Mayur Rustagi
- Re: Job initialization performance of Spark standalone mode ... Sandy Ryza
- Re: Job initialization performance of Spark standalone ... Kevin Markey