Hi Everyone

We are trying the latest aws emr-4.0.0 and Spark and my question is about
YARN vs Standalone mode.
Our usecase is
- start 100-150 nodes cluster every week,
- run one heavy spark job (5-6 hours)
- save data to s3
- stop cluster

Officially aws emr-4.0.0 comes with Spark on Yarn
It's probably possible to hack emr by creating bootstrap script which stops
yarn and starts master and slaves on each computer  (to start Spark in
standalone mode)

My questions are
- Does Spark standalone provides significant performance / memory
improvement in comparison to YARN mode?
- Does it worth hacking official emr Spark on Yarn and switch Spark to
Standalone mode?


I already created comparison table and want you to check if my
understanding is correct

Lets say r3.2xlarge computer has 52GB ram available for Spark Executor JVMs

                    standalone to yarn comparison


        STDLN   YARN

can executor allocate up to 52GB ram                           - yes  |  yes

will executor be unresponsive after using all 52GB ram because of GC - yes
 |  yes

additional JVMs on slave except of spark executor        - workr | node mngr

are additional JVMs lightweight                                     - yes
 |  yes


Thank you

Alex

Reply via email to