Hi Everyone We are trying the latest aws emr-4.0.0 and Spark and my question is about YARN vs Standalone mode. Our usecase is - start 100-150 nodes cluster every week, - run one heavy spark job (5-6 hours) - save data to s3 - stop cluster
Officially aws emr-4.0.0 comes with Spark on Yarn It's probably possible to hack emr by creating bootstrap script which stops yarn and starts master and slaves on each computer (to start Spark in standalone mode) My questions are - Does Spark standalone provides significant performance / memory improvement in comparison to YARN mode? - Does it worth hacking official emr Spark on Yarn and switch Spark to Standalone mode? I already created comparison table and want you to check if my understanding is correct Lets say r3.2xlarge computer has 52GB ram available for Spark Executor JVMs standalone to yarn comparison STDLN YARN can executor allocate up to 52GB ram - yes | yes will executor be unresponsive after using all 52GB ram because of GC - yes | yes additional JVMs on slave except of spark executor - workr | node mngr are additional JVMs lightweight - yes | yes Thank you Alex