And btw if you suspect this is a "YARN issue" you can always launch and use Spark in a Standalone Mode which uses its own embedded cluster resource manager - this is possible even when Spark has been deployed on CDH under YARN by the pre-canned install scripts of CDH
To achieve that: 1. Launch spark in a standalone mode using its shell scripts - you may get some script errors initially because of some mess in the scripts created by the pre-canned CDH YARN install - which you can fix by editing the spark standalone scripts - the error messages will guide you 2. Submit a spark job to the standalone spark master rather than YARN and this is it 3. Measure and compare the performance under YARN, Spark Standalone on Cluster and Spark Standalone on a single machine Bear in mind that running Spark in Standalone mode while using YARN for all other apps would not be very appropriate in production because the two resource managers will be competing for cluster resources - but you can use this for performance tests From: Evo Eftimov [mailto:evo.efti...@isecc.com] Sent: Thursday, April 16, 2015 6:28 PM To: 'Manish Gupta 8'; 'user@spark.apache.org' Subject: RE: General configurations on CDH5 to achieve maximum Spark Performance Essentially to change the performance yield of software cluster infrastructure platform like spark you play with different permutations of: - Number of CPU cores used by Spark Executors on every cluster node - Amount of RAM allocated for each executor How disks and network IO is used also plays a role but that is influenced more by app algorithmic aspects rather than YARN / Spark cluster config (except rack awreness etc) When Spark runs under the management of YARN the above is controlled / allocated by YARN https://spark.apache.org/docs/latest/running-on-yarn.html From: Manish Gupta 8 [mailto:mgupt...@sapient.com] Sent: Thursday, April 16, 2015 6:21 PM To: Evo Eftimov; user@spark.apache.org Subject: RE: General configurations on CDH5 to achieve maximum Spark Performance Thanks Evo. Yes, my concern is only regarding the infrastructure configurations. Basically, configuring Yarn (Node manager) + Spark is must and default setting never works. And what really happens, is we make changes as and when an issue is faced because of one of the numerous default configuration settings. And every time, we have to google a lot to decide on the right values J Again, my issue is very centric to running Spark on Yarn in CDH5 environment. If you know a link that talks about optimum configuration settings for running Spark on Yarn (CDH5), please share the same. Thanks, Manish From: Evo Eftimov [mailto:evo.efti...@isecc.com] Sent: Thursday, April 16, 2015 10:38 PM To: Manish Gupta 8; user@spark.apache.org Subject: RE: General configurations on CDH5 to achieve maximum Spark Performance Well there are a number of performance tuning guidelines in dedicated sections of the spark documentation - have you read and applied them Secondly any performance problem within a distributed cluster environment has two aspects: 1. Infrastructure 2. App Algorithms You seem to be focusing only on 1, but what you said about the performance differences between single laptop and cluster points to potential algorithmic inefficiency in your app when e.g. distributing and performing parallel processing and data. On a single laptop data moves instantly between workers because all worker instances run in the memory of a single machine .. Regards, Evo Eftimov From: Manish Gupta 8 [mailto:mgupt...@sapient.com] Sent: Thursday, April 16, 2015 6:03 PM To: user@spark.apache.org Subject: General configurations on CDH5 to achieve maximum Spark Performance Hi, Is there a document/link that describes the general configuration settings to achieve maximum Spark Performance while running on CDH5? In our environment, we did lot of changes (and still doing it) to get decent performance otherwise our 6 node dev cluster with default configurations, lags behind a single laptop running Spark. Having a standard checklist (taking a base node size of 4-CPU, 16GB RAM) would be really great. Any pointers in this regards will be really helpful. We are running Spark 1.2.0 on CDH 5.3.0. Thanks, Manish Gupta Specialist | Sapient Global Markets Green Boulevard (Tower C) 3rd & 4th Floor Plot No. B-9A, Sector 62 Noida 201 301 Uttar Pradesh, India Tel: +91 (120) 479 5000 Fax: +91 (120) 479 5001 Email: mgupt...@sapient.com sapientglobalmarkets.com The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited. If you received this in error, please contact the sender and delete the material from any (your) computer. ***Please consider the environment before printing this email.***