RE: General configurations on CDH5 to achieve maximum Spark Performance

Evo Eftimov Fri, 17 Apr 2015 02:39:13 -0700

And btw if you suspect this is a "YARN issue" you can always launch and use
Spark in a Standalone Mode which uses its own embedded cluster resource
manager - this is possible even when Spark has been deployed on CDH under
YARN by the pre-canned install  scripts of CDH


 

To achieve that:

 

1.       Launch spark in a standalone mode using its shell scripts - you may
get some script errors initially because of some mess in the scripts created
by the pre-canned CDH YARN install - which you can fix by editing the spark
standalone scripts - the error messages will guide you 

2.       Submit a spark job to the standalone spark master rather than YARN
and this is it 

3.       Measure and compare the performance under YARN, Spark Standalone on
Cluster and Spark Standalone on a single machine  

 

Bear in mind that running Spark in  Standalone mode while using YARN for all
other apps would not be very appropriate in production because the two
resource managers will be competing for cluster resources - but you can use
this for performance tests  

 

From: Evo Eftimov [mailto:evo.efti...@isecc.com] 
Sent: Thursday, April 16, 2015 6:28 PM
To: 'Manish Gupta 8'; 'user@spark.apache.org'
Subject: RE: General configurations on CDH5 to achieve maximum Spark
Performance

 

Essentially to change the performance yield of software cluster
infrastructure platform like spark you play with different permutations of:

 

-          Number of CPU cores used by Spark Executors on every cluster node

-          Amount of RAM allocated for each executor   

 

How disks and network IO is used also plays a role but that is influenced
more by app algorithmic aspects rather than YARN / Spark cluster config
(except rack awreness etc) 

 

When Spark runs under the management of YARN the above is controlled /
allocated by YARN 

 

https://spark.apache.org/docs/latest/running-on-yarn.html 

 

From: Manish Gupta 8 [mailto:mgupt...@sapient.com] 
Sent: Thursday, April 16, 2015 6:21 PM
To: Evo Eftimov; user@spark.apache.org
Subject: RE: General configurations on CDH5 to achieve maximum Spark
Performance

 

Thanks Evo. Yes, my concern is only regarding the infrastructure
configurations. Basically, configuring Yarn (Node manager) + Spark is must
and default setting never works. And what really happens, is we make changes
as and when an issue is faced because of one of the numerous default
configuration settings. And every time, we have to google a lot to decide on
the right values J

 

Again, my issue is very centric to running Spark on Yarn in CDH5
environment.

 

If you know a link that talks about optimum configuration settings for
running Spark on Yarn (CDH5), please share the same. 

 

Thanks,

Manish

 

From: Evo Eftimov [mailto:evo.efti...@isecc.com] 
Sent: Thursday, April 16, 2015 10:38 PM
To: Manish Gupta 8; user@spark.apache.org
Subject: RE: General configurations on CDH5 to achieve maximum Spark
Performance

 

Well there are a number of performance tuning guidelines in dedicated
sections of the spark documentation - have you read and applied them 

 

Secondly any performance problem within a distributed cluster environment
has two aspects:

 

1.       Infrastructure 

2.       App Algorithms 

 

You seem to be focusing only on 1, but what you said about the performance
differences between single laptop and cluster points to potential
algorithmic inefficiency in your app when e.g. distributing and performing
parallel processing and data. On a single laptop data moves instantly
between workers because all worker instances run in the memory of a single
machine ..

 

Regards,

Evo Eftimov  

 

From: Manish Gupta 8 [mailto:mgupt...@sapient.com] 
Sent: Thursday, April 16, 2015 6:03 PM
To: user@spark.apache.org
Subject: General configurations on CDH5 to achieve maximum Spark Performance

 

Hi,

 

Is there a document/link that describes the general configuration settings
to achieve maximum Spark Performance while running on CDH5? In our
environment, we did lot of changes (and still doing it) to get decent
performance otherwise our 6 node dev cluster with default configurations,
lags behind a single laptop running Spark.

 

Having a standard checklist (taking a base node size of 4-CPU, 16GB RAM)
would be really great. Any pointers in this regards will be really helpful.

 

We are running Spark 1.2.0 on CDH 5.3.0.

 

Thanks,

 

Manish Gupta

Specialist | Sapient Global Markets

 

Green Boulevard (Tower C)

3rd & 4th Floor

Plot No. B-9A, Sector 62

Noida 201 301

Uttar Pradesh, India

 

Tel: +91 (120) 479 5000

Fax: +91 (120) 479 5001

Email: mgupt...@sapient.com

 

sapientglobalmarkets.com

 

The information transmitted is intended only for the person or entity to
which it is addressed and may contain confidential and/or privileged
material. Any review, retransmission, dissemination or other use of, or
taking of any action in reliance upon, this information by persons or
entities other than the intended recipient is prohibited. If you received
this in error, please contact the sender and delete the material from any
(your) computer.

 

***Please consider the environment before printing this email.***

RE: General configurations on CDH5 to achieve maximum Spark Performance

Reply via email to