Re: Optimal configuration for benchmark

David Boyd Thu, 27 Jun 2013 10:12:41 -0700

Christian:

I have actually been looking for a more general Giraph benchmarkand would love to test/play with

what you have.

To answer your questions we need to first assume a dedicatedcluster where your test is the only

one running.

For number of mappers we will assume that your cluster isconfigured in a pseudo-standard one mapperper core (e.g. the max mappers for each node equals the number of coreson that node). For giraph, due to itbeing CPU centric it is pretty important that you not oversubscribe thecores.

So for the number of mappers you should use <total clustermappers> - 1. This is because Giraph needs

one mapper for the master node.

HEAP_SIZE and mapred.map.child.java.opts are basicallyequivalent (but I prefer the latter). In any case,a part of this question depends on the what besides Hadoop is running oneach node. Generally, you want eachmapper to have as much heap space as possible. The goal is to avoidswapping, leave enough memory free for buffercache, and have enough heap for each task that it does not need to spenda ton of time in garbage collection.I like to look at an idle node and see what the base overhead of usedmemory is. Then depending on the IOrequirements of my job (especially read IO) reserve a portion of theremaining memory for buffer cache and then

divide the remainder by the number of mappers.

That is sort of the top down approach. A bottom upapproach would look at the size of objects being

managed/used in a mapper and compute upwards from there.

That said a -Xmx 4G would be the low end of what I wouldspecify. Also, you may want to set the options

whcih change how Java does garbage collection.

Hope this helps.

On 6/27/2013 12:20 PM, Christian Krause wrote:

Hi,
I implemented a benchmark that allows me to generate an arbitrarilylarge graph (depending on the number of iterations). Now I would liketo configure Giraph so that I can make the best use of my hardware forthis benchmark. Based on the number of nodes in my cluster, theiramount of main memory and number of cores, I am asking myself how do Idetermine the optimal parameters of Giraph / Hadoop, specifically:
- the number of used mappers
- the HEAP_SIZE environment variable
- the memory specified in the mapred.map.child.java.opts property

(any other relevant parameters?)
Also, I was wondering how well Giraph can handle computations whichstart with a very small graph and mutate it to a very large one. Forexample, if I understand correctly the number of mappers is notdynamically adjusted.
Any hints (or links to documentation) are highly appreciated.

Cheers,
Christian



--
========= mailto:db...@data-tactics.com ============
David W. Boyd
Director, Engineering
7901 Jones Branch, Suite 700
Mclean, VA 22102
office:   +1-571-279-2122
fax:     +1-703-506-6703
cell:     +1-703-402-7908
============== http://www.data-tactics.com.com/ ============
First Robotic Mentor - FRC, FTC - www.iliterobotics.org
President - USSTEM Foundation - www.usstem.org

The information contained in this message may be privileged
and/or confidential and protected from disclosure.
If the reader of this message is not the intended recipient
or an employee or agent responsible for delivering this message
to the intended recipient, you are hereby notified that any
dissemination, distribution or copying of this communication
is strictly prohibited.  If you have received this communication
in error, please notify the sender immediately by replying to
this message and deleting the material from any computer.

Re: Optimal configuration for benchmark

Reply via email to