RE: Running hadoop jobs from a client and tuning (was Re: How does hadoop deal with hadoop-site.xml?)

Amogh Vasekar Thu, 20 Aug 2009 04:33:18 -0700

AFAIK,
hadoop.tmp.dir : Used by NN and DN for directory listings and metadata ( don't 
have much info on this )


java.opts & ulimit : ulimit defines the maximum limit of virtual mem for task 
launched. java.opts is the amount of memory reserved for a task. 
When setting you need to account for memory set aside for hadoop daemons like 
tasktracker etc.

mapred.map.tasks and mapred.reduce.tasks : these are job wide configurations 
and not per-task configurations for a node. Acts as a hint to the hadoop 
framework and explicitly setting them might not be always recommended, unless 
you want to run a no-reducer job.

mapred.tasktracker.(map | reduce )tasks.maximum : Limit on concurrent tasks 
running on a machine, typically set according to cores & memory each map/reduce 
task will be using.

Also, typically client and datanodes will be the same.

Thanks,
Amogh
-----Original Message-----
From: stephen mulcahy [mailto:stephen.mulc...@deri.org] 
Sent: Thursday, August 20, 2009 3:22 PM
To: common-user@hadoop.apache.org
Subject: Running hadoop jobs from a client and tuning (was Re: How does hadoop 
deal with hadoop-site.xml?)

Hi folks,

Sorry to cut across this discussion but I'm experiencing some similar 
confusion about where to change some parameters.

In particular, I'm not entirely clear on how the following should be 
used - clarification welcome (I'm happy to pull some of this together on 
a blog once I get some clarity).

In hadoop/conf/hadoop-site.xml

hadoop.tmp.dir - when submitting a job from a client (not one of the 
hadoop cluster machines), does this specify a directory local to the 
client in which hadoop creates temporary files or is it a directory that 
on each hadoop machine on which the job runs? I notice that the cloudera 
configurator specifies this as /tmp/hadoop-${user.name} - this seems 
like a nice approach to use, is it safe for this tmp.dir to be blown 
away when a machine is rebooted?

mapred.child.java.opts (-Xmx) and mapred.child.ulimit

presumably these should be set totally differently on the namenode, data 
nodes and client machine (assuming they are different?). In the case of 
the namenode and data nodes, I assume they should be set quite large. In 
the case of the client, should they be set so that the number of tasks * 
allocated memory is roughly equal to the amount of memory free on each 
data node?

mapred.map.tasks and mapred.reduce.tasks

My understanding on the namenode and data nodes is that these should be 
set to less than the number of cores or less. Is that correct? For the 
client, should these be bumped closer to the total number of cores that 
are available in the overall cluster?

mapred.tasktracker.tasks.maximum

Does this work as a cap on mapred.map.tasks and mapred.reduce.tasks? Is 
it neccesary to use this as well as mapred.map.tasks and 
mapred.reduce.tasks?


Finally, in hadoop/conf/hadoop-env.sh

export HADOOP_HEAPSIZE=xxxx

Should this be changed normally? If so, how large should it normally be? 
50% of total system memory?

Thanks for any input,

-stephen

-- 
Stephen Mulcahy, DI2, Digital Enterprise Research Institute,
NUI Galway, IDA Business Park, Lower Dangan, Galway, Ireland
http://di2.deri.ie    http://webstar.deri.ie    http://sindice.com

RE: Running hadoop jobs from a client and tuning (was Re: How does hadoop deal with hadoop-site.xml?)

Reply via email to