AFAIK, hadoop.tmp.dir : Used by NN and DN for directory listings and metadata ( don't have much info on this )
java.opts & ulimit : ulimit defines the maximum limit of virtual mem for task launched. java.opts is the amount of memory reserved for a task. When setting you need to account for memory set aside for hadoop daemons like tasktracker etc. mapred.map.tasks and mapred.reduce.tasks : these are job wide configurations and not per-task configurations for a node. Acts as a hint to the hadoop framework and explicitly setting them might not be always recommended, unless you want to run a no-reducer job. mapred.tasktracker.(map | reduce )tasks.maximum : Limit on concurrent tasks running on a machine, typically set according to cores & memory each map/reduce task will be using. Also, typically client and datanodes will be the same. Thanks, Amogh -----Original Message----- From: stephen mulcahy [mailto:stephen.mulc...@deri.org] Sent: Thursday, August 20, 2009 3:22 PM To: common-user@hadoop.apache.org Subject: Running hadoop jobs from a client and tuning (was Re: How does hadoop deal with hadoop-site.xml?) Hi folks, Sorry to cut across this discussion but I'm experiencing some similar confusion about where to change some parameters. In particular, I'm not entirely clear on how the following should be used - clarification welcome (I'm happy to pull some of this together on a blog once I get some clarity). In hadoop/conf/hadoop-site.xml hadoop.tmp.dir - when submitting a job from a client (not one of the hadoop cluster machines), does this specify a directory local to the client in which hadoop creates temporary files or is it a directory that on each hadoop machine on which the job runs? I notice that the cloudera configurator specifies this as /tmp/hadoop-${user.name} - this seems like a nice approach to use, is it safe for this tmp.dir to be blown away when a machine is rebooted? mapred.child.java.opts (-Xmx) and mapred.child.ulimit presumably these should be set totally differently on the namenode, data nodes and client machine (assuming they are different?). In the case of the namenode and data nodes, I assume they should be set quite large. In the case of the client, should they be set so that the number of tasks * allocated memory is roughly equal to the amount of memory free on each data node? mapred.map.tasks and mapred.reduce.tasks My understanding on the namenode and data nodes is that these should be set to less than the number of cores or less. Is that correct? For the client, should these be bumped closer to the total number of cores that are available in the overall cluster? mapred.tasktracker.tasks.maximum Does this work as a cap on mapred.map.tasks and mapred.reduce.tasks? Is it neccesary to use this as well as mapred.map.tasks and mapred.reduce.tasks? Finally, in hadoop/conf/hadoop-env.sh export HADOOP_HEAPSIZE=xxxx Should this be changed normally? If so, how large should it normally be? 50% of total system memory? Thanks for any input, -stephen -- Stephen Mulcahy, DI2, Digital Enterprise Research Institute, NUI Galway, IDA Business Park, Lower Dangan, Galway, Ireland http://di2.deri.ie http://webstar.deri.ie http://sindice.com