mapred.map.tasks is a suggestion to the engine and there is really no reason to define it as it will be driven by the block level partitioning of your files (e.g. if you have a file that is 30 blocks then it will by default spawn 30 map tasks). As for mapred.reduce.tasks, just set it to whatever you set your mapred.tasktracker.reduce.tasks.maximum (reasoning being you are running all of this on a single tasktracker so these two should essentially line up).
By now I should be able to answer whether those are JT level vs TT level parameters but I have heard one thing and personally experienced another so I will leave that answer up to someone who can confirm 100%. Either way I would recommend that your JT and TT sites not deviate from each other for clarity but you can change mapred.reduce.tasks at the app level so if you have something that needs a global sort order you can invoke it as mapred.reduce.tasks=1 using job level conf. Matt From: Dale McDiarmid [mailto:d...@ravn.co.uk] Sent: Thursday, December 15, 2011 3:58 PM To: common-user@hadoop.apache.org Cc: GOEKE, MATTHEW [AG/1000] Subject: Re: Large server recommedations thanks matt, Assuming therefore i run a single tasktracker and have 48 cores available. Based on your recommendation of 2:1 mappers to reducer threads i will be assigning: mapred.tasktracker.map.tasks.maximum=30 mapred.tasktracker.reduce.tasks.maximum=15 This brings me onto my question: "Can i confirm mapred.map.tasks and mapred.reduce.tasks are these JobTracker parameters? The recommendation for these settings seems to related to the number of task trackers. In my architecture, i have potentially only 1 if a single task tracker can only be configured on each host. What should i set these values to therefore considering the box spec?" I have read: mapred.local.tasks = 10x of task trackers mapred.reduce.tasks=2x task trackers Given i have a single task tracker, with multiple concurrent processes does this equates to: mapred.local.tasks =300? mapred.reduce.tasks=30? Some reasoning behind these values appreciated... appreciate this is a little simplified and we will need to profile. Just looking for a sensible starting position. Thanks Dale On 15/12/2011 16:43, GOEKE, MATTHEW (AG/1000) wrote: Dale, Talking solely about hadoop core you will only need to run 4 daemons on that machine: Namenode, Jobtracker, Datanode and Tasktracker. There is no reason to run multiple of any of them as the tasktracker will spawn multiple child jvms which is where you will get your task parallelism. When you set your mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum configurations you will limit the upper bound of the child jvm creation but this needs to be configured based on job profile (I don't know much about Mahoot but traditionally I setup the clusters as 2:1 mappers to reducers until the profile proves otherwise). If you look at blogs / archives you will see that you can assign 1 child task per *logical* core (e.g. hyper threaded core) and to be safe you will want 1 daemon per *physical* core so you can divvy it up based on that recommendation. To summarize the above: if you are sharing the same IO pipe / box then there is no reason to have multiple daemons running because you are not really gaining anything from that level of granularity. Others might disagree based on virtualization but in your case I would say save yourself the headache and keep it simple. Matt -----Original Message----- From: Dale McDiarmid [mailto:d...@ravn.co.uk] Sent: Thursday, December 15, 2011 1:50 PM To: common-user@hadoop.apache.org<mailto:common-user@hadoop.apache.org> Subject: Large server recommedations Hi all New to the community and using hadoop and was looking for some advice as to optimal configurations on very large servers. I have a single server with 48 cores and 512GB of RAM and am looking to perform an LDA analysis using Mahoot across approx 180 million documents. I have configured my namenode and job tracker. My questions are primarily around the optimal number of tasktrackers and data nodes. I have had no issues configuring multiple datanodes, each which could potentially be utilised its own disk location (underlying disk is SAN - solid state). However, from my reading the typical architecture for hadoop is a larger number of smaller nodes with a single tasktracker on each host. Could someone please clarify the following: 1. Can multiple task trackers be run on a single host? If so, how is this configured as it doesn't seem possible to control the host:port. 2. Can i confirm mapred.map.tasks and mapred.reduce.tasks are JobTracker parameters? The recommendation for these settings seems to related to the number of task trackers. In my architecture, i have potentially only 1 if a single task tracker can only be configured on each host. What should i set these values to therefore considering the box spec? 3. I noticed the parameters mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum - do these control the number of JVM processes spawned to handle the respective steps? Is a tasktracker with 48 configured equivalent to a 48 task trackers with a value of 1 configured for these values? 4. Benefits of a large number of datanodes on a single large server? I can see value where the host has multiple IO interfaces and disk sets to avoid IO contention. In my case, however, a SAN negates this. Are there still benefits of multiple datanodes outside of resiliency and potential increase of data transfer i.e. assuming a single data node is limited and single threaded? 5. Any other thoughts/recommended settings? Thanks Dale This e-mail message may contain privileged and/or confidential information, and is intended to be received only by persons entitled to receive such information. If you have received this e-mail in error, please notify the sender immediately. Please delete it and all attachments from any servers, hard drives or any other media. Other use of this e-mail by you is strictly prohibited. All e-mails and attachments sent and received are subject to monitoring, reading and archival by Monsanto, including its subsidiaries. The recipient of this e-mail is solely responsible for checking for the presence of "Viruses" or other "Malware". Monsanto, along with its subsidiaries, accepts no liability for any damage caused by any such code transmitted by or accompanying this e-mail or any attachment. The information contained in this email may be subject to the export control laws and regulations of the United States, potentially including but not limited to the Export Administration Regulations (EAR) and sanctions regulations issued by the U.S. Department of Treasury, Office of Foreign Asset Controls (OFAC). As a recipient of this information you are obligated to comply with all applicable U.S. export laws and regulations.