RE: Large server recommedations

GOEKE, MATTHEW (AG/1000) Thu, 15 Dec 2011 14:15:23 -0800

mapred.map.tasks is a suggestion to the engine and there is really no reason to 
define it as it will be driven by the block level partitioning of your files 
(e.g. if you have a file that is 30 blocks then it will by default spawn 30 map 
tasks). As for mapred.reduce.tasks, just set it to whatever you set your 
mapred.tasktracker.reduce.tasks.maximum (reasoning being you are running all of 
this on a single tasktracker so these two should essentially line up).


By now I should be able to answer whether those are JT level vs TT level 
parameters but I have heard one thing and personally experienced another so I 
will leave that answer up to someone who can confirm 100%. Either way I would 
recommend that your JT and TT sites not deviate from each other for clarity but 
you can change mapred.reduce.tasks at the app level so if you have something 
that needs a global sort order you can invoke it as mapred.reduce.tasks=1 using 
job level conf.

Matt

From: Dale McDiarmid [mailto:d...@ravn.co.uk]
Sent: Thursday, December 15, 2011 3:58 PM
To: common-user@hadoop.apache.org
Cc: GOEKE, MATTHEW [AG/1000]
Subject: Re: Large server recommedations

thanks matt,
Assuming therefore i run a single tasktracker and have 48 cores available.  
Based on your recommendation of 2:1 mappers to reducer threads i will be 
assigning:


mapred.tasktracker.map.tasks.maximum=30

mapred.tasktracker.reduce.tasks.maximum=15
This brings me onto my question:

"Can i confirm mapred.map.tasks and mapred.reduce.tasks are these JobTracker 
parameters? The recommendation for these settings seems to related to the 
number of task trackers. In my architecture, i have potentially only 1 if a 
single task tracker can only be configured on each host. What should i set 
these values to therefore considering the box spec?"

I have read:

mapred.local.tasks = 10x of task trackers
mapred.reduce.tasks=2x task trackers

Given i have a single task tracker, with multiple concurrent processes does 
this equates to:

mapred.local.tasks =300?
mapred.reduce.tasks=30?

Some reasoning behind these values appreciated...


appreciate this is a little simplified and we will need to profile. Just 
looking for a sensible starting position.
Thanks
Dale


On 15/12/2011 16:43, GOEKE, MATTHEW (AG/1000) wrote:

Dale,



Talking solely about hadoop core you will only need to run 4 daemons on that 
machine: Namenode, Jobtracker, Datanode and Tasktracker. There is no reason to 
run multiple of any of them as the tasktracker will spawn multiple child jvms 
which is where you will get your task parallelism. When you set your 
mapred.tasktracker.map.tasks.maximum and 
mapred.tasktracker.reduce.tasks.maximum configurations you will limit the upper 
bound of the child jvm creation but this needs to be configured based on job 
profile (I don't know much about Mahoot but traditionally I setup the clusters 
as 2:1 mappers to reducers until the profile proves otherwise). If you look at 
blogs / archives you will see that you can assign 1 child task per *logical* 
core (e.g. hyper threaded core) and to be safe you will want 1 daemon per 
*physical* core so you can divvy it up based on that recommendation.



To summarize the above: if you are sharing the same IO pipe / box then there is 
no reason to have multiple daemons running because you are not really gaining 
anything from that level of granularity. Others might disagree based on 
virtualization but in your case I would say save yourself the headache and keep 
it simple.



Matt



-----Original Message-----

From: Dale McDiarmid [mailto:d...@ravn.co.uk]

Sent: Thursday, December 15, 2011 1:50 PM

To: common-user@hadoop.apache.org<mailto:common-user@hadoop.apache.org>

Subject: Large server recommedations



Hi all

New to the community and using hadoop and was looking for some advice as

to optimal configurations on very large servers.  I have a single server

with 48 cores and 512GB of RAM and am looking to perform an LDA analysis

using Mahoot across approx 180 million documents.  I have configured my

namenode and job tracker.  My questions are primarily around the optimal

number of tasktrackers and data nodes.  I have had no issues configuring

multiple datanodes, each which could potentially be utilised its own

disk location (underlying disk is SAN - solid state).



However, from my reading the typical architecture for hadoop is a larger

number of smaller nodes with a single tasktracker on each host.  Could

someone please clarify the following:



1. Can multiple task trackers be run on a single host? If so, how is

this configured as it doesn't seem possible to control the host:port.



2. Can i confirm mapred.map.tasks and mapred.reduce.tasks are JobTracker

parameters? The recommendation for these settings seems to related to

the number of task trackers.  In my architecture, i have potentially

only 1 if a single task tracker can only be configured on each host.

What should i set these values to therefore considering the box spec?



3. I noticed the parameters mapred.tasktracker.map.tasks.maximum and

mapred.tasktracker.reduce.tasks.maximum - do these control the number of

JVM processes spawned to handle the respective steps? Is a tasktracker

with 48 configured equivalent to a 48 task trackers with a value of 1

configured for these values?



4. Benefits of a large number of datanodes on a single large server? I

can see value where the host has multiple IO interfaces and disk sets to

avoid IO contention. In my case, however, a SAN negates this.  Are there

still benefits of multiple datanodes outside of resiliency and potential

increase of data transfer i.e. assuming a single data node is limited

and single threaded?



5. Any other thoughts/recommended settings?



Thanks

Dale

This e-mail message may contain privileged and/or confidential information, and 
is intended to be received only by persons entitled

to receive such information. If you have received this e-mail in error, please 
notify the sender immediately. Please delete it and

all attachments from any servers, hard drives or any other media. Other use of 
this e-mail by you is strictly prohibited.



All e-mails and attachments sent and received are subject to monitoring, 
reading and archival by Monsanto, including its

subsidiaries. The recipient of this e-mail is solely responsible for checking 
for the presence of "Viruses" or other "Malware".

Monsanto, along with its subsidiaries, accepts no liability for any damage 
caused by any such code transmitted by or accompanying

this e-mail or any attachment.





The information contained in this email may be subject to the export control 
laws and regulations of the United States, potentially

including but not limited to the Export Administration Regulations (EAR) and 
sanctions regulations issued by the U.S. Department of

Treasury, Office of Foreign Asset Controls (OFAC).  As a recipient of this 
information you are obligated to comply with all

applicable U.S. export laws and regulations.

RE: Large server recommedations

Reply via email to