Large server recommedations

Dale McDiarmid Thu, 15 Dec 2011 11:50:37 -0800

Hi all

New to the community and using hadoop and was looking for some advice asto optimal configurations on very large servers. I have a single serverwith 48 cores and 512GB of RAM and am looking to perform an LDA analysisusing Mahoot across approx 180 million documents. I have configured mynamenode and job tracker. My questions are primarily around the optimalnumber of tasktrackers and data nodes. I have had no issues configuringmultiple datanodes, each which could potentially be utilised its owndisk location (underlying disk is SAN - solid state).

However, from my reading the typical architecture for hadoop is a largernumber of smaller nodes with a single tasktracker on each host. Couldsomeone please clarify the following:

1. Can multiple task trackers be run on a single host? If so, how isthis configured as it doesn't seem possible to control the host:port.

2. Can i confirm mapred.map.tasks and mapred.reduce.tasks are JobTrackerparameters? The recommendation for these settings seems to related tothe number of task trackers. In my architecture, i have potentiallyonly 1 if a single task tracker can only be configured on each host.What should i set these values to therefore considering the box spec?

3. I noticed the parameters mapred.tasktracker.map.tasks.maximum andmapred.tasktracker.reduce.tasks.maximum - do these control the number ofJVM processes spawned to handle the respective steps? Is a tasktrackerwith 48 configured equivalent to a 48 task trackers with a value of 1configured for these values?

4. Benefits of a large number of datanodes on a single large server? Ican see value where the host has multiple IO interfaces and disk sets toavoid IO contention. In my case, however, a SAN negates this. Are therestill benefits of multiple datanodes outside of resiliency and potentialincrease of data transfer i.e. assuming a single data node is limitedand single threaded?


5. Any other thoughts/recommended settings?

Thanks
Dale

Large server recommedations

Reply via email to