Hi all
New to the community and using hadoop and was looking for some advice as to optimal configurations on very large servers. I have a single server with 48 cores and 512GB of RAM and am looking to perform an LDA analysis using Mahoot across approx 180 million documents. I have configured my namenode and job tracker. My questions are primarily around the optimal number of tasktrackers and data nodes. I have had no issues configuring multiple datanodes, each which could potentially be utilised its own disk location (underlying disk is SAN - solid state).

However, from my reading the typical architecture for hadoop is a larger number of smaller nodes with a single tasktracker on each host. Could someone please clarify the following:

1. Can multiple task trackers be run on a single host? If so, how is this configured as it doesn't seem possible to control the host:port.

2. Can i confirm mapred.map.tasks and mapred.reduce.tasks are JobTracker parameters? The recommendation for these settings seems to related to the number of task trackers. In my architecture, i have potentially only 1 if a single task tracker can only be configured on each host. What should i set these values to therefore considering the box spec?

3. I noticed the parameters mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum - do these control the number of JVM processes spawned to handle the respective steps? Is a tasktracker with 48 configured equivalent to a 48 task trackers with a value of 1 configured for these values?

4. Benefits of a large number of datanodes on a single large server? I can see value where the host has multiple IO interfaces and disk sets to avoid IO contention. In my case, however, a SAN negates this. Are there still benefits of multiple datanodes outside of resiliency and potential increase of data transfer i.e. assuming a single data node is limited and single threaded?

5. Any other thoughts/recommended settings?

Thanks
Dale

Reply via email to