Hi all
New to the community and using hadoop and was looking for some advice as
to optimal configurations on very large servers. I have a single server
with 48 cores and 512GB of RAM and am looking to perform an LDA analysis
using Mahoot across approx 180 million documents. I have configured my
namenode and job tracker. My questions are primarily around the optimal
number of tasktrackers and data nodes. I have had no issues configuring
multiple datanodes, each which could potentially be utilised its own
disk location (underlying disk is SAN - solid state).
However, from my reading the typical architecture for hadoop is a larger
number of smaller nodes with a single tasktracker on each host. Could
someone please clarify the following:
1. Can multiple task trackers be run on a single host? If so, how is
this configured as it doesn't seem possible to control the host:port.
2. Can i confirm mapred.map.tasks and mapred.reduce.tasks are JobTracker
parameters? The recommendation for these settings seems to related to
the number of task trackers. In my architecture, i have potentially
only 1 if a single task tracker can only be configured on each host.
What should i set these values to therefore considering the box spec?
3. I noticed the parameters mapred.tasktracker.map.tasks.maximum and
mapred.tasktracker.reduce.tasks.maximum - do these control the number of
JVM processes spawned to handle the respective steps? Is a tasktracker
with 48 configured equivalent to a 48 task trackers with a value of 1
configured for these values?
4. Benefits of a large number of datanodes on a single large server? I
can see value where the host has multiple IO interfaces and disk sets to
avoid IO contention. In my case, however, a SAN negates this. Are there
still benefits of multiple datanodes outside of resiliency and potential
increase of data transfer i.e. assuming a single data node is limited
and single threaded?
5. Any other thoughts/recommended settings?
Thanks
Dale