Thanks for the description, Chris. Now that I understand the basic
model, I'm starting to see how the configuration is passed to the
slaves using the -d option of ec2-run-instances.

One config question: on our cluster (hadoop 0.17 with
INSTANCE_TYPE="m1.small") the conf/hadoop-default.xml has
mapred.reduce.tasks set to 1, and mapred.map.tasks set to 2.

From experimenting and reading the FAQ, it looks like those numbers
should be higher, unless you have single-machine cluster. Maybe
there's something I'm missing, but by upping mapred.map.tasks and
mapred.reduce.tasks to 5 and 15 (in our job jar) we're getting much
better performance. Is there a reason hadoop-init doesn't build a
hadoop-site.xml file with higher or configurable values for these
fields?


configuration values should be set in conf/hadoop-site.xml. Those particular values you are referring to probably should be set per job and generally don't have anything to do with instance sizes but more to do with cluster size and the job being run.

different instance sizes have mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum set accordingly (see hadoop- init), but again might/should be tuned to your application (cpu or io bound).

ckw

Chris K Wensel
[EMAIL PROTECTED]
http://chris.wensel.net/
http://www.cascading.org/




Reply via email to