Anton Potehin wrote:
We have a question on this property. Is it really preferred to set this
parameter several times greater than number of available hosts? We do
not understand why it should be so?

It should be at least numHosts*mapred.tasktracker.tasks.maximum, so that all of the task slots are used. More tasks makes recovery faster when a task fails, since less needs to be redone.

Our spider is distributed among 3 machines. What value is most preferred
for this parameter in our case? Which other factors may have effect on
most preferred value of this parameter?

When fetching, the total number of hosts you're fetching can also be a factor, since fetch tasks are hostwise-disjoint. If you're only fetching a few hosts, then a large value for mapred.map.tasks will cause there to be a few big fetch tasks and a bunch of empty ones. This could be a problem if the big ones are not allocated evenly among your nodes.

I generally use 5*numHosts*mapred.tasktracker.tasks.maximum.

Doug

Reply via email to