Jeremy Bensley (sent by Nabble.com) wrote:
I have been experimenting with MapReduce to perform some distributed tasks 
aside from the normal fetch/index routine of Nutch, and overall have had much 
success.

I'm glad to hear this!

Today I have been experimenting with running extended duration tasks, but have 
run into issues with the tasks timing out. I attempted to both override the 
mapred.tasks.timeout option in mapred-default.xml and in the actual code for my 
Mapper class, but my timeout durations remained steady at the default 10 
minutes.

I looked at TaskTracker and I see that it is assigning to static variables some of the configuration options, and then using the variables for comparison. I have seen that TaskTracker parses the configuration XML files each time a new task is assigned, assuming that this is so that the TaskTracker options can be updated without restarting the process.
Code Examples: (from TaskTracker.java)

private static final int MAX_CURRENT_TASKS = NutchConf.get().getInt("mapred.tasktracker.tasks.maximum", 2);

static final long TASK_TIMEOUT = NutchConf.get().getInt("mapred.task.timeout", 10* 60 * 1000);

It seems to me that these parameters should be fetched each time instead of 
being stored static and loaded only once. I am just getting my feet wet with 
the whole MapReduce thing, so if this is the intended operation then I 
apologise.

For the task timeout, I agree, this would be a good idea. It would require some changes to the TaskTracker, so that a separate timeout could be kept for each running task.

I'm not so sure about the tasks per task tracker. The best value is probably node-specific (typically something a bit larger than the number of processors). Even if it were job-specific, a TaskTracker can, in theory, be running tasks from different jobs at the same time. Unless we want to prohibit that, a single limit on the number of tasks to run concurrently is required. How would you vary this with job?

Also, is this the proper place to report (possible) bugs, or should I just go 
directly to the bug reporting system, even if it's not a verified issue?

This is a fine place. Typically one should first check the bug database, then, if nothing is found, either file a bug or send an inquiry to the list. The best way to get a bug fixed is to submit a patch that fixes it.

Cheers,

Doug

Reply via email to