Re: bot-traps and refetching

2005-08-30 Thread Michael Ji
hi Kelvin:

I believe my previous email about further concerning
of controlled crawling confused you a bit via my
unmatured thought. But I believe that controlled
crawling is very important for an efficient vertical
crawling application generally.

After reviewed our previous discussion, I think the
solution for bot-traps and refetching in OC might be
able to be combined as one.

1) Refetching will look at the FetcherOutput of last
run, and queue the URLs according to their domain name
(for http 1.1 protocol) as your FetcherThread does.

2) We might just count the number of URLs within the
same domain (in fly of queue?). If that number is over
centain threshold, we might think stop adding new URLs
for that domain---it is equvalent in sense of
controlled crawling, but in a way of width.

Will it work as proposed?

thanks, 

Michael Ji,


--- Kelvin Tan [EMAIL PROTECTED] wrote:

 Michael,
 
 On Sun, 28 Aug 2005 07:31:06 -0700 (PDT), Michael Ji
 wrote:
  Hi Kelvin:
 
  2) refetching
 
  If OC's fetchlist is online (memory residence),
 the next time
  refetch we have to restart from seeds.txt once
 again. Is it right?
 
 
 Maybe with the current implementation. But if you
 Implement a CrawlSeedSource that reads in the
 FetcherOutput directory in the Nutch segment, then
 you can seed a crawl using what's already been
 fetched.
 
 





Start your day with Yahoo! - make it your home page 
http://www.yahoo.com/r/hs 
 


Re: [mapred] Possible bug, static primatives holding config values?

2005-08-30 Thread Doug Cutting

Jeremy Bensley (sent by Nabble.com) wrote:

I have been experimenting with MapReduce to perform some distributed tasks 
aside from the normal fetch/index routine of Nutch, and overall have had much 
success.


I'm glad to hear this!


Today I have been experimenting with running extended duration tasks, but have 
run into issues with the tasks timing out. I attempted to both override the 
mapred.tasks.timeout option in mapred-default.xml and in the actual code for my 
Mapper class, but my timeout durations remained steady at the default 10 
minutes.

I looked at TaskTracker and I see that it is assigning to static variables some of the configuration options, and then using the variables for comparison. I have seen that TaskTracker parses the configuration XML files each time a new task is assigned, assuming that this is so that the TaskTracker options can be updated without restarting the process. 


Code Examples: (from TaskTracker.java)

private static final int MAX_CURRENT_TASKS = 
NutchConf.get().getInt(mapred.tasktracker.tasks.maximum, 2);


static final long TASK_TIMEOUT = 
  NutchConf.get().getInt(mapred.task.timeout, 10* 60 * 1000);


It seems to me that these parameters should be fetched each time instead of 
being stored static and loaded only once. I am just getting my feet wet with 
the whole MapReduce thing, so if this is the intended operation then I 
apologise.


For the task timeout, I agree, this would be a good idea.  It would 
require some changes to the TaskTracker, so that a separate timeout 
could be kept for each running task.


I'm not so sure about the tasks per task tracker.  The best value is 
probably node-specific (typically something a bit larger than the number 
of processors).  Even if it were job-specific, a TaskTracker can, in 
theory, be running tasks from different jobs at the same time.  Unless 
we want to prohibit that, a single limit on the number of tasks to run 
concurrently is required.  How would you vary this with job?



Also, is this the proper place to report (possible) bugs, or should I just go 
directly to the bug reporting system, even if it's not a verified issue?


This is a fine place.  Typically one should first check the bug 
database, then, if nothing is found, either file a bug or send an 
inquiry to the list.  The best way to get a bug fixed is to submit a 
patch that fixes it.


Cheers,

Doug