> >>Three, you could be maxing out your bandwidth and only 1/10th or urls 
> >>are actually getting through before timeout or the site is blocking most 
> >>of the urls you are trying to fetch through robots.txt.  Look at the 
> >>JobTracker admin screen for the fetch job and see how many errors are in 
> >>each fetch task.
> >
> >We work with the site, and robots.txt is allowing us
> >through.  It is definitely getting different pages
> >each time.  We have 100000 urls in the crawldb.
> >It is only getting about 3% new pages each generate-
> >fetch-update cycle.
> >
> >The most recent completed run had 97 map tasks and
> >17 reduce tasks, all completed fine, with 0 failures.
> 
> Check the number of errors in the fetcher tasks themselves.  I 
> understand the task will complete but the fetcher screen should show 
> number of fetching errors.  My guess is that this is high.

I am going to the jobtracker url, at default port 50030.
I find the most recent fetch task, which is listed at

  fetch /var/nutch/crawl/segments/20080121075010

I click on the job link (job_0183).
It sends me the jobdetails.jsp page, which is what I
reported on.

It seems to me you are referring to another interface.
Can you please let me know where I should be looking
for the errors in the fetcher tasks themselves?

Thanks!

JohnM

-- 
john mendenhall
[EMAIL PROTECTED]
surf utopia
internet services

Reply via email to