> >>Three, you could be maxing out your bandwidth and only 1/10th or urls > >>are actually getting through before timeout or the site is blocking most > >>of the urls you are trying to fetch through robots.txt. Look at the > >>JobTracker admin screen for the fetch job and see how many errors are in > >>each fetch task. > > > >We work with the site, and robots.txt is allowing us > >through. It is definitely getting different pages > >each time. We have 100000 urls in the crawldb. > >It is only getting about 3% new pages each generate- > >fetch-update cycle. > > > >The most recent completed run had 97 map tasks and > >17 reduce tasks, all completed fine, with 0 failures. > > Check the number of errors in the fetcher tasks themselves. I > understand the task will complete but the fetcher screen should show > number of fetching errors. My guess is that this is high.
I am going to the jobtracker url, at default port 50030. I find the most recent fetch task, which is listed at fetch /var/nutch/crawl/segments/20080121075010 I click on the job link (job_0183). It sends me the jobdetails.jsp page, which is what I reported on. It seems to me you are referring to another interface. Can you please let me know where I should be looking for the errors in the fetcher tasks themselves? Thanks! JohnM -- john mendenhall [EMAIL PROTECTED] surf utopia internet services