You would need to click on the map link on jobdetails.jsp and each task will say something like this:

11337 pages, 3748 errors, 1.7 pages/s, 329 kb/s,

Dennis Kubes

John Mendenhall wrote:
Three, you could be maxing out your bandwidth and only 1/10th or urls are actually getting through before timeout or the site is blocking most of the urls you are trying to fetch through robots.txt. Look at the JobTracker admin screen for the fetch job and see how many errors are in each fetch task.
We work with the site, and robots.txt is allowing us
through.  It is definitely getting different pages
each time.  We have 100000 urls in the crawldb.
It is only getting about 3% new pages each generate-
fetch-update cycle.

The most recent completed run had 97 map tasks and
17 reduce tasks, all completed fine, with 0 failures.
Check the number of errors in the fetcher tasks themselves. I understand the task will complete but the fetcher screen should show number of fetching errors. My guess is that this is high.

I am going to the jobtracker url, at default port 50030.
I find the most recent fetch task, which is listed at

  fetch /var/nutch/crawl/segments/20080121075010

I click on the job link (job_0183).
It sends me the jobdetails.jsp page, which is what I
reported on.

It seems to me you are referring to another interface.
Can you please let me know where I should be looking
for the errors in the fetcher tasks themselves?

Thanks!

JohnM

Reply via email to