Re: nutch 0.9, multiple nodes, not fetching topN links to fetch

Dennis Kubes Mon, 21 Jan 2008 12:14:26 -0800

You would need to click on the map link on jobdetails.jsp and each taskwill say something like this:


11337 pages, 3748 errors, 1.7 pages/s, 329 kb/s,


Dennis Kubes

John Mendenhall wrote:

Three, you could be maxing out your bandwidth and only 1/10th or urlsare actually getting through before timeout or the site is blocking mostof the urls you are trying to fetch through robots.txt. Look at theJobTracker admin screen for the fetch job and see how many errors are ineach fetch task.
We work with the site, and robots.txt is allowing us
through.  It is definitely getting different pages
each time.  We have 100000 urls in the crawldb.
It is only getting about 3% new pages each generate-
fetch-update cycle.

The most recent completed run had 97 map tasks and
17 reduce tasks, all completed fine, with 0 failures.
Check the number of errors in the fetcher tasks themselves. Iunderstand the task will complete but the fetcher screen should shownumber of fetching errors. My guess is that this is high.
I am going to the jobtracker url, at default port 50030.
I find the most recent fetch task, which is listed at

  fetch /var/nutch/crawl/segments/20080121075010

I click on the job link (job_0183).
It sends me the jobdetails.jsp page, which is what I
reported on.

It seems to me you are referring to another interface.
Can you please let me know where I should be looking
for the errors in the fetcher tasks themselves?

Thanks!

JohnM

Re: nutch 0.9, multiple nodes, not fetching topN links to fetch

Reply via email to