What urls does Nutch crawl?

高睿 Mon, 14 Jan 2013 21:07:53 -0800

Hi,

I'm customizing nutch 2.1 for crawling blogs from several authors. Each 
author's blog has list page and article pages.


Say, I want to crawl articles in 50 article lists (each have 30 articles). I 
add the article list links in the feed.txt, and specify '-depth 2' and '-topN 
2000'. My expectation is each time I run nutch, it will crawl all the list 
pages and the articles in each list. But, actually, it seems the urls that 
nutch crawled becomes more and more, and takes more and more time (3 hours -> 
more than 24 hours).

Could someone explain me what happens? Does nutch 2.1 always start crawling 
from the seed folder and follow the 'depth' parameter? What should I do to meet 
my requirement?
Thanks.

Regards,
Rui

What urls does Nutch crawl?

Reply via email to