Hi, I'm customizing nutch 2.1 for crawling blogs from several authors. Each author's blog has list page and article pages.
Say, I want to crawl articles in 50 article lists (each have 30 articles). I add the article list links in the feed.txt, and specify '-depth 2' and '-topN 2000'. My expectation is each time I run nutch, it will crawl all the list pages and the articles in each list. But, actually, it seems the urls that nutch crawled becomes more and more, and takes more and more time (3 hours -> more than 24 hours). Could someone explain me what happens? Does nutch 2.1 always start crawling from the seed folder and follow the 'depth' parameter? What should I do to meet my requirement? Thanks. Regards, Rui

