Hi, did I understood you correctly? - feed.txt is placed in the seed url folder and - contains URLs of the 50 article lists If yes: -depth 2 will crawl these 50 URLs and for each article list all its 30 outlinks, in total 50 + 50*30 = 1550 documents.
If you continue crawling Nutch fetch the outlinks of the 1500 docs fetched in the second cycle, and then the links found again, and so on: it will continue to crawl the whole web. To limit the crawl to exactly the 1550 docs either remove all previously crawled data to start again from scratch or have a look at the plugin "scoring-depth" (it's new and, unfortunately, not yet adapted to 2.x, see https://issues.apache.org/jira/browse/NUTCH-1331 and https://issues.apache.org/jira/browse/NUTCH-1508). The option name -depth does not mean a "limitation of a certain linkage depth" (that's the meaning in "scoring-depth") but the number of crawl cycles or rounds. If a crawl is started from scratch the results are identical in most cases. Sebastian On 01/15/2013 06:53 PM, 高睿 wrote: > I'm not quite sure about your question here. I'm using the Nutch2.1 default > configuration, and run command: bin/nutch crawl urls -solr > http://localhost:8080/solr/collection2 -threads 10 -depth 2 -topN 1000 > The 'urls' folder includes the blog index pages (each index page includes a > list of article pages). > I think the plugin 'parse-html' and 'parse-tika' are currently responsible > for parse the links from the html. Should I clean the outlinks in an > additional Parse plugin in order to prevent nutch from crawling the outlinks > in the article page? > > > > At 2013-01-15 13:31:11,"Lewis John Mcgibbney" <[email protected]> > wrote: >> I take it you are updating the database with the crawl data? This will mark >> all links extracted during parse phase (depending upon your config) as due >> for fetching. When you generate these links will be populated within the >> batchId's and Nutch will attempt to fetch them. >> Please also search out list archives for the definition of the depth >> parameter. >> Lewis >> >> On Monday, January 14, 2013, 高睿 <[email protected]> wrote: >>> Hi, >>> >>> I'm customizing nutch 2.1 for crawling blogs from several authors. Each >> author's blog has list page and article pages. >>> >>> Say, I want to crawl articles in 50 article lists (each have 30 >> articles). I add the article list links in the feed.txt, and specify >> '-depth 2' and '-topN 2000'. My expectation is each time I run nutch, it >> will crawl all the list pages and the articles in each list. But, actually, >> it seems the urls that nutch crawled becomes more and more, and takes more >> and more time (3 hours -> more than 24 hours). >>> >>> Could someone explain me what happens? Does nutch 2.1 always start >> crawling from the seed folder and follow the 'depth' parameter? What should >> I do to meet my requirement? >>> Thanks. >>> >>> Regards, >>> Rui >>> >> >> -- >> *Lewis*

