Re: What urls does Nutch crawl?

Sebastian Nagel Tue, 15 Jan 2013 12:24:51 -0800

Hi,

did I understood you correctly?
- feed.txt is placed in the seed url folder and
- contains URLs of the 50 article lists
If yes:
 -depth 2
will crawl these 50 URLs and for each article list all its 30 outlinks,
in total 50 + 50*30 = 1550 documents.


If you continue crawling Nutch fetch the outlinks of the 1500 docs fetched
in the second cycle, and then the links found again, and so on: it will
continue to crawl the whole web. To limit the crawl to exactly the 1550 docs
either remove all previously crawled data to start again from scratch
or have a look at the plugin "scoring-depth" (it's new and,
unfortunately, not yet adapted to 2.x, see 
https://issues.apache.org/jira/browse/NUTCH-1331
and https://issues.apache.org/jira/browse/NUTCH-1508).

The option name -depth does not mean a "limitation of a certain linkage depth" 
(that's
the meaning in "scoring-depth") but the number of crawl cycles or rounds.
If a crawl is started from scratch the results are identical in most cases.

Sebastian

On 01/15/2013 06:53 PM, 高睿 wrote:
> I'm not quite sure about your question here. I'm using the Nutch2.1 default 
> configuration, and run command: bin/nutch crawl urls -solr 
> http://localhost:8080/solr/collection2 -threads 10 -depth 2 -topN 1000
> The 'urls' folder includes the blog index pages (each index page includes a 
> list of article pages).
> I think the plugin 'parse-html' and 'parse-tika' are currently responsible 
> for parse the links from the html. Should I clean the outlinks in an 
> additional Parse plugin in order to prevent nutch from crawling the outlinks 
> in the article page?
> 
> 
> 
> At 2013-01-15 13:31:11,"Lewis John Mcgibbney" <[email protected]> 
> wrote:
>> I take it you are updating the database with the crawl data? This will mark
>> all links extracted during parse phase (depending upon your config) as due
>> for fetching. When you generate these links will be populated within the
>> batchId's and Nutch will attempt to fetch them.
>> Please also search out list archives for the definition of the depth
>> parameter.
>> Lewis
>>
>> On Monday, January 14, 2013, 高睿 <[email protected]> wrote:
>>> Hi,
>>>
>>> I'm customizing nutch 2.1 for crawling blogs from several authors. Each
>> author's blog has list page and article pages.
>>>
>>> Say, I want to crawl articles in 50 article lists (each have 30
>> articles). I add the article list links in the feed.txt, and specify
>> '-depth 2' and '-topN 2000'. My expectation is each time I run nutch, it
>> will crawl all the list pages and the articles in each list. But, actually,
>> it seems the urls that nutch crawled becomes more and more, and takes more
>> and more time (3 hours -> more than 24 hours).
>>>
>>> Could someone explain me what happens? Does nutch 2.1 always start
>> crawling from the seed folder and follow the 'depth' parameter? What should
>> I do to meet my requirement?
>>> Thanks.
>>>
>>> Regards,
>>> Rui
>>>
>>
>> -- 
>> *Lewis*

Re: What urls does Nutch crawl?

Reply via email to