RE: Crawling a specific site only

Vangelis karv Tue, 17 Dec 2013 03:03:24 -0800

Markus and Wang thank you very much for your fast responses. I forgot to 
mention that i use nutch 2.2.1 and mysql. Both DomainFilter and 
ignore.external.links ideas are awesome! What really bothers me is that dreaded 
"-topN". I really want to live without it! :) I hate it when I open my database 
and I see that i have for example 2000 links unfetched, which means they are 
not parsed->useless, and only 2000 fetched.


> Subject: Re: Crawling a specific site only
> From: wangyi1...@gmail.com
> To: user@nutch.apache.org
> Date: Tue, 17 Dec 2013 18:53:55 +0800
> 
> HI
> Just set 
>         <name>db.ignore.external.links</name>
>         <value>true</value>
> and run crawl script for several times, the default number of pages to
> be added is 50,000.
> 
> Is it right?
> Wang
> 
> 
> -----Original Message-----
> From: Vangelis karv <karvouni...@hotmail.com>
> Reply-to: user@nutch.apache.org
> To: user@nutch.apache.org <user@nutch.apache.org>
> Subject: Crawling a specific site only
> Date: Tue, 17 Dec 2013 12:15:00 +0200
> 
> Hi again! My goal is to crawl a specific site. I want to crawl all the links 
> that exist under that site. For example, if i decide to crawl 
> http://www.uefa.com/, I want to parse all its inlinks(photos, videos, htmls 
> etc) and not only the best scoring urls for this site= topN. So, my question 
> here is: how can we tell Nutch to crawl everything in a site and not only the 
> sites that have the best score?
>                                         
> 
>

RE: Crawling a specific site only

Reply via email to