Thanks for the support guys! I'll crawl again with generate.count.mode=host and generate.max.count=-1. Although, if i dont set -topN in the nutch script it won't let me run GeneratorJob.
> Subject: RE: Crawling a specific site only > From: markus.jel...@openindex.io > To: user@nutch.apache.org > Date: Wed, 18 Dec 2013 09:38:04 +0000 > > Increase it to a reasonable high value or don't set it at all, it will then > attempt to crawl as much as it can. Also check generate.count.mode and > generate.max.count. > > > -----Original message----- > > From:Vangelis karv <karvouni...@hotmail.com> > > Sent: Wednesday 18th December 2013 9:56 > > To: user@nutch.apache.org > > Subject: RE: Crawling a specific site only > > > > Can you be a little more specific about that, Tejas? > > > > > Date: Tue, 17 Dec 2013 23:32:46 -0800 > > > Subject: Re: Crawling a specific site only > > > From: tejas.patil...@gmail.com > > > To: user@nutch.apache.org > > > > > > You should bump the value of topN instead of setting to 2000. That would > > > make lot of the urls eligible for fetching. > > > > > > Thanks, > > > Tejas > > > > > > > > > On Tue, Dec 17, 2013 at 3:02 AM, Vangelis karv > > > <karvouni...@hotmail.com>wrote: > > > > > > > Markus and Wang thank you very much for your fast responses. I forgot to > > > > mention that i use nutch 2.2.1 and mysql. Both DomainFilter and > > > > ignore.external.links ideas are awesome! What really bothers me is that > > > > dreaded "-topN". I really want to live without it! :) I hate it when I > > > > open > > > > my database and I see that i have for example 2000 links unfetched, > > > > which > > > > means they are not parsed->useless, and only 2000 fetched. > > > > > > > > > Subject: Re: Crawling a specific site only > > > > > From: wangyi1...@gmail.com > > > > > To: user@nutch.apache.org > > > > > Date: Tue, 17 Dec 2013 18:53:55 +0800 > > > > > > > > > > HI > > > > > Just set > > > > > <name>db.ignore.external.links</name> > > > > > <value>true</value> > > > > > and run crawl script for several times, the default number of pages to > > > > > be added is 50,000. > > > > > > > > > > Is it right? > > > > > Wang > > > > > > > > > > > > > > > -----Original Message----- > > > > > From: Vangelis karv <karvouni...@hotmail.com> > > > > > Reply-to: user@nutch.apache.org > > > > > To: user@nutch.apache.org <user@nutch.apache.org> > > > > > Subject: Crawling a specific site only > > > > > Date: Tue, 17 Dec 2013 12:15:00 +0200 > > > > > > > > > > Hi again! My goal is to crawl a specific site. I want to crawl all the > > > > links that exist under that site. For example, if i decide to crawl > > > > http://www.uefa.com/, I want to parse all its inlinks(photos, videos, > > > > htmls etc) and not only the best scoring urls for this site= topN. So, > > > > my > > > > question here is: how can we tell Nutch to crawl everything in a site > > > > and > > > > not only the sites that have the best score? > > > > > > > > > > > > > > > > > > > > > > > > >