You should bump the value of topN instead of setting to 2000. That would
make lot of the urls eligible for fetching.

Thanks,
Tejas


On Tue, Dec 17, 2013 at 3:02 AM, Vangelis karv <karvouni...@hotmail.com>wrote:

> Markus and Wang thank you very much for your fast responses. I forgot to
> mention that i use nutch 2.2.1 and mysql. Both DomainFilter and
> ignore.external.links ideas are awesome! What really bothers me is that
> dreaded "-topN". I really want to live without it! :) I hate it when I open
> my database and I see that i have for example 2000 links unfetched, which
> means they are not parsed->useless, and only 2000 fetched.
>
> > Subject: Re: Crawling a specific site only
> > From: wangyi1...@gmail.com
> > To: user@nutch.apache.org
> > Date: Tue, 17 Dec 2013 18:53:55 +0800
> >
> > HI
> > Just set
> >         <name>db.ignore.external.links</name>
> >         <value>true</value>
> > and run crawl script for several times, the default number of pages to
> > be added is 50,000.
> >
> > Is it right?
> > Wang
> >
> >
> > -----Original Message-----
> > From: Vangelis karv <karvouni...@hotmail.com>
> > Reply-to: user@nutch.apache.org
> > To: user@nutch.apache.org <user@nutch.apache.org>
> > Subject: Crawling a specific site only
> > Date: Tue, 17 Dec 2013 12:15:00 +0200
> >
> > Hi again! My goal is to crawl a specific site. I want to crawl all the
> links that exist under that site. For example, if i decide to crawl
> http://www.uefa.com/, I want to parse all its inlinks(photos, videos,
> htmls etc) and not only the best scoring urls for this site= topN. So, my
> question here is: how can we tell Nutch to crawl everything in a site and
> not only the sites that have the best score?
> >
> >
> >
>
>

Reply via email to