Re: Crawling a specific site only

Tejas Patil Tue, 17 Dec 2013 23:33:36 -0800

You should bump the value of topN instead of setting to 2000. That would
make lot of the urls eligible for fetching.


Thanks,
Tejas


On Tue, Dec 17, 2013 at 3:02 AM, Vangelis karv <karvouni...@hotmail.com>wrote:

> Markus and Wang thank you very much for your fast responses. I forgot to
> mention that i use nutch 2.2.1 and mysql. Both DomainFilter and
> ignore.external.links ideas are awesome! What really bothers me is that
> dreaded "-topN". I really want to live without it! :) I hate it when I open
> my database and I see that i have for example 2000 links unfetched, which
> means they are not parsed->useless, and only 2000 fetched.
>
> > Subject: Re: Crawling a specific site only
> > From: wangyi1...@gmail.com
> > To: user@nutch.apache.org
> > Date: Tue, 17 Dec 2013 18:53:55 +0800
> >
> > HI
> > Just set
> >         <name>db.ignore.external.links</name>
> >         <value>true</value>
> > and run crawl script for several times, the default number of pages to
> > be added is 50,000.
> >
> > Is it right?
> > Wang
> >
> >
> > -----Original Message-----
> > From: Vangelis karv <karvouni...@hotmail.com>
> > Reply-to: user@nutch.apache.org
> > To: user@nutch.apache.org <user@nutch.apache.org>
> > Subject: Crawling a specific site only
> > Date: Tue, 17 Dec 2013 12:15:00 +0200
> >
> > Hi again! My goal is to crawl a specific site. I want to crawl all the
> links that exist under that site. For example, if i decide to crawl
> http://www.uefa.com/, I want to parse all its inlinks(photos, videos,
> htmls etc) and not only the best scoring urls for this site= topN. So, my
> question here is: how can we tell Nutch to crawl everything in a site and
> not only the sites that have the best score?
> >
> >
> >
>
>

Re: Crawling a specific site only

Reply via email to