Can you be a little more specific about that, Tejas?

> Date: Tue, 17 Dec 2013 23:32:46 -0800
> Subject: Re: Crawling a specific site only
> From: tejas.patil...@gmail.com
> To: user@nutch.apache.org
> 
> You should bump the value of topN instead of setting to 2000. That would
> make lot of the urls eligible for fetching.
> 
> Thanks,
> Tejas
> 
> 
> On Tue, Dec 17, 2013 at 3:02 AM, Vangelis karv <karvouni...@hotmail.com>wrote:
> 
> > Markus and Wang thank you very much for your fast responses. I forgot to
> > mention that i use nutch 2.2.1 and mysql. Both DomainFilter and
> > ignore.external.links ideas are awesome! What really bothers me is that
> > dreaded "-topN". I really want to live without it! :) I hate it when I open
> > my database and I see that i have for example 2000 links unfetched, which
> > means they are not parsed->useless, and only 2000 fetched.
> >
> > > Subject: Re: Crawling a specific site only
> > > From: wangyi1...@gmail.com
> > > To: user@nutch.apache.org
> > > Date: Tue, 17 Dec 2013 18:53:55 +0800
> > >
> > > HI
> > > Just set
> > >         <name>db.ignore.external.links</name>
> > >         <value>true</value>
> > > and run crawl script for several times, the default number of pages to
> > > be added is 50,000.
> > >
> > > Is it right?
> > > Wang
> > >
> > >
> > > -----Original Message-----
> > > From: Vangelis karv <karvouni...@hotmail.com>
> > > Reply-to: user@nutch.apache.org
> > > To: user@nutch.apache.org <user@nutch.apache.org>
> > > Subject: Crawling a specific site only
> > > Date: Tue, 17 Dec 2013 12:15:00 +0200
> > >
> > > Hi again! My goal is to crawl a specific site. I want to crawl all the
> > links that exist under that site. For example, if i decide to crawl
> > http://www.uefa.com/, I want to parse all its inlinks(photos, videos,
> > htmls etc) and not only the best scoring urls for this site= topN. So, my
> > question here is: how can we tell Nutch to crawl everything in a site and
> > not only the sites that have the best score?
> > >
> > >
> > >
> >
> >
                                          

Reply via email to