Re: Nutch Incremental Crawl

feng lu Tue, 26 Feb 2013 21:49:46 -0800

Hi David

May be what your want is an adaptive re-fetch algorithm. see [0]


 [0] http://pascaldimassimo.com/2010/06/11/how-to-re-crawl-with-nutch/


On Wed, Feb 27, 2013 at 1:20 PM, David Philip
<davidphilipshe...@gmail.com>wrote:

> Hi all,
>
>   Thank you very much for the replies. Very useful information to
> understand how incremental crawling can be achieved.
>
> Dear Markus:
> Can you please tell me how do I over ride this fetch interval , incase if I
> require to fetch the page before the time interval is passed?
>
>
>
> Thanks very much
> - David
>
>
>
>
>
> On Thu, Feb 14, 2013 at 2:57 PM, Markus Jelsma
> <markus.jel...@openindex.io>wrote:
>
> > If you want records to be fetched at a fixed interval its easier to
> inject
> > them with a fixed fetch interval.
> >
> > nutch.fixedFetchInterval=86400
> >
> >
> >
> > -----Original message-----
> > > From:kemical <mickael.lume...@gmail.com>
> > > Sent: Thu 14-Feb-2013 10:15
> > > To: user@nutch.apache.org
> > > Subject: Re: Nutch Incremental Crawl
> > >
> > > Hi David,
> > >
> > > You can also consider setting shorter fetch interval time with nutch
> > inject.
> > > This way you'll set higher score (so the url is always taken in
> priority
> > > when you generate a segment) and a fetch.interval of 1 day.
> > >
> > > If you have a case similar to me, you'll often want some homepage fetch
> > each
> > > day but not their inlinks. What you can do is inject all your seed urls
> > > again (assuming those url are only homepages).
> > >
> > > #change nutch option so existing urls can be injected again in
> > > conf/nutch-default.xml or conf/nutch-site.xml
> > > db.injector.update=true
> > >
> > > #Add metadata to update score/fetch interval
> > > #the following line will concat to each line of your seed urls files
> with
> > > the new score / new interval
> > > perl -pi -e 's/^(.*)\n$/\1\tnutch.score=100\tnutch.fetchInterval=80000'
> > > [your_seed_url_dir]/*
> > >
> > > #run command
> > > bin/nutch inject crawl/crawldb [your_seed_url_dir]
> > >
> > > Now, the following crawl will take your urls in top priority and crawl
> > them
> > > once a day. I've used my situation to illustrate the concept but i
> guess
> > you
> > > can tweek params to fit your needs.
> > >
> > > This way is useful when you want a regular fetch on some urls, if it's
> > > occured rarely i guess freegen is the right choice.
> > >
> > > Best,
> > > Mike
> > >
> > >
> > >
> > >
> > >
> > > --
> > > View this message in context:
> >
> http://lucene.472066.n3.nabble.com/Nutch-Incremental-Crawl-tp4037903p4040400.html
> > > Sent from the Nutch - User mailing list archive at Nabble.com.
> > >
> >
>



-- 
Don't Grow Old, Grow Up... :-)

Re: Nutch Incremental Crawl

Reply via email to