Hi David May be what your want is an adaptive re-fetch algorithm. see [0]
[0] http://pascaldimassimo.com/2010/06/11/how-to-re-crawl-with-nutch/ On Wed, Feb 27, 2013 at 1:20 PM, David Philip <davidphilipshe...@gmail.com>wrote: > Hi all, > > Thank you very much for the replies. Very useful information to > understand how incremental crawling can be achieved. > > Dear Markus: > Can you please tell me how do I over ride this fetch interval , incase if I > require to fetch the page before the time interval is passed? > > > > Thanks very much > - David > > > > > > On Thu, Feb 14, 2013 at 2:57 PM, Markus Jelsma > <markus.jel...@openindex.io>wrote: > > > If you want records to be fetched at a fixed interval its easier to > inject > > them with a fixed fetch interval. > > > > nutch.fixedFetchInterval=86400 > > > > > > > > -----Original message----- > > > From:kemical <mickael.lume...@gmail.com> > > > Sent: Thu 14-Feb-2013 10:15 > > > To: user@nutch.apache.org > > > Subject: Re: Nutch Incremental Crawl > > > > > > Hi David, > > > > > > You can also consider setting shorter fetch interval time with nutch > > inject. > > > This way you'll set higher score (so the url is always taken in > priority > > > when you generate a segment) and a fetch.interval of 1 day. > > > > > > If you have a case similar to me, you'll often want some homepage fetch > > each > > > day but not their inlinks. What you can do is inject all your seed urls > > > again (assuming those url are only homepages). > > > > > > #change nutch option so existing urls can be injected again in > > > conf/nutch-default.xml or conf/nutch-site.xml > > > db.injector.update=true > > > > > > #Add metadata to update score/fetch interval > > > #the following line will concat to each line of your seed urls files > with > > > the new score / new interval > > > perl -pi -e 's/^(.*)\n$/\1\tnutch.score=100\tnutch.fetchInterval=80000' > > > [your_seed_url_dir]/* > > > > > > #run command > > > bin/nutch inject crawl/crawldb [your_seed_url_dir] > > > > > > Now, the following crawl will take your urls in top priority and crawl > > them > > > once a day. I've used my situation to illustrate the concept but i > guess > > you > > > can tweek params to fit your needs. > > > > > > This way is useful when you want a regular fetch on some urls, if it's > > > occured rarely i guess freegen is the right choice. > > > > > > Best, > > > Mike > > > > > > > > > > > > > > > > > > -- > > > View this message in context: > > > http://lucene.472066.n3.nabble.com/Nutch-Incremental-Crawl-tp4037903p4040400.html > > > Sent from the Nutch - User mailing list archive at Nabble.com. > > > > > > -- Don't Grow Old, Grow Up... :-)