Hi all,

  Thank you very much for the replies. Very useful information to
understand how incremental crawling can be achieved.

Dear Markus:
Can you please tell me how do I over ride this fetch interval , incase if I
require to fetch the page before the time interval is passed?



Thanks very much
- David





On Thu, Feb 14, 2013 at 2:57 PM, Markus Jelsma
<markus.jel...@openindex.io>wrote:

> If you want records to be fetched at a fixed interval its easier to inject
> them with a fixed fetch interval.
>
> nutch.fixedFetchInterval=86400
>
>
>
> -----Original message-----
> > From:kemical <mickael.lume...@gmail.com>
> > Sent: Thu 14-Feb-2013 10:15
> > To: user@nutch.apache.org
> > Subject: Re: Nutch Incremental Crawl
> >
> > Hi David,
> >
> > You can also consider setting shorter fetch interval time with nutch
> inject.
> > This way you'll set higher score (so the url is always taken in priority
> > when you generate a segment) and a fetch.interval of 1 day.
> >
> > If you have a case similar to me, you'll often want some homepage fetch
> each
> > day but not their inlinks. What you can do is inject all your seed urls
> > again (assuming those url are only homepages).
> >
> > #change nutch option so existing urls can be injected again in
> > conf/nutch-default.xml or conf/nutch-site.xml
> > db.injector.update=true
> >
> > #Add metadata to update score/fetch interval
> > #the following line will concat to each line of your seed urls files with
> > the new score / new interval
> > perl -pi -e 's/^(.*)\n$/\1\tnutch.score=100\tnutch.fetchInterval=80000'
> > [your_seed_url_dir]/*
> >
> > #run command
> > bin/nutch inject crawl/crawldb [your_seed_url_dir]
> >
> > Now, the following crawl will take your urls in top priority and crawl
> them
> > once a day. I've used my situation to illustrate the concept but i guess
> you
> > can tweek params to fit your needs.
> >
> > This way is useful when you want a regular fetch on some urls, if it's
> > occured rarely i guess freegen is the right choice.
> >
> > Best,
> > Mike
> >
> >
> >
> >
> >
> > --
> > View this message in context:
> http://lucene.472066.n3.nabble.com/Nutch-Incremental-Crawl-tp4037903p4040400.html
> > Sent from the Nutch - User mailing list archive at Nabble.com.
> >
>

Reply via email to