Hi Markus, So I was trying with the *db.injector.update *point that you mentioned, please see my observations below*. * Settings: I did *db.injector.update * to* true *and * db.fetch.interval.default *to* 1hour. * * * * * *Observation:*
On first time crawl[1], 14 urls were successfully crawled and indexed to solr. case 1 : In those 14 urls I modified the content and title of one url (say Aurl) and re executed the crawl after one hour. I see that this(Aurl) url is re-fetched (it shows in log) but at Solr level : for that url (aurl): content field and title field didn't get updated. Why? should I do any configuration for this to make solr index get updated? case2: Added new url to the crawling site The url got indexed - This is success. So interested to know why the above case failed? What configuration need to be made? Thanks - David *PS:* Apologies that I am still asking questions on same topic. I am not able to find good way for incremental crawl so trying different approaches. Once I am clear I will blog this and share it. Thanks lot for replies from mailer. On Wed, Feb 27, 2013 at 4:06 PM, Markus Jelsma <markus.jel...@openindex.io>wrote: > You can simply reinject the records. You can overwrite and/or update the > current record. See the db.injector.update and overwrite settings. > > -----Original message----- > > From:David Philip <davidphilipshe...@gmail.com> > > Sent: Wed 27-Feb-2013 11:23 > > To: user@nutch.apache.org > > Subject: Re: Nutch Incremental Crawl > > > > HI Markus, I meant over riding the injected interval.. How to override > the > > injected fetch interval? > > While crawling fetch interval was set 30days (default). Now I want to > > re-fetch same site (that is to force re-fetch) and not wait for fetch > > interval (30 days).. how can we do that? > > > > > > Feng Lu : Thank you for the reference link. > > > > Thanks - David > > > > > > > > On Wed, Feb 27, 2013 at 3:22 PM, Markus Jelsma > > <markus.jel...@openindex.io>wrote: > > > > > The default or the injected interval? The default interval can be set > in > > > the config (see nutch-default for example). Per URL's can be set using > the > > > injector: <URL>\tnutch.fixedFetchInterval=86400 > > > > > > > > > -----Original message----- > > > > From:David Philip <davidphilipshe...@gmail.com> > > > > Sent: Wed 27-Feb-2013 06:21 > > > > To: user@nutch.apache.org > > > > Subject: Re: Nutch Incremental Crawl > > > > > > > > Hi all, > > > > > > > > Thank you very much for the replies. Very useful information to > > > > understand how incremental crawling can be achieved. > > > > > > > > Dear Markus: > > > > Can you please tell me how do I over ride this fetch interval , > incase > > > if I > > > > require to fetch the page before the time interval is passed? > > > > > > > > > > > > > > > > Thanks very much > > > > - David > > > > > > > > > > > > > > > > > > > > > > > > On Thu, Feb 14, 2013 at 2:57 PM, Markus Jelsma > > > > <markus.jel...@openindex.io>wrote: > > > > > > > > > If you want records to be fetched at a fixed interval its easier to > > > inject > > > > > them with a fixed fetch interval. > > > > > > > > > > nutch.fixedFetchInterval=86400 > > > > > > > > > > > > > > > > > > > > -----Original message----- > > > > > > From:kemical <mickael.lume...@gmail.com> > > > > > > Sent: Thu 14-Feb-2013 10:15 > > > > > > To: user@nutch.apache.org > > > > > > Subject: Re: Nutch Incremental Crawl > > > > > > > > > > > > Hi David, > > > > > > > > > > > > You can also consider setting shorter fetch interval time with > nutch > > > > > inject. > > > > > > This way you'll set higher score (so the url is always taken in > > > priority > > > > > > when you generate a segment) and a fetch.interval of 1 day. > > > > > > > > > > > > If you have a case similar to me, you'll often want some homepage > > > fetch > > > > > each > > > > > > day but not their inlinks. What you can do is inject all your > seed > > > urls > > > > > > again (assuming those url are only homepages). > > > > > > > > > > > > #change nutch option so existing urls can be injected again in > > > > > > conf/nutch-default.xml or conf/nutch-site.xml > > > > > > db.injector.update=true > > > > > > > > > > > > #Add metadata to update score/fetch interval > > > > > > #the following line will concat to each line of your seed urls > files > > > with > > > > > > the new score / new interval > > > > > > perl -pi -e > > > 's/^(.*)\n$/\1\tnutch.score=100\tnutch.fetchInterval=80000' > > > > > > [your_seed_url_dir]/* > > > > > > > > > > > > #run command > > > > > > bin/nutch inject crawl/crawldb [your_seed_url_dir] > > > > > > > > > > > > Now, the following crawl will take your urls in top priority and > > > crawl > > > > > them > > > > > > once a day. I've used my situation to illustrate the concept but > i > > > guess > > > > > you > > > > > > can tweek params to fit your needs. > > > > > > > > > > > > This way is useful when you want a regular fetch on some urls, if > > > it's > > > > > > occured rarely i guess freegen is the right choice. > > > > > > > > > > > > Best, > > > > > > Mike > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > View this message in context: > > > > > > > > > http://lucene.472066.n3.nabble.com/Nutch-Incremental-Crawl-tp4037903p4040400.html > > > > > > Sent from the Nutch - User mailing list archive at Nabble.com. > > > > > > > > > > > > > > > > > > > > >