Hi David yes, it's a tomcat web service cache.
The dump file can use "less" command to open if you use linux OS. or you can use "bin/nutch readseg -get segments/20130121115214/ http://www.cnbeta.com/" to dump the information of specified url. On Tue, Mar 5, 2013 at 3:02 PM, feng lu <amuseme...@gmail.com> wrote: > > > > On Tue, Mar 5, 2013 at 2:49 PM, David Philip > <davidphilipshe...@gmail.com>wrote: > >> Hi, >> >> web server cache - you mean /tomcat/work/; where the solr is running? >> Did u mean that cache? >> >> I tried to use the below command {bin/nutch readseg -dump >> crawltest/segments/20130304185844/ crawltest/test}and it gives dump file, >> format is GMC link (application/x-gmc-link) - I am not able to open it. >> How to open this file? >> >> How ever when I ran : bin/nutch readseg -list >> crawltest/segments/20130304185844/ >> NAME GENERATED FETCHER START FETCHER END FETCHED PARSED >> 20130304185844 1 2013-03-04T18:58:53 2013-03-04T18:58:53 1 1 >> >> >> - David >> >> >> >> >> >> On Tue, Mar 5, 2013 at 11:25 AM, feng lu <amuseme...@gmail.com> wrote: >> >> > Hi David >> > >> > Do you clear the web server cache. Maybe the refetch is also crawl the >> old >> > page. >> > >> > Maybe you can dump the url content to check the modification. >> > using bin/nutch readseg command. >> > >> > Thanks >> > >> > >> > On Tue, Mar 5, 2013 at 1:28 PM, David Philip < >> davidphilipshe...@gmail.com >> > >wrote: >> > >> > > Hi Markus, >> > > >> > > So I was trying with the *db.injector.update *point that you >> mentioned, >> > > please see my observations below*. * >> > > Settings: I did *db.injector.update * to* true *and * >> > > db.fetch.interval.default *to* 1hour. * >> > > * >> > > * >> > > * >> > > * >> > > *Observation:* >> > > >> > > On first time crawl[1], 14 urls were successfully crawled and >> indexed to >> > > solr. >> > > case 1 : >> > > In those 14 urls I modified the content and title of one url (say >> Aurl) >> > and >> > > re executed the crawl after one hour. >> > > I see that this(Aurl) url is re-fetched (it shows in log) but at Solr >> > level >> > > : for that url (aurl): content field and title field didn't get >> updated. >> > > Why? should I do any configuration for this to make solr index get >> > updated? >> > > >> > > case2: >> > > Added new url to the crawling site >> > > The url got indexed - This is success. So interested to know why the >> > above >> > > case failed? What configuration need to be made? >> > > >> > > >> > > Thanks - David >> > > >> > > >> > > *PS:* >> > > Apologies that I am still asking questions on same topic. I am not >> able >> > to >> > > find good way for incremental crawl so trying different approaches. >> > Once I >> > > am clear I will blog this and share it. Thanks lot for replies from >> > mailer. >> > > >> > > >> > > >> > > >> > > >> > > >> > > >> > > On Wed, Feb 27, 2013 at 4:06 PM, Markus Jelsma >> > > <markus.jel...@openindex.io>wrote: >> > > >> > > > You can simply reinject the records. You can overwrite and/or >> update >> > the >> > > > current record. See the db.injector.update and overwrite settings. >> > > > >> > > > -----Original message----- >> > > > > From:David Philip <davidphilipshe...@gmail.com> >> > > > > Sent: Wed 27-Feb-2013 11:23 >> > > > > To: user@nutch.apache.org >> > > > > Subject: Re: Nutch Incremental Crawl >> > > > > >> > > > > HI Markus, I meant over riding the injected interval.. How to >> > override >> > > > the >> > > > > injected fetch interval? >> > > > > While crawling fetch interval was set 30days (default). Now I >> want to >> > > > > re-fetch same site (that is to force re-fetch) and not wait for >> fetch >> > > > > interval (30 days).. how can we do that? >> > > > > >> > > > > >> > > > > Feng Lu : Thank you for the reference link. >> > > > > >> > > > > Thanks - David >> > > > > >> > > > > >> > > > > >> > > > > On Wed, Feb 27, 2013 at 3:22 PM, Markus Jelsma >> > > > > <markus.jel...@openindex.io>wrote: >> > > > > >> > > > > > The default or the injected interval? The default interval can >> be >> > set >> > > > in >> > > > > > the config (see nutch-default for example). Per URL's can be set >> > > using >> > > > the >> > > > > > injector: <URL>\tnutch.fixedFetchInterval=86400 >> > > > > > >> > > > > > >> > > > > > -----Original message----- >> > > > > > > From:David Philip <davidphilipshe...@gmail.com> >> > > > > > > Sent: Wed 27-Feb-2013 06:21 >> > > > > > > To: user@nutch.apache.org >> > > > > > > Subject: Re: Nutch Incremental Crawl >> > > > > > > >> > > > > > > Hi all, >> > > > > > > >> > > > > > > Thank you very much for the replies. Very useful >> information to >> > > > > > > understand how incremental crawling can be achieved. >> > > > > > > >> > > > > > > Dear Markus: >> > > > > > > Can you please tell me how do I over ride this fetch interval >> , >> > > > incase >> > > > > > if I >> > > > > > > require to fetch the page before the time interval is passed? >> > > > > > > >> > > > > > > >> > > > > > > >> > > > > > > Thanks very much >> > > > > > > - David >> > > > > > > >> > > > > > > >> > > > > > > >> > > > > > > >> > > > > > > >> > > > > > > On Thu, Feb 14, 2013 at 2:57 PM, Markus Jelsma >> > > > > > > <markus.jel...@openindex.io>wrote: >> > > > > > > >> > > > > > > > If you want records to be fetched at a fixed interval its >> > easier >> > > to >> > > > > > inject >> > > > > > > > them with a fixed fetch interval. >> > > > > > > > >> > > > > > > > nutch.fixedFetchInterval=86400 >> > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > > -----Original message----- >> > > > > > > > > From:kemical <mickael.lume...@gmail.com> >> > > > > > > > > Sent: Thu 14-Feb-2013 10:15 >> > > > > > > > > To: user@nutch.apache.org >> > > > > > > > > Subject: Re: Nutch Incremental Crawl >> > > > > > > > > >> > > > > > > > > Hi David, >> > > > > > > > > >> > > > > > > > > You can also consider setting shorter fetch interval time >> > with >> > > > nutch >> > > > > > > > inject. >> > > > > > > > > This way you'll set higher score (so the url is always >> taken >> > in >> > > > > > priority >> > > > > > > > > when you generate a segment) and a fetch.interval of 1 >> day. >> > > > > > > > > >> > > > > > > > > If you have a case similar to me, you'll often want some >> > > homepage >> > > > > > fetch >> > > > > > > > each >> > > > > > > > > day but not their inlinks. What you can do is inject all >> your >> > > > seed >> > > > > > urls >> > > > > > > > > again (assuming those url are only homepages). >> > > > > > > > > >> > > > > > > > > #change nutch option so existing urls can be injected >> again >> > in >> > > > > > > > > conf/nutch-default.xml or conf/nutch-site.xml >> > > > > > > > > db.injector.update=true >> > > > > > > > > >> > > > > > > > > #Add metadata to update score/fetch interval >> > > > > > > > > #the following line will concat to each line of your seed >> > urls >> > > > files >> > > > > > with >> > > > > > > > > the new score / new interval >> > > > > > > > > perl -pi -e >> > > > > > 's/^(.*)\n$/\1\tnutch.score=100\tnutch.fetchInterval=80000' >> > > > > > > > > [your_seed_url_dir]/* >> > > > > > > > > >> > > > > > > > > #run command >> > > > > > > > > bin/nutch inject crawl/crawldb [your_seed_url_dir] >> > > > > > > > > >> > > > > > > > > Now, the following crawl will take your urls in top >> priority >> > > and >> > > > > > crawl >> > > > > > > > them >> > > > > > > > > once a day. I've used my situation to illustrate the >> concept >> > > but >> > > > i >> > > > > > guess >> > > > > > > > you >> > > > > > > > > can tweek params to fit your needs. >> > > > > > > > > >> > > > > > > > > This way is useful when you want a regular fetch on some >> > urls, >> > > if >> > > > > > it's >> > > > > > > > > occured rarely i guess freegen is the right choice. >> > > > > > > > > >> > > > > > > > > Best, >> > > > > > > > > Mike >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > -- >> > > > > > > > > View this message in context: >> > > > > > > > >> > > > > > >> > > > >> > > >> > >> http://lucene.472066.n3.nabble.com/Nutch-Incremental-Crawl-tp4037903p4040400.html >> > > > > > > > > Sent from the Nutch - User mailing list archive at >> > Nabble.com. >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > >> > >> > >> > -- >> > Don't Grow Old, Grow Up... :-) >> > >> > > > > -- > Don't Grow Old, Grow Up... :-) > -- Don't Grow Old, Grow Up... :-)