i have never used and tested 0.9. i have looked into the code, it is quite different to 1.0 in regard to CrawlDbReducer and scheduling. i propose to change the method
public void setNextFetchTime() { fetchTime += (long)(MILLISECONDS_PER_DAY*fetchInterval); } in CrawlDatum.java check there for a fetchInterval with value 0.0. and set it to the default interval. Sunnyvale Fl schrieb: > Looks like the right patch for my problem; unfortunately we are still on > Nutch 0.9. The patches are for FetchSchedule which doesn't exist in 0.9... > Any idea? Is there an older patch? > Thanks! > > On Thu, Jan 21, 2010 at 6:35 PM, reinhard schwab > <reinhard.sch...@aon.at>wrote: > > >> some time ago i have had the same with nutch 1.0 and i have discovered >> one bug. >> >> https://issues.apache.org/jira/browse/NUTCH-774 >> https://issues.apache.org/jira/browse/NUTCH-773 >> >> you will find patches there. >> >> Sunnyvale Fl schrieb: >> >>> You know you are right. I dump db for another url and the retry interval >>> >> is >> >>> 0.0. For the same crawl, some url's retry interval is 7.0. Why is that? >>> >> I >> >>> have db.default.fetch.interval set to 7.0 in nutch-site.xml. Thanks! >>> >>> Version: 5 >>> Status: 2 (db_fetched) >>> Fetch time: Thu Jan 21 08:55:24 PST 2010 >>> Modified time: Wed Dec 31 16:00:00 PST 1969 >>> Retries since fetch: 0 >>> Retry interval: 0.0 days >>> Score: 0.0 >>> Signature: 09854146546e5e7fe5def1e1add23037 >>> Metadata: _pst_:success(1), lastModified=0 >>> >>> >>> On Thu, Jan 21, 2010 at 5:50 PM, reinhard schwab <reinhard.sch...@aon.at >>> wrote: >>> >>> >>> >>>> yes, i mean that. >>>> in the java classes, it is called fetch interval, see CrawlDatum class. >>>> do you use the adddays option when generating the segment? >>>> if the value is higher than the fetch interval, then it can also happen >>>> that you >>>> crawl again and again a page. >>>> >>>> the fetch time in your entry is Nov 06 2009. >>>> the last time it has been fetched is before this date. >>>> it has not been refetched since that time. >>>> >>>> >>>> Sunnyvale Fl schrieb: >>>> >>>> >>>>> You mean the retry interval? It is 7 days from readdb - >>>>> >>>>> Version: 5 >>>>> Status: 2 (db_fetched) >>>>> Fetch time: Fri Nov 06 07:48:54 PST 2009 >>>>> Modified time: Wed Dec 31 16:00:00 PST 1969 >>>>> Retries since fetch: 0 >>>>> Retry interval: 7.0 days >>>>> Score: 0.0 >>>>> Signature: 5ec8dc313a9ae4d61c6e8c9d9c18ea26 >>>>> Metadata: _pst_:success(1), lastModified=0 >>>>> >>>>> >>>>> On Thu, Jan 21, 2010 at 5:00 PM, reinhard schwab < >>>>> >> reinhard.sch...@aon.at >> >>>>> wrote: >>>>> >>>>> >>>>> >>>>> >>>>>> using "nutch readdb" you can dump the entry of the page. >>>>>> i believe that the fetch interval of this page is zero. >>>>>> >>>>>> Sunnyvale Fl schrieb: >>>>>> >>>>>> >>>>>> >>>>>>> Hi, >>>>>>> I am using Nutch 0.9.1 and I am having this weird problem - it will >>>>>>> repeatedly fetch the same page without error. So if I let it run to >>>>>>> >> 10 >> >>>>>>> levels deep, the same page will be fetched 10 times. What's wrong? >>>>>>> >>>>>>> >>>>>>> >>>>>> Thanks! >>>>>> >>>>>> >>>>>> >>> >> > >