Re: repeat fetch of same page without error

reinhard schwab Tue, 02 Feb 2010 17:00:32 -0800

i have never used and tested 0.9.
i have looked into the code, it is quite different to 1.0 in regard to
CrawlDbReducer and scheduling.
i propose to change the method


  public void setNextFetchTime() {
    fetchTime += (long)(MILLISECONDS_PER_DAY*fetchInterval);
  }

in CrawlDatum.java
check there for a fetchInterval with value 0.0. and set it to the
default interval.


Sunnyvale Fl schrieb:
> Looks like the right patch for my problem; unfortunately we are still on
> Nutch 0.9.  The patches are for FetchSchedule which doesn't exist in 0.9...
> Any idea?  Is there an older patch?
> Thanks!
>
> On Thu, Jan 21, 2010 at 6:35 PM, reinhard schwab 
> <reinhard.sch...@aon.at>wrote:
>
>   
>> some time ago i have had the same with nutch 1.0 and i have discovered
>> one bug.
>>
>> https://issues.apache.org/jira/browse/NUTCH-774
>> https://issues.apache.org/jira/browse/NUTCH-773
>>
>> you will find patches there.
>>
>> Sunnyvale Fl schrieb:
>>     
>>> You know you are right.  I dump db for another url and the retry interval
>>>       
>> is
>>     
>>> 0.0.  For the same crawl, some url's retry interval is 7.0.  Why is that?
>>>       
>>  I
>>     
>>> have db.default.fetch.interval set to 7.0 in nutch-site.xml.  Thanks!
>>>
>>> Version: 5
>>> Status: 2 (db_fetched)
>>> Fetch time: Thu Jan 21 08:55:24 PST 2010
>>> Modified time: Wed Dec 31 16:00:00 PST 1969
>>> Retries since fetch: 0
>>> Retry interval: 0.0 days
>>> Score: 0.0
>>> Signature: 09854146546e5e7fe5def1e1add23037
>>> Metadata: _pst_:success(1), lastModified=0
>>>
>>>
>>> On Thu, Jan 21, 2010 at 5:50 PM, reinhard schwab <reinhard.sch...@aon.at
>>> wrote:
>>>
>>>
>>>       
>>>> yes, i mean that.
>>>> in the java classes, it is called fetch interval, see CrawlDatum class.
>>>> do you use the adddays option when generating the segment?
>>>> if the value is higher than the fetch interval, then it can also happen
>>>> that you
>>>> crawl again and again a page.
>>>>
>>>> the fetch time in your entry is Nov 06 2009.
>>>> the last time it has been fetched is before this date.
>>>> it has not been refetched since that time.
>>>>
>>>>
>>>> Sunnyvale Fl schrieb:
>>>>
>>>>         
>>>>> You mean the retry interval?  It is 7 days from readdb -
>>>>>
>>>>> Version: 5
>>>>> Status: 2 (db_fetched)
>>>>> Fetch time: Fri Nov 06 07:48:54 PST 2009
>>>>> Modified time: Wed Dec 31 16:00:00 PST 1969
>>>>> Retries since fetch: 0
>>>>> Retry interval: 7.0 days
>>>>> Score: 0.0
>>>>> Signature: 5ec8dc313a9ae4d61c6e8c9d9c18ea26
>>>>> Metadata: _pst_:success(1), lastModified=0
>>>>>
>>>>>
>>>>> On Thu, Jan 21, 2010 at 5:00 PM, reinhard schwab <
>>>>>           
>> reinhard.sch...@aon.at
>>     
>>>>> wrote:
>>>>>
>>>>>
>>>>>
>>>>>           
>>>>>> using "nutch readdb" you can dump the entry of the page.
>>>>>> i believe that the fetch interval of this page is zero.
>>>>>>
>>>>>> Sunnyvale Fl schrieb:
>>>>>>
>>>>>>
>>>>>>             
>>>>>>> Hi,
>>>>>>> I am using Nutch 0.9.1 and I am having this weird problem - it will
>>>>>>> repeatedly fetch the same page without error.  So if I let it run to
>>>>>>>               
>> 10
>>     
>>>>>>> levels deep, the same page will be fetched 10 times.  What's wrong?
>>>>>>>
>>>>>>>
>>>>>>>               
>>>>>>  Thanks!
>>>>>>
>>>>>>
>>>>>>             
>>>       
>>     
>
>

Re: repeat fetch of same page without error

Reply via email to