Re: Nutch indexes less pages, then it fetches

reinhard schwab Wed, 28 Oct 2009 07:26:41 -0700

is your problem solved now???

this can be ok.
new discovered urls will be added to a segment when fetched documents
are parsed and if these urls pass the filters.
they will not have a crawl datum Generate because they are unknown until
they are extracted.


regards

caezar schrieb:
> I've compared the segments data of the URL which have no redirect and was
> indexed correctly, with this "bad" URL, and there is really a difference.
> First one have db record in the segment:
> Crawl Generate::
> Version: 7
> Status: 1 (db_unfetched)
> Fetch time: Wed Oct 28 16:01:05 EET 2009
> Modified time: Thu Jan 01 02:00:00 EET 1970
> Retries since fetch: 0
> Retry interval: 2592000 seconds (30 days)
> Score: 1.0
> Signature: null
> Metadata: _ngt_: 1256738472613
>  
> But second one have no such record, which seems pretty fine: it was not
> added to the segment on generate stage, it was added on the fetch stage. Is
> this a bug in Nutch? Or I'm missing some configuration option?
>
> caezar wrote:
>   
>> I'm pretty sure that I ran both commands before indexing
>>
>> Andrzej Bialecki wrote:
>>     
>>> caezar wrote:
>>>       
>>>> Some more information. Debugging reduce method I've noticed, that before
>>>> code
>>>>     if (fetchDatum == null || dbDatum == null
>>>>         || parseText == null || parseData == null) {
>>>>       return;                                     // only have inlinks
>>>>     }
>>>> my page has fetchDatum, parseText and parseData not null, but dbDatum is
>>>> null. Thats why it's skipped :) 
>>>> Any ideas about the reason?
>>>>         
>>> Yes - you should run updatedb with this segment, and also run 
>>> invertlinks with this segment, _before_ trying to index. Otherwise the 
>>> db status won't be updated properly.
>>>
>>>
>>> -- 
>>> Best regards,
>>> Andrzej Bialecki     <><
>>>   ___. ___ ___ ___ _ _   __________________________________
>>> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
>>> ___|||__||  \|  ||  |  Embedded Unix, System Integration
>>> http://www.sigram.com  Contact: info at sigram dot com
>>>
>>>
>>>
>>>       
>>     
>
>

Re: Nutch indexes less pages, then it fetches

Reply via email to