Re: Nutch indexes less pages, then it fetches

caezar Wed, 28 Oct 2009 07:34:47 -0700

No, problem is not solved. Everything happens as you described, but page is
not indexed, because of condition:
    if (fetchDatum == null || dbDatum == null
        || parseText == null || parseData == null) {
      return;                                     // only have inlinks
    }
in IndexerMapReduce code. For this page dbDatum is null, so it is not
indexed!


reinhard schwab wrote:
> 
> is your problem solved now???
> 
> this can be ok.
> new discovered urls will be added to a segment when fetched documents
> are parsed and if these urls pass the filters.
> they will not have a crawl datum Generate because they are unknown until
> they are extracted.
> 
> regards
> 
> caezar schrieb:
>> I've compared the segments data of the URL which have no redirect and was
>> indexed correctly, with this "bad" URL, and there is really a difference.
>> First one have db record in the segment:
>> Crawl Generate::
>> Version: 7
>> Status: 1 (db_unfetched)
>> Fetch time: Wed Oct 28 16:01:05 EET 2009
>> Modified time: Thu Jan 01 02:00:00 EET 1970
>> Retries since fetch: 0
>> Retry interval: 2592000 seconds (30 days)
>> Score: 1.0
>> Signature: null
>> Metadata: _ngt_: 1256738472613
>>  
>> But second one have no such record, which seems pretty fine: it was not
>> added to the segment on generate stage, it was added on the fetch stage.
>> Is
>> this a bug in Nutch? Or I'm missing some configuration option?
>>
>> caezar wrote:
>>   
>>> I'm pretty sure that I ran both commands before indexing
>>>
>>> Andrzej Bialecki wrote:
>>>     
>>>> caezar wrote:
>>>>       
>>>>> Some more information. Debugging reduce method I've noticed, that
>>>>> before
>>>>> code
>>>>>     if (fetchDatum == null || dbDatum == null
>>>>>         || parseText == null || parseData == null) {
>>>>>       return;                                     // only have inlinks
>>>>>     }
>>>>> my page has fetchDatum, parseText and parseData not null, but dbDatum
>>>>> is
>>>>> null. Thats why it's skipped :) 
>>>>> Any ideas about the reason?
>>>>>         
>>>> Yes - you should run updatedb with this segment, and also run 
>>>> invertlinks with this segment, _before_ trying to index. Otherwise the 
>>>> db status won't be updated properly.
>>>>
>>>>
>>>> -- 
>>>> Best regards,
>>>> Andrzej Bialecki     <><
>>>>   ___. ___ ___ ___ _ _   __________________________________
>>>> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
>>>> ___|||__||  \|  ||  |  Embedded Unix, System Integration
>>>> http://www.sigram.com  Contact: info at sigram dot com
>>>>
>>>>
>>>>
>>>>       
>>>     
>>
>>   
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Nutch-indexes-less-pages%2C-then-it-fetches-tp26078798p26095761.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Nutch indexes less pages, then it fetches

Reply via email to