Re: Nutch indexes less pages, then it fetches

caezar Wed, 28 Oct 2009 04:45:49 -0700

Thanks, checked, it was parsed. Still no answer why it was not indexed

reinhard schwab wrote:
> 
> yes, its permanently redirected.
> you can check also the segment status of this url
> here is an example
> 
> reinh...@thord:>bin/nutch  readseg -get crawl/segments/20091028122455
> "http://www.krems.at/fotoalbum/fotoalbum.asp?albumid=37&big=1&seitenid=20";
> 
> it will show you whether it is parsed and the extracted outlinks.
> it will show any data related to this url stored in the segment.
> 
> regards
> 
> caezar schrieb:
>> Thanks, that was really helpful. I've moved forward but still not found
>> the
>> solution.
>> So the status of the initial URL
>> (http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm)
>> is:
>> Status: 5 (db_redir_perm)
>> Metadata: _pst_: moved(12), lastModified=0:
>> http://www.1stdirectory.com/Companies/1627406_Darwins_Catering_Limited.htm
>>
>> So it answers the question, why initial page was not indexed - because it
>> was redirected.
>> Now checking the status of redirect target:
>> Status: 2 (db_fetched)
>>
>> So it was sucessfully fetchet. But, according to indexing log - it still
>> was
>> not sent to indexer!
>>
>>
>>
>> reinhard schwab wrote:
>>   
>>> what is the db status of this url in your crawl db?
>>> if it is STATUS_DB_NOTMODIFIED,
>>> then it may be the reason.
>>> (you can check it if you dump your crawl db with
>>> reinh...@thord:>bin/nutch readdb  <crawldb> -url <url>
>>>
>>> it has this status, if it is recrawled and the signature does not
>>> change.
>>> the signature is MD5 hash of the content.
>>>
>>> another reason may be that you have some indexing filters.
>>> i dont believe its the reason here.
>>>
>>> regards
>>>
>>>
>>> kevin chen schrieb:
>>>     
>>>> I have similar experience.
>>>>
>>>> Reinhard schwab responded a possible fix.  See mail in this group from
>>>> Reinhard schwab  at 
>>>> Sun, 25 Oct 2009 10:03:41 +0100  (05:03 EDT)
>>>>
>>>> I haven't have chance to try it out yet.
>>>>  
>>>> On Tue, 2009-10-27 at 07:34 -0700, caezar wrote:
>>>>   
>>>>       
>>>>> Hi All,
>>>>>
>>>>> I've got a strange problem, that nutch indexes much less URLs then it
>>>>> fetches. For example URL:
>>>>> http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm.
>>>>> I assume that if fetched sucessfully because in fetch logs it
>>>>> mentioned
>>>>> only
>>>>> once:
>>>>> 2009-10-26 10:01:46,502 INFO org.apache.nutch.fetcher.Fetcher:
>>>>> fetching
>>>>> http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm
>>>>>
>>>>> But it was not sent to the indexer on indexing phase (I'm using custom
>>>>> NutchIndexWriter and it logs every page for witch it's write method
>>>>> executed). What could be possible reason? Is there a way to browse
>>>>> crawldb
>>>>> to ensure that page really fetched? What else could I check?
>>>>>
>>>>> Thanks
>>>>>     
>>>>>         
>>>>   
>>>>       
>>>
>>>     
>>
>>   
> 
> 
>


-- 
View this message in context: 
http://www.nabble.com/Nutch-indexes-less-pages%2C-then-it-fetches-tp26078798p26093230.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Nutch indexes less pages, then it fetches

Reply via email to