No, problem is not solved. Everything happens as you described, but page is not indexed, because of condition: if (fetchDatum == null || dbDatum == null || parseText == null || parseData == null) { return; // only have inlinks } in IndexerMapReduce code. For this page dbDatum is null, so it is not indexed!
reinhard schwab wrote: > > is your problem solved now??? > > this can be ok. > new discovered urls will be added to a segment when fetched documents > are parsed and if these urls pass the filters. > they will not have a crawl datum Generate because they are unknown until > they are extracted. > > regards > > caezar schrieb: >> I've compared the segments data of the URL which have no redirect and was >> indexed correctly, with this "bad" URL, and there is really a difference. >> First one have db record in the segment: >> Crawl Generate:: >> Version: 7 >> Status: 1 (db_unfetched) >> Fetch time: Wed Oct 28 16:01:05 EET 2009 >> Modified time: Thu Jan 01 02:00:00 EET 1970 >> Retries since fetch: 0 >> Retry interval: 2592000 seconds (30 days) >> Score: 1.0 >> Signature: null >> Metadata: _ngt_: 1256738472613 >> >> But second one have no such record, which seems pretty fine: it was not >> added to the segment on generate stage, it was added on the fetch stage. >> Is >> this a bug in Nutch? Or I'm missing some configuration option? >> >> caezar wrote: >> >>> I'm pretty sure that I ran both commands before indexing >>> >>> Andrzej Bialecki wrote: >>> >>>> caezar wrote: >>>> >>>>> Some more information. Debugging reduce method I've noticed, that >>>>> before >>>>> code >>>>> if (fetchDatum == null || dbDatum == null >>>>> || parseText == null || parseData == null) { >>>>> return; // only have inlinks >>>>> } >>>>> my page has fetchDatum, parseText and parseData not null, but dbDatum >>>>> is >>>>> null. Thats why it's skipped :) >>>>> Any ideas about the reason? >>>>> >>>> Yes - you should run updatedb with this segment, and also run >>>> invertlinks with this segment, _before_ trying to index. Otherwise the >>>> db status won't be updated properly. >>>> >>>> >>>> -- >>>> Best regards, >>>> Andrzej Bialecki <>< >>>> ___. ___ ___ ___ _ _ __________________________________ >>>> [__ || __|__/|__||\/| Information Retrieval, Semantic Web >>>> ___|||__|| \| || | Embedded Unix, System Integration >>>> http://www.sigram.com Contact: info at sigram dot com >>>> >>>> >>>> >>>> >>> >> >> > > > -- View this message in context: http://www.nabble.com/Nutch-indexes-less-pages%2C-then-it-fetches-tp26078798p26095761.html Sent from the Nutch - User mailing list archive at Nabble.com.