is your problem solved now??? this can be ok. new discovered urls will be added to a segment when fetched documents are parsed and if these urls pass the filters. they will not have a crawl datum Generate because they are unknown until they are extracted.
regards caezar schrieb: > I've compared the segments data of the URL which have no redirect and was > indexed correctly, with this "bad" URL, and there is really a difference. > First one have db record in the segment: > Crawl Generate:: > Version: 7 > Status: 1 (db_unfetched) > Fetch time: Wed Oct 28 16:01:05 EET 2009 > Modified time: Thu Jan 01 02:00:00 EET 1970 > Retries since fetch: 0 > Retry interval: 2592000 seconds (30 days) > Score: 1.0 > Signature: null > Metadata: _ngt_: 1256738472613 > > But second one have no such record, which seems pretty fine: it was not > added to the segment on generate stage, it was added on the fetch stage. Is > this a bug in Nutch? Or I'm missing some configuration option? > > caezar wrote: > >> I'm pretty sure that I ran both commands before indexing >> >> Andrzej Bialecki wrote: >> >>> caezar wrote: >>> >>>> Some more information. Debugging reduce method I've noticed, that before >>>> code >>>> if (fetchDatum == null || dbDatum == null >>>> || parseText == null || parseData == null) { >>>> return; // only have inlinks >>>> } >>>> my page has fetchDatum, parseText and parseData not null, but dbDatum is >>>> null. Thats why it's skipped :) >>>> Any ideas about the reason? >>>> >>> Yes - you should run updatedb with this segment, and also run >>> invertlinks with this segment, _before_ trying to index. Otherwise the >>> db status won't be updated properly. >>> >>> >>> -- >>> Best regards, >>> Andrzej Bialecki <>< >>> ___. ___ ___ ___ _ _ __________________________________ >>> [__ || __|__/|__||\/| Information Retrieval, Semantic Web >>> ___|||__|| \| || | Embedded Unix, System Integration >>> http://www.sigram.com Contact: info at sigram dot com >>> >>> >>> >>> >> > >