Re: [Nutch-general] Filtering URLs in CrawlDB

Dennis Kubes Tue, 09 Jan 2007 12:17:44 -0800

My stupid mistake.  I am using an older version, customized .8 branch 
which didn't have normalization.  I added normalization to it but in the 
process wasn't updating the key with the normalized url for mergesegs 
filtering.


Dennis

Andrzej Bialecki wrote:
> Dennis Kubes wrote:
>> If I wrote a new normalizer and added some regex filters to filter out 
>> urls in crawldb and then I ran mergedb with a single db to filter and 
>> then ran mergesegs with a single segment to filter does anyone know if 
>> I would then be required to run through a re-parse?
> 
> Re-parse - no; re-index - yes.
> 
>>
>> Reason I am asking is because I went through this process without a 
>> re-parse and upon indexing I get blank index files.  So what I was 
>> thinking is that urls weren't matching up because they were now 
>> normalized.
> 
> Most likely your index is out of sync with your merged segment. Indexes 
> contain segment names and document id-s inside, so if you have 
> merged/sliced your segments you have to rebuild the index too.
> 

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] Filtering URLs in CrawlDB

Reply via email to