That's quite interesting. I am currently involved in a small crawling project. We only crawl a very limited number of news pages, some of them several times per day. We found that there are often tiny changes on these pages (spelling corrections, banner changes) which we would like to ignore
(classify as dublicate) while we want to recognize bigger changes. For such a setting MD5 keys are
not very helpful. How do you detect dublicates in Nutch?
Nutch currently only does MD5-based duplicate elimination. So only exact duplicates are eliminated.
There's been a fair amount of work on better methods. For example, there was Broder et. al.'s "Syntactic Clustering" work (http://gatekeeper.research.compaq.com/pub/DEC/SRC/technical-notes/SRC-1997-015-html/).
However I've never seen anyone demonstrate how such methods can be efficiently applied to huge collections. Perhaps they can, but it's not obvious to me. I've also not followed this literature closely.
Doug
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
