Re: subclassing of IndexReader

Christoph Goller Fri, 31 Oct 2003 06:11:56 -0800

Hi Doug

*)I am just curious. What is IndexReader.undeleteAll needed for?

In Nutch we have a rotating set of indexes. For example, we might create a new index every day. Our crawler guarantees that pages will be re-indexed every 30 days, so we can, e.g., every day merge (or search w/o merging) the most recent 30 indexes. So far so good. But many pages are clones of other pages: different urls with the same content. So, each time we deploy a new set of indexes we need to first perform duplicate detection to make sure that, for each unique content, only a single url is present, that with the highest link analysis score. I implement this by first calling undeleteAll(), then perform the global duplicate detection, deleting duplicates from their index. Does this make sense? Each day duplicate detection must be repeated when a new index is added, but first all of the previously detected duplicates must be cleared.

That's quite interesting. I am currently involved in a small crawling project. We only crawl a very limited number of news pages, some of them several times per day. We found that there are often tiny changes on these pages (spelling corrections, banner changes) which we would like to ignore (classify as dublicate) while we want to recognize bigger changes. For such a setting MD5 keys are not very helpful. How do you detect dublicates in Nutch?

Christoph


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: subclassing of IndexReader

Reply via email to