Re: [Nutch-general] near duplicates

Isabel Drost Wed, 18 Oct 2006 02:06:04 -0700

Find Me wrote:
> How to eliminate near duplicates from the index? Someone suggested that I
> could look at the TermVectors and do a comparision to remove the
> duplicates.


As an alternative you could also have a look at the paper "Detecting 
Phrase-Level Duplication on the World Wide Web" by Dennis Fetterly, Mark 
Manasse, Marc Najork. 


> One major problem with this is the structure of the document is 
> no longer important. Are there any obvious pitfalls? For example: Document
> A being a subset of Document B but in no particular order.

I think this case is pretty unlikely. But I am not sure whether you can detect 
near duplicates by only comparing term-document vectors. There might be 
problems with documents with slightly changed words, words that were replaced 
with synonyms...

However, if you want to keep some information on the word order, you might 
consider comparing n-gram document vectors. That is, each dimension in the 
vector does not only represent one word but a sequence of 2, 3, 4, 5... 
words.

Cheers,
Isabel

-- 
QOTD: Knucklehead: "Knock, knock" Pee Wee: "Who's there?" Knucklehead: "Little 
ol' lady." Pee Wee: "Liddle ol' lady who?" Knucklehead: "I didn't know you 
could yodel" 
  |\      _,,,---,,_       Web:   <http://www.isabel-drost.de>
  /,`.-'`'    -.  ;-;;,_   VoIP:    <sip://[EMAIL PROTECTED]>
 |,4-  ) )-,_..;\ (  `'-'  Jabber: <xmpp://[EMAIL PROTECTED]>
'---''(_/--'  `-'\_) (fL)  Kein ToFu:  <http://learn.to/quote>

pgp8vbzz2urHA.pgp
Description: PGP signature

-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642

_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] near duplicates

Reply via email to