Find Me wrote: > How to eliminate near duplicates from the index? Someone suggested that I > could look at the TermVectors and do a comparision to remove the > duplicates.
As an alternative you could also have a look at the paper "Detecting Phrase-Level Duplication on the World Wide Web" by Dennis Fetterly, Mark Manasse, Marc Najork. > One major problem with this is the structure of the document is > no longer important. Are there any obvious pitfalls? For example: Document > A being a subset of Document B but in no particular order. I think this case is pretty unlikely. But I am not sure whether you can detect near duplicates by only comparing term-document vectors. There might be problems with documents with slightly changed words, words that were replaced with synonyms... However, if you want to keep some information on the word order, you might consider comparing n-gram document vectors. That is, each dimension in the vector does not only represent one word but a sequence of 2, 3, 4, 5... words. Cheers, Isabel -- QOTD: Knucklehead: "Knock, knock" Pee Wee: "Who's there?" Knucklehead: "Little ol' lady." Pee Wee: "Liddle ol' lady who?" Knucklehead: "I didn't know you could yodel" |\ _,,,---,,_ Web: <http://www.isabel-drost.de> /,`.-'`' -. ;-;;,_ VoIP: <sip://[EMAIL PROTECTED]> |,4- ) )-,_..;\ ( `'-' Jabber: <xmpp://[EMAIL PROTECTED]> '---''(_/--' `-'\_) (fL) Kein ToFu: <http://learn.to/quote>
pgp8vbzz2urHA.pgp
Description: PGP signature
------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
