John Casey wrote:
> On 10/18/06, Isabel Drost <[EMAIL PROTECTED]> wrote:
>>
>> Find Me wrote:
>> > How to eliminate near duplicates from the index? Someone suggested 
>> that
>> I
>> > could look at the TermVectors and do a comparision to remove the
>> > duplicates.
>>
>> As an alternative you could also have a look at the paper "Detecting
>> Phrase-Level Duplication on the World Wide Web" by Dennis Fetterly, Mark
>> Manasse, Marc Najork.
>
>
> Another good reference would be Soumen Chakrabarti's reference book, 
> "Mining
> the Web - Discovering Knowledge from Hypertext Data",2003 and the 
> section on
> shingling and the elimination of near duplicates. Of course I think this
> works at the document level rather than at the term vector level but it
> might be useful to prevent duplicate documents from being indexed
> altogether.
>
>> One major problem with this is the structure of the document is
>> > no longer important. Are there any obvious pitfalls? For example:
>> Document
>> > A being a subset of Document B but in no particular order.
>>
>> I think this case is pretty unlikely. But I am not sure whether you can
>> detect
>> near duplicates by only comparing term-document vectors. There might be
>> problems with documents with slightly changed words, words that were
>> replaced
>> with synonyms...
>>
>> However, if you want to keep some information on the word order, you 
>> might
>> consider comparing n-gram document vectors. That is, each dimension 
>> in the
>>
>> vector does not only represent one word but a sequence of 2, 3, 4, 5...
>> words.
>
>
>
> would this involve something like a window of 2-5 words around a 
> particular
> term in a document?
>
> Cheers,
>> Isabel
>>
>

DeleteDuplicates removes documents having the same digest or the same 
url. If you use the TextProfileSigniture instead of MD5Signiture, it 
will remove near similar documents. The MD5Signiture class set digest as 
the md5 of all the content, whereas textProfileSigniture sets digest as 
the md5 of significant terms. You should check the class for 
implementation details.  also look at signitureFactory for how to change 
the configuration.


-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to