Re: Implementing near duplicate detection algorithm using IDF statistics

2010-03-31 Thread Chris Hostetter
: Is it possible to access collection statistics - especially IDF values : for all non-discarded terms in the current document - from within an : implementation of the Signature class? The Signature API just lets you compute a unique value from a pile of Strings, but you could extend the Signatu

Re: Implementing near duplicate detection algorithm using IDF statistics

2010-03-24 Thread Ted Dunning
For reference, you can get a rental copy of this article for less than the cost of the full PDF download here: http://www.deepdyve.com/lp/association-for-computing-machinery/collection-statistics-for-fast-duplicate-document-detection-0o7i3Sx0Wd (joining the ACM is also a good thing to do) (and

Implementing near duplicate detection algorithm using IDF statistics

2010-03-24 Thread Thomas Heigl
Hello, For my current project I need to implement an index-time mechanism to detect (near) duplicate documents. The TextProfileSignature available out-of-the-box (http://wiki.apache.org/solr/Deduplication) seems alright but does not use global collection statistics in deciding which terms will be