: Is it possible to access collection statistics - especially IDF values
: for all non-discarded terms in the current document - from within an
: implementation of the Signature class?
The Signature API just lets you compute a unique value from a pile of
Strings, but you could extend the Signatu
For reference, you can get a rental copy of this article for less than the
cost of the full PDF download here:
http://www.deepdyve.com/lp/association-for-computing-machinery/collection-statistics-for-fast-duplicate-document-detection-0o7i3Sx0Wd
(joining the ACM is also a good thing to do)
(and
Hello,
For my current project I need to implement an index-time mechanism to
detect (near) duplicate documents. The TextProfileSignature available
out-of-the-box (http://wiki.apache.org/solr/Deduplication) seems alright
but does not use global collection statistics in deciding which terms
will be