On Jan 12, 2010, at 7:56 AM, Andrew Clegg wrote:
I'm interested in near-dupe removal as mentioned (briefly) here:

http://wiki.apache.org/solr/Deduplication

However the link for TextProfileSignature hasn't been filled in yet.

Does anyone have an example of using TextProfileSignature that demonstrates
the tunable parameters mentioned in the wiki?

There are some comments in the source code*, but they weren't made class-level. I'm fixing that and committing it now, but here's the comment:

/**
 * <p>This implementation is copied from Apache Nutch. </p>
 * <p>An implementation of a page signature. It calculates an MD5 hash
 * of a plain text "profile" of a page.</p>
* <p>The algorithm to calculate a page "profile" takes the plain text version of
 * a page and performs the following steps:
 * <ul>
* <li>remove all characters except letters and digits, and bring all characters
 * to lower case,</li>
* <li>split the text into tokens (all consecutive non-whitespace characters),</li> * <li>discard tokens equal or shorter than MIN_TOKEN_LEN (default 2 characters),</li>
 * <li>sort the list of tokens by decreasing frequency,</li>
 * <li>round down the counts of tokens to the nearest multiple of QUANT
* (<code>QUANT = QUANT_RATE * maxFreq</code>, where <code>QUANT_RATE</code> is 0.01f * by default, and <code>maxFreq</code> is the maximum token frequency). If * <code>maxFreq</code> is higher than 1, then QUANT is always higher than 2 (which
 * means that tokens with frequency 1 are always discarded).</li>
* <li>tokens, which frequency after quantization falls below QUANT, are discarded.</li> * <li>create a list of tokens and their quantized frequency, separated by spaces,
 * in the order of decreasing frequency.</li>
 * </ul>
 * This list is then submitted to an MD5 hash calculation.*/

There are two parameters this implementation takes:

    quantRate = params.getFloat("quantRate", 0.01f);
    minTokenLen = params.getInt("minTokenLen", 2);

Hope that helps.

        Erik



* 
http://svn.apache.org/repos/asf/lucene/solr/trunk/src/java/org/apache/solr/update/processor/TextProfileSignature.java

Reply via email to