Thanks Erik, but I'm still a little confused as to exactly where in the Solr
config I set these parameters.

The example on the wiki page uses Lookup3Signature which (presumably) takes
no parameters, so there's no indication in the XML examples of where you
would set them. Unless I'm missing something.

Thanks again,

Andrew.


Erik Hatcher-4 wrote:
> 
> 
> On Jan 12, 2010, at 7:56 AM, Andrew Clegg wrote:
>> I'm interested in near-dupe removal as mentioned (briefly) here:
>>
>> http://wiki.apache.org/solr/Deduplication
>>
>> However the link for TextProfileSignature hasn't been filled in yet.
>>
>> Does anyone have an example of using TextProfileSignature that  
>> demonstrates
>> the tunable parameters mentioned in the wiki?
> 
> There are some comments in the source code*, but they weren't made  
> class-level.  I'm fixing that and committing it now, but here's the  
> comment:
> 
> /**
>   * <p>This implementation is copied from Apache Nutch. </p>
>   * <p>An implementation of a page signature. It calculates an MD5 hash
>   * of a plain text "profile" of a page.</p>
>   * <p>The algorithm to calculate a page "profile" takes the plain  
> text version of
>   * a page and performs the following steps:
>   * <ul>
>   * <li>remove all characters except letters and digits, and bring all  
> characters
>   * to lower case,</li>
>   * <li>split the text into tokens (all consecutive non-whitespace  
> characters),</li>
>   * <li>discard tokens equal or shorter than MIN_TOKEN_LEN (default 2  
> characters),</li>
>   * <li>sort the list of tokens by decreasing frequency,</li>
>   * <li>round down the counts of tokens to the nearest multiple of QUANT
>   * (<code>QUANT = QUANT_RATE * maxFreq</code>, where  
> <code>QUANT_RATE</code> is 0.01f
>   * by default, and <code>maxFreq</code> is the maximum token  
> frequency). If
>   * <code>maxFreq</code> is higher than 1, then QUANT is always higher  
> than 2 (which
>   * means that tokens with frequency 1 are always discarded).</li>
>   * <li>tokens, which frequency after quantization falls below QUANT,  
> are discarded.</li>
>   * <li>create a list of tokens and their quantized frequency,  
> separated by spaces,
>   * in the order of decreasing frequency.</li>
>   * </ul>
>   * This list is then submitted to an MD5 hash calculation.*/
> 
> There are two parameters this implementation takes:
> 
>      quantRate = params.getFloat("quantRate", 0.01f);
>      minTokenLen = params.getInt("minTokenLen", 2);
> 
> Hope that helps.
> 
>       Erik
> 
> 
> 
> *
> http://svn.apache.org/repos/asf/lucene/solr/trunk/src/java/org/apache/solr/update/processor/TextProfileSignature.java
> 
> 
> 

-- 
View this message in context: 
http://old.nabble.com/Filtering-near-duplicates-using-TextProfileSignature-tp27127151p27128173.html
Sent from the Solr - User mailing list archive at Nabble.com.

Reply via email to