Re: Filtering near-duplicates using TextProfileSignature
Here's my config for the updateProcessor. It not uses another signature method but i've used TextProfileSignature as well and it works - sort of. updateRequestProcessorChain name=dedupe processor class=org.apache.solr.update.processor.SignatureUpdateProcessorFactory bool name=enabledtrue/bool str name=signatureFieldsig/str bool name=overwriteDupestrue/bool str name=fieldscontent/str str name=signatureClassorg.apache.solr.update.processor.Lookup3Signature/str /processor processor class=solr.LogUpdateProcessorFactory / processor class=solr.RunUpdateProcessorFactory / /updateRequestProcessorChain Of course, you must define the updateProcessor in your requestHandler, it's commented out in mine at the moment. requestHandler name=/update class=solr.XmlUpdateRequestHandler !-- lst name=defaults str name=update.processordedupe/str /lst -- /requestHandler Also, i see you define minTokenLen = 3. Where does that come from? Haven't seen anything on the wiki specifying such a parameter. On Tuesday 08 June 2010 19:45:35 Neeb wrote: Hey Andrew, Just wondering if you ever managed to run TextProfileSignature based deduplication. I would appreciate it if you could send me the code fragment for it from solrconfig. I have currently something like this, but not sure if I am doing it right: updateRequestProcessorChain name=dedupe processor class=org.apache.solr.update.processor.SignatureUpdateProcessorFactory bool name=enabledtrue/bool str name=signatureFieldsignature/str bool name=overwriteDupestrue/bool str name=fieldstitle,author,abstract/str str name=signatureClassorg.apache.solr.update.processor.TextProfileSignature /str str name=minTokenLen3/str /processor processor class=solr.LogUpdateProcessorFactory / processor class=solr.RunUpdateProcessorFactory / /updateRequestProcessorChain -- Thanks in advance, -Ali Markus Jelsma - Technisch Architect - Buyways BV http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350
Re: Filtering near-duplicates using TextProfileSignature
Well, it got me too! KMail didn't properly order this thread. Can't seem to find Hatcher's reply anywhere. ??!!? On Tuesday 08 June 2010 22:00:06 Andrew Clegg wrote: Andrew Clegg wrote: Re. your config, I don't see a minTokenLength in the wiki page for deduplication, is this a recent addition that's not documented yet? Sorry about this -- stupid question -- I should have read back through the thread and refreshed my memory. Markus Jelsma - Technisch Architect - Buyways BV http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350
Re: Filtering near-duplicates using TextProfileSignature
Markus Jelsma wrote: Well, it got me too! KMail didn't properly order this thread. Can't seem to find Hatcher's reply anywhere. ??!!? Whole thread here: http://lucene.472066.n3.nabble.com/Filtering-near-duplicates-using-TextProfileSignature-tt479039.html -- View this message in context: http://lucene.472066.n3.nabble.com/Filtering-near-duplicates-using-TextProfileSignature-tp479039p881797.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Filtering near-duplicates using TextProfileSignature
Thanks guys. I will try this with some test documents, fingers crossed. And by the way, I got the minTokenLen parameter from one of the thread replies (from Erik). Cheerz, Ali -- View this message in context: http://lucene.472066.n3.nabble.com/Filtering-near-duplicates-using-TextProfileSignature-tp479039p881840.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Filtering near-duplicates using TextProfileSignature
Hey Andrew, Just wondering if you ever managed to run TextProfileSignature based deduplication. I would appreciate it if you could send me the code fragment for it from solrconfig. I have currently something like this, but not sure if I am doing it right: updateRequestProcessorChain name=dedupe processor class=org.apache.solr.update.processor.SignatureUpdateProcessorFactory bool name=enabledtrue/bool str name=signatureFieldsignature/str bool name=overwriteDupestrue/bool str name=fieldstitle,author,abstract/str str name=signatureClassorg.apache.solr.update.processor.TextProfileSignature/str str name=minTokenLen3/str /processor processor class=solr.LogUpdateProcessorFactory / processor class=solr.RunUpdateProcessorFactory / /updateRequestProcessorChain -- Thanks in advance, -Ali -- View this message in context: http://lucene.472066.n3.nabble.com/Filtering-near-duplicates-using-TextProfileSignature-tp479039p880044.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Filtering near-duplicates using TextProfileSignature
Andrew Clegg wrote: Re. your config, I don't see a minTokenLength in the wiki page for deduplication, is this a recent addition that's not documented yet? Sorry about this -- stupid question -- I should have read back through the thread and refreshed my memory. -- View this message in context: http://lucene.472066.n3.nabble.com/Filtering-near-duplicates-using-TextProfileSignature-tp479039p880385.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Filtering near-duplicates using TextProfileSignature
On Jan 12, 2010, at 7:56 AM, Andrew Clegg wrote: I'm interested in near-dupe removal as mentioned (briefly) here: http://wiki.apache.org/solr/Deduplication However the link for TextProfileSignature hasn't been filled in yet. Does anyone have an example of using TextProfileSignature that demonstrates the tunable parameters mentioned in the wiki? There are some comments in the source code*, but they weren't made class-level. I'm fixing that and committing it now, but here's the comment: /** * pThis implementation is copied from Apache Nutch. /p * pAn implementation of a page signature. It calculates an MD5 hash * of a plain text profile of a page./p * pThe algorithm to calculate a page profile takes the plain text version of * a page and performs the following steps: * ul * liremove all characters except letters and digits, and bring all characters * to lower case,/li * lisplit the text into tokens (all consecutive non-whitespace characters),/li * lidiscard tokens equal or shorter than MIN_TOKEN_LEN (default 2 characters),/li * lisort the list of tokens by decreasing frequency,/li * liround down the counts of tokens to the nearest multiple of QUANT * (codeQUANT = QUANT_RATE * maxFreq/code, where codeQUANT_RATE/code is 0.01f * by default, and codemaxFreq/code is the maximum token frequency). If * codemaxFreq/code is higher than 1, then QUANT is always higher than 2 (which * means that tokens with frequency 1 are always discarded)./li * litokens, which frequency after quantization falls below QUANT, are discarded./li * licreate a list of tokens and their quantized frequency, separated by spaces, * in the order of decreasing frequency./li * /ul * This list is then submitted to an MD5 hash calculation.*/ There are two parameters this implementation takes: quantRate = params.getFloat(quantRate, 0.01f); minTokenLen = params.getInt(minTokenLen, 2); Hope that helps. Erik * http://svn.apache.org/repos/asf/lucene/solr/trunk/src/java/org/apache/solr/update/processor/TextProfileSignature.java
Re: Filtering near-duplicates using TextProfileSignature
Thanks Erik, but I'm still a little confused as to exactly where in the Solr config I set these parameters. The example on the wiki page uses Lookup3Signature which (presumably) takes no parameters, so there's no indication in the XML examples of where you would set them. Unless I'm missing something. Thanks again, Andrew. Erik Hatcher-4 wrote: On Jan 12, 2010, at 7:56 AM, Andrew Clegg wrote: I'm interested in near-dupe removal as mentioned (briefly) here: http://wiki.apache.org/solr/Deduplication However the link for TextProfileSignature hasn't been filled in yet. Does anyone have an example of using TextProfileSignature that demonstrates the tunable parameters mentioned in the wiki? There are some comments in the source code*, but they weren't made class-level. I'm fixing that and committing it now, but here's the comment: /** * pThis implementation is copied from Apache Nutch. /p * pAn implementation of a page signature. It calculates an MD5 hash * of a plain text profile of a page./p * pThe algorithm to calculate a page profile takes the plain text version of * a page and performs the following steps: * ul * liremove all characters except letters and digits, and bring all characters * to lower case,/li * lisplit the text into tokens (all consecutive non-whitespace characters),/li * lidiscard tokens equal or shorter than MIN_TOKEN_LEN (default 2 characters),/li * lisort the list of tokens by decreasing frequency,/li * liround down the counts of tokens to the nearest multiple of QUANT * (codeQUANT = QUANT_RATE * maxFreq/code, where codeQUANT_RATE/code is 0.01f * by default, and codemaxFreq/code is the maximum token frequency). If * codemaxFreq/code is higher than 1, then QUANT is always higher than 2 (which * means that tokens with frequency 1 are always discarded)./li * litokens, which frequency after quantization falls below QUANT, are discarded./li * licreate a list of tokens and their quantized frequency, separated by spaces, * in the order of decreasing frequency./li * /ul * This list is then submitted to an MD5 hash calculation.*/ There are two parameters this implementation takes: quantRate = params.getFloat(quantRate, 0.01f); minTokenLen = params.getInt(minTokenLen, 2); Hope that helps. Erik * http://svn.apache.org/repos/asf/lucene/solr/trunk/src/java/org/apache/solr/update/processor/TextProfileSignature.java -- View this message in context: http://old.nabble.com/Filtering-near-duplicates-using-TextProfileSignature-tp27127151p27128173.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Filtering near-duplicates using TextProfileSignature
On Jan 12, 2010, at 9:15 AM, Andrew Clegg wrote: Thanks Erik, but I'm still a little confused as to exactly where in the Solr config I set these parameters. You'd configure them within the processor element, something like this: str name=minTokenLen5/str The example on the wiki page uses Lookup3Signature which (presumably) takes no parameters, so there's no indication in the XML examples of where you would set them. Right, looking at the source code, Lookup3Signature takes no parameters. Perhaps you could update the wiki with an example once you get it working? I'm flying a little blind here, just going off the source code, not trying it out for real. Erik
Re: Filtering near-duplicates using TextProfileSignature
Erik Hatcher-4 wrote: On Jan 12, 2010, at 9:15 AM, Andrew Clegg wrote: Thanks Erik, but I'm still a little confused as to exactly where in the Solr config I set these parameters. You'd configure them within the processor element, something like this: str name=minTokenLen5/str OK, thanks. (Should that really be str though, and not int or something?) Erik Hatcher-4 wrote: Perhaps you could update the wiki with an example once you get it working? I'm flying a little blind here, just going off the source code, not trying it out for real. Sure -- it won't be til next week at the earliest though. Cheers, Andrew. -- View this message in context: http://old.nabble.com/Filtering-near-duplicates-using-TextProfileSignature-tp27127151p27128493.html Sent from the Solr - User mailing list archive at Nabble.com.