Re: Filtering near-duplicates using TextProfileSignature

2010-06-09 Thread Markus Jelsma
Here's my config for the updateProcessor. It not uses another signature method 
but i've used TextProfileSignature as well and it works - sort of.


  updateRequestProcessorChain name=dedupe
processor 
class=org.apache.solr.update.processor.SignatureUpdateProcessorFactory
  bool name=enabledtrue/bool
  str name=signatureFieldsig/str
  bool name=overwriteDupestrue/bool
  str name=fieldscontent/str
  str 
name=signatureClassorg.apache.solr.update.processor.Lookup3Signature/str
/processor
processor class=solr.LogUpdateProcessorFactory /
processor class=solr.RunUpdateProcessorFactory /
  /updateRequestProcessorChain


Of course, you must define the updateProcessor in your requestHandler, it's 
commented out in mine at the moment.


  requestHandler name=/update class=solr.XmlUpdateRequestHandler
!--
   lst name=defaults
str name=update.processordedupe/str
   /lst
--
  /requestHandler


Also, i see you define minTokenLen = 3. Where does that come from? Haven't 
seen anything on the wiki specifying such a parameter.


On Tuesday 08 June 2010 19:45:35 Neeb wrote:
 Hey Andrew,
 
 Just wondering if you ever managed to run TextProfileSignature based
 deduplication. I would appreciate it if you could send me the code fragment
 for it from  solrconfig.
 
 I have currently something like this, but not sure if I am doing it right:
 
  updateRequestProcessorChain name=dedupe
 processor
 class=org.apache.solr.update.processor.SignatureUpdateProcessorFactory
   bool name=enabledtrue/bool
   str name=signatureFieldsignature/str
   bool name=overwriteDupestrue/bool
   str name=fieldstitle,author,abstract/str
   str
 name=signatureClassorg.apache.solr.update.processor.TextProfileSignature
 /str str name=minTokenLen3/str
 /processor
 processor class=solr.LogUpdateProcessorFactory /
 processor class=solr.RunUpdateProcessorFactory /
   /updateRequestProcessorChain
 
 --
 
 Thanks in advance,
 -Ali
 

Markus Jelsma - Technisch Architect - Buyways BV
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350



Re: Filtering near-duplicates using TextProfileSignature

2010-06-09 Thread Markus Jelsma
Well, it got me too! KMail didn't properly order this thread. Can't seem to 
find Hatcher's reply anywhere. ??!!?


On Tuesday 08 June 2010 22:00:06 Andrew Clegg wrote:
 Andrew Clegg wrote:
  Re. your config, I don't see a minTokenLength in the wiki page for
  deduplication, is this a recent addition that's not documented yet?
 
 Sorry about this -- stupid question -- I should have read back through the
 thread and refreshed my memory.
 

Markus Jelsma - Technisch Architect - Buyways BV
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350



Re: Filtering near-duplicates using TextProfileSignature

2010-06-09 Thread Andrew Clegg


Markus Jelsma wrote:
 
 Well, it got me too! KMail didn't properly order this thread. Can't seem
 to 
 find Hatcher's reply anywhere. ??!!?
 

Whole thread here:

http://lucene.472066.n3.nabble.com/Filtering-near-duplicates-using-TextProfileSignature-tt479039.html
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Filtering-near-duplicates-using-TextProfileSignature-tp479039p881797.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Filtering near-duplicates using TextProfileSignature

2010-06-09 Thread Neeb

Thanks guys.
I will try this with some test documents, fingers crossed.
And by the way, I got the minTokenLen parameter from one of the thread
replies (from Erik).

Cheerz,
Ali


-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Filtering-near-duplicates-using-TextProfileSignature-tp479039p881840.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Filtering near-duplicates using TextProfileSignature

2010-06-08 Thread Neeb

Hey Andrew,

Just wondering if you ever managed to run TextProfileSignature based
deduplication. I would appreciate it if you could send me the code fragment
for it from  solrconfig.

I have currently something like this, but not sure if I am doing it right:

 updateRequestProcessorChain name=dedupe
processor
class=org.apache.solr.update.processor.SignatureUpdateProcessorFactory
  bool name=enabledtrue/bool
  str name=signatureFieldsignature/str
  bool name=overwriteDupestrue/bool
  str name=fieldstitle,author,abstract/str
  str
name=signatureClassorg.apache.solr.update.processor.TextProfileSignature/str
  str name=minTokenLen3/str
/processor
processor class=solr.LogUpdateProcessorFactory /
processor class=solr.RunUpdateProcessorFactory /
  /updateRequestProcessorChain 

--

Thanks in advance,
-Ali
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Filtering-near-duplicates-using-TextProfileSignature-tp479039p880044.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Filtering near-duplicates using TextProfileSignature

2010-06-08 Thread Andrew Clegg


Andrew Clegg wrote:
 
 Re. your config, I don't see a minTokenLength in the wiki page for
 deduplication, is this a recent addition that's not documented yet?
 

Sorry about this -- stupid question -- I should have read back through the
thread and refreshed my memory.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Filtering-near-duplicates-using-TextProfileSignature-tp479039p880385.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Filtering near-duplicates using TextProfileSignature

2010-01-12 Thread Erik Hatcher


On Jan 12, 2010, at 7:56 AM, Andrew Clegg wrote:

I'm interested in near-dupe removal as mentioned (briefly) here:

http://wiki.apache.org/solr/Deduplication

However the link for TextProfileSignature hasn't been filled in yet.

Does anyone have an example of using TextProfileSignature that  
demonstrates

the tunable parameters mentioned in the wiki?


There are some comments in the source code*, but they weren't made  
class-level.  I'm fixing that and committing it now, but here's the  
comment:


/**
 * pThis implementation is copied from Apache Nutch. /p
 * pAn implementation of a page signature. It calculates an MD5 hash
 * of a plain text profile of a page./p
 * pThe algorithm to calculate a page profile takes the plain  
text version of

 * a page and performs the following steps:
 * ul
 * liremove all characters except letters and digits, and bring all  
characters

 * to lower case,/li
 * lisplit the text into tokens (all consecutive non-whitespace  
characters),/li
 * lidiscard tokens equal or shorter than MIN_TOKEN_LEN (default 2  
characters),/li

 * lisort the list of tokens by decreasing frequency,/li
 * liround down the counts of tokens to the nearest multiple of QUANT
 * (codeQUANT = QUANT_RATE * maxFreq/code, where  
codeQUANT_RATE/code is 0.01f
 * by default, and codemaxFreq/code is the maximum token  
frequency). If
 * codemaxFreq/code is higher than 1, then QUANT is always higher  
than 2 (which

 * means that tokens with frequency 1 are always discarded)./li
 * litokens, which frequency after quantization falls below QUANT,  
are discarded./li
 * licreate a list of tokens and their quantized frequency,  
separated by spaces,

 * in the order of decreasing frequency./li
 * /ul
 * This list is then submitted to an MD5 hash calculation.*/

There are two parameters this implementation takes:

quantRate = params.getFloat(quantRate, 0.01f);
minTokenLen = params.getInt(minTokenLen, 2);

Hope that helps.

Erik



* 
http://svn.apache.org/repos/asf/lucene/solr/trunk/src/java/org/apache/solr/update/processor/TextProfileSignature.java



Re: Filtering near-duplicates using TextProfileSignature

2010-01-12 Thread Andrew Clegg


Thanks Erik, but I'm still a little confused as to exactly where in the Solr
config I set these parameters.

The example on the wiki page uses Lookup3Signature which (presumably) takes
no parameters, so there's no indication in the XML examples of where you
would set them. Unless I'm missing something.

Thanks again,

Andrew.


Erik Hatcher-4 wrote:
 
 
 On Jan 12, 2010, at 7:56 AM, Andrew Clegg wrote:
 I'm interested in near-dupe removal as mentioned (briefly) here:

 http://wiki.apache.org/solr/Deduplication

 However the link for TextProfileSignature hasn't been filled in yet.

 Does anyone have an example of using TextProfileSignature that  
 demonstrates
 the tunable parameters mentioned in the wiki?
 
 There are some comments in the source code*, but they weren't made  
 class-level.  I'm fixing that and committing it now, but here's the  
 comment:
 
 /**
   * pThis implementation is copied from Apache Nutch. /p
   * pAn implementation of a page signature. It calculates an MD5 hash
   * of a plain text profile of a page./p
   * pThe algorithm to calculate a page profile takes the plain  
 text version of
   * a page and performs the following steps:
   * ul
   * liremove all characters except letters and digits, and bring all  
 characters
   * to lower case,/li
   * lisplit the text into tokens (all consecutive non-whitespace  
 characters),/li
   * lidiscard tokens equal or shorter than MIN_TOKEN_LEN (default 2  
 characters),/li
   * lisort the list of tokens by decreasing frequency,/li
   * liround down the counts of tokens to the nearest multiple of QUANT
   * (codeQUANT = QUANT_RATE * maxFreq/code, where  
 codeQUANT_RATE/code is 0.01f
   * by default, and codemaxFreq/code is the maximum token  
 frequency). If
   * codemaxFreq/code is higher than 1, then QUANT is always higher  
 than 2 (which
   * means that tokens with frequency 1 are always discarded)./li
   * litokens, which frequency after quantization falls below QUANT,  
 are discarded./li
   * licreate a list of tokens and their quantized frequency,  
 separated by spaces,
   * in the order of decreasing frequency./li
   * /ul
   * This list is then submitted to an MD5 hash calculation.*/
 
 There are two parameters this implementation takes:
 
  quantRate = params.getFloat(quantRate, 0.01f);
  minTokenLen = params.getInt(minTokenLen, 2);
 
 Hope that helps.
 
   Erik
 
 
 
 *
 http://svn.apache.org/repos/asf/lucene/solr/trunk/src/java/org/apache/solr/update/processor/TextProfileSignature.java
 
 
 

-- 
View this message in context: 
http://old.nabble.com/Filtering-near-duplicates-using-TextProfileSignature-tp27127151p27128173.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Filtering near-duplicates using TextProfileSignature

2010-01-12 Thread Erik Hatcher


On Jan 12, 2010, at 9:15 AM, Andrew Clegg wrote:
Thanks Erik, but I'm still a little confused as to exactly where in  
the Solr

config I set these parameters.


You'd configure them within the processor element, something like  
this:


   str name=minTokenLen5/str


The example on the wiki page uses Lookup3Signature which  
(presumably) takes
no parameters, so there's no indication in the XML examples of where  
you

would set them.


Right, looking at the source code, Lookup3Signature takes no parameters.

Perhaps you could update the wiki with an example once you get it  
working?


I'm flying a little blind here, just going off the source code, not  
trying it out for real.


Erik



Re: Filtering near-duplicates using TextProfileSignature

2010-01-12 Thread Andrew Clegg


Erik Hatcher-4 wrote:
 
 
 On Jan 12, 2010, at 9:15 AM, Andrew Clegg wrote:
 Thanks Erik, but I'm still a little confused as to exactly where in  
 the Solr
 config I set these parameters.
 
 You'd configure them within the processor element, something like  
 this:
 
 str name=minTokenLen5/str
 
 

OK, thanks. (Should that really be str though, and not int or something?)


Erik Hatcher-4 wrote:
 
 
 Perhaps you could update the wiki with an example once you get it  
 working?
 
 I'm flying a little blind here, just going off the source code, not  
 trying it out for real.
 
 

Sure -- it won't be til next week at the earliest though.

Cheers,

Andrew.

-- 
View this message in context: 
http://old.nabble.com/Filtering-near-duplicates-using-TextProfileSignature-tp27127151p27128493.html
Sent from the Solr - User mailing list archive at Nabble.com.