A "fuzzy signature" system will not work here. You are right, you want to try MLT instead.
Lance On Wed, Apr 6, 2011 at 9:47 AM, Frederico Azeiteiro <frederico.azeite...@cision.com> wrote: > Yes, I had already check the code for it and use it to compile a c# method > that returns the same signature. > > But I have a strange issue: > For instance, using MinTokenLenght=2 and default QUANT_RATE, passing the > text "frederico" (simple text no big deal here): > > 1. using my c# app returns "8b92e01d67591dfc60adf9576f76a055" > 2. using SOLR, passing a doc with HeadLine "frederico" I get > "8d9a5c35812ba75b8383d4538b91080f" on my signature field. > 3. Created a Java app (i'm not a Java expert..), using the code from SOLR > SignatureUpdateProcessorFactory class (please check code below) and I get > "8b92e01d67591dfc60adf9576f76a055". > > Java app code: > TextProfileSignature textProfileSignature = new > TextProfileSignature(); > NamedList<String> params = new NamedList<String>(); > params.add("", ""); > SolrParams solrParams = SolrParams.toSolrParams(params); > textProfileSignature.init(solrParams); > textProfileSignature.add("frederico"); > > > byte[] signature = textProfileSignature.getSignature(); > char[] arr = new char[signature.length << 1]; > for (int i = 0; i < signature.length; i++) { > int b = signature[i]; > int idx = i << 1; > arr[idx] = StrUtils.HEX_DIGITS[(b >> 4) & 0xf]; > arr[idx + 1] = StrUtils.HEX_DIGITS[b & 0xf]; > } > String sigString = new String(arr); > System.out.println(sigString); > > > > > Here's my processor configs: > > <updateRequestProcessorChain name="dedupe"> > <processor > class="org.apache.solr.update.processor.SignatureUpdateProcessorFactory"> > <bool name="enabled">true</bool> > <str name="signatureField">sig</str> > <bool name="overwriteDupes">false</bool> > <str name="fields">HeadLine</str> > <str > name="signatureClass">org.apache.solr.update.processor.TextProfileSignature</str> > <str name="minTokenLen">2</str> > </processor> > <processor class="solr.LogUpdateProcessorFactory" /> > <processor class="solr.RunUpdateProcessorFactory" /> > </updateRequestProcessorChain> > > > So both my apps (Java and C#) return the same signature but SOLR returns a > different one.. > Can anyone understand what I should be doing wrong? > > Thank you once again. > > Frederico > > -----Original Message----- > From: Markus Jelsma [mailto:markus.jel...@openindex.io] > Sent: terça-feira, 5 de Abril de 2011 15:20 > To: solr-user@lucene.apache.org > Cc: Frederico Azeiteiro > Subject: Re: Using MLT feature > > If you check the code for TextProfileSignature [1] your'll notice the init > method reading params. You can set those params as you did. Reading Javadoc > [2] might help as well. But what's not documented in the Javadoc is how QUANT > is computed; it rounds. > > [1]: > http://svn.apache.org/viewvc/lucene/solr/branches/branch-1.4/src/java/org/apache/solr/update/processor/TextProfileSignature.java?view=markup > [2]: > http://lucene.apache.org/solr/api/org/apache/solr/update/processor/TextProfileSignature.html > > On Tuesday 05 April 2011 16:10:08 Frederico Azeiteiro wrote: >> Thank you, I'll try to create a c# method to create the same sig of SOLR, >> and then compare both sigs before index the doc. This way I can avoid the >> indexation of existing docs. >> >> If anyone needs to use this parameter (as this info is not on the wiki), >> you can add the option >> >> <str name="minTokenLen">5</str> >> >> On the processor tag. >> >> Best regards, >> Frederico >> >> >> -----Original Message----- >> From: Markus Jelsma [mailto:markus.jel...@openindex.io] >> Sent: terça-feira, 5 de Abril de 2011 12:01 >> To: solr-user@lucene.apache.org >> Cc: Frederico Azeiteiro >> Subject: Re: Using MLT feature >> >> On Tuesday 05 April 2011 12:19:33 Frederico Azeiteiro wrote: >> > Sorry, the reply I made yesterday was directed to Markus and not the >> > list... >> > >> > Here's my thoughts on this. At this point I'm a little confused if SOLR >> > is a good option to find near duplicate docs. >> > >> > >> Yes there is, try set overwriteDupes to true and documents yielding >> > >> > the same signature will be overwritten >> > >> > The problem is that I don't want to overwrite the doc, I need to >> > maintain the original version (because the doc has others fields I need >> > to maintain). >> > >> > >>If you have need both fuzzy and exact matching then add a second >> > >> > update processor inside the chain and create another signature field. >> > >> > I just need the fuzzy search but the quick tests I made, return >> > different signatures for what I consider duplicate docs. >> > "Army deploys as clan war kills 11 in Philippine south" >> > "Army deploys as clan war kills 11 in Philippine south." >> > >> > Same sig for the above 2 strings, that's ok. >> > >> > But a different sig was created for: >> > "Army deploys as clan war kills 11 in Philippine south the." >> > >> > Is there a way to setup the TextProfileSignature parameters to adjust >> > the "sensibility" on SOLR (QUANT_RATE or MIN_TOKEN_LEN)? >> > >> > Do you think that these parameters can help creating the same sig for >> > the above example? >> >> You can only fix this by increasing minTokenLen to 4 to prevent `the` from >> being added to the list of tokens but this may affect other signatures. >> Possibly more documents will then get the same signature. Messing around >> with quantRate won't do much good because all your tokens have the same >> frequency (1) so quant will always be 1 in this short text. That's why >> TextProfileSignature works less well for short texts. >> >> http://nutch.apache.org/apidocs-1.2/org/apache/nutch/crawl/TextProfileSigna >> ture.html >> >> > Is anyone using the TextProfileSignature with success? >> > >> > Thank you, >> > Frederico >> > >> > >> > -----Original Message----- >> > From: Markus Jelsma [mailto:markus.jel...@openindex.io] >> > Sent: segunda-feira, 4 de Abril de 2011 16:47 >> > To: solr-user@lucene.apache.org >> > Cc: Frederico Azeiteiro >> > Subject: Re: Using MLT feature >> > >> > > Hi again, >> > > I guess I was wrong on my early post... There's no automated way to >> > >> > avoid >> > >> > > the indexation of the duplicate doc. >> > >> > Yes there is, try set overwriteDupes to true and documents yielding the >> > same >> > signature will be overwritten. If you have need both fuzzy and exact >> > matching >> > then add a second update processor inside the chain and create another >> > signature field. >> > >> > > I guess I have 2 options: >> > > >> > > 1. Create a temp index with signatures and then have an app that for >> > >> > each >> > >> > > new doc verifies if sig exists on my primary index. If not, add the >> > > article. >> > > >> > > 2. Before adding the doc, create a signature (using the same algorithm >> > >> > that >> > >> > > SOLR uses) on my indexing app and then verify if signature exists >> > >> > before >> > >> > > adding. >> > > >> > > I'm way thinking the right way here? :) >> > > >> > > Thank you, >> > > Frederico >> > > >> > > >> > > >> > > -----Original Message----- >> > > From: Frederico Azeiteiro [mailto:frederico.azeite...@cision.com] >> > > Sent: segunda-feira, 4 de Abril de 2011 11:59 >> > > To: solr-user@lucene.apache.org >> > > Subject: RE: Using MLT feature >> > > >> > > Thank you Markus it looks great. >> > > >> > > But the wiki is not very detailed on this. >> > > Do you mean if I: >> > > >> > > 1. Create: >> > > <updateRequestProcessorChain name="dedupe"> >> > > >> > > <processor >> > >> > class="org.apache.solr.update.processor.SignatureUpdateProcessorFactory" >> > >> > > <bool name="enabled">true</bool> >> > > >> > > <bool name="overwriteDupes">false</bool> >> > > <str name="signatureField">signature</str> >> > > <str name="fields">headline,body,medianame</str> >> > > <str >> > >> > name="signatureClass">org.apache.solr.update.processor.Lookup3Signature< >> > /s >> > >> > > tr> </processor> >> > > >> > > <processor class="solr.LogUpdateProcessorFactory" /> >> > > <processor class="solr.RunUpdateProcessorFactory" /> >> > > >> > > </updateRequestProcessorChain> >> > > >> > > 2. Add the request as the default update request >> > > 3. Add a "signature" indexed field to my schema. >> > > >> > > Then, >> > > When adding a new doc to my index, it is only added of not considered >> > >> > a >> > >> > > duplicate using a Lookup3Signature on the field defined? All >> > >> > duplicates >> > >> > > are ignored and not added to my index? >> > > Is it so simple as that? >> > > >> > > Does it works even if the medianame should be an exact match (not >> > >> > similar >> > >> > > match as the headline and bodytext are)? >> > > >> > > Thank you for your help, >> > > >> > > ____________________________________________ >> > > Frederico Azeiteiro >> > > Developer >> > > >> > > >> > > >> > > -----Original Message----- >> > > From: Markus Jelsma [mailto:markus.jel...@openindex.io] >> > > Sent: segunda-feira, 4 de Abril de 2011 10:48 >> > > To: solr-user@lucene.apache.org >> > > Subject: Re: Using MLT feature >> > > >> > > http://wiki.apache.org/solr/Deduplication >> > > >> > > On Monday 04 April 2011 11:34:52 Frederico Azeiteiro wrote: >> > > > Hi, >> > > > >> > > > The ideia is don't index if something similar (headline+bodytext) >> > >> > for >> > >> > > > the same exact medianame. >> > > > >> > > > Do you mean I would need to index the doc first (maybe in a temp >> > >> > index) >> > >> > > > and then use the MLT feature to find similar docs before adding to >> > >> > final >> > >> > > > index? >> > > > >> > > > Thanks, >> > > > Frederico >> > > > >> > > > >> > > > -----Original Message----- >> > > > From: Chris Fauerbach [mailto:chris.fauerb...@gmail.com] >> > > > Sent: segunda-feira, 4 de Abril de 2011 10:22 >> > > > To: solr-user@lucene.apache.org >> > > > Subject: Re: Using MLT feature >> > > > >> > > > Do you want to not index if something similar? Or don't index if >> > >> > exact. >> > >> > > > I would look into a hash code of the document if you don't want to >> > >> > index >> > >> > > > exact. Similar though, I think has to be based off a document in >> > >> > the >> > >> > > > index. >> > > > >> > > > On Apr 4, 2011, at 5:16, Frederico Azeiteiro >> > > > >> > > > <frederico.azeite...@cision.com> wrote: >> > > > > Hi, >> > > > > >> > > > > >> > > > > >> > > > > I would like to hear your opinion about the MLT feature and if >> > >> > it's a >> > >> > > > > good solution to what I need to implement. >> > > > > >> > > > > >> > > > > >> > > > > My index has fields like: headline, body and medianame. >> > > > > >> > > > > What I need to do is, before adding a new doc, verify if a similar >> > >> > doc >> > >> > > > > exists for this media. >> > > > > >> > > > > >> > > > > >> > > > > My idea is to use the MorelikeThisHandler >> > > > > (http://wiki.apache.org/solr/MoreLikeThisHandler) in the following >> > > > >> > > > way: >> > > > > For each new doc, perform a MLT search with q= medianame and >> > > > > stream.body=headline+bodytext. >> > > > > >> > > > > If no similar docs are found than I can safely add the doc. >> > > > > >> > > > > >> > > > > >> > > > > Is this feasible using the MLT handler? Is it a good approach? Are >> > > > >> > > > there >> > > > >> > > > > a better way to perform this comparison? >> > > > > >> > > > > >> > > > > >> > > > > Thank you for your help. >> > > > > >> > > > > >> > > > > >> > > > > Best regards, >> > > > > >> > > > > ____________________________________________ >> > > > > >> > > > > Frederico Azeiteiro >> > > >> > > Hi again, >> > > I guess I was wrong on my early post... There's no automated way to >> > >> > avoid >> > >> > > the indexation of the duplicate doc. >> > >> > Yes there is, try set overwriteDupes to true and documents yielding the >> > same >> > signature will be overwritten. If you have need both fuzzy and exact >> > matching >> > then add a second update processor inside the chain and create another >> > signature field. >> > >> > > I guess I have 2 options: >> > > >> > > 1. Create a temp index with signatures and then have an app that for >> > >> > each >> > >> > > new doc verifies if sig exists on my primary index. If not, add the >> > > article. >> > > >> > > 2. Before adding the doc, create a signature (using the same algorithm >> > >> > that >> > >> > > SOLR uses) on my indexing app and then verify if signature exists >> > >> > before >> > >> > > adding. >> > > >> > > I'm way thinking the right way here? :) >> > > >> > > Thank you, >> > > Frederico >> > > >> > > >> > > >> > > -----Original Message----- >> > > From: Frederico Azeiteiro [mailto:frederico.azeite...@cision.com] >> > > Sent: segunda-feira, 4 de Abril de 2011 11:59 >> > > To: solr-user@lucene.apache.org >> > > Subject: RE: Using MLT feature >> > > >> > > Thank you Markus it looks great. >> > > >> > > But the wiki is not very detailed on this. >> > > Do you mean if I: >> > > >> > > 1. Create: >> > > <updateRequestProcessorChain name="dedupe"> >> > > >> > > <processor >> > >> > class="org.apache.solr.update.processor.SignatureUpdateProcessorFactory" >> > >> > > <bool name="enabled">true</bool> >> > > >> > > <bool name="overwriteDupes">false</bool> >> > > <str name="signatureField">signature</str> >> > > <str name="fields">headline,body,medianame</str> >> > > <str >> > >> > name="signatureClass">org.apache.solr.update.processor.Lookup3Signature< >> > /s >> > >> > > tr> </processor> >> > > >> > > <processor class="solr.LogUpdateProcessorFactory" /> >> > > <processor class="solr.RunUpdateProcessorFactory" /> >> > > >> > > </updateRequestProcessorChain> >> > > >> > > 2. Add the request as the default update request >> > > 3. Add a "signature" indexed field to my schema. >> > > >> > > Then, >> > > When adding a new doc to my index, it is only added of not considered >> > >> > a >> > >> > > duplicate using a Lookup3Signature on the field defined? All >> > >> > duplicates >> > >> > > are ignored and not added to my index? >> > > Is it so simple as that? >> > > >> > > Does it works even if the medianame should be an exact match (not >> > >> > similar >> > >> > > match as the headline and bodytext are)? >> > > >> > > Thank you for your help, >> > > >> > > ____________________________________________ >> > > Frederico Azeiteiro >> > > Developer >> > > >> > > >> > > >> > > -----Original Message----- >> > > From: Markus Jelsma [mailto:markus.jel...@openindex.io] >> > > Sent: segunda-feira, 4 de Abril de 2011 10:48 >> > > To: solr-user@lucene.apache.org >> > > Subject: Re: Using MLT feature >> > > >> > > http://wiki.apache.org/solr/Deduplication >> > > >> > > On Monday 04 April 2011 11:34:52 Frederico Azeiteiro wrote: >> > > > Hi, >> > > > >> > > > The ideia is don't index if something similar (headline+bodytext) >> > >> > for >> > >> > > > the same exact medianame. >> > > > >> > > > Do you mean I would need to index the doc first (maybe in a temp >> > >> > index) >> > >> > > > and then use the MLT feature to find similar docs before adding to >> > >> > final >> > >> > > > index? >> > > > >> > > > Thanks, >> > > > Frederico >> > > > >> > > > >> > > > -----Original Message----- >> > > > From: Chris Fauerbach [mailto:chris.fauerb...@gmail.com] >> > > > Sent: segunda-feira, 4 de Abril de 2011 10:22 >> > > > To: solr-user@lucene.apache.org >> > > > Subject: Re: Using MLT feature >> > > > >> > > > Do you want to not index if something similar? Or don't index if >> > >> > exact. >> > >> > > > I would look into a hash code of the document if you don't want to >> > >> > index >> > >> > > > exact. Similar though, I think has to be based off a document in >> > >> > the >> > >> > > > index. >> > > > >> > > > On Apr 4, 2011, at 5:16, Frederico Azeiteiro >> > > > >> > > > <frederico.azeite...@cision.com> wrote: >> > > > > Hi, >> > > > > >> > > > > >> > > > > >> > > > > I would like to hear your opinion about the MLT feature and if >> > >> > it's a >> > >> > > > > good solution to what I need to implement. >> > > > > >> > > > > >> > > > > >> > > > > My index has fields like: headline, body and medianame. >> > > > > >> > > > > What I need to do is, before adding a new doc, verify if a similar >> > >> > doc >> > >> > > > > exists for this media. >> > > > > >> > > > > >> > > > > >> > > > > My idea is to use the MorelikeThisHandler >> > > > > (http://wiki.apache.org/solr/MoreLikeThisHandler) in the following >> > > > >> > > > way: >> > > > > For each new doc, perform a MLT search with q= medianame and >> > > > > stream.body=headline+bodytext. >> > > > > >> > > > > If no similar docs are found than I can safely add the doc. >> > > > > >> > > > > >> > > > > >> > > > > Is this feasible using the MLT handler? Is it a good approach? Are >> > > > >> > > > there >> > > > >> > > > > a better way to perform this comparison? >> > > > > >> > > > > >> > > > > >> > > > > Thank you for your help. >> > > > > >> > > > > >> > > > > >> > > > > Best regards, >> > > > > >> > > > > ____________________________________________ >> > > > > >> > > > > Frederico Azeiteiro > > -- > Markus Jelsma - CTO - Openindex > http://www.linkedin.com/in/markus17 > 050-8536620 / 06-50258350 > -- Lance Norskog goks...@gmail.com