Re: Using MLT feature

Lance Norskog Wed, 06 Apr 2011 20:09:57 -0700

A "fuzzy signature" system will not work here. You are right, you want
to try MLT instead.


Lance

On Wed, Apr 6, 2011 at 9:47 AM, Frederico Azeiteiro
<frederico.azeite...@cision.com> wrote:
> Yes, I had already check the code for it and use it to compile a c# method 
> that returns the same signature.
>
> But I have a strange issue:
> For instance, using MinTokenLenght=2 and default QUANT_RATE,  passing the 
> text "frederico" (simple text no big deal here):
>
> 1. using my c# app returns "8b92e01d67591dfc60adf9576f76a055"
> 2. using SOLR, passing a doc with HeadLine "frederico" I get 
> "8d9a5c35812ba75b8383d4538b91080f" on my signature field.
> 3. Created a Java app (i'm not a Java expert..), using the code from SOLR 
> SignatureUpdateProcessorFactory class (please check code below) and I get 
> "8b92e01d67591dfc60adf9576f76a055".
>
> Java app code:
>                TextProfileSignature textProfileSignature = new 
> TextProfileSignature();
>                NamedList<String> params = new NamedList<String>();
>                params.add("", "");
>                SolrParams solrParams = SolrParams.toSolrParams(params);
>                textProfileSignature.init(solrParams);
>                textProfileSignature.add("frederico");
>
>
>                byte[] signature =  textProfileSignature.getSignature();
>                char[] arr = new char[signature.length << 1];
>                for (int i = 0; i < signature.length; i++) {
>                        int b = signature[i];
>                        int idx = i << 1;
>                        arr[idx] = StrUtils.HEX_DIGITS[(b >> 4) & 0xf];
>                        arr[idx + 1] = StrUtils.HEX_DIGITS[b & 0xf];
>                }
>                String sigString = new String(arr);
>                System.out.println(sigString);
>
>
>
>
> Here's my processor configs:
>
> <updateRequestProcessorChain name="dedupe">
>     <processor 
> class="org.apache.solr.update.processor.SignatureUpdateProcessorFactory">
>       <bool name="enabled">true</bool>
>       <str name="signatureField">sig</str>
>       <bool name="overwriteDupes">false</bool>
>       <str name="fields">HeadLine</str>
>       <str 
> name="signatureClass">org.apache.solr.update.processor.TextProfileSignature</str>
>       <str name="minTokenLen">2</str>
>       </processor>
>     <processor class="solr.LogUpdateProcessorFactory" />
>     <processor class="solr.RunUpdateProcessorFactory" />
>   </updateRequestProcessorChain>
>
>
> So both my apps (Java and C#)  return the same signature but SOLR returns a 
> different one..
> Can anyone understand what I should be doing wrong?
>
> Thank you once again.
>
> Frederico
>
> -----Original Message-----
> From: Markus Jelsma [mailto:markus.jel...@openindex.io]
> Sent: terça-feira, 5 de Abril de 2011 15:20
> To: solr-user@lucene.apache.org
> Cc: Frederico Azeiteiro
> Subject: Re: Using MLT feature
>
> If you check the code for TextProfileSignature [1] your'll notice the init
> method reading params. You can set those params as you did. Reading Javadoc
> [2] might help as well. But what's not documented in the Javadoc is how QUANT
> is computed; it rounds.
>
> [1]:
> http://svn.apache.org/viewvc/lucene/solr/branches/branch-1.4/src/java/org/apache/solr/update/processor/TextProfileSignature.java?view=markup
> [2]:
> http://lucene.apache.org/solr/api/org/apache/solr/update/processor/TextProfileSignature.html
>
> On Tuesday 05 April 2011 16:10:08 Frederico Azeiteiro wrote:
>> Thank you, I'll try to create a c# method to create the same sig of SOLR,
>> and then compare both sigs before index the doc. This way I can avoid the
>> indexation of existing docs.
>>
>> If anyone needs to use this parameter (as this info is not on the wiki),
>> you can add the option
>>
>> <str name="minTokenLen">5</str>
>>
>> On the processor tag.
>>
>> Best regards,
>> Frederico
>>
>>
>> -----Original Message-----
>> From: Markus Jelsma [mailto:markus.jel...@openindex.io]
>> Sent: terça-feira, 5 de Abril de 2011 12:01
>> To: solr-user@lucene.apache.org
>> Cc: Frederico Azeiteiro
>> Subject: Re: Using MLT feature
>>
>> On Tuesday 05 April 2011 12:19:33 Frederico Azeiteiro wrote:
>> > Sorry, the reply I made yesterday was directed to Markus and not the
>> > list...
>> >
>> > Here's my thoughts on this. At this point I'm a little confused if SOLR
>> > is a good option to find near duplicate docs.
>> >
>> > >> Yes there is, try set overwriteDupes to true and documents yielding
>> >
>> > the same signature will be overwritten
>> >
>> > The problem is that I don't want to overwrite the doc, I need to
>> > maintain the original version (because the doc has others fields I need
>> > to maintain).
>> >
>> > >>If you have need both fuzzy and exact matching then add a second
>> >
>> > update processor inside the chain and create another signature field.
>> >
>> > I just need the fuzzy search but the quick tests I made, return
>> > different signatures for what I consider duplicate docs.
>> > "Army deploys as clan war kills 11 in Philippine south"
>> > "Army deploys as clan war kills 11 in Philippine south."
>> >
>> > Same sig for the above 2 strings, that's ok.
>> >
>> > But a different sig was created for:
>> > "Army deploys as clan war kills 11 in Philippine south the."
>> >
>> > Is there a way to setup the TextProfileSignature parameters to adjust
>> > the "sensibility" on SOLR (QUANT_RATE or MIN_TOKEN_LEN)?
>> >
>> > Do you think that these parameters can help creating the same sig for
>> > the above example?
>>
>> You can only fix this by increasing minTokenLen to 4 to prevent `the` from
>> being added to the list of tokens but this may affect other signatures.
>> Possibly more documents will then get the same signature. Messing around
>> with quantRate won't do much good because all your tokens have the same
>> frequency (1) so quant will always be 1 in this short text. That's why
>> TextProfileSignature works less well for short texts.
>>
>> http://nutch.apache.org/apidocs-1.2/org/apache/nutch/crawl/TextProfileSigna
>> ture.html
>>
>> > Is anyone using the TextProfileSignature with success?
>> >
>> > Thank you,
>> > Frederico
>> >
>> >
>> > -----Original Message-----
>> > From: Markus Jelsma [mailto:markus.jel...@openindex.io]
>> > Sent: segunda-feira, 4 de Abril de 2011 16:47
>> > To: solr-user@lucene.apache.org
>> > Cc: Frederico Azeiteiro
>> > Subject: Re: Using MLT feature
>> >
>> > > Hi again,
>> > > I guess I was wrong on my early post... There's no automated way to
>> >
>> > avoid
>> >
>> > > the indexation of the duplicate doc.
>> >
>> > Yes there is, try set overwriteDupes to true and documents yielding the
>> > same
>> > signature will be overwritten. If you have need both fuzzy and exact
>> > matching
>> > then add a second update processor inside the chain and create another
>> > signature field.
>> >
>> > > I guess I have 2 options:
>> > >
>> > > 1. Create a temp index with signatures and then have an app that for
>> >
>> > each
>> >
>> > > new doc verifies if sig exists on my primary index. If not, add the
>> > > article.
>> > >
>> > > 2. Before adding the doc, create a signature (using the same algorithm
>> >
>> > that
>> >
>> > > SOLR uses) on my indexing app and then verify if signature exists
>> >
>> > before
>> >
>> > > adding.
>> > >
>> > > I'm way thinking the right way here? :)
>> > >
>> > > Thank you,
>> > > Frederico
>> > >
>> > >
>> > >
>> > > -----Original Message-----
>> > > From: Frederico Azeiteiro [mailto:frederico.azeite...@cision.com]
>> > > Sent: segunda-feira, 4 de Abril de 2011 11:59
>> > > To: solr-user@lucene.apache.org
>> > > Subject: RE: Using MLT feature
>> > >
>> > > Thank you Markus it looks great.
>> > >
>> > > But the wiki is not very detailed on this.
>> > > Do you mean if I:
>> > >
>> > > 1. Create:
>> > > <updateRequestProcessorChain name="dedupe">
>> > >
>> > >     <processor
>> >
>> > class="org.apache.solr.update.processor.SignatureUpdateProcessorFactory"
>> >
>> > > <bool name="enabled">true</bool>
>> > >
>> > >       <bool name="overwriteDupes">false</bool>
>> > >       <str name="signatureField">signature</str>
>> > >       <str name="fields">headline,body,medianame</str>
>> > >       <str
>> >
>> > name="signatureClass">org.apache.solr.update.processor.Lookup3Signature<
>> > /s
>> >
>> > > tr> </processor>
>> > >
>> > >     <processor class="solr.LogUpdateProcessorFactory" />
>> > >     <processor class="solr.RunUpdateProcessorFactory" />
>> > >
>> > >   </updateRequestProcessorChain>
>> > >
>> > > 2. Add the request as the default update request
>> > > 3. Add a "signature" indexed field to my schema.
>> > >
>> > > Then,
>> > > When adding a new doc to my index, it is only added of not considered
>> >
>> > a
>> >
>> > > duplicate using a Lookup3Signature on the field defined? All
>> >
>> > duplicates
>> >
>> > > are ignored and not added to my index?
>> > > Is it so simple as that?
>> > >
>> > > Does it works even if the medianame should be an exact match (not
>> >
>> > similar
>> >
>> > > match as the headline and bodytext are)?
>> > >
>> > > Thank you for your help,
>> > >
>> > > ____________________________________________
>> > > Frederico Azeiteiro
>> > > Developer
>> > >
>> > >
>> > >
>> > > -----Original Message-----
>> > > From: Markus Jelsma [mailto:markus.jel...@openindex.io]
>> > > Sent: segunda-feira, 4 de Abril de 2011 10:48
>> > > To: solr-user@lucene.apache.org
>> > > Subject: Re: Using MLT feature
>> > >
>> > > http://wiki.apache.org/solr/Deduplication
>> > >
>> > > On Monday 04 April 2011 11:34:52 Frederico Azeiteiro wrote:
>> > > > Hi,
>> > > >
>> > > > The ideia is don't index if something similar (headline+bodytext)
>> >
>> > for
>> >
>> > > > the same exact medianame.
>> > > >
>> > > > Do you mean I would need to index the doc first (maybe in a temp
>> >
>> > index)
>> >
>> > > > and then use the MLT feature to find similar docs before adding to
>> >
>> > final
>> >
>> > > > index?
>> > > >
>> > > > Thanks,
>> > > > Frederico
>> > > >
>> > > >
>> > > > -----Original Message-----
>> > > > From: Chris Fauerbach [mailto:chris.fauerb...@gmail.com]
>> > > > Sent: segunda-feira, 4 de Abril de 2011 10:22
>> > > > To: solr-user@lucene.apache.org
>> > > > Subject: Re: Using MLT feature
>> > > >
>> > > > Do you want to not index if something similar? Or don't index if
>> >
>> > exact.
>> >
>> > > > I would look into a hash code of the document if you don't want to
>> >
>> > index
>> >
>> > > > exact.    Similar though, I think has to be based off a document in
>> >
>> > the
>> >
>> > > > index.
>> > > >
>> > > > On Apr 4, 2011, at 5:16, Frederico Azeiteiro
>> > > >
>> > > > <frederico.azeite...@cision.com> wrote:
>> > > > > Hi,
>> > > > >
>> > > > >
>> > > > >
>> > > > > I would like to hear your opinion about the MLT feature and if
>> >
>> > it's a
>> >
>> > > > > good solution to what I need to implement.
>> > > > >
>> > > > >
>> > > > >
>> > > > > My index has fields like: headline, body and medianame.
>> > > > >
>> > > > > What I need to do is, before adding a new doc, verify if a similar
>> >
>> > doc
>> >
>> > > > > exists for this media.
>> > > > >
>> > > > >
>> > > > >
>> > > > > My idea is to use the MorelikeThisHandler
>> > > > > (http://wiki.apache.org/solr/MoreLikeThisHandler) in the following
>> > > >
>> > > > way:
>> > > > > For each new doc, perform a MLT search with q= medianame and
>> > > > > stream.body=headline+bodytext.
>> > > > >
>> > > > > If no similar docs are found than I can safely add the doc.
>> > > > >
>> > > > >
>> > > > >
>> > > > > Is this feasible using the MLT handler? Is it a good approach? Are
>> > > >
>> > > > there
>> > > >
>> > > > > a better way to perform this comparison?
>> > > > >
>> > > > >
>> > > > >
>> > > > > Thank you for your help.
>> > > > >
>> > > > >
>> > > > >
>> > > > > Best regards,
>> > > > >
>> > > > > ____________________________________________
>> > > > >
>> > > > > Frederico Azeiteiro
>> > >
>> > > Hi again,
>> > > I guess I was wrong on my early post... There's no automated way to
>> >
>> > avoid
>> >
>> > > the indexation of the duplicate doc.
>> >
>> > Yes there is, try set overwriteDupes to true and documents yielding the
>> > same
>> > signature will be overwritten. If you have need both fuzzy and exact
>> > matching
>> > then add a second update processor inside the chain and create another
>> > signature field.
>> >
>> > > I guess I have 2 options:
>> > >
>> > > 1. Create a temp index with signatures and then have an app that for
>> >
>> > each
>> >
>> > > new doc verifies if sig exists on my primary index. If not, add the
>> > > article.
>> > >
>> > > 2. Before adding the doc, create a signature (using the same algorithm
>> >
>> > that
>> >
>> > > SOLR uses) on my indexing app and then verify if signature exists
>> >
>> > before
>> >
>> > > adding.
>> > >
>> > > I'm way thinking the right way here? :)
>> > >
>> > > Thank you,
>> > > Frederico
>> > >
>> > >
>> > >
>> > > -----Original Message-----
>> > > From: Frederico Azeiteiro [mailto:frederico.azeite...@cision.com]
>> > > Sent: segunda-feira, 4 de Abril de 2011 11:59
>> > > To: solr-user@lucene.apache.org
>> > > Subject: RE: Using MLT feature
>> > >
>> > > Thank you Markus it looks great.
>> > >
>> > > But the wiki is not very detailed on this.
>> > > Do you mean if I:
>> > >
>> > > 1. Create:
>> > > <updateRequestProcessorChain name="dedupe">
>> > >
>> > >     <processor
>> >
>> > class="org.apache.solr.update.processor.SignatureUpdateProcessorFactory"
>> >
>> > > <bool name="enabled">true</bool>
>> > >
>> > >       <bool name="overwriteDupes">false</bool>
>> > >       <str name="signatureField">signature</str>
>> > >       <str name="fields">headline,body,medianame</str>
>> > >       <str
>> >
>> > name="signatureClass">org.apache.solr.update.processor.Lookup3Signature<
>> > /s
>> >
>> > > tr> </processor>
>> > >
>> > >     <processor class="solr.LogUpdateProcessorFactory" />
>> > >     <processor class="solr.RunUpdateProcessorFactory" />
>> > >
>> > >   </updateRequestProcessorChain>
>> > >
>> > > 2. Add the request as the default update request
>> > > 3. Add a "signature" indexed field to my schema.
>> > >
>> > > Then,
>> > > When adding a new doc to my index, it is only added of not considered
>> >
>> > a
>> >
>> > > duplicate using a Lookup3Signature on the field defined? All
>> >
>> > duplicates
>> >
>> > > are ignored and not added to my index?
>> > > Is it so simple as that?
>> > >
>> > > Does it works even if the medianame should be an exact match (not
>> >
>> > similar
>> >
>> > > match as the headline and bodytext are)?
>> > >
>> > > Thank you for your help,
>> > >
>> > > ____________________________________________
>> > > Frederico Azeiteiro
>> > > Developer
>> > >
>> > >
>> > >
>> > > -----Original Message-----
>> > > From: Markus Jelsma [mailto:markus.jel...@openindex.io]
>> > > Sent: segunda-feira, 4 de Abril de 2011 10:48
>> > > To: solr-user@lucene.apache.org
>> > > Subject: Re: Using MLT feature
>> > >
>> > > http://wiki.apache.org/solr/Deduplication
>> > >
>> > > On Monday 04 April 2011 11:34:52 Frederico Azeiteiro wrote:
>> > > > Hi,
>> > > >
>> > > > The ideia is don't index if something similar (headline+bodytext)
>> >
>> > for
>> >
>> > > > the same exact medianame.
>> > > >
>> > > > Do you mean I would need to index the doc first (maybe in a temp
>> >
>> > index)
>> >
>> > > > and then use the MLT feature to find similar docs before adding to
>> >
>> > final
>> >
>> > > > index?
>> > > >
>> > > > Thanks,
>> > > > Frederico
>> > > >
>> > > >
>> > > > -----Original Message-----
>> > > > From: Chris Fauerbach [mailto:chris.fauerb...@gmail.com]
>> > > > Sent: segunda-feira, 4 de Abril de 2011 10:22
>> > > > To: solr-user@lucene.apache.org
>> > > > Subject: Re: Using MLT feature
>> > > >
>> > > > Do you want to not index if something similar? Or don't index if
>> >
>> > exact.
>> >
>> > > > I would look into a hash code of the document if you don't want to
>> >
>> > index
>> >
>> > > > exact.    Similar though, I think has to be based off a document in
>> >
>> > the
>> >
>> > > > index.
>> > > >
>> > > > On Apr 4, 2011, at 5:16, Frederico Azeiteiro
>> > > >
>> > > > <frederico.azeite...@cision.com> wrote:
>> > > > > Hi,
>> > > > >
>> > > > >
>> > > > >
>> > > > > I would like to hear your opinion about the MLT feature and if
>> >
>> > it's a
>> >
>> > > > > good solution to what I need to implement.
>> > > > >
>> > > > >
>> > > > >
>> > > > > My index has fields like: headline, body and medianame.
>> > > > >
>> > > > > What I need to do is, before adding a new doc, verify if a similar
>> >
>> > doc
>> >
>> > > > > exists for this media.
>> > > > >
>> > > > >
>> > > > >
>> > > > > My idea is to use the MorelikeThisHandler
>> > > > > (http://wiki.apache.org/solr/MoreLikeThisHandler) in the following
>> > > >
>> > > > way:
>> > > > > For each new doc, perform a MLT search with q= medianame and
>> > > > > stream.body=headline+bodytext.
>> > > > >
>> > > > > If no similar docs are found than I can safely add the doc.
>> > > > >
>> > > > >
>> > > > >
>> > > > > Is this feasible using the MLT handler? Is it a good approach? Are
>> > > >
>> > > > there
>> > > >
>> > > > > a better way to perform this comparison?
>> > > > >
>> > > > >
>> > > > >
>> > > > > Thank you for your help.
>> > > > >
>> > > > >
>> > > > >
>> > > > > Best regards,
>> > > > >
>> > > > > ____________________________________________
>> > > > >
>> > > > > Frederico Azeiteiro
>
> --
> Markus Jelsma - CTO - Openindex
> http://www.linkedin.com/in/markus17
> 050-8536620 / 06-50258350
>



-- 
Lance Norskog
goks...@gmail.com

Re: Using MLT feature

Reply via email to