Re: Minimum Match with filters that add tokens

Jack Krupansky Sat, 23 Aug 2014 08:12:05 -0700

Use a percentage rather than an absolute token number, like 50% or 25% ormaybe 33%. You can also specify different percentages based on differentranges of term counts.

Be aware that although it is tempting to think of MM from the userperspective of how many terms are written in the original query, theimplementation (BooleanQuery) uses the terms generated by the analysisprocess, which can break up source terms into multiple terms and generateextra terms as well. Any MM number or percentage will count the terms outputby analysis, not the source terms.


-- Jack Krupansky

-----Original Message-----From: Schmidt, Matthew

Sent: Thursday, August 21, 2014 3:59 PM
To: solr-user@lucene.apache.org
Subject: Minimum Match with filters that add tokens

Is there a good way of handling a minimum match value greater than 1 withtoken filters that add tokens to the stream?


Say you have field with the DoubleMetaphone filter for phonetic matching:

<filter class="solr.DoubleMetaphoneFilterFactory" inject="true"maxCodeLength="6"/>

This would add two tokens to the stream, one for the primary phonetic code,one for the secondary. If I have the min match set to 2 (mm=2) and my queryonly has a single token in it, then I only get results where at least 2 ofthe tokens match. This means that documents that only match on a phonetictoken aren't included.


Example:

Field:
<fieldType name="name " class="solr.TextField" positionIncrementGap="100">
 <analyzer>
   <tokenizer class="solr.StandardTokenizerFactory"/>
   <filter class="solr.LowerCaseFilterFactory"/>

<filter class="solr.DoubleMetaphoneFilterFactory" inject="true"maxCodeLength="6"/>

   <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
 </analyzer>
</fieldType>

Document:

{ id: 1, lastName: "meneghini" } (This generates {meneghini, MNKN} for theindex token stream for the lastName field)

Searching (using edismax) with q=meneghini&mm=2 returns document 1, asexpected, but searching q=menegini&mm=2 does not. However q=menegini&mm=1does. The reason the first query worked as expected is that after thephonetic filter the query token stream has 2 tokens (meneghini, MNKN), andboth of them match the index tokens, satisfying the mm parameter. With thephonetic misspelling (menegini, {menegini, MNJN, MNKN}), only one of thetokens out of the 3 matches, so it is below the mm threshold. The thirdquery only needs one match, which it gets on the phonetic code MNKN.

This seems like counter-intuitive behavior for mm (at least for my usecase), since I'm only interested in the original query terms being subjectto the mm limitation, not the expanded token set. I would imagine thiswould be an issue with synonym expansion and any other filter that might addtokens at query time as well.


Possible solutions I've thought of:

- Just use the regular PhoneticFilterFactory with inject="false" ina separate copy field since it will only emit one token per input token. :(

- Subclass the DoubleMetaphoneFilterFactory to add a parameter tospecify if only the primary or secondary token should be emitted. Then havea separate field type and copy field for each and search the original field,the primary phonetic token field, and the secondary token field with eachquery. This only solves for this specific case with the double metaphonefilter, since it will add at most 2 tokens. Other filters likeBeiderMorseFilterFactory or SynonymFilterFactory might add an arbitrarynumber.

- Change {lots of things} to allow filters to set a flag on a tokenthat the query parser can use to determine that it should not count itagainst the minimum match requirement.


-          ?

Any thoughts?

Matt

Re: Minimum Match with filters that add tokens

Reply via email to