Use a percentage rather than an absolute token number, like 50% or 25% or maybe 33%. You can also specify different percentages based on different ranges of term counts.

Be aware that although it is tempting to think of MM from the user perspective of how many terms are written in the original query, the implementation (BooleanQuery) uses the terms generated by the analysis process, which can break up source terms into multiple terms and generate extra terms as well. Any MM number or percentage will count the terms output by analysis, not the source terms.

-- Jack Krupansky

-----Original Message----- From: Schmidt, Matthew
Sent: Thursday, August 21, 2014 3:59 PM
To: solr-user@lucene.apache.org
Subject: Minimum Match with filters that add tokens

Is there a good way of handling a minimum match value greater than 1 with token filters that add tokens to the stream?

Say you have field with the DoubleMetaphone filter for phonetic matching:

<filter class="solr.DoubleMetaphoneFilterFactory" inject="true" maxCodeLength="6"/>

This would add two tokens to the stream, one for the primary phonetic code, one for the secondary. If I have the min match set to 2 (mm=2) and my query only has a single token in it, then I only get results where at least 2 of the tokens match. This means that documents that only match on a phonetic token aren't included.

Example:

Field:
<fieldType name="name " class="solr.TextField" positionIncrementGap="100">
 <analyzer>
   <tokenizer class="solr.StandardTokenizerFactory"/>
   <filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.DoubleMetaphoneFilterFactory" inject="true" maxCodeLength="6"/>
   <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
 </analyzer>
</fieldType>

Document:
{ id: 1, lastName: "meneghini" } (This generates {meneghini, MNKN} for the index token stream for the lastName field)

Searching (using edismax) with q=meneghini&mm=2 returns document 1, as expected, but searching q=menegini&mm=2 does not. However q=menegini&mm=1 does. The reason the first query worked as expected is that after the phonetic filter the query token stream has 2 tokens (meneghini, MNKN), and both of them match the index tokens, satisfying the mm parameter. With the phonetic misspelling (menegini, {menegini, MNJN, MNKN}), only one of the tokens out of the 3 matches, so it is below the mm threshold. The third query only needs one match, which it gets on the phonetic code MNKN.

This seems like counter-intuitive behavior for mm (at least for my use case), since I'm only interested in the original query terms being subject to the mm limitation, not the expanded token set. I would imagine this would be an issue with synonym expansion and any other filter that might add tokens at query time as well.

Possible solutions I've thought of:


- Just use the regular PhoneticFilterFactory with inject="false" in a separate copy field since it will only emit one token per input token. :(

- Subclass the DoubleMetaphoneFilterFactory to add a parameter to specify if only the primary or secondary token should be emitted. Then have a separate field type and copy field for each and search the original field, the primary phonetic token field, and the secondary token field with each query. This only solves for this specific case with the double metaphone filter, since it will add at most 2 tokens. Other filters like BeiderMorseFilterFactory or SynonymFilterFactory might add an arbitrary number.

- Change {lots of things} to allow filters to set a flag on a token that the query parser can use to determine that it should not count it against the minimum match requirement.

-          ?

Any thoughts?

Matt

Reply via email to