Use a percentage rather than an absolute token number, like 50% or 25% or
maybe 33%. You can also specify different percentages based on different
ranges of term counts.
Be aware that although it is tempting to think of MM from the user
perspective of how many terms are written in the original query, the
implementation (BooleanQuery) uses the terms generated by the analysis
process, which can break up source terms into multiple terms and generate
extra terms as well. Any MM number or percentage will count the terms output
by analysis, not the source terms.
-- Jack Krupansky
-----Original Message-----
From: Schmidt, Matthew
Sent: Thursday, August 21, 2014 3:59 PM
To: solr-user@lucene.apache.org
Subject: Minimum Match with filters that add tokens
Is there a good way of handling a minimum match value greater than 1 with
token filters that add tokens to the stream?
Say you have field with the DoubleMetaphone filter for phonetic matching:
<filter class="solr.DoubleMetaphoneFilterFactory" inject="true"
maxCodeLength="6"/>
This would add two tokens to the stream, one for the primary phonetic code,
one for the secondary. If I have the min match set to 2 (mm=2) and my query
only has a single token in it, then I only get results where at least 2 of
the tokens match. This means that documents that only match on a phonetic
token aren't included.
Example:
Field:
<fieldType name="name " class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.DoubleMetaphoneFilterFactory" inject="true"
maxCodeLength="6"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
Document:
{ id: 1, lastName: "meneghini" } (This generates {meneghini, MNKN} for the
index token stream for the lastName field)
Searching (using edismax) with q=meneghini&mm=2 returns document 1, as
expected, but searching q=menegini&mm=2 does not. However q=menegini&mm=1
does. The reason the first query worked as expected is that after the
phonetic filter the query token stream has 2 tokens (meneghini, MNKN), and
both of them match the index tokens, satisfying the mm parameter. With the
phonetic misspelling (menegini, {menegini, MNJN, MNKN}), only one of the
tokens out of the 3 matches, so it is below the mm threshold. The third
query only needs one match, which it gets on the phonetic code MNKN.
This seems like counter-intuitive behavior for mm (at least for my use
case), since I'm only interested in the original query terms being subject
to the mm limitation, not the expanded token set. I would imagine this
would be an issue with synonym expansion and any other filter that might add
tokens at query time as well.
Possible solutions I've thought of:
- Just use the regular PhoneticFilterFactory with inject="false" in
a separate copy field since it will only emit one token per input token. :(
- Subclass the DoubleMetaphoneFilterFactory to add a parameter to
specify if only the primary or secondary token should be emitted. Then have
a separate field type and copy field for each and search the original field,
the primary phonetic token field, and the secondary token field with each
query. This only solves for this specific case with the double metaphone
filter, since it will add at most 2 tokens. Other filters like
BeiderMorseFilterFactory or SynonymFilterFactory might add an arbitrary
number.
- Change {lots of things} to allow filters to set a flag on a token
that the query parser can use to determine that it should not count it
against the minimum match requirement.
- ?
Any thoughts?
Matt