Hi,
I have a question on phonetic search and matching in solr.
In our application all the content of an article is written to a full-text
search field, which provides stemming and a phonetic filter (cologne
phonetic for german).
This is the relevant part of the configuration for the index analyzer
(search is analogous):
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="0"
catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory" language="German2"
/>
<filter class="solr.PhoneticFilterFactory"
encoder="ColognePhonetic" inject="true"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory" />
Unfortunately this results sometimes in strange, but also explainable,
matches.
For example:
Content field indexes the following String: Donnerstag von 13 bis 17 Uhr.
This results in a match, if we search for "puf" as the result of the
phonetic filter for this is 13.
(As a consequence the 13 is then also highlighted)
Does anyone has an idea how to handle this in a reasonable way that a
search for "puf" does not match 13 in the content?
Thanks in advance!
Dirk