Re: Weird Problem (possible bug?) with german stemming and wildcard search

Markus Jelsma Tue, 07 Oct 2014 05:52:39 -0700

Hi - you should not use wild cards for autocompletion, Lucene has far better 
tools for making very good autocompletion, also, since a wild card is a multi 
term query, they are not passed through your configured query time analyzer.


Some other comments:
- you use a porter stemmer but you should use one of the German specific stem 
filters.
- you don't have an index time tokenizer defined, this should not be possible 
and behaviour is undefined as far as i know.


On Tuesday 07 October 2014 14:25:27 Thomas Michael Engelke wrote:
> I have a problem with a stemmed german field. The field definition:
> 
> <field name="description" type="text_splitting" indexed="true"
> stored="true" required="false" multiValued="false"/>
> ...
> <fieldType name="text_splitting" class="solr.TextField"
> positionIncrementGap="100" autoGeneratePhraseQueries="true">
>    <analyzer type="index">
>      <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt"/>
>      <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="1"
> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
>      <filter class="solr.LowerCaseFilterFactory"/>
>      <filter class="solr.KeywordMarkerFilterFactory"
> protected="protwords.txt"/>
>      <filter class="solr.PorterStemFilterFactory"/>
>    </analyzer>
>    <analyzer type="query">
>      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>      <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> ignoreCase="true" expand="true"/>
>      <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt"/>
>      <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="0"
> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
>      <filter class="solr.LowerCaseFilterFactory"/>
>      <filter class="solr.KeywordMarkerFilterFactory"
> protected="protwords.txt"/>
>      <filter class="solr.PorterStemFilterFactory"/>
>    </analyzer>
> </fieldType>
> 
> When we search for a word from an autosuggest kind of component, we
> always add an asterisk to a word, so when somebody enters something like
> "Radbremszylinder" and waits for some milliseconds, the autosuggest list
> is filled with the results of searching for "Radbremszylinder*". This
> seemed to work quite well. Today we got a bug report from a customer for
> that exact word.
> 
> So I made an analysis for the word as "Field value (index)" and "Field
> value (query)", and it looked like this:
> 
> ST   Radbremszylinder                WT   Radbremszylinder*
> SF   Radbremszylinder                SF   Radbremszylinder*
> WDF  Radbremszylinder                SF   Radbremszylinder*
> LCF  radbremszylinder                WDF  Radbremszylinder
> SKMF radbremszylinder                LCF  radbremszylinder
> PSF  radbremszylind                  SKMF radbremszylinder
> 
> As you can see, the end result looks very much alike. However, records
> containing that word in their "description" field aren't reported as
> results. Strangely enough, records containing "Radbremszylindern"
> (plural) are reported as results. Removing the asterisk from the end
> reports all records with "Radbremszylinder", just as we would expect. So
> the culprit is the asterisk at the end. As far as we can read from the
> docs, an asterisk is just 0 or more characters, which means that the
> literal word in front of the asterisk should match the query.
> 
> Searching further we tried some variations, and it seems that searching
> for "Radbremszylind*" works. All records with any variation
> ("Radbremszylinder", "Radbremszylindern") are reported. So maybe there's
> a weird interaction with stemming?
> 
> Any ideas?

Re: Weird Problem (possible bug?) with german stemming and wildcard search

Reply via email to