Problems with TokenFilter, but only in wildcard queries

Björn Keil Wed, 16 Oct 2019 02:06:16 -0700

Hello,

I am having a problem with a primitive self-written TokenFilter, namely the
GermanUmlautFilter in the example below. It's being used for both queries
and indexing.
It works perfectly most of the time, it replace ä with ae, ö with oe and so
forth, before ICUFoldingFilter replaces the remaining non-ascii symbols.


However, it does cause odd behaviour in Wildcard Queries. e.g.:
The query title:todesmä* matches todesmarsch, which it should not, because
an ä is supposed to be replaced with an ae, however, it also matches
todesmärchen, as it should.
The query title:todesmär still matches todesmarsch, but not todesmärchen.

That is odd, even as though the replacement did not take place while
performing a wildcard query, even though it did work during indexing. In
different circumstances it works, however. E.g.:
The query title:härte does correctly not match harte, but it does match
härte.
The query title:haerte is equivalent to the query title:härte.
The query title:harte does correctly not match haerte, but it does match
harte.

While debugging the GermanUmlautFilter, I did not find any obvious mistake.
The only thing that it is a bit strange is that the CharTermAttribute's
(implement by PackedTokenAttributeImpl) endOffset attribute does not appear
to change. However, if it is supposed to indicate the last character's
offset in byte, that would be the expected result: It replaces a single
two-byte character with two one byte characters in the examples above.

Does anybody have an idea what's going on here? What's so different about
wildcard queries?

>From the schema.xml:
<fieldType name="text_search" class="solr.TextField">
<analyzer>
  <!-- Based on the StandardTokenizer, the ExampleTokenizer uses slightly
       modified jflex code. -->
<tokenizer class="de.example.analysis.ExampleTokenizerFactory"/>

<!-- The LengthFilter is non-standard, it cuts off after 30 character rather
     than discarding the token. -->
<filter class="de.example.analysis.LengthFilterFactory" maxTokenLength="30"
/>

<!-- Yes, I realise that SynonymFilters are deprecated. -->
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"/>

<!-- This is the filter that causes problems. -->
<filter class="de.example.analysis.GermanUmlautFilterFactory"/>
<filter class="solr.ICUFoldingFilterFactory"/>
</analyzer>
</fieldType>

GermanUmlautFilter code:

package de.example.analysis;

import java.io.IOException;

import org.apache.lucene.analysis.TokenFilter;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;

/**
 * This TokenFilter replaces German umlauts and the character ß with a
normalized form in ASCII characters.
 *
 * <ul><li>ü => ue</li>
 * <li>ß => ss</li>
 * <li>etc.</li></ul>
 *
 * This enables a sort order according DIN 5007, variant 2, the so
called "phone book" sort order.
 *
 * @see org.apache.lucene.analysis.TokenStream
 *
 */
public class GermanUmaultFilter extends TokenFilter {
        
        private final CharTermAttribute termAtt =
addAttribute(CharTermAttribute.class);

        /**
         * @see org.apache.lucene.analysis.TokenFilter#TokenFilter()
         * @param input TokenStream with the tokens to filter
         */
        public GermanUmaultFilter(TokenStream input) {
                super(input);
        }

        /**
         * Performs the actual filtering upon request by the consumer.
         *
         * @see org.apache.lucene.analysis.TokenStream#incrementToken()
         * @return true on success, false on failure
         */
        public boolean incrementToken() throws IOException {
                if (input.incrementToken()) {
                        int countReplacements = 0;
                        char[] origBuffer = termAtt.buffer();
                        int origLength = termAtt.length();
                        // Figure out how many replacements we need to get the 
size of the new buffer
                        for (int i = 0; i < origLength; i++) {
                                if (origBuffer[i] == 'ü'
                                        || origBuffer[i] == 'ä'
                                        || origBuffer[i] == 'ö'
                                        || origBuffer[i] == 'ß'
                                        || origBuffer[i] == 'Ä'
                                        || origBuffer[i] == 'Ö'
                                        || origBuffer[i] == 'Ü'
                                ) {
                                        countReplacements++;
                                }
                        }
                        
                        // If there is a replacement create a new buffer of the 
appropriate length...
                        if (countReplacements != 0) {
                                int newLength = origLength + countReplacements;
                                char[] target = new char[newLength];
                                int j = 0;
                                // ... perform the replacement ...
                                for (int i = 0; i < origLength; i++) {
                                        switch (origBuffer[i]) {
                                        case 'ä':
                                                target[j++] = 'a';
                                                target[j++] = 'e';
                                                break;
                                        case 'ö':
                                                target[j++] = 'o';
                                                target[j++] = 'e';
                                                break;
                                        case 'ü':
                                                target[j++] = 'u';
                                                target[j++] = 'e';
                                                break;
                                        case 'Ä':
                                                target[j++] = 'A';
                                                target[j++] = 'E';
                                                break;
                                        case 'Ö':
                                                target[j++] = 'O';
                                                target[j++] = 'E';
                                                break;
                                        case 'Ü':
                                                target[j++] = 'U';
                                                target[j++] = 'E';
                                                break;
                                        case 'ß':
                                                target[j++] = 's';
                                                target[j++] = 's';
                                                break;
                                        default:
                                                target[j++] = origBuffer[i];
                                        }
                                }
                                // ... make sure the attribute's buffer is 
large enough, copy the new buffer
                                // and set the length ...
                                termAtt.resizeBuffer(newLength);
                                termAtt.copyBuffer(target, 0, newLength);
                                termAtt.setLength(newLength);
                        }
                        return true;
                } else {
                        return false;
                }
        }

}

Problems with TokenFilter, but only in wildcard queries

Reply via email to