Hello, I am having a problem with a primitive self-written TokenFilter, namely the GermanUmlautFilter in the example below. It's being used for both queries and indexing. It works perfectly most of the time, it replace ä with ae, ö with oe and so forth, before ICUFoldingFilter replaces the remaining non-ascii symbols.
However, it does cause odd behaviour in Wildcard Queries. e.g.: The query title:todesmä* matches todesmarsch, which it should not, because an ä is supposed to be replaced with an ae, however, it also matches todesmärchen, as it should. The query title:todesmär still matches todesmarsch, but not todesmärchen. That is odd, even as though the replacement did not take place while performing a wildcard query, even though it did work during indexing. In different circumstances it works, however. E.g.: The query title:härte does correctly not match harte, but it does match härte. The query title:haerte is equivalent to the query title:härte. The query title:harte does correctly not match haerte, but it does match harte. While debugging the GermanUmlautFilter, I did not find any obvious mistake. The only thing that it is a bit strange is that the CharTermAttribute's (implement by PackedTokenAttributeImpl) endOffset attribute does not appear to change. However, if it is supposed to indicate the last character's offset in byte, that would be the expected result: It replaces a single two-byte character with two one byte characters in the examples above. Does anybody have an idea what's going on here? What's so different about wildcard queries? >From the schema.xml: <fieldType name="text_search" class="solr.TextField"> <analyzer> <!-- Based on the StandardTokenizer, the ExampleTokenizer uses slightly modified jflex code. --> <tokenizer class="de.example.analysis.ExampleTokenizerFactory"/> <!-- The LengthFilter is non-standard, it cuts off after 30 character rather than discarding the token. --> <filter class="de.example.analysis.LengthFilterFactory" maxTokenLength="30" /> <!-- Yes, I realise that SynonymFilters are deprecated. --> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"/> <!-- This is the filter that causes problems. --> <filter class="de.example.analysis.GermanUmlautFilterFactory"/> <filter class="solr.ICUFoldingFilterFactory"/> </analyzer> </fieldType> GermanUmlautFilter code: package de.example.analysis; import java.io.IOException; import org.apache.lucene.analysis.TokenFilter; import org.apache.lucene.analysis.TokenStream; import org.apache.lucene.analysis.tokenattributes.CharTermAttribute; /** * This TokenFilter replaces German umlauts and the character ß with a normalized form in ASCII characters. * * <ul><li>ü => ue</li> * <li>ß => ss</li> * <li>etc.</li></ul> * * This enables a sort order according DIN 5007, variant 2, the so called "phone book" sort order. * * @see org.apache.lucene.analysis.TokenStream * */ public class GermanUmaultFilter extends TokenFilter { private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class); /** * @see org.apache.lucene.analysis.TokenFilter#TokenFilter() * @param input TokenStream with the tokens to filter */ public GermanUmaultFilter(TokenStream input) { super(input); } /** * Performs the actual filtering upon request by the consumer. * * @see org.apache.lucene.analysis.TokenStream#incrementToken() * @return true on success, false on failure */ public boolean incrementToken() throws IOException { if (input.incrementToken()) { int countReplacements = 0; char[] origBuffer = termAtt.buffer(); int origLength = termAtt.length(); // Figure out how many replacements we need to get the size of the new buffer for (int i = 0; i < origLength; i++) { if (origBuffer[i] == 'ü' || origBuffer[i] == 'ä' || origBuffer[i] == 'ö' || origBuffer[i] == 'ß' || origBuffer[i] == 'Ä' || origBuffer[i] == 'Ö' || origBuffer[i] == 'Ü' ) { countReplacements++; } } // If there is a replacement create a new buffer of the appropriate length... if (countReplacements != 0) { int newLength = origLength + countReplacements; char[] target = new char[newLength]; int j = 0; // ... perform the replacement ... for (int i = 0; i < origLength; i++) { switch (origBuffer[i]) { case 'ä': target[j++] = 'a'; target[j++] = 'e'; break; case 'ö': target[j++] = 'o'; target[j++] = 'e'; break; case 'ü': target[j++] = 'u'; target[j++] = 'e'; break; case 'Ä': target[j++] = 'A'; target[j++] = 'E'; break; case 'Ö': target[j++] = 'O'; target[j++] = 'E'; break; case 'Ü': target[j++] = 'U'; target[j++] = 'E'; break; case 'ß': target[j++] = 's'; target[j++] = 's'; break; default: target[j++] = origBuffer[i]; } } // ... make sure the attribute's buffer is large enough, copy the new buffer // and set the length ... termAtt.resizeBuffer(newLength); termAtt.copyBuffer(target, 0, newLength); termAtt.setLength(newLength); } return true; } else { return false; } } }