Hello, I am having a bit of a problem with Wildcard queries and I don't know how to pin it down yet. I have a suspect, but I kind find an error in it, one of the filters in the respective search field.
The problem is that when I do a wildcard query: title:todesmä* it does return a result, but it also returns results that would match title:todesma* It is not supposed to do that because, due to the filter, it's supposed to be equivalent to title:todesmae* The reals problem is that if I search for title:todesmär* it does not find anything at all anymore. There are titles on the index that would match "todesmärsche" and "todesmärchen". I have looked the Filter in a debugger, but I could not find anything wrong with it. It's supposed to replace the "ä" with "ae", which it does, calls termAtt.resizeBuffer() before it does and termAtt.length() afterwards. The result seems perfectly alright. What it does not change is the endOffset attribute of the CharTermAttribute object, that's probably because it's counting Bytes, not characters; I replaced a single two-byte char with a two one-byte chars, consequently the endOffset is the same. Could anybody tell me whether there is anything wrong with the filter in the attachment?
package de.example.analysis; import java.io.IOException; import org.apache.lucene.analysis.TokenFilter; import org.apache.lucene.analysis.TokenStream; import org.apache.lucene.analysis.tokenattributes.CharTermAttribute; /** * This TokenFilter replaces German umlauts and the character ß with a normalized form in ASCII characters. * * <ul><li>ü => ue</li> * <li>ß => ss</li> * <li>etc.</li></ul> * * This enables a sort order according DIN 5007, variant 2, the so called "phone book" sort order. * * @see org.apache.lucene.analysis.TokenStream * */ public class GermanUmaultFilter extends TokenFilter { private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class); /** * @see org.apache.lucene.analysis.TokenFilter#TokenFilter() * @param input TokenStream with the tokens to filter */ public GermanUmaultFilter(TokenStream input) { super(input); } /** * Performs the actual filtering whenever upon request by the consumer. * * @see org.apache.lucene.analysis.TokenStream#incrementToken() * @return true on success, false on failure */ public boolean incrementToken() throws IOException { if (input.incrementToken()) { int countReplacements = 0; char[] origBuffer = termAtt.buffer(); int origLength = termAtt.length(); // Figure out how many replacements we need to get the size of the new buffer for (int i = 0; i < origLength; i++) { if (origBuffer[i] == 'ü' || origBuffer[i] == 'ä' || origBuffer[i] == 'ö' || origBuffer[i] == 'ß' || origBuffer[i] == 'Ä' || origBuffer[i] == 'Ö' || origBuffer[i] == 'Ü' ) { countReplacements++; } } // If there is a replacement create a new buffer of the appropriate length... if (countReplacements != 0) { int newLength = origLength + countReplacements; char[] target = new char[newLength]; int j = 0; // ... perform the replacement ... for (int i = 0; i < origLength; i++) { switch (origBuffer[i]) { case 'ä': target[j++] = 'a'; target[j++] = 'e'; break; case 'ö': target[j++] = 'o'; target[j++] = 'e'; break; case 'ü': target[j++] = 'u'; target[j++] = 'e'; break; case 'Ä': target[j++] = 'A'; target[j++] = 'E'; break; case 'Ö': target[j++] = 'O'; target[j++] = 'E'; break; case 'Ü': target[j++] = 'U'; target[j++] = 'E'; break; case 'ß': target[j++] = 's'; target[j++] = 's'; break; default: target[j++] = origBuffer[i]; } } // ... make sure the attribute's buffer is large enough, copy the new buffer // and set the length ... termAtt.resizeBuffer(newLength); termAtt.copyBuffer(target, 0, newLength); termAtt.setLength(newLength); } return true; } else { return false; } } }