Hi, Lucene indexes documents from 3 different countries here (English, German and French). I want to normalize some characters like umlauts. ä -> ae I did it in the following way: New Analyzer: public class SpecialCharsAnalyzer extends StandardAnalyzer { public SpecialCharsAnalyzer() { } public SpecialCharsAnalyzer(Set stopWords) { super(stopWords); } public SpecialCharsAnalyzer(String[] stopWords) { super(stopWords); } public SpecialCharsAnalyzer(File stopwords) throws IOException { super(stopwords); } public SpecialCharsAnalyzer(Reader stopwords) throws IOException { super(stopwords); } @Override public TokenStream tokenStream(String fieldName, Reader reader) { TokenStream ts = super.tokenStream(fieldName, reader); ts = new SpecialCharacterFilter(ts); return ts; } } Is the SpecialCharsAnalyzer::tokenStream implemented correctly? New Filter: public class SpecialCharacterFilter extends TokenFilter { public SpecialCharacterFilter(TokenStream input) { super(input); } @Override public Token next() throws IOException { Token t = input.next(); if (t == null) return null; String str = t.termText(); if (str.indexOf("ä") != -1) { str = str.replaceAll("ä", "ae"); t = new Token(str, t.startOffset(), t.endOffset() + 1); } return t; } } Is the SpecialCharacterFilter::next implemented correctly, in case of the "ä"? Is this way the correct way to do normalisation? thx
--------------------------------- NEU: Fragen stellen - Wissen, Meinungen und Erfahrungen teilen. Jetzt auf Yahoo! Clever.