Hi,

Did you take a look at IsoLatin1AccentFilter ?

Patrick

On 11/6/06, hans meiser <[EMAIL PROTECTED]> wrote:

Hi,

  Lucene indexes documents from 3 different countries here
(English, German and French). I want to normalize some
characters like umlauts. ä -> ae
  I did it in the following way:
  New Analyzer:
public class SpecialCharsAnalyzer extends StandardAnalyzer {
public SpecialCharsAnalyzer() {
}
   public SpecialCharsAnalyzer(Set stopWords) {
  super(stopWords);
}
   public SpecialCharsAnalyzer(String[] stopWords) {
  super(stopWords);
}
   public SpecialCharsAnalyzer(File stopwords) throws IOException {
  super(stopwords);
}
   public SpecialCharsAnalyzer(Reader stopwords) throws IOException {
  super(stopwords);
}
   @Override
public TokenStream tokenStream(String fieldName, Reader reader) {
    TokenStream ts = super.tokenStream(fieldName, reader);
  ts = new SpecialCharacterFilter(ts);
  return ts;
}
}
  Is the SpecialCharsAnalyzer::tokenStream implemented correctly?

New Filter:
public class SpecialCharacterFilter extends TokenFilter {
public SpecialCharacterFilter(TokenStream input) {
  super(input);
}
   @Override
public Token next() throws IOException {
  Token t = input.next();
    if (t == null)
   return null;
    String str = t.termText();
  if (str.indexOf("ä") != -1) {
   str = str.replaceAll("ä", "ae");
   t = new Token(str, t.startOffset(), t.endOffset() + 1);
  }
  return t;
}
}
  Is the SpecialCharacterFilter::next implemented correctly,
in case of the "ä"?

Is this way the correct way to do normalisation?
  thx


---------------------------------
NEU: Fragen stellen - Wissen, Meinungen und Erfahrungen teilen. Jetzt auf
Yahoo! Clever.

Reply via email to