whats the correct way to do normalisation?

hans meiser Mon, 06 Nov 2006 07:42:21 -0800

Hi,
   
  Lucene indexes documents from 3 different countries here
(English, German and French). I want to normalize some 
characters like umlauts. ä -> ae
  I did it in the following way:
  New Analyzer:
public class SpecialCharsAnalyzer extends StandardAnalyzer {
 public SpecialCharsAnalyzer() {
 }
   public SpecialCharsAnalyzer(Set stopWords) {
  super(stopWords);
 }
   public SpecialCharsAnalyzer(String[] stopWords) {
  super(stopWords);
 }
   public SpecialCharsAnalyzer(File stopwords) throws IOException {
  super(stopwords);
 }
   public SpecialCharsAnalyzer(Reader stopwords) throws IOException {
  super(stopwords);
 }
   @Override
 public TokenStream tokenStream(String fieldName, Reader reader) {
    TokenStream ts = super.tokenStream(fieldName, reader);
  ts = new SpecialCharacterFilter(ts);
  return ts;
 }
}
  Is the SpecialCharsAnalyzer::tokenStream implemented correctly?
  
New Filter:
public class SpecialCharacterFilter extends TokenFilter {
 public SpecialCharacterFilter(TokenStream input) {
  super(input);
 }
   @Override
 public Token next() throws IOException {
  Token t = input.next();
    if (t == null)
   return null;
    String str = t.termText();
  if (str.indexOf("ä") != -1) {
   str = str.replaceAll("ä", "ae");
   t = new Token(str, t.startOffset(), t.endOffset() + 1);
  }
  return t;
 }
}
  Is the SpecialCharacterFilter::next implemented correctly, 
in case of the "ä"?
  
Is this way the correct way to do normalisation?
  thx


                
---------------------------------
NEU: Fragen stellen - Wissen, Meinungen und Erfahrungen teilen. Jetzt auf 
Yahoo! Clever.

whats the correct way to do normalisation?

Reply via email to