Lucene Unicode Usage

Owen Densmore Wed, 09 Feb 2005 21:32:50 -0800

I'm building an index from a FileMaker database by dumping the data to a tab-separated file. Because the FileMaker output is encoded in MacRoman, and uses Mac line separators, I run a script across the tab file to clean it up: tr '\r\v' '\n ' | iconv -f MAC -t UTF-8 This basically converts the Mac \r's to \n's, replaces FileMaker's vtabs (for inter-field CRs) with blanks, and runs a character converter to build utf-8 data for Java to use. Looks fine in jEdit and BBEdit, both of which understand UTF.

BUT -- when I look at the indexes created in Lucene using Luke, I get unprintable letters! Writing programs to dump the terms (using Writer subclasses which handle unicode correctly) shows that indeed the files now have odd characters when viewed w/ jEdit and BBEdit.

The analyzer used to build the index looks like:
    public class RedfishAnalyser extends Analyzer {
      String[] stopwords;
      public RedfishAnalyser(String[] stopwords) {
        this.stopwords = stopwords;
      }
      public RedfishAnalyser() {
        this.stopwords = StopAnalyzer.ENGLISH_STOP_WORDS;
      }
      public TokenStream tokenStream(String fieldName, Reader reader) {
        return new PorterStemFilter(
            new StopFilter(
                new LowerCaseFilter(
                    new StandardFilter(
                        new StandardTokenizer(reader))),
               stopwords));
      }
    }

Yikes, what am I doing wrong?! Is the analyzer at fault? Its about the only place where I can see a problem happening.

Thanks for any pointers,

Owen


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Lucene Unicode Usage

Reply via email to