tr '\r\v' '\n ' | iconv -f MAC -t UTF-8
This basically converts the Mac \r's to \n's, replaces FileMaker's vtabs (for inter-field CRs) with blanks, and runs a character converter to build utf-8 data for Java to use. Looks fine in jEdit and BBEdit, both of which understand UTF.
BUT -- when I look at the indexes created in Lucene using Luke, I get unprintable letters! Writing programs to dump the terms (using Writer subclasses which handle unicode correctly) shows that indeed the files now have odd characters when viewed w/ jEdit and BBEdit.
The analyzer used to build the index looks like: public class RedfishAnalyser extends Analyzer { String[] stopwords; public RedfishAnalyser(String[] stopwords) { this.stopwords = stopwords; } public RedfishAnalyser() { this.stopwords = StopAnalyzer.ENGLISH_STOP_WORDS; } public TokenStream tokenStream(String fieldName, Reader reader) { return new PorterStemFilter( new StopFilter( new LowerCaseFilter( new StandardFilter( new StandardTokenizer(reader))), stopwords)); } }
Yikes, what am I doing wrong?! Is the analyzer at fault? Its about the only place where I can see a problem happening.
Thanks for any pointers,
Owen
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]