InputStreamReader( new FileInputStream(filename), "UTF-8");
On Wed, 9 Feb 2005 22:32:38 -0700, Owen Densmore <[EMAIL PROTECTED]> wrote:
I'm building an index from a FileMaker database by dumping the data to a tab-separated file. Because the FileMaker output is encoded in MacRoman, and uses Mac line separators, I run a script across the tab file to clean it up:
tr '\r\v' '\n ' | iconv -f MAC -t UTF-8
This basically converts the Mac \r's to \n's, replaces FileMaker's vtabs (for inter-field CRs) with blanks, and runs a character converter to build utf-8 data for Java to use. Looks fine in jEdit and BBEdit, both of which understand UTF.
BUT -- when I look at the indexes created in Lucene using Luke, I get unprintable letters! Writing programs to dump the terms (using Writer subclasses which handle unicode correctly) shows that indeed the files now have odd characters when viewed w/ jEdit and BBEdit.
The analyzer used to build the index looks like: public class RedfishAnalyser extends Analyzer { String[] stopwords; public RedfishAnalyser(String[] stopwords) { this.stopwords = stopwords; } public RedfishAnalyser() { this.stopwords = StopAnalyzer.ENGLISH_STOP_WORDS; } public TokenStream tokenStream(String fieldName, Reader reader) { return new PorterStemFilter( new StopFilter( new LowerCaseFilter( new StandardFilter( new StandardTokenizer(reader))), stopwords)); } }
Yikes, what am I doing wrong?! Is the analyzer at fault? Its about the only place where I can see a problem happening.
Thanks for any pointers,
Owen
-- Using Opera's revolutionary e-mail client: http://www.opera.com/m2/
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]