Bingo! I used the InputStreamReader and that fixed the index. Boy,
tough to catch all the holes through which unicode leaks occur!
Owen
From: aurora <[EMAIL PROTECTED]>
Date: February 9, 2005 11:04:35 PM MST
To: lucene-user@jakarta.apache.org
Subject: Re: Lucene Unicode Usage
So you got
Owen Densmore wrote:
I'm building an index from a FileMaker database by dumping the data to a
tab-separated file. Because the FileMaker output is encoded in
MacRoman, and uses Mac line separators, I run a script across the tab
file to clean it up:
tr '\r\v' '\n ' | iconv -f MAC -t UTF-8
Thi
So you got a utf8 encoded text file. But how do you read the file into
Java? The default encoding of Java is likely to be something other than
utf8. Make sure you specify the encoding like:
InputStreamReader( new FileInputStream(filename), "UTF-8");
On Wed, 9 Feb 2005 22:32:38 -0700, Owen De
I'm building an index from a FileMaker database by dumping the data to
a tab-separated file. Because the FileMaker output is encoded in
MacRoman, and uses Mac line separators, I run a script across the tab
file to clean it up:
tr '\r\v' '\n ' | iconv -f MAC -t UTF-8
This basically converts the