Re: Lucene does NOT use UTF-8.

2005-08-28 Thread Dave Kor
http://java.sun.com/docs/books/tutorial/i18n/text/stream.html Yes, its confusing. Sun calls its own encoding format as "Unicode" and the above webpage talks about how to convert between Java's Unicode format and the UTF-8 format. Its just a matter of specifying "UTF-8" when creating output strea

Re: Lucene does NOT use UTF-8.

2005-08-27 Thread Bill Janssen
Thanks for pointing this out, Marvin. I wish Sun (or someone) would document and register this particular character set encoding with IANA, so that it could be used outside of Java. As it stands now, it's essentially a bastard encoding, good for nothing, and one of the warts of Java. Lucene prob

Re: Lucene does NOT use UTF-8.

2005-08-27 Thread Ken Krugler
I've delved into the matter of Lucene and UTF-8 a little further, and I am discouraged by what I believe I've uncovered. Lucene should not be advertising that it uses "standard UTF-8" -- or even UTF-8 at all, since "Modified UTF-8" is _illegal_ UTF-8. Unfortunately this is how Sun documents t

Re: Lucene does NOT use UTF-8.

2005-08-27 Thread Marvin Humphrey
On Aug 26, 2005, at 10:14 PM, jian chen wrote: Hi, It seems to me that in theory, Lucene storage code could use true UTF-8 to store terms. Maybe it is just a legacy issue that the modified UTF-8 is used? It has been suggested that this discussion should move to the developer's list, s

Re: Lucene does NOT use UTF-8.

2005-08-27 Thread jian chen
Hi, It seems to me that in theory, Lucene storage code could use true UTF-8 to store terms. Maybe it is just a legacy issue that the modified UTF-8 is used? Cheers, Jian On 8/26/05, Marvin Humphrey <[EMAIL PROTECTED]> wrote: > > Greets, > > [crossposted to java-user@lucene.apache.org and [E

Lucene does NOT use UTF-8.

2005-08-26 Thread Marvin Humphrey
Greets, [crossposted to java-user@lucene.apache.org and [EMAIL PROTECTED] I've delved into the matter of Lucene and UTF-8 a little further, and I am discouraged by what I believe I've uncovered. Lucene should not be advertising that it uses "standard UTF-8" -- or even UTF-8 at all, since "