Re: Lucene does NOT use UTF-8.

2005-08-27 Thread jian chen
Hi, It seems to me that in theory, Lucene storage code could use true UTF-8 to store terms. Maybe it is just a legacy issue that the modified UTF-8 is used? Cheers, Jian On 8/26/05, Marvin Humphrey [EMAIL PROTECTED] wrote: Greets, [crossposted to java-user@lucene.apache.org and [EMAIL

Re: Lucene does NOT use UTF-8.

2005-08-27 Thread Marvin Humphrey
On Aug 26, 2005, at 10:14 PM, jian chen wrote: Hi, It seems to me that in theory, Lucene storage code could use true UTF-8 to store terms. Maybe it is just a legacy issue that the modified UTF-8 is used? It has been suggested that this discussion should move to the developer's list,

Re: read past EOF

2005-08-27 Thread jian chen
Hi, It seems this problem only happens when the index files get really large. Could it be because java has trouble handling very large files on windows machine (guess there is max file size on windows)? In Lucene, I think there is a maxDoc kind of parameter that you can use to specify, when

Re: Lucene does NOT use UTF-8.

2005-08-27 Thread Ken Krugler
I've delved into the matter of Lucene and UTF-8 a little further, and I am discouraged by what I believe I've uncovered. Lucene should not be advertising that it uses standard UTF-8 -- or even UTF-8 at all, since Modified UTF-8 is _illegal_ UTF-8. Unfortunately this is how Sun documents the

Re: Lucene does NOT use UTF-8.

2005-08-27 Thread Bill Janssen
Thanks for pointing this out, Marvin. I wish Sun (or someone) would document and register this particular character set encoding with IANA, so that it could be used outside of Java. As it stands now, it's essentially a bastard encoding, good for nothing, and one of the warts of Java. Lucene