Hi,
It seems to me that in theory, Lucene storage code could use true UTF-8 to
store terms. Maybe it is just a legacy issue that the modified UTF-8 is
used?
Cheers,
Jian
On 8/26/05, Marvin Humphrey [EMAIL PROTECTED] wrote:
Greets,
[crossposted to java-user@lucene.apache.org and [EMAIL
On Aug 26, 2005, at 10:14 PM, jian chen wrote:
Hi,
It seems to me that in theory, Lucene storage code could use true
UTF-8 to
store terms. Maybe it is just a legacy issue that the modified
UTF-8 is
used?
It has been suggested that this discussion should move to the
developer's list,
Hi,
It seems this problem only happens when the index files get really large.
Could it be because java has trouble handling very large files on windows
machine (guess there is max file size on windows)?
In Lucene, I think there is a maxDoc kind of parameter that you can use to
specify, when
I've delved into the matter of Lucene and UTF-8 a little further,
and I am discouraged by what I believe I've uncovered.
Lucene should not be advertising that it uses standard UTF-8 -- or
even UTF-8 at all, since Modified UTF-8 is _illegal_ UTF-8.
Unfortunately this is how Sun documents the
Thanks for pointing this out, Marvin. I wish Sun (or someone) would
document and register this particular character set encoding with
IANA, so that it could be used outside of Java. As it stands now,
it's essentially a bastard encoding, good for nothing, and one of the
warts of Java.
Lucene