Re: Lucene does NOT use UTF-8.

Ken Krugler Sat, 27 Aug 2005 14:12:03 -0700

I've delved into the matter of Lucene and UTF-8 a little further,and I am discouraged by what I believe I've uncovered.
Lucene should not be advertising that it uses "standard UTF-8" -- oreven UTF-8 at all, since "Modified UTF-8" is _illegal_ UTF-8.

Unfortunately this is how Sun documents the format they use forserialized strings.

The two distinguishing characteristics of "Modified UTF-8" are thetreatment of codepoints above the BMP (which are written assurrogate pairs), and the encoding of null bytes as 1100 0000 10000000 rather than 0000 0000. Both of these became illegal as ofUnicode 3.1 (IIRC), because they are not shortest-form andnon-shortest-form UTF-8 presents a security risk.

For UTF-8 these were always invalid, but the standard wasn't veryclear about it. Unfortunately the fuzzy nature of the 1.0/2.0 specsencouraged some sloppy implementations.

The documentation should really state that Lucene stores strings ina Java-only adulteration of UTF-8,

Yes, good point. I don't know who's in charge of that page, but itshould be fixed.

unsuitable for interchange.


Other than as an internal representation for Java serialization.

Since Perl uses true shortest-form UTF-8 as its native encoding,Plucene would have to jump through two efficiency-killing hoops inorder to write files that would not choke Lucene: instead of writingout its true, legal UTF-8 directly, it would be necessary to firsttranslate to UTF-16, then duplicate the Lucene encoding algorithmfrom OutputStream. In theory.

Actually I don't think it would be all that bad. Since a null in themiddle of a string is rare, as is a character outside of the BMP, aquick scan of the text should be sufficient to determine if it can bewritten as-is.

The ICU project has C code that can be used to quickly walk a string.I believe these would find/report such invalid code points, if youuse the safe (versus faster unsafe) versions.

Below you will find a simple Perl script which illustrates whathappens when Perl encounters malformed UTF-8. Run it (you need Perl5.8 or higher) and you will see why even if I thought it was a goodidea to emulate the Java hack for encoding "Modified UTF-8", tryingto make it work in practice would be a nightmare.
If Plucene were to write legal UTF-8 strings to its index files,Java Lucene would misbehave and possibly blow up any time a stringcontained either a 4-byte character or a null byte. On the flipside, Perl will spew warnings like crazy and possibly blow upwhenever it encounters a Lucene-encoded null or surrogate pair. Thepotential blowups are due to the fact that Lucene and Plucene willnot agree on how many characters a string contains, resulting inoverruns or underruns.
I am hoping that the answer to this will be a fix to the encodingmechanism in Lucene so that it really does use legal UTF-8. Themost efficient way to go about this has not yet presented itself.

I'd need to look at the code more, but using something other than theJava serialized format would probably incur a performance penalty forthe Java implementation. Or at least make it harder to handle thestrings using the standard Java serialization support. So I doubtthis would be a slam-dunk in the Lucene community.


-- Ken

#----------------------------------------

#!/usr/bin/perl
use strict;
use warnings;

# illegal_null.plx -- Perl complains about non-shortest-form null.

my $data = "foo\xC0\x80\n";

open (my $virtual_filehandle, "+<:utf8", \$data);
print <$virtual_filehandle>;


--
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-470-9200

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Lucene does NOT use UTF-8.

Reply via email to