Hi, It seems to me that in theory, Lucene storage code could use true UTF-8 to store terms. Maybe it is just a legacy issue that the modified UTF-8 is used?
Cheers, Jian On 8/26/05, Marvin Humphrey <[EMAIL PROTECTED]> wrote: > > Greets, > > [crossposted to java-user@lucene.apache.org and [EMAIL PROTECTED] > > I've delved into the matter of Lucene and UTF-8 a little further, and > I am discouraged by what I believe I've uncovered. > > Lucene should not be advertising that it uses "standard UTF-8" -- or > even UTF-8 at all, since "Modified UTF-8" is _illegal_ UTF-8. The > two distinguishing characteristics of "Modified UTF-8" are the > treatment of codepoints above the BMP (which are written as surrogate > pairs), and the encoding of null bytes as 1100 0000 1000 0000 rather > than 0000 0000. Both of these became illegal as of Unicode 3.1 > (IIRC), because they are not shortest-form and non-shortest-form > UTF-8 presents a security risk. > > The documentation should really state that Lucene stores strings in a > Java-only adulteration of UTF-8, unsuitable for interchange. Since > Perl uses true shortest-form UTF-8 as its native encoding, Plucene > would have to jump through two efficiency-killing hoops in order to > write files that would not choke Lucene: instead of writing out its > true, legal UTF-8 directly, it would be necessary to first translate > to UTF-16, then duplicate the Lucene encoding algorithm from > OutputStream. In theory. > > Below you will find a simple Perl script which illustrates what > happens when Perl encounters malformed UTF-8. Run it (you need Perl > 5.8 or higher) and you will see why even if I thought it was a good > idea to emulate the Java hack for encoding "Modified UTF-8", trying > to make it work in practice would be a nightmare. > > If Plucene were to write legal UTF-8 strings to its index files, Java > Lucene would misbehave and possibly blow up any time a string > contained either a 4-byte character or a null byte. On the flip > side, Perl will spew warnings like crazy and possibly blow up > whenever it encounters a Lucene-encoded null or surrogate pair. The > potential blowups are due to the fact that Lucene and Plucene will > not agree on how many characters a string contains, resulting in > overruns or underruns. > > I am hoping that the answer to this will be a fix to the encoding > mechanism in Lucene so that it really does use legal UTF-8. The most > efficient way to go about this has not yet presented itself. > > Marvin Humphrey > Rectangular Research > http://www.rectangular.com/ > > #---------------------------------------- > > #!/usr/bin/perl > use strict; > use warnings; > > # illegal_null.plx -- Perl complains about non-shortest-form null. > > my $data = "foo\xC0\x80\n"; > > open (my $virtual_filehandle, "+<:utf8", \$data); > print <$virtual_filehandle>; > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > >