I'm not familiar with UTF-8 enough to follow the details of this
discussion. I hope other Lucene developers are, so we can resolve this
issue.... anyone raising a hand?
Otis
--- Marvin Humphrey <[EMAIL PROTECTED]> wrote:
> Ken Krugler sent a reply to the user list. In an effort to keep all
> the developers informed, I'm sending my reply to the developer list
> and including his entire original post below my sig.
>
> Ken writes...
>
> > Since a null in the
> > middle of a string is rare, as is a character outside of the BMP,
> a
> > quick scan of the text should be sufficient to determine if it can
> be
> > written as-is.
>
> Let's see. I think we are looking at two scans, (one index(), one
> regex), or a regex that uses alternation. I strongly suspect two
> scans
> are faster.
>
> if ( (index($string, "\xC0\x80") != -1)
> or ($string =~ /[\F0-\xF7]/ ) # only exists in 4-byte UTF-8
> ) {
> # Process string...
> }
>
> That would tell us whether the string needed to be specially encoded
> for
> Java's sake on output. Yes, I suspect that's considerably more
> efficient than always converting first to UTF-16 and then to
> "Modified
> UTF-8".
>
> It's also completely unnecessary, as you'll see from the patch below,
> so I'm going to press ahead and make these XS ports of InputStream
> and
> OutputStream work with legal UTF-8.
>
> It would actually make a lot more sense for Plucene if the integer at
> the head of a string measured *bytes* instead of either Unicode code
> points or Java chars. Then it's just a straight up copy! No
> scanning
> OR decoding required.
>
> (Hmm... I wonder if there's a way to make Lucene work quickly if the
> VInt were redefined to be "length in bytes"...)
>
> Speaking of which, the Lucene file formats document also says this...
>
> "Lucene writes strings as a VInt representing the length,
> followed by
> the character data."
>
> The ambiguity of the word "length" in this sentence left me
> scratching
> my head. Length in bytes or length in UTF-8 characters? Of course
> the real answer is... neither. :\
>
> It's length in Java chars, or, if you prefer to further Sun's
> disinformation campaign, ;) "Modified UTF-8 characters". If the
> Lucene
> docs had stated "Java chars" explicitly, I would have had a better
> idea
> about why the value of that VInt is what it is -- a Java-specific
> quirk at odds with a widely-accepted standard -- and about what
> it was going to take to adhere to the spec.
>
> > I'd need to look at the code more, but using something other than
> the
> > Java serialized format would probably incur a performance penalty
> for
> > the Java implementation. Or at least make it harder to handle the
> > strings using the standard Java serialization support.
>
> I believe that the following true-UTF-8 replacement for the
> readChars function is at least as fast as the current implementation,
> unless your text contains characters outside the BMP. It's
> incomplete,
> because my Java expertise is quite limited, but it should be
> conceptually sound. The algo is adapted from C code supplied by the
> Unicode consortium.
>
> http://www.unicode.org/Public/PROGRAMS/CVTUTF/ConvertUTF.c
>
> static final byte[] TRAILING_BYTES_FOR_UTF8 = {
> 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
> 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
> 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
> 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
> 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
> 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
> 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
> 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
> 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
> 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
> 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
> 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
> 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
> 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
> 2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,
> 3,3,3,3,3,3,3,3,4,4,4,4,5,5,5,5
> };
>
> public final void readChars(char[] buffer, int start, int length)
> throws IOException {
> int end = start + length; // No longer a final int.
> for (int i = start; i < end; i++) {
> int b = readByte(); // NOTE: b changed from byte to int.
> switch (TRAILING_BYTES_FOR_UTF8[b & 0xFF]) {
> case 0:
> buffer[i] = (char)(b & 0x7F);
> break;
> case 1:
> buffer[i] = (char)(((b & 0x1F) << 6)
> | (readByte() & 0x3F));
> //buffer[i] = (char)(((b & 0x1F) << 6)
> // | (readByte() & 0x3F));
> break;
> case 2:
> buffer[i] = (char)(((b & 0x0F) << 12)
> | ((readByte() & 0x3F) << 6)
> | (readByte() & 0x3F));
> break;
> case 3:
> int utf32 = (((b & 0x0F) << 18)
> | ((readByte() & 0x3F) << 12)
> | ((readByte() & 0x3F) << 6)
> | (readByte() & 0x3F));
> // These are just for illustration.
> int firstSurrogate = (utf32 >> 10) + 0xD7C0;
> int secondSurrogate = (utf32 & 0x03FF) + 0xDC00;
> // If the current buffer isn't long enough,
> // create a new buffer with length one greater than
> // the current buffer, copy the entire contents,
> // enter the first surrogate, increment both i and end,
> // enter the second surrogate.
> // This is extremely inefficient, but also
> // likely to be invoked extremely rarely.
> // Problem: In Perl I'd do this with references, and
> // in C I'd do it with pointers. Not sure how to
> // make it work in Java.
> break;
> }
> }
> }
>
>
> Initial benchmarking experiments appear to indicate negligible impact
> on performance.
>
> > So I doubt
> > this would be a slam-dunk in the Lucene community.
>
> I appreciate your willingness to at least weigth the matter, and I
> understand the potential reluctance. Hopefully the comparable
> performance of the standards-compliant code above will render the
> issue
> moot, and the next release of Lucene will use legal UTF-8.
>
> Best,
>
> Marvin Humphrey
> Rectangular Research
> http://www.rectangular.com/
>
> ================================================================
>
> From: Ken Krugler <[EMAIL PROTECTED]>
> Date: August 27, 2005 2:11:34 PM PDT
> To: [email protected]
> Subject: Re: Lucene does NOT use UTF-8.
> Reply-To: [email protected]
>
>
> > I've delved into the matter of Lucene and UTF-8 a little further,
> > and I am discouraged by what I believe I've uncovered.
> >
> > Lucene should not be advertising that it uses "standard UTF-8" --
> > or even UTF-8 at all, since "Modified UTF-8" is _illegal_ UTF-8.
> >
>
> Unfortunately this is how Sun documents the format they use for
> serialized strings.
>
>
> > The two distinguishing characteristics of "Modified UTF-8" are the
>
> > treatment of codepoints above the BMP (which are written as
> > surrogate pairs), and the encoding of null bytes as 1100 0000 1000
>
> > 0000 rather than 0000 0000. Both of these became illegal as of
> > Unicode 3.1 (IIRC), because they are not shortest-form and non-
> > shortest-form UTF-8 presents a security risk.
> >
>
> For UTF-8 these were always invalid, but the standard wasn't very
> clear about it. Unfortunately the fuzzy nature of the 1.0/2.0 specs
> encouraged some sloppy implementations.
>
>
> > The documentation should really state that Lucene stores strings in
>
> > a Java-only adulteration of UTF-8,
> >
>
> Yes, good point. I don't know who's in charge of that page, but it
> should be fixed.
>
>
> > unsuitable for interchange.
> >
>
> Other than as an internal representation for Java serialization.
>
>
> > Since Perl uses true shortest-form UTF-8 as its native encoding,
> > Plucene would have to jump through two efficiency-killing hoops in
>
> > order to write files that would not choke Lucene: instead of
> > writing out its true, legal UTF-8 directly, it would be necessary
> > to first translate to UTF-16, then duplicate the Lucene encoding
> > algorithm from OutputStream. In theory.
> >
>
> Actually I don't think it would be all that bad. Since a null in the
>
> middle of a string is rare, as is a character outside of the BMP, a
> quick scan of the text should be sufficient to determine if it can be
>
> written as-is.
>
> The ICU project has C code that can be used to quickly walk a string.
>
> I believe these would find/report such invalid code points, if you
> use the safe (versus faster unsafe) versions.
>
>
> > Below you will find a simple Perl script which illustrates what
> > happens when Perl encounters malformed UTF-8. Run it (you need
> > Perl 5.8 or higher) and you will see why even if I thought it was a
>
> > good idea to emulate the Java hack for encoding "Modified UTF-8",
> > trying to make it work in practice would be a nightmare.
> >
> > If Plucene were to write legal UTF-8 strings to its index files,
> > Java Lucene would misbehave and possibly blow up any time a string
>
> > contained either a 4-byte character or a null byte. On the flip
> > side, Perl will spew warnings like crazy and possibly blow up
> > whenever it encounters a Lucene-encoded null or surrogate pair.
> > The potential blowups are due to the fact that Lucene and Plucene
> > will not agree on how many characters a string contains, resulting
>
> > in overruns or underruns.
> >
> > I am hoping that the answer to this will be a fix to the encoding
> > mechanism in Lucene so that it really does use legal UTF-8. The
> > most efficient way to go about this has not yet presented itself.
> >
>
> I'd need to look at the code more, but using something other than the
>
> Java serialized format would probably incur a performance penalty for
>
> the Java implementation. Or at least make it harder to handle the
> strings using the standard Java serialization support. So I doubt
> this would be a slam-dunk in the Lucene community.
>
> -- Ken
>
>
>
> > #----------------------------------------
> >
> > #!/usr/bin/perl
> > use strict;
> > use warnings;
> >
> > # illegal_null.plx -- Perl complains about non-shortest-form null.
> >
> > my $data = "foo\xC0\x80\n";
> >
> > open (my $virtual_filehandle, "+<:utf8", \$data);
> > print <$virtual_filehandle>;
> >
>
> --
> Ken Krugler
> TransPac Software, Inc.
> <http://www.transpac.com>
> +1 530-470-9200
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]