Re: Lucene does NOT use UTF-8.

Otis Gospodnetic Sun, 28 Aug 2005 20:22:01 -0700

I'm not familiar with UTF-8 enough to follow the details of this
discussion.  I hope other Lucene developers are, so we can resolve this
issue.... anyone raising a hand?


Otis

--- Marvin Humphrey <[EMAIL PROTECTED]> wrote:

> Ken Krugler sent a reply to the user list.  In an effort to keep all
> the developers informed, I'm sending my reply to the developer list
> and including his entire original post below my sig.
> 
> Ken writes...
> 
>  > Since a null in the
>  > middle of a string is rare, as is a character outside of the BMP,
> a
>  > quick scan of the text should be sufficient to determine if it can
> be
>  > written as-is.
> 
> Let's see.  I think we are looking at two scans, (one index(), one
> regex), or a regex that uses alternation.  I strongly suspect two
> scans
> are faster.
> 
>      if (  (index($string, "\xC0\x80") != -1)
>         or ($string =~ /[\F0-\xF7]/ ) # only exists in 4-byte UTF-8
>      ) {
>          # Process string...
>      }
> 
> That would tell us whether the string needed to be specially encoded
> for
> Java's sake on output.  Yes, I suspect that's considerably more
> efficient than always converting first to UTF-16 and then to
> "Modified
> UTF-8".
> 
> It's also completely unnecessary, as you'll see from the patch below,
> so I'm going to press ahead and make these XS ports of InputStream
> and
> OutputStream work with legal UTF-8.
> 
> It would actually make a lot more sense for Plucene if the integer at
> the head of a string measured *bytes* instead of either Unicode code
> points or Java chars.  Then it's just a straight up copy!  No
> scanning
> OR decoding required.
> 
> (Hmm... I wonder if there's a way to make Lucene work quickly if the
> VInt were redefined to be "length in bytes"...)
> 
> Speaking of which, the Lucene file formats document also says this...
> 
>      "Lucene writes strings as a VInt representing the length,  
> followed by
>      the character data."
> 
> The ambiguity of the word "length" in this sentence left me
> scratching
> my head.  Length in bytes or length in UTF-8 characters?  Of course
> the real answer is... neither. :\
> 
> It's length in Java chars, or, if you prefer to further Sun's
> disinformation campaign, ;) "Modified UTF-8 characters".  If the
> Lucene
> docs had stated "Java chars" explicitly, I would have had a better
> idea
> about why the value of that VInt is what it is -- a Java-specific
> quirk at odds with a widely-accepted standard -- and about what
> it was going to take to adhere to the spec.
> 
>  > I'd need to look at the code more, but using something other than
> the
>  > Java serialized format would probably incur a performance penalty
> for
>  > the Java implementation. Or at least make it harder to handle the
>  > strings using the standard Java serialization support.
> 
> I believe that the following true-UTF-8 replacement for the
> readChars function is at least as fast as the current implementation,
> unless your text contains characters outside the BMP.  It's
> incomplete,
> because my Java expertise is quite limited, but it should be
> conceptually sound.  The algo is adapted from C code supplied by the
> Unicode consortium.
> 
> http://www.unicode.org/Public/PROGRAMS/CVTUTF/ConvertUTF.c
> 
>    static final byte[] TRAILING_BYTES_FOR_UTF8 = {
>        0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
> 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
>        0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
> 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
>        0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
> 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
>        0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
> 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
>        0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
> 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
>        0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
> 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
>        1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
> 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
>        2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,
> 3,3,3,3,3,3,3,3,4,4,4,4,5,5,5,5
>    };
> 
>    public final void readChars(char[] buffer, int start, int length)
>         throws IOException {
>      int end = start + length; // No longer a final int.
>      for (int i = start; i < end; i++) {
>        int b = readByte();   // NOTE: b changed from byte to int.
>        switch (TRAILING_BYTES_FOR_UTF8[b & 0xFF]) {
>          case 0:
>            buffer[i] = (char)(b & 0x7F);
>            break;
>          case 1:
>            buffer[i] = (char)(((b & 0x1F) << 6)
>              | (readByte() & 0x3F));
>            //buffer[i] = (char)(((b & 0x1F) << 6)
>            //  | (readByte() & 0x3F));
>            break;
>          case 2:
>            buffer[i] = (char)(((b & 0x0F) << 12)
>              | ((readByte() & 0x3F) << 6)
>              |  (readByte() & 0x3F));
>            break;
>          case 3:
>            int utf32 = (((b & 0x0F) << 18)
>              | ((readByte() & 0x3F) << 12)
>              | ((readByte() & 0x3F) << 6)
>              |  (readByte() & 0x3F));
>            // These are just for illustration.
>            int firstSurrogate  = (utf32 >> 10) + 0xD7C0;
>            int secondSurrogate = (utf32 & 0x03FF) + 0xDC00;
>            // If the current buffer isn't long enough,
>            // create a new buffer with length one greater than
>            // the current buffer, copy the entire contents,
>            // enter the first surrogate, increment both i and end,
>            // enter the second surrogate.
>            // This is extremely inefficient, but also
>            // likely to be invoked extremely rarely.
>            // Problem: In Perl I'd do this with references, and
>            // in C I'd do it with pointers.  Not sure how to
>            // make it work in Java.
>            break;
>        }
>      }
>    }
> 
> 
> Initial benchmarking experiments appear to indicate negligible impact
> on performance.
> 
>  > So I doubt
>  > this would be a slam-dunk in the Lucene community.
> 
> I appreciate your willingness to at least weigth the matter, and I
> understand the potential reluctance.  Hopefully the comparable
> performance of the standards-compliant code above will render the
> issue
> moot, and the next release of Lucene will use legal UTF-8.
> 
> Best,
> 
> Marvin Humphrey
> Rectangular Research
> http://www.rectangular.com/
> 
> ================================================================
> 
> From: Ken Krugler <[EMAIL PROTECTED]>
> Date: August 27, 2005 2:11:34 PM PDT
> To: [email protected]
> Subject: Re: Lucene does NOT use UTF-8.
> Reply-To: [email protected]
> 
> 
> > I've delved into the matter of Lucene and UTF-8 a little further,  
> > and I am discouraged by what I believe I've uncovered.
> >
> > Lucene should not be advertising that it uses "standard UTF-8" --  
> > or even UTF-8 at all, since "Modified UTF-8" is _illegal_ UTF-8.
> >
> 
> Unfortunately this is how Sun documents the format they use for  
> serialized strings.
> 
> 
> > The two distinguishing characteristics of "Modified UTF-8" are the 
> 
> > treatment of codepoints above the BMP (which are written as  
> > surrogate pairs), and the encoding of null bytes as 1100 0000 1000 
> 
> > 0000 rather than 0000 0000.  Both of these became illegal as of  
> > Unicode 3.1 (IIRC), because they are not shortest-form and non- 
> > shortest-form UTF-8 presents a security risk.
> >
> 
> For UTF-8 these were always invalid, but the standard wasn't very  
> clear about it. Unfortunately the fuzzy nature of the 1.0/2.0 specs  
> encouraged some sloppy implementations.
> 
> 
> > The documentation should really state that Lucene stores strings in
>  
> > a Java-only adulteration of UTF-8,
> >
> 
> Yes, good point. I don't know who's in charge of that page, but it  
> should be fixed.
> 
> 
> > unsuitable for interchange.
> >
> 
> Other than as an internal representation for Java serialization.
> 
> 
> > Since Perl uses true shortest-form UTF-8 as its native encoding,  
> > Plucene would have to jump through two efficiency-killing hoops in 
> 
> > order to write files that would not choke Lucene: instead of  
> > writing out its true, legal UTF-8 directly, it would be necessary  
> > to first translate to UTF-16, then duplicate the Lucene encoding  
> > algorithm from OutputStream.  In theory.
> >
> 
> Actually I don't think it would be all that bad. Since a null in the 
> 
> middle of a string is rare, as is a character outside of the BMP, a  
> quick scan of the text should be sufficient to determine if it can be
>  
> written as-is.
> 
> The ICU project has C code that can be used to quickly walk a string.
>  
> I believe these would find/report such invalid code points, if you  
> use the safe (versus faster unsafe) versions.
> 
> 
> > Below you will find a simple Perl script which illustrates what  
> > happens when Perl encounters malformed UTF-8.  Run it (you need  
> > Perl 5.8 or higher) and you will see why even if I thought it was a
>  
> > good idea to emulate the Java hack for encoding "Modified UTF-8",  
> > trying to make it work in practice would be a nightmare.
> >
> > If Plucene were to write legal UTF-8 strings to its index files,  
> > Java Lucene would misbehave and possibly blow up any time a string 
> 
> > contained either a 4-byte character or a null byte.  On the flip  
> > side, Perl will spew warnings like crazy and possibly blow up  
> > whenever it encounters a Lucene-encoded null or surrogate pair.   
> > The potential blowups are due to the fact that Lucene and Plucene  
> > will not agree on how many characters a string contains, resulting 
> 
> > in overruns or underruns.
> >
> > I am hoping that the answer to this will be a fix to the encoding  
> > mechanism in Lucene so that it really does use legal UTF-8.  The  
> > most efficient way to go about this has not yet presented itself.
> >
> 
> I'd need to look at the code more, but using something other than the
>  
> Java serialized format would probably incur a performance penalty for
>  
> the Java implementation. Or at least make it harder to handle the  
> strings using the standard Java serialization support. So I doubt  
> this would be a slam-dunk in the Lucene community.
> 
> -- Ken
> 
> 
> 
> > #----------------------------------------
> >
> > #!/usr/bin/perl
> > use strict;
> > use warnings;
> >
> > # illegal_null.plx -- Perl complains about non-shortest-form null.
> >
> > my $data = "foo\xC0\x80\n";
> >
> > open (my $virtual_filehandle, "+<:utf8", \$data);
> > print <$virtual_filehandle>;
> >
> 
> -- 
> Ken Krugler
> TransPac Software, Inc.
> <http://www.transpac.com>
> +1 530-470-9200
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Lucene does NOT use UTF-8.

Reply via email to