Lucene doesn't currently output totally valid UTF-8
Patches to make it do so are here:
http://www.mail-archive.com/java-dev@lucene.apache.org/msg01987.html

Should this be tackled pre or post 2.0?

-Yonik
http://incubator.apache.org/solr Solr, The Open Source Lucene Search Server

On 3/30/06, Dennis Kubes <[EMAIL PROTECTED]> wrote:
> I was reading up on conversion of characters to UTF-8 and I now understand
> why it is writing out UTF-8 (to be able to support most of the worlds
> languages with minimal space?). But after reading up on the algorithms for
> conversion as given below, does the writeChars method not support the
> U+10000→U+10FFFF conversions or am I misreading the code?
>
>
>
>
> Character Range
>
> Bit Encoding
>
>
> U+0000→U+007F
>
> 0xxxxxxx
>
>
> U+0080→U+07FF
>
> 110xxxxx 10xxxxxx
>
>
> U+0800→U+FFFF
>
> 1110xxxx 10xxxxxx 10xxxxxx
>
>
> U+10000→U+10FFFF
>
> 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
>
>
>
>   public void writeChars(String s, int start, int length)
>
>     throws IOException {
>
>
>
>     final int end = start + length;
>
>     for (int i = start; i < end; i++) {
>
>
>
>       final int code = (int)s.charAt(i);
>
>
>
>       if (code >= 0x01 && code <= 0x7F)
>
>         writeByte((byte)code);
>
>       else if (((code >= 0x80) && (code <= 0x7FF)) || code == 0) {
>
>         writeByte((byte)(0xC0 | (code >> 6)));
>
>         writeByte((byte)(0x80 | (code & 0x3F)));
>
>       }
>
>       else {
>
>         writeByte((byte)(0xE0 | (code >>> 12)));
>
>         writeByte((byte)(0x80 | ((code >> 6) & 0x3F)));
>
>         writeByte((byte)(0x80 | (code & 0x3F)));
>
>       }
>
>     }
>
>   }

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to