On 3/30/06, Dennis Kubes <[EMAIL PROTECTED]> wrote: > Is this modified UTF-8 such as is found in DataInput interface?
Yes, I believe so. -Yonik http://incubator.apache.org/solr Solr, The Open Source Lucene Search Server > -----Original Message----- > From: Yonik Seeley [mailto:[EMAIL PROTECTED] > Sent: Thursday, March 30, 2006 11:56 AM > To: [email protected] > Subject: Re: writeChars method in IndexOutput > > Lucene doesn't currently output totally valid UTF-8 > Patches to make it do so are here: > http://www.mail-archive.com/[email protected]/msg01987.html > > Should this be tackled pre or post 2.0? > > -Yonik > http://incubator.apache.org/solr Solr, The Open Source Lucene Search Server > > On 3/30/06, Dennis Kubes <[EMAIL PROTECTED]> wrote: > > I was reading up on conversion of characters to UTF-8 and I now understand > > why it is writing out UTF-8 (to be able to support most of the worlds > > languages with minimal space?). But after reading up on the algorithms for > > conversion as given below, does the writeChars method not support the > > U+10000→U+10FFFF conversions or am I misreading the code? > > > > > > > > > > Character Range > > > > Bit Encoding > > > > > > U+0000→U+007F > > > > 0xxxxxxx > > > > > > U+0080→U+07FF > > > > 110xxxxx 10xxxxxx > > > > > > U+0800→U+FFFF > > > > 1110xxxx 10xxxxxx 10xxxxxx > > > > > > U+10000→U+10FFFF > > > > 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx > > > > > > > > public void writeChars(String s, int start, int length) > > > > throws IOException { > > > > > > > > final int end = start + length; > > > > for (int i = start; i < end; i++) { > > > > > > > > final int code = (int)s.charAt(i); > > > > > > > > if (code >= 0x01 && code <= 0x7F) > > > > writeByte((byte)code); > > > > else if (((code >= 0x80) && (code <= 0x7FF)) || code == 0) { > > > > writeByte((byte)(0xC0 | (code >> 6))); > > > > writeByte((byte)(0x80 | (code & 0x3F))); > > > > } > > > > else { > > > > writeByte((byte)(0xE0 | (code >>> 12))); > > > > writeByte((byte)(0x80 | ((code >> 6) & 0x3F))); > > > > writeByte((byte)(0x80 | (code & 0x3F))); > > > > } > > > > } > > > > } --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
