On 8/2/2012 8:46 AM, Dmitry Olshansky wrote:
Keep a 6 character buffer in your consumer. If you read a char with the
high bit set, start filling that buffer and then decode it.

4 bytes is enough.

Since Unicode 5(?) the range of codepoints was defined to be 0...0x10FFFF
specifically so that it could be encoded in 4 bytes of UTF-8.

Yeah, but I thought 6 bytes would future proof it! (Inevitably, the Unicode committee will add more.)


P.S. Looks like I'm too late for this party ;)



It affects you strongly, too, so I'm glad to see you join in.

Reply via email to