Re: UTF-8 ill-formed question

Asmus Freytag Tue, 11 Dec 2012 12:39:58 -0800

On 12/11/2012 11:50 AM, [email protected] wrote:

From: James Lin <James_Lin_at_symantec.com>

Hi
Does anyone know why ill-form occurred on the UTF-8? besides it doesn't follow 
> the pattern of UTF-8 byte-sequences, i just wondering how or why?
If i have a code point: U+4E8C or "二"
In UTF-8, it's "E4 BA 8C" while in UTF-16, it's "4E8C". Where is this "BA"
comes from?


thanks
-James

Each of the UTF encodings represents the binary data in different ways. So we
need to break the scalar value, U+4E8C, into its binary representation before
we proceed.

4E8C -> 0100 1110 1000 1100

Then, we need to look up the rules for UTF-8. It states that code points
between U+800 and U+FFFF are encoded with three bytes, in the form 1110xxxx
10xxxxxx 10xxxxxx. So plugging in our data, we get

         4      E    8     C
       0100   1110 10-00 1100
       ||||   ||||//   \\||||
+ 1110xxxx 10xxxxxx 10xxxxxx

= 11100100 10111010 10001100
or  E  4     B  A     8  C

-Van Anderson

Nice!

A./

PS: I fixed a missing "\"

Re: UTF-8 ill-formed question

Reply via email to