Followup to:  <[EMAIL PROTECTED]>
By author:    Jeu George <[EMAIL PROTECTED]>
In newsgroup: linux.utf8
>
> 
> Hello,
> 
>       The utf-8 encoding scheme goes like this
>   for
>   1-byte characters 0xxxxxxx 
>   2-byte characters 110xxxxx 10xxxxxx
>   3-byte characters 1110xxxx 10xxxxxx
> 

4-byte characters       11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
5-byte characters       111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
6-byte characters       1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

> here the bits marked x are used up for the actuall encoding of characters
> i would like to know the way these bits are used to code a particular
> charter, also is this dependent on the operating system, can u provide a
> program which checks finds this or any link that provides information
> about this

The bits are encoded bigendian (MSB first), i.e. the way you would
read the bits when written in the above form.

It is also very important to realize that ONLY THE SHORTEST POSSIBLE
SEQUENCE IS LEGAL.  This is incredibly important, since any misguided
attempt to "be liberal in what you accept" without addition of an
explicit canonicalization step would lead to the kind of security
holes that Microsoft web-related applications have been so full of,
because MS operating systems have way too many ways to say the same
thing.

Thus, the character K <U+004B> is encoded as:

        01001011

The alternate spelling

        11000001 10001011

... is not the character K <U+004B> but INVALID SEQUENCE.  One
possible thing to do in a decoder is to emit U+FFFD SUBSTITUTION
CHARACTER on encountering illegal sequences.

        -hpa

-- 
<[EMAIL PROTECTED]> at work, <[EMAIL PROTECTED]> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt
-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/lists/

Reply via email to