Re: utf-8 encoding scheme

Henry Spencer Thu, 13 Jul 2000 06:18:35 -0700
On Thu, 13 Jul 2000, Jeu George wrote:
> > >   2-byte characters 110xxxxx 10xxxxxx
> > The bits are encoded bigendian (MSB first), i.e. the way you would
> > read the bits when written in the above form.
> 
> for a two byte long character.
> where will the MSB be
> in the 4th bit of the first byte from the left or on the 3rd bit of the
> second byte from the left

As he said:  MSB first.  The 16-bit character 00000pqrstuvwxyz is encoded
as 110pqrst 10uvwxyz. 

> Will this be OS dependant. ie the arrangement of bits

No, the encoding fully defines the bit arrangement.

> How is the null character going to be??  00000000 ??
> but u have mentioned something else below

Properly, ASCII NUL (U+0000), the 16-bit character 0000000000000000,
should be encoded as just 00000000, since that is the shortest encoding
for it.  This is one case where there has been some violation of the rules
internally within some systems, although with luck it will remain an
internal oddity and won't become visible. 

> I thought that all the ascii character were be retained in UTF-8 .
> That is the major reason why 1 byte long charcters will always have the
> MSB as 0. am i right??

Correct.

> > One possible thing to do in a decoder is to emit U+FFFD SUBSTITUTION
> > CHARACTER on encountering illegal sequences.
> 
> WHat is this U+FFFD SUBSTITUTION  about exactly could you ellaborate on
> this also??

The character U+FFFD, whose Unicode 3.0 name is REPLACEMENT CHARACTER, is
"used to replace an incoming character whose value is unknown or
unrepresentable in Unicode".  That is, it marks the place where something
untranslatable used to be.

                                                          Henry Spencer
                                                       [EMAIL PROTECTED]

-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/lists/
Re: utf-8 encoding scheme

Reply via email to