Re: [sqlite] Encodings question

Michael Roth Mon, 19 Apr 2004 05:24:27 -0700

Bertrand Mansion wrote:

As far as I understand, UTF-8 will read 8859-1 without problem but
ISO-8859-1 will not be able to read UTF-8, unless everything in the UTF8
string uses only 8859-1 codes.

You're wrong, I think.

UTF-8 is a variable length encoding of character codes of the unicode code page. Iso8869-1 is a definition of a code page, each character is encoded in exactly one byte.

Unicode itself is a code page with much more characters than iso8859-1.

Unicode, iso8859-1 and ASCII code pages share following properties:

a.) character codes 0 upto 127 in unicode are equal to ASCII codes. b.) character codes 128 upto 255 in unicode are equal to the iso8859-1 codes.

Please note: A 'character code' is _not_ a byte! It's the number of the position of that character in a code page. The code page in iso8859-1 is only 8 bits wide and has 256 entries. The unicode code page is 21 bits wide, and not all positions are assigned to characters.

In iso8859-1 all 256 character codes are encoded using simply one byte. The value of the byte is the character position in the code page.

In UTF-8 character codes 0 upto 127 are encoded in one byte and character codes above 127 are encoded in _two_ bytes!

That means the byte value of encoded character codes 0 upto 127 are equal in UTF-8 and iso8859-1, but character codes above 127 takes two bytes in UTF-8 and one byte in iso8859-1.

In iso8859-1 the byte value is always the character code. In UTF-8 this is only true for character codes 0 upto 127.

However, in UTF-8 (the unicode code page encoding) you can encode character codes upto 31 bits wide, using 6 bytes.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [sqlite] Encodings question

Reply via email to