Re: [sqlite] Encodings question

Bertrand Mansion Mon, 19 Apr 2004 06:03:40 -0700

<[EMAIL PROTECTED]> wrote :

> Bertrand Mansion wrote:
> 
>> As far as I understand, UTF-8 will read 8859-1 without problem but
>> ISO-8859-1 will not be able to read UTF-8, unless everything in the UTF8
>> string uses only 8859-1 codes.
> 
> You're wrong, I think.
> 
> UTF-8 is a variable length encoding of character codes of the unicode
> code page. Iso8869-1 is a definition of a code page, each character is
> encoded in exactly one byte.
> 
> Unicode itself is a code page with much more characters than iso8859-1.
> 
> Unicode, iso8859-1 and ASCII code pages share following properties:
> 
> a.) character codes 0 upto 127 in unicode are equal to ASCII codes.
> b.) character codes 128 upto 255 in unicode are equal to the iso8859-1
> codes.
> 
> Please note: A 'character code' is _not_ a byte! It's the number of the
> position of that character in a code page. The code page in iso8859-1 is
> only 8 bits wide and has 256 entries. The unicode code page is 21 bits
> wide, and not all positions are assigned to characters.
> 
> In iso8859-1 all 256 character codes are encoded using simply one byte.
> The value of the byte is the character position in the code page.
> 
> In UTF-8 character codes 0 upto 127 are encoded in one byte and
> character codes above 127 are encoded in _two_ bytes!
> 
> That means the byte value of encoded character codes 0 upto 127 are
> equal in UTF-8 and iso8859-1, but character codes above 127 takes two
> bytes in UTF-8 and one byte in iso8859-1.
> 
> In iso8859-1 the byte value is always the character code. In UTF-8 this
> is only true for character codes 0 upto 127.
> 
> However, in UTF-8 (the unicode code page encoding) you can encode
> character codes upto 31 bits wide, using 6 bytes.


Thanks for the clear explanations :)

Does this mean that as long as I only use ASCII in an UTF8 compiled sqlite
library, the db will be also usable with a ISO-8859-1 compiled version of
the library, but if I use for instance accentuated characters, it won't be
compatible anymore ?

I am asking because I once created a 8859-1 db and it could be read and
modified in the UTF8 version of the library. I haven't tested the other way
though. What will happen if I update fields with accentuated characters in
my application compiled with the UTF8 and then try to open the db with let's
say PHP sqlite extension ? I'll try to see what happens.

On the php site, they warn users:

<quote>
The default PHP distribution builds libsqlite in ISO-8859-1 encoding mode.
However, this is a misnomer; rather than handling ISO-8859-1, it operates
according to your current locale settings for string comparisons and sort
ordering. So, rather than ISO-8859-1, you should think of it as being
'8-bit' instead.
</quote>

I am not sure what this means ?

<quote>
It is not recommended that you use PHP in a web-server configuration with a
version of the SQLite library compiled with UTF-8 support, since libsqlite
will abort the process if it detects a problem with the UTF-8 encoding.
</quote>

So, it looks like it is recommended not to use UTF8. But how then can I deal
with characters like the euro symbol ? I guess that I am stuck ?

Bertrand Mansion
Mamasam



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [sqlite] Encodings question

Reply via email to