Re: [PHP-DEV] Where are we ACTUALLY on Unicode?

Andrey Hristov Tue, 16 Mar 2010 11:26:01 -0700

dreamcat four wrote:

On Tue, Mar 16, 2010 at 11:48 AM, dreamcat four <[email protected]> wrote:

On Tue, Mar 16, 2010 at 8:30 AM, Lester Caine <[email protected]> wrote:

'3' is not a very processor friendly number, so working with 4 even though
wasteful on memory, does make perfect sense. How long is it since we had a
640k limit on working memory? SERVERS should have a good amount of memory
for caching information anyway. SO is UTF-16 the right approach for
processing wide strings? It needs special code to handle everything wider
than 16 bits, but at what gain really? If all core functionality is handled
as 32 bit characters is there that much of an overhead over the additional
processing to get around strings of dissimilar sizes in UTF-16 ?

Just to re-enforce some of Lester's points above here.


4-byte per character is never slower that 2-bytes per character... its
faster if anything. Bear in mind that 4-byte has been the defacto size
for all modern cpu registers / 32-bit microarchitectures since....
like... Forever. Give a c compiler 4bytes of data... it'll say: thank
you very much, and more of the same please! It keeps em happy ;)

Sure UTF-16 can make sense. But only if your external representations
are also in UTF-16. So whats the default Unicode settings for MYSQL,
POSTGRE, etc? Well, are they always set to UTF-8, or UTF-16?


To answer my own question, I have done some some further research.

It seems that both MySQL and Postgre recommend / default to Latin1
(8-bit ASCII) and  'C' (7-bit ASCII) respectively. So that is to say
neither set themselves to any unicode standard by default.

In the case of Postgre, the ASCII default is often overiden to UTF-8
by the distro / os / package managers. From the $LOCALE environment
variable. So then its UTF-8.

In the case of MySQL, it may be left as latin1. But most competent web
developers decide to set it to utf-8. Again, its not generally
believed that very many people (by comparison) actively chooses
utf-16. The most common encoding issue people run into is that their
web application has sent their database utf-8 encoded data. But their
(usually a MySQL) database still has the factory default encoding
Latin-1 (8-bit ascii). People who discover this almost always solve
the problem by converting their databases into utf-8.


MySQL doesn't support UTF-16 in any GA release. UCS-2 can be used though.

As for text files on disk, if they are unicode, they are most commonly
utf-8 too. So then, why use utf-16 as internal unicode representation
in Php? It doesn't really make a lot of sense for most regular people
who want to use Php for their web application. Unless they don't
really care how slow its gonna be converting everything, constantly...



Andrey

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php

Re: [PHP-DEV] Where are we ACTUALLY on Unicode?

Reply via email to