Re: [PHP-DEV] PHP Unicode extension in PHP6

Stefan Walk Sun, 20 May 2007 14:44:40 -0700

Disclaimer: I don't know much about the way unicode is implemented in
php, i have only used it a bit, but i believe i can clear some things
up here.


On 20/05/07, Tomas Kuliavas <[EMAIL PROTECTED]> wrote:

0xC4 and 0x85 are hex codes for latin small letter a with ogonek in utf-8. ą

<?php
var_dump("ą" == "\xC4\x85");
echo "ą\n";
echo "\xC4\x85";
?>

If script is written in utf-8, I expect bool(true) on var_dump() line.


You expect wrong things. "\xC4\x85" is a unicode string containing two
codepoints, those at 0xC4 and 0x85 (LATIN CAPITAL LETTER A WITH
DIAERESIS and NEXT LINE (NEL)), while "ą" is a unicode string
containing one code point (0x0105, LATIN SMALL LETTER A WITH OGONEK)
(see
http://www.unicode.org/charts/PDF/U0080.pdf and
http://www.unicode.org/charts/PDF/U0100.pdf). Different strings, so
comparision should return false. If you want to type bytes, use the
"b" prefix: b"\xC4\x85", and compare that with the binary version of
your string literal. var_dump(b"ą" == b"\xC4\x85"); should give you
bool(true) if your encoding is utf-8.

It
is bool(false), when unicode.semantics are turned on. Internal
SquirrelMail character set decoding functions write mapping tables in
hexadecimals or octals. In some cases they evaluate only byte value and
not whole symbol. Multibyte character set decoding can use recode, iconv
and mbstring, but most of single byte decoding is written in plain string
functions and stores hex to html mapping tables in associative arrays.

<?php
// example uses utf-8. similar code is used in iso-8859-2 -
// iso-8859-16 decoding. utf-8 decoding does not need mapping tables
// and is written in pcre.
$s1 = "ą";
$s2 = "\xC4\x85";
echo str_replace($s2,'&#261;',$s1);
?>

Expected result: &#261;
Got: ą


Same thing. If you want binary replacements, use binary strings, not
unicode strings.

Regards,
Stefan

Re: [PHP-DEV] PHP Unicode extension in PHP6

Reply via email to