Disclaimer: I don't know much about the way unicode is implemented in php, i have only used it a bit, but i believe i can clear some things up here.
On 20/05/07, Tomas Kuliavas <[EMAIL PROTECTED]> wrote:
0xC4 and 0x85 are hex codes for latin small letter a with ogonek in utf-8. ą <?php var_dump("ą" == "\xC4\x85"); echo "ą\n"; echo "\xC4\x85"; ?> If script is written in utf-8, I expect bool(true) on var_dump() line.
You expect wrong things. "\xC4\x85" is a unicode string containing two codepoints, those at 0xC4 and 0x85 (LATIN CAPITAL LETTER A WITH DIAERESIS and NEXT LINE (NEL)), while "ą" is a unicode string containing one code point (0x0105, LATIN SMALL LETTER A WITH OGONEK) (see http://www.unicode.org/charts/PDF/U0080.pdf and http://www.unicode.org/charts/PDF/U0100.pdf). Different strings, so comparision should return false. If you want to type bytes, use the "b" prefix: b"\xC4\x85", and compare that with the binary version of your string literal. var_dump(b"ą" == b"\xC4\x85"); should give you bool(true) if your encoding is utf-8.
It is bool(false), when unicode.semantics are turned on. Internal SquirrelMail character set decoding functions write mapping tables in hexadecimals or octals. In some cases they evaluate only byte value and not whole symbol. Multibyte character set decoding can use recode, iconv and mbstring, but most of single byte decoding is written in plain string functions and stores hex to html mapping tables in associative arrays. <?php // example uses utf-8. similar code is used in iso-8859-2 - // iso-8859-16 decoding. utf-8 decoding does not need mapping tables // and is written in pcre. $s1 = "ą"; $s2 = "\xC4\x85"; echo str_replace($s2,'ą',$s1); ?> Expected result: ą Got: ą
Same thing. If you want binary replacements, use binary strings, not unicode strings. Regards, Stefan