>> --- test.php --- >> <?php >> $string1 = "ą"; >> $string2 = "\xC4\x85"; >> var_dump($string1 == $string2) > > How you expect one-character string to be equal to two-character string?
In PHP4/5 \xC4 and \x85 are not characters. They are bytes. >> ą is in utf-8 (latin small letter a with ogonek, latin extended-a >> range). It contains two bytes with 0xC4 0x85 values. > > It contains two bytes in the filesystem. It however contains one > character in PHP. In unicode mode, bytes and characters are different > things. You could make $string2 as binary and then convert it from utf-8 > to unicode, but without explicitly saying otherwise that string contains > two characters - U+00C4 (LATIN CAPITAL LETTER A WITH DIAERESIS) and > U+0085 (control character, no name). It doesn't mean escape sequences > stop working, it means characters and bytes are no more the same. That's > the price one has to pay for doing unicode. I can't pay such price. You are reducing available coding options and want me to rely on your functions when existing code was doing fine without unicode support and your functions are not documented (http://www.php.net/unicode) and don't provide the way to see the difference between 7bit and 8bit string. Theoretically I might call unicode_encode() with ascii target, but doing charset conversions just to detect 8bit is a hack and not a solution. If I take a look at ext/unicode/unicode.c, I see more PHP_FUNCTION functions. I don't know PHP6 release schedule. If PHP6 is approaching RC stage, maybe docs can be updated to inform about these functions. PHP provides API for PHP scripts developers. Strongest API part is good documentation. I shouldn't have to dig through C sources in order to learn about available interpreter features. If you write code now and document it later, you won't document it or it will take some time and lots of bug reports to sync sources with manual. I think I'll be able to port scripts to PHP6 unicode.semantics=on. Currently I am not sure only about POP3 and IMAP streams with data encoded in different character sets and MIME Q encoding. -- Tomas -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php