>> --- test.php ---
>> <?php
>> $string1 = "ą";
>> $string2 = "\xC4\x85";
>> var_dump($string1 == $string2)
>
> How you expect one-character string to be equal to two-character string?

In PHP4/5 \xC4 and \x85 are not characters. They are bytes.

>> ą is in utf-8 (latin small letter a with ogonek, latin extended-a
>> range). It contains two bytes with 0xC4 0x85 values.
>
> It contains two bytes in the filesystem. It however contains one
> character in PHP. In unicode mode, bytes and characters are different
> things. You could make $string2 as binary and then convert it from utf-8
> to unicode, but without explicitly saying otherwise that string contains
> two characters - U+00C4 (LATIN CAPITAL LETTER A WITH DIAERESIS) and
> U+0085 (control character, no name). It doesn't mean escape sequences
> stop working, it means characters and bytes are no more the same. That's
> the price one has to pay for doing unicode.

I can't pay such price. You are reducing available coding options and want
me to rely on your functions when existing code was doing fine without
unicode support and your functions are not documented
(http://www.php.net/unicode) and don't provide the way to see the
difference between 7bit and 8bit string. Theoretically I might call
unicode_encode() with ascii target, but doing charset conversions just to
detect 8bit is a hack and not a solution.

If I take a look at ext/unicode/unicode.c, I see more PHP_FUNCTION
functions. I don't know PHP6 release schedule. If PHP6 is approaching RC
stage, maybe docs can be updated to inform about these functions. PHP
provides API for PHP scripts developers. Strongest API part is good
documentation. I shouldn't have to dig through C sources in order to learn
about available interpreter features. If you write code now and document
it later, you won't document it or it will take some time and lots of bug
reports to sync sources with manual.

I think I'll be able to port scripts to PHP6 unicode.semantics=on.
Currently I am not sure only about POP3 and IMAP streams with data encoded
in different character sets and MIME Q encoding.

-- 
Tomas

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php

Reply via email to