On Mon, 28 Jan 2008 17:26:48 -0800, in php.internals
[EMAIL PROTECTED] (Rasmus Lerdorf) wrote:
>> On the other hand utf8_decode() also expects the input to be UTF-8
>> encoded, but it replaces incomplete sequences with the character "?".
>
>utf8_decode() doesn't replace invalid chars with a ?
>
>eg.
>
>php -r '$a="abcd".chr(0xE0);echo
>iconv("utf-8","utf-8",$a)."\n".utf8_decode($a);' | od -t x1
>
>0000000 61 62 63 64 0a 61 62 63 64 03
Yes it does, but not in your case :-)
However:
$ php -r '$a="abcd".chr(0xE0)."e"; echo
iconv("utf-8","utf-8",$a)."\n".utf8_decode($a);'|hd
00000000 61 62 63 64 0a 61 62 63 64 3f |abcd.abcd?|
$ php -r 'print utf8_decode("Fløde på æblegrød");'
Fl?p?blegr?
>It would be a horrendously bad idea to replace invalid chars with some
>other valid char. Way worse than returning nothing. Think about what
>would happen in a regex, for example, if a user was able to inject a '?'
>by sending an invalid utf-8 sequence that ends up in a regular expression.
I don't disagree with you and I have thought of the same issue
(although I suppose any sanitation should happen after any given
conversion; other charsets than utf-8 might be able to encode lowbits
such as "?" as well - but this is beside the point)
I'm not fond of the "?" feature as well, but it is present in
utf8_decode() and other non-php applications with utf-8 conversion.
My guess is still that some standard recommends this conversion as a
possible fallback for error handling.
--
- Peter Brodersen
--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php