On Mon, 28 Jan 2008 17:26:48 -0800, in php.internals
[EMAIL PROTECTED] (Rasmus Lerdorf) wrote:

>> On the other hand utf8_decode() also expects the input to be UTF-8
>> encoded, but it replaces incomplete sequences with the character "?".
>
>utf8_decode() doesn't replace invalid chars with a ?
>
>eg.
>
>php -r '$a="abcd".chr(0xE0);echo
>iconv("utf-8","utf-8",$a)."\n".utf8_decode($a);' | od -t x1
>
>0000000    61  62  63  64  0a  61  62  63  64  03

Yes it does, but not in your case :-)

However:

$ php -r '$a="abcd".chr(0xE0)."e"; echo
iconv("utf-8","utf-8",$a)."\n".utf8_decode($a);'|hd

00000000  61 62 63 64 0a 61 62 63  64 3f                  |abcd.abcd?|

$ php -r 'print utf8_decode("Fløde på æblegrød");'
Fl?p?blegr?


>It would be a horrendously bad idea to replace invalid chars with some
>other valid char.  Way worse than returning nothing.  Think about what
>would happen in a regex, for example, if a user was able to inject a '?'
>by sending an invalid utf-8 sequence that ends up in a regular expression.

I don't disagree with you and I have thought of the same issue
(although I suppose any sanitation should happen after any given
conversion; other charsets than utf-8 might be able to encode lowbits
such as "?" as well - but this is beside the point)

I'm not fond of the "?" feature as well, but it is present in
utf8_decode() and other non-php applications with utf-8 conversion.

My guess is still that some standard recommends this conversion as a
possible fallback for error handling.

-- 
- Peter Brodersen

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php

Reply via email to