Re: [PHP-DEV] [PATCH] Bug #43896 htmlspecialchars returns empty stringoninvalid unicode sequence

Rasmus Lerdorf Mon, 28 Jan 2008 21:22:31 -0800

Peter Brodersen wrote:
> http://php.net/xml also documents this replacement:
> ==
> If PHP encounters characters in the parsed XML document that can not be
> represented in the chosen target encoding, the problem characters will be
> "demoted". Currently, this means that such characters are replaced by a
> question mark.
> ==


That was back in the expat days.  We don't use that xml parser anymore.

> http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt mentions:
> ==
> According to ISO 10646-1:2000, sections D.7 and 2.3c, a device
> receiving UTF-8 shall interpret a "malformed sequence in the same way
> that it interprets a character that is outside the adopted subset" and
> "characters that are not within the adopted subset shall be indicated
> to the user" by a receiving device. A quite commonly used approach in
> UTF-8 decoders is to replace any malformed UTF-8 sequence by a
> replacement character (U+FFFD), which looks a bit like an inverted
> question mark, or a similar symbol. It might be a good idea to
> visually distinguish a malformed UTF-8 sequence from a correctly
> encoded Unicode character that is just not available in the current
> font but otherwise fully legal, even though ISO 10646-1 doesn't
> mandate this. In any case, just ignoring malformed sequences or
> unavailable characters does not conform to ISO 10646, will make
> debugging more difficult, and can lead to user confusion.
> ==

That part is completely different.  That's at the display level.
Replacing it in the backend makes no sense to me.  Don't use
utf8_decode.  Use iconv() so you know what the heck is going on.

-Rasmus

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php

Re: [PHP-DEV] [PATCH] Bug #43896 htmlspecialchars returns empty stringoninvalid unicode sequence

Reply via email to