From: "Marco Cimarosti" <[EMAIL PROTECTED]> To: "'Pim Blokland'" <[EMAIL PROTECTED]>; "Unicode mailing list" <[EMAIL PROTECTED]> > Pim Blokland wrote: > > Not only that, but the process making the mistake of thinking it is > > UTF-8 also makes the mistake of not generating an error for > > encountering malformed byte sequences, > > BTW, this process has a name: "Internet Explorer".
Don't blame IE too much if it attempts to interpret the text using UTF-8, because the page is tagged explicitly with a UTF-8 charset. Well, it's true that IE should stop to use this erroneous charset tag as soon as it sees a violation of the UTF-8 rule, and rather should attempt to use its "automatic selection". But it's true also, that IE still attempts to use the legacy UTF-8 encoding which allowed interpreting non-short sequences. I do think this bug does not occur within recent updates of IE, notably since it was corrected to remove the security hole in MSHTML.DLL to avoid interpreting non-short sequences. If IE really wants to keep some compatibility, it may only accept the CESU-8 encoding only as a possible choice for its "automatic selection" of charsets, or display a visible replacement character (such as a narrow white box) for invalid characters (that could internally be handled as if these invalid sequences were representing U+FFFF). But if the user forces the UTF-8 decoding in the GUI, IE should still not consider any invalid UTF-8 sequence, and interpret it as an invalid character like U+FFFF or, even better, disable this UTF-8 choice in the user interface. So this is really an effect of the collision of multiple Unicode violations, both in the User-Agent interpreting the coded strings, and in the content of the page, incorrectly labelled UTF-8 when it is not (here: complain to your web page designer, or blame yourself if you created this page with invalid meta-tags). Beware, when editing an UTF-8 page that includes the UTF-8 charset metatag explicitly, that your editor will not save it into ISO-8859-1, only because it thinks it will save storage space... There are also of some bogous "web site optimizers" that perform this kind of encoding optimization (in addition to removing unnecessary spaces and new lines, or to compressing/obfuscating the JavaScript code, CSS stylesheet class names) and don't take care of changing the value of this meta-tag... Changing the internal encoding of any text file without an explicit request from the user should never be done automatically without confirmation and logging of the actions taken.