On 8/7/2012 12:48 PM, Joshua Bell wrote:
When Anne's spec appeared I gutted mine and deferred wherever possible to his. One consequence of that was getting the other encodings "for free" as far as the spec writing goes. If we achieve consensus that we only want to support UTF encodings we can add the restrictions. There are use cases for supporting other encodings (parsing legacy data file formats, for example), but that could be deferred.

My main use case, and the only one I'm going to argue for, is being able to handle mail messages with this API, and the primary concern here is decoding. I'll agree with other sentiments in this thread that I don't particularly care about encoding to anything other than UTF-8 (it might be nice, but I can live without it); it's being able to decode $CHARSET that I'm concerned about. As far as edge cases in this scenario are concerned, it pretty much boils down to "I want to produce the same JS string that would be output if I looked at the text content of the document data:text/plain;charset=<charset>,<data>".

When encoding, I think it is absolutely necessary to enforce a uniform guidelines for the output. When decoding, however, I think that most differences (beyond concerns like the BOM) are a result of "buggy" content creators as opposed to the browser media. Given that HTML display has apparently tolerated differences in charset decoding for legacy charsets, I suppose it is possible to live with a difference of exact character decoding for various charsets--in other words, turning the charset document into an advisory list of both minimum charsets to support and how to do so.

--
Beware of bugs in the above code; I have only proved it correct, not tried it. 
-- Donald E. Knuth

Reply via email to