Re: [whatwg] StringEncoding: Allowed encodings for TextEncoder
On Thu, 09 Aug 2012 19:42:07 +0200, Joshua Bell jsb...@chromium.org wrote: http://wiki.whatwg.org/wiki/StringEncoding has been updated to restrict the supported encodings for encoding to UTF-8, UTF-16 and UTF-16BE. I'm tempted to take it further to just UTF-8 and see if anyone complains. I was going to suggest doing so. We've gone UTF-8-only for new features (workers, webvtt, appcache manifest, etc). The Encoding spec says New content and formats must exclusively use the utf-8 encoding.. Is there a use case for utf-16/utf-16be? -- Simon Pieters Opera Software
Re: [whatwg] StringEncoding: Allowed encodings for TextEncoder
I think the main reason would be if there are modern formats which use UTF16 which we want to allow people to create documents in. I asked on twitter for such formats and got some responses: https://twitter.com/SickingJ/status/234060964058763264 / Jonas On Tue, Aug 14, 2012 at 7:42 AM, Simon Pieters sim...@opera.com wrote: On Thu, 09 Aug 2012 19:42:07 +0200, Joshua Bell jsb...@chromium.org wrote: http://wiki.whatwg.org/wiki/StringEncoding has been updated to restrict the supported encodings for encoding to UTF-8, UTF-16 and UTF-16BE. I'm tempted to take it further to just UTF-8 and see if anyone complains. I was going to suggest doing so. We've gone UTF-8-only for new features (workers, webvtt, appcache manifest, etc). The Encoding spec says New content and formats must exclusively use the utf-8 encoding.. Is there a use case for utf-16/utf-16be? -- Simon Pieters Opera Software
Re: [whatwg] StringEncoding: Allowed encodings for TextEncoder
On Tue, Aug 14, 2012 at 9:42 AM, Simon Pieters sim...@opera.com wrote: On Thu, 09 Aug 2012 19:42:07 +0200, Joshua Bell jsb...@chromium.org wrote: http://wiki.whatwg.org/wiki/**StringEncodinghttp://wiki.whatwg.org/wiki/StringEncodinghas been updated to restrict the supported encodings for encoding to UTF-8, UTF-16 and UTF-16BE. I'm tempted to take it further to just UTF-8 and see if anyone complains. I was going to suggest doing so. We've gone UTF-8-only for new features (workers, webvtt, appcache manifest, etc). The Encoding spec says New content and formats must exclusively use the utf-8 encoding.. Is there a use case for utf-16/utf-16be? Specs can't (meaningfully) place normative requirements on all new content and formats. This should be a note. -- Glenn Maynard
Re: [whatwg] StringEncoding: Allowed encodings for TextEncoder
Sorry if this is a dupe; I replied to this from my phone and an incorrect address, and my earlier reply isn't showing in the archives. On Fri, Aug 10, 2012 at 9:16 PM, Jonas Sicking jo...@sicking.cc wrote: The spec now contains the following text: NOTE: Because only UTF encodings are supported, and because of the algorithm used to convert a DOMString to a sequence of Unicode characters, no input can cause the encoding process to emit an encoder error. This is not correct. A DOMString is not a sequence of Unicode characters, it's a UTF16 encoded string (this is per EcmaScript). Thus it can contain unpaired surrogates and so the encoding process can result in encoder errors. As I've suggested earlier, I think we should deal with this by simply emitting Unicode replacement characters for these encoder errors (i.e. for unpaired surrogates). Already accounted for. Note the phrase: and because of the algorithm used to convert a DOMString to a sequence of Unicode characters This refers to the normative text that generates a sequence of Unicode code points from a DOMString by reference to the algorithm in WebIDL [1], which handles unpaired surrogates etc. This informative text should say Unicode code points rather than Unicode characters, though. Fixing now and referenced [1] even in the note. [1] http://dev.w3.org/2006/webapi/WebIDL/#dfn-obtain-unicode
Re: [whatwg] StringEncoding: Allowed encodings for TextEncoder
On Thu, Aug 9, 2012 at 10:42 AM, Joshua Bell jsb...@chromium.org wrote: On Wed, Aug 8, 2012 at 9:03 AM, Joshua Bell jsb...@chromium.org wrote: On Wed, Aug 8, 2012 at 2:48 AM, James Graham jgra...@opera.com wrote: On 08/07/2012 07:51 PM, Jonas Sicking wrote: I don't mind supporting *decoding* from basically any encoding that Anne's spec enumerates. I don't see a downside with that since I suspect most implementations will just call into a generic decoding backend anyway, and so supporting the same set of encodings as for other parts of the platform should be relatively easy. [...] However I think we should consider restricting support to a smaller set of encodings for while *encoding*. There should be little reason for people today to produce text in non-utf formats. We might even be able to get away with only supporting UTF8, though I wouldn't be surprised if there are reasonably modern file formats which use utf16. FWIW, I agree with the decode-from-all-platform-**encodings encode-to-utf[8|16] position. Any disagreement on limiting the supported encodings to utf-8, utf-16, and utf-16be, while permitting decoding of all encodings in the Encoding spec? (This eliminates the what to do on encoding error issue nicely, still need to resolve the BOM issue though.) http://wiki.whatwg.org/wiki/StringEncoding has been updated to restrict the supported encodings for encoding to UTF-8, UTF-16 and UTF-16BE. I'm tempted to take it further to just UTF-8 and see if anyone complains. Jury is still out on the decode-with-BOM issue - I need to reason through Glenn's suggestions on the open issues thread. I added a related open issue raised by Glenn, summarized as ... suggest that the .encoding attribute simply return the name that was passed to the constructor. - taking this further, perhaps the attribute should be eliminated as callers could apply it themselves. I could definitely live with removing the attribute. / Jonas
Re: [whatwg] StringEncoding: Allowed encodings for TextEncoder
On Thu, Aug 9, 2012 at 10:42 AM, Joshua Bell jsb...@chromium.org wrote: On Wed, Aug 8, 2012 at 9:03 AM, Joshua Bell jsb...@chromium.org wrote: On Wed, Aug 8, 2012 at 2:48 AM, James Graham jgra...@opera.com wrote: On 08/07/2012 07:51 PM, Jonas Sicking wrote: I don't mind supporting *decoding* from basically any encoding that Anne's spec enumerates. I don't see a downside with that since I suspect most implementations will just call into a generic decoding backend anyway, and so supporting the same set of encodings as for other parts of the platform should be relatively easy. [...] However I think we should consider restricting support to a smaller set of encodings for while *encoding*. There should be little reason for people today to produce text in non-utf formats. We might even be able to get away with only supporting UTF8, though I wouldn't be surprised if there are reasonably modern file formats which use utf16. FWIW, I agree with the decode-from-all-platform-**encodings encode-to-utf[8|16] position. Any disagreement on limiting the supported encodings to utf-8, utf-16, and utf-16be, while permitting decoding of all encodings in the Encoding spec? (This eliminates the what to do on encoding error issue nicely, still need to resolve the BOM issue though.) http://wiki.whatwg.org/wiki/StringEncoding has been updated to restrict the supported encodings for encoding to UTF-8, UTF-16 and UTF-16BE. I'm tempted to take it further to just UTF-8 and see if anyone complains. Jury is still out on the decode-with-BOM issue - I need to reason through Glenn's suggestions on the open issues thread. I added a related open issue raised by Glenn, summarized as ... suggest that the .encoding attribute simply return the name that was passed to the constructor. - taking this further, perhaps the attribute should be eliminated as callers could apply it themselves. The spec now contains the following text: NOTE: Because only UTF encodings are supported, and because of the algorithm used to convert a DOMString to a sequence of Unicode characters, no input can cause the encoding process to emit an encoder error. This is not correct. A DOMString is not a sequence of Unicode characters, it's a UTF16 encoded string (this is per EcmaScript). Thus it can contain unpaired surrogates and so the encoding process can result in encoder errors. As I've suggested earlier, I think we should deal with this by simply emitting Unicode replacement characters for these encoder errors (i.e. for unpaired surrogates). / Jonas
Re: [whatwg] StringEncoding: Allowed encodings for TextEncoder
On Wed, Aug 8, 2012 at 9:03 AM, Joshua Bell jsb...@chromium.org wrote: On Wed, Aug 8, 2012 at 2:48 AM, James Graham jgra...@opera.com wrote: On 08/07/2012 07:51 PM, Jonas Sicking wrote: I don't mind supporting *decoding* from basically any encoding that Anne's spec enumerates. I don't see a downside with that since I suspect most implementations will just call into a generic decoding backend anyway, and so supporting the same set of encodings as for other parts of the platform should be relatively easy. [...] However I think we should consider restricting support to a smaller set of encodings for while *encoding*. There should be little reason for people today to produce text in non-utf formats. We might even be able to get away with only supporting UTF8, though I wouldn't be surprised if there are reasonably modern file formats which use utf16. FWIW, I agree with the decode-from-all-platform-**encodings encode-to-utf[8|16] position. Any disagreement on limiting the supported encodings to utf-8, utf-16, and utf-16be, while permitting decoding of all encodings in the Encoding spec? (This eliminates the what to do on encoding error issue nicely, still need to resolve the BOM issue though.) http://wiki.whatwg.org/wiki/StringEncoding has been updated to restrict the supported encodings for encoding to UTF-8, UTF-16 and UTF-16BE. I'm tempted to take it further to just UTF-8 and see if anyone complains. Jury is still out on the decode-with-BOM issue - I need to reason through Glenn's suggestions on the open issues thread. I added a related open issue raised by Glenn, summarized as ... suggest that the .encoding attribute simply return the name that was passed to the constructor. - taking this further, perhaps the attribute should be eliminated as callers could apply it themselves.
Re: [whatwg] StringEncoding: Allowed encodings for TextEncoder
On 08/07/2012 07:51 PM, Jonas Sicking wrote: I don't mind supporting *decoding* from basically any encoding that Anne's spec enumerates. I don't see a downside with that since I suspect most implementations will just call into a generic decoding backend anyway, and so supporting the same set of encodings as for other parts of the platform should be relatively easy. [...] However I think we should consider restricting support to a smaller set of encodings for while *encoding*. There should be little reason for people today to produce text in non-utf formats. We might even be able to get away with only supporting UTF8, though I wouldn't be surprised if there are reasonably modern file formats which use utf16. FWIW, I agree with the decode-from-all-platform-encodings encode-to-utf[8|16] position.
Re: [whatwg] StringEncoding: Allowed encodings for TextEncoder
On Wed, Aug 8, 2012 at 2:48 AM, James Graham jgra...@opera.com wrote: On 08/07/2012 07:51 PM, Jonas Sicking wrote: I don't mind supporting *decoding* from basically any encoding that Anne's spec enumerates. I don't see a downside with that since I suspect most implementations will just call into a generic decoding backend anyway, and so supporting the same set of encodings as for other parts of the platform should be relatively easy. [...] However I think we should consider restricting support to a smaller set of encodings for while *encoding*. There should be little reason for people today to produce text in non-utf formats. We might even be able to get away with only supporting UTF8, though I wouldn't be surprised if there are reasonably modern file formats which use utf16. FWIW, I agree with the decode-from-all-platform-**encodings encode-to-utf[8|16] position. Any disagreement on limiting the supported encodings to utf-8, utf-16, and utf-16be, while permitting decoding of all encodings in the Encoding spec? (This eliminates the what to do on encoding error issue nicely, still need to resolve the BOM issue though.)
Re: [whatwg] StringEncoding: Allowed encodings for TextEncoder
On 8/7/2012 12:39 AM, Jonas Sicking wrote: Hi All, I seem to have a recollection that we discussed only allowing encoding to UTF8 and UTF16LE, UTF16BE. This in order to promote these formats as well as stay in sync with other APIs like XMLHttpRequest. However I currently can't find any restrictions on which target encodings are supported in the current drafts. One wrinkle in this is if we want to support arbitrary encodings when encoding, that means that we can't use insert a the replacement character as default error handling since that isn't available in a lot of encoding formats. I found that the wiki version of the proposal cites http://dvcs.w3.org/hg/encoding/raw-file/tip/Overview.html as the way to find encodings. -- Beware of bugs in the above code; I have only proved it correct, not tried it. -- Donald E. Knuth
Re: [whatwg] StringEncoding: Allowed encodings for TextEncoder
On Mon, Aug 6, 2012 at 11:39 PM, Jonas Sicking jo...@sicking.cc wrote: I seem to have a recollection that we discussed only allowing encoding to UTF8 and UTF16LE, UTF16BE. This in order to promote these formats as well as stay in sync with other APIs like XMLHttpRequest. Not an objection, but where does XHR limit sent data to those encodings? send(FormData) forces UTF-8 (which is even more restrictive); send(Document) seems to allow any encoding *except* for UTF-16 (presumably web compat since that's a weird criteria). I'm not sure that staying in sync with XHR--which has its own pile of legacy code to support--is worthwhile here anyway, but limiting to Unicode seems fine in its own right, especially since the restriction can always be lifted later if real needs come up. However I currently can't find any restrictions on which target encodings are supported in the current drafts. One wrinkle in this is if we want to support arbitrary encodings when encoding, that means that we can't use insert a the replacement character as default error handling since that isn't available in a lot of encoding formats. I don't think this part is a real hurdle. Just replace with ? for non-Unicode encodings. On Tue, Aug 7, 2012 at 8:10 AM, Joshua Cranmer pidgeo...@verizon.netwrote: I found that the wiki version of the proposal cites http://dvcs.w3.org/hg/encoding/raw-file/tip/Overview.html as the way to find encodings. That spec documents the encodings which are used anywhere in the platform, but that doesn't necessarily mean every API needs to support all those encodings. It's almost all backwards-compatibility. -- Glenn Maynard
Re: [whatwg] StringEncoding: Allowed encodings for TextEncoder
On Tue, Aug 7, 2012 at 8:32 AM, Glenn Maynard gl...@zewt.org wrote: On Mon, Aug 6, 2012 at 11:39 PM, Jonas Sicking jo...@sicking.cc wrote: I seem to have a recollection that we discussed only allowing encoding to UTF8 and UTF16LE, UTF16BE. This in order to promote these formats as well as stay in sync with other APIs like XMLHttpRequest. It looks like the relevant discussion was at http://lists.whatwg.org/htdig.cgi/whatwg-whatwg.org/2012-March/035038.html It doesn't appear we reached consensus - there was some desire expressed to scope to UTF-8, then perhaps expand to include UTF-16, definite consensus that any encoding supported should be handled by both encode and decode, then comments about XHR and form data encodings, but then the discussion wandered into stateful vs. stateless encodings which took us off topic. So Glenn's comment below pretty much reboots the conversation where it was: Not an objection, but where does XHR limit sent data to those encodings? send(FormData) forces UTF-8 (which is even more restrictive); send(Document) seems to allow any encoding *except* for UTF-16 (presumably web compat since that's a weird criteria). I'm not sure that staying in sync with XHR--which has its own pile of legacy code to support--is worthwhile here anyway, but limiting to Unicode seems fine in its own right, especially since the restriction can always be lifted later if real needs come up. However I currently can't find any restrictions on which target encodings are supported in the current drafts. When Anne's spec appeared I gutted mine and deferred wherever possible to his. One consequence of that was getting the other encodings for free as far as the spec writing goes. If we achieve consensus that we only want to support UTF encodings we can add the restrictions. There are use cases for supporting other encodings (parsing legacy data file formats, for example), but that could be deferred. One wrinkle in this is if we want to support arbitrary encodings when encoding, that means that we can't use insert a the replacement character as default error handling since that isn't available in a lot of encoding formats. I don't think this part is a real hurdle. Just replace with ? for non-Unicode encodings. On Tue, Aug 7, 2012 at 8:10 AM, Joshua Cranmer pidgeo...@verizon.netwrote: I found that the wiki version of the proposal cites http://dvcs.w3.org/hg/encoding/raw-file/tip/Overview.html as the way to find encodings. That spec documents the encodings which are used anywhere in the platform, but that doesn't necessarily mean every API needs to support all those encodings. It's almost all backwards-compatibility. There are also cross-browser differences in handling decoding of certain code points in certain encodings. Exposing those encodings in a new API would either require that the browser vendors expose those differences (bleah) or implement a compatibility switch in the affected codecs (bleah).
Re: [whatwg] StringEncoding: Allowed encodings for TextEncoder
On Tue, Aug 7, 2012 at 11:48 AM, Joshua Bell jsb...@chromium.org wrote: It doesn't appear we reached consensus - there was some desire expressed to scope to UTF-8, then perhaps expand to include UTF-16, definite consensus that any encoding supported should be handled by both encode and decode, then comments about XHR and form data encodings, but then the discussion wandered into stateful vs. stateless encodings which took us off topic. So Glenn's comment below pretty much reboots the conversation where it was: I don't agree that we necessarily need to support both encode and decode for every encoding. For example, an MP3 tag editor supporting legacy ID3 tags may want to be able to decode ISO-8859-1, since it allows tags in that encoding. However, there's no reason to ever write MP3 tags in anything but Unicode--they only need decode support for 8859-1, not encode. This pattern of decode support for legacy, but only encoding to Unicode, seems common today. Many email clients today (not a use case, just a comparison) also decode from any encoding but send only in UTF-8. That's not to say there are no use cases for encoding other encodings, but it's much easier to relax the restriction later and allow them if we really need to than it is to go the other way, and I think there's a danger of perpetuating legacy encodings if we're not careful. There are also cross-browser differences in handling decoding of certain code points in certain encodings. Exposing those encodings in a new API would either require that the browser vendors expose those differences (bleah) or implement a compatibility switch in the affected codecs (bleah). The real fix for this would be for browsers to implement the encodings in the correct, interoperable way when exposed by this API, even if that means that this API interprets data differently than eg. the HTML parser. MS has made it clear that they won't touch their encodings in any way, due to legacy support, but hopefully that doesn't apply to a new API with no legacy at all. (If you want to find that out you'll need to ask on webapps or through some other channel, since they're not on this list.) -- Glenn Maynard
Re: [whatwg] StringEncoding: Allowed encodings for TextEncoder
On Tue, Aug 7, 2012 at 9:48 AM, Joshua Bell jsb...@chromium.org wrote: Not an objection, but where does XHR limit sent data to those encodings? send(FormData) forces UTF-8 (which is even more restrictive); send(Document) seems to allow any encoding *except* for UTF-16 (presumably web compat since that's a weird criteria). I'm not sure that staying in sync with XHR--which has its own pile of legacy code to support--is worthwhile here anyway, but limiting to Unicode seems fine in its own right, especially since the restriction can always be lifted later if real needs come up. However I currently can't find any restrictions on which target encodings are supported in the current drafts. When Anne's spec appeared I gutted mine and deferred wherever possible to his. One consequence of that was getting the other encodings for free as far as the spec writing goes. If we achieve consensus that we only want to support UTF encodings we can add the restrictions. There are use cases for supporting other encodings (parsing legacy data file formats, for example), but that could be deferred. I don't mind supporting *decoding* from basically any encoding that Anne's spec enumerates. I don't see a downside with that since I suspect most implementations will just call into a generic decoding backend anyway, and so supporting the same set of encodings as for other parts of the platform should be relatively easy. That also means that we don't have to figure out which encodings we need to support to support reading legacy file formats etc. However I think we should consider restricting support to a smaller set of encodings for while *encoding*. There should be little reason for people today to produce text in non-utf formats. We might even be able to get away with only supporting UTF8, though I wouldn't be surprised if there are reasonably modern file formats which use utf16. Restricting the encoding formats have the advantage of that we can rely on the target encoding to support a consistent feature set. For example we don't need to deal with defining what to do if we receive a perfectly well formed string, but the target encoding doesn't support all the characters in that string. Likewise we don't have to deal with target encodings which doesn't support the replacement character. / Jonas
Re: [whatwg] StringEncoding: Allowed encodings for TextEncoder
On Tue, Aug 7, 2012 at 10:47 AM, Glenn Maynard gl...@zewt.org wrote: On Tue, Aug 7, 2012 at 11:48 AM, Joshua Bell jsb...@chromium.org wrote: It doesn't appear we reached consensus - there was some desire expressed to scope to UTF-8, then perhaps expand to include UTF-16, definite consensus that any encoding supported should be handled by both encode and decode, then comments about XHR and form data encodings, but then the discussion wandered into stateful vs. stateless encodings which took us off topic. So Glenn's comment below pretty much reboots the conversation where it was: I don't agree that we necessarily need to support both encode and decode for every encoding. For example, an MP3 tag editor supporting legacy ID3 tags may want to be able to decode ISO-8859-1, since it allows tags in that encoding. However, there's no reason to ever write MP3 tags in anything but Unicode--they only need decode support for 8859-1, not encode. This pattern of decode support for legacy, but only encoding to Unicode, seems common today. Many email clients today (not a use case, just a comparison) also decode from any encoding but send only in UTF-8. That's not to say there are no use cases for encoding other encodings, but it's much easier to relax the restriction later and allow them if we really need to than it is to go the other way, and I think there's a danger of perpetuating legacy encodings if we're not careful. Yup, that matches my feelings exactly. There are also cross-browser differences in handling decoding of certain code points in certain encodings. Exposing those encodings in a new API would either require that the browser vendors expose those differences (bleah) or implement a compatibility switch in the affected codecs (bleah). The real fix for this would be for browsers to implement the encodings in the correct, interoperable way when exposed by this API, even if that means that this API interprets data differently than eg. the HTML parser. MS has made it clear that they won't touch their encodings in any way, due to legacy support, but hopefully that doesn't apply to a new API with no legacy at all. (If you want to find that out you'll need to ask on webapps or through some other channel, since they're not on this list.) I'm hoping that browsers in general will be able to converge on the encoding databases that they have. Both as far as which encodings are supported, and as far as what encoding tables those encodings support. Anne's spec is a great first step in that direction. It'll definitely take time before we have full convergence, but I see no reason that we couldn't get there eventually. We were able to get there with HTML5 parsing after all :-) / Jonas
Re: [whatwg] StringEncoding: Allowed encodings for TextEncoder
On 8/7/2012 12:48 PM, Joshua Bell wrote: When Anne's spec appeared I gutted mine and deferred wherever possible to his. One consequence of that was getting the other encodings for free as far as the spec writing goes. If we achieve consensus that we only want to support UTF encodings we can add the restrictions. There are use cases for supporting other encodings (parsing legacy data file formats, for example), but that could be deferred. My main use case, and the only one I'm going to argue for, is being able to handle mail messages with this API, and the primary concern here is decoding. I'll agree with other sentiments in this thread that I don't particularly care about encoding to anything other than UTF-8 (it might be nice, but I can live without it); it's being able to decode $CHARSET that I'm concerned about. As far as edge cases in this scenario are concerned, it pretty much boils down to I want to produce the same JS string that would be output if I looked at the text content of the document data:text/plain;charset=charset,data. When encoding, I think it is absolutely necessary to enforce a uniform guidelines for the output. When decoding, however, I think that most differences (beyond concerns like the BOM) are a result of buggy content creators as opposed to the browser media. Given that HTML display has apparently tolerated differences in charset decoding for legacy charsets, I suppose it is possible to live with a difference of exact character decoding for various charsets--in other words, turning the charset document into an advisory list of both minimum charsets to support and how to do so. -- Beware of bugs in the above code; I have only proved it correct, not tried it. -- Donald E. Knuth
Re: [whatwg] StringEncoding: Allowed encodings for TextEncoder
On Tue, Aug 7, 2012 at 12:55 PM, Jonas Sicking jo...@sicking.cc wrote: I'm hoping that browsers in general will be able to converge on the encoding databases that they have. Both as far as which encodings are supported, and as far as what encoding tables those encodings support. Anne's spec is a great first step in that direction. It'll definitely take time before we have full convergence, but I see no reason that we couldn't get there eventually. We were able to get there with HTML5 parsing after all :-) MS has given a flat refusal to change the encoding tables in any way. http://permalink.gmane.org/gmane.ietf.charsets/588 Personally I'm inclined to not care which encoding tables we use for legacy encodings. Rather than fight this battle and end up without interoperable tables at all, it might be better to punt this one and standardize on Microsoft's tables and be done with it. (Sorry for the slight tangent; this is an Encoding topic, not a StringEncoding issue.) -- Glenn Maynard
[whatwg] StringEncoding: Allowed encodings for TextEncoder
Hi All, I seem to have a recollection that we discussed only allowing encoding to UTF8 and UTF16LE, UTF16BE. This in order to promote these formats as well as stay in sync with other APIs like XMLHttpRequest. However I currently can't find any restrictions on which target encodings are supported in the current drafts. One wrinkle in this is if we want to support arbitrary encodings when encoding, that means that we can't use insert a the replacement character as default error handling since that isn't available in a lot of encoding formats. / Jonas