Re: [whatwg] API for encoding/decoding ArrayBuffers into text
Any further input on Kenneth's suggestions? Re: ArrayBufferView vs. DataView - I'm tempted to make the switch to just DataView. As discussed below, data parsing/serialization operations will tend to be associated with DataViews. As Glenn has mentioned elsewhere recently, it is possible to accidentally do a buffer copy when mis-using typed array constructors, while DataView avoids this. DataViews are cheap to construct, and when I'm writing sample code for the proposed API I find I create throw-away DataViews anyway. Also, there is the potential for confusion when using a non-Uint8Array buffer e.g. are the elements being decoded using array[N] as the octets or using the underlying buffer? for Uint16Array/UTF-16 encodings, what are the endianness concerns? DataView APIs have an explicit endianness and no index getter, which alleviates this somewhat. Re: writing into an existing buffer - as Glenn says, most of the input earlier in the thread advocated strongly for very simple initial API with streaming support as the only fancy feature beyond the minimal string = foo.decode(buffer) / buffer = foo.encode(string). Adding details = foo.encodeInto(string, buffer) later on is not precluded if there is demand. Also, I am planning to move the fatal option from the encode/decode methods to the TextEncoder/TextDecoder constructors. Objections? On Tue, Mar 27, 2012 at 7:43 PM, Kenneth Russell k...@google.com wrote: On Tue, Mar 27, 2012 at 6:44 PM, Glenn Maynard gl...@zewt.org wrote: On Tue, Mar 27, 2012 at 7:12 PM, Kenneth Russell k...@google.com wrote: - I think it should reference DataView directly rather than ArrayBufferView. The typed array spec was specifically designed with two use cases in mind: in-memory assembly of data to be sent to the graphics card or audio device, where the byte order must be that of the host architecture; This is wrong, broken, won't be implemented this way by any production browser, isn't how it's used in practice, and needs to be fixed in the spec. It violates the most basic web API requirement: interoperability. Please see earlier in the thread; the views affected by endianness need to be specced as little endian. That's what everyone is going to implement, and what everyone's pages are going to depend on, so it's what the spec needs to say. Separate types should be added for big-endian (eg. Int16BEArray). Thanks for your input. The design of the typed array classes was informed by requirements about how the OpenGL, and therefore WebGL, API work; and from prior experience with the design and implementation of Java's New I/O Buffer classes, which suffered from horrible performance pitfalls because of a design similar to that which you suggest. Production browsers already implement typed arrays with their current semantics. It is not possible to change them and have WebGL continue to function. I will go so far as to say that the semantics will not be changed. In the typed array specification, unlike Java's New I/O specification, the API was split between two use cases: in-memory data construction (for consumption by APIs like WebGL and Web Audio), and file and network I/O. The API was carefully designed to avoid roadblocks that would prevent maximum performance from being achieved for these use cases. Experience has shown that the moment an artificial performance barrier is imposed, it becomes impossible to build certain kinds of programs. I consider it unacceptable to prevent developers from achieving their goals. I also disagree that it should use DataView. Views are used to access arrays (including strings) within larger data structures. DataView is used to access packed data structures, where constructing a view for each variable in the struct is unwieldy. It might be useful to have a helper in DataView, but the core API should work on views. This is one point of view. The true design goal of DataView is to supply the primitives for fast file and network input/output, where the endianness is explicitly specified in the file format. Converting strings to and from binary encodings is obviously an operation associated with transfer of data to or from files or the network. According to this taxonomy, the string encoding and decoding operations should only be associated with DataView, and not the other typed array types, which are designed for in-memory data assembly for consumption by other hardware on the system. - It would be preferable if the encoding API had a way to avoid memory allocation, for example to encode into a passed-in DataView. This was an earlier design, and discussion led to it being removed as a premature optimization, to simplify the API. I'd recommend reading the rest of the thread. I do apologize for not being fully caught up on the thread, but hope that the input above was still useful. -Ken
Re: [whatwg] API for encoding/decoding ArrayBuffers into text
On Wed, Apr 4, 2012 at 11:09 AM, Joshua Bell jsb...@chromium.org wrote: Any further input on Kenneth's suggestions? I largely disagree with those suggestions, because I don't believe they align with the natural, intuitive usage of the API. Re: ArrayBufferView vs. DataView - I'm tempted to make the switch to just DataView. As discussed below, data parsing/serialization operations will tend to be associated with DataViews. I disagree. TypedArray is much more natural for processing arrays, since they can be accessed just like a regular JavaScript array; code generally doesn't have to care whether it's been given a JavaScript array or a TypedArray. For DataView, you need to rewrite everything. As Glenn has mentioned elsewhere recently, it is possible to accidentally do a buffer copy when mis-using typed array constructors, while DataView avoids this. That should be fixed, not used against TypedArray classes when they make sense. This can be fixed by adding a TypedArray(TypedArray, byteOffset, length) constructor, which creates a new shallow view from an existing view; this would be logically grouped with the similar TypedArray(ArrayBuffer, byteOffset, length) function. Unfortunately, the offset parameter would have to be required, so the method can be resolved against the TypedArray(TypedArray) constructor. (A cleaner design would have been to have a separate copy() function to create an explicit copy, but it's most likely too late to remove the TypedArray(TypedArray) ctor.) As (another) aside, all of the TypedArray constructors should be available on DataView, too, so they exist on all ArrayBufferView subtypes. DataViews are cheap to construct, and when I'm writing sample code for the proposed API I find I create throw-away DataViews anyway. Array views are cheap to construct, too. APIs returning DataViews feels unnatural; it's a helper class that isn't returned by anything else. If you don't return a view of a specific, contextually-meaningful type (eg. Int16LEArray for UTF-16LE) from encode(), then returning the ArrayBuffer itself seems preferable, like XHR2. Let's not split APIs, with some returning DataView and some ArrayBuffer. Also, there is the potential for confusion when using a non-Uint8Array buffer e.g. are the elements being decoded using array[N] as the octets or using the underlying buffer? for Uint16Array/UTF-16 encodings, what are the endianness concerns? The data is always decoded based on the encoding specified. It wouldn't make sense for decode() to only take a DataView. If I have an Int8Array, it's busywork to make me construct a DataView from it so I can pass it to decode(). Just take ArrayBufferView, so it doesn't care what the particular view type is. DataView APIs have an explicit endianness and no index getter, which alleviates this somewhat. Ideally, endian-explicit TypedArrays should be created, eg. Int16LEArray and Int16BEArray. I mentioned this in the other thread; the big-endian types seem important to have anyway (regardless of the encoding API), and the little-endian views are just so we can pretend the native endian issue isn't there. Also, I am planning to move the fatal option from the encode/decode methods to the TextEncoder/TextDecoder constructors. Objections? I don't have a strong feeling either way. Can you think of any cases where the encoder/decoder object would be handed off from one user to another, who might want different behavior? It seems unlikely. -- Glenn Maynard
Re: [whatwg] API for encoding/decoding ArrayBuffers into text
On Sat, Mar 31, 2012 at 6:13 PM, Glenn Maynard gl...@zewt.org wrote: On Wed, Mar 28, 2012 at 1:44 AM, Jonas Sicking jo...@sicking.cc wrote: Scanning over the buffer twice will cause a lot more memory IO and will definitely be slower. That's what cache is for. But: benchmarks... We can argue weather it's meaningfully slower or harder. But it seems like we agree that it's slower and harder. I'm saying that if an API is better in every way then it doesn't seem like an interesting discussion how much better, we should clearly go with that API. / Jonas
Re: [whatwg] API for encoding/decoding ArrayBuffers into text
On Sun, Apr 1, 2012 at 5:28 PM, Jonas Sicking jo...@sicking.cc wrote: I'm saying that if an API is better in every way then it doesn't seem like an interesting discussion how much better, we should clearly go with that API. It's not a different API, it's an *additional* API. (Assuming that the indexOf function is added anyway; the string-array use case wants it, and its general usefulness should be uncontroversial.) It doesn't remove anything else. There's always a cost to adding more API; what's not clear is whether it's worth it here, since it's essentially a four-line helper function (each way) that may or may not actually be used often enough to justify itself. -- Glenn Maynard
Re: [whatwg] API for encoding/decoding ArrayBuffers into text
On Wed, Mar 28, 2012 at 1:44 AM, Jonas Sicking jo...@sicking.cc wrote: Scanning over the buffer twice will cause a lot more memory IO and will definitely be slower. That's what cache is for. But: benchmarks... We can argue weather it's meaningfully slower or harder. But it seems like we agree that it's slower and harder. What? Are you really arguing that we should do something because of *meaningless* differences? I still don't understand what that benefit you are seeing is. You hinted at some more generic argument, but I still don't understand it. So far the only reason that has been brought up is that it provides an API for simply finding null terminators which could be useful if you are doing things other than decoding. Is that what you are talking about when you are saying that it's more generic? Yes, I've said that repeatedly. It also avoids bloating the API with something that's merely a helper for something you can do in a couple lines of code, and allows you to tell how many bytes/words were consumed (eg. for packed string arrays). It can always be added later, but it feels unnecessary. -- Glenn Maynard
Re: [whatwg] API for encoding/decoding ArrayBuffers into text
On Tue, Mar 27, 2012 at 4:45 PM, Glenn Maynard gl...@zewt.org wrote: On Tue, Mar 27, 2012 at 12:41 AM, Jonas Sicking jo...@sicking.cc wrote: The memchr is purely overhead, I.e. we are comparing memchr+decoding to decoding. So I don't see what's backing up the probably the fastest thing claim. If you don't do it as an initial pass, then you have to embed null checks into the inner loop of your decoding algorithm. For example, an ASCII decoder may look like: // char *input = input buffer // char *input_end = one past last byte of input buffer // wchar_t *output = output buffer input_end = memchr(input, 0, input_end - input); while(input input_end) { if(*input = 0x80) *output++ = 0xFFFD; else *output++ = *input; ++input; } If you don't do the initial search, then it becomes: while(input input_end *input != 0) { if(*input = 0x80) *output++ = 0xFFFD; else *output++ = *input; ++input; } which means that you have an additional branch each time through the loop to check for the null terminator. That's likely to be slower than just doing another pass. But anyway, please either make a benchmark or two to show the differences we're talking about, or drop performance as an argument. This is all just a distraction otherwise. I don't think the speed of conversion is even a serious issue, much less the microseconds taken by memchr. The extra null-check is basically free since you are going to be bound on memory IO. I.e. the extra nullcheck will just happen in the bubble in the CPU pipeline while waiting for data from memory. Scanning over the buffer twice will cause a lot more memory IO and will definitely be slower. It doesn't seem materially harder (a little more code, yes, but that's not the same thing), and it's more general-purpose. I agree it doesn't seem materially harder. I also agree that I don't have data to show that it's materially slower. But it sounds like we're in agreement that keeping the logic outside is both harder and slower which honestly doesn't speak strongly in its favor. Sorry, I'm confused--you're saying that it isn't harder, but we're in agreement that it's harder. Please clarify what you mean. I don't believe it's meaningfully slower or harder. I'm saying that having separate functions for * finding the null terminator * decoding a set number of bytes is both harder and slower for the webpage, than having a single function which just decodes to the null terminator. We can argue weather it's meaningfully slower or harder. But it seems like we agree that it's slower and harder. I don't understand the argument that the alternative is more general-purpose. The API is already generic in that you can use whatever delimiter you want since you pass in a view. The only functionality which is not available is finding a null-terminator in an arraybuffer which you are arguing below shouldn't be part of the decoder (which I agree with). I'm confused. What are you arguing? The alternative--taking the null terminator search out of the decoder--you seem to argue against (first sentence), then to agree with (last sentence). Can you back up and restate what you're saying from scratch? If you agree that creating separate functions for finding the null terminator and then decoding to it, rather than having a single function which does both things, while yet agreeing that having separate functions are better, then clearly you must think that having separate functions bring some other benefits. I still don't understand what that benefit you are seeing is. You hinted at some more generic argument, but I still don't understand it. So far the only reason that has been brought up is that it provides an API for simply finding null terminators which could be useful if you are doing things other than decoding. Is that what you are talking about when you are saying that it's more generic? / Jonas
Re: [whatwg] API for encoding/decoding ArrayBuffers into text
On Tue, Mar 27, 2012 at 12:41 AM, Jonas Sicking jo...@sicking.cc wrote: The memchr is purely overhead, I.e. we are comparing memchr+decoding to decoding. So I don't see what's backing up the probably the fastest thing claim. If you don't do it as an initial pass, then you have to embed null checks into the inner loop of your decoding algorithm. For example, an ASCII decoder may look like: // char *input = input buffer // char *input_end = one past last byte of input buffer // wchar_t *output = output buffer input_end = memchr(input, 0, input_end - input); while(input input_end) { if(*input = 0x80) *output++ = 0xFFFD; else *output++ = *input; ++input; } If you don't do the initial search, then it becomes: while(input input_end *input != 0) { if(*input = 0x80) *output++ = 0xFFFD; else *output++ = *input; ++input; } which means that you have an additional branch each time through the loop to check for the null terminator. That's likely to be slower than just doing another pass. But anyway, please either make a benchmark or two to show the differences we're talking about, or drop performance as an argument. This is all just a distraction otherwise. I don't think the speed of conversion is even a serious issue, much less the microseconds taken by memchr. I admit I missed the previous discussion which led to the agreement to keep the length measuring outside, so I don't know what arguments were presented. Any pointers would be appreciated. You've already mentioned one of them: being able to tell how many bytes were consumed. Having a view.indexOf function is also obviously generally useful, and it simplifies the API. Beyond that, having a feature--whether a wrapper or a flag to the actual decoder/encoder--that's just a shortcut for all of four or five liens of code is just a minor convenience. I don't think it's something so common that we need to save people a few lines of trivial wrapper code that they can write themselves. It doesn't seem materially harder (a little more code, yes, but that's not the same thing), and it's more general-purpose. I agree it doesn't seem materially harder. I also agree that I don't have data to show that it's materially slower. But it sounds like we're in agreement that keeping the logic outside is both harder and slower which honestly doesn't speak strongly in its favor. Sorry, I'm confused--you're saying that it isn't harder, but we're in agreement that it's harder. Please clarify what you mean. I don't believe it's meaningfully slower or harder. I don't understand the argument that the alternative is more general-purpose. The API is already generic in that you can use whatever delimiter you want since you pass in a view. The only functionality which is not available is finding a null-terminator in an arraybuffer which you are arguing below shouldn't be part of the decoder (which I agree with). I'm confused. What are you arguing? The alternative--taking the null terminator search out of the decoder--you seem to argue against (first sentence), then to agree with (last sentence). Can you back up and restate what you're saying from scratch? -- Glenn Maynard
Re: [whatwg] API for encoding/decoding ArrayBuffers into text
On Mon, Mar 26, 2012 at 10:28 PM, Jonas Sicking jo...@sicking.cc wrote: On Mon, Mar 26, 2012 at 6:11 PM, Kenneth Russell k...@google.com wrote: On Mon, Mar 26, 2012 at 5:33 PM, Jonas Sicking jo...@sicking.cc wrote: On Mon, Mar 26, 2012 at 4:40 PM, Joshua Bell jsb...@chromium.org wrote: * We lost the ability to decode from a arraybuffer and see how many bytes were consumed before a null-terminator was hit. One not terribly elegant solution would be to add a TextDecoder.decodeWithLength method which return a DOMString+length tuple. Agreed, but of course see above - there was consensus earlier in the thread that searching for null terminators should be done outside the API, therefore the caller will have the length handy already. Yes, this would be a big flaw since decoding at tightly packed data structure (e.g. array of null terminated strings w/o length) would be impossible with just the nullTerminator flag. Requiring callers to find the null character first, and then use that will require one additional pass over the encoded binary data though. Also, if we put the API for finding the null character on the Decoder object it doesn't seem like we're creating an API which is easier to use, just one that has moved some of the logic from the API to every caller. Though I guess the best solution would be to add methods to DataView which allows consuming an ArrayBuffer up to a null terminated point and returns the decoded string. Potentially such a method could take a Decoder object as argument. The rationale for specifying the string encoding and decoding functionality outside the typed array specification is to keep the typed array spec small and easily implementable. The indexed property getters and setters on the typed array views, and methods on DataView, are designed to be implementable with a small amount of assembly code in JavaScript engines. I'd strongly prefer to continue to design the encoding/decoding functionality separately from the typed array views. Is there a reason you couldn't keep the current set of functions on DataView implemented using a small amount of assembly code, and let the new functions fall back to slower C++ functions? That's possible. Another motivation for keeping encoding/decoding functionality separate is that it is likely that it will require a lot of spec text, which would dramatically increase the size of the typed array spec. Perhaps once all of the details have been hammered out on this thread it will be more obvious whether these methods would be much clearer if added directly to DataView. A couple of comments on the current StringEncoding proposal: - I think it should reference DataView directly rather than ArrayBufferView. The typed array spec was specifically designed with two use cases in mind: in-memory assembly of data to be sent to the graphics card or audio device, where the byte order must be that of the host architecture; and assembly of data for network transmission, where the byte order needs to be explicit. DataView covers the latter case. - It would be preferable if the encoding API had a way to avoid memory allocation, for example to encode into a passed-in DataView. -Ken
Re: [whatwg] API for encoding/decoding ArrayBuffers into text
On Tue, Mar 27, 2012 at 7:12 PM, Kenneth Russell k...@google.com wrote: - I think it should reference DataView directly rather than ArrayBufferView. The typed array spec was specifically designed with two use cases in mind: in-memory assembly of data to be sent to the graphics card or audio device, where the byte order must be that of the host architecture; This is wrong, broken, won't be implemented this way by any production browser, isn't how it's used in practice, and needs to be fixed in the spec. It violates the most basic web API requirement: interoperability. Please see earlier in the thread; the views affected by endianness need to be specced as little endian. That's what everyone is going to implement, and what everyone's pages are going to depend on, so it's what the spec needs to say. Separate types should be added for big-endian (eg. Int16BEArray). I also disagree that it should use DataView. Views are used to access arrays (including strings) within larger data structures. DataView is used to access packed data structures, where constructing a view for each variable in the struct is unwieldy. It might be useful to have a helper in DataView, but the core API should work on views. - It would be preferable if the encoding API had a way to avoid memory allocation, for example to encode into a passed-in DataView. This was an earlier design, and discussion led to it being removed as a premature optimization, to simplify the API. I'd recommend reading the rest of the thread. -- Glenn Maynard
Re: [whatwg] API for encoding/decoding ArrayBuffers into text
On Tue, Mar 27, 2012 at 6:44 PM, Glenn Maynard gl...@zewt.org wrote: On Tue, Mar 27, 2012 at 7:12 PM, Kenneth Russell k...@google.com wrote: - I think it should reference DataView directly rather than ArrayBufferView. The typed array spec was specifically designed with two use cases in mind: in-memory assembly of data to be sent to the graphics card or audio device, where the byte order must be that of the host architecture; This is wrong, broken, won't be implemented this way by any production browser, isn't how it's used in practice, and needs to be fixed in the spec. It violates the most basic web API requirement: interoperability. Please see earlier in the thread; the views affected by endianness need to be specced as little endian. That's what everyone is going to implement, and what everyone's pages are going to depend on, so it's what the spec needs to say. Separate types should be added for big-endian (eg. Int16BEArray). Thanks for your input. The design of the typed array classes was informed by requirements about how the OpenGL, and therefore WebGL, API work; and from prior experience with the design and implementation of Java's New I/O Buffer classes, which suffered from horrible performance pitfalls because of a design similar to that which you suggest. Production browsers already implement typed arrays with their current semantics. It is not possible to change them and have WebGL continue to function. I will go so far as to say that the semantics will not be changed. In the typed array specification, unlike Java's New I/O specification, the API was split between two use cases: in-memory data construction (for consumption by APIs like WebGL and Web Audio), and file and network I/O. The API was carefully designed to avoid roadblocks that would prevent maximum performance from being achieved for these use cases. Experience has shown that the moment an artificial performance barrier is imposed, it becomes impossible to build certain kinds of programs. I consider it unacceptable to prevent developers from achieving their goals. I also disagree that it should use DataView. Views are used to access arrays (including strings) within larger data structures. DataView is used to access packed data structures, where constructing a view for each variable in the struct is unwieldy. It might be useful to have a helper in DataView, but the core API should work on views. This is one point of view. The true design goal of DataView is to supply the primitives for fast file and network input/output, where the endianness is explicitly specified in the file format. Converting strings to and from binary encodings is obviously an operation associated with transfer of data to or from files or the network. According to this taxonomy, the string encoding and decoding operations should only be associated with DataView, and not the other typed array types, which are designed for in-memory data assembly for consumption by other hardware on the system. - It would be preferable if the encoding API had a way to avoid memory allocation, for example to encode into a passed-in DataView. This was an earlier design, and discussion led to it being removed as a premature optimization, to simplify the API. I'd recommend reading the rest of the thread. I do apologize for not being fully caught up on the thread, but hope that the input above was still useful. -Ken
Re: [whatwg] API for encoding/decoding ArrayBuffers into text
On 3/25/12 7:45 AM, Geoffrey Sneddon wrote: On 21/03/12 04:31, Mark Callow wrote: On 17/03/2012 08:19, Boris Zbarsky wrote: I think that trying to get web developers to do this right is a lost cause, esp. because none of them (to a good approximation) have any big-endian systems to test on. On what do you base this oft-repeated assertion? ARM CPUs can work either way. I have no idea how the various licensees are actually setting them up. All major mobile OSes use LE on ARM — I believe we currently don't ship anything on BE ARM. (We, do, however, currently ship on BE MIPS, though MIPS too is mostly LE nowadays). Yep, exactly. Sorry I missed the original mail from Mark, but Geoffrey is spot on: none of the licensees actually shipping anything resembling consumer hardware are setting up their processors to run BE, to my knowledge. -Boris
Re: [whatwg] API for encoding/decoding ArrayBuffers into text
On Sat, Mar 24, 2012 at 6:52 AM, Glenn Maynard gl...@zewt.org wrote: On Thu, Mar 22, 2012 at 8:58 AM, Anne van Kesteren ann...@opera.com wrote: Another way would be to have a second optional argument that indicates whether more bytes are coming (defaults to false), but I'm not sure of the chances that would be used correctly. The reasons you outline are probably why many browser implementations deal with EOF poorly too. It might not improve it, but I don't think it'd be worse. If you didn't use it correctly for an encoding where it matters, the breakage would be obvious. Also, the previous automatically-streaming API has another possible misuse: constructing a single encoder, then calling it repeatedly for unrelated strings, without calling eof() between them (trailing bytes would become U+FFFD in the next string). That'd be a less likely mistake with this, too. Agreed. Simple things should be simple. Here's a suggestion, working from that: encoder = Encoder(euc-kr); view = encoder.encode(str1, {continues: true}); view = encoder.encode(str2, {continues: true}); view = encoder.encode(str3, {continues: false}); An alternative way to end the stream: encoder = Encoder(euc-kr); view = encoder.encode(str1, {continues: true}); view = encoder.encode(str2, {continues: true}); view = encoder.encode(str3, {continues: true}); view = encoder.encode(, {continues: false}); // or view = encoder.encode(); // equivalent; continues defaults to false // or view = encoder.encode(); // maybe equivalent, if the first parameter is optional The simplest usage is concise enough that we don't really need a separate str.encode() method: view = Encoder(euc-kr).encode(str); If it has an eof() method, it'd just be a literal wrapper for encoder.encode(), but it can probably be omitted. Agreed, I'd omit it. Bikeshed: The |continues| term doesn't completely thrill me; it's clear in context, but not necessarily what someone might go searching for. {eof:true} would be lovely except we want the default to be yes-EOF but a falsy JS value. |noEOF| ? If there aren't immediate objections, I'll update my wiki draft with this style of API, and see about updating my JS polyfill as well. Opinions on one object type (Encoding) vs. two (Encoder, Decoder) ? One object type is simpler for the non-streaming case, e.g.: // somewhere globally g_codec = Encoding(euc-kr); // elsewhere... str = g_codec.decode(view); // okay view = g_codec.encode(str); // fine, no state captured str = g_codec.decode(view); // still okay but IMHO someone unfamiliar with the internals of encodings might extend the above into:: // somewhere globally g_codec = Encoding(euc-kr); // elsewhere in some stream handling code... str = g_codec.decode(view, {continues: true}); // okay.. view = g_codec.encode(str, {continues: true}); // sure, now both an encode and decode state are captured by codec str = g_codec.decode(view, {continues: true}); // okay only if this is more of the same stream; if there are two incoming streams, this is wrong The same mistake is possible with Encoder / Decoder objects, of course (you just need two globals). But something about separating them makes it clearer to me that the |continues| flag is affecting state in the object rather than just affecting the output of the call.
Re: [whatwg] API for encoding/decoding ArrayBuffers into text
On Mon, 26 Mar 2012 17:56:41 +0100, Joshua Bell jsb...@chromium.org wrote: Bikeshed: The |continues| term doesn't completely thrill me; it's clear in context, but not necessarily what someone might go searching for. {eof:true} would be lovely except we want the default to be yes-EOF but a falsy JS value. |noEOF| ? Peter Beverloo suggests stream on IRC. I like it. Opinions on one object type (Encoding) vs. two (Encoder, Decoder) ? Two seems cleaner. -- Anne van Kesteren http://annevankesteren.nl/
Re: [whatwg] API for encoding/decoding ArrayBuffers into text
On Mon, Mar 26, 2012 at 2:42 PM, Anne van Kesteren ann...@opera.com wrote: On Mon, 26 Mar 2012 17:56:41 +0100, Joshua Bell jsb...@chromium.org wrote: Bikeshed: The |continues| term doesn't completely thrill me; it's clear in context, but not necessarily what someone might go searching for. {eof:true} would be lovely except we want the default to be yes-EOF but a falsy JS value. |noEOF| ? Peter Beverloo suggests stream on IRC. I like it. +1 Opinions on one object type (Encoding) vs. two (Encoder, Decoder) ? Two seems cleaner. I've gone ahead and updated the wiki/draft: http://wiki.whatwg.org/wiki/StringEncoding This includes: * TextEncoder / TextDecoder objects, with |encode| and |decode| methods that take option dicts * A |stream| option, per the above * A |nullTerminator| option eliminates the need for a stringLength method (hasta la vista, baby!) * |encodedLength| method is dropped since you can't in-place encode anyway * decoding errors yield fallback code points by default, but setting a |fatal| option cause a DOMException to be thrown instead * specified exceptions as DOMException of type EncodingError, as a placeholder New issues resulting from this refactor: * You can change the options (stream, nullTerminator, fatal) midway through decoding a stream. This would be silly to do, but as written I don't think this makes the implementation more difficult. Alternately, the non-stream options could be set on the TextDecoder object itself. * BOM handling needs to be resolved. The Encoding spec makes the encoding label secondary to the BOM. With this API it's unclear if that should be the case. Options include having a mismatching BOM throw, treating a mismatching BOM as a decoding error (i.e. fallback or throw, depending on options), or allow the BOM to actually switch the decoder used for this stream - possibly if-and-only-if the default encoding was specified. I've also partially updated the JS polyfill proof-of-concept implementation, tests, and examples as well, but it does not implement streaming yet (i.e. a stream option is ignored, state is always lost); I need to do a tiny bit more refactoring first.
Re: [whatwg] API for encoding/decoding ArrayBuffers into text
On Mon, Mar 26, 2012 at 4:49 PM, Joshua Bell jsb...@chromium.org wrote: * A |stream| option, per the above Does this make sense when you're using stream: false to flush the stream? It's still a streaming operation. I guess it's close enough. * A |nullTerminator| option eliminates the need for a stringLength method (hasta la vista, baby!) I strongly disagree with this change. It's much cleaner and more generic for the decoding algorithm to not know anything about null terminators, and to have separate general-purpose methods to determine the length of the string (memchr/wmemchr analogs, which we should have anyway). We made this simplification a long time ago--why did you resurrect this? array = new Int8Array(myArrayBuffer); length = array.indexOf(0); // same semantics as String.indexOf if(length != -1) array = array.subarray(0, length); new TextDecoder('utf-8').decode(array); * BOM handling needs to be resolved. The Encoding spec makes the encoding label secondary to the BOM. With this API it's unclear if that should be the case. Options include having a mismatching BOM throw, treating a mismatching BOM as a decoding error (i.e. fallback or throw, depending on options), or allow the BOM to actually switch the decoder used for this stream - possibly if-and-only-if the default encoding was specified. The path of fewest errors is probably to have a BOM override the specified UTF-16 endianness, so saying UTF-16BE just changes the default. An aside: The TypedArray constructors have a depressing design bug: new Int8Array(someOtherView) makes a copy of the data. It's nonsensical that view constructors create a view when passed an ArrayBuffer, but a copy when passed another view. This doesn't make any kind of sense; creating a view should create a *view* if it's passed an object that already has ArrayBuffer-based storage, and making a copy should have been its own operation. This means we can't say creating a view is cheap; we have to qualify it: creating a view is cheap, as long as you're careful not to call a constructor that makes a copy. It's frustrating that we're now stuck with a confusing, inconsistent API like this. I'm sure it's much too late to fix this properly, but hopefully an option can be added to fix it, so a new TypedArray(TypedArray, {view: true}) call actually creates a view. -- Glenn Maynard
Re: [whatwg] API for encoding/decoding ArrayBuffers into text
On Mon, Mar 26, 2012 at 2:49 PM, Joshua Bell jsb...@chromium.org wrote: On Mon, Mar 26, 2012 at 2:42 PM, Anne van Kesteren ann...@opera.com wrote: On Mon, 26 Mar 2012 17:56:41 +0100, Joshua Bell jsb...@chromium.org wrote: Bikeshed: The |continues| term doesn't completely thrill me; it's clear in context, but not necessarily what someone might go searching for. {eof:true} would be lovely except we want the default to be yes-EOF but a falsy JS value. |noEOF| ? Peter Beverloo suggests stream on IRC. I like it. +1 Opinions on one object type (Encoding) vs. two (Encoder, Decoder) ? Two seems cleaner. I've gone ahead and updated the wiki/draft: http://wiki.whatwg.org/wiki/StringEncoding This includes: * TextEncoder / TextDecoder objects, with |encode| and |decode| methods that take option dicts * A |stream| option, per the above * A |nullTerminator| option eliminates the need for a stringLength method (hasta la vista, baby!) * |encodedLength| method is dropped since you can't in-place encode anyway * decoding errors yield fallback code points by default, but setting a |fatal| option cause a DOMException to be thrown instead * specified exceptions as DOMException of type EncodingError, as a placeholder New issues resulting from this refactor: * You can change the options (stream, nullTerminator, fatal) midway through decoding a stream. This would be silly to do, but as written I don't think this makes the implementation more difficult. Alternately, the non-stream options could be set on the TextDecoder object itself. * BOM handling needs to be resolved. The Encoding spec makes the encoding label secondary to the BOM. With this API it's unclear if that should be the case. Options include having a mismatching BOM throw, treating a mismatching BOM as a decoding error (i.e. fallback or throw, depending on options), or allow the BOM to actually switch the decoder used for this stream - possibly if-and-only-if the default encoding was specified. I've also partially updated the JS polyfill proof-of-concept implementation, tests, and examples as well, but it does not implement streaming yet (i.e. a stream option is ignored, state is always lost); I need to do a tiny bit more refactoring first. This looks awesome! A few comments: * It appears that we lost the ability to measure how long a resulting buffer was going to be and then decode into the buffer. I don't know if this is an issue. * It might be a performance problem to have to check for the fatal/nullTerminator options on each call. * We lost the ability to decode from a arraybuffer and see how many bytes were consumed before a null-terminator was hit. One not terribly elegant solution would be to add a TextDecoder.decodeWithLength method which return a DOMString+length tuple. / Jonas
Re: [whatwg] API for encoding/decoding ArrayBuffers into text
On Mon, Mar 26, 2012 at 4:12 PM, Glenn Maynard gl...@zewt.org wrote: On Mon, Mar 26, 2012 at 4:49 PM, Joshua Bell jsb...@chromium.org wrote: * A |stream| option, per the above Does this make sense when you're using stream: false to flush the stream? It's still a streaming operation. I guess it's close enough. * A |nullTerminator| option eliminates the need for a stringLength method (hasta la vista, baby!) I strongly disagree with this change. It's much cleaner and more generic for the decoding algorithm to not know anything about null terminators, and to have separate general-purpose methods to determine the length of the string (memchr/wmemchr analogs, which we should have anyway). We made this simplification a long time ago--why did you resurrect this? Ah, I'd forgotten that there was consensus that doing this outside the API was preferable. I'll remove the option when I touch the spec again. * BOM handling needs to be resolved. The Encoding spec makes the encoding label secondary to the BOM. With this API it's unclear if that should be the case. Options include having a mismatching BOM throw, treating a mismatching BOM as a decoding error (i.e. fallback or throw, depending on options), or allow the BOM to actually switch the decoder used for this stream - possibly if-and-only-if the default encoding was specified. The path of fewest errors is probably to have a BOM override the specified UTF-16 endianness, so saying UTF-16BE just changes the default. This would apply on if the previous call had {stream: false} (implicitly or explicitly). Calling with {stream:false} would reset for the next call. Would it apply only to UTF-16 or UTF-8 as well? Should there be any special behavior when not specifying an encoding in the constructor? On Mon, Mar 26, 2012 at 4:27 PM, Jonas Sicking jo...@sicking.cc wrote: A few comments: * It appears that we lost the ability to measure how long a resulting buffer was going to be and then decode into the buffer. I don't know if this is an issue. True. On the plus side, the examples in the page (encode/decode array-of-strings) didn't change size or IMHO readability at all. * It might be a performance problem to have to check for the fatal/nullTerminator options on each call. No comment here. Moving the fatal and other options to the TextDecoding object rather than the decode() call is a possibility. I'm not sure which I prefer. * We lost the ability to decode from a arraybuffer and see how many bytes were consumed before a null-terminator was hit. One not terribly elegant solution would be to add a TextDecoder.decodeWithLength method which return a DOMString+length tuple. Agreed, but of course see above - there was consensus earlier in the thread that searching for null terminators should be done outside the API, therefore the caller will have the length handy already. Yes, this would be a big flaw since decoding at tightly packed data structure (e.g. array of null terminated strings w/o length) would be impossible with just the nullTerminator flag.
Re: [whatwg] API for encoding/decoding ArrayBuffers into text
On Mon, Mar 26, 2012 at 6:27 PM, Jonas Sicking jo...@sicking.cc wrote: * It appears that we lost the ability to measure how long a resulting buffer was going to be and then decode into the buffer. I don't know if this is an issue. The theory is that it probably isn't a real performance issue to decode into a new buffer, then copy it where you want it. If you think there are any cases where it matters, we should look at it, though. The extra GC might matter if you're doing a lot of large conversions, but that's easily fixed by adding ArrayBuffer.close(). * It might be a performance problem to have to check for the fatal/nullTerminator options on each call. Are you thinking of people, say, feeding in a single byte at a time? That seems like it'll be slow no matter what. On Mon, Mar 26, 2012 at 6:40 PM, Joshua Bell jsb...@chromium.org wrote: The path of fewest errors is probably to have a BOM override the specified UTF-16 endianness, so saying UTF-16BE just changes the default. This would apply on if the previous call had {stream: false} (implicitly or explicitly). Right. The following two operations should be exactly identical, for every possible value of str and combination of options, and resulting in a decoder in the same state: view1 = decoder.decode(str.substr(0, 8), {stream: true}); view2 = decoder.decode(str.substr(8)); finalView = new Int8Array(view1.length + view2.length); finalView.set(view1); finalView.set(view2, view1.length); return finalView; return decoder.decode(str); Calling with {stream:false} would reset for the next call. Right: after a {stream:false} call, a decoder or encoder should be equivalent to a newly-created one. Would it apply only to UTF-16 or UTF-8 as well? Should there be any special behavior when not specifying an encoding in the constructor? Do you mean, should decoding UTF-8 switch to UTF-16 if it starts with a UTF-16 BOM? I think that would be confusing. If people want to autodetect UTF-16 like that, they should probably do it themselves. I think browsers do this with text/html, but that's just a web-compatibility wart, not a feature... -- Glenn Maynard
Re: [whatwg] API for encoding/decoding ArrayBuffers into text
On Mon, Mar 26, 2012 at 4:40 PM, Joshua Bell jsb...@chromium.org wrote: * We lost the ability to decode from a arraybuffer and see how many bytes were consumed before a null-terminator was hit. One not terribly elegant solution would be to add a TextDecoder.decodeWithLength method which return a DOMString+length tuple. Agreed, but of course see above - there was consensus earlier in the thread that searching for null terminators should be done outside the API, therefore the caller will have the length handy already. Yes, this would be a big flaw since decoding at tightly packed data structure (e.g. array of null terminated strings w/o length) would be impossible with just the nullTerminator flag. Requiring callers to find the null character first, and then use that will require one additional pass over the encoded binary data though. Also, if we put the API for finding the null character on the Decoder object it doesn't seem like we're creating an API which is easier to use, just one that has moved some of the logic from the API to every caller. Though I guess the best solution would be to add methods to DataView which allows consuming an ArrayBuffer up to a null terminated point and returns the decoded string. Potentially such a method could take a Decoder object as argument. / Jonas
Re: [whatwg] API for encoding/decoding ArrayBuffers into text
On Mon, Mar 26, 2012 at 5:33 PM, Jonas Sicking jo...@sicking.cc wrote: On Mon, Mar 26, 2012 at 4:40 PM, Joshua Bell jsb...@chromium.org wrote: * We lost the ability to decode from a arraybuffer and see how many bytes were consumed before a null-terminator was hit. One not terribly elegant solution would be to add a TextDecoder.decodeWithLength method which return a DOMString+length tuple. Agreed, but of course see above - there was consensus earlier in the thread that searching for null terminators should be done outside the API, therefore the caller will have the length handy already. Yes, this would be a big flaw since decoding at tightly packed data structure (e.g. array of null terminated strings w/o length) would be impossible with just the nullTerminator flag. Requiring callers to find the null character first, and then use that will require one additional pass over the encoded binary data though. Also, if we put the API for finding the null character on the Decoder object it doesn't seem like we're creating an API which is easier to use, just one that has moved some of the logic from the API to every caller. Though I guess the best solution would be to add methods to DataView which allows consuming an ArrayBuffer up to a null terminated point and returns the decoded string. Potentially such a method could take a Decoder object as argument. The rationale for specifying the string encoding and decoding functionality outside the typed array specification is to keep the typed array spec small and easily implementable. The indexed property getters and setters on the typed array views, and methods on DataView, are designed to be implementable with a small amount of assembly code in JavaScript engines. I'd strongly prefer to continue to design the encoding/decoding functionality separately from the typed array views. -Ken
Re: [whatwg] API for encoding/decoding ArrayBuffers into text
On Mon, Mar 26, 2012 at 9:11 PM, Kenneth Russell k...@google.com wrote: The rationale for specifying the string encoding and decoding functionality outside the typed array specification is to keep the typed array spec small and easily implementable. The indexed property getters and setters on the typed array views, and methods on DataView, are designed to be implementable with a small amount of assembly code in JavaScript engines. I'd strongly prefer to continue to design the encoding/decoding functionality separately from the typed array views. However, if the browser's don't all implement this, then you can't rely on it being there. In apps where you compile separately for each browser, you only pay the cost where the browser doesn't implement it (for example, in GWT we emulate DataView and Uint8ClampedArray where it is missing). Even then, you may have to include both versions and do runtime detection, such as when later versions of the browser include the functionality -- that may be worse than simply not using the API at all if you care more about code size than execution speed of encoding/decoding text. So, personally I think whatever gets the most browsers to completely implement it is better, whether that is being part of the typed arrays spec or separate. Logically, it seems to fit most directly in DataView. -- John A. Tamplin Software Engineer (GWT), Google
Re: [whatwg] API for encoding/decoding ArrayBuffers into text
On Mon, Mar 26, 2012 at 7:33 PM, Jonas Sicking jo...@sicking.cc wrote: Requiring callers to find the null character first, and then use that will require one additional pass over the encoded binary data though. That's extremely fast (memchr), and it's probably the fastest thing to do anyway, compared to embedding null-termination logic into the inner loop of decode functions. Unless there's a concrete benchmark showing that it's slower, and slower enough to actually matter, this shouldn't be a consideration. It's a premature optimization. Also, if we put the API for finding the null character on the Decoder object it doesn't seem like we're creating an API which is easier to use, just one that has moved some of the logic from the API to every caller. It doesn't seem materially harder (a little more code, yes, but that's not the same thing), and it's more general-purpose. The API for finding the character doesn't belong on Decoder. It should probably go on each View type, analogous to String.indexOf. Multi-byte views should search on the view's size; eg. Int16Array.indexOf(i) maps to wmemchr. Though I guess the best solution would be to add methods to DataView which allows consuming an ArrayBuffer up to a null terminated point and returns the decoded string. Potentially such a method could take a Decoder object as argument. I guess. It doesn't seem that important, since it's just a few lines of code. If this is done, I'd suggest that this helper API *not* have any special support for streaming (not to disallow it, but not to have any special handling for it, either). I think streaming has little overlap with null-terminated fields, since null-termination is typically used with fixed-size buffers. It would complicate things; for example, you'd need some way to signal to the caller that a null terminator was encountered. That is, it'd basically look like: function decodeNullTerminated(decoder, options) { // Create the correct array type, so array.find and array.subarray work in 16-bit for UTF-16. var arrayType = (decoder.encoding.toLowerCase() == 'utf-16le' || decoder.encoding.toLowerCase() == 'utf-16be')? Int16Array:Int8Array; var array = new arrayType(this.buffer, this.byteOffset, this.byteLength); var terminator = array.find(0); if(terminator != -1) array = array.subarray(0, terminator); return decoder.decode(array, options); } which doesn't specifically prohibit options including {stream: true}, but doesn't attempt to make it useful. (Side note: If you have null-terminated strings, you're almost always dealing with only multibyte encodings like UTF-8, or only wide encodings like UTF-16, so you'd just use the appropriate type. That is, the minor complication of the first line above isn't something that users would normally actually need to do.) On Mon, Mar 26, 2012 at 8:11 PM, Kenneth Russell k...@google.com wrote: The rationale for specifying the string encoding and decoding functionality outside the typed array specification is to keep the typed array spec small and easily implementable. The indexed property getters and setters on the typed array views, and methods on DataView, are designed to be implementable with a small amount of assembly code in JavaScript engines. I'd strongly prefer to continue to design the encoding/decoding functionality separately from the typed array views. It doesn't need to go into the Typed Array spec. It can just be an addition to the interface provided by an external specification, which doesn't need to be implemented to implement typed arrays itself. I don't think it's an important thing to have, but this in particular doesn't seem like a problem. -- Glenn Maynard
Re: [whatwg] API for encoding/decoding ArrayBuffers into text
On Mon, Mar 26, 2012 at 6:24 PM, Glenn Maynard gl...@zewt.org wrote: I guess. It doesn't seem that important, since it's just a few lines of code. If this is done, I'd suggest that this helper API *not* have any special support for streaming (not to disallow it, but not to have any special handling for it, either). I think streaming has little overlap with null-terminated fields, since null-termination is typically used with fixed-size buffers. It would complicate things; for example, you'd need some way to signal to the caller that a null terminator was encountered. Agreed. Also worth relying to this thread is that in addition to null termination there have been requests for other terminators, such as 0xFF which is an invalid byte in a UTF-8 stream and thus a lovely terminator. Other byte sequences were mentioned. (This was over in the Khronos WebGL list for anyone who wants to dig it up. It was tracked as an unresolved ISSUE in the spec.) This supports the assertion that we should not special case null terminators, but instead provide general (and highly optimizable) utilities like memchr operating on buffers, since we can't anticipate every usage in higher-level APIs like the one under discussion.
Re: [whatwg] API for encoding/decoding ArrayBuffers into text
On Mon, Mar 26, 2012 at 6:11 PM, Kenneth Russell k...@google.com wrote: On Mon, Mar 26, 2012 at 5:33 PM, Jonas Sicking jo...@sicking.cc wrote: On Mon, Mar 26, 2012 at 4:40 PM, Joshua Bell jsb...@chromium.org wrote: * We lost the ability to decode from a arraybuffer and see how many bytes were consumed before a null-terminator was hit. One not terribly elegant solution would be to add a TextDecoder.decodeWithLength method which return a DOMString+length tuple. Agreed, but of course see above - there was consensus earlier in the thread that searching for null terminators should be done outside the API, therefore the caller will have the length handy already. Yes, this would be a big flaw since decoding at tightly packed data structure (e.g. array of null terminated strings w/o length) would be impossible with just the nullTerminator flag. Requiring callers to find the null character first, and then use that will require one additional pass over the encoded binary data though. Also, if we put the API for finding the null character on the Decoder object it doesn't seem like we're creating an API which is easier to use, just one that has moved some of the logic from the API to every caller. Though I guess the best solution would be to add methods to DataView which allows consuming an ArrayBuffer up to a null terminated point and returns the decoded string. Potentially such a method could take a Decoder object as argument. The rationale for specifying the string encoding and decoding functionality outside the typed array specification is to keep the typed array spec small and easily implementable. The indexed property getters and setters on the typed array views, and methods on DataView, are designed to be implementable with a small amount of assembly code in JavaScript engines. I'd strongly prefer to continue to design the encoding/decoding functionality separately from the typed array views. Is there a reason you couldn't keep the current set of functions on DataView implemented using a small amount of assembly code, and let the new functions fall back to slower C++ functions? / Jonas
Re: [whatwg] API for encoding/decoding ArrayBuffers into text
On Mon, Mar 26, 2012 at 6:24 PM, Glenn Maynard gl...@zewt.org wrote: On Mon, Mar 26, 2012 at 7:33 PM, Jonas Sicking jo...@sicking.cc wrote: Requiring callers to find the null character first, and then use that will require one additional pass over the encoded binary data though. That's extremely fast (memchr), and it's probably the fastest thing to do anyway, compared to embedding null-termination logic into the inner loop of decode functions. The memchr is purely overhead, I.e. we are comparing memchr+decoding to decoding. So I don't see what's backing up the probably the fastest thing claim. Unless there's a concrete benchmark showing that it's slower, and slower enough to actually matter, this shouldn't be a consideration. It's a premature optimization. My argument is that it's both faster and more author friendly. I admit I missed the previous discussion which led to the agreement to keep the length measuring outside, so I don't know what arguments were presented. Any pointers would be appreciated. Also, if we put the API for finding the null character on the Decoder object it doesn't seem like we're creating an API which is easier to use, just one that has moved some of the logic from the API to every caller. It doesn't seem materially harder (a little more code, yes, but that's not the same thing), and it's more general-purpose. I agree it doesn't seem materially harder. I also agree that I don't have data to show that it's materially slower. But it sounds like we're in agreement that keeping the logic outside is both harder and slower which honestly doesn't speak strongly in its favor. I don't understand the argument that the alternative is more general-purpose. The API is already generic in that you can use whatever delimiter you want since you pass in a view. The only functionality which is not available is finding a null-terminator in an arraybuffer which you are arguing below shouldn't be part of the decoder (which I agree with). / Jonas
Re: [whatwg] API for encoding/decoding ArrayBuffers into text
On Thu, Mar 22, 2012 at 8:58 AM, Anne van Kesteren ann...@opera.com wrote: Another way would be to have a second optional argument that indicates whether more bytes are coming (defaults to false), but I'm not sure of the chances that would be used correctly. The reasons you outline are probably why many browser implementations deal with EOF poorly too. It might not improve it, but I don't think it'd be worse. If you didn't use it correctly for an encoding where it matters, the breakage would be obvious. Also, the previous automatically-streaming API has another possible misuse: constructing a single encoder, then calling it repeatedly for unrelated strings, without calling eof() between them (trailing bytes would become U+FFFD in the next string). That'd be a less likely mistake with this, too. Here's a suggestion, working from that: encoder = Encoder(euc-kr); view = encoder.encode(str1, {continues: true}); view = encoder.encode(str2, {continues: true}); view = encoder.encode(str3, {continues: false}); An alternative way to end the stream: encoder = Encoder(euc-kr); view = encoder.encode(str1, {continues: true}); view = encoder.encode(str2, {continues: true}); view = encoder.encode(str3, {continues: true}); view = encoder.encode(, {continues: false}); // or view = encoder.encode(); // equivalent; continues defaults to false // or view = encoder.encode(); // maybe equivalent, if the first parameter is optional The simplest usage is concise enough that we don't really need a separate str.encode() method: view = Encoder(euc-kr).encode(str); If it has an eof() method, it'd just be a literal wrapper for encoder.encode(), but it can probably be omitted. -- Glenn Maynard
Re: [whatwg] API for encoding/decoding ArrayBuffers into text
On Thu, 22 Mar 2012 03:12:25 +0100, Mark Callow callow_m...@hicorp.co.jp wrote: This has encode and decode reversed from my understanding. I regard the string (wide-char) as the canonical form and the bytes as the encoded form. This view is reflected in the widely used terminology charset encodings which refers to the likes of euc-kr and shift_jis. Yeah, I suspect we'll get it right once put in a draft :-) -- Anne van Kesteren http://annevankesteren.nl/
Re: [whatwg] API for encoding/decoding ArrayBuffers into text
On Wed, 21 Mar 2012 16:53:36 +0100, Joshua Bell jsb...@chromium.org wrote: Just to throw it out there - does anyone feel we can/should offer asymmetric encode/decode support, i.e. supporting more encodings for decode operations than for encode operations? XMLHttpRequest has that. You can only send (encode) UTF-8, receive (decode) everything. Forms can send everything. URL query parameters can encode everything (though the page itself has to be encoded in the encoding of choice). If we have no use cases just supporting encoding UTF-8 seems fine to me, but I think the design should allow for other encodings in the future. Bikeshedding on the name - we'd have to put String or Text in there somewhere, since audio/video/image codecs will likely want to use similar terms. They can use the prefixed variants :-) If we have to use a prefix String seems better, as Text is a node object in the platform. -- Anne van Kesteren http://annevankesteren.nl/
Re: [whatwg] API for encoding/decoding ArrayBuffers into text
On Thu, 22 Mar 2012 10:19:30 +0100, Anne van Kesteren ann...@opera.com wrote: They can use the prefixed variants :-) If we have to use a prefix String seems better, as Text is a node object in the platform. Simon pointed out Text as prefix is probably better (it is used elsewhere in the platform unrelated to nodes (e.g. TextTrack)), though I'd personally prefer simply Decoder/Encoder. -- Anne van Kesteren http://annevankesteren.nl/
Re: [whatwg] API for encoding/decoding ArrayBuffers into text
On 03/21/2012 04:53 PM, Joshua Bell wrote: As for the API, how about: enc = new Encoder(euc-kr) string1 = enc.encode(bytes1) string2 = enc.encode(bytes2) string3 = enc.eof() // might return empty string if all is fine And similarly you would have dec = new Decoder(shift_jis) bytes = dec.decode(string) Or alternatively you could have a single object that exposes both encode() and decode() and tracks state for both: enc = new Encoding(gb18030) bytes1 = enc.decode(string1) string2 = enc.encode(bytes2) I don't mind this API for complex usecases e.g. streaming, but it is massive overkill for the simple common case of I have a list of bytes that I want to decode to a string or I have a string that I want to encode to bytes. For those cases I strongly prefer the earlier API along the lines of String.prototype.encode(encoding) ArrayBufferView.prototype.decode(encoding)
Re: [whatwg] API for encoding/decoding ArrayBuffers into text
On Wed, Mar 21, 2012 at 2:42 PM, Anne van Kesteren ann...@opera.com wrote: As for the API, how about: enc = new Encoder(euc-kr) string1 = enc.encode(bytes1) string2 = enc.encode(bytes2) string3 = enc.eof() // might return empty string if all is fine A problem with this is that the bugs resulting from not calling eof() are subtle. The only thing eof() would ever do, I think, is return U+FFFD characters if there are leftover characters in the internal buffer; if you never call eof(), you'll never get incorrect results unless you test with invalid inputs. It's minor, as subtle-edge-cases-that-people-won't-get-right go, but it's at least worth a mention. Maybe people who would use this API instead of the simpler non-streaming version (which could be a thin wrapper on this) in the first place are also more likely to get this right. I'm guessing a common, incorrect pattern would be: string = new Encoder(euc-kr).encode(bytes); which would *not* be equivalent to bytes.encode(euc-kr). -- Glenn Maynard
Re: [whatwg] API for encoding/decoding ArrayBuffers into text
On Thu, 22 Mar 2012 14:47:05 +0100, Glenn Maynard gl...@zewt.org wrote: A problem with this is that the bugs resulting from not calling eof() are subtle. The only thing eof() would ever do, I think, is return U+FFFD characters if there are leftover characters in the internal buffer; if you never call eof(), you'll never get incorrect results unless you test with invalid inputs. It's minor, as subtle-edge-cases-that-people-won't-get-right go, but it's at least worth a mention. Maybe people who would use this API instead of the simpler non-streaming version (which could be a thin wrapper on this) in the first place are also more likely to get this right. I'm guessing a common, incorrect pattern would be: string = new Encoder(euc-kr).encode(bytes); which would *not* be equivalent to bytes.encode(euc-kr). Another way would be to have a second optional argument that indicates whether more bytes are coming (defaults to false), but I'm not sure of the chances that would be used correctly. The reasons you outline are probably why many browser implementations deal with EOF poorly too. -- Anne van Kesteren http://annevankesteren.nl/
Re: [whatwg] API for encoding/decoding ArrayBuffers into text
2012/3/22 Anne van Kesteren ann...@opera.com: As for the API, how about: enc = new Encoder(euc-kr) string1 = enc.encode(bytes1) string2 = enc.encode(bytes2) string3 = enc.eof() // might return empty string if all is fine And similarly you would have dec = new Decoder(shift_jis) bytes = dec.decode(string) Or alternatively you could have a single object that exposes both encode() and decode() and tracks state for both: enc = new Encoding(gb18030) bytes1 = enc.decode(string1) string2 = enc.encode(bytes2) Usually, strings are encoded to bytes. Therefore that encode/decode methods should be reversed like: enc = new Encoding(gb18030) bytes1 = enc.encode(string1) string2 = enc.decode(bytes2) Or if it may cause confusion use getBytes/getChars like Java and C#. http://docs.oracle.com/javase/7/docs/api/java/lang/String.html http://msdn.microsoft.com/en-us/library/system.text.encoder(v=vs.110).aspx#Y1873 http://msdn.microsoft.com/en-us/library/system.text.Decoder(v=vs.110).aspx#Y1873 enc = new Encoding(gb18030) bytes1 = enc.getBytes(string1) string2 = enc.getChars(bytes2) -- NARUSE, Yui nar...@airemix.jp
Re: [whatwg] API for encoding/decoding ArrayBuffers into text
On Tue, Mar 20, 2012 at 10:39 AM, Joshua Bell jsb...@chromium.org wrote: On Tue, Mar 20, 2012 at 7:26 AM, Glenn Maynard gl...@zewt.org wrote: On Mon, Mar 19, 2012 at 11:52 PM, Jonas Sicking jo...@sicking.cc wrote: Why are encodings different than other parts of the API where you indeed have to know what works and what doesn't. Do you memorize lists of encodings? I certainly don't. I look them up as needed. UTF8 is stateful, so I disagree. No, UTF-8 doesn't require a stateful decoder to support streaming. You decode up to the last codepoint that you can decode completely. The return values are the output data, the number of bytes output, and the number of bytes consumed; that's all you need to restart decoding later. That's the iconv(3) approach that we're probably all familiar with, which works with almost all encodings. ISO-2022 encodings are stateful: you have to persistently remember the character subsets activated by earlier escape sequences. An iconv-like streaming API is impossible; to support streamed decoding, you'd need to have a decoder object that the user keeps around in order to store that state. http://en.wikipedia.org/wiki/ISO/IEC_2022#Code_structure Which seems like it leaves us with these options: 1. Only support encodings with stateless coding (possibly down to a minimum of UTF-8) 2. Only provide an API supporting non-streaming coding (i.e. whole strings/whole buffers) 3. Expand the API to return encoder/decoder objects that capture state I'm pretty sure there is consensus for supporting UTF8. UTF8 is stateful though can be made not stateful by not consuming all characters and instead forcing the caller to keep the state (in the form of unconsumed text). So I would rephrase your 3 options above as: 1) Create an API which forces consumers to do state handling. Probably leading to people creating wrappers which essentially implement option 3 2) Don't support streaming 3) Have encoder/decoder objects which hold state I personally don't think 1 is a good option since it's basically the same as 3 but just with libraries doing some of the work. We might as well do that work so that libraries aren't needed. This leaves us with 2 or 3. So the question is if we should support streaming or not. I suspect doing so would be worth it. / Jonas
Re: [whatwg] API for encoding/decoding ArrayBuffers into text
2012/3/21 Glenn Maynard gl...@zewt.org: On Tue, Mar 20, 2012 at 12:39 PM, Joshua Bell jsb...@chromium.org wrote: 1. Only support encodings with stateless coding (possibly down to a minimum of UTF-8) 2. Only provide an API supporting non-streaming coding (i.e. whole strings/whole buffers) 3. Expand the API to return encoder/decoder objects that capture state Any others? Trying to do simplify the problem but take on both (1) and (2) without (3) would lead to an API that could not encompass (3) in the future, which would be a mistake. I don't think that's obviously a mistake. Only the nastiest, wartiest of legacy encodings require it. The categories feels strange. If the conversion is not streaming (whole strings/whole buffers), its implementation should be simply the wrapper of the browser's conversion functions. There is no need to a state object to save the state because the conversion is done with the completion of the function, even if it is stateful encoding. For streaming conversion, it needs state even if the encoding is stateless. When the given partial input is finished at the middle of a character like \xE3\x81\x82\xC2, the conversion consumes 4 bytes, output one character \u3042, and remember the partial bytes \xC2. This bytes is the state. That said, it's fairly simple to later return an additional state object from the previously proposed streaming APIs, eg. result = decode(str, 0, outputView) // result.outputBytes == 15 // result.nextInputByte == 5 // result.state == opaque object result2 = decode(str, result.nextInputByte, outputView, {state: result.state}); You can refer mbsrtowcs(3), which convert a character string to a wide-character string (restartable). It uses opaque state. size_t mbsnrtowcs(wchar_t *restrict dst, const char **restrict src, size_t nmc, size_t len, mbstate_t *restrict ps); http://pubs.opengroup.org/onlinepubs/9699919799/functions/mbsrtowcs.html Anyway, they need error if the byte sequence is invalid for the encoding. -- NARUSE, Yui nar...@airemix.jp
Re: [whatwg] API for encoding/decoding ArrayBuffers into text
2012/3/21 Jonas Sicking jo...@sicking.cc: I'm pretty sure there is consensus for supporting UTF8. UTF8 is stateful though can be made not stateful by not consuming all characters and instead forcing the caller to keep the state (in the form of unconsumed text). Your use of the word stateful involves misunderstanding. Usually the word stateful encoding means that the encoding keeps a state between characters, not bytes. What you mean is usually expressed by the word multibyte. UTF-8 is multibyte encoding, and it needs to keep a state on streaming. So I would rephrase your 3 options above as: 1) Create an API which forces consumers to do state handling. Probably leading to people creating wrappers which essentially implement option 3 2) Don't support streaming 3) Have encoder/decoder objects which hold state I personally don't think 1 is a good option since it's basically the same as 3 but just with libraries doing some of the work. We might as well do that work so that libraries aren't needed. This leaves us with 2 or 3. So the question is if we should support streaming or not. I suspect doing so would be worth it. I think it should provide non streaming API. And if there are concreate use case, provide streaming API as another one. -- NARUSE, Yui nar...@airemix.jp
Re: [whatwg] API for encoding/decoding ArrayBuffers into text
On Wed, 21 Mar 2012 01:27:47 -0700, Jonas Sicking jo...@sicking.cc wrote: This leaves us with 2 or 3. So the question is if we should support streaming or not. I suspect doing so would be worth it. For XMLHttpRequest it might be, yes. I think we should expose the same encoding set throughout the platform. One reason to limit the encoding set initially might be because we have not all converged yet on our encoding sets. Gecko, Safari, and Internet Explorer expose a lot more encodings than Opera and Chrome. As for the API, how about: enc = new Encoder(euc-kr) string1 = enc.encode(bytes1) string2 = enc.encode(bytes2) string3 = enc.eof() // might return empty string if all is fine And similarly you would have dec = new Decoder(shift_jis) bytes = dec.decode(string) Or alternatively you could have a single object that exposes both encode() and decode() and tracks state for both: enc = new Encoding(gb18030) bytes1 = enc.decode(string1) string2 = enc.encode(bytes2) -- Anne van Kesteren http://annevankesteren.nl/
Re: [whatwg] API for encoding/decoding ArrayBuffers into text
On Wed, Mar 21, 2012 at 3:27 AM, Jonas Sicking jo...@sicking.cc wrote: 1) Create an API which forces consumers to do state handling. Probably leading to people creating wrappers which essentially implement option 3 It's not the same. Please look at how ISO-2022 works: the stream has *long-lived* state, with escape sequences that change the meaning of later code sequences in the stream. For example, you have to remember whether GR is encoding G1, G2 or G3. This can't be stored merely by remembering the next input byte you have to start at. As Yui said, the sort of state UTF-8 has isn't what people mean when we talk about stateful encodings. On Wed, Mar 21, 2012 at 3:34 AM, NARUSE, Yui nar...@airemix.jp wrote: For streaming conversion, it needs state even if the encoding is stateless. When the given partial input is finished at the middle of a character like \xE3\x81\x82\xC2, the conversion consumes 4 bytes, output one character \u3042, and remember the partial bytes \xC2. This bytes is the state. You don't need to do that. You can simply convert as many output codepoints as can be *completely* converted. In this example, you'd consume 3 bytes and output one codepoint. You don't consume data that you can't immediately convert, so you don't have to buffer anything. (We don't have to do it that way, of course; just pointing out that you don't *need* special state for streaming encodings like UTF-8.) Anyway, they need error if the byte sequence is invalid for the encoding. Errors were discussed previously: by default errors output U+FFFD (or another replacement character, for encoding unsupported characters to non-Unicode encodings), and we may have an option to turn it into an exception. -- Glenn Maynard
Re: [whatwg] API for encoding/decoding ArrayBuffers into text
On Wed, Mar 21, 2012 at 12:42 PM, Anne van Kesteren ann...@opera.comwrote: On Wed, 21 Mar 2012 01:27:47 -0700, Jonas Sicking jo...@sicking.cc wrote: This leaves us with 2 or 3. So the question is if we should support streaming or not. I suspect doing so would be worth it. For XMLHttpRequest it might be, yes. I think we should expose the same encoding set throughout the platform. One reason to limit the encoding set initially might be because we have not all converged yet on our encoding sets. Gecko, Safari, and Internet Explorer expose a lot more encodings than Opera and Chrome. Just to throw it out there - does anyone feel we can/should offer asymmetric encode/decode support, i.e. supporting more encodings for decode operations than for encode operations? As for the API, how about: enc = new Encoder(euc-kr) string1 = enc.encode(bytes1) string2 = enc.encode(bytes2) string3 = enc.eof() // might return empty string if all is fine And similarly you would have dec = new Decoder(shift_jis) bytes = dec.decode(string) Or alternatively you could have a single object that exposes both encode() and decode() and tracks state for both: enc = new Encoding(gb18030) bytes1 = enc.decode(string1) string2 = enc.encode(bytes2) That's the direction my thinking was headed. Glenn pointed out that the state that's implicitly captured in the above objects could instead be returned as an explicit but opaque state object that's passed in and out of stateless functions. As a potential user of the API, I find the above object-oriented style easier to understand. Re: Encoding object vs. an Encoder/Decoder pair - I'd prefer the latter as it makes the state being captured and any methods/attributes to interrogate the state clearer. Bikeshedding on the name - we'd have to put String or Text in there somewhere, since audio/video/image codecs will likely want to use similar terms.
Re: [whatwg] API for encoding/decoding ArrayBuffers into text
On 3/21/2012 8:53 AM, Joshua Bell wrote: On Wed, Mar 21, 2012 at 12:42 PM, Anne van Kesterenann...@opera.comwrote: On Wed, 21 Mar 2012 01:27:47 -0700, Jonas Sickingjo...@sicking.cc wrote: This leaves us with 2 or 3. So the question is if we should support streaming or not. I suspect doing so would be worth it. For XMLHttpRequest it might be, yes. I think we should expose the same encoding set throughout the platform. One reason to limit the encoding set initially might be because we have not all converged yet on our encoding sets. Gecko, Safari, and Internet Explorer expose a lot more encodings than Opera and Chrome. Just to throw it out there - does anyone feel we can/should offer asymmetric encode/decode support, i.e. supporting more encodings for decode operations than for encode operations? In the past decade I've never had to encode into something other than UTF-8. I have had to decode many encoding sets. If I did need to do a special encoding, given the state of typed arrays, I'd probably just implement the encoding in JS. +1 for asymmetric from my experience. -Charles
Re: [whatwg] API for encoding/decoding ArrayBuffers into text
On 22/03/2012 04:42, Anne van Kesteren wrote: ... As for the API, how about: enc = new Encoder(euc-kr) string1 = enc.encode(bytes1) string2 = enc.encode(bytes2) string3 = enc.eof() // might return empty string if all is fine And similarly you would have dec = new Decoder(shift_jis) bytes = dec.decode(string) Or alternatively you could have a single object that exposes both encode() and decode() and tracks state for both: enc = new Encoding(gb18030) bytes1 = enc.decode(string1) string2 = enc.encode(bytes2) This has encode and decode reversed from my understanding. I regard the string (wide-char) as the canonical form and the bytes as the encoded form. This view is reflected in the widely used terminology charset encodings which refers to the likes of euc-kr and shift_jis. Regards -Mark
Re: [whatwg] API for encoding/decoding ArrayBuffers into text
On Mon, Mar 19, 2012 at 11:52 PM, Jonas Sicking jo...@sicking.cc wrote: Why are encodings different than other parts of the API where you indeed have to know what works and what doesn't. Do you memorize lists of encodings? I certainly don't. I look them up as needed. UTF8 is stateful, so I disagree. No, UTF-8 doesn't require a stateful decoder to support streaming. You decode up to the last codepoint that you can decode completely. The return values are the output data, the number of bytes output, and the number of bytes consumed; that's all you need to restart decoding later. That's the iconv(3) approach that we're probably all familiar with, which works with almost all encodings. ISO-2022 encodings are stateful: you have to persistently remember the character subsets activated by earlier escape sequences. An iconv-like streaming API is impossible; to support streamed decoding, you'd need to have a decoder object that the user keeps around in order to store that state. http://en.wikipedia.org/wiki/ISO/IEC_2022#Code_structure -- Glenn Maynard
Re: [whatwg] API for encoding/decoding ArrayBuffers into text
On Tue, Mar 20, 2012 at 7:26 AM, Glenn Maynard gl...@zewt.org wrote: On Mon, Mar 19, 2012 at 11:52 PM, Jonas Sicking jo...@sicking.cc wrote: Why are encodings different than other parts of the API where you indeed have to know what works and what doesn't. Do you memorize lists of encodings? I certainly don't. I look them up as needed. UTF8 is stateful, so I disagree. No, UTF-8 doesn't require a stateful decoder to support streaming. You decode up to the last codepoint that you can decode completely. The return values are the output data, the number of bytes output, and the number of bytes consumed; that's all you need to restart decoding later. That's the iconv(3) approach that we're probably all familiar with, which works with almost all encodings. ISO-2022 encodings are stateful: you have to persistently remember the character subsets activated by earlier escape sequences. An iconv-like streaming API is impossible; to support streamed decoding, you'd need to have a decoder object that the user keeps around in order to store that state. http://en.wikipedia.org/wiki/ISO/IEC_2022#Code_structure Which seems like it leaves us with these options: 1. Only support encodings with stateless coding (possibly down to a minimum of UTF-8) 2. Only provide an API supporting non-streaming coding (i.e. whole strings/whole buffers) 3. Expand the API to return encoder/decoder objects that capture state Any others? Trying to do simplify the problem but take on both (1) and (2) without (3) would lead to an API that could not encompass (3) in the future, which would be a mistake. I'll throw out that the in-progress design of a Globalization API for ECMAScript - http://norbertlindenberg.com/2012/02/ecmascript-internationalization-api/ - is currently spec'd to both build on the existing locale-aware methods on String/Number/Date prototypes as conveniences, as well as introducing the Collator and *Format objects. Should we start with UTF-8-only/non-streaming methods on DOMString/ArrayBufferView, and avoid constraining a future API supporting multiple, possibly stateful encodings and streaming?
Re: [whatwg] API for encoding/decoding ArrayBuffers into text
On Tue, Mar 20, 2012 at 12:39 PM, Joshua Bell jsb...@chromium.org wrote: 1. Only support encodings with stateless coding (possibly down to a minimum of UTF-8) 2. Only provide an API supporting non-streaming coding (i.e. whole strings/whole buffers) 3. Expand the API to return encoder/decoder objects that capture state Any others? Trying to do simplify the problem but take on both (1) and (2) without (3) would lead to an API that could not encompass (3) in the future, which would be a mistake. I don't think that's obviously a mistake. Only the nastiest, wartiest of legacy encodings require it. That said, it's fairly simple to later return an additional state object from the previously proposed streaming APIs, eg. result = decode(str, 0, outputView) // result.outputBytes == 15 // result.nextInputByte == 5 // result.state == opaque object result2 = decode(str, result.nextInputByte, outputView, {state: result.state}); -- Glenn Maynard
Re: [whatwg] API for encoding/decoding ArrayBuffers into text
On 17/03/2012 08:19, Boris Zbarsky wrote: I think that trying to get web developers to do this right is a lost cause, esp. because none of them (to a good approximation) have any big-endian systems to test on. On what do you base this oft-repeated assertion? ARM CPUs can work either way. I have no idea how the various licensees are actually setting them up. Regards -Mark
Re: [whatwg] API for encoding/decoding ArrayBuffers into text
On Wed, Mar 14, 2012 at 12:49 AM, Jonas Sicking jo...@sicking.cc wrote: Something that has come up a couple of times with content authors lately has been the desire to convert an ArrayBuffer (or part thereof) into a decoded string. Similarly being able to encode a string into an ArrayBuffer (or part thereof). Something as simple as DOMString decode(ArrayBufferView source, DOMString encoding); ArrayBufferView encode(DOMString source, DOMString encoding, [optional] ArrayBufferView destination); It saddens me that this allows non-UTF-8 encodings. However, since use cases for non-UTF-8 encodings were mentioned in this thread, I suggest that the set of supported encodings be an enumerated set of encodings stated in a spec and browsers MUST NOT support other encodings. The set should probably be the set offered in the encoding popup at http://validator.nu/?charset or a subset thereof (containing at least UTF-8 of course). (That set was derived by researching the intersection of the encodings supported by browsers, Python and the JDK.) would go a very long way. Are you sure that it's not necessary to support streaming conversion? The suggested API design assumes you always have the entire data sequence in a single DOMString or ArrayBufferView. The question is where to stick these functions. Internationalization doesn't have a obvious object we can hang functions off of (unlike, for example crypto), and the above names are much too generic to turn into global functions. If we deem streaming conversion unnecessary, I'd put the methods on DOMString and ArrayBufferView. It would be terribly sad to let the schedules of various working groups affect the API design. -- Henri Sivonen hsivo...@iki.fi http://hsivonen.iki.fi/
Re: [whatwg] API for encoding/decoding ArrayBuffers into text
On Mon, Mar 19, 2012 at 7:00 AM, Henri Sivonen hsivo...@iki.fi wrote: On Wed, Mar 14, 2012 at 12:49 AM, Jonas Sicking jo...@sicking.cc wrote: Something that has come up a couple of times with content authors lately has been the desire to convert an ArrayBuffer (or part thereof) into a decoded string. Similarly being able to encode a string into an ArrayBuffer (or part thereof). Something as simple as DOMString decode(ArrayBufferView source, DOMString encoding); ArrayBufferView encode(DOMString source, DOMString encoding, [optional] ArrayBufferView destination); It saddens me that this allows non-UTF-8 encodings. However, since use cases for non-UTF-8 encodings were mentioned in this thread, I suggest that the set of supported encodings be an enumerated set of encodings stated in a spec and browsers MUST NOT support other encodings. The set should probably be the set offered in the encoding popup at http://validator.nu/?charset or a subset thereof (containing at least UTF-8 of course). (That set was derived by researching the intersection of the encodings supported by browsers, Python and the JDK.) Yes, I think we should enumerate the set of encodings supported. Ideally we'd for simplicity support the same set of enumerated encodings everywhere in the platform and over time try to shrink that set. would go a very long way. Are you sure that it's not necessary to support streaming conversion? The suggested API design assumes you always have the entire data sequence in a single DOMString or ArrayBufferView. The question is where to stick these functions. Internationalization doesn't have a obvious object we can hang functions off of (unlike, for example crypto), and the above names are much too generic to turn into global functions. If we deem streaming conversion unnecessary, I'd put the methods on DOMString and ArrayBufferView. It would be terribly sad to let the schedules of various working groups affect the API design. Streaming is a very good question. I hadn't thought about that. Especially now that we have chunked ArrayBuffer support in XHR streaming would seem like a much more interesting request. / Jonas
Re: [whatwg] API for encoding/decoding ArrayBuffers into text
On Mon, Mar 19, 2012 at 12:46 PM, Joshua Bell jsb...@chromium.org wrote: I have edited the proposal to base the list of encodings on http://dvcs.w3.org/hg/encoding/raw-file/tip/Overview.html - is there any reason that would not be sufficient or appropriate? (this appears to be a superset of the validator.nu/?charset list, with only a small number of additional encodings) There are lots of encodings in that list which browsers need to support for legacy text/html content, which are probably completely unnecessary here. People may be storing Shift-JIS text in ID3 tags, but I doubt they're doing that with ISO-2022-JP. I'm undecided about legacy encodings in general, but that aside, I'd start from just [UTF-8], and add to the list based on concrete use cases. Don't start from the whole list and try to pare it down. I wonder if we can't limit the damage of extending more support to legacy encodings. We have a use case for decoding legacy charsets (ID3 tags), but do we have any use cases for encoding to them? If you're writing back changed ID3 tags, you should be writing it back in as ID3v2 (which is all most tagging software writes to now), which uses UTF-8. On Mon, Mar 19, 2012 at 5:54 PM, Jonas Sicking jo...@sicking.cc wrote: Yes, I think we should enumerate the set of encodings supported. Ideally we'd for simplicity support the same set of enumerated encodings everywhere in the platform and over time try to shrink that set. Shrinking the set supported for HTML will be much harder than keeping this set small to begin with. -- Glenn Maynard
Re: [whatwg] API for encoding/decoding ArrayBuffers into text
On Mon, Mar 19, 2012 at 5:10 PM, Glenn Maynard gl...@zewt.org wrote: On Mon, Mar 19, 2012 at 5:54 PM, Jonas Sicking jo...@sicking.cc wrote: Yes, I think we should enumerate the set of encodings supported. Ideally we'd for simplicity support the same set of enumerated encodings everywhere in the platform and over time try to shrink that set. Shrinking the set supported for HTML will be much harder than keeping this set small to begin with. What value are we adding, and to whom, by keeping the list the smallest it can be, even when that means keeping the lists of supported encodings different between different APIs? The concrete costs are that authors will have to learn which encodings work where, and that implementations need to keep separate lists of supported encodings in different APIs. / Jonas
Re: [whatwg] API for encoding/decoding ArrayBuffers into text
On Mon, Mar 19, 2012 at 7:33 PM, Jonas Sicking jo...@sicking.cc wrote: What value are we adding, and to whom, by keeping the list the smallest it can be, even when that means keeping the lists of supported encodings different between different APIs? Not needlessly extending support for legacy encodings means there's no chance of this API inadvertently causing proliferation of those encodings. That benefits everyone who might come in contact with that data, and increases the odds of being able to remove some of those encodings from the platform entirely. The concrete costs are that authors will have to learn which encodings work where, and that implementations need to keep separate lists of supported encodings in different APIs. Authors don't need to learn that; all they care about is if the encoding they're trying to use works. Nobody memorizes lists of encodings. Keeping a list of supported encodings is a trivial cost. It also means that browsers need to be able to encode to each of these encodings, and encoding for all of them needs to be specified, which I think is currently unneeded. (Unless we go the asymmetric encoding/decoding route, supporting only decoders for legacy charsets. If this is the only reason that'd all have to be specified, that's probably another reason to consider it...) Supporting streaming decoding for modal encodings, such as ISO-2022-CN, might also be a burden: it means implementations would be required to support stateful, incremental decoding for that charset, which is more complicated than most encodings (which are stateless). Many implementations probably do support that, but I don't think it's currently mandatory, and it would complicate any streaming API. Stateful encodings need to die even more than other legacy encodings; I hope this API doesn't have to support any of them. -- Glenn Maynard
Re: [whatwg] API for encoding/decoding ArrayBuffers into text
On Mon, Mar 19, 2012 at 8:14 PM, Glenn Maynard gl...@zewt.org wrote: If this is the only reason that'd all have to be specified, that's probably another reason to consider it... (Well, there's form data either way. At least encoding is probably easier to spec, since it only has to deal with UTF-16 error handling...) -- Glenn Maynard
Re: [whatwg] API for encoding/decoding ArrayBuffers into text
On Mon, Mar 19, 2012 at 6:14 PM, Glenn Maynard gl...@zewt.org wrote: On Mon, Mar 19, 2012 at 7:33 PM, Jonas Sicking jo...@sicking.cc wrote: What value are we adding, and to whom, by keeping the list the smallest it can be, even when that means keeping the lists of supported encodings different between different APIs? Not needlessly extending support for legacy encodings means there's no chance of this API inadvertently causing proliferation of those encodings. That benefits everyone who might come in contact with that data, and increases the odds of being able to remove some of those encodings from the platform entirely. It seems unlikely to me that adding support for an encoding here will make it harder to eradicate the encoding from the web. The concrete costs are that authors will have to learn which encodings work where, and that implementations need to keep separate lists of supported encodings in different APIs. Authors don't need to learn that; all they care about is if the encoding they're trying to use works. Nobody memorizes lists of encodings. Why are encodings different than other parts of the API where you indeed have to know what works and what doesn't. It also means that browsers need to be able to encode to each of these encodings, and encoding for all of them needs to be specified, which I think is currently unneeded. (Unless we go the asymmetric encoding/decoding route, supporting only decoders for legacy charsets. If this is the only reason that'd all have to be specified, that's probably another reason to consider it...) Supporting streaming decoding for modal encodings, such as ISO-2022-CN, might also be a burden: it means implementations would be required to support stateful, incremental decoding for that charset, which is more complicated than most encodings (which are stateless). Many implementations probably do support that, but I don't think it's currently mandatory, and it would complicate any streaming API. Stateful encodings need to die even more than other legacy encodings; I hope this API doesn't have to support any of them. UTF8 is stateful, so I disagree. / Jonas
Re: [whatwg] API for encoding/decoding ArrayBuffers into text
On Thu, Mar 15, 2012 at 5:20 PM, Glenn Maynard gl...@zewt.org wrote: On Thu, Mar 15, 2012 at 6:51 PM, Jonas Sicking jo...@sicking.cc wrote: What's the use-case for the stringLength function? You can't decode into an existing datastructure anyway, so you're ultimately forced to call decode at which point the stringLength function hasn't helped you. stringLength doesn't return the length of the decoded string. It returns the byte offset of the first \0 (or the length of the whole buffer, if none), for decoding null-terminated strings. For multibyte encodings (eg. everything except UTF-16 and friends), it's just memchr(), so it's much faster than actually decoding the string. And just to be clear, the use case is decoding data formats where string fields are variable length null terminated. Currently the use-case of simply wanting to convert a string to a binary buffer is a bit cumbersome. You first have to call the encodedLength function, then allocate a buffer of the right size, then call the encode function. I suggested eg. result = encode(string, utf-8, null).output; which would create an ArrayBuffer of the required size. Presumably the null ArrayBufferView argument would be optional, so you could just say encode(string, utf-8). I think we want both encoding and destination to be optional. That leads us to an API like: out_dict = stringEncoding.encode(string, opt_dict); .. where both out_dict and opt_dict are WebIDL Dictionaries: opt_dict keys: view, encoding out_dict keys: charactersWritten, byteWritten, output ... where output === view if view is supplied, otherwise a new Uint8Array (or Uint8ClampedArray??) If this instead is attached to String, it would look like: out_dict = my_string.encode(opt_dict); If it were attached to ArrayBufferView, having a right-size buffer allocated for the caller gets uglier unless we include a static version. It doesn't seem possible to implement the 'encode' function without doing multiple scans over the string. The implementation seems required both to check that the data can be decoded using the specified encoding, as well as check that the data will fit in the passed in buffer. Only then can the implementation start decoding the data. This seems problematic. Only if it guarantees that it doesn't write anything to the output buffer unless the entire result will fit. I don't think we need to do that; just guarantee that it'll be truncated on a whole codepoint. Agreed. Input/output dicts mean the API documentation a caller needs to read to understand the usage is more complex than a function signature which is why I resisted them, but it does seem like the best approach. Thanks for pushing, Glenn! In the create-a-buffer-on-the-fly case there will be some memory juggling going on, either by initially over allocating or reallocating/moving. I also don't think it's a good idea to throw an exception for encoding errors. Better to convert characters to the unicode replacement character. I believe we made a similar change to the WebSockets specification recently. Was that change made? I filed https://www.w3.org/Bugs/Public/show_bug.cgi?id=16157, but it still seems to be undecided. Settling on an options dict means adding a flag to control this behavior (throws: true ?) doesn't extend the API surface significantly.
Re: [whatwg] API for encoding/decoding ArrayBuffers into text
On Fri, Mar 16, 2012 at 9:19 AM, Joshua Bell jsb...@chromium.org wrote: And just to be clear, the use case is decoding data formats where string fields are variable length null terminated. ... and the spec should include normative guidance that length-prefixing is strongly recommended for new data formats.
Re: [whatwg] API for encoding/decoding ArrayBuffers into text
On Fri, Mar 16, 2012 at 11:19 AM, Joshua Bell jsb...@chromium.org wrote: And just to be clear, the use case is decoding data formats where string fields are variable length null terminated. A concrete example is ZIP central directories. I think we want both encoding and destination to be optional. That leads us to an API like: out_dict = stringEncoding.encode(string, opt_dict); .. where both out_dict and opt_dict are WebIDL Dictionaries: opt_dict keys: view, encoding out_dict keys: charactersWritten, byteWritten, output The return value should just be a [NoInterfaceObject] interface. Dictionaries are used for input fields. Something that came up on IRC that we should spend some time thinking about, though: Is it actually important to be able to encode into an existing buffer? This may be a premature optimization. You can always encode into a new buffer, and--if needed--copy the result where you need it. If we don't support that, most of this extra stuff in encode() goes away. ... where output === view if view is supplied, otherwise a new Uint8Array (or Uint8ClampedArray??) Uint8Array is correct. (Uint8ClampedArray is for image color data.) If UTF-16 or UTF-32 are supported, decoding to them should return Uint16Array and Uint32Array, respectively (with the return value being typed just to ArrayBufferView). If this instead is attached to String, it would look like: out_dict = my_string.encode(opt_dict); If it were attached to ArrayBufferView, having a right-size buffer allocated for the caller gets uglier unless we include a static version. If in-place decoding isn't really needed, we could have: newView = str.encode(utf-8); // or {encoding: utf-8} str2 = newView.decode(utf-8); len = newView.find(0); // replaces stringLength, searching for 0 in the view's type; you'd use Uint16Array for UTF-16 and encodedLength() would go away. newView.find(val) would live on subclasses of TypedArray. In the create-a-buffer-on-the-fly case there will be some memory juggling going on, either by initially over allocating or reallocating/moving. But since that's all behind the scenes, the implementation can do it whichever way is most efficient for the particular encoding. In many cases, it may be possible to eliminate any reallocation, by making an educated guess about how big the buffer is likely to be. On Fri, Mar 16, 2012 at 11:21 AM, Joshua Bell jsb...@chromium.org wrote: ... and the spec should include normative guidance that length-prefixing is strongly recommended for new data formats. I think this would be a bit off-topic. -- Glenn Maynard
Re: [whatwg] API for encoding/decoding ArrayBuffers into text
On Fri, 16 Mar 2012, Glenn Maynard wrote: On Fri, Mar 16, 2012 at 11:19 AM, Joshua Bell jsb...@chromium.org wrote: And just to be clear, the use case is decoding data formats where string fields are variable length null terminated. A concrete example is ZIP central directories. I think we want both encoding and destination to be optional. That leads us to an API like: out_dict = stringEncoding.encode(string, opt_dict); .. where both out_dict and opt_dict are WebIDL Dictionaries: opt_dict keys: view, encoding out_dict keys: charactersWritten, byteWritten, output The return value should just be a [NoInterfaceObject] interface. Dictionaries are used for input fields. Something that came up on IRC that we should spend some time thinking about, though: Is it actually important to be able to encode into an existing buffer? This may be a premature optimization. You can always encode into a new buffer, and--if needed--copy the result where you need it. If we don't support that, most of this extra stuff in encode() goes away. Yes, I think we should focus on getting feature parity with e.g. python first -- i.e. not worry about decoding into existing buffers -- and add extra fancy stuff later if we find that there are actually usecases where avoiding the copy is critical. This should allow us to focus on getting the right API for the common case. If in-place decoding isn't really needed, we could have: newView = str.encode(utf-8); // or {encoding: utf-8} str2 = newView.decode(utf-8); len = newView.find(0); // replaces stringLength, searching for 0 in the view's type; you'd use Uint16Array for UTF-16 and encodedLength() would go away. This looks like a big win to me.
Re: [whatwg] API for encoding/decoding ArrayBuffers into text
On Fri, Mar 16, 2012 at 10:35 AM, Glenn Maynard gl...@zewt.org wrote: On Fri, Mar 16, 2012 at 11:19 AM, Joshua Bell jsb...@chromium.org wrote: ... where output === view if view is supplied, otherwise a new Uint8Array (or Uint8ClampedArray??) Uint8Array is correct. (Uint8ClampedArray is for image color data.) If UTF-16 or UTF-32 are supported, decoding to them should return Uint16Array and Uint32Array, respectively (with the return value being typed just to ArrayBufferView). FYI, there was some follow up IRC conversation on this. With Typed Arrays as currently specified - that is, that Uint16Array has platform endianness - the above would imply that either platform endianness dictated the output byte sequence (and le/be was ignored), or that encode(\uFFFD, utf-16).view[0] might != 0xFFFD on some platforms. There was consensus (among the two of us) that the output view's underlying buffer's byte order would be le/be depending on the selected encoding. There is not consensus over what the return view type should be - Uint8Array, or pursue BE/LE variants of Uint16Array to conceal platform endianness.
Re: [whatwg] API for encoding/decoding ArrayBuffers into text
On 3/16/12 5:12 PM, Joshua Bell wrote: FYI, there was some follow up IRC conversation on this. With Typed Arrays as currently specified - that is, that Uint16Array has platform endianness For what it's worth, it seems like this is something we should seriously consider changing so as to make the web-visible endianness of typed arrays always be little-endian. Authors are actively writing code (and being encouraged to do so by technology evangelists) that makes that assumption anyway -Boris
Re: [whatwg] API for encoding/decoding ArrayBuffers into text
On 3/16/2012 2:17 PM, Boris Zbarsky wrote: On 3/16/12 5:12 PM, Joshua Bell wrote: FYI, there was some follow up IRC conversation on this. With Typed Arrays as currently specified - that is, that Uint16Array has platform endianness For what it's worth, it seems like this is something we should seriously consider changing so as to make the web-visible endianness of typed arrays always be little-endian. Authors are actively writing code (and being encouraged to do so by technology evangelists) that makes that assumption anyway The DataView set of methods already does this work. The raw arrays are supposed to have platform endianness. If you see some evangelists skipping the endian check, send them an e-mail and let them know. -Charles
Re: [whatwg] API for encoding/decoding ArrayBuffers into text
On Fri, 16 Mar 2012, Charles Pritchard wrote: On 3/16/2012 2:17 PM, Boris Zbarsky wrote: On 3/16/12 5:12 PM, Joshua Bell wrote: FYI, there was some follow up IRC conversation on this. With Typed Arrays as currently specified - that is, that Uint16Array has platform endianness For what it's worth, it seems like this is something we should seriously consider changing so as to make the web-visible endianness of typed arrays always be little-endian. Authors are actively writing code (and being encouraged to do so by technology evangelists) that makes that assumption anyway The DataView set of methods already does this work. The raw arrays are supposed to have platform endianness. If you see some evangelists skipping the endian check, send them an e-mail and let them know. Not going to work. You can't evangelise people into making their code work on architectures that they don't own. It's hard enough to get people to work around differences between browsers when all the browsers are avaliable for free and run on the platforms that they develop on. The reality is that on devices where typed arrays don't appear LE, content will break.
Re: [whatwg] API for encoding/decoding ArrayBuffers into text
On Fri, Mar 16, 2012 at 4:44 PM, Charles Pritchard ch...@jumis.com wrote: The DataView set of methods already does this work. The raw arrays are supposed to have platform endianness. That's wrong. This is web API design 101; everyone should know better than this by now. Exposing platform endianness is setting the platform up for massive incompatibilities down the road. In reality, the spec is moot here: if anyone does implement typed arrays on a production big-endian system, they're going to make these views little-endian, because doing otherwise would break countless applications, essentially all of which are tested only on little-endian systems. Web compatibility is a top priority to browser implementations. (DataView isn't relevant here; it's used for different access patterns. To access arrays of data embedded in an ArrayBuffer, you use views, not DataView. Use DataView if you have a packed data structure with variable-size fields, such as the metadata in a ZIP local file header.) -- Glenn Maynard
Re: [whatwg] API for encoding/decoding ArrayBuffers into text
On 3/16/2012 3:26 PM, Glenn Maynard wrote: On Fri, Mar 16, 2012 at 4:44 PM, Charles Pritchard ch...@jumis.com mailto:ch...@jumis.com wrote: The DataView set of methods already does this work. The raw arrays are supposed to have platform endianness. That's wrong. This is web API design 101; everyone should know better than this by now. Exposing platform endianness is setting the platform up for massive incompatibilities down the road. I make mistakes all the time with UTF8 and raw string arrays. I make mistakes all the time with endianness. Low level API design 101; everyone working with low level APIs makes mistakes. In reality, the spec is moot here: if anyone does implement typed arrays on a production big-endian system, they're going to make these views little-endian, because doing otherwise would break countless applications, essentially all of which are tested only on little-endian systems. Web compatibility is a top priority to browser implementations. It's up to programmers to code defensively. More-so with multi-platform multi-vendor deployments than walled gardens. Authors should be using the spec as written, it only takes one target system to use big-endian. It doesn't harm anything for a vendor to implement as little-endian, as most authors assume and test on little endian. It may cause some harm to alter the spec so as to remove the requirement that coders account for both. (DataView isn't relevant here; it's used for different access patterns. To access arrays of data embedded in an ArrayBuffer, you use views, not DataView. Use DataView if you have a packed data structure with variable-size fields, such as the metadata in a ZIP local file header.) I use the subarray pattern frequently. DataView is not much different than using subarray. Use DataView when it's easier than ArrayBufferView and available.
Re: [whatwg] API for encoding/decoding ArrayBuffers into text
On 3/16/12 5:44 PM, Charles Pritchard wrote: The DataView set of methods already does this work. The raw arrays are supposed to have platform endianness. I haven't seen anyone actually using the DataView stuff in practice, or presenting it to developers much... If you see some evangelists skipping the endian check, send them an e-mail and let them know. I've done that... then I stopped because it just wasn't worth the effort. Every single WebGL demo I've seen recently was doing this. People were being told that typed arrays are a good way to load binary (integer and float) data from servers using the arraybuffer facilities of XHR at SXSW last week, with no mention of endianness. I think that trying to get web developers to do this right is a lost cause, esp. because none of them (to a good approximation) have any big-endian systems to test on. -Boris
Re: [whatwg] API for encoding/decoding ArrayBuffers into text
On 3/16/12 5:25 PM, Brandon Jones wrote: Everyone knows that typed arrays /can/ be Big Endian, but I'm not aware of any devices available right now that support WebGL that are. I believe that recent Firefox on a SPARC processor would fit that description. Of course the number of web developers that have a SPARC-based machine is 0 to a very good approximation -Boris
Re: [whatwg] API for encoding/decoding ArrayBuffers into text
On Fri, Mar 16, 2012 at 9:19 AM, Joshua Bell jsb...@chromium.org wrote: On Thu, Mar 15, 2012 at 5:20 PM, Glenn Maynard gl...@zewt.org wrote: On Thu, Mar 15, 2012 at 6:51 PM, Jonas Sicking jo...@sicking.cc wrote: What's the use-case for the stringLength function? You can't decode into an existing datastructure anyway, so you're ultimately forced to call decode at which point the stringLength function hasn't helped you. stringLength doesn't return the length of the decoded string. It returns the byte offset of the first \0 (or the length of the whole buffer, if none), for decoding null-terminated strings. For multibyte encodings (eg. everything except UTF-16 and friends), it's just memchr(), so it's much faster than actually decoding the string. And just to be clear, the use case is decoding data formats where string fields are variable length null terminated. Currently the use-case of simply wanting to convert a string to a binary buffer is a bit cumbersome. You first have to call the encodedLength function, then allocate a buffer of the right size, then call the encode function. I suggested eg. result = encode(string, utf-8, null).output; which would create an ArrayBuffer of the required size. Presumably the null ArrayBufferView argument would be optional, so you could just say encode(string, utf-8). I think we want both encoding and destination to be optional. That leads us to an API like: out_dict = stringEncoding.encode(string, opt_dict); .. where both out_dict and opt_dict are WebIDL Dictionaries: opt_dict keys: view, encoding out_dict keys: charactersWritten, byteWritten, output ... where output === view if view is supplied, otherwise a new Uint8Array (or Uint8ClampedArray??) If this instead is attached to String, it would look like: out_dict = my_string.encode(opt_dict); If it were attached to ArrayBufferView, having a right-size buffer allocated for the caller gets uglier unless we include a static version. Using input and output dictionaries is definitely messy, but I can't see a better way either. And I think ES6 is adding some syntax here that will make developer's lives better (deconstructing assignments) It doesn't seem possible to implement the 'encode' function without doing multiple scans over the string. The implementation seems required both to check that the data can be decoded using the specified encoding, as well as check that the data will fit in the passed in buffer. Only then can the implementation start decoding the data. This seems problematic. Only if it guarantees that it doesn't write anything to the output buffer unless the entire result will fit. I don't think we need to do that; just guarantee that it'll be truncated on a whole codepoint. Agreed. Input/output dicts mean the API documentation a caller needs to read to understand the usage is more complex than a function signature which is why I resisted them, but it does seem like the best approach. Thanks for pushing, Glenn! In the create-a-buffer-on-the-fly case there will be some memory juggling going on, either by initially over allocating or reallocating/moving. The implementation can always figure out what strategy fits its own requirements best with regards to memory allocation. I suspect that right now in Firefox the fastest implementation would be to scan through the string once to measure the desired buffer size, then allocate and write into the allocated buffer. The problem is that the way that the encoding function is defined right now, you are not allowed to write any data if you are throwing for whatever reason, which means that you have to do a scan first to see if you need to throw, and then do a separate pass to actually encode the data. I think we need to change that such that when an exception is thrown that data should be written up to the point that causes the exception. I also don't think it's a good idea to throw an exception for encoding errors. Better to convert characters to the unicode replacement character. I believe we made a similar change to the WebSockets specification recently. Was that change made? I filed https://www.w3.org/Bugs/Public/show_bug.cgi?id=16157, but it still seems to be undecided. Settling on an options dict means adding a flag to control this behavior (throws: true ?) doesn't extend the API surface significantly. Sounds good to me. Though I would still strongly prefer the default to be non-throwing as to minimize the risk of website breakage in the case of bugs. Especially since these bugs are so data dependent and are likely to not happen on a developers computer. / Jonas
Re: [whatwg] API for encoding/decoding ArrayBuffers into text
On 3/16/2012 4:25 PM, Boris Zbarsky wrote: On 3/16/12 5:25 PM, Brandon Jones wrote: Everyone knows that typed arrays /can/ be Big Endian, but I'm not aware of any devices available right now that support WebGL that are. I believe that recent Firefox on a SPARC processor would fit that description. Of course the number of web developers that have a SPARC-based machine is 0 to a very good approximation I've written some hash/encryption methods that could very well could fail on Firefox on SPARC; many things fail on machines I've never tested with. Flip the implementation on SPARC, and it wouldn't harm anything. Cut it out of the spec, so that the behavior is undocumented, implementations break. DataView is a more complex than ArrayBufferView, so implementers started with the easy option. The coders using Float32Array are cowboys; (web app gaming and encryption). We're talking about a few hundred people out of many millions.
Re: [whatwg] API for encoding/decoding ArrayBuffers into text
On Fri, Mar 16, 2012 at 4:25 PM, Boris Zbarsky bzbar...@mit.edu wrote: On 3/16/12 5:25 PM, Brandon Jones wrote: Everyone knows that typed arrays /can/ be Big Endian, but I'm not aware of any devices available right now that support WebGL that are. I believe that recent Firefox on a SPARC processor would fit that description. Of course the number of web developers that have a SPARC-based machine is 0 to a very good approximation You can s/web developers/users/ and the statement would still apply, wouldn't it? - James -Boris
Re: [whatwg] API for encoding/decoding ArrayBuffers into text
On 3/16/12 7:43 PM, James Robinson wrote: You can s/web developers/users/ and the statement would still apply, wouldn't it? Sure, but so what? The upshot is that people are writing code that assumes little-endian hardware all over. We should just clearly make the spec say that that's what typed arrays are so that an implementor can actually implement the spec and be web compatible. The value of a spec which can't be implemented as written is arguably lower than not having a spec at all... At least then you _know_ you have to reverse-engineer. -Boris
Re: [whatwg] API for encoding/decoding ArrayBuffers into text
On 3/16/2012 5:25 PM, Boris Zbarsky wrote: On 3/16/12 7:43 PM, James Robinson wrote: You can s/web developers/users/ and the statement would still apply, wouldn't it? Sure, but so what? The upshot is that people are writing code that assumes little-endian hardware all over. We should just clearly make the spec say that that's what typed arrays are so that an implementor can actually implement the spec and be web compatible. The value of a spec which can't be implemented as written is arguably lower than not having a spec at all... At least then you _know_ you have to reverse-engineer. Isn't that an issue for TC39?
Re: [whatwg] API for encoding/decoding ArrayBuffers into text
On Wed, 14 Mar 2012 23:53:12 +0100, Glenn Maynard gl...@zewt.org wrote: On Wed, Mar 14, 2012 at 6:52 AM, Anne van Kesteren ann...@opera.com wrote: If we can make it a deterministic, unchanging, and defined algorithm, I think that would actually be acceptable. And ideally we do define that algorithm at some point so new browsers can enter the existing market more easily and existing browsers interpret existing content in the same way. We don't have any untagged content to support yet, so let's not create an API that guarantees it'll come into existence. The heuristics you need depend heavily on the content, anyway (for example, heuristics that work for HTML probably won't for ID3 tags, which are generally very short). What I replied to suggested reusing an existing undocumented code path which is definitely used to support existing content. From what I remember reading about the detector in Gecko it can be quite useful regardless of context. -- Anne van Kesteren http://annevankesteren.nl/
Re: [whatwg] API for encoding/decoding ArrayBuffers into text
On Wed, Mar 14, 2012 at 3:33 PM, Joshua Bell jsb...@chromium.org wrote: FYI, I've updated http://wiki.whatwg.org/wiki/StringEncoding A few comments: What's the use-case for the stringLength function? You can't decode into an existing datastructure anyway, so you're ultimately forced to call decode at which point the stringLength function hasn't helped you. Currently the use-case of simply wanting to convert a string to a binary buffer is a bit cumbersome. You first have to call the encodedLength function, then allocate a buffer of the right size, then call the encode function. Could we add a function with something like the following signature: ArrayBufferView encode(DOMString value, optional DOMString encoding); It doesn't seem possible to implement the 'encode' function without doing multiple scans over the string. The implementation seems required both to check that the data can be decoded using the specified encoding, as well as check that the data will fit in the passed in buffer. Only then can the implementation start decoding the data. This seems problematic. I also don't think it's a good idea to throw an exception for encoding errors. Better to convert characters to the unicode replacement character. I believe we made a similar change to the WebSockets specification recently. / Jonas
Re: [whatwg] API for encoding/decoding ArrayBuffers into text
On Thu, Mar 15, 2012 at 6:51 PM, Jonas Sicking jo...@sicking.cc wrote: What's the use-case for the stringLength function? You can't decode into an existing datastructure anyway, so you're ultimately forced to call decode at which point the stringLength function hasn't helped you. stringLength doesn't return the length of the decoded string. It returns the byte offset of the first \0 (or the length of the whole buffer, if none), for decoding null-terminated strings. For multibyte encodings (eg. everything except UTF-16 and friends), it's just memchr(), so it's much faster than actually decoding the string. Currently the use-case of simply wanting to convert a string to a binary buffer is a bit cumbersome. You first have to call the encodedLength function, then allocate a buffer of the right size, then call the encode function. I suggested eg. result = encode(string, utf-8, null).output; which would create an ArrayBuffer of the required size. Presumably the null ArrayBufferView argument would be optional, so you could just say encode(string, utf-8). It doesn't seem possible to implement the 'encode' function without doing multiple scans over the string. The implementation seems required both to check that the data can be decoded using the specified encoding, as well as check that the data will fit in the passed in buffer. Only then can the implementation start decoding the data. This seems problematic. Only if it guarantees that it doesn't write anything to the output buffer unless the entire result will fit. I don't think we need to do that; just guarantee that it'll be truncated on a whole codepoint. I also don't think it's a good idea to throw an exception for encoding errors. Better to convert characters to the unicode replacement character. I believe we made a similar change to the WebSockets specification recently. Was that change made? I filed https://www.w3.org/Bugs/Public/show_bug.cgi?id=16157, but it still seems to be undecided. -- Glenn Maynard
Re: [whatwg] API for encoding/decoding ArrayBuffers into text
On Wed, 14 Mar 2012 01:01:42 +0100, Ian Hickson i...@hixie.ch wrote: Seems reasonable. If we have specific use cases for non-UTF-8 encodings, I agree we should support them; if that's the case, we should survey those use cases to work out what the set of encodings we need is, and add just those. And not go beyond what is defined/allowed in: http://dvcs.w3.org/hg/encoding/raw-file/tip/Overview.html -- Anne van Kesteren http://annevankesteren.nl/
Re: [whatwg] API for encoding/decoding ArrayBuffers into text
Hi, On Wed, Mar 14, 2012 at 06:49, Jonas Sicking jo...@sicking.cc wrote: Something that has come up a couple of times with content authors lately has been the desire to convert an ArrayBuffer (or part thereof) into a decoded string. Similarly being able to encode a string into an ArrayBuffer (or part thereof). What are the 'late' use cases for this? The question might sound naive, but to me the encoding/decoding would have been really great to have during the time when we didn't have support for ArrayBuffers in general input/output APIs like we have now (XHR, WebSockets, File API, ...) - which sounds like the mainstream use cases to me. However there is one use case that is not supported that sounds something worthy not to overlook imho : embedding of binary data (typed arrays) into textual formats such as XML or JSON. For this, base64 encoding/decoding is typically used (so that it doesn't conflict with the XML or JSON container) and thus more or less efficiently implemented in JavaScript (just like we had to encode/decode strings in JS to/from XHR a while ago). Would it make sense to support encoding=base64 in this API? Something as simple as DOMString decode(ArrayBufferView source, DOMString encoding); ArrayBufferView encode(DOMString source, DOMString encoding, [optional] ArrayBufferView destination); This API proposal looks lean and mean. I hope we can move the current StringEncoding proposal to something closer to this. Regards,
Re: [whatwg] API for encoding/decoding ArrayBuffers into text
On 03/14/2012 12:38 AM, Tab Atkins Jr. wrote: On Tue, Mar 13, 2012 at 4:11 PM, Glenn Maynardgl...@zewt.org wrote: The API on that wiki page is a reasonable start. For the same reasons that we discussed in a recent thread ( http://lists.w3.org/Archives/Public/public-webapps/2011JulSep/1589.html), conversion errors should use replacement (eg. U+FFFD), not throw exceptions. Python throws errors by default, but both functions have an additional argument specifying an alternate strategy. In particular, bytes.decode can either drop the invalid bytes, replace them with a replacement char (which I agree should be U+FFFD), or replace them with XML entities; str.encode can choose to drop characters the encoding doesn't support. For completeness I note that python also allows user-provided custom error handling. I'm not suggesting we want this, but I would strongly prefer it to providing an XML-entity-encode option :)
Re: [whatwg] API for encoding/decoding ArrayBuffers into text
On Wed, 14 Mar 2012 00:50:43 +0100, Joshua Bell jsb...@chromium.org wrote: For both of the above: initially suggested use cases included parsing data as esoteric as ID3 tags in MP3 files, where encoding unspecified and is guessed at by decoders, and includes non-Unicode encodings. It was suggested that the encoding sniffing capabilities of browsers be leveraged. (Cue a strong nooo! from Anne.) If we can make it a deterministic, unchanging, and defined algorithm, I think that would actually be acceptable. And ideally we do define that algorithm at some point so new browsers can enter the existing market more easily and existing browsers interpret existing content in the same way. -- Anne van Kesteren http://annevankesteren.nl/
Re: [whatwg] API for encoding/decoding ArrayBuffers into text
FYI, I've updated http://wiki.whatwg.org/wiki/StringEncoding * Rewritten in terms of Anne's Encoding spec and WebIDL, for algorithms, encodings, and encoding selection, which greatly simplifies the spec. This implicitly adds support for all of the other encodings defined therein - we may still want to dictate a subset of encodings. A few minor issues noted throughout the spec. * Define a binary encoding, since that support was already in this spec. We may decide to kill this but I didn't want to remove it just yet. * Simplify methods to take ArrayBufferView instead of any/byteOffset/byteLength. The implication is that you may need to use temporary DataViews, and this is reflected in the examples. * Call out more of the big open issues raised on this thread (e.g. where should we hang this API) Nothing controversial added, or (alas) resolved.
Re: [whatwg] API for encoding/decoding ArrayBuffers into text
On Tue, Mar 13, 2012 at 9:47 PM, John Tamplin j...@google.com wrote: I am fine with strongly suggesting that only UTF8 be used for new things, but leaving out legacy support will severely limit the utility of this library. Not all limitations are bad, and I'd disagree with seriously. At a minimum, the set of encodings should be very carefully selected. Limit it to Unicode to begin with, and if we're really going to put legacy encodings on yet more life support, only add an encoding where there's a clear, justified need for it. (There are many encodings that browsers need to support for text/html because they're used in legacy content, but which nobody is still using today in new content--those should not be supported here.) But stick with Unicode for now. Once an encoding is added, it's hard to ever remove it. On Wed, Mar 14, 2012 at 6:52 AM, Anne van Kesteren ann...@opera.com wrote: If we can make it a deterministic, unchanging, and defined algorithm, I think that would actually be acceptable. And ideally we do define that algorithm at some point so new browsers can enter the existing market more easily and existing browsers interpret existing content in the same way. We don't have any untagged content to support yet, so let's not create an API that guarantees it'll come into existence. The heuristics you need depend heavily on the content, anyway (for example, heuristics that work for HTML probably won't for ID3 tags, which are generally very short). On Wed, Mar 14, 2012 at 11:14 AM, Joshua Bell jsb...@chromium.org wrote: Having implemented a library that handled both text encodings and base16/base64 encoding, I can offer the opinion that the nomenclature gets very confusing since the encode/decode semantics are reversed. binary_buffer = encode(text_content) text_content = decode(binary_buffer) vs. binary_buffer = decode(base64_data) base64_data = encode(binary_buffer) It's more than a naming problem. With this string API, one side of the conversion is always a DOMString. Base64 conversion wants ArrayBuffer-ArrayBuffer conversions, so it would belong in a separate API. -- Glenn Maynard
Re: [whatwg] API for encoding/decoding ArrayBuffers into text
On Wed, Mar 14, 2012 at 3:53 PM, Glenn Maynard gl...@zewt.org wrote: It's more than a naming problem. With this string API, one side of the conversion is always a DOMString. Base64 conversion wants ArrayBuffer-ArrayBuffer conversions, so it would belong in a separate API. Huh. The scenarios I've run across are Base64-encoded binary data islands embedded in textual container formats like XML or JSON, which yield a DOMString I want to decode into an ArrayBuffer.
[whatwg] API for encoding/decoding ArrayBuffers into text
Hi All, Something that has come up a couple of times with content authors lately has been the desire to convert an ArrayBuffer (or part thereof) into a decoded string. Similarly being able to encode a string into an ArrayBuffer (or part thereof). Something as simple as DOMString decode(ArrayBufferView source, DOMString encoding); ArrayBufferView encode(DOMString source, DOMString encoding, [optional] ArrayBufferView destination); would go a very long way. The question is where to stick these functions. Internationalization doesn't have a obvious object we can hang functions off of (unlike, for example crypto), and the above names are much too generic to turn into global functions. Ideas/opinions/bikesheds? / Jonas
Re: [whatwg] API for encoding/decoding ArrayBuffers into text
On Tue, Mar 13, 2012 at 3:49 PM, Jonas Sicking jo...@sicking.cc wrote: Hi All, Something that has come up a couple of times with content authors lately has been the desire to convert an ArrayBuffer (or part thereof) into a decoded string. Similarly being able to encode a string into an ArrayBuffer (or part thereof). Something as simple as DOMString decode(ArrayBufferView source, DOMString encoding); ArrayBufferView encode(DOMString source, DOMString encoding, [optional] ArrayBufferView destination); would go a very long way. The question is where to stick these functions. Internationalization doesn't have a obvious object we can hang functions off of (unlike, for example crypto), and the above names are much too generic to turn into global functions. Ideas/opinions/bikesheds? Python3 just defines str.encode and bytes.decode. Can we not do this with String.encode and ArrayBuffer.decode? ~TJ
Re: [whatwg] API for encoding/decoding ArrayBuffers into text
On Tue, 13 Mar 2012, Jonas Sicking wrote: Something that has come up a couple of times with content authors lately has been the desire to convert an ArrayBuffer (or part thereof) into a decoded string. Similarly being able to encode a string into an ArrayBuffer (or part thereof). Something as simple as DOMString decode(ArrayBufferView source, DOMString encoding); ArrayBufferView encode(DOMString source, DOMString encoding, [optional] ArrayBufferView destination); would go a very long way. The question is where to stick these functions. Internationalization doesn't have a obvious object we can hang functions off of (unlike, for example crypto), and the above names are much too generic to turn into global functions. Shouldn't this just be another ArrayBufferView type with special semantics, like Uint8ClampedArray? DOMStringArray or some such? And/or a getString()/setString() method pair on DataView? Incidentally I _strongly_ suggest we only support UTF-8 here. -- Ian Hickson U+1047E)\._.,--,'``.fL http://ln.hixie.ch/ U+263A/, _.. \ _\ ;`._ ,. Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.'
Re: [whatwg] API for encoding/decoding ArrayBuffers into text
Joshua Bell has been working on a string encoding and decoding API that supports the needed encodings, and which is separable from the core typed array API: http://wiki.whatwg.org/wiki/StringEncoding This is the direction I prefer. String encoding and decoding seems to be a complex enough problem that it should be expressed separately from the typed array spec itself. -Ken On Tue, Mar 13, 2012 at 5:59 PM, Ian Hickson i...@hixie.ch wrote: On Tue, 13 Mar 2012, Jonas Sicking wrote: Something that has come up a couple of times with content authors lately has been the desire to convert an ArrayBuffer (or part thereof) into a decoded string. Similarly being able to encode a string into an ArrayBuffer (or part thereof). Something as simple as DOMString decode(ArrayBufferView source, DOMString encoding); ArrayBufferView encode(DOMString source, DOMString encoding, [optional] ArrayBufferView destination); would go a very long way. The question is where to stick these functions. Internationalization doesn't have a obvious object we can hang functions off of (unlike, for example crypto), and the above names are much too generic to turn into global functions. Shouldn't this just be another ArrayBufferView type with special semantics, like Uint8ClampedArray? DOMStringArray or some such? And/or a getString()/setString() method pair on DataView? Incidentally I _strongly_ suggest we only support UTF-8 here. -- Ian Hickson U+1047E )\._.,--,'``. fL http://ln.hixie.ch/ U+263A /, _.. \ _\ ;`._ ,. Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.'
Re: [whatwg] API for encoding/decoding ArrayBuffers into text
On Tue, Mar 13, 2012 at 3:58 PM, Tab Atkins Jr. jackalm...@gmail.com wrote: On Tue, Mar 13, 2012 at 3:49 PM, Jonas Sicking jo...@sicking.cc wrote: Hi All, Something that has come up a couple of times with content authors lately has been the desire to convert an ArrayBuffer (or part thereof) into a decoded string. Similarly being able to encode a string into an ArrayBuffer (or part thereof). Something as simple as DOMString decode(ArrayBufferView source, DOMString encoding); ArrayBufferView encode(DOMString source, DOMString encoding, [optional] ArrayBufferView destination); would go a very long way. The question is where to stick these functions. Internationalization doesn't have a obvious object we can hang functions off of (unlike, for example crypto), and the above names are much too generic to turn into global functions. Ideas/opinions/bikesheds? Python3 just defines str.encode and bytes.decode. Can we not do this with String.encode and ArrayBuffer.decode? Unfortunately I suspect getting anything added on the String object will take a few years given that it's too late to get into ES6 (and in any case I suspect adding ArrayBuffer dependencies to ES6 would be controversial). / Jonas
Re: [whatwg] API for encoding/decoding ArrayBuffers into text
On Tue, Mar 13, 2012 at 4:08 PM, Kenneth Russell k...@google.com wrote: Joshua Bell has been working on a string encoding and decoding API that supports the needed encodings, and which is separable from the core typed array API: http://wiki.whatwg.org/wiki/StringEncoding This is the direction I prefer. String encoding and decoding seems to be a complex enough problem that it should be expressed separately from the typed array spec itself. Very cool. Where do I provide feedback to this? Here? / Jonas
Re: [whatwg] API for encoding/decoding ArrayBuffers into text
On Tue, Mar 13, 2012 at 5:49 PM, Jonas Sicking jo...@sicking.cc wrote: Something that has come up a couple of times with content authors lately has been the desire to convert an ArrayBuffer (or part thereof) into a decoded string. Similarly being able to encode a string into an ArrayBuffer (or part thereof). There was discussion about this before: https://www.khronos.org/webgl/public-mailing-list/archives//msg00017.html http://wiki.whatwg.org/wiki/StringEncoding (I don't know why it was on the WebGL list; typed arrays are becoming infrastructural and this doesn't seem like it belongs there, even though ArrayBuffer was started there.) The API on that wiki page is a reasonable start. For the same reasons that we discussed in a recent thread ( http://lists.w3.org/Archives/Public/public-webapps/2011JulSep/1589.html), conversion errors should use replacement (eg. U+FFFD), not throw exceptions. The any arguments should be fixed. Encoding to UTF-16 should definitely not prefix a BOM, and UTF-16 having unspecified endianness is obviously bad. I'd also suggest that, unless there's serious, substantiated demand for it--which I doubt--only major Unicode encodings be supported. Don't make it easier for people to keep using legacy encodings. Shouldn't this just be another ArrayBufferView type with special semantics, like Uint8ClampedArray? DOMStringArray or some such? And/or a getString()/setString() method pair on DataView? I don't think so, because retrieving the N'th decoded/reencoded character isn't a constant-time operation. -- Glenn Maynard
Re: [whatwg] API for encoding/decoding ArrayBuffers into text
On Tue, Mar 13, 2012 at 6:10 PM, Jonas Sicking jo...@sicking.cc wrote: On Tue, Mar 13, 2012 at 4:08 PM, Kenneth Russell k...@google.com wrote: Joshua Bell has been working on a string encoding and decoding API that supports the needed encodings, and which is separable from the core typed array API: http://wiki.whatwg.org/wiki/StringEncoding This is the direction I prefer. String encoding and decoding seems to be a complex enough problem that it should be expressed separately from the typed array spec itself. Very cool. Where do I provide feedback to this? Here? This list seems like a good place to discuss it. -Ken
Re: [whatwg] API for encoding/decoding ArrayBuffers into text
On Tue, 13 Mar 2012, Jonas Sicking wrote: Unfortunately I suspect getting anything added on the String object will take a few years given that it's too late to get into ES6 (and in any case I suspect adding ArrayBuffer dependencies to ES6 would be controversial). We can just define it outside the ES spec. -- Ian Hickson U+1047E)\._.,--,'``.fL http://ln.hixie.ch/ U+263A/, _.. \ _\ ;`._ ,. Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.'
Re: [whatwg] API for encoding/decoding ArrayBuffers into text
On Tue, Mar 13, 2012 at 4:11 PM, Glenn Maynard gl...@zewt.org wrote: On Tue, Mar 13, 2012 at 5:49 PM, Jonas Sicking jo...@sicking.cc wrote: Something that has come up a couple of times with content authors lately has been the desire to convert an ArrayBuffer (or part thereof) into a decoded string. Similarly being able to encode a string into an ArrayBuffer (or part thereof). There was discussion about this before: https://www.khronos.org/webgl/public-mailing-list/archives//msg00017.html http://wiki.whatwg.org/wiki/StringEncoding (I don't know why it was on the WebGL list; typed arrays are becoming infrastructural and this doesn't seem like it belongs there, even though ArrayBuffer was started there.) The API on that wiki page is a reasonable start. For the same reasons that we discussed in a recent thread ( http://lists.w3.org/Archives/Public/public-webapps/2011JulSep/1589.html), conversion errors should use replacement (eg. U+FFFD), not throw exceptions. The any arguments should be fixed. Encoding to UTF-16 should definitely not prefix a BOM, and UTF-16 having unspecified endianness is obviously bad. I'd also suggest that, unless there's serious, substantiated demand for it--which I doubt--only major Unicode encodings be supported. Don't make it easier for people to keep using legacy encodings. Two other pieces of feedback I received from Adam Barth off list: * take ArrayBufferView as input which both fixes any and simplifies the API to eliminate byteOffset and byteLength * support two versions of encode, one which takes a target ArrayBufferView, and one which allocates/returns a new Uint8Array of the appropriate length. Shouldn't this just be another ArrayBufferView type with special semantics, like Uint8ClampedArray? DOMStringArray or some such? And/or a getString()/setString() method pair on DataView? I don't think so, because retrieving the N'th decoded/reencoded character isn't a constant-time operation. -- Glenn Maynard
Re: [whatwg] API for encoding/decoding ArrayBuffers into text
On Tue, Mar 13, 2012 at 4:11 PM, Glenn Maynard gl...@zewt.org wrote: On Tue, Mar 13, 2012 at 5:49 PM, Jonas Sicking jo...@sicking.cc wrote: Something that has come up a couple of times with content authors lately has been the desire to convert an ArrayBuffer (or part thereof) into a decoded string. Similarly being able to encode a string into an ArrayBuffer (or part thereof). There was discussion about this before: https://www.khronos.org/webgl/public-mailing-list/archives//msg00017.html http://wiki.whatwg.org/wiki/StringEncoding (I don't know why it was on the WebGL list; typed arrays are becoming infrastructural and this doesn't seem like it belongs there, even though ArrayBuffer was started there.) Purely historical; early adopters of Typed Arrays were folks prototyping with WebGL who wanted to parse data files containing strings. WHATWG makes sense, I just hadn't gotten around to shopping for a home. (Administrivia: Is there need to propose a charter addition?) The API on that wiki page is a reasonable start. For the same reasons that we discussed in a recent thread ( http://lists.w3.org/Archives/Public/public-webapps/2011JulSep/1589.html), conversion errors should use replacement (eg. U+FFFD), not throw exceptions. The any arguments should be fixed. Encoding to UTF-16 should definitely not prefix a BOM, and UTF-16 having unspecified endianness is obviously bad. I'd also suggest that, unless there's serious, substantiated demand for it--which I doubt--only major Unicode encodings be supported. Don't make it easier for people to keep using legacy encodings. Two other pieces of feedback I received from Adam Barth off list: * take ArrayBufferView as input which both fixes any and simplifies the API to eliminate byteOffset and byteLength * support two versions of encode, one which takes a target ArrayBufferView, and one which allocates/returns a new Uint8Array of the appropriate length. Shouldn't this just be another ArrayBufferView type with special semantics, like Uint8ClampedArray? DOMStringArray or some such? And/or a getString()/setString() method pair on DataView? I don't think so, because retrieving the N'th decoded/reencoded character isn't a constant-time operation. -- Glenn Maynard
Re: [whatwg] API for encoding/decoding ArrayBuffers into text
On Tue, 13 Mar 2012, Joshua Bell wrote: On Tue, Mar 13, 2012 at 4:10 PM, Jonas Sicking jo...@sicking.cc wrote: On Tue, Mar 13, 2012 at 4:08 PM, Kenneth Russell k...@google.com wrote: Joshua Bell has been working on a string encoding and decoding API that supports the needed encodings, and which is separable from the core typed array API: http://wiki.whatwg.org/wiki/StringEncoding This is the direction I prefer. String encoding and decoding seems to be a complex enough problem that it should be expressed separately from the typed array spec itself. Some quick feedback: - [OmitConstructor] doesn't seem to be WebIDL - please don't allow UAs to implement other encodings. You should list the exact set of supported encodings and the exact labels that should be recognised as meaning those encodings, and disallow all others. Otherwise, we'll be in a never-ending game of reverse-engineering each others' lists of supported encodings and it'll keep growing. - What's the use case for supporting anything but UTF-8? - Having a mechanism that lets you encode the string and get a length separate from the mechanism that lets you encode the string and get the encoded string seems like it would encourage very inefficient code. Can we instead have a mechanism that returns both at once? Or is the idea that for some encodings getting the encoded length is much quicker than getting the actual string? - Seems weird that integers and strings would have such different APIs for doing the same thing. Why can't we handle them equivalently? As in: len = view.setString(strings[i], offset + Uint32Array.BYTES_PER_ELEMENT, UTF-8); view.setUint32(offset, len); offset += Uint32Array.BYTES_PER_ELEMENT + len; HTH, -- Ian Hickson U+1047E)\._.,--,'``.fL http://ln.hixie.ch/ U+263A/, _.. \ _\ ;`._ ,. Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.'
Re: [whatwg] API for encoding/decoding ArrayBuffers into text
On Tue, 13 Mar 2012, Joshua Bell wrote: WHATWG makes sense, I just hadn't gotten around to shopping for a home. (Administrivia: Is there need to propose a charter addition?) You're welcome to use the WHATWG list for this. Charters are pointless and there's no need to worry about them here. -- Ian Hickson U+1047E)\._.,--,'``.fL http://ln.hixie.ch/ U+263A/, _.. \ _\ ;`._ ,. Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.'
Re: [whatwg] API for encoding/decoding ArrayBuffers into text
On Tue, Mar 13, 2012 at 4:08 PM, Jonas Sicking jo...@sicking.cc wrote: On Tue, Mar 13, 2012 at 3:58 PM, Tab Atkins Jr. jackalm...@gmail.com wrote: On Tue, Mar 13, 2012 at 3:49 PM, Jonas Sicking jo...@sicking.cc wrote: Hi All, Something that has come up a couple of times with content authors lately has been the desire to convert an ArrayBuffer (or part thereof) into a decoded string. Similarly being able to encode a string into an ArrayBuffer (or part thereof). Something as simple as DOMString decode(ArrayBufferView source, DOMString encoding); ArrayBufferView encode(DOMString source, DOMString encoding, [optional] ArrayBufferView destination); would go a very long way. The question is where to stick these functions. Internationalization doesn't have a obvious object we can hang functions off of (unlike, for example crypto), and the above names are much too generic to turn into global functions. Ideas/opinions/bikesheds? Python3 just defines str.encode and bytes.decode. Can we not do this with String.encode and ArrayBuffer.decode? Unfortunately I suspect getting anything added on the String object will take a few years given that it's too late to get into ES6 (and in any case I suspect adding ArrayBuffer dependencies to ES6 would be controversial). Like Ian said, I don't see anything particularly bad about the spec defining ArrayBuffers to define an ArrayBuffer-related method on String. There's no reason it has to be in the ES spec. ~TJ
Re: [whatwg] API for encoding/decoding ArrayBuffers into text
On Tue, Mar 13, 2012 at 4:11 PM, Glenn Maynard gl...@zewt.org wrote: The API on that wiki page is a reasonable start. For the same reasons that we discussed in a recent thread ( http://lists.w3.org/Archives/Public/public-webapps/2011JulSep/1589.html), conversion errors should use replacement (eg. U+FFFD), not throw exceptions. Python throws errors by default, but both functions have an additional argument specifying an alternate strategy. In particular, bytes.decode can either drop the invalid bytes, replace them with a replacement char (which I agree should be U+FFFD), or replace them with XML entities; str.encode can choose to drop characters the encoding doesn't support. ~TJ
Re: [whatwg] API for encoding/decoding ArrayBuffers into text
On Tue, 13 Mar 2012, Joshua Bell wrote: For both of the above: initially suggested use cases included parsing data as esoteric as ID3 tags in MP3 files, where encoding unspecified and is guessed at by decoders, and includes non-Unicode encodings. It was suggested that the encoding sniffing capabilities of browsers be leveraged. [...] Whether we should restrict it as far as UTF-8 depends on whether we envision this API only used for parsing/serializing newly defined data formats, or whether there is consideration for interop with previously existing formats data formats and code. Seems reasonable. If we have specific use cases for non-UTF-8 encodings, I agree we should support them; if that's the case, we should survey those use cases to work out what the set of encodings we need is, and add just those. - Having a mechanism that lets you encode the string and get a length separate from the mechanism that lets you encode the string and get the encoded string seems like it would encourage very inefficient code. Can we instead have a mechanism that returns both at once? Or is the idea that for some encodings getting the encoded length is much quicker than getting the actual string? The use case was to compute the size necessary to allocate a single buffer into which may be encoded multiple strings and other data, rather than allocating multiple small buffers and then copying strings into a larger buffer. Ignoring the issue of invalid code points, the length calculations for non-UTF-8 encodings are trivial. (And with the suggestion that UTF-16 not be sanitized, that case is trivially 2x the JS string length.) Yeah, but surely we'll mainly be doing stuff with UTF-8... One option is to return an opaque object of the form: interface EncodedString { readonly attributes unsigned long length; // internally has a copy of the encoded string } ...and then have view.setString take this EncodedString object. At least then you get it down to an extraneous copy, rather than an extraneous encode. Still not ideal though. - Seems weird that integers and strings would have such different APIs for doing the same thing. Why can't we handle them equivalently? As in: len = view.setString(strings[i], offset + Uint32Array.BYTES_PER_ELEMENT, UTF-8); view.setUint32(offset, len); offset += Uint32Array.BYTES_PER_ELEMENT + len; Heh, that's where the discussion started, actually. We wanted to keep the DataView interface simple, and potentially support encoding into plain JS arrays and/or non-TypedArray support that appeared to be on the horizon for JS. I see where you're coming from, but I think we should look at the platform as a whole, not just one API. It doesn't help the platform as a whole if we just have the same features split across two interfaces, the complexity is even slightly higher than just having one consistent API that does ints and strings equivalently. -- Ian Hickson U+1047E)\._.,--,'``.fL http://ln.hixie.ch/ U+263A/, _.. \ _\ ;`._ ,. Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.'
Re: [whatwg] API for encoding/decoding ArrayBuffers into text
Using Views instead of specifying the offset and length sounds good. On Tue, Mar 13, 2012 at 6:28 PM, Ian Hickson i...@hixie.ch wrote: - What's the use case for supporting anything but UTF-8? Other Unicode encodings may be useful, to decode existing file formats containing (most likely at a minimum) UTF-16. I don't feel strongly about that, though; we're stuck with UTF-16 as an internal representation in the platform, but that doesn't necessarily mean we need to support it as a transfer encoding. For non-Unicode legacy encodings, I think that even if use cases exist, they should be given more than the usual amount of scrutiny before being supported. On Tue, Mar 13, 2012 at 6:38 PM, Tab Atkins Jr. jackalm...@gmail.comwrote: Python throws errors by default, but both functions have an additional argument specifying an alternate strategy. In particular, bytes.decode can either drop the invalid bytes, replace them with a replacement char (which I agree should be U+FFFD), or replace them with XML entities; str.encode can choose to drop characters the encoding doesn't support. Supporting throwing is okay if it's really wanted, but the default should be replacement. It reduces fatal errors to (usually) non-fatal replacement, for obscure cases that people generally don't test. It's a much more sane default failure mode. As another option, never throw, but allow returning the number of conversion errors: results = encode(abc\uD800def, outputView, UTF-8); where results.inputConsumed is the number of words consumed in myString, results.outputWritten is the number of UTF-8 bytes written, and results.errors is 1. That also allows block-by-block conversion; for example, to convert as many complete characters as possible into a fixed-size buffer for transmission, then starting again at the next unencoded character. One more idea, while I'm brainstorming: if outputView is null, allocate an ArrayBuffer of the necessary size, storing it in results.output. That eliminates the need for a separate length pass, without bloating the API with another overload. On Tue, Mar 13, 2012 at 6:50 PM, Joshua Bell jsb...@chromium.org wrote: (Cue a strong nooo! from Anne.) (Count me in on that, too. Heuristics bad.) Ignoring the issue of invalid code points, the length calculations for non-UTF-8 encodings are trivial. (And with the suggestion that UTF-16 not be sanitized, that case is trivially 2x the JS string length.) UTF-16 sanitization (replacing mismatched surrogates with U+FFFD) doesn't change the size of the output, actually. -- Glenn Maynard
Re: [whatwg] API for encoding/decoding ArrayBuffers into text
On Tue, Mar 13, 2012 at 8:19 PM, Glenn Maynard gl...@zewt.org wrote: Using Views instead of specifying the offset and length sounds good. On Tue, Mar 13, 2012 at 6:28 PM, Ian Hickson i...@hixie.ch wrote: - What's the use case for supporting anything but UTF-8? Other Unicode encodings may be useful, to decode existing file formats containing (most likely at a minimum) UTF-16. I don't feel strongly about that, though; we're stuck with UTF-16 as an internal representation in the platform, but that doesn't necessarily mean we need to support it as a transfer encoding. For non-Unicode legacy encodings, I think that even if use cases exist, they should be given more than the usual amount of scrutiny before being supported. The whole idea is to be able to extract textual data out of some packed binary format. If you don't support the character sets people want to use, they will simply do like they have to do now and hand-code the character set conversion, where it will slow and inaccurate. In particular, I think you have to include various ISO-8859-* character sets (especially Latin1) and the non-Unicode character sets still frequently used by Japanese and Chinese users. I am fine with strongly suggesting that only UTF8 be used for new things, but leaving out legacy support will severely limit the utility of this library. -- John A. Tamplin Software Engineer (GWT), Google