Re: [whatwg] API for encoding/decoding ArrayBuffers into text
On Thu, Mar 15, 2012 at 5:20 PM, Glenn Maynard gl...@zewt.org wrote: On Thu, Mar 15, 2012 at 6:51 PM, Jonas Sicking jo...@sicking.cc wrote: What's the use-case for the stringLength function? You can't decode into an existing datastructure anyway, so you're ultimately forced to call decode at which point the stringLength function hasn't helped you. stringLength doesn't return the length of the decoded string. It returns the byte offset of the first \0 (or the length of the whole buffer, if none), for decoding null-terminated strings. For multibyte encodings (eg. everything except UTF-16 and friends), it's just memchr(), so it's much faster than actually decoding the string. And just to be clear, the use case is decoding data formats where string fields are variable length null terminated. Currently the use-case of simply wanting to convert a string to a binary buffer is a bit cumbersome. You first have to call the encodedLength function, then allocate a buffer of the right size, then call the encode function. I suggested eg. result = encode(string, utf-8, null).output; which would create an ArrayBuffer of the required size. Presumably the null ArrayBufferView argument would be optional, so you could just say encode(string, utf-8). I think we want both encoding and destination to be optional. That leads us to an API like: out_dict = stringEncoding.encode(string, opt_dict); .. where both out_dict and opt_dict are WebIDL Dictionaries: opt_dict keys: view, encoding out_dict keys: charactersWritten, byteWritten, output ... where output === view if view is supplied, otherwise a new Uint8Array (or Uint8ClampedArray??) If this instead is attached to String, it would look like: out_dict = my_string.encode(opt_dict); If it were attached to ArrayBufferView, having a right-size buffer allocated for the caller gets uglier unless we include a static version. It doesn't seem possible to implement the 'encode' function without doing multiple scans over the string. The implementation seems required both to check that the data can be decoded using the specified encoding, as well as check that the data will fit in the passed in buffer. Only then can the implementation start decoding the data. This seems problematic. Only if it guarantees that it doesn't write anything to the output buffer unless the entire result will fit. I don't think we need to do that; just guarantee that it'll be truncated on a whole codepoint. Agreed. Input/output dicts mean the API documentation a caller needs to read to understand the usage is more complex than a function signature which is why I resisted them, but it does seem like the best approach. Thanks for pushing, Glenn! In the create-a-buffer-on-the-fly case there will be some memory juggling going on, either by initially over allocating or reallocating/moving. I also don't think it's a good idea to throw an exception for encoding errors. Better to convert characters to the unicode replacement character. I believe we made a similar change to the WebSockets specification recently. Was that change made? I filed https://www.w3.org/Bugs/Public/show_bug.cgi?id=16157, but it still seems to be undecided. Settling on an options dict means adding a flag to control this behavior (throws: true ?) doesn't extend the API surface significantly.
Re: [whatwg] API for encoding/decoding ArrayBuffers into text
On Fri, Mar 16, 2012 at 9:19 AM, Joshua Bell jsb...@chromium.org wrote: And just to be clear, the use case is decoding data formats where string fields are variable length null terminated. ... and the spec should include normative guidance that length-prefixing is strongly recommended for new data formats.
Re: [whatwg] API for encoding/decoding ArrayBuffers into text
On Fri, Mar 16, 2012 at 11:19 AM, Joshua Bell jsb...@chromium.org wrote: And just to be clear, the use case is decoding data formats where string fields are variable length null terminated. A concrete example is ZIP central directories. I think we want both encoding and destination to be optional. That leads us to an API like: out_dict = stringEncoding.encode(string, opt_dict); .. where both out_dict and opt_dict are WebIDL Dictionaries: opt_dict keys: view, encoding out_dict keys: charactersWritten, byteWritten, output The return value should just be a [NoInterfaceObject] interface. Dictionaries are used for input fields. Something that came up on IRC that we should spend some time thinking about, though: Is it actually important to be able to encode into an existing buffer? This may be a premature optimization. You can always encode into a new buffer, and--if needed--copy the result where you need it. If we don't support that, most of this extra stuff in encode() goes away. ... where output === view if view is supplied, otherwise a new Uint8Array (or Uint8ClampedArray??) Uint8Array is correct. (Uint8ClampedArray is for image color data.) If UTF-16 or UTF-32 are supported, decoding to them should return Uint16Array and Uint32Array, respectively (with the return value being typed just to ArrayBufferView). If this instead is attached to String, it would look like: out_dict = my_string.encode(opt_dict); If it were attached to ArrayBufferView, having a right-size buffer allocated for the caller gets uglier unless we include a static version. If in-place decoding isn't really needed, we could have: newView = str.encode(utf-8); // or {encoding: utf-8} str2 = newView.decode(utf-8); len = newView.find(0); // replaces stringLength, searching for 0 in the view's type; you'd use Uint16Array for UTF-16 and encodedLength() would go away. newView.find(val) would live on subclasses of TypedArray. In the create-a-buffer-on-the-fly case there will be some memory juggling going on, either by initially over allocating or reallocating/moving. But since that's all behind the scenes, the implementation can do it whichever way is most efficient for the particular encoding. In many cases, it may be possible to eliminate any reallocation, by making an educated guess about how big the buffer is likely to be. On Fri, Mar 16, 2012 at 11:21 AM, Joshua Bell jsb...@chromium.org wrote: ... and the spec should include normative guidance that length-prefixing is strongly recommended for new data formats. I think this would be a bit off-topic. -- Glenn Maynard
Re: [whatwg] API for encoding/decoding ArrayBuffers into text
On Fri, 16 Mar 2012, Glenn Maynard wrote: On Fri, Mar 16, 2012 at 11:19 AM, Joshua Bell jsb...@chromium.org wrote: And just to be clear, the use case is decoding data formats where string fields are variable length null terminated. A concrete example is ZIP central directories. I think we want both encoding and destination to be optional. That leads us to an API like: out_dict = stringEncoding.encode(string, opt_dict); .. where both out_dict and opt_dict are WebIDL Dictionaries: opt_dict keys: view, encoding out_dict keys: charactersWritten, byteWritten, output The return value should just be a [NoInterfaceObject] interface. Dictionaries are used for input fields. Something that came up on IRC that we should spend some time thinking about, though: Is it actually important to be able to encode into an existing buffer? This may be a premature optimization. You can always encode into a new buffer, and--if needed--copy the result where you need it. If we don't support that, most of this extra stuff in encode() goes away. Yes, I think we should focus on getting feature parity with e.g. python first -- i.e. not worry about decoding into existing buffers -- and add extra fancy stuff later if we find that there are actually usecases where avoiding the copy is critical. This should allow us to focus on getting the right API for the common case. If in-place decoding isn't really needed, we could have: newView = str.encode(utf-8); // or {encoding: utf-8} str2 = newView.decode(utf-8); len = newView.find(0); // replaces stringLength, searching for 0 in the view's type; you'd use Uint16Array for UTF-16 and encodedLength() would go away. This looks like a big win to me.
Re: [whatwg] API for encoding/decoding ArrayBuffers into text
On Fri, Mar 16, 2012 at 10:35 AM, Glenn Maynard gl...@zewt.org wrote: On Fri, Mar 16, 2012 at 11:19 AM, Joshua Bell jsb...@chromium.org wrote: ... where output === view if view is supplied, otherwise a new Uint8Array (or Uint8ClampedArray??) Uint8Array is correct. (Uint8ClampedArray is for image color data.) If UTF-16 or UTF-32 are supported, decoding to them should return Uint16Array and Uint32Array, respectively (with the return value being typed just to ArrayBufferView). FYI, there was some follow up IRC conversation on this. With Typed Arrays as currently specified - that is, that Uint16Array has platform endianness - the above would imply that either platform endianness dictated the output byte sequence (and le/be was ignored), or that encode(\uFFFD, utf-16).view[0] might != 0xFFFD on some platforms. There was consensus (among the two of us) that the output view's underlying buffer's byte order would be le/be depending on the selected encoding. There is not consensus over what the return view type should be - Uint8Array, or pursue BE/LE variants of Uint16Array to conceal platform endianness.
Re: [whatwg] API for encoding/decoding ArrayBuffers into text
On 3/16/12 5:12 PM, Joshua Bell wrote: FYI, there was some follow up IRC conversation on this. With Typed Arrays as currently specified - that is, that Uint16Array has platform endianness For what it's worth, it seems like this is something we should seriously consider changing so as to make the web-visible endianness of typed arrays always be little-endian. Authors are actively writing code (and being encouraged to do so by technology evangelists) that makes that assumption anyway -Boris
Re: [whatwg] API for encoding/decoding ArrayBuffers into text
On 3/16/2012 2:17 PM, Boris Zbarsky wrote: On 3/16/12 5:12 PM, Joshua Bell wrote: FYI, there was some follow up IRC conversation on this. With Typed Arrays as currently specified - that is, that Uint16Array has platform endianness For what it's worth, it seems like this is something we should seriously consider changing so as to make the web-visible endianness of typed arrays always be little-endian. Authors are actively writing code (and being encouraged to do so by technology evangelists) that makes that assumption anyway The DataView set of methods already does this work. The raw arrays are supposed to have platform endianness. If you see some evangelists skipping the endian check, send them an e-mail and let them know. -Charles
Re: [whatwg] API for encoding/decoding ArrayBuffers into text
On Fri, 16 Mar 2012, Charles Pritchard wrote: On 3/16/2012 2:17 PM, Boris Zbarsky wrote: On 3/16/12 5:12 PM, Joshua Bell wrote: FYI, there was some follow up IRC conversation on this. With Typed Arrays as currently specified - that is, that Uint16Array has platform endianness For what it's worth, it seems like this is something we should seriously consider changing so as to make the web-visible endianness of typed arrays always be little-endian. Authors are actively writing code (and being encouraged to do so by technology evangelists) that makes that assumption anyway The DataView set of methods already does this work. The raw arrays are supposed to have platform endianness. If you see some evangelists skipping the endian check, send them an e-mail and let them know. Not going to work. You can't evangelise people into making their code work on architectures that they don't own. It's hard enough to get people to work around differences between browsers when all the browsers are avaliable for free and run on the platforms that they develop on. The reality is that on devices where typed arrays don't appear LE, content will break.
Re: [whatwg] API for encoding/decoding ArrayBuffers into text
On Fri, Mar 16, 2012 at 4:44 PM, Charles Pritchard ch...@jumis.com wrote: The DataView set of methods already does this work. The raw arrays are supposed to have platform endianness. That's wrong. This is web API design 101; everyone should know better than this by now. Exposing platform endianness is setting the platform up for massive incompatibilities down the road. In reality, the spec is moot here: if anyone does implement typed arrays on a production big-endian system, they're going to make these views little-endian, because doing otherwise would break countless applications, essentially all of which are tested only on little-endian systems. Web compatibility is a top priority to browser implementations. (DataView isn't relevant here; it's used for different access patterns. To access arrays of data embedded in an ArrayBuffer, you use views, not DataView. Use DataView if you have a packed data structure with variable-size fields, such as the metadata in a ZIP local file header.) -- Glenn Maynard
Re: [whatwg] API for encoding/decoding ArrayBuffers into text
On 3/16/2012 3:26 PM, Glenn Maynard wrote: On Fri, Mar 16, 2012 at 4:44 PM, Charles Pritchard ch...@jumis.com mailto:ch...@jumis.com wrote: The DataView set of methods already does this work. The raw arrays are supposed to have platform endianness. That's wrong. This is web API design 101; everyone should know better than this by now. Exposing platform endianness is setting the platform up for massive incompatibilities down the road. I make mistakes all the time with UTF8 and raw string arrays. I make mistakes all the time with endianness. Low level API design 101; everyone working with low level APIs makes mistakes. In reality, the spec is moot here: if anyone does implement typed arrays on a production big-endian system, they're going to make these views little-endian, because doing otherwise would break countless applications, essentially all of which are tested only on little-endian systems. Web compatibility is a top priority to browser implementations. It's up to programmers to code defensively. More-so with multi-platform multi-vendor deployments than walled gardens. Authors should be using the spec as written, it only takes one target system to use big-endian. It doesn't harm anything for a vendor to implement as little-endian, as most authors assume and test on little endian. It may cause some harm to alter the spec so as to remove the requirement that coders account for both. (DataView isn't relevant here; it's used for different access patterns. To access arrays of data embedded in an ArrayBuffer, you use views, not DataView. Use DataView if you have a packed data structure with variable-size fields, such as the metadata in a ZIP local file header.) I use the subarray pattern frequently. DataView is not much different than using subarray. Use DataView when it's easier than ArrayBufferView and available.
Re: [whatwg] API for encoding/decoding ArrayBuffers into text
On 3/16/12 5:44 PM, Charles Pritchard wrote: The DataView set of methods already does this work. The raw arrays are supposed to have platform endianness. I haven't seen anyone actually using the DataView stuff in practice, or presenting it to developers much... If you see some evangelists skipping the endian check, send them an e-mail and let them know. I've done that... then I stopped because it just wasn't worth the effort. Every single WebGL demo I've seen recently was doing this. People were being told that typed arrays are a good way to load binary (integer and float) data from servers using the arraybuffer facilities of XHR at SXSW last week, with no mention of endianness. I think that trying to get web developers to do this right is a lost cause, esp. because none of them (to a good approximation) have any big-endian systems to test on. -Boris
Re: [whatwg] API for encoding/decoding ArrayBuffers into text
On 3/16/12 5:25 PM, Brandon Jones wrote: Everyone knows that typed arrays /can/ be Big Endian, but I'm not aware of any devices available right now that support WebGL that are. I believe that recent Firefox on a SPARC processor would fit that description. Of course the number of web developers that have a SPARC-based machine is 0 to a very good approximation -Boris
Re: [whatwg] API for encoding/decoding ArrayBuffers into text
On Fri, Mar 16, 2012 at 9:19 AM, Joshua Bell jsb...@chromium.org wrote: On Thu, Mar 15, 2012 at 5:20 PM, Glenn Maynard gl...@zewt.org wrote: On Thu, Mar 15, 2012 at 6:51 PM, Jonas Sicking jo...@sicking.cc wrote: What's the use-case for the stringLength function? You can't decode into an existing datastructure anyway, so you're ultimately forced to call decode at which point the stringLength function hasn't helped you. stringLength doesn't return the length of the decoded string. It returns the byte offset of the first \0 (or the length of the whole buffer, if none), for decoding null-terminated strings. For multibyte encodings (eg. everything except UTF-16 and friends), it's just memchr(), so it's much faster than actually decoding the string. And just to be clear, the use case is decoding data formats where string fields are variable length null terminated. Currently the use-case of simply wanting to convert a string to a binary buffer is a bit cumbersome. You first have to call the encodedLength function, then allocate a buffer of the right size, then call the encode function. I suggested eg. result = encode(string, utf-8, null).output; which would create an ArrayBuffer of the required size. Presumably the null ArrayBufferView argument would be optional, so you could just say encode(string, utf-8). I think we want both encoding and destination to be optional. That leads us to an API like: out_dict = stringEncoding.encode(string, opt_dict); .. where both out_dict and opt_dict are WebIDL Dictionaries: opt_dict keys: view, encoding out_dict keys: charactersWritten, byteWritten, output ... where output === view if view is supplied, otherwise a new Uint8Array (or Uint8ClampedArray??) If this instead is attached to String, it would look like: out_dict = my_string.encode(opt_dict); If it were attached to ArrayBufferView, having a right-size buffer allocated for the caller gets uglier unless we include a static version. Using input and output dictionaries is definitely messy, but I can't see a better way either. And I think ES6 is adding some syntax here that will make developer's lives better (deconstructing assignments) It doesn't seem possible to implement the 'encode' function without doing multiple scans over the string. The implementation seems required both to check that the data can be decoded using the specified encoding, as well as check that the data will fit in the passed in buffer. Only then can the implementation start decoding the data. This seems problematic. Only if it guarantees that it doesn't write anything to the output buffer unless the entire result will fit. I don't think we need to do that; just guarantee that it'll be truncated on a whole codepoint. Agreed. Input/output dicts mean the API documentation a caller needs to read to understand the usage is more complex than a function signature which is why I resisted them, but it does seem like the best approach. Thanks for pushing, Glenn! In the create-a-buffer-on-the-fly case there will be some memory juggling going on, either by initially over allocating or reallocating/moving. The implementation can always figure out what strategy fits its own requirements best with regards to memory allocation. I suspect that right now in Firefox the fastest implementation would be to scan through the string once to measure the desired buffer size, then allocate and write into the allocated buffer. The problem is that the way that the encoding function is defined right now, you are not allowed to write any data if you are throwing for whatever reason, which means that you have to do a scan first to see if you need to throw, and then do a separate pass to actually encode the data. I think we need to change that such that when an exception is thrown that data should be written up to the point that causes the exception. I also don't think it's a good idea to throw an exception for encoding errors. Better to convert characters to the unicode replacement character. I believe we made a similar change to the WebSockets specification recently. Was that change made? I filed https://www.w3.org/Bugs/Public/show_bug.cgi?id=16157, but it still seems to be undecided. Settling on an options dict means adding a flag to control this behavior (throws: true ?) doesn't extend the API surface significantly. Sounds good to me. Though I would still strongly prefer the default to be non-throwing as to minimize the risk of website breakage in the case of bugs. Especially since these bugs are so data dependent and are likely to not happen on a developers computer. / Jonas
Re: [whatwg] API for encoding/decoding ArrayBuffers into text
On 3/16/2012 4:25 PM, Boris Zbarsky wrote: On 3/16/12 5:25 PM, Brandon Jones wrote: Everyone knows that typed arrays /can/ be Big Endian, but I'm not aware of any devices available right now that support WebGL that are. I believe that recent Firefox on a SPARC processor would fit that description. Of course the number of web developers that have a SPARC-based machine is 0 to a very good approximation I've written some hash/encryption methods that could very well could fail on Firefox on SPARC; many things fail on machines I've never tested with. Flip the implementation on SPARC, and it wouldn't harm anything. Cut it out of the spec, so that the behavior is undocumented, implementations break. DataView is a more complex than ArrayBufferView, so implementers started with the easy option. The coders using Float32Array are cowboys; (web app gaming and encryption). We're talking about a few hundred people out of many millions.
Re: [whatwg] API for encoding/decoding ArrayBuffers into text
On Fri, Mar 16, 2012 at 4:25 PM, Boris Zbarsky bzbar...@mit.edu wrote: On 3/16/12 5:25 PM, Brandon Jones wrote: Everyone knows that typed arrays /can/ be Big Endian, but I'm not aware of any devices available right now that support WebGL that are. I believe that recent Firefox on a SPARC processor would fit that description. Of course the number of web developers that have a SPARC-based machine is 0 to a very good approximation You can s/web developers/users/ and the statement would still apply, wouldn't it? - James -Boris
Re: [whatwg] API for encoding/decoding ArrayBuffers into text
On 3/16/12 7:43 PM, James Robinson wrote: You can s/web developers/users/ and the statement would still apply, wouldn't it? Sure, but so what? The upshot is that people are writing code that assumes little-endian hardware all over. We should just clearly make the spec say that that's what typed arrays are so that an implementor can actually implement the spec and be web compatible. The value of a spec which can't be implemented as written is arguably lower than not having a spec at all... At least then you _know_ you have to reverse-engineer. -Boris
Re: [whatwg] API for encoding/decoding ArrayBuffers into text
On 3/16/2012 5:25 PM, Boris Zbarsky wrote: On 3/16/12 7:43 PM, James Robinson wrote: You can s/web developers/users/ and the statement would still apply, wouldn't it? Sure, but so what? The upshot is that people are writing code that assumes little-endian hardware all over. We should just clearly make the spec say that that's what typed arrays are so that an implementor can actually implement the spec and be web compatible. The value of a spec which can't be implemented as written is arguably lower than not having a spec at all... At least then you _know_ you have to reverse-engineer. Isn't that an issue for TC39?