Re: [whatwg] API for encoding/decoding ArrayBuffers into text

2012-03-16 Thread Joshua Bell
On Thu, Mar 15, 2012 at 5:20 PM, Glenn Maynard gl...@zewt.org wrote:

 On Thu, Mar 15, 2012 at 6:51 PM, Jonas Sicking jo...@sicking.cc wrote:

 What's the use-case for the stringLength function? You can't decode
 into an existing datastructure anyway, so you're ultimately forced to
 call decode at which point the stringLength function hasn't helped
 you.


 stringLength doesn't return the length of the decoded string.  It returns
 the byte offset of the first \0 (or the length of the whole buffer, if
 none), for decoding null-terminated strings.  For multibyte encodings (eg.
 everything except UTF-16 and friends), it's just memchr(), so it's much
 faster than actually decoding the string.


And just to be clear, the use case is decoding data formats where string
fields are variable length null terminated.


 Currently the use-case of simply wanting to convert a string to a
 binary buffer is a bit cumbersome. You first have to call the
 encodedLength function, then allocate a buffer of the right size,
 then call the encode function.


 I suggested eg.

 result = encode(string, utf-8, null).output;

 which would create an ArrayBuffer of the required size.  Presumably the
 null ArrayBufferView argument would be optional, so you could just say
 encode(string, utf-8).


I think we want both encoding and destination to be optional. That leads us
to an API like:

out_dict = stringEncoding.encode(string, opt_dict);

.. where both out_dict and opt_dict are WebIDL Dictionaries:

opt_dict keys: view, encoding
out_dict keys: charactersWritten, byteWritten, output

... where output === view if view is supplied, otherwise a new Uint8Array
(or Uint8ClampedArray??)

If this instead is attached to String, it would look like:

out_dict = my_string.encode(opt_dict);

If it were attached to ArrayBufferView, having a right-size buffer
allocated for the caller gets uglier unless we include a static version.

It doesn't seem possible to implement the 'encode' function without
 doing multiple scans over the string. The implementation seems
 required both to check that the data can be decoded using the
 specified encoding, as well as check that the data will fit in the
 passed in buffer. Only then can the implementation start decoding the
 data. This seems problematic.


 Only if it guarantees that it doesn't write anything to the output buffer
 unless the entire result will fit.  I don't think we need to do that; just
 guarantee that it'll be truncated on a whole codepoint.


Agreed. Input/output dicts mean the API documentation a caller needs to
read to understand the usage is more complex than a function signature
which is why I resisted them, but it does seem like the best approach.
Thanks for pushing, Glenn!

In the create-a-buffer-on-the-fly case there will be some memory juggling
going on, either by initially over allocating or reallocating/moving.


 I also don't think it's a good idea to throw an exception for encoding
 errors. Better to convert characters to the unicode replacement
 character. I believe we made a similar change to the WebSockets
 specification recently.


 Was that change made?  I filed
 https://www.w3.org/Bugs/Public/show_bug.cgi?id=16157, but it still seems
 to be undecided.


Settling on an options dict means adding a flag to control this behavior
(throws: true ?) doesn't extend the API surface significantly.


Re: [whatwg] API for encoding/decoding ArrayBuffers into text

2012-03-16 Thread Joshua Bell
On Fri, Mar 16, 2012 at 9:19 AM, Joshua Bell jsb...@chromium.org wrote:


 And just to be clear, the use case is decoding data formats where string
 fields are variable length null terminated.


... and the spec should include normative guidance that length-prefixing is
strongly recommended for new data formats.


Re: [whatwg] API for encoding/decoding ArrayBuffers into text

2012-03-16 Thread Glenn Maynard
On Fri, Mar 16, 2012 at 11:19 AM, Joshua Bell jsb...@chromium.org wrote:

 And just to be clear, the use case is decoding data formats where string
 fields are variable length null terminated.


A concrete example is ZIP central directories.

 I think we want both encoding and destination to be optional. That leads us
 to an API like:

 out_dict = stringEncoding.encode(string, opt_dict);

 .. where both out_dict and opt_dict are WebIDL Dictionaries:

 opt_dict keys: view, encoding



 out_dict keys: charactersWritten, byteWritten, output


The return value should just be a [NoInterfaceObject] interface.
Dictionaries are used for input fields.

Something that came up on IRC that we should spend some time thinking
about, though: Is it actually important to be able to encode into an
existing buffer?  This may be a premature optimization.  You can always
encode into a new buffer, and--if needed--copy the result where you need it.

If we don't support that, most of this extra stuff in encode() goes away.

... where output === view if view is supplied, otherwise a new Uint8Array
 (or Uint8ClampedArray??)


Uint8Array is correct.  (Uint8ClampedArray is for image color data.)

If UTF-16 or UTF-32 are supported, decoding to them should return
Uint16Array and Uint32Array, respectively (with the return value being
typed just to ArrayBufferView).

If this instead is attached to String, it would look like:

 out_dict = my_string.encode(opt_dict);

 If it were attached to ArrayBufferView, having a right-size buffer
 allocated for the caller gets uglier unless we include a static version.


If in-place decoding isn't really needed, we could have:

newView = str.encode(utf-8); // or {encoding: utf-8}
str2 = newView.decode(utf-8);
len = newView.find(0); // replaces stringLength, searching for 0 in the
view's type; you'd use Uint16Array for UTF-16

and encodedLength() would go away.

newView.find(val) would live on subclasses of TypedArray.

In the create-a-buffer-on-the-fly case there will be some memory juggling
 going on, either by initially over allocating or reallocating/moving.


But since that's all behind the scenes, the implementation can do it
whichever way is most efficient for the particular encoding.  In many
cases, it may be possible to eliminate any reallocation, by making an
educated guess about how big the buffer is likely to be.

On Fri, Mar 16, 2012 at 11:21 AM, Joshua Bell jsb...@chromium.org wrote:

 ... and the spec should include normative guidance that length-prefixing is
 strongly recommended for new data formats.


I think this would be a bit off-topic.

-- 
Glenn Maynard


Re: [whatwg] API for encoding/decoding ArrayBuffers into text

2012-03-16 Thread James Graham



On Fri, 16 Mar 2012, Glenn Maynard wrote:


On Fri, Mar 16, 2012 at 11:19 AM, Joshua Bell jsb...@chromium.org wrote:


And just to be clear, the use case is decoding data formats where string
fields are variable length null terminated.



A concrete example is ZIP central directories.

I think we want both encoding and destination to be optional. That leads us

to an API like:

out_dict = stringEncoding.encode(string, opt_dict);

.. where both out_dict and opt_dict are WebIDL Dictionaries:

opt_dict keys: view, encoding





out_dict keys: charactersWritten, byteWritten, output



The return value should just be a [NoInterfaceObject] interface.
Dictionaries are used for input fields.

Something that came up on IRC that we should spend some time thinking
about, though: Is it actually important to be able to encode into an
existing buffer?  This may be a premature optimization.  You can always
encode into a new buffer, and--if needed--copy the result where you need it.

If we don't support that, most of this extra stuff in encode() goes away.


Yes, I think we should focus on getting feature parity with e.g. python 
first -- i.e. not worry about decoding into existing buffers -- and add 
extra fancy stuff later if we find that there are actually usecases where 
avoiding the copy is critical. This should allow us to focus on getting 
the right API for the common case.



If in-place decoding isn't really needed, we could have:

newView = str.encode(utf-8); // or {encoding: utf-8}
str2 = newView.decode(utf-8);
len = newView.find(0); // replaces stringLength, searching for 0 in the
view's type; you'd use Uint16Array for UTF-16

and encodedLength() would go away.


This looks like a big win to me.


Re: [whatwg] API for encoding/decoding ArrayBuffers into text

2012-03-16 Thread Joshua Bell
On Fri, Mar 16, 2012 at 10:35 AM, Glenn Maynard gl...@zewt.org wrote:

 On Fri, Mar 16, 2012 at 11:19 AM, Joshua Bell jsb...@chromium.org wrote:


 ... where output === view if view is supplied, otherwise a new Uint8Array
 (or Uint8ClampedArray??)


 Uint8Array is correct.  (Uint8ClampedArray is for image color data.)

 If UTF-16 or UTF-32 are supported, decoding to them should return
 Uint16Array and Uint32Array, respectively (with the return value being
 typed just to ArrayBufferView).


FYI, there was some follow up IRC conversation on this. With Typed Arrays
as currently specified - that is, that Uint16Array has platform endianness
- the above would imply that either platform endianness dictated the output
byte sequence (and le/be was ignored), or that encode(\uFFFD,
utf-16).view[0] might != 0xFFFD on some platforms.

There was consensus (among the two of us) that the output view's underlying
buffer's byte order would be le/be depending on the selected encoding.
There is not consensus over what the return view type should be -
Uint8Array, or pursue BE/LE variants of Uint16Array to conceal platform
endianness.


Re: [whatwg] API for encoding/decoding ArrayBuffers into text

2012-03-16 Thread Boris Zbarsky

On 3/16/12 5:12 PM, Joshua Bell wrote:

FYI, there was some follow up IRC conversation on this. With Typed Arrays
as currently specified - that is, that Uint16Array has platform endianness


For what it's worth, it seems like this is something we should seriously 
consider changing so as to make the web-visible endianness of typed 
arrays always be little-endian.  Authors are actively writing code (and 
being encouraged to do so by technology evangelists) that makes that 
assumption anyway


-Boris


Re: [whatwg] API for encoding/decoding ArrayBuffers into text

2012-03-16 Thread Charles Pritchard

On 3/16/2012 2:17 PM, Boris Zbarsky wrote:

On 3/16/12 5:12 PM, Joshua Bell wrote:
FYI, there was some follow up IRC conversation on this. With Typed 
Arrays
as currently specified - that is, that Uint16Array has platform 
endianness


For what it's worth, it seems like this is something we should 
seriously consider changing so as to make the web-visible endianness 
of typed arrays always be little-endian.  Authors are actively writing 
code (and being encouraged to do so by technology evangelists) that 
makes that assumption anyway


The DataView set of methods already does this work. The raw arrays are 
supposed to have platform endianness.


If you see some evangelists skipping the endian check, send them an 
e-mail and let them know.



-Charles


Re: [whatwg] API for encoding/decoding ArrayBuffers into text

2012-03-16 Thread James Graham

On Fri, 16 Mar 2012, Charles Pritchard wrote:


On 3/16/2012 2:17 PM, Boris Zbarsky wrote:

On 3/16/12 5:12 PM, Joshua Bell wrote:

FYI, there was some follow up IRC conversation on this. With Typed Arrays
as currently specified - that is, that Uint16Array has platform endianness


For what it's worth, it seems like this is something we should seriously 
consider changing so as to make the web-visible endianness of typed arrays 
always be little-endian.  Authors are actively writing code (and being 
encouraged to do so by technology evangelists) that makes that assumption 
anyway


The DataView set of methods already does this work. The raw arrays are 
supposed to have platform endianness.


If you see some evangelists skipping the endian check, send them an e-mail 
and let them know.


Not going to work.

You can't evangelise people into making their code work on architectures 
that they don't own. It's hard enough to get people to work around 
differences between browsers when all the browsers are avaliable for free 
and run on the platforms that they develop on.


The reality is that on devices where typed arrays don't appear LE, content 
will break.


Re: [whatwg] API for encoding/decoding ArrayBuffers into text

2012-03-16 Thread Glenn Maynard
On Fri, Mar 16, 2012 at 4:44 PM, Charles Pritchard ch...@jumis.com wrote:

 The DataView set of methods already does this work. The raw arrays are
 supposed to have platform endianness.


That's wrong.  This is web API design 101; everyone should know better than
this by now.  Exposing platform endianness is setting the platform up for
massive incompatibilities down the road.

In reality, the spec is moot here: if anyone does implement typed arrays on
a production big-endian system, they're going to make these views
little-endian, because doing otherwise would break countless applications,
essentially all of which are tested only on little-endian systems.  Web
compatibility is a top priority to browser implementations.

(DataView isn't relevant here; it's used for different access patterns.  To
access arrays of data embedded in an ArrayBuffer, you use views, not
DataView.  Use DataView if you have a packed data structure with
variable-size fields, such as the metadata in a ZIP local file header.)

-- 
Glenn Maynard


Re: [whatwg] API for encoding/decoding ArrayBuffers into text

2012-03-16 Thread Charles Pritchard

On 3/16/2012 3:26 PM, Glenn Maynard wrote:
On Fri, Mar 16, 2012 at 4:44 PM, Charles Pritchard ch...@jumis.com 
mailto:ch...@jumis.com wrote:


The DataView set of methods already does this work. The raw arrays
are supposed to have platform endianness.


That's wrong.  This is web API design 101; everyone should know better 
than this by now.  Exposing platform endianness is setting the 
platform up for massive incompatibilities down the road.


I make mistakes all the time with UTF8 and raw string arrays. I make 
mistakes all the time with endianness.


Low level API design 101; everyone working with low level APIs makes 
mistakes.


In reality, the spec is moot here: if anyone does implement typed 
arrays on a production big-endian system, they're going to make these 
views little-endian, because doing otherwise would break countless 
applications, essentially all of which are tested only on 
little-endian systems.  Web compatibility is a top priority to browser 
implementations.


It's up to programmers to code defensively. More-so with multi-platform 
multi-vendor deployments than walled gardens.
Authors should be using the spec as written, it only takes one target 
system to use big-endian.


It doesn't harm anything for a vendor to implement as little-endian, as 
most authors assume and test on little endian.
It may cause some harm to alter the spec so as to remove the requirement 
that coders account for both.




(DataView isn't relevant here; it's used for different access 
patterns.  To access arrays of data embedded in an ArrayBuffer, you 
use views, not DataView.  Use DataView if you have a packed data 
structure with variable-size fields, such as the metadata in a ZIP 
local file header.)
I use the subarray pattern frequently. DataView is not much different 
than using subarray.


Use DataView when it's easier than ArrayBufferView and available.




Re: [whatwg] API for encoding/decoding ArrayBuffers into text

2012-03-16 Thread Boris Zbarsky

On 3/16/12 5:44 PM, Charles Pritchard wrote:

The DataView set of methods already does this work. The raw arrays are
supposed to have platform endianness.


I haven't seen anyone actually using the DataView stuff in practice, or 
presenting it to developers much...



If you see some evangelists skipping the endian check, send them an
e-mail and let them know.


I've done that... then I stopped because it just wasn't worth the 
effort.  Every single WebGL demo I've seen recently was doing this. 
People were being told that typed arrays are a good way to load binary 
(integer and float) data from servers using the arraybuffer facilities 
of XHR at SXSW last week, with no mention of endianness.


I think that trying to get web developers to do this right is a lost 
cause, esp. because none of them (to a good approximation) have any 
big-endian systems to test on.


-Boris


Re: [whatwg] API for encoding/decoding ArrayBuffers into text

2012-03-16 Thread Boris Zbarsky

On 3/16/12 5:25 PM, Brandon Jones wrote:

Everyone knows that typed arrays /can/ be Big Endian, but I'm not aware
of any devices available right now that support WebGL that are.


I believe that recent Firefox on a SPARC processor would fit that 
description.  Of course the number of web developers that have a 
SPARC-based machine is 0 to a very good approximation


-Boris


Re: [whatwg] API for encoding/decoding ArrayBuffers into text

2012-03-16 Thread Jonas Sicking
On Fri, Mar 16, 2012 at 9:19 AM, Joshua Bell jsb...@chromium.org wrote:
 On Thu, Mar 15, 2012 at 5:20 PM, Glenn Maynard gl...@zewt.org wrote:

 On Thu, Mar 15, 2012 at 6:51 PM, Jonas Sicking jo...@sicking.cc wrote:

 What's the use-case for the stringLength function? You can't decode
 into an existing datastructure anyway, so you're ultimately forced to
 call decode at which point the stringLength function hasn't helped
 you.


 stringLength doesn't return the length of the decoded string.  It returns
 the byte offset of the first \0 (or the length of the whole buffer, if
 none), for decoding null-terminated strings.  For multibyte encodings (eg.
 everything except UTF-16 and friends), it's just memchr(), so it's much
 faster than actually decoding the string.


 And just to be clear, the use case is decoding data formats where string
 fields are variable length null terminated.


 Currently the use-case of simply wanting to convert a string to a
 binary buffer is a bit cumbersome. You first have to call the
 encodedLength function, then allocate a buffer of the right size,
 then call the encode function.


 I suggested eg.

 result = encode(string, utf-8, null).output;

 which would create an ArrayBuffer of the required size.  Presumably the
 null ArrayBufferView argument would be optional, so you could just say
 encode(string, utf-8).


 I think we want both encoding and destination to be optional. That leads us
 to an API like:

 out_dict = stringEncoding.encode(string, opt_dict);

 .. where both out_dict and opt_dict are WebIDL Dictionaries:

 opt_dict keys: view, encoding
 out_dict keys: charactersWritten, byteWritten, output

 ... where output === view if view is supplied, otherwise a new Uint8Array
 (or Uint8ClampedArray??)

 If this instead is attached to String, it would look like:

 out_dict = my_string.encode(opt_dict);

 If it were attached to ArrayBufferView, having a right-size buffer
 allocated for the caller gets uglier unless we include a static version.

Using input and output dictionaries is definitely messy, but I can't
see a better way either. And I think ES6 is adding some syntax here
that will make developer's lives better (deconstructing assignments)

 It doesn't seem possible to implement the 'encode' function without
 doing multiple scans over the string. The implementation seems
 required both to check that the data can be decoded using the
 specified encoding, as well as check that the data will fit in the
 passed in buffer. Only then can the implementation start decoding the
 data. This seems problematic.


 Only if it guarantees that it doesn't write anything to the output buffer
 unless the entire result will fit.  I don't think we need to do that; just
 guarantee that it'll be truncated on a whole codepoint.


 Agreed. Input/output dicts mean the API documentation a caller needs to
 read to understand the usage is more complex than a function signature
 which is why I resisted them, but it does seem like the best approach.
 Thanks for pushing, Glenn!

 In the create-a-buffer-on-the-fly case there will be some memory juggling
 going on, either by initially over allocating or reallocating/moving.

The implementation can always figure out what strategy fits its own
requirements best with regards to memory allocation. I suspect that
right now in Firefox the fastest implementation would be to scan
through the string once to measure the desired buffer size, then
allocate and write into the allocated buffer.

The problem is that the way that the encoding function is defined
right now, you are not allowed to write any data if you are throwing
for whatever reason, which means that you have to do a scan first to
see if you need to throw, and then do a separate pass to actually
encode the data. I think we need to change that such that when an
exception is thrown that data should be written up to the point that
causes the exception.

 I also don't think it's a good idea to throw an exception for encoding
 errors. Better to convert characters to the unicode replacement
 character. I believe we made a similar change to the WebSockets
 specification recently.


 Was that change made?  I filed
 https://www.w3.org/Bugs/Public/show_bug.cgi?id=16157, but it still seems
 to be undecided.


 Settling on an options dict means adding a flag to control this behavior
 (throws: true ?) doesn't extend the API surface significantly.

Sounds good to me. Though I would still strongly prefer the default to
be non-throwing as to minimize the risk of website breakage in the
case of bugs. Especially since these bugs are so data dependent and
are likely to not happen on a developers computer.

/ Jonas


Re: [whatwg] API for encoding/decoding ArrayBuffers into text

2012-03-16 Thread Charles Pritchard

On 3/16/2012 4:25 PM, Boris Zbarsky wrote:

On 3/16/12 5:25 PM, Brandon Jones wrote:

Everyone knows that typed arrays /can/ be Big Endian, but I'm not aware
of any devices available right now that support WebGL that are.


I believe that recent Firefox on a SPARC processor would fit that 
description.  Of course the number of web developers that have a 
SPARC-based machine is 0 to a very good approximation



I've written some hash/encryption methods that could very well could 
fail on Firefox on SPARC; many things fail on machines I've never tested 
with.


Flip the implementation on SPARC, and it wouldn't harm anything. Cut it 
out of the spec, so that the behavior is undocumented, implementations 
break.
DataView is a more complex than ArrayBufferView, so implementers started 
with the easy option.


The coders using Float32Array are cowboys; (web app gaming and 
encryption). We're talking about a few hundred people out of many millions.




Re: [whatwg] API for encoding/decoding ArrayBuffers into text

2012-03-16 Thread James Robinson
On Fri, Mar 16, 2012 at 4:25 PM, Boris Zbarsky bzbar...@mit.edu wrote:

 On 3/16/12 5:25 PM, Brandon Jones wrote:

 Everyone knows that typed arrays /can/ be Big Endian, but I'm not aware

 of any devices available right now that support WebGL that are.


 I believe that recent Firefox on a SPARC processor would fit that
 description.  Of course the number of web developers that have a
 SPARC-based machine is 0 to a very good approximation


You can s/web developers/users/ and the statement would still apply,
wouldn't it?

- James



 -Boris



Re: [whatwg] API for encoding/decoding ArrayBuffers into text

2012-03-16 Thread Boris Zbarsky

On 3/16/12 7:43 PM, James Robinson wrote:

You can s/web developers/users/ and the statement would still apply,
wouldn't it?


Sure, but so what?

The upshot is that people are writing code that assumes little-endian 
hardware all over.  We should just clearly make the spec say that that's 
what typed arrays are so that an implementor can actually implement the 
spec and be web compatible.


The value of a spec which can't be implemented as written is arguably 
lower than not having a spec at all...  At least then you _know_ you 
have to reverse-engineer.


-Boris


Re: [whatwg] API for encoding/decoding ArrayBuffers into text

2012-03-16 Thread Charles Pritchard

On 3/16/2012 5:25 PM, Boris Zbarsky wrote:

On 3/16/12 7:43 PM, James Robinson wrote:

You can s/web developers/users/ and the statement would still apply,
wouldn't it?


Sure, but so what?

The upshot is that people are writing code that assumes little-endian 
hardware all over.  We should just clearly make the spec say that 
that's what typed arrays are so that an implementor can actually 
implement the spec and be web compatible.


The value of a spec which can't be implemented as written is arguably 
lower than not having a spec at all...  At least then you _know_ you 
have to reverse-engineer.


Isn't that an issue for TC39?