Re: [whatwg] API for encoding/decoding ArrayBuffers into text

2012-04-04 Thread Joshua Bell
Any further input on Kenneth's suggestions?

Re: ArrayBufferView vs. DataView - I'm tempted to make the switch to just
DataView. As discussed below, data parsing/serialization operations will
tend to be associated with DataViews. As Glenn has mentioned elsewhere
recently, it is possible to accidentally do a buffer copy when mis-using
typed array constructors, while DataView avoids this. DataViews are cheap
to construct, and when I'm writing sample code for the proposed API I find
I create throw-away DataViews anyway. Also, there is the potential for
confusion when using a non-Uint8Array buffer e.g. are the elements being
decoded using array[N] as the octets or using the underlying buffer? for
Uint16Array/UTF-16 encodings, what are the endianness concerns? DataView
APIs have an explicit endianness and no index getter, which alleviates this
somewhat.

Re: writing into an existing buffer - as Glenn says, most of the input
earlier in the thread advocated strongly for very simple initial API with
streaming support as the only fancy feature beyond the minimal string =
foo.decode(buffer) / buffer = foo.encode(string). Adding details =
foo.encodeInto(string, buffer) later on is not precluded if there is demand.

Also, I am planning to move the fatal option from the encode/decode
methods to the TextEncoder/TextDecoder constructors. Objections?

On Tue, Mar 27, 2012 at 7:43 PM, Kenneth Russell k...@google.com wrote:

 On Tue, Mar 27, 2012 at 6:44 PM, Glenn Maynard gl...@zewt.org wrote:
  On Tue, Mar 27, 2012 at 7:12 PM, Kenneth Russell k...@google.com wrote:
 
- I think it should reference DataView directly rather than
  ArrayBufferView. The typed array spec was specifically designed with
  two use cases in mind: in-memory assembly of data to be sent to the
  graphics card or audio device, where the byte order must be that of
  the host architecture;
 
 
  This is wrong, broken, won't be implemented this way by any production
  browser, isn't how it's used in practice, and needs to be fixed in the
  spec.  It violates the most basic web API requirement: interoperability.
  Please see earlier in the thread; the views affected by endianness need
 to
  be specced as little endian.  That's what everyone is going to implement,
  and what everyone's pages are going to depend on, so it's what the spec
  needs to say.  Separate types should be added for big-endian (eg.
  Int16BEArray).

 Thanks for your input.

 The design of the typed array classes was informed by requirements
 about how the OpenGL, and therefore WebGL, API work; and from prior
 experience with the design and implementation of Java's New I/O Buffer
 classes, which suffered from horrible performance pitfalls because of
 a design similar to that which you suggest.

 Production browsers already implement typed arrays with their current
 semantics. It is not possible to change them and have WebGL continue
 to function. I will go so far as to say that the semantics will not be
 changed.

 In the typed array specification, unlike Java's New I/O specification,
 the API was split between two use cases: in-memory data construction
 (for consumption by APIs like WebGL and Web Audio), and file and
 network I/O. The API was carefully designed to avoid roadblocks that
 would prevent maximum performance from being achieved for these use
 cases. Experience has shown that the moment an artificial performance
 barrier is imposed, it becomes impossible to build certain kinds of
 programs. I consider it unacceptable to prevent developers from
 achieving their goals.


  I also disagree that it should use DataView.  Views are used to access
  arrays (including strings) within larger data structures.  DataView is
 used
  to access packed data structures, where constructing a view for each
  variable in the struct is unwieldy.  It might be useful to have a helper
 in
  DataView, but the core API should work on views.

 This is one point of view. The true design goal of DataView is to
 supply the primitives for fast file and network input/output, where
 the endianness is explicitly specified in the file format. Converting
 strings to and from binary encodings is obviously an operation
 associated with transfer of data to or from files or the network.
 According to this taxonomy, the string encoding and decoding
 operations should only be associated with DataView, and not the other
 typed array types, which are designed for in-memory data assembly for
 consumption by other hardware on the system.


   - It would be preferable if the encoding API had a way to avoid
  memory allocation, for example to encode into a passed-in DataView.
 
 
  This was an earlier design, and discussion led to it being removed as a
  premature optimization, to simplify the API.  I'd recommend reading the
 rest
  of the thread.

 I do apologize for not being fully caught up on the thread, but hope
 that the input above was still useful.

 -Ken



Re: [whatwg] API for encoding/decoding ArrayBuffers into text

2012-04-04 Thread Glenn Maynard
On Wed, Apr 4, 2012 at 11:09 AM, Joshua Bell jsb...@chromium.org wrote:

 Any further input on Kenneth's suggestions?


I largely disagree with those suggestions, because I don't believe they
align with the natural, intuitive usage of the API.

Re: ArrayBufferView vs. DataView - I'm tempted to make the switch to just
 DataView. As discussed below, data parsing/serialization operations will
 tend to be associated with DataViews.


I disagree.  TypedArray is much more natural for processing arrays, since
they can be accessed just like a regular JavaScript array; code generally
doesn't have to care whether it's been given a JavaScript array or a
TypedArray.  For DataView, you need to rewrite everything.

As Glenn has mentioned elsewhere
 recently, it is possible to accidentally do a buffer copy when mis-using
 typed array constructors, while DataView avoids this.


That should be fixed, not used against TypedArray classes when they make
sense.

This can be fixed by adding a TypedArray(TypedArray, byteOffset, length)
constructor, which creates a new shallow view from an existing view; this
would be logically grouped with the similar TypedArray(ArrayBuffer,
byteOffset, length) function.  Unfortunately, the offset parameter would
have to be required, so the method can be resolved against the
TypedArray(TypedArray) constructor.  (A cleaner design would have been to
have a separate copy() function to create an explicit copy, but it's most
likely too late to remove the TypedArray(TypedArray) ctor.)

As (another) aside, all of the TypedArray constructors should be available
on DataView, too, so they exist on all ArrayBufferView subtypes.

DataViews are cheap
 to construct, and when I'm writing sample code for the proposed API I find
 I create throw-away DataViews anyway.


Array views are cheap to construct, too.

APIs returning DataViews feels unnatural; it's a helper class that isn't
returned by anything else.  If you don't return a view of a specific,
contextually-meaningful type (eg. Int16LEArray for UTF-16LE) from encode(),
then returning the ArrayBuffer itself seems preferable, like XHR2.  Let's
not split APIs, with some returning DataView and some ArrayBuffer.

Also, there is the potential for
 confusion when using a non-Uint8Array buffer e.g. are the elements being
 decoded using array[N] as the octets or using the underlying buffer? for
 Uint16Array/UTF-16 encodings, what are the endianness concerns?


The data is always decoded based on the encoding specified.

It wouldn't make sense for decode() to only take a DataView.  If I have an
Int8Array, it's busywork to make me construct a DataView from it so I can
pass it to decode().  Just take ArrayBufferView, so it doesn't care what
the particular view type is.

DataView APIs have an explicit endianness and no index getter, which
 alleviates this
 somewhat.


Ideally, endian-explicit TypedArrays should be created, eg. Int16LEArray
and Int16BEArray.  I mentioned this in the other thread; the big-endian
types seem important to have anyway (regardless of the encoding API), and
the little-endian views are just so we can pretend the native endian
issue isn't there.

 Also, I am planning to move the fatal option from the encode/decode
 methods to the TextEncoder/TextDecoder constructors. Objections?


I don't have a strong feeling either way.  Can you think of any cases where
the encoder/decoder object would be handed off from one user to another,
who might want different behavior?  It seems unlikely.

-- 
Glenn Maynard


Re: [whatwg] API for encoding/decoding ArrayBuffers into text

2012-04-01 Thread Jonas Sicking
On Sat, Mar 31, 2012 at 6:13 PM, Glenn Maynard gl...@zewt.org wrote:
 On Wed, Mar 28, 2012 at 1:44 AM, Jonas Sicking jo...@sicking.cc wrote:

 Scanning over the buffer twice will cause a lot more memory IO and
 will definitely be slower.

 That's what cache is for.  But: benchmarks...

 We can argue weather it's meaningfully slower or harder. But it seems
 like we agree that it's slower and harder.

I'm saying that if an API is better in every way then it doesn't seem
like an interesting discussion how much better, we should clearly go
with that API.

/ Jonas


Re: [whatwg] API for encoding/decoding ArrayBuffers into text

2012-04-01 Thread Glenn Maynard
On Sun, Apr 1, 2012 at 5:28 PM, Jonas Sicking jo...@sicking.cc wrote:

 I'm saying that if an API is better in every way then it doesn't seem
 like an interesting discussion how much better, we should clearly go
 with that API.


It's not a different API, it's an *additional* API.  (Assuming that the
indexOf function is added anyway; the string-array use case wants it, and
its general usefulness should be uncontroversial.)  It doesn't remove
anything else.  There's always a cost to adding more API; what's not clear
is whether it's worth it here, since it's essentially a four-line helper
function (each way) that may or may not actually be used often enough to
justify itself.

-- 
Glenn Maynard


Re: [whatwg] API for encoding/decoding ArrayBuffers into text

2012-03-31 Thread Glenn Maynard
On Wed, Mar 28, 2012 at 1:44 AM, Jonas Sicking jo...@sicking.cc wrote:

 Scanning over the buffer twice will cause a lot more memory IO and
 will definitely be slower.


That's what cache is for.  But: benchmarks...

We can argue weather it's meaningfully slower or harder. But it seems
 like we agree that it's slower and harder.


What?  Are you really arguing that we should do something because of
*meaningless* differences?

I still don't understand what that benefit you are seeing is. You
 hinted at some more generic argument, but I still don't understand
 it. So far the only reason that has been brought up is that it
 provides an API for simply finding null terminators which could be
 useful if you are doing things other than decoding. Is that what you
 are talking about when you are saying that it's more generic?


Yes, I've said that repeatedly.  It also avoids bloating the API with
something that's merely a helper for something you can do in a couple lines
of code, and allows you to tell how many bytes/words were consumed (eg. for
packed string arrays).

It can always be added later, but it feels unnecessary.

-- 
Glenn Maynard


Re: [whatwg] API for encoding/decoding ArrayBuffers into text

2012-03-28 Thread Jonas Sicking
On Tue, Mar 27, 2012 at 4:45 PM, Glenn Maynard gl...@zewt.org wrote:
 On Tue, Mar 27, 2012 at 12:41 AM, Jonas Sicking jo...@sicking.cc wrote:

 The memchr is purely overhead, I.e. we are comparing memchr+decoding
 to decoding. So I don't see what's backing up the probably the
 fastest thing claim.


 If you don't do it as an initial pass, then you have to embed null checks
 into the inner loop of your decoding algorithm.  For example, an ASCII
 decoder may look like:

 // char *input = input buffer
 // char *input_end = one past last byte of input buffer
 // wchar_t *output = output buffer
 input_end = memchr(input, 0, input_end - input);
 while(input  input_end)
 {
     if(*input = 0x80)
     *output++ = 0xFFFD;
     else
         *output++ = *input;
     ++input;
 }

 If you don't do the initial search, then it becomes:

 while(input  input_end  *input != 0)
 {
     if(*input = 0x80)
     *output++ = 0xFFFD;
     else
         *output++ = *input;
     ++input;
 }

 which means that you have an additional branch each time through the loop to
 check for the null terminator.  That's likely to be slower than just doing
 another pass.

 But anyway, please either make a benchmark or two to show the differences
 we're talking about, or drop performance as an argument.  This is all just
 a distraction otherwise.  I don't think the speed of conversion is even a
 serious issue, much less the microseconds taken by memchr.

The extra null-check is basically free since you are going to be bound
on memory IO. I.e. the extra nullcheck will just happen in the bubble
in the CPU pipeline while waiting for data from memory.

Scanning over the buffer twice will cause a lot more memory IO and
will definitely be slower.

  It doesn't seem materially harder (a little more code, yes, but that's
  not
  the same thing), and it's more general-purpose.

 I agree it doesn't seem materially harder. I also agree that I don't
 have data to show that it's materially slower. But it sounds like
 we're in agreement that keeping the logic outside is both harder and
 slower which honestly doesn't speak strongly in its favor.

 Sorry, I'm confused--you're saying that it isn't harder, but we're in
 agreement that it's harder.  Please clarify what you mean.

 I don't believe it's meaningfully slower or harder.

I'm saying that having separate functions for
* finding the null terminator
* decoding a set number of bytes

is both harder and slower for the webpage, than having a single
function which just decodes to the null terminator.

We can argue weather it's meaningfully slower or harder. But it seems
like we agree that it's slower and harder.

 I don't understand the argument that the alternative is more
 general-purpose. The API is already generic in that you can use
 whatever delimiter you want since you pass in a view. The only
 functionality which is not available is finding a null-terminator in
 an arraybuffer which you are arguing below shouldn't be part of the
 decoder (which I agree with).

 I'm confused.  What are you arguing?  The alternative--taking the null
 terminator search out of the decoder--you seem to argue against (first
 sentence), then to agree with (last sentence).  Can you back up and restate
 what you're saying from scratch?

If you agree that creating separate functions for finding the null
terminator and then decoding to it, rather than having a single
function which does both things, while yet agreeing that having
separate functions are better, then clearly you must think that having
separate functions bring some other benefits.

I still don't understand what that benefit you are seeing is. You
hinted at some more generic argument, but I still don't understand
it. So far the only reason that has been brought up is that it
provides an API for simply finding null terminators which could be
useful if you are doing things other than decoding. Is that what you
are talking about when you are saying that it's more generic?

/ Jonas


Re: [whatwg] API for encoding/decoding ArrayBuffers into text

2012-03-27 Thread Glenn Maynard
On Tue, Mar 27, 2012 at 12:41 AM, Jonas Sicking jo...@sicking.cc wrote:

 The memchr is purely overhead, I.e. we are comparing memchr+decoding
 to decoding. So I don't see what's backing up the probably the
 fastest thing claim.


If you don't do it as an initial pass, then you have to embed null checks
into the inner loop of your decoding algorithm.  For example, an ASCII
decoder may look like:

// char *input = input buffer
// char *input_end = one past last byte of input buffer
// wchar_t *output = output buffer
input_end = memchr(input, 0, input_end - input);
while(input  input_end)
{
if(*input = 0x80)
*output++ = 0xFFFD;
else
*output++ = *input;
++input;
}

If you don't do the initial search, then it becomes:

while(input  input_end  *input != 0)
{
if(*input = 0x80)
*output++ = 0xFFFD;
else
*output++ = *input;
++input;
}

which means that you have an additional branch each time through the loop
to check for the null terminator.  That's likely to be slower than just
doing another pass.

But anyway, please either make a benchmark or two to show the differences
we're talking about, or drop performance as an argument.  This is all
just a distraction otherwise.  I don't think the speed of conversion is
even a serious issue, much less the microseconds taken by memchr.

I admit I missed the previous discussion which led to the agreement to
 keep the length measuring outside, so I don't know what arguments were
 presented. Any pointers would be appreciated.


You've already mentioned one of them: being able to tell how many bytes
were consumed.  Having a view.indexOf function is also obviously generally
useful, and it simplifies the API.

Beyond that, having a feature--whether a wrapper or a flag to the actual
decoder/encoder--that's just a shortcut for all of four or five liens of
code is just a minor convenience.  I don't think it's something so common
that we need to save people a few lines of trivial wrapper code that they
can write themselves.

  It doesn't seem materially harder (a little more code, yes, but that's
 not
  the same thing), and it's more general-purpose.

 I agree it doesn't seem materially harder. I also agree that I don't
 have data to show that it's materially slower. But it sounds like
 we're in agreement that keeping the logic outside is both harder and
 slower which honestly doesn't speak strongly in its favor.


Sorry, I'm confused--you're saying that it isn't harder, but we're in
agreement that it's harder.  Please clarify what you mean.

I don't believe it's meaningfully slower or harder.

I don't understand the argument that the alternative is more
 general-purpose. The API is already generic in that you can use
 whatever delimiter you want since you pass in a view. The only
 functionality which is not available is finding a null-terminator in
 an arraybuffer which you are arguing below shouldn't be part of the
 decoder (which I agree with).


I'm confused.  What are you arguing?  The alternative--taking the null
terminator search out of the decoder--you seem to argue against (first
sentence), then to agree with (last sentence).  Can you back up and restate
what you're saying from scratch?

-- 
Glenn Maynard


Re: [whatwg] API for encoding/decoding ArrayBuffers into text

2012-03-27 Thread Kenneth Russell
On Mon, Mar 26, 2012 at 10:28 PM, Jonas Sicking jo...@sicking.cc wrote:
 On Mon, Mar 26, 2012 at 6:11 PM, Kenneth Russell k...@google.com wrote:
 On Mon, Mar 26, 2012 at 5:33 PM, Jonas Sicking jo...@sicking.cc wrote:
 On Mon, Mar 26, 2012 at 4:40 PM, Joshua Bell jsb...@chromium.org wrote:
 * We lost the ability to decode from a arraybuffer and see how many
 bytes were consumed before a null-terminator was hit. One not terribly
 elegant solution would be to add a TextDecoder.decodeWithLength method
 which return a DOMString+length tuple.

 Agreed, but of course see above - there was consensus earlier in the thread
 that searching for null terminators should be done outside the API,
 therefore the caller will have the length handy already. Yes, this would be
 a big flaw since decoding at tightly packed data structure (e.g. array of
 null terminated strings w/o length) would be impossible with just the
 nullTerminator flag.

 Requiring callers to find the null character first, and then use that
 will require one additional pass over the encoded binary data though.
 Also, if we put the API for finding the null character on the Decoder
 object it doesn't seem like we're creating an API which is easier to
 use, just one that has moved some of the logic from the API to every
 caller.

 Though I guess the best solution would be to add methods to DataView
 which allows consuming an ArrayBuffer up to a null terminated point
 and returns the decoded string. Potentially such a method could take a
 Decoder object as argument.

 The rationale for specifying the string encoding and decoding
 functionality outside the typed array specification is to keep the
 typed array spec small and easily implementable. The indexed property
 getters and setters on the typed array views, and methods on DataView,
 are designed to be implementable with a small amount of assembly code
 in JavaScript engines. I'd strongly prefer to continue to design the
 encoding/decoding functionality separately from the typed array views.

 Is there a reason you couldn't keep the current set of functions on
 DataView implemented using a small amount of assembly code, and let
 the new functions fall back to slower C++ functions?

That's possible.

Another motivation for keeping encoding/decoding functionality
separate is that it is likely that it will require a lot of spec text,
which would dramatically increase the size of the typed array spec.
Perhaps once all of the details have been hammered out on this thread
it will be more obvious whether these methods would be much clearer if
added directly to DataView.

A couple of comments on the current StringEncoding proposal:

  - I think it should reference DataView directly rather than
ArrayBufferView. The typed array spec was specifically designed with
two use cases in mind: in-memory assembly of data to be sent to the
graphics card or audio device, where the byte order must be that of
the host architecture; and assembly of data for network transmission,
where the byte order needs to be explicit. DataView covers the latter
case.

  - It would be preferable if the encoding API had a way to avoid
memory allocation, for example to encode into a passed-in DataView.

-Ken


Re: [whatwg] API for encoding/decoding ArrayBuffers into text

2012-03-27 Thread Glenn Maynard
On Tue, Mar 27, 2012 at 7:12 PM, Kenneth Russell k...@google.com wrote:

   - I think it should reference DataView directly rather than
 ArrayBufferView. The typed array spec was specifically designed with
 two use cases in mind: in-memory assembly of data to be sent to the
 graphics card or audio device, where the byte order must be that of
 the host architecture;


This is wrong, broken, won't be implemented this way by any production
browser, isn't how it's used in practice, and needs to be fixed in the
spec.  It violates the most basic web API requirement: interoperability.
Please see earlier in the thread; the views affected by endianness need to
be specced as little endian.  That's what everyone is going to implement,
and what everyone's pages are going to depend on, so it's what the spec
needs to say.  Separate types should be added for big-endian (eg.
Int16BEArray).

I also disagree that it should use DataView.  Views are used to access
arrays (including strings) within larger data structures.  DataView is used
to access packed data structures, where constructing a view for each
variable in the struct is unwieldy.  It might be useful to have a helper in
DataView, but the core API should work on views.

 - It would be preferable if the encoding API had a way to avoid
 memory allocation, for example to encode into a passed-in DataView.


This was an earlier design, and discussion led to it being removed as a
premature optimization, to simplify the API.  I'd recommend reading the
rest of the thread.

-- 
Glenn Maynard


Re: [whatwg] API for encoding/decoding ArrayBuffers into text

2012-03-27 Thread Kenneth Russell
On Tue, Mar 27, 2012 at 6:44 PM, Glenn Maynard gl...@zewt.org wrote:
 On Tue, Mar 27, 2012 at 7:12 PM, Kenneth Russell k...@google.com wrote:

   - I think it should reference DataView directly rather than
 ArrayBufferView. The typed array spec was specifically designed with
 two use cases in mind: in-memory assembly of data to be sent to the
 graphics card or audio device, where the byte order must be that of
 the host architecture;


 This is wrong, broken, won't be implemented this way by any production
 browser, isn't how it's used in practice, and needs to be fixed in the
 spec.  It violates the most basic web API requirement: interoperability.
 Please see earlier in the thread; the views affected by endianness need to
 be specced as little endian.  That's what everyone is going to implement,
 and what everyone's pages are going to depend on, so it's what the spec
 needs to say.  Separate types should be added for big-endian (eg.
 Int16BEArray).

Thanks for your input.

The design of the typed array classes was informed by requirements
about how the OpenGL, and therefore WebGL, API work; and from prior
experience with the design and implementation of Java's New I/O Buffer
classes, which suffered from horrible performance pitfalls because of
a design similar to that which you suggest.

Production browsers already implement typed arrays with their current
semantics. It is not possible to change them and have WebGL continue
to function. I will go so far as to say that the semantics will not be
changed.

In the typed array specification, unlike Java's New I/O specification,
the API was split between two use cases: in-memory data construction
(for consumption by APIs like WebGL and Web Audio), and file and
network I/O. The API was carefully designed to avoid roadblocks that
would prevent maximum performance from being achieved for these use
cases. Experience has shown that the moment an artificial performance
barrier is imposed, it becomes impossible to build certain kinds of
programs. I consider it unacceptable to prevent developers from
achieving their goals.


 I also disagree that it should use DataView.  Views are used to access
 arrays (including strings) within larger data structures.  DataView is used
 to access packed data structures, where constructing a view for each
 variable in the struct is unwieldy.  It might be useful to have a helper in
 DataView, but the core API should work on views.

This is one point of view. The true design goal of DataView is to
supply the primitives for fast file and network input/output, where
the endianness is explicitly specified in the file format. Converting
strings to and from binary encodings is obviously an operation
associated with transfer of data to or from files or the network.
According to this taxonomy, the string encoding and decoding
operations should only be associated with DataView, and not the other
typed array types, which are designed for in-memory data assembly for
consumption by other hardware on the system.


  - It would be preferable if the encoding API had a way to avoid
 memory allocation, for example to encode into a passed-in DataView.


 This was an earlier design, and discussion led to it being removed as a
 premature optimization, to simplify the API.  I'd recommend reading the rest
 of the thread.

I do apologize for not being fully caught up on the thread, but hope
that the input above was still useful.

-Ken


Re: [whatwg] API for encoding/decoding ArrayBuffers into text

2012-03-26 Thread Boris Zbarsky

On 3/25/12 7:45 AM, Geoffrey Sneddon wrote:

On 21/03/12 04:31, Mark Callow wrote:


On 17/03/2012 08:19, Boris Zbarsky wrote:


I think that trying to get web developers to do this right is a lost
cause, esp. because none of them (to a good approximation) have any
big-endian systems to test on.


On what do you base this oft-repeated assertion? ARM CPUs can work
either way. I have no idea how the various licensees are actually
setting them up.


All major mobile OSes use LE on ARM — I believe we currently don't ship
anything on BE ARM. (We, do, however, currently ship on BE MIPS, though
MIPS too is mostly LE nowadays).


Yep, exactly.  Sorry I missed the original mail from Mark, but Geoffrey 
is spot on: none of the licensees actually shipping anything resembling 
consumer hardware are setting up their processors to run BE, to my 
knowledge.


-Boris


Re: [whatwg] API for encoding/decoding ArrayBuffers into text

2012-03-26 Thread Joshua Bell
On Sat, Mar 24, 2012 at 6:52 AM, Glenn Maynard gl...@zewt.org wrote:

 On Thu, Mar 22, 2012 at 8:58 AM, Anne van Kesteren ann...@opera.com
 wrote:

  Another way would be to have a second optional argument that indicates
  whether more bytes are coming (defaults to false), but I'm not sure of
 the
  chances that would be used correctly. The reasons you outline are
 probably
  why many browser implementations deal with EOF poorly too.


 It might not improve it, but I don't think it'd be worse.  If you didn't
 use it correctly for an encoding where it matters, the breakage would be
 obvious.

 Also, the previous automatically-streaming API has another possible
 misuse: constructing a single encoder, then calling it repeatedly for
 unrelated strings, without calling eof() between them (trailing bytes would
 become U+FFFD in the next string).  That'd be a less likely mistake with
 this, too.


Agreed. Simple things should be simple.


 Here's a suggestion, working from that:

 encoder = Encoder(euc-kr);
 view = encoder.encode(str1, {continues: true});
 view = encoder.encode(str2, {continues: true});
 view = encoder.encode(str3, {continues: false});

 An alternative way to end the stream:

 encoder = Encoder(euc-kr);
 view = encoder.encode(str1, {continues: true});
 view = encoder.encode(str2, {continues: true});
 view = encoder.encode(str3, {continues: true});
 view = encoder.encode(, {continues: false});
 // or view = encoder.encode(); // equivalent; continues defaults to false
 // or view = encoder.encode(); // maybe equivalent, if the first parameter
 is optional

 The simplest usage is concise enough that we don't really need a separate
 str.encode() method:

 view = Encoder(euc-kr).encode(str);

 If it has an eof() method, it'd just be a literal wrapper for
 encoder.encode(), but it can probably be omitted.


Agreed, I'd omit it.

Bikeshed: The |continues| term doesn't completely thrill me; it's clear in
context, but not necessarily what someone might go searching for.
{eof:true} would be lovely except we want the default to be yes-EOF but a
falsy JS value. |noEOF| ?

If there aren't immediate objections, I'll update my wiki draft with this
style of API, and see about updating my JS polyfill as well.

Opinions on one object type (Encoding) vs. two (Encoder, Decoder) ?

One object type is simpler for the non-streaming case, e.g.:

// somewhere globally
g_codec = Encoding(euc-kr);
// elsewhere...
str = g_codec.decode(view); // okay
view = g_codec.encode(str); // fine, no state captured
str = g_codec.decode(view); // still okay

but IMHO someone unfamiliar with the internals of encodings might extend
the above into::

// somewhere globally
g_codec = Encoding(euc-kr);
// elsewhere in some stream handling code...
str = g_codec.decode(view, {continues: true}); // okay..
view = g_codec.encode(str, {continues: true}); // sure, now both an encode
and decode state are captured by codec
str = g_codec.decode(view, {continues: true}); // okay only if this is more
of the same stream; if there are two incoming streams, this is wrong

The same mistake is possible with Encoder / Decoder objects, of course (you
just need two globals). But something about separating them makes it
clearer to me that the |continues| flag is affecting state in the object
rather than just affecting the output of the call.


Re: [whatwg] API for encoding/decoding ArrayBuffers into text

2012-03-26 Thread Anne van Kesteren
On Mon, 26 Mar 2012 17:56:41 +0100, Joshua Bell jsb...@chromium.org  
wrote:
Bikeshed: The |continues| term doesn't completely thrill me; it's clear  
in context, but not necessarily what someone might go searching for.

{eof:true} would be lovely except we want the default to be yes-EOF but a
falsy JS value. |noEOF| ?


Peter Beverloo suggests stream on IRC. I like it.



Opinions on one object type (Encoding) vs. two (Encoder, Decoder) ?


Two seems cleaner.


--
Anne van Kesteren
http://annevankesteren.nl/


Re: [whatwg] API for encoding/decoding ArrayBuffers into text

2012-03-26 Thread Joshua Bell
On Mon, Mar 26, 2012 at 2:42 PM, Anne van Kesteren ann...@opera.com wrote:

 On Mon, 26 Mar 2012 17:56:41 +0100, Joshua Bell jsb...@chromium.org
 wrote:

 Bikeshed: The |continues| term doesn't completely thrill me; it's clear
 in context, but not necessarily what someone might go searching for.
 {eof:true} would be lovely except we want the default to be yes-EOF but a
 falsy JS value. |noEOF| ?


 Peter Beverloo suggests stream on IRC. I like it.


+1


 Opinions on one object type (Encoding) vs. two (Encoder, Decoder) ?


 Two seems cleaner.


I've gone ahead and updated the wiki/draft:
http://wiki.whatwg.org/wiki/StringEncoding

This includes:

* TextEncoder / TextDecoder objects, with |encode| and |decode| methods
that take option dicts
* A |stream| option, per the above
* A |nullTerminator| option eliminates the need for a stringLength method
(hasta la vista, baby!)
* |encodedLength| method is dropped since you can't in-place encode anyway
* decoding errors yield fallback code points by default, but setting a
|fatal| option cause a DOMException to be thrown instead
* specified exceptions as DOMException of type EncodingError, as a
placeholder

New issues resulting from this refactor:

* You can change the options (stream, nullTerminator, fatal) midway through
decoding a stream. This would be silly to do, but as written I don't think
this makes the implementation more difficult. Alternately, the non-stream
options could be set on the TextDecoder object itself.

* BOM handling needs to be resolved. The Encoding spec makes the encoding
label secondary to the BOM. With this API it's unclear if that should be
the case. Options include having a mismatching BOM throw, treating a
mismatching BOM as a decoding error (i.e. fallback or throw, depending on
options), or allow the BOM to actually switch the decoder used for this
stream - possibly if-and-only-if the default encoding was specified.

I've also partially updated the JS polyfill proof-of-concept
implementation, tests, and examples as well, but it does not implement
streaming yet (i.e. a stream option is ignored, state is always lost); I
need to do a tiny bit more refactoring first.


Re: [whatwg] API for encoding/decoding ArrayBuffers into text

2012-03-26 Thread Glenn Maynard
On Mon, Mar 26, 2012 at 4:49 PM, Joshua Bell jsb...@chromium.org wrote:

 * A |stream| option, per the above


Does this make sense when you're using stream: false to flush the stream?
It's still a streaming operation.  I guess it's close enough.

* A |nullTerminator| option eliminates the need for a stringLength method
 (hasta la vista, baby!)


I strongly disagree with this change.  It's much cleaner and more generic
for the decoding algorithm to not know anything about null terminators, and
to have separate general-purpose methods to determine the length of the
string (memchr/wmemchr analogs, which we should have anyway).  We made this
simplification a long time ago--why did you resurrect this?

array = new Int8Array(myArrayBuffer);
length = array.indexOf(0); // same semantics as String.indexOf
if(length != -1)
array = array.subarray(0, length);
new TextDecoder('utf-8').decode(array);

* BOM handling needs to be resolved. The Encoding spec makes the encoding
 label secondary to the BOM. With this API it's unclear if that should be
 the case. Options include having a mismatching BOM throw, treating a
 mismatching BOM as a decoding error (i.e. fallback or throw, depending on
 options), or allow the BOM to actually switch the decoder used for this
 stream - possibly if-and-only-if the default encoding was specified.


The path of fewest errors is probably to have a BOM override the specified
UTF-16 endianness, so saying UTF-16BE just changes the default.


An aside:

The TypedArray constructors have a depressing design bug: new
Int8Array(someOtherView) makes a copy of the data.  It's nonsensical that
view constructors create a view when passed an ArrayBuffer, but a copy when
passed another view.  This doesn't make any kind of sense; creating a view
should create a *view* if it's passed an object that already has
ArrayBuffer-based storage, and making a copy should have been its own
operation.

This means we can't say creating a view is cheap; we have to qualify it:
creating a view is cheap, as long as you're careful not to call a
constructor that makes a copy.

It's frustrating that we're now stuck with a confusing, inconsistent API
like this.  I'm sure it's much too late to fix this properly, but hopefully
an option can be added to fix it, so a new TypedArray(TypedArray, {view:
true}) call  actually creates a view.

-- 
Glenn Maynard


Re: [whatwg] API for encoding/decoding ArrayBuffers into text

2012-03-26 Thread Jonas Sicking
On Mon, Mar 26, 2012 at 2:49 PM, Joshua Bell jsb...@chromium.org wrote:
 On Mon, Mar 26, 2012 at 2:42 PM, Anne van Kesteren ann...@opera.com wrote:

 On Mon, 26 Mar 2012 17:56:41 +0100, Joshua Bell jsb...@chromium.org
 wrote:

 Bikeshed: The |continues| term doesn't completely thrill me; it's clear
 in context, but not necessarily what someone might go searching for.
 {eof:true} would be lovely except we want the default to be yes-EOF but a
 falsy JS value. |noEOF| ?


 Peter Beverloo suggests stream on IRC. I like it.


 +1


 Opinions on one object type (Encoding) vs. two (Encoder, Decoder) ?


 Two seems cleaner.


 I've gone ahead and updated the wiki/draft:
 http://wiki.whatwg.org/wiki/StringEncoding

 This includes:

 * TextEncoder / TextDecoder objects, with |encode| and |decode| methods
 that take option dicts
 * A |stream| option, per the above
 * A |nullTerminator| option eliminates the need for a stringLength method
 (hasta la vista, baby!)
 * |encodedLength| method is dropped since you can't in-place encode anyway
 * decoding errors yield fallback code points by default, but setting a
 |fatal| option cause a DOMException to be thrown instead
 * specified exceptions as DOMException of type EncodingError, as a
 placeholder

 New issues resulting from this refactor:

 * You can change the options (stream, nullTerminator, fatal) midway through
 decoding a stream. This would be silly to do, but as written I don't think
 this makes the implementation more difficult. Alternately, the non-stream
 options could be set on the TextDecoder object itself.

 * BOM handling needs to be resolved. The Encoding spec makes the encoding
 label secondary to the BOM. With this API it's unclear if that should be
 the case. Options include having a mismatching BOM throw, treating a
 mismatching BOM as a decoding error (i.e. fallback or throw, depending on
 options), or allow the BOM to actually switch the decoder used for this
 stream - possibly if-and-only-if the default encoding was specified.

 I've also partially updated the JS polyfill proof-of-concept
 implementation, tests, and examples as well, but it does not implement
 streaming yet (i.e. a stream option is ignored, state is always lost); I
 need to do a tiny bit more refactoring first.

This looks awesome!

A few comments:

* It appears that we lost the ability to measure how long a resulting
buffer was going to be and then decode into the buffer. I don't know
if this is an issue.
* It might be a performance problem to have to check for the
fatal/nullTerminator options on each call.
* We lost the ability to decode from a arraybuffer and see how many
bytes were consumed before a null-terminator was hit. One not terribly
elegant solution would be to add a TextDecoder.decodeWithLength method
which return a DOMString+length tuple.

/ Jonas


Re: [whatwg] API for encoding/decoding ArrayBuffers into text

2012-03-26 Thread Joshua Bell
On Mon, Mar 26, 2012 at 4:12 PM, Glenn Maynard gl...@zewt.org wrote:

 On Mon, Mar 26, 2012 at 4:49 PM, Joshua Bell jsb...@chromium.org wrote:

 * A |stream| option, per the above


 Does this make sense when you're using stream: false to flush the stream?
 It's still a streaming operation.  I guess it's close enough.

 * A |nullTerminator| option eliminates the need for a stringLength method
 (hasta la vista, baby!)


 I strongly disagree with this change.  It's much cleaner and more generic
 for the decoding algorithm to not know anything about null terminators, and
 to have separate general-purpose methods to determine the length of the
 string (memchr/wmemchr analogs, which we should have anyway).  We made this
 simplification a long time ago--why did you resurrect this?


Ah, I'd forgotten that there was consensus that doing this outside the API
was preferable. I'll remove the option when I touch the spec again.

* BOM handling needs to be resolved. The Encoding spec makes the encoding
 label secondary to the BOM. With this API it's unclear if that should be
 the case. Options include having a mismatching BOM throw, treating a
 mismatching BOM as a decoding error (i.e. fallback or throw, depending on
 options), or allow the BOM to actually switch the decoder used for this
 stream - possibly if-and-only-if the default encoding was specified.


 The path of fewest errors is probably to have a BOM override the specified
 UTF-16 endianness, so saying UTF-16BE just changes the default.


This would apply on if the previous call had {stream: false} (implicitly or
explicitly). Calling with {stream:false} would reset for the next call.

Would it apply only to UTF-16 or UTF-8 as well? Should there be any special
behavior when not specifying an encoding in the constructor?

On Mon, Mar 26, 2012 at 4:27 PM, Jonas Sicking jo...@sicking.cc wrote:

 A few comments:

 * It appears that we lost the ability to measure how long a resulting
 buffer was going to be and then decode into the buffer. I don't know
 if this is an issue.


True. On the plus side, the examples in the page (encode/decode
array-of-strings) didn't change size or IMHO readability at all.


 * It might be a performance problem to have to check for the
 fatal/nullTerminator options on each call.


No comment here. Moving the fatal and other options to the TextDecoding
object rather than the decode() call is a possibility. I'm not sure which I
prefer.


 * We lost the ability to decode from a arraybuffer and see how many
 bytes were consumed before a null-terminator was hit. One not terribly
 elegant solution would be to add a TextDecoder.decodeWithLength method
 which return a DOMString+length tuple.


Agreed, but of course see above - there was consensus earlier in the thread
that searching for null terminators should be done outside the API,
therefore the caller will have the length handy already. Yes, this would be
a big flaw since decoding at tightly packed data structure (e.g. array of
null terminated strings w/o length) would be impossible with just the
nullTerminator flag.


Re: [whatwg] API for encoding/decoding ArrayBuffers into text

2012-03-26 Thread Glenn Maynard
On Mon, Mar 26, 2012 at 6:27 PM, Jonas Sicking jo...@sicking.cc wrote:

 * It appears that we lost the ability to measure how long a resulting
 buffer was going to be and then decode into the buffer. I don't know
 if this is an issue.


The theory is that it probably isn't a real performance issue to decode
into a new buffer, then copy it where you want it.  If you think there are
any cases where it matters, we should look at it, though.

The extra GC might matter if you're doing a lot of large conversions, but
that's easily fixed by adding ArrayBuffer.close().

* It might be a performance problem to have to check for the
 fatal/nullTerminator options on each call.


Are you thinking of people, say, feeding in a single byte at a time?  That
seems like it'll be slow no matter what.


On Mon, Mar 26, 2012 at 6:40 PM, Joshua Bell jsb...@chromium.org wrote:

  The path of fewest errors is probably to have a BOM override the
 specified
  UTF-16 endianness, so saying UTF-16BE just changes the default.

 This would apply on if the previous call had {stream: false} (implicitly or
 explicitly).


Right.  The following two operations should be exactly identical, for every
possible value of str and combination of options, and resulting in a
decoder in the same state:

view1 = decoder.decode(str.substr(0, 8), {stream: true});
view2 = decoder.decode(str.substr(8));
finalView = new Int8Array(view1.length + view2.length);
finalView.set(view1);
finalView.set(view2, view1.length);
return finalView;

return decoder.decode(str);

Calling with {stream:false} would reset for the next call.


Right: after a {stream:false} call, a decoder or encoder should be
equivalent to a newly-created one.

Would it apply only to UTF-16 or UTF-8 as well? Should there be any special
 behavior when not specifying an encoding in the constructor?


Do you mean, should decoding UTF-8 switch to UTF-16 if it starts with a
UTF-16 BOM?  I think that would be confusing.  If people want to autodetect
UTF-16 like that, they should probably do it themselves.  I think browsers
do this with text/html, but that's just a web-compatibility wart, not a
feature...

-- 
Glenn Maynard


Re: [whatwg] API for encoding/decoding ArrayBuffers into text

2012-03-26 Thread Jonas Sicking
On Mon, Mar 26, 2012 at 4:40 PM, Joshua Bell jsb...@chromium.org wrote:
 * We lost the ability to decode from a arraybuffer and see how many
 bytes were consumed before a null-terminator was hit. One not terribly
 elegant solution would be to add a TextDecoder.decodeWithLength method
 which return a DOMString+length tuple.

 Agreed, but of course see above - there was consensus earlier in the thread
 that searching for null terminators should be done outside the API,
 therefore the caller will have the length handy already. Yes, this would be
 a big flaw since decoding at tightly packed data structure (e.g. array of
 null terminated strings w/o length) would be impossible with just the
 nullTerminator flag.

Requiring callers to find the null character first, and then use that
will require one additional pass over the encoded binary data though.
Also, if we put the API for finding the null character on the Decoder
object it doesn't seem like we're creating an API which is easier to
use, just one that has moved some of the logic from the API to every
caller.

Though I guess the best solution would be to add methods to DataView
which allows consuming an ArrayBuffer up to a null terminated point
and returns the decoded string. Potentially such a method could take a
Decoder object as argument.

/ Jonas


Re: [whatwg] API for encoding/decoding ArrayBuffers into text

2012-03-26 Thread Kenneth Russell
On Mon, Mar 26, 2012 at 5:33 PM, Jonas Sicking jo...@sicking.cc wrote:
 On Mon, Mar 26, 2012 at 4:40 PM, Joshua Bell jsb...@chromium.org wrote:
 * We lost the ability to decode from a arraybuffer and see how many
 bytes were consumed before a null-terminator was hit. One not terribly
 elegant solution would be to add a TextDecoder.decodeWithLength method
 which return a DOMString+length tuple.

 Agreed, but of course see above - there was consensus earlier in the thread
 that searching for null terminators should be done outside the API,
 therefore the caller will have the length handy already. Yes, this would be
 a big flaw since decoding at tightly packed data structure (e.g. array of
 null terminated strings w/o length) would be impossible with just the
 nullTerminator flag.

 Requiring callers to find the null character first, and then use that
 will require one additional pass over the encoded binary data though.
 Also, if we put the API for finding the null character on the Decoder
 object it doesn't seem like we're creating an API which is easier to
 use, just one that has moved some of the logic from the API to every
 caller.

 Though I guess the best solution would be to add methods to DataView
 which allows consuming an ArrayBuffer up to a null terminated point
 and returns the decoded string. Potentially such a method could take a
 Decoder object as argument.

The rationale for specifying the string encoding and decoding
functionality outside the typed array specification is to keep the
typed array spec small and easily implementable. The indexed property
getters and setters on the typed array views, and methods on DataView,
are designed to be implementable with a small amount of assembly code
in JavaScript engines. I'd strongly prefer to continue to design the
encoding/decoding functionality separately from the typed array views.

-Ken


Re: [whatwg] API for encoding/decoding ArrayBuffers into text

2012-03-26 Thread John Tamplin
On Mon, Mar 26, 2012 at 9:11 PM, Kenneth Russell k...@google.com wrote:

 The rationale for specifying the string encoding and decoding
 functionality outside the typed array specification is to keep the
 typed array spec small and easily implementable. The indexed property
 getters and setters on the typed array views, and methods on DataView,
 are designed to be implementable with a small amount of assembly code
 in JavaScript engines. I'd strongly prefer to continue to design the
 encoding/decoding functionality separately from the typed array views.


However, if the browser's don't all implement this, then you can't rely on
it being there.  In apps where you compile separately for each browser, you
only pay the cost where the browser doesn't implement it (for example, in
GWT we emulate DataView and Uint8ClampedArray where it is missing).  Even
then, you may have to include both versions and do runtime detection, such
as when later versions of the browser include the functionality -- that may
be worse than simply not using the API at all if you care more about code
size than execution speed of encoding/decoding text.

So, personally I think whatever gets the most browsers to completely
implement it is better, whether that is being part of the typed arrays spec
or separate.  Logically, it seems to fit most directly in DataView.

-- 
John A. Tamplin
Software Engineer (GWT), Google


Re: [whatwg] API for encoding/decoding ArrayBuffers into text

2012-03-26 Thread Glenn Maynard
On Mon, Mar 26, 2012 at 7:33 PM, Jonas Sicking jo...@sicking.cc wrote:

 Requiring callers to find the null character first, and then use that
 will require one additional pass over the encoded binary data though.


That's extremely fast (memchr), and it's probably the fastest thing to do
anyway, compared to embedding null-termination logic into the inner loop of
decode functions.

Unless there's a concrete benchmark showing that it's slower, and slower
enough to actually matter, this shouldn't be a consideration.  It's a
premature optimization.

Also, if we put the API for finding the null character on the Decoder
 object it doesn't seem like we're creating an API which is easier to
 use, just one that has moved some of the logic from the API to every
 caller.


It doesn't seem materially harder (a little more code, yes, but that's not
the same thing), and it's more general-purpose.  The API for finding the
character doesn't belong on Decoder.  It should probably go on each View
type, analogous to String.indexOf.  Multi-byte views should search on the
view's size; eg. Int16Array.indexOf(i) maps to wmemchr.

Though I guess the best solution would be to add methods to DataView
 which allows consuming an ArrayBuffer up to a null terminated point
 and returns the decoded string. Potentially such a method could take a
 Decoder object as argument.


I guess.  It doesn't seem that important, since it's just a few lines of
code.  If this is done, I'd suggest that this helper API *not* have any
special support for streaming (not to disallow it, but not to have any
special handling for it, either).  I think streaming has little overlap
with null-terminated fields, since null-termination is typically used with
fixed-size buffers.  It would complicate things; for example, you'd need
some way to signal to the caller that a null terminator was encountered.

That is, it'd basically look like:

function decodeNullTerminated(decoder, options)
{
// Create the correct array type, so array.find and array.subarray work
in 16-bit for UTF-16.
var arrayType = (decoder.encoding.toLowerCase() == 'utf-16le' ||
decoder.encoding.toLowerCase() == 'utf-16be')? Int16Array:Int8Array;
var array = new arrayType(this.buffer, this.byteOffset,
this.byteLength);
var terminator = array.find(0);
if(terminator != -1)
array = array.subarray(0, terminator);
return decoder.decode(array, options);
}

which doesn't specifically prohibit options including {stream: true}, but
doesn't attempt to make it useful.

(Side note: If you have null-terminated strings, you're almost always
dealing with only multibyte encodings like UTF-8, or only wide encodings
like UTF-16, so you'd just use the appropriate type.  That is, the minor
complication of the first line above isn't something that users would
normally actually need to do.)


On Mon, Mar 26, 2012 at 8:11 PM, Kenneth Russell k...@google.com wrote:

 The rationale for specifying the string encoding and decoding
 functionality outside the typed array specification is to keep the
 typed array spec small and easily implementable. The indexed property
 getters and setters on the typed array views, and methods on DataView,
 are designed to be implementable with a small amount of assembly code
 in JavaScript engines. I'd strongly prefer to continue to design the
 encoding/decoding functionality separately from the typed array views.


It doesn't need to go into the Typed Array spec.  It can just be an
addition to the interface provided by an external specification, which
doesn't need to be implemented to implement typed arrays itself.

I don't think it's an important thing to have, but this in particular
doesn't seem like a problem.

-- 
Glenn Maynard


Re: [whatwg] API for encoding/decoding ArrayBuffers into text

2012-03-26 Thread Joshua Bell
On Mon, Mar 26, 2012 at 6:24 PM, Glenn Maynard gl...@zewt.org wrote:

 I guess.  It doesn't seem that important, since it's just a few lines of
 code.  If this is done, I'd suggest that this helper API *not* have any
 special support for streaming (not to disallow it, but not to have any
 special handling for it, either).  I think streaming has little overlap
 with null-terminated fields, since null-termination is typically used with
 fixed-size buffers.  It would complicate things; for example, you'd need
 some way to signal to the caller that a null terminator was encountered.


Agreed.

Also worth relying to this thread is that in addition to null termination
there have been requests for other terminators, such as 0xFF which is an
invalid byte in a UTF-8 stream and thus a lovely terminator. Other byte
sequences were mentioned. (This was over in the Khronos WebGL list for
anyone who wants to dig it up. It was tracked as an unresolved ISSUE in the
spec.)

This supports the assertion that we should not special case null
terminators, but instead provide general (and highly optimizable) utilities
like memchr operating on buffers, since we can't anticipate every usage in
higher-level APIs like the one under discussion.


Re: [whatwg] API for encoding/decoding ArrayBuffers into text

2012-03-26 Thread Jonas Sicking
On Mon, Mar 26, 2012 at 6:11 PM, Kenneth Russell k...@google.com wrote:
 On Mon, Mar 26, 2012 at 5:33 PM, Jonas Sicking jo...@sicking.cc wrote:
 On Mon, Mar 26, 2012 at 4:40 PM, Joshua Bell jsb...@chromium.org wrote:
 * We lost the ability to decode from a arraybuffer and see how many
 bytes were consumed before a null-terminator was hit. One not terribly
 elegant solution would be to add a TextDecoder.decodeWithLength method
 which return a DOMString+length tuple.

 Agreed, but of course see above - there was consensus earlier in the thread
 that searching for null terminators should be done outside the API,
 therefore the caller will have the length handy already. Yes, this would be
 a big flaw since decoding at tightly packed data structure (e.g. array of
 null terminated strings w/o length) would be impossible with just the
 nullTerminator flag.

 Requiring callers to find the null character first, and then use that
 will require one additional pass over the encoded binary data though.
 Also, if we put the API for finding the null character on the Decoder
 object it doesn't seem like we're creating an API which is easier to
 use, just one that has moved some of the logic from the API to every
 caller.

 Though I guess the best solution would be to add methods to DataView
 which allows consuming an ArrayBuffer up to a null terminated point
 and returns the decoded string. Potentially such a method could take a
 Decoder object as argument.

 The rationale for specifying the string encoding and decoding
 functionality outside the typed array specification is to keep the
 typed array spec small and easily implementable. The indexed property
 getters and setters on the typed array views, and methods on DataView,
 are designed to be implementable with a small amount of assembly code
 in JavaScript engines. I'd strongly prefer to continue to design the
 encoding/decoding functionality separately from the typed array views.

Is there a reason you couldn't keep the current set of functions on
DataView implemented using a small amount of assembly code, and let
the new functions fall back to slower C++ functions?

/ Jonas


Re: [whatwg] API for encoding/decoding ArrayBuffers into text

2012-03-26 Thread Jonas Sicking
On Mon, Mar 26, 2012 at 6:24 PM, Glenn Maynard gl...@zewt.org wrote:
 On Mon, Mar 26, 2012 at 7:33 PM, Jonas Sicking jo...@sicking.cc wrote:

 Requiring callers to find the null character first, and then use that
 will require one additional pass over the encoded binary data though.


 That's extremely fast (memchr), and it's probably the fastest thing to do
 anyway, compared to embedding null-termination logic into the inner loop of
 decode functions.

The memchr is purely overhead, I.e. we are comparing memchr+decoding
to decoding. So I don't see what's backing up the probably the
fastest thing claim.

 Unless there's a concrete benchmark showing that it's slower, and slower
 enough to actually matter, this shouldn't be a consideration.  It's a
 premature optimization.

My argument is that it's both faster and more author friendly.

I admit I missed the previous discussion which led to the agreement to
keep the length measuring outside, so I don't know what arguments were
presented. Any pointers would be appreciated.

 Also, if we put the API for finding the null character on the Decoder
 object it doesn't seem like we're creating an API which is easier to
 use, just one that has moved some of the logic from the API to every
 caller.

 It doesn't seem materially harder (a little more code, yes, but that's not
 the same thing), and it's more general-purpose.

I agree it doesn't seem materially harder. I also agree that I don't
have data to show that it's materially slower. But it sounds like
we're in agreement that keeping the logic outside is both harder and
slower which honestly doesn't speak strongly in its favor.

I don't understand the argument that the alternative is more
general-purpose. The API is already generic in that you can use
whatever delimiter you want since you pass in a view. The only
functionality which is not available is finding a null-terminator in
an arraybuffer which you are arguing below shouldn't be part of the
decoder (which I agree with).

/ Jonas


Re: [whatwg] API for encoding/decoding ArrayBuffers into text

2012-03-24 Thread Glenn Maynard
On Thu, Mar 22, 2012 at 8:58 AM, Anne van Kesteren ann...@opera.com wrote:

 Another way would be to have a second optional argument that indicates
 whether more bytes are coming (defaults to false), but I'm not sure of the
 chances that would be used correctly. The reasons you outline are probably
 why many browser implementations deal with EOF poorly too.


It might not improve it, but I don't think it'd be worse.  If you didn't
use it correctly for an encoding where it matters, the breakage would be
obvious.

Also, the previous automatically-streaming API has another possible
misuse: constructing a single encoder, then calling it repeatedly for
unrelated strings, without calling eof() between them (trailing bytes would
become U+FFFD in the next string).  That'd be a less likely mistake with
this, too.

Here's a suggestion, working from that:

encoder = Encoder(euc-kr);
view = encoder.encode(str1, {continues: true});
view = encoder.encode(str2, {continues: true});
view = encoder.encode(str3, {continues: false});

An alternative way to end the stream:

encoder = Encoder(euc-kr);
view = encoder.encode(str1, {continues: true});
view = encoder.encode(str2, {continues: true});
view = encoder.encode(str3, {continues: true});
view = encoder.encode(, {continues: false});
// or view = encoder.encode(); // equivalent; continues defaults to false
// or view = encoder.encode(); // maybe equivalent, if the first parameter
is optional

The simplest usage is concise enough that we don't really need a separate
str.encode() method:

view = Encoder(euc-kr).encode(str);

If it has an eof() method, it'd just be a literal wrapper for
encoder.encode(), but it can probably be omitted.

-- 
Glenn Maynard


Re: [whatwg] API for encoding/decoding ArrayBuffers into text

2012-03-22 Thread Anne van Kesteren
On Thu, 22 Mar 2012 03:12:25 +0100, Mark Callow callow_m...@hicorp.co.jp  
wrote:

This has encode and decode reversed from my understanding. I regard the
string (wide-char) as the canonical form and the bytes as the encoded
form. This view is reflected in the widely used terminology charset
encodings which refers to the likes of euc-kr and shift_jis.


Yeah, I suspect we'll get it right once put in a draft :-)


--
Anne van Kesteren
http://annevankesteren.nl/


Re: [whatwg] API for encoding/decoding ArrayBuffers into text

2012-03-22 Thread Anne van Kesteren
On Wed, 21 Mar 2012 16:53:36 +0100, Joshua Bell jsb...@chromium.org  
wrote:

Just to throw it out there - does anyone feel we can/should offer
asymmetric encode/decode support, i.e. supporting more encodings for  
decode operations than for encode operations?


XMLHttpRequest has that. You can only send (encode) UTF-8, receive  
(decode) everything. Forms can send everything. URL query parameters  
can encode everything (though the page itself has to be encoded in the  
encoding of choice).


If we have no use cases just supporting encoding UTF-8 seems fine to me,  
but I think the design should allow for other encodings in the future.




Bikeshedding on the name - we'd have to put String or Text in there
somewhere, since audio/video/image codecs will likely want to use similar
terms.


They can use the prefixed variants :-) If we have to use a prefix String  
seems better, as Text is a node object in the platform.



--
Anne van Kesteren
http://annevankesteren.nl/


Re: [whatwg] API for encoding/decoding ArrayBuffers into text

2012-03-22 Thread Anne van Kesteren
On Thu, 22 Mar 2012 10:19:30 +0100, Anne van Kesteren ann...@opera.com  
wrote:
They can use the prefixed variants :-) If we have to use a prefix  
String seems better, as Text is a node object in the platform.


Simon pointed out Text as prefix is probably better (it is used elsewhere  
in the platform unrelated to nodes (e.g. TextTrack)), though I'd  
personally prefer simply Decoder/Encoder.



--
Anne van Kesteren
http://annevankesteren.nl/


Re: [whatwg] API for encoding/decoding ArrayBuffers into text

2012-03-22 Thread James Graham

On 03/21/2012 04:53 PM, Joshua Bell wrote:


As for the API, how about:


  enc = new Encoder(euc-kr)
  string1 = enc.encode(bytes1)
  string2 = enc.encode(bytes2)
  string3 = enc.eof() // might return empty string if all is fine

And similarly you would have

  dec = new Decoder(shift_jis)
  bytes = dec.decode(string)

Or alternatively you could have a single object that exposes both encode()
and decode() and tracks state for both:

  enc = new Encoding(gb18030)
  bytes1  = enc.decode(string1)
  string2 = enc.encode(bytes2)




I don't mind this API for complex usecases e.g. streaming, but it is 
massive overkill for the simple common case of I have a list of bytes 
that I want to decode to a string or I have a string that I want to 
encode to bytes. For those cases I strongly prefer the earlier API 
along the lines of


String.prototype.encode(encoding)
ArrayBufferView.prototype.decode(encoding)


Re: [whatwg] API for encoding/decoding ArrayBuffers into text

2012-03-22 Thread Glenn Maynard
On Wed, Mar 21, 2012 at 2:42 PM, Anne van Kesteren ann...@opera.com wrote:

 As for the API, how about:

  enc = new Encoder(euc-kr)
  string1 = enc.encode(bytes1)
  string2 = enc.encode(bytes2)
  string3 = enc.eof() // might return empty string if all is fine


A problem with this is that the bugs resulting from not calling eof() are
subtle.  The only thing eof() would ever do, I think, is return U+FFFD
characters if there are leftover characters in the internal buffer; if you
never call eof(), you'll never get incorrect results unless you test with
invalid inputs.

It's minor, as subtle-edge-cases-that-people-won't-get-right go, but it's
at least worth a mention.  Maybe people who would use this API instead of
the simpler non-streaming version (which could be a thin wrapper on this)
in the first place are also more likely to get this right.

I'm guessing a common, incorrect pattern would be:

string = new Encoder(euc-kr).encode(bytes);

which would *not* be equivalent to bytes.encode(euc-kr).

-- 
Glenn Maynard


Re: [whatwg] API for encoding/decoding ArrayBuffers into text

2012-03-22 Thread Anne van Kesteren

On Thu, 22 Mar 2012 14:47:05 +0100, Glenn Maynard gl...@zewt.org wrote:

A problem with this is that the bugs resulting from not calling eof() are
subtle.  The only thing eof() would ever do, I think, is return U+FFFD
characters if there are leftover characters in the internal buffer; if  
you

never call eof(), you'll never get incorrect results unless you test with
invalid inputs.

It's minor, as subtle-edge-cases-that-people-won't-get-right go, but it's
at least worth a mention.  Maybe people who would use this API instead of
the simpler non-streaming version (which could be a thin wrapper on this)
in the first place are also more likely to get this right.

I'm guessing a common, incorrect pattern would be:

string = new Encoder(euc-kr).encode(bytes);

which would *not* be equivalent to bytes.encode(euc-kr).


Another way would be to have a second optional argument that indicates  
whether more bytes are coming (defaults to false), but I'm not sure of the  
chances that would be used correctly. The reasons you outline are probably  
why many browser implementations deal with EOF poorly too.



--
Anne van Kesteren
http://annevankesteren.nl/


Re: [whatwg] API for encoding/decoding ArrayBuffers into text

2012-03-22 Thread NARUSE, Yui
2012/3/22 Anne van Kesteren ann...@opera.com:
 As for the API, how about:

  enc = new Encoder(euc-kr)
  string1 = enc.encode(bytes1)
  string2 = enc.encode(bytes2)
  string3 = enc.eof() // might return empty string if all is fine

 And similarly you would have

  dec = new Decoder(shift_jis)
  bytes = dec.decode(string)

 Or alternatively you could have a single object that exposes both encode()
 and decode() and tracks state for both:

  enc = new Encoding(gb18030)
  bytes1  = enc.decode(string1)
  string2 = enc.encode(bytes2)

Usually, strings are encoded to bytes.
Therefore that encode/decode methods should be reversed like:

 enc = new Encoding(gb18030)
 bytes1  = enc.encode(string1)
 string2 = enc.decode(bytes2)

Or if it may cause confusion use getBytes/getChars like Java and C#.
http://docs.oracle.com/javase/7/docs/api/java/lang/String.html
http://msdn.microsoft.com/en-us/library/system.text.encoder(v=vs.110).aspx#Y1873
http://msdn.microsoft.com/en-us/library/system.text.Decoder(v=vs.110).aspx#Y1873

 enc = new Encoding(gb18030)
 bytes1  = enc.getBytes(string1)
 string2 = enc.getChars(bytes2)

-- 
NARUSE, Yui  nar...@airemix.jp


Re: [whatwg] API for encoding/decoding ArrayBuffers into text

2012-03-21 Thread Jonas Sicking
On Tue, Mar 20, 2012 at 10:39 AM, Joshua Bell jsb...@chromium.org wrote:
 On Tue, Mar 20, 2012 at 7:26 AM, Glenn Maynard gl...@zewt.org wrote:

 On Mon, Mar 19, 2012 at 11:52 PM, Jonas Sicking jo...@sicking.cc wrote:

 Why are encodings different than other parts of the API where you

 indeed have to know what works and what doesn't.


 Do you memorize lists of encodings?  I certainly don't.  I look them up as
 needed.

 UTF8 is stateful, so I disagree.


 No, UTF-8 doesn't require a stateful decoder to support streaming.  You
 decode up to the last codepoint that you can decode completely.  The return
 values are the output data, the number of bytes output, and the number of
 bytes consumed; that's all you need to restart decoding later.  That's the
 iconv(3) approach that we're probably all familiar with, which works with
 almost all encodings.

 ISO-2022 encodings are stateful: you have to persistently remember the
 character subsets activated by earlier escape sequences.  An iconv-like
 streaming API is impossible; to support streamed decoding, you'd need to
 have a decoder object that the user keeps around in order to store that
 state.  http://en.wikipedia.org/wiki/ISO/IEC_2022#Code_structure


 Which seems like it leaves us with these options:

 1. Only support encodings with stateless coding (possibly down to a minimum
 of UTF-8)
 2. Only provide an API supporting non-streaming coding (i.e. whole
 strings/whole buffers)
 3. Expand the API to return encoder/decoder objects that capture state

I'm pretty sure there is consensus for supporting UTF8. UTF8 is
stateful though can be made not stateful by not consuming all
characters and instead forcing the caller to keep the state (in the
form of unconsumed text).

So I would rephrase your 3 options above as:

1) Create an API which forces consumers to do state handling. Probably
leading to people creating wrappers which essentially implement option
3
2) Don't support streaming
3) Have encoder/decoder objects which hold state

I personally don't think 1 is a good option since it's basically the
same as 3 but just with libraries doing some of the work. We might as
well do that work so that libraries aren't needed.

This leaves us with 2 or 3. So the question is if we should support
streaming or not. I suspect doing so would be worth it.

/ Jonas


Re: [whatwg] API for encoding/decoding ArrayBuffers into text

2012-03-21 Thread NARUSE, Yui
2012/3/21 Glenn Maynard gl...@zewt.org:
 On Tue, Mar 20, 2012 at 12:39 PM, Joshua Bell jsb...@chromium.org wrote:

 1. Only support encodings with stateless coding (possibly down to a minimum
 of UTF-8)
 2. Only provide an API supporting non-streaming coding (i.e. whole
 strings/whole buffers)
 3. Expand the API to return encoder/decoder objects that capture state

 Any others?

 Trying to do simplify the problem but take on both (1) and (2) without (3)
 would lead to an API that could not encompass (3) in the future, which
 would be a mistake.

 I don't think that's obviously a mistake.  Only the nastiest, wartiest of
 legacy encodings require it.

The categories feels strange.

If the conversion is not streaming (whole strings/whole buffers), its
implementation should be simply the wrapper of the browser's
conversion functions.
There is no need to a state object to save the state because the conversion
is done with the completion of the function, even if it is stateful encoding.

For streaming conversion, it needs state even if the encoding is stateless.
When the given partial input is finished at the middle of a character
like \xE3\x81\x82\xC2, the conversion consumes 4 bytes, output one character
\u3042, and remember the partial bytes \xC2. This bytes is the state.

 That said, it's fairly simple to later return an additional state object
 from the previously proposed streaming APIs, eg.

 result = decode(str, 0, outputView)
 // result.outputBytes == 15
 // result.nextInputByte == 5
 // result.state == opaque object

 result2 = decode(str, result.nextInputByte, outputView, {state:
 result.state});

You can refer mbsrtowcs(3), which convert a character string to a wide-character
string (restartable). It uses opaque state.
size_t mbsnrtowcs(wchar_t *restrict dst, const char **restrict src,
   size_t nmc, size_t len, mbstate_t *restrict ps);
http://pubs.opengroup.org/onlinepubs/9699919799/functions/mbsrtowcs.html

Anyway, they need error if the byte sequence is invalid for the encoding.

-- 
NARUSE, Yui  nar...@airemix.jp


Re: [whatwg] API for encoding/decoding ArrayBuffers into text

2012-03-21 Thread NARUSE, Yui
2012/3/21 Jonas Sicking jo...@sicking.cc:
 I'm pretty sure there is consensus for supporting UTF8. UTF8 is
 stateful though can be made not stateful by not consuming all
 characters and instead forcing the caller to keep the state (in the
 form of unconsumed text).

Your use of the word stateful involves misunderstanding.
Usually the word stateful encoding means that the encoding keeps a state
between characters, not bytes.
What you mean is usually expressed by the word multibyte.
UTF-8 is multibyte encoding, and it needs to keep a state on streaming.

 So I would rephrase your 3 options above as:

 1) Create an API which forces consumers to do state handling. Probably
 leading to people creating wrappers which essentially implement option
 3
 2) Don't support streaming
 3) Have encoder/decoder objects which hold state

 I personally don't think 1 is a good option since it's basically the
 same as 3 but just with libraries doing some of the work. We might as
 well do that work so that libraries aren't needed.

 This leaves us with 2 or 3. So the question is if we should support
 streaming or not. I suspect doing so would be worth it.

I think it should provide non streaming API.
And if there are concreate use case, provide streaming API as another one.

-- 
NARUSE, Yui  nar...@airemix.jp


Re: [whatwg] API for encoding/decoding ArrayBuffers into text

2012-03-21 Thread Anne van Kesteren

On Wed, 21 Mar 2012 01:27:47 -0700, Jonas Sicking jo...@sicking.cc wrote:

This leaves us with 2 or 3. So the question is if we should support
streaming or not. I suspect doing so would be worth it.


For XMLHttpRequest it might be, yes.

I think we should expose the same encoding set throughout the platform.  
One reason to limit the encoding set initially might be because we have  
not all converged yet on our encoding sets. Gecko, Safari, and Internet  
Explorer expose a lot more encodings than Opera and Chrome.


As for the API, how about:

  enc = new Encoder(euc-kr)
  string1 = enc.encode(bytes1)
  string2 = enc.encode(bytes2)
  string3 = enc.eof() // might return empty string if all is fine

And similarly you would have

  dec = new Decoder(shift_jis)
  bytes = dec.decode(string)

Or alternatively you could have a single object that exposes both encode()  
and decode() and tracks state for both:


  enc = new Encoding(gb18030)
  bytes1  = enc.decode(string1)
  string2 = enc.encode(bytes2)


--
Anne van Kesteren
http://annevankesteren.nl/


Re: [whatwg] API for encoding/decoding ArrayBuffers into text

2012-03-21 Thread Glenn Maynard
On Wed, Mar 21, 2012 at 3:27 AM, Jonas Sicking jo...@sicking.cc wrote:

 1) Create an API which forces consumers to do state handling. Probably
 leading to people creating wrappers which essentially implement option
 3


It's not the same.  Please look at how ISO-2022 works: the stream has
*long-lived* state, with escape sequences that change the meaning of later
code sequences in the stream.  For example, you have to remember whether GR
is encoding G1, G2 or G3.  This can't be stored merely by remembering the
next input byte you have to start at.

As Yui said, the sort of state UTF-8 has isn't what people mean when we
talk about stateful encodings.

On Wed, Mar 21, 2012 at 3:34 AM, NARUSE, Yui nar...@airemix.jp wrote:

 For streaming conversion, it needs state even if the encoding is stateless.
 When the given partial input is finished at the middle of a character
 like \xE3\x81\x82\xC2, the conversion consumes 4 bytes, output one
 character
 \u3042, and remember the partial bytes \xC2. This bytes is the state.


You don't need to do that.  You can simply convert as many output
codepoints as can be *completely* converted.  In this example, you'd
consume 3 bytes and output one codepoint.  You don't consume data that you
can't immediately convert, so you don't have to buffer anything.

(We don't have to do it that way, of course; just pointing out that you
don't *need* special state for streaming encodings like UTF-8.)

Anyway, they need error if the byte sequence is invalid for the encoding.


Errors were discussed previously: by default errors output U+FFFD (or
another replacement character, for encoding unsupported characters to
non-Unicode encodings), and we may have an option to turn it into an
exception.

-- 
Glenn Maynard


Re: [whatwg] API for encoding/decoding ArrayBuffers into text

2012-03-21 Thread Joshua Bell
On Wed, Mar 21, 2012 at 12:42 PM, Anne van Kesteren ann...@opera.comwrote:

 On Wed, 21 Mar 2012 01:27:47 -0700, Jonas Sicking jo...@sicking.cc
 wrote:

 This leaves us with 2 or 3. So the question is if we should support
 streaming or not. I suspect doing so would be worth it.


 For XMLHttpRequest it might be, yes.

 I think we should expose the same encoding set throughout the platform.
 One reason to limit the encoding set initially might be because we have not
 all converged yet on our encoding sets. Gecko, Safari, and Internet
 Explorer expose a lot more encodings than Opera and Chrome.


Just to throw it out there - does anyone feel we can/should offer
asymmetric encode/decode support, i.e. supporting more encodings for decode
operations than for encode operations?

As for the API, how about:

  enc = new Encoder(euc-kr)
  string1 = enc.encode(bytes1)
  string2 = enc.encode(bytes2)
  string3 = enc.eof() // might return empty string if all is fine

 And similarly you would have

  dec = new Decoder(shift_jis)
  bytes = dec.decode(string)

 Or alternatively you could have a single object that exposes both encode()
 and decode() and tracks state for both:

  enc = new Encoding(gb18030)
  bytes1  = enc.decode(string1)
  string2 = enc.encode(bytes2)


That's the direction my thinking was headed. Glenn pointed out that the
state that's implicitly captured in the above objects could instead be
returned as an explicit but opaque state object that's passed in and out of
stateless functions. As a potential user of the API, I find the above
object-oriented style easier to understand.

Re: Encoding object vs. an Encoder/Decoder pair - I'd prefer the latter as
it makes the state being captured and any methods/attributes to interrogate
the state clearer.

Bikeshedding on the name - we'd have to put String or Text in there
somewhere, since audio/video/image codecs will likely want to use similar
terms.


Re: [whatwg] API for encoding/decoding ArrayBuffers into text

2012-03-21 Thread Charles Pritchard

On 3/21/2012 8:53 AM, Joshua Bell wrote:

On Wed, Mar 21, 2012 at 12:42 PM, Anne van Kesterenann...@opera.comwrote:


  On Wed, 21 Mar 2012 01:27:47 -0700, Jonas Sickingjo...@sicking.cc
  wrote:


  This leaves us with 2 or 3. So the question is if we should support
  streaming or not. I suspect doing so would be worth it.



  For XMLHttpRequest it might be, yes.

  I think we should expose the same encoding set throughout the platform.
  One reason to limit the encoding set initially might be because we have not
  all converged yet on our encoding sets. Gecko, Safari, and Internet
  Explorer expose a lot more encodings than Opera and Chrome.


Just to throw it out there - does anyone feel we can/should offer
asymmetric encode/decode support, i.e. supporting more encodings for decode
operations than for encode operations?


In the past decade I've never had to encode into something other than 
UTF-8. I have had to decode many encoding sets.


If I did need to do a special encoding, given the state of typed arrays, 
I'd probably just implement the encoding in JS.


+1 for asymmetric from my experience.

-Charles







Re: [whatwg] API for encoding/decoding ArrayBuffers into text

2012-03-21 Thread Mark Callow

On 22/03/2012 04:42, Anne van Kesteren wrote:
 ...

 As for the API, how about:

   enc = new Encoder(euc-kr)
   string1 = enc.encode(bytes1)
   string2 = enc.encode(bytes2)
   string3 = enc.eof() // might return empty string if all is fine

 And similarly you would have

   dec = new Decoder(shift_jis)
   bytes = dec.decode(string)

 Or alternatively you could have a single object that exposes both
 encode() and decode() and tracks state for both:

   enc = new Encoding(gb18030)
   bytes1  = enc.decode(string1)
   string2 = enc.encode(bytes2)

This has encode and decode reversed from my understanding. I regard the
string (wide-char) as the canonical form and the bytes as the encoded
form. This view is reflected in the widely used terminology charset
encodings which refers to the likes of euc-kr and shift_jis.

Regards

-Mark


Re: [whatwg] API for encoding/decoding ArrayBuffers into text

2012-03-20 Thread Glenn Maynard
On Mon, Mar 19, 2012 at 11:52 PM, Jonas Sicking jo...@sicking.cc wrote:

 Why are encodings different than other parts of the API where you

indeed have to know what works and what doesn't.


Do you memorize lists of encodings?  I certainly don't.  I look them up as
needed.

UTF8 is stateful, so I disagree.


No, UTF-8 doesn't require a stateful decoder to support streaming.  You
decode up to the last codepoint that you can decode completely.  The return
values are the output data, the number of bytes output, and the number of
bytes consumed; that's all you need to restart decoding later.  That's the
iconv(3) approach that we're probably all familiar with, which works with
almost all encodings.

ISO-2022 encodings are stateful: you have to persistently remember the
character subsets activated by earlier escape sequences.  An iconv-like
streaming API is impossible; to support streamed decoding, you'd need to
have a decoder object that the user keeps around in order to store that
state.  http://en.wikipedia.org/wiki/ISO/IEC_2022#Code_structure

-- 
Glenn Maynard


Re: [whatwg] API for encoding/decoding ArrayBuffers into text

2012-03-20 Thread Joshua Bell
On Tue, Mar 20, 2012 at 7:26 AM, Glenn Maynard gl...@zewt.org wrote:

 On Mon, Mar 19, 2012 at 11:52 PM, Jonas Sicking jo...@sicking.cc wrote:

 Why are encodings different than other parts of the API where you

 indeed have to know what works and what doesn't.


 Do you memorize lists of encodings?  I certainly don't.  I look them up as
 needed.

 UTF8 is stateful, so I disagree.


 No, UTF-8 doesn't require a stateful decoder to support streaming.  You
 decode up to the last codepoint that you can decode completely.  The return
 values are the output data, the number of bytes output, and the number of
 bytes consumed; that's all you need to restart decoding later.  That's the
 iconv(3) approach that we're probably all familiar with, which works with
 almost all encodings.

 ISO-2022 encodings are stateful: you have to persistently remember the
 character subsets activated by earlier escape sequences.  An iconv-like
 streaming API is impossible; to support streamed decoding, you'd need to
 have a decoder object that the user keeps around in order to store that
 state.  http://en.wikipedia.org/wiki/ISO/IEC_2022#Code_structure


Which seems like it leaves us with these options:

1. Only support encodings with stateless coding (possibly down to a minimum
of UTF-8)
2. Only provide an API supporting non-streaming coding (i.e. whole
strings/whole buffers)
3. Expand the API to return encoder/decoder objects that capture state

Any others?

Trying to do simplify the problem but take on both (1) and (2) without (3)
would lead to an API that could not encompass (3) in the future, which
would be a mistake.

I'll throw out that the in-progress design of a Globalization API for
ECMAScript -
http://norbertlindenberg.com/2012/02/ecmascript-internationalization-api/ -
is currently spec'd to both build on the existing locale-aware methods on
String/Number/Date prototypes as conveniences, as well as introducing the
Collator and *Format objects.

Should we start with UTF-8-only/non-streaming methods on
DOMString/ArrayBufferView, and avoid constraining a future API supporting
multiple, possibly stateful encodings and streaming?


Re: [whatwg] API for encoding/decoding ArrayBuffers into text

2012-03-20 Thread Glenn Maynard
On Tue, Mar 20, 2012 at 12:39 PM, Joshua Bell jsb...@chromium.org wrote:

 1. Only support encodings with stateless coding (possibly down to a minimum
 of UTF-8)
 2. Only provide an API supporting non-streaming coding (i.e. whole
 strings/whole buffers)
 3. Expand the API to return encoder/decoder objects that capture state

 Any others?

 Trying to do simplify the problem but take on both (1) and (2) without (3)
 would lead to an API that could not encompass (3) in the future, which
 would be a mistake.


I don't think that's obviously a mistake.  Only the nastiest, wartiest of
legacy encodings require it.

That said, it's fairly simple to later return an additional state object
from the previously proposed streaming APIs, eg.

result = decode(str, 0, outputView)
// result.outputBytes == 15
// result.nextInputByte == 5
// result.state == opaque object

result2 = decode(str, result.nextInputByte, outputView, {state:
result.state});

-- 
Glenn Maynard


Re: [whatwg] API for encoding/decoding ArrayBuffers into text

2012-03-20 Thread Mark Callow

On 17/03/2012 08:19, Boris Zbarsky wrote:

 I think that trying to get web developers to do this right is a lost
 cause, esp. because none of them (to a good approximation) have any
 big-endian systems to test on.

On what do you base this oft-repeated assertion? ARM CPUs can work
either way. I have no idea how the various licensees are actually
setting them up.

Regards

-Mark


Re: [whatwg] API for encoding/decoding ArrayBuffers into text

2012-03-19 Thread Henri Sivonen
On Wed, Mar 14, 2012 at 12:49 AM, Jonas Sicking jo...@sicking.cc wrote:
 Something that has come up a couple of times with content authors
 lately has been the desire to convert an ArrayBuffer (or part thereof)
 into a decoded string. Similarly being able to encode a string into an
 ArrayBuffer (or part thereof).

 Something as simple as

 DOMString decode(ArrayBufferView source, DOMString encoding);
 ArrayBufferView encode(DOMString source, DOMString encoding,
 [optional] ArrayBufferView destination);

It saddens me that this allows non-UTF-8 encodings. However, since use
cases for non-UTF-8 encodings were mentioned in this thread, I suggest
that the set of supported encodings be an enumerated set of encodings
stated in a spec and browsers MUST NOT support other encodings. The
set should probably be the set offered in the encoding popup at
http://validator.nu/?charset or a subset thereof (containing at least
UTF-8 of course). (That set was derived by researching the
intersection of the encodings supported by browsers, Python and the
JDK.)

 would go a very long way.

Are you sure that it's not necessary to support streaming conversion?
The suggested API design assumes you always have the entire data
sequence in a single DOMString or ArrayBufferView.

 The question is where to stick these
 functions. Internationalization doesn't have a obvious object we can
 hang functions off of (unlike, for example crypto), and the above
 names are much too generic to turn into global functions.

If we deem streaming conversion unnecessary, I'd put the methods on
DOMString and ArrayBufferView. It would be terribly sad to let the
schedules of various working groups affect the API design.

-- 
Henri Sivonen
hsivo...@iki.fi
http://hsivonen.iki.fi/


Re: [whatwg] API for encoding/decoding ArrayBuffers into text

2012-03-19 Thread Jonas Sicking
On Mon, Mar 19, 2012 at 7:00 AM, Henri Sivonen hsivo...@iki.fi wrote:
 On Wed, Mar 14, 2012 at 12:49 AM, Jonas Sicking jo...@sicking.cc wrote:
 Something that has come up a couple of times with content authors
 lately has been the desire to convert an ArrayBuffer (or part thereof)
 into a decoded string. Similarly being able to encode a string into an
 ArrayBuffer (or part thereof).

 Something as simple as

 DOMString decode(ArrayBufferView source, DOMString encoding);
 ArrayBufferView encode(DOMString source, DOMString encoding,
 [optional] ArrayBufferView destination);

 It saddens me that this allows non-UTF-8 encodings. However, since use
 cases for non-UTF-8 encodings were mentioned in this thread, I suggest
 that the set of supported encodings be an enumerated set of encodings
 stated in a spec and browsers MUST NOT support other encodings. The
 set should probably be the set offered in the encoding popup at
 http://validator.nu/?charset or a subset thereof (containing at least
 UTF-8 of course). (That set was derived by researching the
 intersection of the encodings supported by browsers, Python and the
 JDK.)

Yes, I think we should enumerate the set of encodings supported.
Ideally we'd for simplicity support the same set of enumerated
encodings everywhere in the platform and over time try to shrink that
set.

 would go a very long way.

 Are you sure that it's not necessary to support streaming conversion?
 The suggested API design assumes you always have the entire data
 sequence in a single DOMString or ArrayBufferView.

 The question is where to stick these
 functions. Internationalization doesn't have a obvious object we can
 hang functions off of (unlike, for example crypto), and the above
 names are much too generic to turn into global functions.

 If we deem streaming conversion unnecessary, I'd put the methods on
 DOMString and ArrayBufferView. It would be terribly sad to let the
 schedules of various working groups affect the API design.

Streaming is a very good question. I hadn't thought about that.
Especially now that we have chunked ArrayBuffer support in XHR
streaming would seem like a much more interesting request.

/ Jonas


Re: [whatwg] API for encoding/decoding ArrayBuffers into text

2012-03-19 Thread Glenn Maynard
On Mon, Mar 19, 2012 at 12:46 PM, Joshua Bell jsb...@chromium.org wrote:

 I have edited the proposal to base the list of encodings on

http://dvcs.w3.org/hg/encoding/raw-file/tip/Overview.html - is there any
 reason that would not be sufficient or appropriate? (this appears to be a
 superset of the validator.nu/?charset list, with only a small number of
 additional encodings)


There are lots of encodings in that list which browsers need to support for
legacy text/html content, which are probably completely unnecessary here.
 People may be storing Shift-JIS text in ID3 tags, but I doubt they're
doing that with ISO-2022-JP.

I'm undecided about legacy encodings in general, but that aside, I'd start
from just [UTF-8], and add to the list based on concrete use cases.
 Don't start from the whole list and try to pare it down.

I wonder if we can't limit the damage of extending more support to legacy
encodings.  We have a use case for decoding legacy charsets (ID3 tags), but
do we have any use cases for encoding to them?  If you're writing back
changed ID3 tags, you should be writing it back in as ID3v2 (which is all
most tagging software writes to now), which uses UTF-8.

On Mon, Mar 19, 2012 at 5:54 PM, Jonas Sicking jo...@sicking.cc wrote:

 Yes, I think we should enumerate the set of encodings supported.
 Ideally we'd for simplicity support the same set of enumerated
 encodings everywhere in the platform and over time try to shrink that
 set.


Shrinking the set supported for HTML will be much harder than keeping this
set small to begin with.

-- 
Glenn Maynard


Re: [whatwg] API for encoding/decoding ArrayBuffers into text

2012-03-19 Thread Jonas Sicking
On Mon, Mar 19, 2012 at 5:10 PM, Glenn Maynard gl...@zewt.org wrote:
 On Mon, Mar 19, 2012 at 5:54 PM, Jonas Sicking jo...@sicking.cc wrote:

 Yes, I think we should enumerate the set of encodings supported.
 Ideally we'd for simplicity support the same set of enumerated
 encodings everywhere in the platform and over time try to shrink that
 set.

 Shrinking the set supported for HTML will be much harder than keeping this
 set small to begin with.

What value are we adding, and to whom, by keeping the list the
smallest it can be, even when that means keeping the lists of
supported encodings different between different APIs?

The concrete costs are that authors will have to learn which encodings
work where, and that implementations need to keep separate lists of
supported encodings in different APIs.

/ Jonas


Re: [whatwg] API for encoding/decoding ArrayBuffers into text

2012-03-19 Thread Glenn Maynard
On Mon, Mar 19, 2012 at 7:33 PM, Jonas Sicking jo...@sicking.cc wrote:

 What value are we adding, and to whom, by keeping the list the
 smallest it can be, even when that means keeping the lists of
 supported encodings different between different APIs?


Not needlessly extending support for legacy encodings means there's no
chance of this API inadvertently causing proliferation of those encodings.
That benefits everyone who might come in contact with that data, and
increases the odds of being able to remove some of those encodings from the
platform entirely.

The concrete costs are that authors will have to learn which encodings
 work where, and that implementations need to keep separate lists of
 supported encodings in different APIs.


Authors don't need to learn that; all they care about is if the encoding
they're trying to use works.  Nobody memorizes lists of encodings.

Keeping a list of supported encodings is a trivial cost.

It also means that browsers need to be able to encode to each of these
encodings, and encoding for all of them needs to be specified, which I
think is currently unneeded.  (Unless we go the asymmetric
encoding/decoding route, supporting only decoders for legacy charsets.  If
this is the only reason that'd all have to be specified, that's probably
another reason to consider it...)

Supporting streaming decoding for modal encodings, such as ISO-2022-CN,
might also be a burden: it means implementations would be required to
support stateful, incremental decoding for that charset, which is more
complicated than most encodings (which are stateless).  Many
implementations probably do support that, but I don't think it's currently
mandatory, and it would complicate any streaming API.  Stateful encodings
need to die even more than other legacy encodings; I hope this API doesn't
have to support any of them.

-- 
Glenn Maynard


Re: [whatwg] API for encoding/decoding ArrayBuffers into text

2012-03-19 Thread Glenn Maynard
On Mon, Mar 19, 2012 at 8:14 PM, Glenn Maynard gl...@zewt.org wrote:

 If this is the only reason that'd all have to be specified, that's
 probably another reason to consider it...


(Well, there's form data either way.  At least encoding is probably easier
to spec, since it only has to deal with UTF-16 error handling...)

-- 
Glenn Maynard


Re: [whatwg] API for encoding/decoding ArrayBuffers into text

2012-03-19 Thread Jonas Sicking
On Mon, Mar 19, 2012 at 6:14 PM, Glenn Maynard gl...@zewt.org wrote:
 On Mon, Mar 19, 2012 at 7:33 PM, Jonas Sicking jo...@sicking.cc wrote:

 What value are we adding, and to whom, by keeping the list the
 smallest it can be, even when that means keeping the lists of
 supported encodings different between different APIs?

 Not needlessly extending support for legacy encodings means there's no
 chance of this API inadvertently causing proliferation of those encodings.
 That benefits everyone who might come in contact with that data, and
 increases the odds of being able to remove some of those encodings from the
 platform entirely.

It seems unlikely to me that adding support for an encoding here will
make it harder to eradicate the encoding from the web.

 The concrete costs are that authors will have to learn which encodings
 work where, and that implementations need to keep separate lists of
 supported encodings in different APIs.


 Authors don't need to learn that; all they care about is if the encoding
 they're trying to use works.  Nobody memorizes lists of encodings.

Why are encodings different than other parts of the API where you
indeed have to know what works and what doesn't.

 It also means that browsers need to be able to encode to each of these
 encodings, and encoding for all of them needs to be specified, which I think
 is currently unneeded.  (Unless we go the asymmetric encoding/decoding
 route, supporting only decoders for legacy charsets.  If this is the only
 reason that'd all have to be specified, that's probably another reason to
 consider it...)

 Supporting streaming decoding for modal encodings, such as ISO-2022-CN,
 might also be a burden: it means implementations would be required to
 support stateful, incremental decoding for that charset, which is more
 complicated than most encodings (which are stateless).  Many implementations
 probably do support that, but I don't think it's currently mandatory, and it
 would complicate any streaming API.  Stateful encodings need to die even
 more than other legacy encodings; I hope this API doesn't have to support
 any of them.

UTF8 is stateful, so I disagree.

/ Jonas


Re: [whatwg] API for encoding/decoding ArrayBuffers into text

2012-03-16 Thread Joshua Bell
On Thu, Mar 15, 2012 at 5:20 PM, Glenn Maynard gl...@zewt.org wrote:

 On Thu, Mar 15, 2012 at 6:51 PM, Jonas Sicking jo...@sicking.cc wrote:

 What's the use-case for the stringLength function? You can't decode
 into an existing datastructure anyway, so you're ultimately forced to
 call decode at which point the stringLength function hasn't helped
 you.


 stringLength doesn't return the length of the decoded string.  It returns
 the byte offset of the first \0 (or the length of the whole buffer, if
 none), for decoding null-terminated strings.  For multibyte encodings (eg.
 everything except UTF-16 and friends), it's just memchr(), so it's much
 faster than actually decoding the string.


And just to be clear, the use case is decoding data formats where string
fields are variable length null terminated.


 Currently the use-case of simply wanting to convert a string to a
 binary buffer is a bit cumbersome. You first have to call the
 encodedLength function, then allocate a buffer of the right size,
 then call the encode function.


 I suggested eg.

 result = encode(string, utf-8, null).output;

 which would create an ArrayBuffer of the required size.  Presumably the
 null ArrayBufferView argument would be optional, so you could just say
 encode(string, utf-8).


I think we want both encoding and destination to be optional. That leads us
to an API like:

out_dict = stringEncoding.encode(string, opt_dict);

.. where both out_dict and opt_dict are WebIDL Dictionaries:

opt_dict keys: view, encoding
out_dict keys: charactersWritten, byteWritten, output

... where output === view if view is supplied, otherwise a new Uint8Array
(or Uint8ClampedArray??)

If this instead is attached to String, it would look like:

out_dict = my_string.encode(opt_dict);

If it were attached to ArrayBufferView, having a right-size buffer
allocated for the caller gets uglier unless we include a static version.

It doesn't seem possible to implement the 'encode' function without
 doing multiple scans over the string. The implementation seems
 required both to check that the data can be decoded using the
 specified encoding, as well as check that the data will fit in the
 passed in buffer. Only then can the implementation start decoding the
 data. This seems problematic.


 Only if it guarantees that it doesn't write anything to the output buffer
 unless the entire result will fit.  I don't think we need to do that; just
 guarantee that it'll be truncated on a whole codepoint.


Agreed. Input/output dicts mean the API documentation a caller needs to
read to understand the usage is more complex than a function signature
which is why I resisted them, but it does seem like the best approach.
Thanks for pushing, Glenn!

In the create-a-buffer-on-the-fly case there will be some memory juggling
going on, either by initially over allocating or reallocating/moving.


 I also don't think it's a good idea to throw an exception for encoding
 errors. Better to convert characters to the unicode replacement
 character. I believe we made a similar change to the WebSockets
 specification recently.


 Was that change made?  I filed
 https://www.w3.org/Bugs/Public/show_bug.cgi?id=16157, but it still seems
 to be undecided.


Settling on an options dict means adding a flag to control this behavior
(throws: true ?) doesn't extend the API surface significantly.


Re: [whatwg] API for encoding/decoding ArrayBuffers into text

2012-03-16 Thread Joshua Bell
On Fri, Mar 16, 2012 at 9:19 AM, Joshua Bell jsb...@chromium.org wrote:


 And just to be clear, the use case is decoding data formats where string
 fields are variable length null terminated.


... and the spec should include normative guidance that length-prefixing is
strongly recommended for new data formats.


Re: [whatwg] API for encoding/decoding ArrayBuffers into text

2012-03-16 Thread Glenn Maynard
On Fri, Mar 16, 2012 at 11:19 AM, Joshua Bell jsb...@chromium.org wrote:

 And just to be clear, the use case is decoding data formats where string
 fields are variable length null terminated.


A concrete example is ZIP central directories.

 I think we want both encoding and destination to be optional. That leads us
 to an API like:

 out_dict = stringEncoding.encode(string, opt_dict);

 .. where both out_dict and opt_dict are WebIDL Dictionaries:

 opt_dict keys: view, encoding



 out_dict keys: charactersWritten, byteWritten, output


The return value should just be a [NoInterfaceObject] interface.
Dictionaries are used for input fields.

Something that came up on IRC that we should spend some time thinking
about, though: Is it actually important to be able to encode into an
existing buffer?  This may be a premature optimization.  You can always
encode into a new buffer, and--if needed--copy the result where you need it.

If we don't support that, most of this extra stuff in encode() goes away.

... where output === view if view is supplied, otherwise a new Uint8Array
 (or Uint8ClampedArray??)


Uint8Array is correct.  (Uint8ClampedArray is for image color data.)

If UTF-16 or UTF-32 are supported, decoding to them should return
Uint16Array and Uint32Array, respectively (with the return value being
typed just to ArrayBufferView).

If this instead is attached to String, it would look like:

 out_dict = my_string.encode(opt_dict);

 If it were attached to ArrayBufferView, having a right-size buffer
 allocated for the caller gets uglier unless we include a static version.


If in-place decoding isn't really needed, we could have:

newView = str.encode(utf-8); // or {encoding: utf-8}
str2 = newView.decode(utf-8);
len = newView.find(0); // replaces stringLength, searching for 0 in the
view's type; you'd use Uint16Array for UTF-16

and encodedLength() would go away.

newView.find(val) would live on subclasses of TypedArray.

In the create-a-buffer-on-the-fly case there will be some memory juggling
 going on, either by initially over allocating or reallocating/moving.


But since that's all behind the scenes, the implementation can do it
whichever way is most efficient for the particular encoding.  In many
cases, it may be possible to eliminate any reallocation, by making an
educated guess about how big the buffer is likely to be.

On Fri, Mar 16, 2012 at 11:21 AM, Joshua Bell jsb...@chromium.org wrote:

 ... and the spec should include normative guidance that length-prefixing is
 strongly recommended for new data formats.


I think this would be a bit off-topic.

-- 
Glenn Maynard


Re: [whatwg] API for encoding/decoding ArrayBuffers into text

2012-03-16 Thread James Graham



On Fri, 16 Mar 2012, Glenn Maynard wrote:


On Fri, Mar 16, 2012 at 11:19 AM, Joshua Bell jsb...@chromium.org wrote:


And just to be clear, the use case is decoding data formats where string
fields are variable length null terminated.



A concrete example is ZIP central directories.

I think we want both encoding and destination to be optional. That leads us

to an API like:

out_dict = stringEncoding.encode(string, opt_dict);

.. where both out_dict and opt_dict are WebIDL Dictionaries:

opt_dict keys: view, encoding





out_dict keys: charactersWritten, byteWritten, output



The return value should just be a [NoInterfaceObject] interface.
Dictionaries are used for input fields.

Something that came up on IRC that we should spend some time thinking
about, though: Is it actually important to be able to encode into an
existing buffer?  This may be a premature optimization.  You can always
encode into a new buffer, and--if needed--copy the result where you need it.

If we don't support that, most of this extra stuff in encode() goes away.


Yes, I think we should focus on getting feature parity with e.g. python 
first -- i.e. not worry about decoding into existing buffers -- and add 
extra fancy stuff later if we find that there are actually usecases where 
avoiding the copy is critical. This should allow us to focus on getting 
the right API for the common case.



If in-place decoding isn't really needed, we could have:

newView = str.encode(utf-8); // or {encoding: utf-8}
str2 = newView.decode(utf-8);
len = newView.find(0); // replaces stringLength, searching for 0 in the
view's type; you'd use Uint16Array for UTF-16

and encodedLength() would go away.


This looks like a big win to me.


Re: [whatwg] API for encoding/decoding ArrayBuffers into text

2012-03-16 Thread Joshua Bell
On Fri, Mar 16, 2012 at 10:35 AM, Glenn Maynard gl...@zewt.org wrote:

 On Fri, Mar 16, 2012 at 11:19 AM, Joshua Bell jsb...@chromium.org wrote:


 ... where output === view if view is supplied, otherwise a new Uint8Array
 (or Uint8ClampedArray??)


 Uint8Array is correct.  (Uint8ClampedArray is for image color data.)

 If UTF-16 or UTF-32 are supported, decoding to them should return
 Uint16Array and Uint32Array, respectively (with the return value being
 typed just to ArrayBufferView).


FYI, there was some follow up IRC conversation on this. With Typed Arrays
as currently specified - that is, that Uint16Array has platform endianness
- the above would imply that either platform endianness dictated the output
byte sequence (and le/be was ignored), or that encode(\uFFFD,
utf-16).view[0] might != 0xFFFD on some platforms.

There was consensus (among the two of us) that the output view's underlying
buffer's byte order would be le/be depending on the selected encoding.
There is not consensus over what the return view type should be -
Uint8Array, or pursue BE/LE variants of Uint16Array to conceal platform
endianness.


Re: [whatwg] API for encoding/decoding ArrayBuffers into text

2012-03-16 Thread Boris Zbarsky

On 3/16/12 5:12 PM, Joshua Bell wrote:

FYI, there was some follow up IRC conversation on this. With Typed Arrays
as currently specified - that is, that Uint16Array has platform endianness


For what it's worth, it seems like this is something we should seriously 
consider changing so as to make the web-visible endianness of typed 
arrays always be little-endian.  Authors are actively writing code (and 
being encouraged to do so by technology evangelists) that makes that 
assumption anyway


-Boris


Re: [whatwg] API for encoding/decoding ArrayBuffers into text

2012-03-16 Thread Charles Pritchard

On 3/16/2012 2:17 PM, Boris Zbarsky wrote:

On 3/16/12 5:12 PM, Joshua Bell wrote:
FYI, there was some follow up IRC conversation on this. With Typed 
Arrays
as currently specified - that is, that Uint16Array has platform 
endianness


For what it's worth, it seems like this is something we should 
seriously consider changing so as to make the web-visible endianness 
of typed arrays always be little-endian.  Authors are actively writing 
code (and being encouraged to do so by technology evangelists) that 
makes that assumption anyway


The DataView set of methods already does this work. The raw arrays are 
supposed to have platform endianness.


If you see some evangelists skipping the endian check, send them an 
e-mail and let them know.



-Charles


Re: [whatwg] API for encoding/decoding ArrayBuffers into text

2012-03-16 Thread James Graham

On Fri, 16 Mar 2012, Charles Pritchard wrote:


On 3/16/2012 2:17 PM, Boris Zbarsky wrote:

On 3/16/12 5:12 PM, Joshua Bell wrote:

FYI, there was some follow up IRC conversation on this. With Typed Arrays
as currently specified - that is, that Uint16Array has platform endianness


For what it's worth, it seems like this is something we should seriously 
consider changing so as to make the web-visible endianness of typed arrays 
always be little-endian.  Authors are actively writing code (and being 
encouraged to do so by technology evangelists) that makes that assumption 
anyway


The DataView set of methods already does this work. The raw arrays are 
supposed to have platform endianness.


If you see some evangelists skipping the endian check, send them an e-mail 
and let them know.


Not going to work.

You can't evangelise people into making their code work on architectures 
that they don't own. It's hard enough to get people to work around 
differences between browsers when all the browsers are avaliable for free 
and run on the platforms that they develop on.


The reality is that on devices where typed arrays don't appear LE, content 
will break.


Re: [whatwg] API for encoding/decoding ArrayBuffers into text

2012-03-16 Thread Glenn Maynard
On Fri, Mar 16, 2012 at 4:44 PM, Charles Pritchard ch...@jumis.com wrote:

 The DataView set of methods already does this work. The raw arrays are
 supposed to have platform endianness.


That's wrong.  This is web API design 101; everyone should know better than
this by now.  Exposing platform endianness is setting the platform up for
massive incompatibilities down the road.

In reality, the spec is moot here: if anyone does implement typed arrays on
a production big-endian system, they're going to make these views
little-endian, because doing otherwise would break countless applications,
essentially all of which are tested only on little-endian systems.  Web
compatibility is a top priority to browser implementations.

(DataView isn't relevant here; it's used for different access patterns.  To
access arrays of data embedded in an ArrayBuffer, you use views, not
DataView.  Use DataView if you have a packed data structure with
variable-size fields, such as the metadata in a ZIP local file header.)

-- 
Glenn Maynard


Re: [whatwg] API for encoding/decoding ArrayBuffers into text

2012-03-16 Thread Charles Pritchard

On 3/16/2012 3:26 PM, Glenn Maynard wrote:
On Fri, Mar 16, 2012 at 4:44 PM, Charles Pritchard ch...@jumis.com 
mailto:ch...@jumis.com wrote:


The DataView set of methods already does this work. The raw arrays
are supposed to have platform endianness.


That's wrong.  This is web API design 101; everyone should know better 
than this by now.  Exposing platform endianness is setting the 
platform up for massive incompatibilities down the road.


I make mistakes all the time with UTF8 and raw string arrays. I make 
mistakes all the time with endianness.


Low level API design 101; everyone working with low level APIs makes 
mistakes.


In reality, the spec is moot here: if anyone does implement typed 
arrays on a production big-endian system, they're going to make these 
views little-endian, because doing otherwise would break countless 
applications, essentially all of which are tested only on 
little-endian systems.  Web compatibility is a top priority to browser 
implementations.


It's up to programmers to code defensively. More-so with multi-platform 
multi-vendor deployments than walled gardens.
Authors should be using the spec as written, it only takes one target 
system to use big-endian.


It doesn't harm anything for a vendor to implement as little-endian, as 
most authors assume and test on little endian.
It may cause some harm to alter the spec so as to remove the requirement 
that coders account for both.




(DataView isn't relevant here; it's used for different access 
patterns.  To access arrays of data embedded in an ArrayBuffer, you 
use views, not DataView.  Use DataView if you have a packed data 
structure with variable-size fields, such as the metadata in a ZIP 
local file header.)
I use the subarray pattern frequently. DataView is not much different 
than using subarray.


Use DataView when it's easier than ArrayBufferView and available.




Re: [whatwg] API for encoding/decoding ArrayBuffers into text

2012-03-16 Thread Boris Zbarsky

On 3/16/12 5:44 PM, Charles Pritchard wrote:

The DataView set of methods already does this work. The raw arrays are
supposed to have platform endianness.


I haven't seen anyone actually using the DataView stuff in practice, or 
presenting it to developers much...



If you see some evangelists skipping the endian check, send them an
e-mail and let them know.


I've done that... then I stopped because it just wasn't worth the 
effort.  Every single WebGL demo I've seen recently was doing this. 
People were being told that typed arrays are a good way to load binary 
(integer and float) data from servers using the arraybuffer facilities 
of XHR at SXSW last week, with no mention of endianness.


I think that trying to get web developers to do this right is a lost 
cause, esp. because none of them (to a good approximation) have any 
big-endian systems to test on.


-Boris


Re: [whatwg] API for encoding/decoding ArrayBuffers into text

2012-03-16 Thread Boris Zbarsky

On 3/16/12 5:25 PM, Brandon Jones wrote:

Everyone knows that typed arrays /can/ be Big Endian, but I'm not aware
of any devices available right now that support WebGL that are.


I believe that recent Firefox on a SPARC processor would fit that 
description.  Of course the number of web developers that have a 
SPARC-based machine is 0 to a very good approximation


-Boris


Re: [whatwg] API for encoding/decoding ArrayBuffers into text

2012-03-16 Thread Jonas Sicking
On Fri, Mar 16, 2012 at 9:19 AM, Joshua Bell jsb...@chromium.org wrote:
 On Thu, Mar 15, 2012 at 5:20 PM, Glenn Maynard gl...@zewt.org wrote:

 On Thu, Mar 15, 2012 at 6:51 PM, Jonas Sicking jo...@sicking.cc wrote:

 What's the use-case for the stringLength function? You can't decode
 into an existing datastructure anyway, so you're ultimately forced to
 call decode at which point the stringLength function hasn't helped
 you.


 stringLength doesn't return the length of the decoded string.  It returns
 the byte offset of the first \0 (or the length of the whole buffer, if
 none), for decoding null-terminated strings.  For multibyte encodings (eg.
 everything except UTF-16 and friends), it's just memchr(), so it's much
 faster than actually decoding the string.


 And just to be clear, the use case is decoding data formats where string
 fields are variable length null terminated.


 Currently the use-case of simply wanting to convert a string to a
 binary buffer is a bit cumbersome. You first have to call the
 encodedLength function, then allocate a buffer of the right size,
 then call the encode function.


 I suggested eg.

 result = encode(string, utf-8, null).output;

 which would create an ArrayBuffer of the required size.  Presumably the
 null ArrayBufferView argument would be optional, so you could just say
 encode(string, utf-8).


 I think we want both encoding and destination to be optional. That leads us
 to an API like:

 out_dict = stringEncoding.encode(string, opt_dict);

 .. where both out_dict and opt_dict are WebIDL Dictionaries:

 opt_dict keys: view, encoding
 out_dict keys: charactersWritten, byteWritten, output

 ... where output === view if view is supplied, otherwise a new Uint8Array
 (or Uint8ClampedArray??)

 If this instead is attached to String, it would look like:

 out_dict = my_string.encode(opt_dict);

 If it were attached to ArrayBufferView, having a right-size buffer
 allocated for the caller gets uglier unless we include a static version.

Using input and output dictionaries is definitely messy, but I can't
see a better way either. And I think ES6 is adding some syntax here
that will make developer's lives better (deconstructing assignments)

 It doesn't seem possible to implement the 'encode' function without
 doing multiple scans over the string. The implementation seems
 required both to check that the data can be decoded using the
 specified encoding, as well as check that the data will fit in the
 passed in buffer. Only then can the implementation start decoding the
 data. This seems problematic.


 Only if it guarantees that it doesn't write anything to the output buffer
 unless the entire result will fit.  I don't think we need to do that; just
 guarantee that it'll be truncated on a whole codepoint.


 Agreed. Input/output dicts mean the API documentation a caller needs to
 read to understand the usage is more complex than a function signature
 which is why I resisted them, but it does seem like the best approach.
 Thanks for pushing, Glenn!

 In the create-a-buffer-on-the-fly case there will be some memory juggling
 going on, either by initially over allocating or reallocating/moving.

The implementation can always figure out what strategy fits its own
requirements best with regards to memory allocation. I suspect that
right now in Firefox the fastest implementation would be to scan
through the string once to measure the desired buffer size, then
allocate and write into the allocated buffer.

The problem is that the way that the encoding function is defined
right now, you are not allowed to write any data if you are throwing
for whatever reason, which means that you have to do a scan first to
see if you need to throw, and then do a separate pass to actually
encode the data. I think we need to change that such that when an
exception is thrown that data should be written up to the point that
causes the exception.

 I also don't think it's a good idea to throw an exception for encoding
 errors. Better to convert characters to the unicode replacement
 character. I believe we made a similar change to the WebSockets
 specification recently.


 Was that change made?  I filed
 https://www.w3.org/Bugs/Public/show_bug.cgi?id=16157, but it still seems
 to be undecided.


 Settling on an options dict means adding a flag to control this behavior
 (throws: true ?) doesn't extend the API surface significantly.

Sounds good to me. Though I would still strongly prefer the default to
be non-throwing as to minimize the risk of website breakage in the
case of bugs. Especially since these bugs are so data dependent and
are likely to not happen on a developers computer.

/ Jonas


Re: [whatwg] API for encoding/decoding ArrayBuffers into text

2012-03-16 Thread Charles Pritchard

On 3/16/2012 4:25 PM, Boris Zbarsky wrote:

On 3/16/12 5:25 PM, Brandon Jones wrote:

Everyone knows that typed arrays /can/ be Big Endian, but I'm not aware
of any devices available right now that support WebGL that are.


I believe that recent Firefox on a SPARC processor would fit that 
description.  Of course the number of web developers that have a 
SPARC-based machine is 0 to a very good approximation



I've written some hash/encryption methods that could very well could 
fail on Firefox on SPARC; many things fail on machines I've never tested 
with.


Flip the implementation on SPARC, and it wouldn't harm anything. Cut it 
out of the spec, so that the behavior is undocumented, implementations 
break.
DataView is a more complex than ArrayBufferView, so implementers started 
with the easy option.


The coders using Float32Array are cowboys; (web app gaming and 
encryption). We're talking about a few hundred people out of many millions.




Re: [whatwg] API for encoding/decoding ArrayBuffers into text

2012-03-16 Thread James Robinson
On Fri, Mar 16, 2012 at 4:25 PM, Boris Zbarsky bzbar...@mit.edu wrote:

 On 3/16/12 5:25 PM, Brandon Jones wrote:

 Everyone knows that typed arrays /can/ be Big Endian, but I'm not aware

 of any devices available right now that support WebGL that are.


 I believe that recent Firefox on a SPARC processor would fit that
 description.  Of course the number of web developers that have a
 SPARC-based machine is 0 to a very good approximation


You can s/web developers/users/ and the statement would still apply,
wouldn't it?

- James



 -Boris



Re: [whatwg] API for encoding/decoding ArrayBuffers into text

2012-03-16 Thread Boris Zbarsky

On 3/16/12 7:43 PM, James Robinson wrote:

You can s/web developers/users/ and the statement would still apply,
wouldn't it?


Sure, but so what?

The upshot is that people are writing code that assumes little-endian 
hardware all over.  We should just clearly make the spec say that that's 
what typed arrays are so that an implementor can actually implement the 
spec and be web compatible.


The value of a spec which can't be implemented as written is arguably 
lower than not having a spec at all...  At least then you _know_ you 
have to reverse-engineer.


-Boris


Re: [whatwg] API for encoding/decoding ArrayBuffers into text

2012-03-16 Thread Charles Pritchard

On 3/16/2012 5:25 PM, Boris Zbarsky wrote:

On 3/16/12 7:43 PM, James Robinson wrote:

You can s/web developers/users/ and the statement would still apply,
wouldn't it?


Sure, but so what?

The upshot is that people are writing code that assumes little-endian 
hardware all over.  We should just clearly make the spec say that 
that's what typed arrays are so that an implementor can actually 
implement the spec and be web compatible.


The value of a spec which can't be implemented as written is arguably 
lower than not having a spec at all...  At least then you _know_ you 
have to reverse-engineer.


Isn't that an issue for TC39?




Re: [whatwg] API for encoding/decoding ArrayBuffers into text

2012-03-15 Thread Anne van Kesteren

On Wed, 14 Mar 2012 23:53:12 +0100, Glenn Maynard gl...@zewt.org wrote:
On Wed, Mar 14, 2012 at 6:52 AM, Anne van Kesteren ann...@opera.com  
wrote:

If we can make it a deterministic, unchanging, and defined algorithm, I
think that would actually be acceptable. And ideally we do define that
algorithm at some point so new browsers can enter the existing market  
more easily and existing browsers interpret existing content in the  
same way.


We don't have any untagged content to support yet, so let's not create an
API that guarantees it'll come into existence.  The heuristics you need
depend heavily on the content, anyway (for example, heuristics that work
for HTML probably won't for ID3 tags, which are generally very short).


What I replied to suggested reusing an existing undocumented code path  
which is definitely used to support existing content. From what I remember  
reading about the detector in Gecko it can be quite useful regardless of  
context.



--
Anne van Kesteren
http://annevankesteren.nl/


Re: [whatwg] API for encoding/decoding ArrayBuffers into text

2012-03-15 Thread Jonas Sicking
On Wed, Mar 14, 2012 at 3:33 PM, Joshua Bell jsb...@chromium.org wrote:
 FYI, I've updated http://wiki.whatwg.org/wiki/StringEncoding

A few comments:

What's the use-case for the stringLength function? You can't decode
into an existing datastructure anyway, so you're ultimately forced to
call decode at which point the stringLength function hasn't helped
you.


Currently the use-case of simply wanting to convert a string to a
binary buffer is a bit cumbersome. You first have to call the
encodedLength function, then allocate a buffer of the right size,
then call the encode function. Could we add a function with
something like the following signature:

ArrayBufferView encode(DOMString value, optional DOMString encoding);


It doesn't seem possible to implement the 'encode' function without
doing multiple scans over the string. The implementation seems
required both to check that the data can be decoded using the
specified encoding, as well as check that the data will fit in the
passed in buffer. Only then can the implementation start decoding the
data. This seems problematic.


I also don't think it's a good idea to throw an exception for encoding
errors. Better to convert characters to the unicode replacement
character. I believe we made a similar change to the WebSockets
specification recently.

/ Jonas


Re: [whatwg] API for encoding/decoding ArrayBuffers into text

2012-03-15 Thread Glenn Maynard
On Thu, Mar 15, 2012 at 6:51 PM, Jonas Sicking jo...@sicking.cc wrote:

 What's the use-case for the stringLength function? You can't decode
 into an existing datastructure anyway, so you're ultimately forced to
 call decode at which point the stringLength function hasn't helped
 you.


stringLength doesn't return the length of the decoded string.  It returns
the byte offset of the first \0 (or the length of the whole buffer, if
none), for decoding null-terminated strings.  For multibyte encodings (eg.
everything except UTF-16 and friends), it's just memchr(), so it's much
faster than actually decoding the string.

Currently the use-case of simply wanting to convert a string to a
 binary buffer is a bit cumbersome. You first have to call the
 encodedLength function, then allocate a buffer of the right size,
 then call the encode function.


I suggested eg.

result = encode(string, utf-8, null).output;

which would create an ArrayBuffer of the required size.  Presumably the
null ArrayBufferView argument would be optional, so you could just say
encode(string, utf-8).

It doesn't seem possible to implement the 'encode' function without
 doing multiple scans over the string. The implementation seems
 required both to check that the data can be decoded using the
 specified encoding, as well as check that the data will fit in the
 passed in buffer. Only then can the implementation start decoding the
 data. This seems problematic.


Only if it guarantees that it doesn't write anything to the output buffer
unless the entire result will fit.  I don't think we need to do that; just
guarantee that it'll be truncated on a whole codepoint.

I also don't think it's a good idea to throw an exception for encoding
 errors. Better to convert characters to the unicode replacement
 character. I believe we made a similar change to the WebSockets
 specification recently.


Was that change made?  I filed
https://www.w3.org/Bugs/Public/show_bug.cgi?id=16157, but it still seems to
be undecided.

-- 
Glenn Maynard


Re: [whatwg] API for encoding/decoding ArrayBuffers into text

2012-03-14 Thread Anne van Kesteren

On Wed, 14 Mar 2012 01:01:42 +0100, Ian Hickson i...@hixie.ch wrote:
Seems reasonable. If we have specific use cases for non-UTF-8 encodings,  
I agree we should support them; if that's the case, we should survey  
those

use cases to work out what the set of encodings we need is, and add just
those.


And not go beyond what is defined/allowed in:  
http://dvcs.w3.org/hg/encoding/raw-file/tip/Overview.html



--
Anne van Kesteren
http://annevankesteren.nl/


Re: [whatwg] API for encoding/decoding ArrayBuffers into text

2012-03-14 Thread Cedric Vivier
Hi,

On Wed, Mar 14, 2012 at 06:49, Jonas Sicking jo...@sicking.cc wrote:
 Something that has come up a couple of times with content authors
 lately has been the desire to convert an ArrayBuffer (or part thereof)
 into a decoded string. Similarly being able to encode a string into an
 ArrayBuffer (or part thereof).

What are the 'late' use cases for this?
The question might sound naive, but to me the encoding/decoding would
have been really great to have during the time when we didn't have
support for ArrayBuffers in general input/output APIs like we have now
(XHR, WebSockets, File API, ...) - which sounds like the mainstream
use cases to me.

However there is one use case that is not supported that sounds
something worthy not to overlook imho : embedding of binary data
(typed arrays) into textual formats such as XML or JSON.

For this, base64 encoding/decoding is typically used (so that it
doesn't conflict with the XML or JSON container) and thus more or less
efficiently implemented in JavaScript (just like we had to
encode/decode strings in JS to/from XHR a while ago).

Would it make sense to support encoding=base64 in this API?


 Something as simple as

 DOMString decode(ArrayBufferView source, DOMString encoding);
 ArrayBufferView encode(DOMString source, DOMString encoding,
 [optional] ArrayBufferView destination);

This API proposal looks lean and mean.
I hope we can move the current StringEncoding proposal to something
closer to this.

Regards,


Re: [whatwg] API for encoding/decoding ArrayBuffers into text

2012-03-14 Thread James Graham

On 03/14/2012 12:38 AM, Tab Atkins Jr. wrote:

On Tue, Mar 13, 2012 at 4:11 PM, Glenn Maynardgl...@zewt.org  wrote:

The API on that wiki page is a reasonable start.  For the same reasons that
we discussed in a recent thread (
http://lists.w3.org/Archives/Public/public-webapps/2011JulSep/1589.html),
conversion errors should use replacement (eg. U+FFFD), not throw
exceptions.


Python throws errors by default, but both functions have an additional
argument specifying an alternate strategy.  In particular,
bytes.decode can either drop the invalid bytes, replace them with a
replacement char (which I agree should be U+FFFD), or replace them
with XML entities; str.encode can choose to drop characters the
encoding doesn't support.


For completeness I note that python also allows user-provided custom 
error handling. I'm not suggesting we want this, but I would strongly 
prefer it to providing an XML-entity-encode option :)


Re: [whatwg] API for encoding/decoding ArrayBuffers into text

2012-03-14 Thread Anne van Kesteren
On Wed, 14 Mar 2012 00:50:43 +0100, Joshua Bell jsb...@chromium.org  
wrote:
For both of the above: initially suggested use cases included parsing  
data as esoteric as ID3 tags in MP3 files, where encoding unspecified  
and is

guessed at by decoders, and includes non-Unicode encodings. It was
suggested that the encoding sniffing capabilities of browsers be  
leveraged. (Cue a strong nooo! from Anne.)


If we can make it a deterministic, unchanging, and defined algorithm, I  
think that would actually be acceptable. And ideally we do define that  
algorithm at some point so new browsers can enter the existing market more  
easily and existing browsers interpret existing content in the same way.



--
Anne van Kesteren
http://annevankesteren.nl/


Re: [whatwg] API for encoding/decoding ArrayBuffers into text

2012-03-14 Thread Joshua Bell
FYI, I've updated http://wiki.whatwg.org/wiki/StringEncoding

* Rewritten in terms of Anne's Encoding spec and WebIDL, for algorithms,
encodings, and encoding selection, which greatly simplifies the spec. This
implicitly adds support for all of the other encodings defined therein - we
may still want to dictate a subset of encodings. A few minor issues noted
throughout the spec.
* Define a binary encoding, since that support was already in this spec.
We may decide to kill this but I didn't want to remove it just yet.
* Simplify methods to take ArrayBufferView instead of
any/byteOffset/byteLength. The implication is that you may need to use
temporary DataViews, and this is reflected in the examples.
* Call out more of the big open issues raised on this thread (e.g. where
should we hang this API)

Nothing controversial added, or (alas) resolved.


Re: [whatwg] API for encoding/decoding ArrayBuffers into text

2012-03-14 Thread Glenn Maynard
On Tue, Mar 13, 2012 at 9:47 PM, John Tamplin j...@google.com wrote:
 I am fine with strongly suggesting that only UTF8 be used for new things,
 but leaving out legacy support will severely limit the utility of this
 library.

Not all limitations are bad, and I'd disagree with seriously.

At a minimum, the set of encodings should be very carefully selected.
Limit it to Unicode to begin with, and if we're really going to put legacy
encodings on yet more life support, only add an encoding where there's a
clear, justified need for it.  (There are many encodings that browsers need
to support for text/html because they're used in legacy content, but which
nobody is still using today in new content--those should not be supported
here.)

But stick with Unicode for now.  Once an encoding is added, it's hard to
ever remove it.

On Wed, Mar 14, 2012 at 6:52 AM, Anne van Kesteren ann...@opera.com wrote:

 If we can make it a deterministic, unchanging, and defined algorithm, I
 think that would actually be acceptable. And ideally we do define that
 algorithm at some point so new browsers can enter the existing market more
 easily and existing browsers interpret existing content in the same way.


We don't have any untagged content to support yet, so let's not create an
API that guarantees it'll come into existence.  The heuristics you need
depend heavily on the content, anyway (for example, heuristics that work
for HTML probably won't for ID3 tags, which are generally very short).


On Wed, Mar 14, 2012 at 11:14 AM, Joshua Bell jsb...@chromium.org wrote:

 Having implemented a library that handled both text encodings and
 base16/base64 encoding, I can offer the opinion that the nomenclature gets
 very confusing since the encode/decode semantics are reversed.

 binary_buffer = encode(text_content)
 text_content = decode(binary_buffer)

 vs.

 binary_buffer = decode(base64_data)
 base64_data = encode(binary_buffer)


It's more than a naming problem.  With this string API, one side of the
conversion is always a DOMString.  Base64 conversion wants
ArrayBuffer-ArrayBuffer conversions, so it would belong in a separate API.

-- 
Glenn Maynard


Re: [whatwg] API for encoding/decoding ArrayBuffers into text

2012-03-14 Thread Joshua Bell
On Wed, Mar 14, 2012 at 3:53 PM, Glenn Maynard gl...@zewt.org wrote:


 It's more than a naming problem.  With this string API, one side of the
 conversion is always a DOMString.  Base64 conversion wants
 ArrayBuffer-ArrayBuffer conversions, so it would belong in a separate API.


Huh. The scenarios I've run across are Base64-encoded binary data islands
embedded in textual container formats like XML or JSON, which yield a
DOMString I want to decode into an ArrayBuffer.


[whatwg] API for encoding/decoding ArrayBuffers into text

2012-03-13 Thread Jonas Sicking
Hi All,

Something that has come up a couple of times with content authors
lately has been the desire to convert an ArrayBuffer (or part thereof)
into a decoded string. Similarly being able to encode a string into an
ArrayBuffer (or part thereof).

Something as simple as

DOMString decode(ArrayBufferView source, DOMString encoding);
ArrayBufferView encode(DOMString source, DOMString encoding,
[optional] ArrayBufferView destination);

would go a very long way. The question is where to stick these
functions. Internationalization doesn't have a obvious object we can
hang functions off of (unlike, for example crypto), and the above
names are much too generic to turn into global functions.

Ideas/opinions/bikesheds?

/ Jonas


Re: [whatwg] API for encoding/decoding ArrayBuffers into text

2012-03-13 Thread Tab Atkins Jr.
On Tue, Mar 13, 2012 at 3:49 PM, Jonas Sicking jo...@sicking.cc wrote:
 Hi All,

 Something that has come up a couple of times with content authors
 lately has been the desire to convert an ArrayBuffer (or part thereof)
 into a decoded string. Similarly being able to encode a string into an
 ArrayBuffer (or part thereof).

 Something as simple as

 DOMString decode(ArrayBufferView source, DOMString encoding);
 ArrayBufferView encode(DOMString source, DOMString encoding,
 [optional] ArrayBufferView destination);

 would go a very long way. The question is where to stick these
 functions. Internationalization doesn't have a obvious object we can
 hang functions off of (unlike, for example crypto), and the above
 names are much too generic to turn into global functions.

 Ideas/opinions/bikesheds?

Python3 just defines str.encode and bytes.decode.  Can we not do this
with String.encode and ArrayBuffer.decode?

~TJ


Re: [whatwg] API for encoding/decoding ArrayBuffers into text

2012-03-13 Thread Ian Hickson
On Tue, 13 Mar 2012, Jonas Sicking wrote:
 
 Something that has come up a couple of times with content authors
 lately has been the desire to convert an ArrayBuffer (or part thereof)
 into a decoded string. Similarly being able to encode a string into an
 ArrayBuffer (or part thereof).
 
 Something as simple as
 
 DOMString decode(ArrayBufferView source, DOMString encoding);
 ArrayBufferView encode(DOMString source, DOMString encoding,
 [optional] ArrayBufferView destination);
 
 would go a very long way. The question is where to stick these
 functions. Internationalization doesn't have a obvious object we can
 hang functions off of (unlike, for example crypto), and the above
 names are much too generic to turn into global functions.

Shouldn't this just be another ArrayBufferView type with special 
semantics, like Uint8ClampedArray? DOMStringArray or some such? And/or a 
getString()/setString() method pair on DataView?

Incidentally I _strongly_ suggest we only support UTF-8 here.

-- 
Ian Hickson   U+1047E)\._.,--,'``.fL
http://ln.hixie.ch/   U+263A/,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'


Re: [whatwg] API for encoding/decoding ArrayBuffers into text

2012-03-13 Thread Kenneth Russell
Joshua Bell has been working on a string encoding and decoding API
that supports the needed encodings, and which is separable from the
core typed array API:

http://wiki.whatwg.org/wiki/StringEncoding

This is the direction I prefer. String encoding and decoding seems to
be a complex enough problem that it should be expressed separately
from the typed array spec itself.

-Ken


On Tue, Mar 13, 2012 at 5:59 PM, Ian Hickson i...@hixie.ch wrote:
 On Tue, 13 Mar 2012, Jonas Sicking wrote:

 Something that has come up a couple of times with content authors
 lately has been the desire to convert an ArrayBuffer (or part thereof)
 into a decoded string. Similarly being able to encode a string into an
 ArrayBuffer (or part thereof).

 Something as simple as

 DOMString decode(ArrayBufferView source, DOMString encoding);
 ArrayBufferView encode(DOMString source, DOMString encoding,
 [optional] ArrayBufferView destination);

 would go a very long way. The question is where to stick these
 functions. Internationalization doesn't have a obvious object we can
 hang functions off of (unlike, for example crypto), and the above
 names are much too generic to turn into global functions.

 Shouldn't this just be another ArrayBufferView type with special
 semantics, like Uint8ClampedArray? DOMStringArray or some such? And/or a
 getString()/setString() method pair on DataView?

 Incidentally I _strongly_ suggest we only support UTF-8 here.

 --
 Ian Hickson               U+1047E                )\._.,--,'``.    fL
 http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
 Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'


Re: [whatwg] API for encoding/decoding ArrayBuffers into text

2012-03-13 Thread Jonas Sicking
On Tue, Mar 13, 2012 at 3:58 PM, Tab Atkins Jr. jackalm...@gmail.com wrote:
 On Tue, Mar 13, 2012 at 3:49 PM, Jonas Sicking jo...@sicking.cc wrote:
 Hi All,

 Something that has come up a couple of times with content authors
 lately has been the desire to convert an ArrayBuffer (or part thereof)
 into a decoded string. Similarly being able to encode a string into an
 ArrayBuffer (or part thereof).

 Something as simple as

 DOMString decode(ArrayBufferView source, DOMString encoding);
 ArrayBufferView encode(DOMString source, DOMString encoding,
 [optional] ArrayBufferView destination);

 would go a very long way. The question is where to stick these
 functions. Internationalization doesn't have a obvious object we can
 hang functions off of (unlike, for example crypto), and the above
 names are much too generic to turn into global functions.

 Ideas/opinions/bikesheds?

 Python3 just defines str.encode and bytes.decode.  Can we not do this
 with String.encode and ArrayBuffer.decode?

Unfortunately I suspect getting anything added on the String object
will take a few years given that it's too late to get into ES6 (and in
any case I suspect adding ArrayBuffer dependencies to ES6 would be
controversial).

/ Jonas


Re: [whatwg] API for encoding/decoding ArrayBuffers into text

2012-03-13 Thread Jonas Sicking
On Tue, Mar 13, 2012 at 4:08 PM, Kenneth Russell k...@google.com wrote:
 Joshua Bell has been working on a string encoding and decoding API
 that supports the needed encodings, and which is separable from the
 core typed array API:

 http://wiki.whatwg.org/wiki/StringEncoding

 This is the direction I prefer. String encoding and decoding seems to
 be a complex enough problem that it should be expressed separately
 from the typed array spec itself.

Very cool. Where do I provide feedback to this? Here?

/ Jonas


Re: [whatwg] API for encoding/decoding ArrayBuffers into text

2012-03-13 Thread Glenn Maynard
On Tue, Mar 13, 2012 at 5:49 PM, Jonas Sicking jo...@sicking.cc wrote:

 Something that has come up a couple of times with content authors
 lately has been the desire to convert an ArrayBuffer (or part thereof)
 into a decoded string. Similarly being able to encode a string into an
 ArrayBuffer (or part thereof).


There was discussion about this before:

https://www.khronos.org/webgl/public-mailing-list/archives//msg00017.html
http://wiki.whatwg.org/wiki/StringEncoding

(I don't know why it was on the WebGL list; typed arrays are becoming
infrastructural and this doesn't seem like it belongs there, even though
ArrayBuffer was started there.)

The API on that wiki page is a reasonable start.  For the same reasons that
we discussed in a recent thread (
http://lists.w3.org/Archives/Public/public-webapps/2011JulSep/1589.html),
conversion errors should use replacement (eg. U+FFFD), not throw
exceptions.  The any arguments should be fixed.  Encoding to UTF-16
should definitely not prefix a BOM, and UTF-16 having unspecified
endianness is obviously bad.

I'd also suggest that, unless there's serious, substantiated demand for
it--which I doubt--only major Unicode encodings be supported.  Don't make
it easier for people to keep using legacy encodings.

 Shouldn't this just be another ArrayBufferView type with special
 semantics, like Uint8ClampedArray? DOMStringArray or some such? And/or a
 getString()/setString() method pair on DataView?

I don't think so, because retrieving the N'th decoded/reencoded character
isn't a constant-time operation.

-- 
Glenn Maynard


Re: [whatwg] API for encoding/decoding ArrayBuffers into text

2012-03-13 Thread Kenneth Russell
On Tue, Mar 13, 2012 at 6:10 PM, Jonas Sicking jo...@sicking.cc wrote:
 On Tue, Mar 13, 2012 at 4:08 PM, Kenneth Russell k...@google.com wrote:
 Joshua Bell has been working on a string encoding and decoding API
 that supports the needed encodings, and which is separable from the
 core typed array API:

 http://wiki.whatwg.org/wiki/StringEncoding

 This is the direction I prefer. String encoding and decoding seems to
 be a complex enough problem that it should be expressed separately
 from the typed array spec itself.

 Very cool. Where do I provide feedback to this? Here?

This list seems like a good place to discuss it.

-Ken


Re: [whatwg] API for encoding/decoding ArrayBuffers into text

2012-03-13 Thread Ian Hickson
On Tue, 13 Mar 2012, Jonas Sicking wrote:

 Unfortunately I suspect getting anything added on the String object will 
 take a few years given that it's too late to get into ES6 (and in any 
 case I suspect adding ArrayBuffer dependencies to ES6 would be 
 controversial).

We can just define it outside the ES spec.

-- 
Ian Hickson   U+1047E)\._.,--,'``.fL
http://ln.hixie.ch/   U+263A/,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'


Re: [whatwg] API for encoding/decoding ArrayBuffers into text

2012-03-13 Thread Joshua Bell
On Tue, Mar 13, 2012 at 4:11 PM, Glenn Maynard gl...@zewt.org wrote:

 On Tue, Mar 13, 2012 at 5:49 PM, Jonas Sicking jo...@sicking.cc wrote:

  Something that has come up a couple of times with content authors
  lately has been the desire to convert an ArrayBuffer (or part thereof)
  into a decoded string. Similarly being able to encode a string into an
  ArrayBuffer (or part thereof).
 

 There was discussion about this before:


 https://www.khronos.org/webgl/public-mailing-list/archives//msg00017.html
 http://wiki.whatwg.org/wiki/StringEncoding

 (I don't know why it was on the WebGL list; typed arrays are becoming
 infrastructural and this doesn't seem like it belongs there, even though
 ArrayBuffer was started there.)

 The API on that wiki page is a reasonable start.  For the same reasons that
 we discussed in a recent thread (
 http://lists.w3.org/Archives/Public/public-webapps/2011JulSep/1589.html),
 conversion errors should use replacement (eg. U+FFFD), not throw
 exceptions.  The any arguments should be fixed.  Encoding to UTF-16
 should definitely not prefix a BOM, and UTF-16 having unspecified
 endianness is obviously bad.

 I'd also suggest that, unless there's serious, substantiated demand for
 it--which I doubt--only major Unicode encodings be supported.  Don't make
 it easier for people to keep using legacy encodings.


Two other pieces of feedback I received from Adam Barth off list:

* take ArrayBufferView as input which both fixes any and simplifies the
API to eliminate byteOffset and byteLength
* support two versions of encode, one which takes a target ArrayBufferView,
and one which allocates/returns a new Uint8Array of the appropriate length.



  Shouldn't this just be another ArrayBufferView type with special
  semantics, like Uint8ClampedArray? DOMStringArray or some such? And/or a
  getString()/setString() method pair on DataView?

 I don't think so, because retrieving the N'th decoded/reencoded character
 isn't a constant-time operation.

 --
 Glenn Maynard



Re: [whatwg] API for encoding/decoding ArrayBuffers into text

2012-03-13 Thread Joshua Bell
On Tue, Mar 13, 2012 at 4:11 PM, Glenn Maynard gl...@zewt.org wrote:

 On Tue, Mar 13, 2012 at 5:49 PM, Jonas Sicking jo...@sicking.cc wrote:

  Something that has come up a couple of times with content authors
  lately has been the desire to convert an ArrayBuffer (or part thereof)
  into a decoded string. Similarly being able to encode a string into an
  ArrayBuffer (or part thereof).
 

 There was discussion about this before:


 https://www.khronos.org/webgl/public-mailing-list/archives//msg00017.html
 http://wiki.whatwg.org/wiki/StringEncoding

 (I don't know why it was on the WebGL list; typed arrays are becoming
 infrastructural and this doesn't seem like it belongs there, even though
 ArrayBuffer was started there.)


Purely historical; early adopters of Typed Arrays were folks prototyping
with WebGL who wanted to parse data files containing strings.

WHATWG makes sense, I just hadn't gotten around to shopping for a home.
(Administrivia: Is there need to propose a charter addition?)


 The API on that wiki page is a reasonable start.  For the same reasons that
 we discussed in a recent thread (
 http://lists.w3.org/Archives/Public/public-webapps/2011JulSep/1589.html),
 conversion errors should use replacement (eg. U+FFFD), not throw
 exceptions.  The any arguments should be fixed.  Encoding to UTF-16
 should definitely not prefix a BOM, and UTF-16 having unspecified
 endianness is obviously bad.

 I'd also suggest that, unless there's serious, substantiated demand for
 it--which I doubt--only major Unicode encodings be supported.  Don't make
 it easier for people to keep using legacy encodings.


Two other pieces of feedback I received from Adam Barth off list:

* take ArrayBufferView as input which both fixes any and simplifies the
API to eliminate byteOffset and byteLength
* support two versions of encode, one which takes a target ArrayBufferView,
and one which allocates/returns a new Uint8Array of the appropriate length.



  Shouldn't this just be another ArrayBufferView type with special
  semantics, like Uint8ClampedArray? DOMStringArray or some such? And/or a
  getString()/setString() method pair on DataView?

 I don't think so, because retrieving the N'th decoded/reencoded character
 isn't a constant-time operation.

 --
 Glenn Maynard



Re: [whatwg] API for encoding/decoding ArrayBuffers into text

2012-03-13 Thread Ian Hickson
On Tue, 13 Mar 2012, Joshua Bell wrote:
 On Tue, Mar 13, 2012 at 4:10 PM, Jonas Sicking jo...@sicking.cc wrote:
  On Tue, Mar 13, 2012 at 4:08 PM, Kenneth Russell k...@google.com 
  wrote:
   Joshua Bell has been working on a string encoding and decoding API 
   that supports the needed encodings, and which is separable from the 
   core typed array API:
  
   http://wiki.whatwg.org/wiki/StringEncoding
  
   This is the direction I prefer. String encoding and decoding seems 
   to be a complex enough problem that it should be expressed 
   separately from the typed array spec itself.

Some quick feedback:

 - [OmitConstructor] doesn't seem to be WebIDL

 - please don't allow UAs to implement other encodings. You should list 
   the exact set of supported encodings and the exact labels that should 
   be recognised as meaning those encodings, and disallow all others. 
   Otherwise, we'll be in a never-ending game of reverse-engineering each 
   others' lists of supported encodings and it'll keep growing.

 - What's the use case for supporting anything but UTF-8?

 - Having a mechanism that lets you encode the string and get a length 
   separate from the mechanism that lets you encode the string and get the 
   encoded string seems like it would encourage very inefficient code. Can 
   we instead have a mechanism that returns both at once? Or is the idea 
   that for some encodings getting the encoded length is much quicker than 
   getting the actual string?

 - Seems weird that integers and strings would have such different APIs 
   for doing the same thing. Why can't we handle them equivalently? As in:

 len = view.setString(strings[i],
  offset + Uint32Array.BYTES_PER_ELEMENT,
  UTF-8);
 view.setUint32(offset, len);
 offset += Uint32Array.BYTES_PER_ELEMENT + len;

HTH,
-- 
Ian Hickson   U+1047E)\._.,--,'``.fL
http://ln.hixie.ch/   U+263A/,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'


Re: [whatwg] API for encoding/decoding ArrayBuffers into text

2012-03-13 Thread Ian Hickson
On Tue, 13 Mar 2012, Joshua Bell wrote:
 
 WHATWG makes sense, I just hadn't gotten around to shopping for a home. 
 (Administrivia: Is there need to propose a charter addition?)

You're welcome to use the WHATWG list for this. Charters are pointless and 
there's no need to worry about them here.

-- 
Ian Hickson   U+1047E)\._.,--,'``.fL
http://ln.hixie.ch/   U+263A/,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'


Re: [whatwg] API for encoding/decoding ArrayBuffers into text

2012-03-13 Thread Tab Atkins Jr.
On Tue, Mar 13, 2012 at 4:08 PM, Jonas Sicking jo...@sicking.cc wrote:
 On Tue, Mar 13, 2012 at 3:58 PM, Tab Atkins Jr. jackalm...@gmail.com wrote:
 On Tue, Mar 13, 2012 at 3:49 PM, Jonas Sicking jo...@sicking.cc wrote:
 Hi All,

 Something that has come up a couple of times with content authors
 lately has been the desire to convert an ArrayBuffer (or part thereof)
 into a decoded string. Similarly being able to encode a string into an
 ArrayBuffer (or part thereof).

 Something as simple as

 DOMString decode(ArrayBufferView source, DOMString encoding);
 ArrayBufferView encode(DOMString source, DOMString encoding,
 [optional] ArrayBufferView destination);

 would go a very long way. The question is where to stick these
 functions. Internationalization doesn't have a obvious object we can
 hang functions off of (unlike, for example crypto), and the above
 names are much too generic to turn into global functions.

 Ideas/opinions/bikesheds?

 Python3 just defines str.encode and bytes.decode.  Can we not do this
 with String.encode and ArrayBuffer.decode?

 Unfortunately I suspect getting anything added on the String object
 will take a few years given that it's too late to get into ES6 (and in
 any case I suspect adding ArrayBuffer dependencies to ES6 would be
 controversial).

Like Ian said, I don't see anything particularly bad about the spec
defining ArrayBuffers to define an ArrayBuffer-related method on
String.  There's no reason it has to be in the ES spec.

~TJ


Re: [whatwg] API for encoding/decoding ArrayBuffers into text

2012-03-13 Thread Tab Atkins Jr.
On Tue, Mar 13, 2012 at 4:11 PM, Glenn Maynard gl...@zewt.org wrote:
 The API on that wiki page is a reasonable start.  For the same reasons that
 we discussed in a recent thread (
 http://lists.w3.org/Archives/Public/public-webapps/2011JulSep/1589.html),
 conversion errors should use replacement (eg. U+FFFD), not throw
 exceptions.

Python throws errors by default, but both functions have an additional
argument specifying an alternate strategy.  In particular,
bytes.decode can either drop the invalid bytes, replace them with a
replacement char (which I agree should be U+FFFD), or replace them
with XML entities; str.encode can choose to drop characters the
encoding doesn't support.

~TJ


Re: [whatwg] API for encoding/decoding ArrayBuffers into text

2012-03-13 Thread Ian Hickson
On Tue, 13 Mar 2012, Joshua Bell wrote:
 
 For both of the above: initially suggested use cases included parsing 
 data as esoteric as ID3 tags in MP3 files, where encoding unspecified 
 and is guessed at by decoders, and includes non-Unicode encodings. It 
 was suggested that the encoding sniffing capabilities of browsers be 
 leveraged. [...]
 
 Whether we should restrict it as far as UTF-8 depends on whether we 
 envision this API only used for parsing/serializing newly defined data 
 formats, or whether there is consideration for interop with previously 
 existing formats data formats and code.

Seems reasonable. If we have specific use cases for non-UTF-8 encodings, I 
agree we should support them; if that's the case, we should survey those 
use cases to work out what the set of encodings we need is, and add just 
those.


   - Having a mechanism that lets you encode the string and get a length
separate from the mechanism that lets you encode the string and get the
encoded string seems like it would encourage very inefficient code. Can
we instead have a mechanism that returns both at once? Or is the idea
that for some encodings getting the encoded length is much quicker than
getting the actual string?
 
 
 The use case was to compute the size necessary to allocate a single buffer
 into which may be encoded multiple strings and other data, rather than
 allocating multiple small buffers and then copying strings into a larger
 buffer.
 
 Ignoring the issue of invalid code points, the length calculations for
 non-UTF-8 encodings are trivial. (And with the suggestion that UTF-16 not
 be sanitized, that case is trivially 2x the JS string length.)

Yeah, but surely we'll mainly be doing stuff with UTF-8...

One option is to return an opaque object of the form:

   interface EncodedString {
 readonly attributes unsigned long length;
 // internally has a copy of the encoded string
   }

...and then have view.setString take this EncodedString object. At least 
then you get it down to an extraneous copy, rather than an extraneous 
encode. Still not ideal though.


   - Seems weird that integers and strings would have such different APIs
for doing the same thing. Why can't we handle them equivalently? As in:
 
  len = view.setString(strings[i],
   offset + Uint32Array.BYTES_PER_ELEMENT,
   UTF-8);
  view.setUint32(offset, len);
  offset += Uint32Array.BYTES_PER_ELEMENT + len;
 
 Heh, that's where the discussion started, actually. We wanted to keep 
 the DataView interface simple, and potentially support encoding into 
 plain JS arrays and/or non-TypedArray support that appeared to be on the 
 horizon for JS.

I see where you're coming from, but I think we should look at the platform 
as a whole, not just one API. It doesn't help the platform as a whole if 
we just have the same features split across two interfaces, the complexity 
is even slightly higher than just having one consistent API that does ints 
and strings equivalently.

-- 
Ian Hickson   U+1047E)\._.,--,'``.fL
http://ln.hixie.ch/   U+263A/,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'


Re: [whatwg] API for encoding/decoding ArrayBuffers into text

2012-03-13 Thread Glenn Maynard
Using Views instead of specifying the offset and length sounds good.

On Tue, Mar 13, 2012 at 6:28 PM, Ian Hickson i...@hixie.ch wrote:

  - What's the use case for supporting anything but UTF-8?


Other Unicode encodings may be useful, to decode existing file formats
containing (most likely at a minimum) UTF-16.  I don't feel strongly about
that, though; we're stuck with UTF-16 as an internal representation in the
platform, but that doesn't necessarily mean we need to support it as a
transfer encoding.

For non-Unicode legacy encodings, I think that even if use cases exist,
they should be given more than the usual amount of scrutiny before being
supported.



On Tue, Mar 13, 2012 at 6:38 PM, Tab Atkins Jr. jackalm...@gmail.comwrote:

 Python throws errors by default, but both functions have an additional
 argument specifying an alternate strategy.  In particular,
 bytes.decode can either drop the invalid bytes, replace them with a
 replacement char (which I agree should be U+FFFD), or replace them
 with XML entities; str.encode can choose to drop characters the
 encoding doesn't support.


Supporting throwing is okay if it's really wanted, but the default should
be replacement.  It reduces fatal errors to (usually) non-fatal
replacement, for obscure cases that people generally don't test.  It's a
much more sane default failure mode.

As another option, never throw, but allow returning the number of
conversion errors:

results = encode(abc\uD800def, outputView, UTF-8);

where results.inputConsumed is the number of words consumed in myString,
results.outputWritten is the number of UTF-8 bytes written, and
results.errors is 1.

That also allows block-by-block conversion; for example, to convert as many
complete characters as possible into a fixed-size buffer for transmission,
then starting again at the next unencoded character.

One more idea, while I'm brainstorming: if outputView is null, allocate an
ArrayBuffer of the necessary size, storing it in results.output.  That
eliminates the need for a separate length pass, without bloating the API
with another overload.


On Tue, Mar 13, 2012 at 6:50 PM, Joshua Bell jsb...@chromium.org wrote:

 (Cue a strong nooo! from Anne.)


(Count me in on that, too.  Heuristics bad.)

 Ignoring the issue of invalid code points, the length calculations for
 non-UTF-8 encodings are trivial. (And with the suggestion that UTF-16 not
 be sanitized, that case is trivially 2x the JS string length.)


UTF-16 sanitization (replacing mismatched surrogates with U+FFFD) doesn't
change the size of the output, actually.

-- 
Glenn Maynard


Re: [whatwg] API for encoding/decoding ArrayBuffers into text

2012-03-13 Thread John Tamplin
On Tue, Mar 13, 2012 at 8:19 PM, Glenn Maynard gl...@zewt.org wrote:

 Using Views instead of specifying the offset and length sounds good.

 On Tue, Mar 13, 2012 at 6:28 PM, Ian Hickson i...@hixie.ch wrote:

   - What's the use case for supporting anything but UTF-8?
 

 Other Unicode encodings may be useful, to decode existing file formats
 containing (most likely at a minimum) UTF-16.  I don't feel strongly about
 that, though; we're stuck with UTF-16 as an internal representation in the
 platform, but that doesn't necessarily mean we need to support it as a
 transfer encoding.

 For non-Unicode legacy encodings, I think that even if use cases exist,
 they should be given more than the usual amount of scrutiny before being
 supported.


The whole idea is to be able to extract textual data out of some packed
binary format.  If you don't support the character sets people want to use,
they will simply do like they have to do now and hand-code the character
set conversion, where it will slow and inaccurate.

In particular, I think you have to include various ISO-8859-* character
sets (especially Latin1) and the non-Unicode character sets still
frequently used by Japanese and Chinese users.

I am fine with strongly suggesting that only UTF8 be used for new things,
but leaving out legacy support will severely limit the utility of this
library.

-- 
John A. Tamplin
Software Engineer (GWT), Google