On 1/11/2012 4:22 PM, Boris Zbarsky wrote:
On 1/11/12 6:03 PM, Charles Pritchard wrote:
Is there any instance in practice where DOMString as exposed to the
scripting environment is not implemented as a unicode string?

I don't know what you mean by that.

The point is, it's trivial to construct JS strings that contain arbitrary sequences of 16-bit units (using fromCharCode or \u escapes). Nothing anywhere in JS or the DOM per se enforces that strings are valid UTF-16 (which is the way that an actual Unicode string would be encoded as a JS string).


My [wrong] understanding was that DOMString referred to valid unicode.

WebIDL:
"The DOMString type corresponds to the set of all possible sequences of 16 bit unsigned integer code units. Such sequences are commonly interpreted as UTF-16 encoded strings [RFC2781] although this is not required... Nothing in this specification requires a DOMString value to be a valid UTF-16 string."
http://www.w3.org/TR/WebIDL/#idl-DOMString

DOM3:
"The DOMString type is used to store [Unicode] characters as a sequence of 16-bit units using UTF-16 as defined in [Unicode] and Amendment 1 of [ISO/IEC 10646]." There are some normalization notes, but otherwise, it's close enough to saying it stores Unicode, but it can handle all 16bit combinations.
http://www.w3.org/TR/2004/REC-DOM-Level-3-Core-20040407/core.html#ID-C74D1578

For "historic reasons" WindowBase64 throws an error if input is not within Unicode range.
http://www.whatwg.org/specs/web-apps/current-work/multipage/webappapis.html#atob


I realize that internally, DOMString may be implemented as a 16 bit
integer + length;

Not just internally. The JS spec and the DOM spec both explicitly say that this is what strings are: an array of 16-bit integers.

WebIDL and DOM define "DOMString", of course. JS defines "The String Type" in 8.4. They are intended to be the same.
http://www.ecma-international.org/publications/files/ECMA-ST/Ecma-262.pdf

"The String type is the set of all finite ordered sequences of zero or more 16-bit unsigned integer values .... When a String contains actual textual data, each element is considered to be a single UTF-16 code unit. Whether or not this is the actual storage format of a String, the characters within a String are numbered by their initial code unit element position as though they were represented using UTF-16."

Browsers do the same thing with WindowBase64, though it's specified as
DOMString, in practice (as the notes say), it's unicode.
http://www.whatwg.org/specs/web-apps/current-work/multipage/webappapis.html#atob

If you look at the actual processing model, you take the input array of 16-bit integers, throw if any is not in the set { 0x2B, 0x2F, 0x30 } union [0x41,0x5A] union [0x61,0x6A] and then treat the rest as ASCII data (which at that point it is).

It defines this in terms of "Unicode" but that's just because any JS string that satisfies the above constraints can be considered a "Unicode" string if one wishes.

Web Storage, also, only works with unicode.

I'm not familiar with the relevant part of Web Storage. Can you cite the relevant part please?

The character code conversion gets weird. If you'd explain this in the proper terms, I'd appreciate it.

Load a binary resource via the old charset hack.

Save the resulting string into localStorage. There are some conversion issues. I am not using the right vocabulary. I know the list has seen the issue before, and I'll bet someone here can explain it succinctly.

Example:
// Image files are easiest to try this with.
https://developer.mozilla.org/En/XMLHttpRequest/Using_XMLHttpRequest#Receiving_binary_data_in_older_browsers
// From the article:
function load_binary_resource(url) {
  var req = new XMLHttpRequest();
  req.open('GET', url, false);
//XHR binary charset opt by Marcus Granado 2006 [http://mgran.blogspot.com]
  req.overrideMimeType('text\/plain; charset=x-user-defined');
  req.send(null);
  if (req.status != 200) return '';
  return req.responseText;
}
var x = load_binary_resource('imageurl.png');
localStorage.fail = x;
localStorage.fail == x.fail; // will return false.



Reply via email to