On 1/11/2012 4:22 PM, Boris Zbarsky wrote:
On 1/11/12 6:03 PM, Charles Pritchard wrote:
Is there any instance in practice where DOMString as exposed to the
scripting environment is not implemented as a unicode string?
I don't know what you mean by that.
The point is, it's trivial to construct JS strings that contain
arbitrary sequences of 16-bit units (using fromCharCode or \u
escapes). Nothing anywhere in JS or the DOM per se enforces that
strings are valid UTF-16 (which is the way that an actual Unicode
string would be encoded as a JS string).
My [wrong] understanding was that DOMString referred to valid unicode.
WebIDL:
"The DOMString type corresponds to the set of all possible sequences of
16 bit unsigned integer code units. Such sequences are commonly
interpreted as UTF-16 encoded strings [RFC2781] although this is not
required... Nothing in this specification requires a DOMString value to
be a valid UTF-16 string."
http://www.w3.org/TR/WebIDL/#idl-DOMString
DOM3:
"The DOMString type is used to store [Unicode] characters as a sequence
of 16-bit units using UTF-16 as defined in [Unicode] and Amendment 1 of
[ISO/IEC 10646]." There are some normalization notes, but otherwise,
it's close enough to saying it stores Unicode, but it can handle all
16bit combinations.
http://www.w3.org/TR/2004/REC-DOM-Level-3-Core-20040407/core.html#ID-C74D1578
For "historic reasons" WindowBase64 throws an error if input is not
within Unicode range.
http://www.whatwg.org/specs/web-apps/current-work/multipage/webappapis.html#atob
I realize that internally, DOMString may be implemented as a 16 bit
integer + length;
Not just internally. The JS spec and the DOM spec both explicitly say
that this is what strings are: an array of 16-bit integers.
WebIDL and DOM define "DOMString", of course. JS defines "The String
Type" in 8.4. They are intended to be the same.
http://www.ecma-international.org/publications/files/ECMA-ST/Ecma-262.pdf
"The String type is the set of all finite ordered sequences of zero or
more 16-bit unsigned integer values .... When a String contains actual
textual data, each element is considered to be a single UTF-16 code
unit. Whether or not this is the actual storage format of a String, the
characters within a String are numbered by their initial code unit
element position as though they were represented using UTF-16."
Browsers do the same thing with WindowBase64, though it's specified as
DOMString, in practice (as the notes say), it's unicode.
http://www.whatwg.org/specs/web-apps/current-work/multipage/webappapis.html#atob
If you look at the actual processing model, you take the input array
of 16-bit integers, throw if any is not in the set { 0x2B, 0x2F, 0x30
} union [0x41,0x5A] union [0x61,0x6A] and then treat the rest as ASCII
data (which at that point it is).
It defines this in terms of "Unicode" but that's just because any JS
string that satisfies the above constraints can be considered a
"Unicode" string if one wishes.
Web Storage, also, only works with unicode.
I'm not familiar with the relevant part of Web Storage. Can you cite
the relevant part please?
The character code conversion gets weird. If you'd explain this in the
proper terms, I'd appreciate it.
Load a binary resource via the old charset hack.
Save the resulting string into localStorage. There are some conversion
issues. I am not using the right vocabulary.
I know the list has seen the issue before, and I'll bet someone here can
explain it succinctly.
Example:
// Image files are easiest to try this with.
https://developer.mozilla.org/En/XMLHttpRequest/Using_XMLHttpRequest#Receiving_binary_data_in_older_browsers
// From the article:
function load_binary_resource(url) {
var req = new XMLHttpRequest();
req.open('GET', url, false);
//XHR binary charset opt by Marcus Granado 2006
[http://mgran.blogspot.com]
req.overrideMimeType('text\/plain; charset=x-user-defined');
req.send(null);
if (req.status != 200) return '';
return req.responseText;
}
var x = load_binary_resource('imageurl.png');
localStorage.fail = x;
localStorage.fail == x.fail; // will return false.