Re: Full Unicode strings strawman

Allen Wirfs-Brock Mon, 16 May 2011 14:53:45 -0700

On May 16, 2011, at 1:37 PM, Mike Samuel wrote:

> 2011/5/16 Allen Wirfs-Brock <al...@wirfs-brock.com>:
>> 
>> ...
>> 
>>> How would
>>> 
>>>    var oneSupplemental = "\U00010000";
> 
>> I don't think I understand you literal notation. \U is a 32-bit character
>> value?  I whose implementation?
> 
> Sorry, please read this as
>    var oneSupplemental = String.fromCharCode(0x10000);
>


In my proposal you would have to say String.fromCodepoint(0x10000);

In ES5 String.fromCharCode(0x10000) produced the same string as "\0".  That 
remains the case in my proposal.

> 
>>>    alert(oneSupplemental.length);  //  alerts 1
>>> 
>> I'll take your word for this
> 
> If I understand, a string containing the single codepoint U+10000
> should have length 1.
> 
>    "The length of a String is the number of elements (i.e.,
> 16-bit\b\b\b\b\b\b 21-bit values) within it."

yes, it's 1.  My gruff comment was in reference to not being sure of your 
literal notation

> 
>>>    var utf16Encoded = encodeUTF16(oneSupplemental);
>>>    alert(utf16Encoded.length);  //  alerts 2
>> 
>> yes
>> 
>>>    var textNode = document.createTextNode(utf16Encoded);
>>>    alert(textNode.nodeValue.length);   // alerts ?
>> 
>> 2
>> 
>>> Does the DOM need to represent utf16Encoded internally so that it can
>>> report 2 as the length on fetch of nodeValue?
>> 
>> However the DOM representations DOMString values internally, to conform to
>> the DOM spec. it must act as if it is representing them using UTF-16.
> 
> Ok.  This seems to present two options:
> (1) Break the internet by binding DOMStrings to a JavaScript host type
> and not to the JavaScript string type.
Not sure why this would break the internet.  At the implementation level, a key 
point of my proposal is that implementations can (and even today some do) have 
multiple different internal representations for strings.  These internal 
representation difference simply are not exposed to the JS program expect 
possiblely in terms to measurable performance differences.

> (2) DOMStrings never contain supplemental codepoints.
that's how DOMStrings are currently defined and I'm not proposing to change 
this.  Adding full unicode DOMStrings to the DOM spec. seems like a task for 
W3C.
> 
> So for either alert(typeof
> 
>   var roundTripped = document.createTextNode(oneSupplemental).nodeValue
> 
> either
> 
>    typeof roundTripped !== "string"

Not really, I'm perfectly happy to allow the DOM to continue report the type of 
DOMString as 'string'.  It's no different from a user constructed string that 
may or may not contain a UTF-16 character sequence depending upon what the user 
code.

> 
> or
> 
>    roundTripped.length != oneSupplemental.length

Yes, this may be the case but only for new code that explicit builds 
oneSupplemental to contain a supplemental character using \u+xxxxxx or 
String.fromCodepoint or some other new function. All existing valid code only 
produces strings with codepoints limited to 0xffff

> 
> 
>>> If so, how can it
>>> represent that for systems that use a UTF-16 internal representation
>>> for DOMString?
>> 
>> Let me know if I haven't already answered this.
> 
> You might have.  If you reject my assertion about option 2 above, then
> to clarify,
> The UTF-16 representation of codepoint U+10000 is the code-unit pair
> U+D8000 U+DC000.
> The UTF-16 representation of codepoint U+D8000 is the single code-unit
> U+D8000 and similarly for U+DC00.
> 
> How can the codepoints U+D800 U+DC00 be distinguished in a DOMString
> implementation that uses UTF-16 under the hood from the codepoint
> U+10000?
> 

I think you have an extra 0 at a couple of  places above...

A DOMstring is defined by the DOM spec. to consists of 16-bit elements that are 
to be interpreted as a UTF-16 encoding of Unicode characters.  It doesn't 
matter what implementation level representation is used for the string, the 
indexible positions within a DOMString is restricted to 16-bit values. At the 
representation level each position could even be represented by a 32-bit cell 
and it doesn't matter.  To be a valid DOMString element values must be in the 
range 0-0xffff.

I think you are unnecessarily mixing up the string semantics defined by the 
language, encodings that might be used in implementing the semantics, and 
application level processing of those strings.

To simplify things just think of a ES string as if it was an array each element 
of which could contain an arbitrary integer value.  If we have such an array 
like [0xd800, 0xdc00] at the language semantics level this is a two element 
array containing two well specific values.  At the language implementation 
level there are all sorts of representations that might be used, maybe the 
implementation Huffman encodes the elements...  How the application processes 
that array is completely up to the application.  It may treat the array simply 
as two integer values.  It may treated each element as a 21-bit value encoding 
a Unicode codepoint and logically consider the array to be a unicode string of 
length 2.  It may consider each element to be a 16-bit value and that sequences 
of values are interpreted as UTF-16 string encodings.  In that case, it could 
consider it to represent a string of logical length 1.

This is no different from what people do today with 16-bit char JS strings.  
Many people just treat them as strings of BMP characters and ignore the 
possibility of supplemental characters or UTF-16 encodings.  Other people 
(particularly when dealing with DOMStrings) treat strings as code units of an 
UTF-16 encoding.  They need to use more complex sting processing algorithms to 
deal with logical unicode characters.

Allen

_______________________________________________
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss

Re: Full Unicode strings strawman

Reply via email to