Phillips, Addison wrote:

Because it has always been possible, it’s difficult to say how many scripts have transported byte-oriented data by “punning” the data into strings. Actually, I think this is more likely to be truly binary data rather than text in some non-Unicode character encoding, but anything is possible, I suppose. This could include using non-character values like “FFFE”, “FFFF” in addition to the surrogates. A BRS-running implementation would break a script that relied on String being a sequence of 16-bit unsigned integer values with no error checking.


Allen's view of the BRS-enabled semantics would have 16-bit "GIGO" without exceptions -- you'd be storing 16-bit values, whatever their source (including "\uXXXX" literals spelling invalid characters and unmatched surrogates) in at-least-21-bit elements of strings, and reading them back.

My concern and reason for advocating early or late errors on shenanigans was that people today writing surrogate pais literally and then taking extra pains in JS or C++ (whatever the host language might be) to process them as single code points and characters would be broken by the BRS-enabled behavior of separating the parts into distinct code points.

But that's pessimistic. It could happen, but OTOH anyone coding surrogate pairs might want them to read back piece-wise when indexing. In that case what Allen proposes, storing each formerly 16-bit code unit, however expressed, in the wider 21-or-more-bits unit, and reading back likewise, would "just work".

Sorry if this is all obvious. Mainly I want to throw in my lot with Allen's exception-free literal/constructor approach. The encoding APIs should throw on invalid Unicode but literals and strings as immutable 16-bit storage buffers should work as today.

/be
_______________________________________________
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss

Reply via email to