You might want to check out Norbert's proposal [1] Addison
Addison Phillips Globalization Architect (Lab126) Chair (W3C I18N WG) Internationalization is not a feature. It is an architecture. [1] http://norbertlindenberg.com/2012/05/ecmascript-supplementary-characters/index.html > -----Original Message----- > From: Roger Andrews [mailto:[email protected]] > Sent: Wednesday, November 14, 2012 6:07 AM > To: [email protected] > Subject: easy handling of UTF16 surrogates & well-formed strings > > This is rather long but the idea is to make handling UTF16 surrogates easier > for > the casual user without harming the ability of UTF16 experts to delve into > details if surrogates are not well-paired (and hence the string is not well- > formed). > > Under the current definitions (ed. 6_10-26-12) surprising things happen. > E.g. a string converted to an array of codepoints with 'codePointAt' then > back to > a string with 'fromCodePoint' is not equal to the original string if it > contains > well-formed surrogate pairs. > > Here are some thoughts from a JavaScript enthusiast playing with Unicode > outside the BMP. > > > String.prototype.codePointAt > ---------------------------- > > The current definition of codePointAt has results: > out-of-bounds -> Undefined > normal BMP char -> the codepoint > lead surrogate of a good pair -> the codepoint > trail surrogate of a good pair -> codeunit in [0xDC00:0xDFFF] !!ambiguous > bad trail surrogate -> codeunit in [0xDC00:0xDFFF] > bad lead surrogate -> codeunit in [0xD800:0xDBFF] > > Note that a well-paired trail surrogate still results in a value even though > the > previous codeunit "subsumed" it. So, if a caller is indexing down the string > then > it should take the well-paired trail surrogate value out of the sequence. > > UTF16 experts can write code to check these possibilities; but for general > usability lets have: > Undefined for the trail surrogate of a good pair, and > NaN for bad surrogate. > > Then codePointAt would do the work for the casual user and experts can probe > the string with charCodeAt (or codeUnitAt if it exists) if they really want to > know the situation of bad surrogates. > > [Unchanged, users are called upon to write code patterns like the messy.... > > // if the indexed position is part of a well-formed surrogate pair > // then result is either the entire code-point (for lead surrogates) > // or undefined (for trail surrogates) > // result is NaN for bad surrogates > // (result is always undefined for out-of-bounds position) > > cp = str.codePointAt( pos ); > if (0xDC00 <= cp && cp <= 0xDFFF) { > cu = str.charCodeAt( pos-1 ); > if (0xD800 <= cu && cu <= 0xDBFF) { > cp = undefined; // trail surrogate of good pair > } > } > if (0xD800 <= cp && cp <= 0xDFFF) { > cp = NaN; // bad surrogate > } > > ] > > > String.prototype.charCodeAt / String.prototype.codeUnitAt > --------------------------- > > The existing charCodeAt returns NaN (not Undefined) if the indexed position > is > out-of-bounds, unlike codePointAt. > > For consistency, there could be a method 'codeUnitAt' which behaves like (and > is named like) codePointAt; i.e. returns Undefined for out-of-bounds. > > > String.prototype.charAt / String.prototype.unicodeCharAt > ----------------------- > > The existing charAt does not handle UTF16 surrogate pairs. > > For consistency with the above, there could be a method 'unicodeCharAt' > which returns the 1- or 2-char string corresponding to the 'codePointAt' > value and empty-string for out-of-bounds or a well-paired trail surrogate. > Note that an array of such strings could be joined to form the original > string. > > What to return for a bad surrogate? Null? Undefined? > > > String.fromCodePoint > -------------------- > > The current definition of fromCodePoint does not convert a sequence produced > by codePointAt back to the original string. > > This is really due to codePointAt returning a trail surrogate value after a > well- > formed pair (which were just converted to a single codepoint). > > If codePointAt is changed to return Undefined for a good trail surrogate then > fromCodePoint should simply ignore Undefined arguments. Currently I think it > throws RangeError (or maybe converts Undefined values to NUL chars?). > > > String.fromCharCode / String.fromCodeUnit > ------------------- > > The existing fromCharCode converts undefined,null,NaN,Infinity values into > NUL chars (U+0000), and maps other naughty values into valid chars. > > For consistency, there could be a function 'fromCodeUnit' which behaves like > (and is named like) fromCodePoint; i.e. throws RangeError for naughty values. > This function should also have arity = 0 like fromCodePoint. > > If fromCodePoint is changed to ignore Undefined arguments then so should > fromCodeUnit. > > > String.isWellFormed > ------------------- > > To enable a user easily to detect a well-/ill-formed string how about a > friendly > predicate: > String.isWellFormed( str ) > > Without this, the following regexp should test a string for well-formedness > (no > warranty implied): > /^(?:[\uD800-\uDBFF][\uDC00-\uDFFF]|[\u0000-\uD7FF\uE000-\uFFFF])*$/ > > > String.prototype.repair > ----------------------- > > Following on from isWellFormed, what is the user to do with an ill-formed > string? Here is one suggestion: a 'repair' method which replaces improper > surrogates with something (like the Unicode replacement character U+FFFD). > (Alternatively, the user may want to give up and throw an Error, see next.) > > [Here is a possible implementation which UTF16 experts could shim in.... > > var re_badsurrogate = > /[\uD800-\uDBFF](?![\uDC00-\uDFFF])|([^\uD800-\uDBFF])[\uDC00- > \uDFFF]|^[\uDC00-\uDFFF]/g; > > String.prototype.repair = function (replacer) > { > if (arguments.length == 0) replacer = "\uFFFD"; > > return this.replace( re_badsurrogate, "$1"+replacer ); > }; > > ] > > > StringError (& URI functions) > ----------- > > The existing encodeURI & encodeURIComponent throw URIError if given an ill- > formed string. (The URI decode function similar both for ill-formed strings > and > improper use of percent-coding.) > > A new Error, called StringError, could be thrown by URI functions and user > functions which reject an ill-formed string *because* it is ill-formed, > (rather > than trying to repair it). > > To avoid changing the existing URI functions, versions using StringError could > be moved from global namespace to a "URI" namespace (ala "JSON"): > URI.encodeComponent, ... > This seems quite neat, and declutters the global namespace too. > > _______________________________________________ es-discuss mailing list [email protected] https://mail.mozilla.org/listinfo/es-discuss

