RE: easy handling of UTF16 surrogates & well-formed strings

Phillips, Addison Wed, 14 Nov 2012 09:06:04 -0800

You might want to check out Norbert's proposal [1]

Addison


Addison Phillips
Globalization Architect (Lab126)
Chair (W3C I18N WG)

Internationalization is not a feature.
It is an architecture.


[1] 
http://norbertlindenberg.com/2012/05/ecmascript-supplementary-characters/index.html

> -----Original Message-----
> From: Roger Andrews [mailto:[email protected]]
> Sent: Wednesday, November 14, 2012 6:07 AM
> To: [email protected]
> Subject: easy handling of UTF16 surrogates & well-formed strings
> 
> This is rather long but the idea is to make handling UTF16 surrogates easier 
> for
> the casual user without harming the ability of UTF16 experts to delve into
> details if surrogates are not well-paired (and hence the string is not well-
> formed).
> 
> Under the current definitions (ed. 6_10-26-12) surprising things happen.
> E.g. a string converted to an array of codepoints with 'codePointAt' then 
> back to
> a string with 'fromCodePoint' is not equal to the original string if it 
> contains
> well-formed surrogate pairs.
> 
> Here are some thoughts from a JavaScript enthusiast playing with Unicode
> outside the BMP.
> 
> 
> String.prototype.codePointAt
> ----------------------------
> 
> The current definition of codePointAt has results:
>    out-of-bounds                  -> Undefined
>    normal BMP char                -> the codepoint
>    lead surrogate of a good pair  -> the codepoint
>    trail surrogate of a good pair -> codeunit in [0xDC00:0xDFFF] !!ambiguous
>    bad trail surrogate            -> codeunit in [0xDC00:0xDFFF]
>    bad lead surrogate             -> codeunit in [0xD800:0xDBFF]
> 
> Note that a well-paired trail surrogate still results in a value even though 
> the
> previous codeunit "subsumed" it.  So, if a caller is indexing down the string 
> then
> it should take the well-paired trail surrogate value out of the sequence.
> 
> UTF16 experts can write code to check these possibilities; but for general
> usability lets have:
>    Undefined for the trail surrogate of a good pair, and
>    NaN for bad surrogate.
> 
> Then codePointAt would do the work for the casual user and experts can probe
> the string with charCodeAt (or codeUnitAt if it exists) if they really want to
> know the situation of bad surrogates.
> 
> [Unchanged, users are called upon to write code patterns like the messy....
> 
>     // if the indexed position is part of a well-formed surrogate pair
>     // then result is either the entire code-point (for lead surrogates)
>     //                or undefined (for trail surrogates)
>     // result is NaN for bad surrogates
>     // (result is always undefined for out-of-bounds position)
> 
>     cp = str.codePointAt( pos );
>     if (0xDC00 <= cp  &&  cp <= 0xDFFF) {
>         cu = str.charCodeAt( pos-1 );
>         if (0xD800 <= cu  &&  cu <= 0xDBFF) {
>             cp =  undefined;      // trail surrogate of good pair
>         }
>     }
>     if (0xD800 <= cp  &&  cp <= 0xDFFF) {
>         cp = NaN;                 // bad surrogate
>     }
> 
> ]
> 
> 
> String.prototype.charCodeAt / String.prototype.codeUnitAt
> ---------------------------
> 
> The existing charCodeAt returns NaN  (not Undefined) if the indexed position 
> is
> out-of-bounds, unlike codePointAt.
> 
> For consistency, there could be a method 'codeUnitAt' which behaves like (and
> is named like) codePointAt; i.e. returns Undefined for out-of-bounds.
> 
> 
> String.prototype.charAt / String.prototype.unicodeCharAt
> -----------------------
> 
> The existing charAt does not handle UTF16 surrogate pairs.
> 
> For consistency with the above, there could be a method 'unicodeCharAt'
> which returns the 1- or 2-char string corresponding to the 'codePointAt'
> value and empty-string for out-of-bounds or a well-paired trail surrogate.
> Note that an array of such strings could be joined to form the original 
> string.
> 
> What to return for a bad surrogate?  Null?  Undefined?
> 
> 
> String.fromCodePoint
> --------------------
> 
> The current definition of fromCodePoint does not convert a sequence produced
> by codePointAt back to the original string.
> 
> This is really due to codePointAt returning a trail surrogate value after a 
> well-
> formed pair (which were just converted to a single codepoint).
> 
> If codePointAt is changed to return Undefined for a good trail surrogate then
> fromCodePoint should simply ignore Undefined arguments.  Currently I think it
> throws RangeError (or maybe converts Undefined values to NUL chars?).
> 
> 
> String.fromCharCode / String.fromCodeUnit
> -------------------
> 
> The existing fromCharCode converts undefined,null,NaN,Infinity values into
> NUL chars (U+0000), and maps other naughty values into valid chars.
> 
> For consistency, there could be a function 'fromCodeUnit' which behaves like
> (and is named like) fromCodePoint; i.e. throws RangeError for naughty values.
> This function should also have arity = 0 like fromCodePoint.
> 
> If fromCodePoint is changed to ignore Undefined arguments then so should
> fromCodeUnit.
> 
> 
> String.isWellFormed
> -------------------
> 
> To enable a user easily to detect a well-/ill-formed string how about a 
> friendly
> predicate:
>    String.isWellFormed( str )
> 
> Without this, the following regexp should test a string for well-formedness 
> (no
> warranty implied):
>    /^(?:[\uD800-\uDBFF][\uDC00-\uDFFF]|[\u0000-\uD7FF\uE000-\uFFFF])*$/
> 
> 
> String.prototype.repair
> -----------------------
> 
> Following on from isWellFormed, what is the user to do with an ill-formed
> string?  Here is one suggestion: a 'repair' method which replaces improper
> surrogates with something (like the Unicode replacement character U+FFFD).
> (Alternatively, the user may want to give up and throw an Error, see next.)
> 
> [Here is a possible implementation which UTF16 experts could shim in....
> 
>     var re_badsurrogate =
> /[\uD800-\uDBFF](?![\uDC00-\uDFFF])|([^\uD800-\uDBFF])[\uDC00-
> \uDFFF]|^[\uDC00-\uDFFF]/g;
> 
>     String.prototype.repair = function (replacer)
>     {
>         if (arguments.length == 0)  replacer = "\uFFFD";
> 
>         return this.replace( re_badsurrogate, "$1"+replacer );
>     };
> 
> ]
> 
> 
> StringError (& URI functions)
> -----------
> 
> The existing encodeURI & encodeURIComponent throw URIError if given an ill-
> formed string.  (The URI decode function similar both for ill-formed strings 
> and
> improper use of percent-coding.)
> 
> A new Error, called StringError, could be thrown by URI functions and user
> functions which reject an ill-formed string *because* it is ill-formed, 
> (rather
> than trying to repair it).
> 
> To avoid changing the existing URI functions, versions using StringError could
> be moved from global namespace to a "URI" namespace (ala "JSON"):
>   URI.encodeComponent, ...
> This seems quite neat, and declutters the global namespace too.
> 
> 

_______________________________________________
es-discuss mailing list
[email protected]
https://mail.mozilla.org/listinfo/es-discuss

RE: easy handling of UTF16 surrogates & well-formed strings

Reply via email to