Thanks for the ref to Norbert's proposal.
(I have been interested in i18n since writing an international telephony switch control system in 1987.)

Norbert's proposal has much interesting info about formats, locales, case-mapping & much else, but says little about the String.* functions or how the user can handle an ill-formed string (thinking from the perspective of a lowly software engineer working to achieve some task, rather than a top-down architect).

Head:  4.3.20 Surrogate pair
The proposal does confirm that an unpaired surrogate makes a UTF16 sequence ill-formed.
Head:  5.3 Text Interpretation
The proposal confirms that a valid surrogate pair is interpreted as a single codepoint, not a codepoint followed by an unpaired surrogate (as String.prototype.codePointAt does).

Towards the end of the page, in section Code Point Based String Accessors,
the proposal defines String.fromCodePoint and String.prototype.codePointAt in effectively the same manner as ES6 (ed. 6_10-26-12) - although the length property (arity) of fromCodePoint differs from ES6's.

This definition of codePointAt has the same usability issues as ES6's (ed. 6_10-26-12); i.e. it returns a value in [0xDC00:0xDFFF] for both the 2nd member of a surrogate pair and an unpaired surrogate. It returns a value in [0xD800:0xDFFF] for an unpaired surrogate - maybe it would be friendlier to the casual user to return NaN (UTF16 experts can probe the location with charCodeAt / codeUnitAt if they care to).

My original post tried to point to anomalies in:
  String.prototype.codePointAt   (of ES6)
String.prototype.charCodeAt (suggest String.prototype.codeUnitAt instead)
  String.prototype.charAt   (suggest String.prototype.unicodeCharAt too)
  String.fromCodePoint   (of ES6)
  String.fromCharCode   (suggest String.fromCodeUnit instead)
and floated:
  String.isWellFormed
  String.prototype.repair
  StringError   (& suggest URI functions mods)

Thanks again for the ref.


--------------------------------------------------
From: "Phillips, Addison" <[email protected]>
Sent: Wednesday, November 14, 2012 5:05 PM
To: "Roger Andrews" <[email protected]>; <[email protected]>
Subject: RE: easy handling of UTF16 surrogates & well-formed strings

You might want to check out Norbert's proposal [1]

Addison

Addison Phillips
Globalization Architect (Lab126)
Chair (W3C I18N WG)

Internationalization is not a feature.
It is an architecture.


[1] http://norbertlindenberg.com/2012/05/ecmascript-supplementary-characters/index.html

-----Original Message-----
From: Roger Andrews [mailto:[email protected]]
Sent: Wednesday, November 14, 2012 6:07 AM
To: [email protected]
Subject: easy handling of UTF16 surrogates & well-formed strings

This is rather long but the idea is to make handling UTF16 surrogates easier for the casual user without harming the ability of UTF16 experts to delve into details if surrogates are not well-paired (and hence the string is not well-
formed).

Under the current definitions (ed. 6_10-26-12) surprising things happen.
E.g. a string converted to an array of codepoints with 'codePointAt' then back to a string with 'fromCodePoint' is not equal to the original string if it contains
well-formed surrogate pairs.

Here are some thoughts from a JavaScript enthusiast playing with Unicode
outside the BMP.


String.prototype.codePointAt
----------------------------

The current definition of codePointAt has results:
   out-of-bounds                  -> Undefined
   normal BMP char                -> the codepoint
   lead surrogate of a good pair  -> the codepoint
trail surrogate of a good pair -> codeunit in [0xDC00:0xDFFF] !!ambiguous
   bad trail surrogate            -> codeunit in [0xDC00:0xDFFF]
   bad lead surrogate             -> codeunit in [0xD800:0xDBFF]

Note that a well-paired trail surrogate still results in a value even though the previous codeunit "subsumed" it. So, if a caller is indexing down the string then
it should take the well-paired trail surrogate value out of the sequence.

UTF16 experts can write code to check these possibilities; but for general
usability lets have:
   Undefined for the trail surrogate of a good pair, and
   NaN for bad surrogate.

Then codePointAt would do the work for the casual user and experts can probe the string with charCodeAt (or codeUnitAt if it exists) if they really want to
know the situation of bad surrogates.

[Unchanged, users are called upon to write code patterns like the messy....

    // if the indexed position is part of a well-formed surrogate pair
    // then result is either the entire code-point (for lead surrogates)
    //                or undefined (for trail surrogates)
    // result is NaN for bad surrogates
    // (result is always undefined for out-of-bounds position)

    cp = str.codePointAt( pos );
    if (0xDC00 <= cp  &&  cp <= 0xDFFF) {
        cu = str.charCodeAt( pos-1 );
        if (0xD800 <= cu  &&  cu <= 0xDBFF) {
            cp =  undefined;      // trail surrogate of good pair
        }
    }
    if (0xD800 <= cp  &&  cp <= 0xDFFF) {
        cp = NaN;                 // bad surrogate
    }

]


String.prototype.charCodeAt / String.prototype.codeUnitAt
---------------------------

The existing charCodeAt returns NaN (not Undefined) if the indexed position is
out-of-bounds, unlike codePointAt.

For consistency, there could be a method 'codeUnitAt' which behaves like (and
is named like) codePointAt; i.e. returns Undefined for out-of-bounds.


String.prototype.charAt / String.prototype.unicodeCharAt
-----------------------

The existing charAt does not handle UTF16 surrogate pairs.

For consistency with the above, there could be a method 'unicodeCharAt'
which returns the 1- or 2-char string corresponding to the 'codePointAt'
value and empty-string for out-of-bounds or a well-paired trail surrogate. Note that an array of such strings could be joined to form the original string.

What to return for a bad surrogate?  Null?  Undefined?


String.fromCodePoint
--------------------

The current definition of fromCodePoint does not convert a sequence produced
by codePointAt back to the original string.

This is really due to codePointAt returning a trail surrogate value after a well-
formed pair (which were just converted to a single codepoint).

If codePointAt is changed to return Undefined for a good trail surrogate then fromCodePoint should simply ignore Undefined arguments. Currently I think it
throws RangeError (or maybe converts Undefined values to NUL chars?).


String.fromCharCode / String.fromCodeUnit
-------------------

The existing fromCharCode converts undefined,null,NaN,Infinity values into
NUL chars (U+0000), and maps other naughty values into valid chars.

For consistency, there could be a function 'fromCodeUnit' which behaves like (and is named like) fromCodePoint; i.e. throws RangeError for naughty values.
This function should also have arity = 0 like fromCodePoint.

If fromCodePoint is changed to ignore Undefined arguments then so should
fromCodeUnit.


String.isWellFormed
-------------------

To enable a user easily to detect a well-/ill-formed string how about a friendly
predicate:
   String.isWellFormed( str )

Without this, the following regexp should test a string for well-formedness (no
warranty implied):
   /^(?:[\uD800-\uDBFF][\uDC00-\uDFFF]|[\u0000-\uD7FF\uE000-\uFFFF])*$/


String.prototype.repair
-----------------------

Following on from isWellFormed, what is the user to do with an ill-formed
string? Here is one suggestion: a 'repair' method which replaces improper surrogates with something (like the Unicode replacement character U+FFFD). (Alternatively, the user may want to give up and throw an Error, see next.)

[Here is a possible implementation which UTF16 experts could shim in....

    var re_badsurrogate =
/[\uD800-\uDBFF](?![\uDC00-\uDFFF])|([^\uD800-\uDBFF])[\uDC00-
\uDFFF]|^[\uDC00-\uDFFF]/g;

    String.prototype.repair = function (replacer)
    {
        if (arguments.length == 0)  replacer = "\uFFFD";

        return this.replace( re_badsurrogate, "$1"+replacer );
    };

]


StringError (& URI functions)
-----------

The existing encodeURI & encodeURIComponent throw URIError if given an ill- formed string. (The URI decode function similar both for ill-formed strings and
improper use of percent-coding.)

A new Error, called StringError, could be thrown by URI functions and user functions which reject an ill-formed string *because* it is ill-formed, (rather
than trying to repair it).

To avoid changing the existing URI functions, versions using StringError could
be moved from global namespace to a "URI" namespace (ala "JSON"):
  URI.encodeComponent, ...
This seems quite neat, and declutters the global namespace too.




_______________________________________________
es-discuss mailing list
[email protected]
https://mail.mozilla.org/listinfo/es-discuss

Reply via email to