Re: easy handling of UTF16 surrogates & well-formed strings

Roger Andrews Wed, 14 Nov 2012 13:49:38 -0800

Thanks for the ref to Norbert's proposal.

(I have been interested in i18n since writing an international telephonyswitch control system in 1987.)

Norbert's proposal has much interesting info about formats, locales,case-mapping & much else, but says little about the String.* functions orhow the user can handle an ill-formed string (thinking from the perspectiveof a lowly software engineer working to achieve some task, rather than atop-down architect).


Head:  4.3.20 Surrogate pair

The proposal does confirm that an unpaired surrogate makes a UTF16 sequenceill-formed.

Head:  5.3 Text Interpretation

The proposal confirms that a valid surrogate pair is interpreted as a singlecodepoint, not a codepoint followed by an unpaired surrogate (asString.prototype.codePointAt does).


Towards the end of the page, in section Code Point Based String Accessors,

the proposal defines String.fromCodePoint and String.prototype.codePointAtin effectively the same manner as ES6 (ed. 6_10-26-12) - although the lengthproperty (arity) of fromCodePoint differs from ES6's.

This definition of codePointAt has the same usability issues as ES6's (ed.6_10-26-12);i.e. it returns a value in [0xDC00:0xDFFF] for both the 2nd member of asurrogate pair and an unpaired surrogate.It returns a value in [0xD800:0xDFFF] for an unpaired surrogate - maybe itwould be friendlier to the casual user to return NaN (UTF16 experts canprobe the location with charCodeAt / codeUnitAt if they care to).


My original post tried to point to anomalies in:
  String.prototype.codePointAt   (of ES6)

String.prototype.charCodeAt (suggest String.prototype.codeUnitAtinstead)

  String.prototype.charAt   (suggest String.prototype.unicodeCharAt too)
  String.fromCodePoint   (of ES6)
  String.fromCharCode   (suggest String.fromCodeUnit instead)
and floated:
  String.isWellFormed
  String.prototype.repair
  StringError   (& suggest URI functions mods)

Thanks again for the ref.


--------------------------------------------------
From: "Phillips, Addison" <[email protected]>
Sent: Wednesday, November 14, 2012 5:05 PM
To: "Roger Andrews" <[email protected]>; <[email protected]>
Subject: RE: easy handling of UTF16 surrogates & well-formed strings

You might want to check out Norbert's proposal [1]

Addison

Addison Phillips
Globalization Architect (Lab126)
Chair (W3C I18N WG)

Internationalization is not a feature.
It is an architecture.
[1]http://norbertlindenberg.com/2012/05/ecmascript-supplementary-characters/index.html
-----Original Message-----
From: Roger Andrews [mailto:[email protected]]
Sent: Wednesday, November 14, 2012 6:07 AM
To: [email protected]
Subject: easy handling of UTF16 surrogates & well-formed strings
This is rather long but the idea is to make handling UTF16 surrogateseasier forthe casual user without harming the ability of UTF16 experts to delveintodetails if surrogates are not well-paired (and hence the string is notwell-
formed).

Under the current definitions (ed. 6_10-26-12) surprising things happen.
E.g. a string converted to an array of codepoints with 'codePointAt' thenback toa string with 'fromCodePoint' is not equal to the original string if itcontains
well-formed surrogate pairs.

Here are some thoughts from a JavaScript enthusiast playing with Unicode
outside the BMP.


String.prototype.codePointAt
----------------------------

The current definition of codePointAt has results:
   out-of-bounds                  -> Undefined
   normal BMP char                -> the codepoint
   lead surrogate of a good pair  -> the codepoint
trail surrogate of a good pair -> codeunit in [0xDC00:0xDFFF]!!ambiguous
   bad trail surrogate            -> codeunit in [0xDC00:0xDFFF]
   bad lead surrogate             -> codeunit in [0xD800:0xDBFF]
Note that a well-paired trail surrogate still results in a value eventhough theprevious codeunit "subsumed" it. So, if a caller is indexing down thestring then
it should take the well-paired trail surrogate value out of the sequence.
UTF16 experts can write code to check these possibilities; but forgeneral
usability lets have:
   Undefined for the trail surrogate of a good pair, and
   NaN for bad surrogate.
Then codePointAt would do the work for the casual user and experts canprobethe string with charCodeAt (or codeUnitAt if it exists) if they reallywant to
know the situation of bad surrogates.
[Unchanged, users are called upon to write code patterns like themessy....
    // if the indexed position is part of a well-formed surrogate pair
    // then result is either the entire code-point (for lead surrogates)
    //                or undefined (for trail surrogates)
    // result is NaN for bad surrogates
    // (result is always undefined for out-of-bounds position)

    cp = str.codePointAt( pos );
    if (0xDC00 <= cp  &&  cp <= 0xDFFF) {
        cu = str.charCodeAt( pos-1 );
        if (0xD800 <= cu  &&  cu <= 0xDBFF) {
            cp =  undefined;      // trail surrogate of good pair
        }
    }
    if (0xD800 <= cp  &&  cp <= 0xDFFF) {
        cp = NaN;                 // bad surrogate
    }

]


String.prototype.charCodeAt / String.prototype.codeUnitAt
---------------------------
The existing charCodeAt returns NaN (not Undefined) if the indexedposition is
out-of-bounds, unlike codePointAt.
For consistency, there could be a method 'codeUnitAt' which behaves like(and
is named like) codePointAt; i.e. returns Undefined for out-of-bounds.


String.prototype.charAt / String.prototype.unicodeCharAt
-----------------------

The existing charAt does not handle UTF16 surrogate pairs.

For consistency with the above, there could be a method 'unicodeCharAt'
which returns the 1- or 2-char string corresponding to the 'codePointAt'
value and empty-string for out-of-bounds or a well-paired trailsurrogate.Note that an array of such strings could be joined to form the originalstring.
What to return for a bad surrogate?  Null?  Undefined?


String.fromCodePoint
--------------------
The current definition of fromCodePoint does not convert a sequenceproduced
by codePointAt back to the original string.
This is really due to codePointAt returning a trail surrogate value aftera well-
formed pair (which were just converted to a single codepoint).
If codePointAt is changed to return Undefined for a good trail surrogatethenfromCodePoint should simply ignore Undefined arguments. Currently Ithink it
throws RangeError (or maybe converts Undefined values to NUL chars?).


String.fromCharCode / String.fromCodeUnit
-------------------
The existing fromCharCode converts undefined,null,NaN,Infinity valuesinto
NUL chars (U+0000), and maps other naughty values into valid chars.
For consistency, there could be a function 'fromCodeUnit' which behaveslike(and is named like) fromCodePoint; i.e. throws RangeError for naughtyvalues.
This function should also have arity = 0 like fromCodePoint.

If fromCodePoint is changed to ignore Undefined arguments then so should
fromCodeUnit.


String.isWellFormed
-------------------
To enable a user easily to detect a well-/ill-formed string how about afriendly
predicate:
   String.isWellFormed( str )
Without this, the following regexp should test a string forwell-formedness (no
warranty implied):
   /^(?:[\uD800-\uDBFF][\uDC00-\uDFFF]|[\u0000-\uD7FF\uE000-\uFFFF])*$/


String.prototype.repair
-----------------------

Following on from isWellFormed, what is the user to do with an ill-formed
string? Here is one suggestion: a 'repair' method which replacesimpropersurrogates with something (like the Unicode replacement characterU+FFFD).(Alternatively, the user may want to give up and throw an Error, seenext.)
[Here is a possible implementation which UTF16 experts could shim in....

    var re_badsurrogate =
/[\uD800-\uDBFF](?![\uDC00-\uDFFF])|([^\uD800-\uDBFF])[\uDC00-
\uDFFF]|^[\uDC00-\uDFFF]/g;

    String.prototype.repair = function (replacer)
    {
        if (arguments.length == 0)  replacer = "\uFFFD";

        return this.replace( re_badsurrogate, "$1"+replacer );
    };

]


StringError (& URI functions)
-----------
The existing encodeURI & encodeURIComponent throw URIError if given anill-formed string. (The URI decode function similar both for ill-formedstrings and
improper use of percent-coding.)
A new Error, called StringError, could be thrown by URI functions anduserfunctions which reject an ill-formed string *because* it is ill-formed,(rather
than trying to repair it).
To avoid changing the existing URI functions, versions using StringErrorcould
be moved from global namespace to a "URI" namespace (ala "JSON"):
  URI.encodeComponent, ...
This seems quite neat, and declutters the global namespace too.

_______________________________________________
es-discuss mailing list
[email protected]
https://mail.mozilla.org/listinfo/es-discuss

Re: easy handling of UTF16 surrogates & well-formed strings

Reply via email to