Re: Utility to report and repair broken surrogate pairs in UTF-16 text

Asmus Freytag Thu, 04 Nov 2010 23:06:17 -0700

On 11/4/2010 5:46 PM, Doug Ewell wrote:

Markus Scherer wrote:
While processing 16-bit Unicode text which is not assumed to bewell-formed UTF-16, you can treat (decode) an unpaired surrogate as amostly-inert surrogate code point. However, you cannot unambiguouslyencode a surrogate code point in 16-bit text (because you could notdistinguish a sequence of lead+trail surrogate code points from onesupplementary code point), and therefore it is not allowed to encodesurrogate code points in any well-formed UTF-8/16/32. [All of this isdiscussed in The Unicode Standard, Chapter 3.]
I'm probably missing something here, but I don't agree that it's OKfor a consumer of UTF-16 to accept an unpaired surrogate withoutthrowing an error, or converting it to U+FFFD, or otherwise raising afuss. Unpaired surrogates are ill-formed, and have to be caught anddealt with.

The question is whether you want every library that handles stringsperform the equivalent of a citizen's arrest, or whether you architectthings that the gatekeepers (border control) police the data stream.

During development, early and widespread error detection is helpful indebugging. After that, it's probably better to concentrate handlingthese errors, because that would tend to improve your options forimplementing successful error recovery.

Malformed data shouldn't get in and shouldn't get perpetuated, but inthe general case, there should be a facility for "repairing" faultydata, wherever that is reasonably possible.

In the context of uppercasing a string, for example, repair is not areasonable option, neither is rejecting the string at that point - itshould have been rejected / repaired much earlier.

A./

Re: Utility to report and repair broken surrogate pairs in UTF-16 text

Reply via email to