On 11/4/2010 5:46 PM, Doug Ewell wrote:
Markus Scherer wrote:

While processing 16-bit Unicode text which is not assumed to be well-formed UTF-16, you can treat (decode) an unpaired surrogate as a mostly-inert surrogate code point. However, you cannot unambiguously encode a surrogate code point in 16-bit text (because you could not distinguish a sequence of lead+trail surrogate code points from one supplementary code point), and therefore it is not allowed to encode surrogate code points in any well-formed UTF-8/16/32. [All of this is discussed in The Unicode Standard, Chapter 3.]

I'm probably missing something here, but I don't agree that it's OK for a consumer of UTF-16 to accept an unpaired surrogate without throwing an error, or converting it to U+FFFD, or otherwise raising a fuss. Unpaired surrogates are ill-formed, and have to be caught and dealt with.


The question is whether you want every library that handles strings perform the equivalent of a citizen's arrest, or whether you architect things that the gatekeepers (border control) police the data stream.

During development, early and widespread error detection is helpful in debugging. After that, it's probably better to concentrate handling these errors, because that would tend to improve your options for implementing successful error recovery.

Malformed data shouldn't get in and shouldn't get perpetuated, but in the general case, there should be a facility for "repairing" faulty data, wherever that is reasonably possible.

In the context of uppercasing a string, for example, repair is not a reasonable option, neither is rejecting the string at that point - it should have been rejected / repaired much earlier.

A./

Reply via email to