On 11/4/2010 5:46 PM, Doug Ewell wrote:
Markus Scherer wrote:
While processing 16-bit Unicode text which is not assumed to be
well-formed UTF-16, you can treat (decode) an unpaired surrogate as a
mostly-inert surrogate code point. However, you cannot unambiguously
encode a surrogate code point in 16-bit text (because you could not
distinguish a sequence of lead+trail surrogate code points from one
supplementary code point), and therefore it is not allowed to encode
surrogate code points in any well-formed UTF-8/16/32. [All of this is
discussed in The Unicode Standard, Chapter 3.]
I'm probably missing something here, but I don't agree that it's OK
for a consumer of UTF-16 to accept an unpaired surrogate without
throwing an error, or converting it to U+FFFD, or otherwise raising a
fuss. Unpaired surrogates are ill-formed, and have to be caught and
dealt with.
The question is whether you want every library that handles strings
perform the equivalent of a citizen's arrest, or whether you architect
things that the gatekeepers (border control) police the data stream.
During development, early and widespread error detection is helpful in
debugging. After that, it's probably better to concentrate handling
these errors, because that would tend to improve your options for
implementing successful error recovery.
Malformed data shouldn't get in and shouldn't get perpetuated, but in
the general case, there should be a facility for "repairing" faulty
data, wherever that is reasonably possible.
In the context of uppercasing a string, for example, repair is not a
reasonable option, neither is rejecting the string at that point - it
should have been rejected / repaired much earlier.
A./