Re: Utility to report and repair broken surrogate pairs in UTF-16 text

Doug Ewell Fri, 05 Nov 2010 07:16:47 -0700

Asmus Freytag <asmusf at ix dot netcom dot com> wrote:

>> I'm probably missing something here, but I don't agree that it's OK
>> for a consumer of UTF-16 to accept an unpaired surrogate without
>> throwing an error, or converting it to U+FFFD, or otherwise raising a
>> fuss. Unpaired surrogates are ill-formed, and have to be caught and
>> dealt with.
>
> The question is whether you want every library that handles strings
> perform the equivalent of a citizen's arrest, or whether you architect
> things that the gatekeepers (border control) police the data stream.


If you can have upstream libraries check for unpaired surrogates at the
time they convert UTF-16 to Unicode code points, then your point is well
taken, because then the downstream libraries are no longer dealing with
UTF-16, but with code points.  Doing conversion and validation at
different stages isn't a great idea; that's how character encodings get
involved with security problems.

Corrigendum #1 closed the door on interpretation of invalid UTF-8
sequences.  I'm not sure why the approach to handling UTF-16 should be
any different.

--
Doug Ewell | Thornton, Colorado, USA | http://www.ewellic.org
RFC 5645, 4645, UTN #14 | ietf-languages @ is dot gd slash 2kf0s

Re: Utility to report and repair broken surrogate pairs in UTF-16 text

Reply via email to