Re: Utility to report and repair broken surrogate pairs in UTF-16 text

Markus Scherer Thu, 04 Nov 2010 11:01:21 -0700

On Thu, Nov 4, 2010 at 7:20 AM, Doug Ewell <d...@ewellic.org> wrote:

> It may be that broken UTF-16 text doesn't appear that often in the real
> world.  Certainly it's a test case that should be detected and handled
> (and I always do so when rolling my own transcoders), but perhaps not
> many people besides you have actually been bitten such that they needed
> such a tool.
>


16-bit Unicode is convenient in that when you find an unpaired surrogate
(that is, it's not well-formed UTF-16) you can usually just treat it like a
surrogate code point which normally has default properties much like an
unassigned code point or noncharacter. It case-maps to itself, normalizes to
itself, has default Unicode property values (except for the general
category), etc.

In other words, when you process 16-bit Unicode text it takes no effort to
handle unpaired surrogates, other than making sure that you only assemble a
supplementary code point when a lead surrogate is really followed by a trail
surrogate. Hence little need for cleanup functions -- but if you need one,
it's trivial to write one for UTF-16.

markus

Re: Utility to report and repair broken surrogate pairs in UTF-16 text

Reply via email to