On Thu, Nov 4, 2010 at 7:20 AM, Doug Ewell <d...@ewellic.org> wrote: > It may be that broken UTF-16 text doesn't appear that often in the real > world. Certainly it's a test case that should be detected and handled > (and I always do so when rolling my own transcoders), but perhaps not > many people besides you have actually been bitten such that they needed > such a tool. >
16-bit Unicode is convenient in that when you find an unpaired surrogate (that is, it's not well-formed UTF-16) you can usually just treat it like a surrogate code point which normally has default properties much like an unassigned code point or noncharacter. It case-maps to itself, normalizes to itself, has default Unicode property values (except for the general category), etc. In other words, when you process 16-bit Unicode text it takes no effort to handle unpaired surrogates, other than making sure that you only assemble a supplementary code point when a lead surrogate is really followed by a trail surrogate. Hence little need for cleanup functions -- but if you need one, it's trivial to write one for UTF-16. markus