Re: Utility to report and repair broken surrogate pairs in UTF-16 text

Martin J. Dürst Fri, 05 Nov 2010 02:11:25 -0700

On 2010/11/05 2:46, Markus Scherer wrote:

16-bit Unicode is convenient in that when you find an unpaired surrogate
(that is, it's not well-formed UTF-16) you can usually just treat it like a
surrogate code point which normally has default properties much like an
unassigned code point or noncharacter. It case-maps to itself, normalizes to
itself, has default Unicode property values (except for the general
category), etc.

Well, yes, you can handle it that way, but that's pretty much GIGO(garbage in, garbage out) and dumping the problem on the nextperson/software downwards in the datastream. Also, while some thingsmight still work, much stuff won't, e.g. when you try to find a word(with some lone surrogate hidden in some place) starting with the sameword (but with some lone surrogate hidden in another place, or no suchsurrogate).

In other words, when you process 16-bit Unicode text it takes no effort to
handle unpaired surrogates, other than making sure that you only assemble a
supplementary code point when a lead surrogate is really followed by a trail
surrogate. Hence little need for cleanup functions -- but if you need one,
it's trivial to write one for UTF-16.


For some processing this is true, but it's rather short-sighted.

Regards,    Martin.

--
#-# Martin J. Dürst, Professor, Aoyama Gakuin University
#-# http://www.sw.it.aoyama.ac.jp   mailto:[email protected]

Re: Utility to report and repair broken surrogate pairs in UTF-16 text

Reply via email to