On 2010/11/05 2:46, Markus Scherer wrote:

16-bit Unicode is convenient in that when you find an unpaired surrogate
(that is, it's not well-formed UTF-16) you can usually just treat it like a
surrogate code point which normally has default properties much like an
unassigned code point or noncharacter. It case-maps to itself, normalizes to
itself, has default Unicode property values (except for the general
category), etc.

Well, yes, you can handle it that way, but that's pretty much GIGO (garbage in, garbage out) and dumping the problem on the next person/software downwards in the datastream. Also, while some things might still work, much stuff won't, e.g. when you try to find a word (with some lone surrogate hidden in some place) starting with the same word (but with some lone surrogate hidden in another place, or no such surrogate).

In other words, when you process 16-bit Unicode text it takes no effort to
handle unpaired surrogates, other than making sure that you only assemble a
supplementary code point when a lead surrogate is really followed by a trail
surrogate. Hence little need for cleanup functions -- but if you need one,
it's trivial to write one for UTF-16.

For some processing this is true, but it's rather short-sighted.

Regards,    Martin.

--
#-# Martin J. Dürst, Professor, Aoyama Gakuin University
#-# http://www.sw.it.aoyama.ac.jp   mailto:[email protected]

Reply via email to