On Mon, Oct 12, 2015 at 11:42 PM Richard Wordingham < richard.wording...@ntlworld.com> wrote:
> On Mon, 12 Oct 2015 23:35:32 +0000 > David Starner <prosfil...@gmail.com> wrote: > > > Thus a Unicode string simply can't be in UTF-16 format > > internally with unpaired surrogates; a Unicode string in a programmer > > opaque format must do something with broken data on input. > > You're assuming that the source of the non-conformance is external to > the program. In the case that has caused me to ask about lone > surrogates, they were actually caused by a faulty character deletion > function within the program itself. Despite this fault, the program > remains usable - it's little worse than a word processor that insists on > autocorrupting 'GHz' and 'MHz' to 'Ghz' and 'Mhz'. > > I presume you are expecting input of fractional characters to be > buffered until there is a whole character to add to a string. For > example, a MSKLC keyboard will deliver a supplementary character in > two WM_CHAR messages, one for the high surrogate and one for the low > surrogate. > A UTF-16 string could delete one surrogate, or add a fractional character. A Unicode string (not a "UTF-16 string"), which could be stored internally in, say, a Python-like format which is Latin-1, UCS-2, or UTF-32, conversions made as needed and differences hidden from the user, can't. If you let the code delete one surrogate or add one surrogate, if you interpret surrogates at all, it's a UTF-16 string; like often in computing, it gives the programmer more power and control at the cost of being harder to use and easier to break.