Re: Counting Codepoints

David Starner Tue, 13 Oct 2015 08:30:56 -0700

On Mon, Oct 12, 2015 at 11:42 PM Richard Wordingham <
richard.wording...@ntlworld.com> wrote:

> On Mon, 12 Oct 2015 23:35:32 +0000
> David Starner <prosfil...@gmail.com> wrote:
>
> > Thus a Unicode string simply can't be in UTF-16 format
> > internally with unpaired surrogates; a Unicode string in a programmer
> > opaque format must do something with broken data on input.
>
> You're assuming that the source of the non-conformance is external to
> the program.  In the case that has caused me to ask about lone
> surrogates, they were actually caused by a faulty character deletion
> function within the program itself.  Despite this fault, the program
> remains usable - it's little worse than a word processor that insists on
> autocorrupting 'GHz' and 'MHz' to 'Ghz' and 'Mhz'.
>
> I presume you are expecting input of fractional characters to be
> buffered until there is a whole character to add to a string.  For
> example, a MSKLC keyboard will deliver a supplementary character in
> two WM_CHAR messages, one for the high surrogate and one for the low
> surrogate.
>

A UTF-16 string could delete one surrogate, or add a fractional character.
A Unicode string (not a "UTF-16 string"), which could be stored internally
in, say, a Python-like format which is Latin-1, UCS-2, or UTF-32,
conversions made as needed and differences hidden from the user, can't. If
you let the code delete one surrogate or add one surrogate, if you
interpret surrogates at all, it's a UTF-16 string; like often in computing,
it gives the programmer more power and control at the cost of being harder
to use and easier to break.

Re: Counting Codepoints

Reply via email to