On Tue, 13 Oct 2015 00:49:29 +0200 Philippe Verdy <verd...@wanadoo.fr> wrote:
> 2015-10-12 21:38 GMT+02:00 Richard Wordingham < > richard.wording...@ntlworld.com>: > > Graceful fallback is exactly where the issue arises. Throwing an > > exception is not a useful answer to the question of how many code > > points a 'Unicode string' (not a 'UTF-16 string') contains. > If you get an invalid UTF-16 string, and caught an exception, this is > a sign that it is not UTF-16, and very frequently something else. The > application may want to retry with another encoding, possibly using > heuristic guessers, but the heuristic will only give a *probable > answer*. On Mon, 12 Oct 2015 23:35:32 +0000 David Starner <prosfil...@gmail.com> wrote: > Thus a Unicode string simply can't be in UTF-16 format > internally with unpaired surrogates; a Unicode string in a programmer > opaque format must do something with broken data on input. You're assuming that the source of the non-conformance is external to the program. In the case that has caused me to ask about lone surrogates, they were actually caused by a faulty character deletion function within the program itself. Despite this fault, the program remains usable - it's little worse than a word processor that insists on autocorrupting 'GHz' and 'MHz' to 'Ghz' and 'Mhz'. I presume you are expecting input of fractional characters to be buffered until there is a whole character to add to a string. For example, a MSKLC keyboard will deliver a supplementary character in two WM_CHAR messages, one for the high surrogate and one for the low surrogate. Returning to the original questions, it would seem that there is not a unique answer to the question of how many codepoints a Unicode 16-bit string contains. Rather the question must be the unwieldy one of how many scalar values and lone surrogates it contains in total. Richard.