2015-10-19 20:53 GMT+02:00 Richard Wordingham < richard.wording...@ntlworld.com>:
> On Mon, 19 Oct 2015 10:07:31 -0700 > "Doug Ewell" <d...@ewellic.org> wrote: > > > This discussion was originally about how to handle unpaired > > surrogates, as if that were a normal use case. > > And the subject line was changed when the topic changed to traversing > strings. > > > Regardless of what encoding model is used to handle characters under > > the hood, and regardless of how the Delete key should work with actual > > characters or clusters, there is never any excuse for software to > > create unpaired surrogates, or any other sort of invalid code unit > > sequences. > > The word > 'codepoint' is even worse, as a supplementary plane codepoint is > represented by two BMP codepoints. > No ! The "supplementary code points" (or "supplementary characters" when they are assigned to characters) are represented in UTF-16 as two **code units**, NOT as two "code points" (even if their binary value are related). The code points in range U+D800..U+DF00 are NEVER characters they are juste permanently reserved in order to unassign them to any character, so these code points are assigned, but not to characters (otherwise these characters would not be representable as valid UTF-16). These code points also do not have any scalar value, and there are not valid scalar values in range 0xD800..0xDFFF (the valid scalar values are in two ranges of integers, separated by this hole). So please don't mix "code points" and "code units" !