On Mon, 19 Oct 2015 10:07:31 -0700 "Doug Ewell" <d...@ewellic.org> wrote:
> This discussion was originally about how to handle unpaired > surrogates, as if that were a normal use case. And the subject line was changed when the topic changed to traversing strings. > Regardless of what encoding model is used to handle characters under > the hood, and regardless of how the Delete key should work with actual > characters or clusters, there is never any excuse for software to > create unpaired surrogates, or any other sort of invalid code unit > sequences. How about, 'The specification says that one must pass the number of _characters_ in the string.'? Even worse, some specifications talk of 'Unicode characters' when they mean UTF-16 code units. The word 'codepoint' is even worse, as a supplementary plane codepoint is represented by two BMP codepoints. ICU (but perhaps it's actually Java) seems to have a culture of tolerating lone surrogates, and rules for handling lone surrogates are strewn across the Unicode standards and annexes. It was the once the case that basic Unicode support in regular expressions required a regular expression engine to be able to search for specified lone surrogates - a real show-stopper for an engine working in UTF-8. The Unicode collation algorithm conformance test once tested that implementations of collation collated lone surrogates correctly. Raising an exception was an automatic test failure! By contrast, no-one's proposed collation rules for broken bits of UTF-8 characters or non-minimal length forms. > That is like having an image editor that deletes every > 128th byte from a JPEG file, and then worrying about how to display > the file. 1. Of course, telemetry streams may very well contain damaged JPEG images! 2. The problem bad handling of supplementary characters seems to be associated with UTF-16 is that the damage is rarely as obvious as every 128th code unit. By contrast, bad UTF-8 handling usually comes to light as soon as the text processing moves beyond ASCII. Richard.