On Tue, 14 Mar 2017 08:51:18 +0000 Alastair Houghton <alast...@alastairs-place.net> wrote:
> On 14 Mar 2017, at 02:03, Richard Wordingham > <richard.wording...@ntlworld.com> wrote: > > > > On Mon, 13 Mar 2017 19:18:00 +0000 > > Alastair Houghton <alast...@alastairs-place.net> wrote: > > The problem is that UTF-16 based code can very easily overlook the > > handling of surrogate pairs, and one very easily get confused over > > what string lengths mean. > > Yet the same problem exists for UCS-4; it could very easily overlook > the handling of combining characters. That's a different issue. I presume you mean the issues of canonical equivalence and detecting text boundaries. Again, there is the problem of remembering to consider the whole surrogate pair when using UTF-16. (I suppose this could be largely handled by avoiding the concept of arrays.) Now, the supplementary characters where these issues arise are very infrequently used. An error in UTF-16 code might easily not come to attention, whereas a problem with UCS-4 (or UTF-8) comes to light as soon as one handles Thai or IPA. > As for string lengths, string > lengths in code points are no more meaningful than string lengths in > UTF-16 code units. They don’t tell you anything about the number of > user-visible characters; or anything about the width the string will > take up if rendered on the display (even in a fixed-width font); or > anything about the number of glyphs that a given string might be > transformed into by glyph mapping. The *only* think a string length > of a Unicode string will tell you is the number of code units. A string length in codepoints does have the advantage of being independent of encoding. I'm actually using an index for UTF-16 text (I don't know whether its denominated in codepoints or code units) to index into the UTF-8 source code. However, the number of code units is the more commonly used quantity, as it tells one how much memory is required for simple array storage. Richard.