On 31.05.2016 20:53, Jonathan M Davis via Digitalmars-d wrote:
On Tuesday, May 31, 2016 14:30:08 Andrei Alexandrescu via Digitalmars-d wrote:
>On 5/31/16 2:11 PM, Jonathan M Davis via Digitalmars-d wrote:
> >On Tuesday, May 31, 2016 13:21:57 Andrei Alexandrescu via Digitalmars-d
wrote:
> >>On 05/31/2016 01:15 PM, Jonathan M Davis via Digitalmars-d wrote:
> >>>Saying that operating at the code point level - UTF-32 - is correct
> >>>is like saying that operating at UTF-16 instead of UTF-8 is correct.
> >>
> >>Could you please substantiate that? My understanding is that code unit
> >>is a higher-level Unicode notion independent of encoding, whereas code
> >>point is an encoding-dependent representation detail. -- Andrei
> >
>Does walkLength yield the same number for all representations?
walkLength treats a code point like it's a character. My point is that
that's incorrect behavior. It will not result in correct string processing
in the general case, because a code point is not guaranteed to be a
full character.
...

What's "correct"? Maybe the user intended to count the number of code points in order to pre-allocate a dchar[] of the correct size.

Generally, I don't see how algorithms become magically "incorrect" when applied to utf code units.

walkLength does not report the length of a character as one in all cases
just like length does not report the length of a character as one in all
cases. walkLength is counting bigger units than length, but it's still
counting pieces of a character rather than counting full characters.


The 'length' of a character is not one in all contexts.
The following text takes six columns in my terminal:

日本語
123456

Reply via email to