Re: The Case Against Autodecode

Timon Gehr via Digitalmars-d Tue, 31 May 2016 12:26:50 -0700

On 31.05.2016 20:53, Jonathan M Davis via Digitalmars-d wrote:

On Tuesday, May 31, 2016 14:30:08 Andrei Alexandrescu via Digitalmars-d wrote:

>On 5/31/16 2:11 PM, Jonathan M Davis via Digitalmars-d wrote:

> >On Tuesday, May 31, 2016 13:21:57 Andrei Alexandrescu via Digitalmars-d

wrote:

> >>On 05/31/2016 01:15 PM, Jonathan M Davis via Digitalmars-d wrote:

> >>>Saying that operating at the code point level - UTF-32 - is correct
> >>>is like saying that operating at UTF-16 instead of UTF-8 is correct.

> >>
> >>Could you please substantiate that? My understanding is that code unit
> >>is a higher-level Unicode notion independent of encoding, whereas code
> >>point is an encoding-dependent representation detail. -- Andrei

> >

>Does walkLength yield the same number for all representations?

walkLength treats a code point like it's a character. My point is that
that's incorrect behavior. It will not result in correct string processing
in the general case, because a code point is not guaranteed to be a
full character.
...

What's "correct"? Maybe the user intended to count the number of codepoints in order to pre-allocate a dchar[] of the correct size.

Generally, I don't see how algorithms become magically "incorrect" whenapplied to utf code units.

walkLength does not report the length of a character as one in all cases
just like length does not report the length of a character as one in all
cases. walkLength is counting bigger units than length, but it's still
counting pieces of a character rather than counting full characters.


The 'length' of a character is not one in all contexts.
The following text takes six columns in my terminal:

日本語
123456

Re: The Case Against Autodecode

Reply via email to