On Friday, 27 May 2016 at 13:34:33 UTC, Andrei Alexandrescu wrote:
On 5/27/16 6:56 AM, Marc Schütz wrote:
It is not, which has been shown by various posts in this thread.

Couldn't quite find strong arguments. Could you please be more explicit on which you found most convincing? -- Andrei

There are several possibilities of what iteration over a char range can mean. (For the sake of simplicity, let's ignore special cases like `find` and `split`; instead, let's look at `walkLength`, `retro` and similar.)

BEFORE the introduction of auto decoding, it used to iterate over UTF8 code _units_, which is wrong for any non-ASCII data (except for the unlikely case where you really want code units).

AFTER the introduction of auto decoding, it iterates over UTF8 code _points_, which is wrong for combined characters, e.g. äöüéòàñ on MacOS X, more "exotic" ones everywhere (except for the even more unlikely case where you really want code points).

That is, both the BEFORE and AFTER behaviour are wrong, both break for various kinds of input in different ways.

So, is AFTER an improvement over BEFORE? The set of inputs where auto decoding produces wrong output is likely smaller, making it slightly less likely to encounter problems in practice; on the other hand, it's still wrong, and it's harder to find these problems during testing. That's like "improving" a bicycle so that it only breaks down after riding it for 30 minutes instead of just after 10 minutes, so you won't notice it during a test ride.

But there are even more possibilities. It could iterate over graphemes, which is expensive, but more likely to produce the results that the user wants. Or it could iterate by lines, or words (and there are different ways to define what a word is), and so on.

The fundamental problem is choosing one of those possibilities over the others without knowing what the user actually wants, which is what both BEFORE and AFTER do.

So, what was the original goal when introducing auto decoding? To improve correctness, right? I would argue that this goal has not been achieved. Have a look at the article [1], which IMO gives good criteria for how a _correct_ string type should behave. Both BEFORE and AFTER fail most of them.

[1] https://mortoray.com/2013/11/27/the-string-type-is-broken/

Reply via email to