Re: The Case Against Autodecode

Marc Schütz via Digitalmars-d Sat, 28 May 2016 04:06:38 -0700

On Friday, 27 May 2016 at 13:34:33 UTC, Andrei Alexandrescu wrote:

On 5/27/16 6:56 AM, Marc Schütz wrote:
It is not, which has been shown by various posts in thisthread.
Couldn't quite find strong arguments. Could you please be moreexplicit on which you found most convincing? -- Andrei

There are several possibilities of what iteration over a charrange can mean. (For the sake of simplicity, let's ignore specialcases like `find` and `split`; instead, let's look at`walkLength`, `retro` and similar.)

BEFORE the introduction of auto decoding, it used to iterate overUTF8 code _units_, which is wrong for any non-ASCII data (exceptfor the unlikely case where you really want code units).

AFTER the introduction of auto decoding, it iterates over UTF8code _points_, which is wrong for combined characters, e.g.äöüéòàñ on MacOS X, more "exotic" ones everywhere (except for theeven more unlikely case where you really want code points).

That is, both the BEFORE and AFTER behaviour are wrong, bothbreak for various kinds of input in different ways.

So, is AFTER an improvement over BEFORE? The set of inputs whereauto decoding produces wrong output is likely smaller, making itslightly less likely to encounter problems in practice; on theother hand, it's still wrong, and it's harder to find theseproblems during testing. That's like "improving" a bicycle sothat it only breaks down after riding it for 30 minutes insteadof just after 10 minutes, so you won't notice it during a testride.

But there are even more possibilities. It could iterate overgraphemes, which is expensive, but more likely to produce theresults that the user wants. Or it could iterate by lines, orwords (and there are different ways to define what a word is),and so on.

The fundamental problem is choosing one of those possibilitiesover the others without knowing what the user actually wants,which is what both BEFORE and AFTER do.

So, what was the original goal when introducing auto decoding? Toimprove correctness, right? I would argue that this goal has notbeen achieved. Have a look at the article [1], which IMO givesgood criteria for how a _correct_ string type should behave. BothBEFORE and AFTER fail most of them.


[1] https://mortoray.com/2013/11/27/the-string-type-is-broken/

Re: The Case Against Autodecode

Reply via email to