On Wednesday, 29 April 2015 at 10:02:09 UTC, Chris wrote:
This sounds like a good starting point for a transition plan. One important thing, though, would be to do some benchmarking with and without autodecoding, to see if it really boosts performance in a way that would justify the transition.

Well, personally, I think that it's worth it even if the performance is identical (and it's a guarantee that it's going to be better without autodecoding - it's just a question of how much better - since it's going to have less work to do without autodecoding). Simply operating at the code point level like we do now is the worst of all worlds in terms of flexibility and correctness. As long as the Unicode is normalized, operating at the code unit level is the most efficient, and decoding is often unnecessary for correctness, and if you need to decode, then you really need to go up to the grapheme level in order to be operating on the full character, meaning that operating on code points really has the same problems as operating on code units as far as correctness goes. So, it's less performant without actually being correct. It just gives the illusion of correctness.

By treating strings as ranges of code units, you don't take a performance hit when you don't need to, and it forces you to actually consider something like byDchar or byGrapheme if you want to operate on full, Unicode characters. It's similar to how operating on UTF-16 code units as if they were characters (as Java and C# generally do) frequently gives the incorrect impression that you're handling Unicode correctly, because you have to work harder at coming up with characters that can't fit in a single code unit, whereas with UTF-8, anything but ASCII is screwed if you treat code units as code points. Treating code points as if they were full characters like we're doing now in Phobos with ranges just makes it that much harder to notice that you're not handling Unicode correctly.

Also, treating strings as ranges of code units makes it so that they're not so special and actually are treated like every other type of array, which eliminates a lot of the special casing that we're forced to do right now, and it eliminates all of the confusion that folks keep running into when string doesn't work with many functions, because it's not a random-access range or doesn't have length, or because the resulting range isn't the same type (copy would be a prime example of a function that doesn't work with char[] when it should). By leaving in autodecoding, we're basically leaving in technical debt in D permanently. We'll forever have to be explaining it to folks and forever have to be working around it in order to achieve either performance or correctness.

What we have now isn't performant, correct, or flexible, and we'll be forever paying for that if we don't get rid of autodecoding.

I don't criticize Andrei in the least for coming up with it, since if you don't take graphemes into account (and he didn't know about them at the time), it seems like a great idea and allows us to be correct by default and performant if we put some effort into, but after having seen how it's worked out, how much code has to be special-cased, how much confusion there is over it, and how it's not actually correct anyway, I think that it's quite clear that autodecoding was a mistake. And at this point, it's mainly a question of how we can get rid of it without being too disruptive and whether we can convince Andrei that it makes sense to make the change, since he seems to still think that autodecoding is fine in spite of the fact that it's neither performant nor correct.

It may be that the decision will be that it's too disruptive to remove autodecoding, but I think that that's really a question of whether we can find a way to do it that doesn't break tons of code rather than whether it's worth the performance or correctness gain.

- Jonathan M Davis

Reply via email to