Re: Range of chars (narrow string ranges)

Jonathan M Davis via Digitalmars-d Wed, 29 Apr 2015 08:16:05 -0700

On Wednesday, 29 April 2015 at 10:02:09 UTC, Chris wrote:

This sounds like a good starting point for a transition plan.One important thing, though, would be to do some benchmarkingwith and without autodecoding, to see if it really boostsperformance in a way that would justify the transition.

Well, personally, I think that it's worth it even if theperformance is identical (and it's a guarantee that it's going tobe better without autodecoding - it's just a question of how muchbetter - since it's going to have less work to do withoutautodecoding). Simply operating at the code point level like wedo now is the worst of all worlds in terms of flexibility andcorrectness. As long as the Unicode is normalized, operating atthe code unit level is the most efficient, and decoding is oftenunnecessary for correctness, and if you need to decode, then youreally need to go up to the grapheme level in order to beoperating on the full character, meaning that operating on codepoints really has the same problems as operating on code units asfar as correctness goes. So, it's less performant withoutactually being correct. It just gives the illusion of correctness.

By treating strings as ranges of code units, you don't take aperformance hit when you don't need to, and it forces you toactually consider something like byDchar or byGrapheme if youwant to operate on full, Unicode characters. It's similar to howoperating on UTF-16 code units as if they were characters (asJava and C# generally do) frequently gives the incorrectimpression that you're handling Unicode correctly, because youhave to work harder at coming up with characters that can't fitin a single code unit, whereas with UTF-8, anything but ASCII isscrewed if you treat code units as code points. Treating codepoints as if they were full characters like we're doing now inPhobos with ranges just makes it that much harder to notice thatyou're not handling Unicode correctly.

Also, treating strings as ranges of code units makes it so thatthey're not so special and actually are treated like every othertype of array, which eliminates a lot of the special casing thatwe're forced to do right now, and it eliminates all of theconfusion that folks keep running into when string doesn't workwith many functions, because it's not a random-access range ordoesn't have length, or because the resulting range isn't thesame type (copy would be a prime example of a function thatdoesn't work with char[] when it should). By leaving inautodecoding, we're basically leaving in technical debt in Dpermanently. We'll forever have to be explaining it to folks andforever have to be working around it in order to achieve eitherperformance or correctness.

What we have now isn't performant, correct, or flexible, andwe'll be forever paying for that if we don't get rid ofautodecoding.

I don't criticize Andrei in the least for coming up with it,since if you don't take graphemes into account (and he didn'tknow about them at the time), it seems like a great idea andallows us to be correct by default and performant if we put someeffort into, but after having seen how it's worked out, how muchcode has to be special-cased, how much confusion there is overit, and how it's not actually correct anyway, I think that it'squite clear that autodecoding was a mistake. And at this point,it's mainly a question of how we can get rid of it without beingtoo disruptive and whether we can convince Andrei that it makessense to make the change, since he seems to still think thatautodecoding is fine in spite of the fact that it's neitherperformant nor correct.

It may be that the decision will be that it's too disruptive toremove autodecoding, but I think that that's really a question ofwhether we can find a way to do it that doesn't break tons ofcode rather than whether it's worth the performance orcorrectness gain.


- Jonathan M Davis

Re: Range of chars (narrow string ranges)

Reply via email to