Re: Major performance problem with std.array.front()

monarch_dodra Sun, 09 Mar 2014 06:05:48 -0700

On Sunday, 9 March 2014 at 11:34:31 UTC, Peter Alexander wrote:

On Sunday, 9 March 2014 at 08:32:09 UTC, monarch_dodra wrote:
On topic, I think D's implicit default decode to dchar is*infinity* times better than C++'s char-based strings. Whileimperfect in terms of grapheme, it was still a design decisionmade of win.
I'd be tempted to not ask "how do we back out", but rather,"how can we take this further"? I'd love to ditch the whole"char"/"dchar" thing altogether, and work with graphemes. Butthat would be massive involvement.
Why do you think it is better?
Let's be clear here: if you are searching/iterating/comparingby code point then your program is either not correct, or nobetter than doing so by code unit. Graphemes don't really fixthis either.
I think this is the main confusion: the belief that iteratingby code point has utility.
If you care about normalization then neither by code unit, bycode point, nor by grapheme are correct (except in certainlanguage subsets).
If you don't care about normalization then by code unit is justas good as by code point, but you don't need to specialiseeverywhere in Phobos.

IMO, the "normalization" argument is overrated. I've yet toencounter a real-world case of normalization: only hand writtencounter-examples. Not saying it doesn't exist, just that:1. It occurs only in special cases that the program should beaware of before hand.

2. Arguably, be taken care of eagerly, or in a special pass.

As for "the belief that iterating by code point has utility." Ihave to strongly disagree. Unicode is composed of codepoints, andthat is what we handle. The fact that it can be be encoded andstored as UTF is implementation detail.

As for the grapheme thing, I'm not actually so sure about itmyself, so don't take it too seriously.

AFAIK, there is only one exception, stuff like s.all!(c => c =='é'), but as Vladimir correctly points out: (a) by code point,this is still broken in the face of normalization, and (b) arethere any real applications that search a string for a specificnon-ASCII character?

But *what* other kinds of algorithms are there? AFAIK, the *only*type of algorithm that doesn't need decoding is searching, andyou know what? std.algorithm.find does it perfectly well. Thistrickles into most other algorithms too: split, splitter orfindAmong don't decode if they don't have too.

AFAIK, the most common algorithm "case insensitive search" *must*decode.

There may still be cases where it is still not working asintended in the face of normalization, but it is still leaps andbounds better than what we get iterating with codeunits.

To turn it the other way around, *what* are you guys doing, thatdoesn't require decoding, and where performance is such a killer?

To those that think the status quo is better, can you give anexample of a real-life use case that demonstrates this?

I do not know of a single bug report in regards to buggy phoboscode that used front/popFront. Not_a_single_one (AFAIK).

On the other hand, there are plenty of cases of bugs forattempting to not decode strings, or incorrectly decodingstrings. They are being corrected on a continuous basis.

Seriously, Bearophile suggested "ABCD".sort(), and it took about6 pages (!) for someone to point out this would be wrong. EvenWalter pointed out that such code should work. *Maybe* it isstill wrong in regards to graphemes and normalization, but at*least*, the result is not a corrupted UTF-8 stream.

Walter keeps grinding on about "myCharArray.put('é')" notworking, but I'm not sure he realizes how dangerous it wouldactually be to allow such a thing to work.

In particular, in all these cases, a simple call to"representation" will deactivate the feature, giving you thetools you want.

I do think it's probably too late to change this, but I thinkthere is value in at least getting everyone on the same page.

Me too. I do see the value in being able to do decode-lessiteration. I just think the *default* behavior has the advantageof being correct *most* of the time, and definitely much morecorrect than without decoding.

I think opt-out of decoding is just a much much much sanerapproach to string handling.

Re: Major performance problem with std.array.front()

Reply via email to