Re: Major performance problem with std.array.front()

Vladimir Panteleev Sat, 08 Mar 2014 16:46:24 -0800

On Saturday, 8 March 2014 at 23:59:15 UTC, Andrei Alexandrescuwrote:

My only claim is that recognizing and iterating strings by codepoint is better than doing things by the octet.


Considering or disregarding the disadvantages of this choice?

1. Eliminating dangerous constructs, such as s.countUntil ands.indexOfboth returning integers, yet possibly having different valuesin
circumstances that the developer may not foresee.
I disagree there's any danger. They deal in code points, end ofstory.


Perhaps I did not explain clearly enough.

auto pos = s.countUntil(sub);
writeln(s[pos..$]);

This will compile, and work for English text. For someone withoutcomplete knowledge of Phobos functions and how D handles Unicode,it is not obvious that this code is actually wrong. In certainsituations, this can have devastating effects: consider, forexample, if this code is extracting a slice from a string thatelsewhere contains sensitive data (e.g. a configuration filecontaining, among other data, a password). An attacker couldsupply an Unicode string where the developer did not expect it,thus causing "pos" to have a smaller value than the correspondingindexOf result, thus revealing a slice of "s" which was notintended to be visible. Thus, a developer currently needs totread very carefully wherever he is slicing strings, so as to notaccidentally use indices obtained from functions that count codepoints.

2. Very high complexity of implementations (theElementEncodingType
problem previously mentioned).
I disagree with "very high".

I'm quite sure that std.range and std.algorithm will lose a LOTof weight if they were fixed to not treat strings specially.

Besides if you want to do Unicode you gotta crack some eggs.

No, I can't see how this justifies the choice. An explicitdecoding range would have simplified things greatly whileoffering much of the same advantages. Whether the fact that it isthere "by default" an advantage of the current approach at all isdebatable.

3. Hidden, difficult-to-detect performance problems. Thereason why thisthread was started. I've had to deal with them in severalplaces myself.
I disagree with "hidden, difficult to detect".

Why? You can only find out that an algorithm is slower than itneeds to be via either profiling (at which point you're wonderingwhy the @#$% the thing is so slow), or feeding it invalid UTF. Ifyou had made a different choice for Unicode in D, this problemwould not exist altogether.

Also I'd add that I'd rather not have hidden, difficult todetect correctness problems.

Except we already do. Arguments have already been presented inthis thread that demonstrate correctness problems with thecurrent approach. I don't think that these can stand up to theproblems that the simpler by-char iteration approach would have.

4. Encourage D programmers to write Unicode-capable code thatis correct
in the full sense of the word.
I disagree we are presently discouraging them.

I did not say we are. The problem is that we aren't encouragingthem either - we are instead setting an example of how to do itin a wrong (incomplete) way.

I do agree a change would make certain things clearer.

I have an issue with all the counter-arguments presented in thisthread being shoved behind the one word "clearer".

But not enough to nearly make up for the breakage.

I would still like to go ahead with my suggestion to attempt somepossible changes without releasing them. I'm going to try themwith my own programs first to see how much it will break. Ibelieve that you are too eagerly dismissing all proposals withouteven evaluating them.

I think the above list has enough weight to merit at leastconsidering
*some* breaking changes.
I think a better approach is to figure what to add.


This is obvious:
- more Unicode algorithms (normalization, segmentation, etc.)
- better documentation

Re: Major performance problem with std.array.front()

Reply via email to