Re: Major performance problem with std.array.front()

Andrei Alexandrescu Sat, 08 Mar 2014 17:26:25 -0800

On 3/8/14, 4:42 PM, Vladimir Panteleev wrote:

On Saturday, 8 March 2014 at 23:59:15 UTC, Andrei Alexandrescu wrote:

My only claim is that recognizing and iterating strings by code point
is better than doing things by the octet.


Considering or disregarding the disadvantages of this choice?


Doing my best to weigh everything with the right measures.

1. Eliminating dangerous constructs, such as s.countUntil and s.indexOf
both returning integers, yet possibly having different values in
circumstances that the developer may not foresee.


I disagree there's any danger. They deal in code points, end of story.


Perhaps I did not explain clearly enough.

auto pos = s.countUntil(sub);
writeln(s[pos..$]);

This will compile, and work for English text. For someone without
complete knowledge of Phobos functions and how D handles Unicode, it is
not obvious that this code is actually wrong.

I agree. At a point or another, the dual nature of strings (and dualmeans to iterate them) will cause trouble for the unwary.

In certain situations,
this can have devastating effects: consider, for example, if this code
is extracting a slice from a string that elsewhere contains sensitive
data (e.g. a configuration file containing, among other data, a
password).


Whaaa, passwords in clear?

An attacker could supply an Unicode string where the
developer did not expect it, thus causing "pos" to have a smaller value
than the corresponding indexOf result, thus revealing a slice of "s"
which was not intended to be visible. Thus, a developer currently needs
to tread very carefully wherever he is slicing strings, so as to not
accidentally use indices obtained from functions that count code points.

Okay, though when you opened with "devastating" I was hoping for nothingshort of death and dismemberment. Anyhow the fix is obvious per thisbrief tutorial: http://www.youtube.com/watch?v=hkDD03yeLnU

2. Very high complexity of implementations (the ElementEncodingType
problem previously mentioned).


I disagree with "very high".


I'm quite sure that std.range and std.algorithm will lose a LOT of
weight if they were fixed to not treat strings specially.

I'm not so sure. Most of the string-specific optimizations simply detectcertain string cases and forward them to array algorithms that need bewritten anyway. You would, indeed, save a fair amount of isSomeStringconditionals and stuff (thus simplifying on scaffolding), but probablynot a lot of code. That's not useless work - it'd go somewhere in anydesign.

Besides if you want to do Unicode you gotta crack some eggs.


No, I can't see how this justifies the choice. An explicit decoding
range would have simplified things greatly while offering much of the
same advantages.

My point there is that there's no useless or duplicated code that wouldbe thrown away. A better design would indeed make for better modularseparation - would be great if the string-related optimizations instd.algorithm went elsewhere. They wouldn't disappear.

Whether the fact that it is there "by default" an advantage of the
current approach at all is debatable.

Clearly. If I'd do things over again, I'd definitely change a thing ortwo. (I wouldn't go with Walter's proposal, which I think is worse thanwhat we have now.)

But the current approach has something very difficult to talk away: it'sthere. And that makes a whole lotta difference. Do I believe it'sperfect? Hell no. Does it blunt much of the point of this debate? I'mafraid so.

3. Hidden, difficult-to-detect performance problems. The reason why this
thread was started. I've had to deal with them in several places myself.


I disagree with "hidden, difficult to detect".


Why? You can only find out that an algorithm is slower than it needs to
be via either profiling (at which point you're wondering why the @#$%
the thing is so slow), or feeding it invalid UTF. If you had made a
different choice for Unicode in D, this problem would not exist altogether.


Disagree.

Also I'd add that I'd rather not have hidden, difficult to detect
correctness problems.


Except we already do. Arguments have already been presented in this
thread that demonstrate correctness problems with the current approach.
I don't think that these can stand up to the problems that the simpler
by-char iteration approach would have.

Sure there are, and you yourself illustrated a misuse of the APIs. Mypoint is: code point is better than code unit and not all that muchslower. Grapheme is better than code point but a lot slower. It seemswe're quite in a sweet spot here wrt performance/correctness.

4. Encourage D programmers to write Unicode-capable code that is correct
in the full sense of the word.


I disagree we are presently discouraging them.


I did not say we are. The problem is that we aren't encouraging them
either - we are instead setting an example of how to do it in a wrong
(incomplete) way.

Code unit is what it is. Those programming for natural languages forwhich code units are not sufficient would need to exercise duediligence. We ought to help them without crippling efficiency.

I do agree a change would make certain things clearer.


I have an issue with all the counter-arguments presented in this thread
being shoved behind the one word "clearer".

What is the issue you are having? I don't see a much better API beingproposed. I see a marginally improved API at the very best, and possiblyquite a bit more prone to error.

But not enough to nearly make up for the breakage.


I would still like to go ahead with my suggestion to attempt some
possible changes without releasing them. I'm going to try them with my
own programs first to see how much it will break.


I think that's great.

I believe that you are
too eagerly dismissing all proposals without even evaluating them.

Perspective is everything, isn't it :o). I thought I'm being reasonableand accepting in discussing of a number of proposed points, although inmy heart of hearts many arguments seem rather frivolous.

With what has been put forward so far, that's not even close tojustifying a breaking change. If that great better design is just getback to code unit iteration, the change will not happen while I work onD. It is possible, however, that a much better idea comes forward, andI'd be looking forward to such.

I think the above list has enough weight to merit at least considering
*some* breaking changes.


I think a better approach is to figure what to add.


This is obvious:
- more Unicode algorithms (normalization, segmentation, etc.)
- better documentation


I was thinking of these too:

1. Revisit std.encoding and perhaps confer legitimacy to the charactertypes defined there. The implementation in std.encoding is wanting, butI think the idea is sound. Essentially give more love to variousencodings, including Ascii and "bypass encoding, I'll deal with stuffmyself".

2. Add byChar that returns a random-access range iterating a string bycharacter. Add byWchar that does on-the-fly transcoding to UTF16. AddbyDchar that accepts any range of char and does decoding. And suchstuff. Then whenever one wants to go through a string by code point canjust use str.byChar.



Andrei

Re: Major performance problem with std.array.front()

Reply via email to