On Saturday, 8 March 2014 at 23:59:15 UTC, Andrei Alexandrescu wrote:
My only claim is that recognizing and iterating strings by code point is better than doing things by the octet.

Considering or disregarding the disadvantages of this choice?

1. Eliminating dangerous constructs, such as s.countUntil and s.indexOf both returning integers, yet possibly having different values in
circumstances that the developer may not foresee.

I disagree there's any danger. They deal in code points, end of story.

Perhaps I did not explain clearly enough.

auto pos = s.countUntil(sub);
writeln(s[pos..$]);

This will compile, and work for English text. For someone without complete knowledge of Phobos functions and how D handles Unicode, it is not obvious that this code is actually wrong. In certain situations, this can have devastating effects: consider, for example, if this code is extracting a slice from a string that elsewhere contains sensitive data (e.g. a configuration file containing, among other data, a password). An attacker could supply an Unicode string where the developer did not expect it, thus causing "pos" to have a smaller value than the corresponding indexOf result, thus revealing a slice of "s" which was not intended to be visible. Thus, a developer currently needs to tread very carefully wherever he is slicing strings, so as to not accidentally use indices obtained from functions that count code points.

2. Very high complexity of implementations (the ElementEncodingType
problem previously mentioned).

I disagree with "very high".

I'm quite sure that std.range and std.algorithm will lose a LOT of weight if they were fixed to not treat strings specially.

Besides if you want to do Unicode you gotta crack some eggs.

No, I can't see how this justifies the choice. An explicit decoding range would have simplified things greatly while offering much of the same advantages. Whether the fact that it is there "by default" an advantage of the current approach at all is debatable.

3. Hidden, difficult-to-detect performance problems. The reason why this thread was started. I've had to deal with them in several places myself.

I disagree with "hidden, difficult to detect".

Why? You can only find out that an algorithm is slower than it needs to be via either profiling (at which point you're wondering why the @#$% the thing is so slow), or feeding it invalid UTF. If you had made a different choice for Unicode in D, this problem would not exist altogether.

Also I'd add that I'd rather not have hidden, difficult to detect correctness problems.

Except we already do. Arguments have already been presented in this thread that demonstrate correctness problems with the current approach. I don't think that these can stand up to the problems that the simpler by-char iteration approach would have.

4. Encourage D programmers to write Unicode-capable code that is correct
in the full sense of the word.

I disagree we are presently discouraging them.

I did not say we are. The problem is that we aren't encouraging them either - we are instead setting an example of how to do it in a wrong (incomplete) way.

I do agree a change would make certain things clearer.

I have an issue with all the counter-arguments presented in this thread being shoved behind the one word "clearer".

But not enough to nearly make up for the breakage.

I would still like to go ahead with my suggestion to attempt some possible changes without releasing them. I'm going to try them with my own programs first to see how much it will break. I believe that you are too eagerly dismissing all proposals without even evaluating them.

I think the above list has enough weight to merit at least considering
*some* breaking changes.

I think a better approach is to figure what to add.

This is obvious:
- more Unicode algorithms (normalization, segmentation, etc.)
- better documentation

Reply via email to