On Thursday, 26 May 2016 at 16:00:54 UTC, Andrei Alexandrescu wrote:
[snip]

I would agree only with the amendment "...if used naively", which is important. Knowledge of how autodecoding works is a prerequisite for writing fast string code in D. Also, little code should deal with one code unit or code point at a time; instead, it should use standard library algorithms for searching, matching etc. When needed, iterating every code unit is trivially done through indexing.

I disagree. "if used naively" shouldn't be the default. A user (naively) expects string algorithms to work as efficiently as possible without overheads. To tell the user later that s/he shouldn't _naively_ have used a certain algorithm provided by the library is a bit cynical. Having to redesign a code base because of hidden behavior is a big turn off, having to go through Phobos to determine where the hidden pitfalls are is not the user's job.

Also allow me to point that much of the slowdown can be addressed tactically. The test c < 0x80 is highly predictable (in ASCII-heavy text) and therefore easily speculated. We can and we should arrange code to minimize impact.

And what if you deal with non-ASCII heavy text? Does the user have to guess an micro-optimize for simple use cases?

5. Very few algorithms require decoding.

The key here is leaving it to the standard library to do the right thing instead of having the user wonder separately for each case. These uses don't need decoding, and the standard library correctly doesn't involve it (or if it currently does it has a bug):

s.find("abc")
s.findSplit("abc")
s.findSplit('a')
s.count!(c => "!()-;:,.?".canFind(c)) // punctuation

However the following do require autodecoding:

s.walkLength
s.count!(c => !"!()-;:,.?".canFind(c)) // non-punctuation
s.count!(c => c >= 32) // non-control characters

Currently the standard library operates at code point level even though inside it may choose to use code units when admissible. Leaving such a decision to the library seems like a wise thing to do.

But how is the user supposed to know without being a core contributor to Phobos? If using a library method that works well in one case can slow down your code in a slightly different case, something is wrong with the language/library design. For simple cases the burden shouldn't be on the user, or, if it is, s/he should be informed about it in order to be able to make well-informed decisions. Personally I wouldn't mind having to decide in each case what I want (provided I have a best practices cheat sheet :)), so I can get the best out of it. But to keep guessing, testing and benchmarking each string handling library function is not good at all.

[snip]

Reply via email to