On Thursday, 26 May 2016 at 16:00:54 UTC, Andrei Alexandrescu
wrote:
[snip]
I would agree only with the amendment "...if used naively",
which is important. Knowledge of how autodecoding works is a
prerequisite for writing fast string code in D. Also, little
code should deal with one code unit or code point at a time;
instead, it should use standard library algorithms for
searching, matching etc. When needed, iterating every code unit
is trivially done through indexing.
I disagree. "if used naively" shouldn't be the default. A user
(naively) expects string algorithms to work as efficiently as
possible without overheads. To tell the user later that s/he
shouldn't _naively_ have used a certain algorithm provided by the
library is a bit cynical. Having to redesign a code base because
of hidden behavior is a big turn off, having to go through Phobos
to determine where the hidden pitfalls are is not the user's job.
Also allow me to point that much of the slowdown can be
addressed tactically. The test c < 0x80 is highly predictable
(in ASCII-heavy text) and therefore easily speculated. We can
and we should arrange code to minimize impact.
And what if you deal with non-ASCII heavy text? Does the user
have to guess an micro-optimize for simple use cases?
5. Very few algorithms require decoding.
The key here is leaving it to the standard library to do the
right thing instead of having the user wonder separately for
each case. These uses don't need decoding, and the standard
library correctly doesn't involve it (or if it currently does
it has a bug):
s.find("abc")
s.findSplit("abc")
s.findSplit('a')
s.count!(c => "!()-;:,.?".canFind(c)) // punctuation
However the following do require autodecoding:
s.walkLength
s.count!(c => !"!()-;:,.?".canFind(c)) // non-punctuation
s.count!(c => c >= 32) // non-control characters
Currently the standard library operates at code point level
even though inside it may choose to use code units when
admissible. Leaving such a decision to the library seems like a
wise thing to do.
But how is the user supposed to know without being a core
contributor to Phobos? If using a library method that works well
in one case can slow down your code in a slightly different case,
something is wrong with the language/library design. For simple
cases the burden shouldn't be on the user, or, if it is, s/he
should be informed about it in order to be able to make
well-informed decisions. Personally I wouldn't mind having to
decide in each case what I want (provided I have a best practices
cheat sheet :)), so I can get the best out of it. But to keep
guessing, testing and benchmarking each string handling library
function is not good at all.
[snip]