On Thursday, 26 May 2016 at 16:00:54 UTC, Andrei Alexandrescu
wrote:
4. Autodecoding is slow and has no place in high speed string
processing.
I would agree only with the amendment "...if used naively",
which is important. Knowledge of how autodecoding works is a
prerequisite for writing fast string code in D.
It is completely wasted mental effort.
5. Very few algorithms require decoding.
The key here is leaving it to the standard library to do the
right thing instead of having the user wonder separately for
each case. These uses don't need decoding, and the standard
library correctly doesn't involve it (or if it currently does
it has a bug):
s.count!(c => "!()-;:,.?".canFind(c)) // punctuation
As far as I can see, the language currently does not provide the
facilities to implement the above without autodecoding.
However the following do require autodecoding:
s.walkLength
Usage of the result of this expression will be incorrect in many
foreseeable cases.
s.count!(c => !"!()-;:,.?".canFind(c)) // non-punctuation
Ditto.
s.count!(c => c >= 32) // non-control characters
Ditto, with a big red flag. If you are dealing with control
characters, the code is likely low-level enough that you need to
be explicit in what you are counting. It is likely not what
actually needs to be counted. Such confusion can lead to security
risks.
Currently the standard library operates at code point level
even though inside it may choose to use code units when
admissible. Leaving such a decision to the library seems like a
wise thing to do.
It should be explicit.
7. Autodecode cannot be used with unicode path/filenames,
because it is
legal (at least on Linux) to have invalid UTF-8 as filenames.
It turns
out in the wild that pure Unicode is not universal - there's
lots of
dirty Unicode that should remain unmolested, and autocode does
not play
with that.
If paths are not UTF-8, then they shouldn't have string type
(instead use ubyte[] etc). More on that below.
This is not practical. Do you really see changing std.file and
std.path to accept ubyte[] for all path arguments?
8. In my work with UTF-8 streams, dealing with autodecode has
caused me
considerably extra work every time. A convenient timesaver it
ain't.
Objection. Vague.
I can confirm this vague subjective observation. For example,
DustMite reimplements some std.string functions in order to be
able to handle D files with invalid UTF-8 characters.
9. Autodecode cannot be turned off, i.e. it isn't practical to
avoid
importing std.array one way or another, and then autodecode is
there.
Turning off autodecoding is as easy as inserting
.representation after any string. (Not to mention using
indexing directly.)
This is neither easy nor practical. It makes writing reliable
string handling code a chore in D. Because it is difficult to
find all places where this must be done, it is not possible to do
on a program-wide scale, thus bugs can only be discovered when
this or that component fails because it was not tested with
Unicode strings.
10. Autodecoded arrays cannot be RandomAccessRanges, losing a
key
benefit of being arrays in the first place.
First off, you always have the option with .representation.
That's a great name because it gives you the type used to
represent the string - i.e. an array of integers of a specific
width.
Second, it's as it should. The entire scaffolding rests on the
notion that char[] is distinguished from ubyte[] by having UTF8
code units, not arbitrary bytes. It seems that many arguments
against autodecoding are in fact arguments in favor of
eliminating virtually all distinctions between char[] and
ubyte[]. Then the natural question is, what _is_ the difference
between char[] and ubyte[] and why do we need char as a
separate type from ubyte?
This is a fundamental question for which we need a rigorous
answer.
Why?
What is the purpose of char, wchar, and dchar? My current
understanding is that they're justified as pretty much
indistinguishable in primitives and behavior from ubyte,
ushort, and uint respectively, but they reflect a loose
subjective intent from the programmer that they hold actual UTF
code units. The core language does not enforce such, except it
does special things in random places like for loops (any other)?
If char is to be distinct from ubyte, and char[] is to be
distinct from ubyte[], then autodecoding does the right thing:
it makes sure they are distinguished in behavior and embodies
the assumption that char is, in fact, a UTF8 code point.
I don't follow this line of reasoning at all.
11. Indexing an array produces different results than
autodecoding,
another glaring special case.
This is a direct consequence of the fact that string is
immutable(char)[] and not a specific type. That error predates
autodecoding.
There is no convincing argument why indexing and slicing should
not simply operate on code units.
Overall, I think the one way to make real steps forward in
improving string processing in the D language is to give a
clear answer of what char, wchar, and dchar mean.
I don't follow. Though, making char implicitly convertible to
wchar and dchar has clearly been a mistake.