This might be a good time to discuss this a tad further. I'd appreciate if the debate stayed on point going forward. Thanks!

My thesis: the D1 design decision to represent strings as char[] was disastrous and probably one of the largest weaknesses of D1. The decision in D2 to use immutable(char)[] for strings is a vast improvement but still has a number of issues. The approach to autodecoding in Phobos is an improvement on that decision. The insistent shunning of a user-defined type to represent strings is not good and we need to rid ourselves of it.

On 05/12/2016 04:15 PM, Walter Bright wrote:
On 5/12/2016 9:29 AM, Andrei Alexandrescu wrote:
 > I am as unclear about the problems of autodecoding as I am about the
necessity
 > to remove curl. Whenever I ask I hear some arguments that work well
emotionally
 > but are scant on reason and engineering. Maybe it's time to rehash
them? I just
 > did so about curl, no solid argument seemed to come together. I'd be
curious of
 > a crisp list of grievances about autodecoding. -- Andrei

Here are some that are not matters of opinion.

1. Ranges of characters do not autodecode, but arrays of characters do.
This is a glaring inconsistency.

Agreed. At the point of that decision, the party line was "arrays of characters are strings, nothing else is or should be". Now it is apparent that shouldn't have been the case.

2. Every time one wants an algorithm to work with both strings and
ranges, you wind up special casing the strings to defeat the
autodecoding, or to decode the ranges. Having to constantly special case
it makes for more special cases when plugging together components. These
issues often escape detection when unittesting because it is convenient
to unittest only with arrays.

This is a consequence of 1. It is at least partially fixable.

3. Wrapping an array in a struct with an alias this to an array turns
off autodecoding, another special case.

This is also a consequence of 1.

4. Autodecoding is slow and has no place in high speed string processing.

I would agree only with the amendment "...if used naively", which is important. Knowledge of how autodecoding works is a prerequisite for writing fast string code in D. Also, little code should deal with one code unit or code point at a time; instead, it should use standard library algorithms for searching, matching etc. When needed, iterating every code unit is trivially done through indexing.

Also allow me to point that much of the slowdown can be addressed tactically. The test c < 0x80 is highly predictable (in ASCII-heavy text) and therefore easily speculated. We can and we should arrange code to minimize impact.

5. Very few algorithms require decoding.

The key here is leaving it to the standard library to do the right thing instead of having the user wonder separately for each case. These uses don't need decoding, and the standard library correctly doesn't involve it (or if it currently does it has a bug):

s.find("abc")
s.findSplit("abc")
s.findSplit('a')
s.count!(c => "!()-;:,.?".canFind(c)) // punctuation

However the following do require autodecoding:

s.walkLength
s.count!(c => !"!()-;:,.?".canFind(c)) // non-punctuation
s.count!(c => c >= 32) // non-control characters

Currently the standard library operates at code point level even though inside it may choose to use code units when admissible. Leaving such a decision to the library seems like a wise thing to do.

6. Autodecoding has two choices when encountering invalid code units -
throw or produce an error dchar. Currently, it throws, meaning no
algorithms using autodecode can be made nothrow.

Agreed. This is probably the most glaring mistake. I think we should open a discussion no fixing this everywhere in the stdlib, even at the cost of breaking code.

7. Autodecode cannot be used with unicode path/filenames, because it is
legal (at least on Linux) to have invalid UTF-8 as filenames. It turns
out in the wild that pure Unicode is not universal - there's lots of
dirty Unicode that should remain unmolested, and autocode does not play
with that.

If paths are not UTF-8, then they shouldn't have string type (instead use ubyte[] etc). More on that below.

8. In my work with UTF-8 streams, dealing with autodecode has caused me
considerably extra work every time. A convenient timesaver it ain't.

Objection. Vague.

9. Autodecode cannot be turned off, i.e. it isn't practical to avoid
importing std.array one way or another, and then autodecode is there.

Turning off autodecoding is as easy as inserting .representation after any string. (Not to mention using indexing directly.)

10. Autodecoded arrays cannot be RandomAccessRanges, losing a key
benefit of being arrays in the first place.

First off, you always have the option with .representation. That's a great name because it gives you the type used to represent the string - i.e. an array of integers of a specific width.

Second, it's as it should. The entire scaffolding rests on the notion that char[] is distinguished from ubyte[] by having UTF8 code units, not arbitrary bytes. It seems that many arguments against autodecoding are in fact arguments in favor of eliminating virtually all distinctions between char[] and ubyte[]. Then the natural question is, what _is_ the difference between char[] and ubyte[] and why do we need char as a separate type from ubyte?

This is a fundamental question for which we need a rigorous answer. What is the purpose of char, wchar, and dchar? My current understanding is that they're justified as pretty much indistinguishable in primitives and behavior from ubyte, ushort, and uint respectively, but they reflect a loose subjective intent from the programmer that they hold actual UTF code units. The core language does not enforce such, except it does special things in random places like for loops (any other)?

If char is to be distinct from ubyte, and char[] is to be distinct from ubyte[], then autodecoding does the right thing: it makes sure they are distinguished in behavior and embodies the assumption that char is, in fact, a UTF8 code point.

11. Indexing an array produces different results than autodecoding,
another glaring special case.

This is a direct consequence of the fact that string is immutable(char)[] and not a specific type. That error predates autodecoding.

Overall, I think the one way to make real steps forward in improving string processing in the D language is to give a clear answer of what char, wchar, and dchar mean.


Andrei

Reply via email to