Re: The Case Against Autodecode

Andrei Alexandrescu via Digitalmars-d Thu, 26 May 2016 09:06:14 -0700

This might be a good time to discuss this a tad further. I'd appreciateif the debate stayed on point going forward. Thanks!

My thesis: the D1 design decision to represent strings as char[] wasdisastrous and probably one of the largest weaknesses of D1. Thedecision in D2 to use immutable(char)[] for strings is a vastimprovement but still has a number of issues. The approach toautodecoding in Phobos is an improvement on that decision. The insistentshunning of a user-defined type to represent strings is not good and weneed to rid ourselves of it.


On 05/12/2016 04:15 PM, Walter Bright wrote:

On 5/12/2016 9:29 AM, Andrei Alexandrescu wrote:
 > I am as unclear about the problems of autodecoding as I am about the
necessity
 > to remove curl. Whenever I ask I hear some arguments that work well
emotionally
 > but are scant on reason and engineering. Maybe it's time to rehash
them? I just
 > did so about curl, no solid argument seemed to come together. I'd be
curious of
 > a crisp list of grievances about autodecoding. -- Andrei

Here are some that are not matters of opinion.

1. Ranges of characters do not autodecode, but arrays of characters do.
This is a glaring inconsistency.

Agreed. At the point of that decision, the party line was "arrays ofcharacters are strings, nothing else is or should be". Now it isapparent that shouldn't have been the case.

2. Every time one wants an algorithm to work with both strings and
ranges, you wind up special casing the strings to defeat the
autodecoding, or to decode the ranges. Having to constantly special case
it makes for more special cases when plugging together components. These
issues often escape detection when unittesting because it is convenient
to unittest only with arrays.


This is a consequence of 1. It is at least partially fixable.

3. Wrapping an array in a struct with an alias this to an array turns
off autodecoding, another special case.


This is also a consequence of 1.

4. Autodecoding is slow and has no place in high speed string processing.

I would agree only with the amendment "...if used naively", which isimportant. Knowledge of how autodecoding works is a prerequisite forwriting fast string code in D. Also, little code should deal with onecode unit or code point at a time; instead, it should use standardlibrary algorithms for searching, matching etc. When needed, iteratingevery code unit is trivially done through indexing.

Also allow me to point that much of the slowdown can be addressedtactically. The test c < 0x80 is highly predictable (in ASCII-heavytext) and therefore easily speculated. We can and we should arrange codeto minimize impact.

5. Very few algorithms require decoding.

The key here is leaving it to the standard library to do the right thinginstead of having the user wonder separately for each case. These usesdon't need decoding, and the standard library correctly doesn't involveit (or if it currently does it has a bug):


s.find("abc")
s.findSplit("abc")
s.findSplit('a')
s.count!(c => "!()-;:,.?".canFind(c)) // punctuation

However the following do require autodecoding:

s.walkLength
s.count!(c => !"!()-;:,.?".canFind(c)) // non-punctuation
s.count!(c => c >= 32) // non-control characters

Currently the standard library operates at code point level even thoughinside it may choose to use code units when admissible. Leaving such adecision to the library seems like a wise thing to do.

6. Autodecoding has two choices when encountering invalid code units -
throw or produce an error dchar. Currently, it throws, meaning no
algorithms using autodecode can be made nothrow.

Agreed. This is probably the most glaring mistake. I think we shouldopen a discussion no fixing this everywhere in the stdlib, even at thecost of breaking code.

7. Autodecode cannot be used with unicode path/filenames, because it is
legal (at least on Linux) to have invalid UTF-8 as filenames. It turns
out in the wild that pure Unicode is not universal - there's lots of
dirty Unicode that should remain unmolested, and autocode does not play
with that.

If paths are not UTF-8, then they shouldn't have string type (insteaduse ubyte[] etc). More on that below.

8. In my work with UTF-8 streams, dealing with autodecode has caused me
considerably extra work every time. A convenient timesaver it ain't.


Objection. Vague.

9. Autodecode cannot be turned off, i.e. it isn't practical to avoid
importing std.array one way or another, and then autodecode is there.

Turning off autodecoding is as easy as inserting .representation afterany string. (Not to mention using indexing directly.)

10. Autodecoded arrays cannot be RandomAccessRanges, losing a key
benefit of being arrays in the first place.

First off, you always have the option with .representation. That's agreat name because it gives you the type used to represent the string -i.e. an array of integers of a specific width.

Second, it's as it should. The entire scaffolding rests on the notionthat char[] is distinguished from ubyte[] by having UTF8 code units, notarbitrary bytes. It seems that many arguments against autodecoding arein fact arguments in favor of eliminating virtually all distinctionsbetween char[] and ubyte[]. Then the natural question is, what _is_ thedifference between char[] and ubyte[] and why do we need char as aseparate type from ubyte?

This is a fundamental question for which we need a rigorous answer. Whatis the purpose of char, wchar, and dchar? My current understanding isthat they're justified as pretty much indistinguishable in primitivesand behavior from ubyte, ushort, and uint respectively, but they reflecta loose subjective intent from the programmer that they hold actual UTFcode units. The core language does not enforce such, except it doesspecial things in random places like for loops (any other)?

If char is to be distinct from ubyte, and char[] is to be distinct fromubyte[], then autodecoding does the right thing: it makes sure they aredistinguished in behavior and embodies the assumption that char is, infact, a UTF8 code point.

11. Indexing an array produces different results than autodecoding,
another glaring special case.

This is a direct consequence of the fact that string isimmutable(char)[] and not a specific type. That error predates autodecoding.

Overall, I think the one way to make real steps forward in improvingstring processing in the D language is to give a clear answer of whatchar, wchar, and dchar mean.



Andrei

Re: The Case Against Autodecode

Reply via email to