On Thursday, 26 May 2016 at 16:00:54 UTC, Andrei Alexandrescu wrote:
4. Autodecoding is slow and has no place in high speed string processing.

I would agree only with the amendment "...if used naively", which is important. Knowledge of how autodecoding works is a prerequisite for writing fast string code in D.

It is completely wasted mental effort.

5. Very few algorithms require decoding.

The key here is leaving it to the standard library to do the right thing instead of having the user wonder separately for each case. These uses don't need decoding, and the standard library correctly doesn't involve it (or if it currently does it has a bug):

s.count!(c => "!()-;:,.?".canFind(c)) // punctuation

As far as I can see, the language currently does not provide the facilities to implement the above without autodecoding.

However the following do require autodecoding:

s.walkLength

Usage of the result of this expression will be incorrect in many foreseeable cases.

s.count!(c => !"!()-;:,.?".canFind(c)) // non-punctuation

Ditto.

s.count!(c => c >= 32) // non-control characters

Ditto, with a big red flag. If you are dealing with control characters, the code is likely low-level enough that you need to be explicit in what you are counting. It is likely not what actually needs to be counted. Such confusion can lead to security risks.

Currently the standard library operates at code point level even though inside it may choose to use code units when admissible. Leaving such a decision to the library seems like a wise thing to do.

It should be explicit.

7. Autodecode cannot be used with unicode path/filenames, because it is legal (at least on Linux) to have invalid UTF-8 as filenames. It turns out in the wild that pure Unicode is not universal - there's lots of dirty Unicode that should remain unmolested, and autocode does not play
with that.

If paths are not UTF-8, then they shouldn't have string type (instead use ubyte[] etc). More on that below.

This is not practical. Do you really see changing std.file and std.path to accept ubyte[] for all path arguments?

8. In my work with UTF-8 streams, dealing with autodecode has caused me considerably extra work every time. A convenient timesaver it ain't.

Objection. Vague.

I can confirm this vague subjective observation. For example, DustMite reimplements some std.string functions in order to be able to handle D files with invalid UTF-8 characters.

9. Autodecode cannot be turned off, i.e. it isn't practical to avoid importing std.array one way or another, and then autodecode is there.

Turning off autodecoding is as easy as inserting .representation after any string. (Not to mention using indexing directly.)

This is neither easy nor practical. It makes writing reliable string handling code a chore in D. Because it is difficult to find all places where this must be done, it is not possible to do on a program-wide scale, thus bugs can only be discovered when this or that component fails because it was not tested with Unicode strings.

10. Autodecoded arrays cannot be RandomAccessRanges, losing a key
benefit of being arrays in the first place.

First off, you always have the option with .representation. That's a great name because it gives you the type used to represent the string - i.e. an array of integers of a specific width.

Second, it's as it should. The entire scaffolding rests on the notion that char[] is distinguished from ubyte[] by having UTF8 code units, not arbitrary bytes. It seems that many arguments against autodecoding are in fact arguments in favor of eliminating virtually all distinctions between char[] and ubyte[]. Then the natural question is, what _is_ the difference between char[] and ubyte[] and why do we need char as a separate type from ubyte?

This is a fundamental question for which we need a rigorous answer.

Why?

What is the purpose of char, wchar, and dchar? My current understanding is that they're justified as pretty much indistinguishable in primitives and behavior from ubyte, ushort, and uint respectively, but they reflect a loose subjective intent from the programmer that they hold actual UTF code units. The core language does not enforce such, except it does special things in random places like for loops (any other)?

If char is to be distinct from ubyte, and char[] is to be distinct from ubyte[], then autodecoding does the right thing: it makes sure they are distinguished in behavior and embodies the assumption that char is, in fact, a UTF8 code point.

I don't follow this line of reasoning at all.

11. Indexing an array produces different results than autodecoding,
another glaring special case.

This is a direct consequence of the fact that string is immutable(char)[] and not a specific type. That error predates autodecoding.

There is no convincing argument why indexing and slicing should not simply operate on code units.

Overall, I think the one way to make real steps forward in improving string processing in the D language is to give a clear answer of what char, wchar, and dchar mean.

I don't follow. Though, making char implicitly convertible to wchar and dchar has clearly been a mistake.

Reply via email to