Re: The Case Against Autodecode

Vladimir Panteleev via Digitalmars-d Thu, 26 May 2016 21:37:41 -0700

On Thursday, 26 May 2016 at 16:00:54 UTC, Andrei Alexandrescuwrote:

4. Autodecoding is slow and has no place in high speed stringprocessing.
I would agree only with the amendment "...if used naively",which is important. Knowledge of how autodecoding works is aprerequisite for writing fast string code in D.


It is completely wasted mental effort.

5. Very few algorithms require decoding.
The key here is leaving it to the standard library to do theright thing instead of having the user wonder separately foreach case. These uses don't need decoding, and the standardlibrary correctly doesn't involve it (or if it currently doesit has a bug):
s.count!(c => "!()-;:,.?".canFind(c)) // punctuation

As far as I can see, the language currently does not provide thefacilities to implement the above without autodecoding.

However the following do require autodecoding:

s.walkLength

Usage of the result of this expression will be incorrect in manyforeseeable cases.

s.count!(c => !"!()-;:,.?".canFind(c)) // non-punctuation


Ditto.

s.count!(c => c >= 32) // non-control characters

Ditto, with a big red flag. If you are dealing with controlcharacters, the code is likely low-level enough that you need tobe explicit in what you are counting. It is likely not whatactually needs to be counted. Such confusion can lead to securityrisks.

Currently the standard library operates at code point leveleven though inside it may choose to use code units whenadmissible. Leaving such a decision to the library seems like awise thing to do.


It should be explicit.

7. Autodecode cannot be used with unicode path/filenames,because it islegal (at least on Linux) to have invalid UTF-8 as filenames.It turnsout in the wild that pure Unicode is not universal - there'slots ofdirty Unicode that should remain unmolested, and autocode doesnot play
with that.
If paths are not UTF-8, then they shouldn't have string type(instead use ubyte[] etc). More on that below.

This is not practical. Do you really see changing std.file andstd.path to accept ubyte[] for all path arguments?

8. In my work with UTF-8 streams, dealing with autodecode hascaused meconsiderably extra work every time. A convenient timesaver itain't.
Objection. Vague.

I can confirm this vague subjective observation. For example,DustMite reimplements some std.string functions in order to beable to handle D files with invalid UTF-8 characters.

9. Autodecode cannot be turned off, i.e. it isn't practical toavoidimporting std.array one way or another, and then autodecode isthere.
Turning off autodecoding is as easy as inserting.representation after any string. (Not to mention usingindexing directly.)

This is neither easy nor practical. It makes writing reliablestring handling code a chore in D. Because it is difficult tofind all places where this must be done, it is not possible to doon a program-wide scale, thus bugs can only be discovered whenthis or that component fails because it was not tested withUnicode strings.

10. Autodecoded arrays cannot be RandomAccessRanges, losing akey
benefit of being arrays in the first place.
First off, you always have the option with .representation.That's a great name because it gives you the type used torepresent the string - i.e. an array of integers of a specificwidth.
Second, it's as it should. The entire scaffolding rests on thenotion that char[] is distinguished from ubyte[] by having UTF8code units, not arbitrary bytes. It seems that many argumentsagainst autodecoding are in fact arguments in favor ofeliminating virtually all distinctions between char[] andubyte[]. Then the natural question is, what _is_ the differencebetween char[] and ubyte[] and why do we need char as aseparate type from ubyte?
This is a fundamental question for which we need a rigorousanswer.


Why?

What is the purpose of char, wchar, and dchar? My currentunderstanding is that they're justified as pretty muchindistinguishable in primitives and behavior from ubyte,ushort, and uint respectively, but they reflect a loosesubjective intent from the programmer that they hold actual UTFcode units. The core language does not enforce such, except itdoes special things in random places like for loops (any other)?
If char is to be distinct from ubyte, and char[] is to bedistinct from ubyte[], then autodecoding does the right thing:it makes sure they are distinguished in behavior and embodiesthe assumption that char is, in fact, a UTF8 code point.


I don't follow this line of reasoning at all.

11. Indexing an array produces different results thanautodecoding,
another glaring special case.
This is a direct consequence of the fact that string isimmutable(char)[] and not a specific type. That error predatesautodecoding.

There is no convincing argument why indexing and slicing shouldnot simply operate on code units.

Overall, I think the one way to make real steps forward inimproving string processing in the D language is to give aclear answer of what char, wchar, and dchar mean.

I don't follow. Though, making char implicitly convertible towchar and dchar has clearly been a mistake.

Re: The Case Against Autodecode

Reply via email to