Re: The Case Against Autodecode

Marc Schütz via Digitalmars-d Fri, 27 May 2016 04:01:59 -0700

On Thursday, 26 May 2016 at 16:00:54 UTC, Andrei Alexandrescuwrote:

This might be a good time to discuss this a tad further. I'dappreciate if the debate stayed on point going forward. Thanks!
My thesis: the D1 design decision to represent strings aschar[] was disastrous and probably one of the largestweaknesses of D1. The decision in D2 to use immutable(char)[]for strings is a vast improvement but still has a number ofissues. The approach to autodecoding in Phobos is animprovement on that decision.

It is not, which has been shown by various posts in this thread.Iterating by code points is at least as wrong as iterating bycode units; it can be argued it is worse because it sometimesmakes the fact that it's wrong harder to detect.

The insistent shunning of a user-defined type to representstrings is not good and we need to rid ourselves of it.

While this may be true, it has nothing to do with auto decoding.I assume you would want such a user-define string type toauto-decode as well, right?

On 05/12/2016 04:15 PM, Walter Bright wrote:
5. Very few algorithms require decoding.
The key here is leaving it to the standard library to do theright thing instead of having the user wonder separately foreach case. These uses don't need decoding, and the standardlibrary correctly doesn't involve it (or if it currently doesit has a bug):
s.find("abc")
s.findSplit("abc")
s.findSplit('a')


Yes.

s.count!(c => "!()-;:,.?".canFind(c)) // punctuation

Ideally yes, but this is a special case that cannot be detectedby `count`.


However the following do require autodecoding:

s.walkLength
s.count!(c => !"!()-;:,.?".canFind(c)) // non-punctuation
s.count!(c => c >= 32) // non-control characters

No, they do not need _auto_ decoding, they need a decision _bythe user_ what they should be decoded to. Code units? Codepoints? Graphemes? Words? Lines?


Currently the standard library operates at code point level


Because it auto decodes.

even though inside it may choose to use code units whenadmissible. Leaving such a decision to the library seems like awise thing to do.

No one wants to take that second part away. For example, the`find` can provide an overload that accepts `const(char)[]`directly, while `walkLength` doesn't, requiring a decision by thecaller.

7. Autodecode cannot be used with unicode path/filenames,because it islegal (at least on Linux) to have invalid UTF-8 as filenames.It turnsout in the wild that pure Unicode is not universal - there'slots ofdirty Unicode that should remain unmolested, and autocode doesnot play
with that.
If paths are not UTF-8, then they shouldn't have string type(instead use ubyte[] etc). More on that below.

I believe a library type would be more appropriate than bare`ubyte[]`. It should provide conversion between the OS encoding(which can be detected automatically) and UTF strings, forexample. And it should be used for any "strings" that comes fromoutside the program, like main's arguments, env variables...

9. Autodecode cannot be turned off, i.e. it isn't practical toavoidimporting std.array one way or another, and then autodecode isthere.
Turning off autodecoding is as easy as inserting.representation after any string. (Not to mention usingindexing directly.)

This would no longer work if char[] and char ranges were to betreated identically.

10. Autodecoded arrays cannot be RandomAccessRanges, losing akey
benefit of being arrays in the first place.
First off, you always have the option with .representation.That's a great name because it gives you the type used torepresent the string - i.e. an array of integers of a specificwidth.
Second, it's as it should. The entire scaffolding rests on thenotion that char[] is distinguished from ubyte[] by having UTF8code units, not arbitrary bytes. It seems that many argumentsagainst autodecoding are in fact arguments in favor ofeliminating virtually all distinctions between char[] andubyte[]. Then the natural question is, what _is_ the differencebetween char[] and ubyte[] and why do we need char as aseparate type from ubyte?
This is a fundamental question for which we need a rigorousanswer. What is the purpose of char, wchar, and dchar? Mycurrent understanding is that they're justified as pretty muchindistinguishable in primitives and behavior from ubyte,ushort, and uint respectively, but they reflect a loosesubjective intent from the programmer that they hold actual UTFcode units. The core language does not enforce such, except itdoes special things in random places like for loops (any other)?


Agreed.

If char is to be distinct from ubyte, and char[] is to bedistinct from ubyte[], then autodecoding does the right thing:it makes sure they are distinguished in behavior and embodiesthe assumption that char is, in fact, a UTF8 code point.

Distinguishing them is the right thing to do, but auto decodingis not the way to achieve that, see above.

Re: The Case Against Autodecode

Reply via email to