On Thursday, 26 May 2016 at 16:00:54 UTC, Andrei Alexandrescu
wrote:
This might be a good time to discuss this a tad further. I'd
appreciate if the debate stayed on point going forward. Thanks!
My thesis: the D1 design decision to represent strings as
char[] was disastrous and probably one of the largest
weaknesses of D1. The decision in D2 to use immutable(char)[]
for strings is a vast improvement but still has a number of
issues. The approach to autodecoding in Phobos is an
improvement on that decision.
It is not, which has been shown by various posts in this thread.
Iterating by code points is at least as wrong as iterating by
code units; it can be argued it is worse because it sometimes
makes the fact that it's wrong harder to detect.
The insistent shunning of a user-defined type to represent
strings is not good and we need to rid ourselves of it.
While this may be true, it has nothing to do with auto decoding.
I assume you would want such a user-define string type to
auto-decode as well, right?
On 05/12/2016 04:15 PM, Walter Bright wrote:
5. Very few algorithms require decoding.
The key here is leaving it to the standard library to do the
right thing instead of having the user wonder separately for
each case. These uses don't need decoding, and the standard
library correctly doesn't involve it (or if it currently does
it has a bug):
s.find("abc")
s.findSplit("abc")
s.findSplit('a')
Yes.
s.count!(c => "!()-;:,.?".canFind(c)) // punctuation
Ideally yes, but this is a special case that cannot be detected
by `count`.
However the following do require autodecoding:
s.walkLength
s.count!(c => !"!()-;:,.?".canFind(c)) // non-punctuation
s.count!(c => c >= 32) // non-control characters
No, they do not need _auto_ decoding, they need a decision _by
the user_ what they should be decoded to. Code units? Code
points? Graphemes? Words? Lines?
Currently the standard library operates at code point level
Because it auto decodes.
even though inside it may choose to use code units when
admissible. Leaving such a decision to the library seems like a
wise thing to do.
No one wants to take that second part away. For example, the
`find` can provide an overload that accepts `const(char)[]`
directly, while `walkLength` doesn't, requiring a decision by the
caller.
7. Autodecode cannot be used with unicode path/filenames,
because it is
legal (at least on Linux) to have invalid UTF-8 as filenames.
It turns
out in the wild that pure Unicode is not universal - there's
lots of
dirty Unicode that should remain unmolested, and autocode does
not play
with that.
If paths are not UTF-8, then they shouldn't have string type
(instead use ubyte[] etc). More on that below.
I believe a library type would be more appropriate than bare
`ubyte[]`. It should provide conversion between the OS encoding
(which can be detected automatically) and UTF strings, for
example. And it should be used for any "strings" that comes from
outside the program, like main's arguments, env variables...
9. Autodecode cannot be turned off, i.e. it isn't practical to
avoid
importing std.array one way or another, and then autodecode is
there.
Turning off autodecoding is as easy as inserting
.representation after any string. (Not to mention using
indexing directly.)
This would no longer work if char[] and char ranges were to be
treated identically.
10. Autodecoded arrays cannot be RandomAccessRanges, losing a
key
benefit of being arrays in the first place.
First off, you always have the option with .representation.
That's a great name because it gives you the type used to
represent the string - i.e. an array of integers of a specific
width.
Second, it's as it should. The entire scaffolding rests on the
notion that char[] is distinguished from ubyte[] by having UTF8
code units, not arbitrary bytes. It seems that many arguments
against autodecoding are in fact arguments in favor of
eliminating virtually all distinctions between char[] and
ubyte[]. Then the natural question is, what _is_ the difference
between char[] and ubyte[] and why do we need char as a
separate type from ubyte?
This is a fundamental question for which we need a rigorous
answer. What is the purpose of char, wchar, and dchar? My
current understanding is that they're justified as pretty much
indistinguishable in primitives and behavior from ubyte,
ushort, and uint respectively, but they reflect a loose
subjective intent from the programmer that they hold actual UTF
code units. The core language does not enforce such, except it
does special things in random places like for loops (any other)?
Agreed.
If char is to be distinct from ubyte, and char[] is to be
distinct from ubyte[], then autodecoding does the right thing:
it makes sure they are distinguished in behavior and embodies
the assumption that char is, in fact, a UTF8 code point.
Distinguishing them is the right thing to do, but auto decoding
is not the way to achieve that, see above.