On Thursday, May 12, 2016 13:15:45 Walter Bright via Digitalmars-d wrote: > On 5/12/2016 9:29 AM, Andrei Alexandrescu wrote: > > I am as unclear about the problems of autodecoding as I am about the > > necessity to remove curl. Whenever I ask I hear some arguments that work > > well emotionally but are scant on reason and engineering. Maybe it's > > time to rehash them? I just did so about curl, no solid argument seemed > > to come together. I'd be curious of a crisp list of grievances about > > autodecoding. -- Andrei > > Here are some that are not matters of opinion. > > 1. Ranges of characters do not autodecode, but arrays of characters do. This > is a glaring inconsistency. > > 2. Every time one wants an algorithm to work with both strings and ranges, > you wind up special casing the strings to defeat the autodecoding, or to > decode the ranges. Having to constantly special case it makes for more > special cases when plugging together components. These issues often escape > detection when unittesting because it is convenient to unittest only with > arrays. > > 3. Wrapping an array in a struct with an alias this to an array turns off > autodecoding, another special case. > > 4. Autodecoding is slow and has no place in high speed string processing. > > 5. Very few algorithms require decoding. > > 6. Autodecoding has two choices when encountering invalid code units - throw > or produce an error dchar. Currently, it throws, meaning no algorithms > using autodecode can be made nothrow. > > 7. Autodecode cannot be used with unicode path/filenames, because it is > legal (at least on Linux) to have invalid UTF-8 as filenames. It turns out > in the wild that pure Unicode is not universal - there's lots of dirty > Unicode that should remain unmolested, and autocode does not play with > that. > > 8. In my work with UTF-8 streams, dealing with autodecode has caused me > considerably extra work every time. A convenient timesaver it ain't. > > 9. Autodecode cannot be turned off, i.e. it isn't practical to avoid > importing std.array one way or another, and then autodecode is there. > > 10. Autodecoded arrays cannot be RandomAccessRanges, losing a key benefit of > being arrays in the first place. > > 11. Indexing an array produces different results than autodecoding, another > glaring special case.
It also results in constantly special-casing algorithms for narrow strings in order to avoid auto-decoding. Phobos does this all over the place. We have a ridiculous amount of code in Phobos just to avoid auto-decoding, and anyone who wants high performance will have to do the same. And it's not like auto-decoding is even correct. It would be one thing if auto-decoding were fully correct but slow, but to be fully correct, it would need to operate at the grapheme level, not the code point level. So, by default, we get slower code without actually getting fully correct code. So, we're neither fast nor correct. We _are_ correct in more cases than we'd be if we simply acted like ASCII was all there was, but what we end up with is the illusion that we're correct when we're not. IIRC, Andrei talked in TDPL about how Java's choice to go with UTF-16 was worse than the choice to go with UTF-8, because it was correct in many more cases to operate on the code unit level as if a code unit were a character, and it was therefore harder to realize that what you were doing was wrong, whereas with UTF-8, it's obvious very quickly. We currently have that same problem with auto-decoding except that it's treating UTF-32 code units as if they were full characters rather than treating UTF-16 code units as if they were full characters. Ideally, algorithms would be Unicode aware as appropriate, but the default would be to operate on code units with wrappers to handle decoding by code point or grapheme. Then it's easy to write fast code while still allowing for full correctness. Granted, it's not necessarily easy to get correct code that way, but anyone who wants fully correctness without caring about efficiency can just use ranges of graphemes. Ranges of code points are rare regardless. Based on what I've seen in previous conversations on auto-decoding over the past few years (be it in the newsgroup, on github, or at dconf), most of the core devs think that auto-decoding was a major blunder that we continue to pay for. But unfortunately, even if we all agree that it was a huge mistake and want to fix it, the question remains of how to do that without breaking tons of code - though since AFAIK, Andrei is still in favor of auto-decoding, we'd have a hard time going forward with plans to get rid of it even if we had come up with a good way of doing so. But I would love it if we could get rid of auto-decoding and clean up string handling in D. - Jonathan M Davis