On Saturday, September 8, 2018 8:05:04 AM MDT Laeeth Isharc via Digitalmars- d wrote: > On Thursday, 6 September 2018 at 20:15:22 UTC, Jonathan M Davis > > wrote: > > On Thursday, September 6, 2018 1:04:45 PM MDT aliak via > > > > Digitalmars-d wrote: > >> D makes the code-point case default and hence that becomes the > >> simplest to use. But unfortunately, the only thing I can think > >> of > >> that requires code point representations is when dealing > >> specifically with unicode algorithms (normalization, etc). > >> Here's > >> a good read on code points: > >> https://manishearth.github.io/blog/2017/01/14/stop-ascribing-meaning-to > >> -un icode-code-points/ - > >> > >> tl;dr: application logic does not need or want to deal with > >> code points. For speed units work, and for correctness, > >> graphemes work. > > > > I think that it's pretty clear that code points are objectively > > the worst level to be the default. Unfortunately, changing it > > to _anything_ else is not going to be an easy feat at this > > point. But if we can first ensure that Phobos in general > > doesn't rely on it (i.e. in general, it can deal with ranges of > > char, wchar, dchar, or graphemes correctly rather than assuming > > that all ranges of characters are ranges of dchar), then maybe > > we can figure something out. Unfortunately, while some work has > > been done towards that, what's mostly happened is that folks > > have complained about auto-decoding without doing much to > > improve the current situation. There's a lot more to this than > > simply ripping out auto-decoding even if every D user on the > > planet agreed that outright breaking almost every existing D > > program to get rid of auto-decoding was worth it. But as with > > too many things around here, there's a lot more talking than > > working. And actually, as such, I should probably stop > > discussing this and go do something useful. > > A tutorial page linked from the front page with some examples > would go a long way to making it easier for people. If I had > time and understood strings enough to explain to others I would > try to make a start, but unfortunately neither are true.
Writing up an article on proper Unicode handling in D is on my todo list, but my todo list of things to do for D is long enough that I don't know then I'm going to get to it. > And if we are doing things right with RCString, then isn't it > easier to make the change with that first - which is new so can't > break code - and in some years when people are used to working > that way update Phobos (compiler switch in beginning and have big > transition a few years after that). Well, I'm not actually convinced that what we have for RCString right now _is_ doing the right thing, but even if it is, that doesn't fix the issue that string doesn't do the right thing, and code needs to take that into account - especially if it's generic code. The better job we do at making Phobos code work with arbitrary ranges of characters, the less of an issue that is, but you're still pretty much forced to deal with it in a number of cases if you want your code to be efficient or if you want a function to be able to accept a string and return a string rather than a wrapper range. Using RCString in your code would reduce how much you had to worry about it, but it doesn't completely solve the problem. And if you're doing stuff like writing a library for other people to use, then you definitely can't just ignore the issue. So, an RCString that handles Unicode sanely will definitely help, but it's not really a fix. And plenty of code is still going to be written to use strings (especially when -betterC is involved). RCString is going to be another option, but it's not going to replace string. Even if RCString became the most common string type to use (which I question is going to ever happen), dynamic arrays of char, wchar, etc. are still going to exist in the language and are still going to have to be handled correctly. Phobos won't be able to assume that all of the code out there is using RCString and not string. The combination of improving Phobos so that it works properly with ranges of characters in general (and not just strings or ranges of dchar) and having an alternate string type that does the right thing will definitely help and need to be done if we have any hope of actually removing auto-decoding, but even with all of that, I don't see how it would be possible to really deprecate the old behavior. We _might_ be able to do something if we're willing to deprecate std.algorithm and std.range (since std.range gives you the current definitions of the range primitives for arrays, and std.algorithm publicly imports std.range), but you still then have the problem of two different definitions of the range primitives for arrays and all of the problems that that causes (even if it's only for the deprecation period). So, strings would end up behaving drastically differently with range-based functions depending on which module you imported. I don't know that that problem is insurmountable, but it's not at all clear that there is a path to fixing auto-decoding that doesn't outright break old code. If we're willing to break old code, then we could defnitely do it, but if we don't want to risk serious problems, we really need a way to have a more gradual transition, and that's the big problem that no one has a clean solution for. > Isn't this one of the challenges created by the tension between D > being both a high-level and low-level language. The higher the > aim, the more problems you will encounter getting there. That's > okay. > > And isn't the obstacle to breaking auto-decoding because it seems > to be a monolithic challenge of overwhelming magnitude, whereas > if we could figure out some steps to eat the elephant one > mouthful at a time (which might mean start with RCString) then it > will seem less intimidating. It will take years anyway perhaps - > but so what? Well, I think that it's clear at this point that before we can even consider getting rid of auto-decoding, we need to make sure that Phobos in general works with arbitrary ranges of code units, code points, and graphemes. With that done, we would have a standard library that could work with strings as ranges of code units if that's what they were. So, in theory, at that point, the only issue would be how on earth to make strings work as ranges of code units without just pulling the rug out from under everyone. I'm not at all convinced that that's possible, but I am very much convinced that unless we improve first Phobos so that it's fully correct in spite of the auto-decoding issues, we definitely can't remove auto-decoding. And as a group, we haven't done a good enough job with that. Most of us agree that auto-decoding was a huge mistake, but there hasn't been enough work done towards fixing what we have, and there's plenty of work there that needs to be done whether we later try to remove auto-decoding or not. - Jonathan M Davis