On Tuesday, May 31, 2016 11:07:09 Andrei Alexandrescu via Digitalmars-d wrote: > On 5/31/16 3:56 AM, Walter Bright wrote: > > If there is an abstraction for strings that is efficient, consistent, > > useful, and hides the fact that it is UTF, I am not aware of it. > > It's been mentioned several times: a string type that does not offer > range primitives; instead it offers explicit primitives (such as > byCodeUnit, byCodePoint, byGrapheme etc) that yield appropriate ranges.
Not exactly. Such a string type does not hide the fact that it's UTF. Rather, it forces you to deal with the fact that its UTF. I have to agree with Walter in that there really isn't a way to automatically handle Unicode correctly and efficiently while hiding the fact that it's doing all of the stuff that has to be done for UTF. That being said, while an array of code units is really what a string should be underneath the hood, having a string type that provides byCodeUnit, byCodePoint, and byGrapheme is an improvement over treating immutable(char)[] as string, even if byCodeUnit returns immutable(char)[], because it forces the programmer to decide what they want to do rather than blindingly operate on immutable(char)[] as if a char were a full character. And as long as it provides access to each level of Unicode, then it's possible for programmers who know what they're doing to efficiently operate on Unicode while simultaneously making it much more obvious to those who don't know what they're doing that they don't know they're doing rather than having them blindly act like char is a full character. There's really no reason why we couldn't define a string type that operated that way while continuing to treat arrays of char the way that we do now in the language, though transitioning to such a scheme is not at all straightforward in terms of avoiding code breakage. Defining a String type would be simple enough, and any function in Phobos which accepted a string could be changed to accept a String, but we'd have problems with many functions which currently returned string, since changing what they returned would break code. But even if Phobos were somehow completly changed over to use a new String type, and even if the string alias were deprecated/removed, we'd still have to deal with arrays of char, wchar, and dchar and run the risk of someone using those and having problems, because they didn't treat them as arrays of code units. We can't really prevent that, just make it so that string/String is something else that makes the Unicode issue obvious so that folks are less likely to blindly treat chars as full characters. But even then, it's not like it would be hard for folks to just use the wrong Unicode level. All we'd really be doing is shoving the issue in their face so that they'd have to acknowledge it on some level and maybe then actually learn enough to operate on Unicode strings correctly. But then again, since all you're really doing at that point is shoving the Unicode issues in folks' faces by not treating strings as ranges or indexable and forcing them to call byCodeUnit, byCodePoint, byGrapheme, etc., I don't know that it actually solves much over treating immutable(char)[] as string. Programmers still have to learn Unicode enough to handle it correctly, just like they do now (whether we have autodecoding or not). And such a string type really doesn't make the Unicode handling any easier. It just make it harder to ignore the Unicode issues. The Unicode problem is a lot like the floating point problems that have been discussed recently. Programmers want it to "just work" without them having to worry about the details, but that really doesn't work, and while the average programmer may not understand either floating point operations or Unicode properly, the average programmer does actually have to work with both on a regular basis. I'm not at all convinced that having string be an alias of immutable(char)[] was a mistake, but having a struct that's not a range may very well be an improvement. It _would_ at least make some of the Unicode issues more obvious, but it doesn't really solve much from what I can see. - Jonathan M Davis