On Tuesday, May 31, 2016 15:33:38 Andrei Alexandrescu via Digitalmars-d wrote: > On 05/31/2016 02:53 PM, Jonathan M Davis via Digitalmars-d wrote: > > walkLength treats a code point like it's a character. > > No, it treats a code point like it's a code point. -- Andrei
Wasn't the whole point of operating at the code point level by default to make it so that code would be operating on full characters by default instead of chopping them up as is so easy to do when operating at the code unit level? Thanks to how Phobos treats strings as ranges of dchar, most D code treats code points as if they were characters. So, whether it's correct or not, a _lot_ of D code is treating walkLength like it returns the number of characters in a string. And if walkLength doesn't provide the number of characters in a string, why would I want to use it under normal circumstances? Why would I want to be operating at the code point level in my code? It's not necessarily a full character, since it's not necessarily a grapheme. So, by using walkLength and front and popFront and whatnot with strings, I'm not getting full characters. I'm still only getting pieces of characters - just like would happen if strings were treated as ranges of code units. I'm just getting bigger pieces of the characters out of the deal. But if they're not full characters, what's the point? I am sure that there is code that is going to want to operate at the code point level, but your average program is either operating on strings as a whole or individual characters. As long as strings are being operated on as a whole, code units are generally plenty, and careful encoding of characters into code units for comparisons means that much of the time that you want to operate on individual characters, you can still operate at the code unit level. But if you can't, then you need the grapheme level, because a code point is not necessarily a full character. So, what is the point of operating on code points in your average D program? walkLength will not always tell me the number of characters in a string. front risks giving me a partial character rather than a whole one. Slicing dchar[] risks chopping up characters just like slicing char[] does. Operating on code points by default does not result in correct string processing. I honestly don't see how autodecoding is defensible. We may not be able to get rid of it due to the breakage that doing that would cause, but I fail to see how it is at all desirable that we have autodecoded strings. I can understand how we got it if it's based on a misunderstanding on your part about how Unicode works. We all make mistakes. But I fail to see how autodecoding wasn't a mistake. It's the worst of both worlds - inefficient while still incorrect. At least operating at the code unit level would be fast while being incorrect, and it would be obviously incorrect once you did anything with non-ASCII values, whereas it's easy to miss that ranges of dchar are doing the wrong thing too - Jonathan M Davis