On Thursday, October 20, 2011 20:37:40 Walter Bright wrote: > On 10/20/2011 7:37 PM, Jonathan M Davis wrote: > > True, but if the default were dchar, then the common case would be have > > fewer bugs > > Is that really the common case? It's certainly the *slow* case. Common > string operations like searching, copying, etc., do not require decoding.
And why would you iterate over a string with foreach without decoding it unless you specifically need to operate on code units (which I would expect to be _very_ rare)? Sure, copying doesn't require decoding, but searching sure does (unless you're specifically looking for a code unit rather than a code point, which would not be normal). Most anything which needs to operate on the characters of a string needs to decode them. And iterating over them to do much of anything would require decoding, since otherwise you're operating on code units, and how often does anyone do that unless they're specifically messing around with character encodings? Sure, if you _know_ that you're dealing with a string with only ASCII, it's faster to just iterate over chars, but then you can explicitly give the type of the foreach variable as char, but normally what people care about is iterating over characters, not pieces of characters. So, I would expect the case where people _want_ to iterate over chars to be rare. In most cases, iterating over a string as chars is a bug - one which in many cases won't be quickly caught, because the programmer is English speaking and uses almost exclusively ASCII for whatever testing that they do. The default for string handling really should be to treat them as ranges of dchar but still make it easy for them to be treated as arrays of code units when necessary. There's plenty of code in Phobos which is able to special case strings and make operating on them more efficient when it's not necessary to operate on them as ranges of dchar or when decoding the string explicitly with functions such as stride. But the default is still to operate on them as ranges of dchar, because that is what is normally correct. Defaulting to the guaranteed correct handling of characters and special casing when it's possible to write code more efficiently than that is definitely the way to go about it, and it's how Phobos generally does it. The fact that foreach doesn't is incongruous with how strings are handled in most other cases. > > (still allowing you to explicitly use char or wchar when you want to). > > At > > minimum, I think that it would be a good idea to implement > > http://d.puremagic.com/issues/show_bug.cgi?id=6652 and make it a warning > > not to explicitly give the type with foreach for arrays of char or > > wchar. It would catch bugs without changing the behavior of any > > existing code, and it still allows you to iterate over either code > > units or code points. > > I like the type deduction feature of foreach, and don't think it should be > removed for strings. Currently, it's consistent - T[] gets an element type > of T. Sure, the type deduction of foreach is great, and it's completely consistent that iterating over an array of chars would iterate over chars rather than dchars when you don't give the type. However, in most cases, that is _not_ what the programmer actually wants. They want to iterate over characters, not pieces of characters. So, the default at this point is _wrong_ in the common case. As such, I'm very leery of any code which uses foreach over a string without specifying the iteration type. And in fact, unless the code is clearly intended to operate on code units, I would expect a comment indicating that the use of char instead of dchar was intentional, or I'd still consider it likely that it's a bug and a mistake on the programmer's part (likely due to a misunderstanding of unicode and how D deals with it). > I want to reiterate that there's no way to program strings in D without > being cognizant of them being a multibyte representation. D is both a high > level and a low level language, and you can pick which to use, but you > still gotta pick. I fully agree that programmers need to properly understand unicode to use strings in D properly. However, the problem is that the default handling of strings with foreach is _not_ what programmers are going to normally want, so the default will cause bugs. If strings defaulted to iterating as ranges of dchar, or if programmers had to say what type they wanted to iterate over when dealing with strings (or at least got a warning if they didn't), then there would be fewer bugs. Pretty much every time that the use of strings with foreach comes up on this list, most everyone agrees that it's a wart in the language that the default is to iterate over chars rather than dchars. Not everyone agrees on the best way to fix the problem, but most everyone agrees that it _is_ a problem. - Jonathan M Davis