On Friday, 7 March 2014 at 02:37:11 UTC, Walter Bright wrote:
In "Lots of low hanging fruit in Phobos" the issue came up about the automatic encoding and decoding of char ranges.

Throughout D's history, there are regular and repeated proposals to redesign D's view of char[] to pretend it is not UTF-8, but UTF-32. I.e. so D will automatically generate code to decode and encode on every attempt to index char[].

I'm glad I'm not the only one who feels this way. Implicit decoding must die.

I strongly believe that implicit decoding of character points in std.range has been a mistake.

- Algorithms such as "countUntil" will count code points. These numbers are useless for slicing, and can introduce hard-to-find bugs.

- In lots of places, I've discovered that Phobos did UTF decoding (thus murdering performance) when it didn't need to. Such cases included format (now fixed), appender (now fixed), startsWith (now fixed - recently), skipOver (still unfixed). These have caused latent bugs in my programs that happened to be fed non-UTF data. There's no reason for why D should fail on non-UTF data if it has no reason to decode it in the first place! These failures have only served to identify places in Phobos where redundant decoding was occurring.

Furthermore, it doesn't actually solve anything completely! The only thing it solves is a subset of cases for a subset of languages!

People want to look at a string "character by character". If a Unicode code point is a character in your language and alphabet, I'm really happy for you, but that's not how it is for everyone. Combining marks, complex scripts etc. make this point just a fallacy that in the end will cause programmers to make mistakes that will affect certain users somewhere.

Why do people want to look at individual characters? There are a lot of misconceptions about Unicode, and I think some of that applies here.

- Do you want to split a string by whitespace? Some languages have no notion of whitespace. What do you need it for? Line wrapping? Employ the Unicode line-breaking algorithm instead.

- Do you want to uppercase the first letter of a string? Some language have no notion of letter case, and some use it for different reasons. Furthermore, even languages with a Latin-based alphabet may not have 1:1 mapping for case, e.g. the German ß letter.

- Do you want to count how wide a string will be in a fixed-point font? Wrong... Combining and control characters, zero-width whitespace, etc. will render this approach futile.

- Do you want to split or flush a stream to a character device at a point so that there's no garbage? I believe, this is the case in TDPL's mention of the subject. Again, combining characters or complex scripts will still be broken by this approach.

You need to either go all-out and provide complete implementations of the relevant Unicode algorithms to perform tasks such as the above that will work in all locales, or you need to draw a line somewhere for which languages, alphabets, locales do you want to support in your program. D's line is drawn at the point where it considers that code points == characters, however the outcome of this is clear nowhere in its documentation and for such an arbitrary decision (from a cultural point of view), it is embedded too deep into the language itself. With std.ascii, at least, it's clear to the user that the functions there will only work with English or languages using the same alphabet.

This doesn't apply universally. There are still cases like, e.g., regular expression ranges. [a-z] makes sense in English, and [а-я] makes sense in Russian, but I don't think that makes sense for all languages. However, for the most part, I think implicit decoding must be axed, and instead we need implementations of Unicode algorithms and the documentation to instruct users why and how to use them.

Reply via email to