Re: Major performance problem with std.array.front()

Vladimir Panteleev Fri, 07 Mar 2014 04:02:29 -0800

On Friday, 7 March 2014 at 02:37:11 UTC, Walter Bright wrote:

In "Lots of low hanging fruit in Phobos" the issue came upabout the automatic encoding and decoding of char ranges.
Throughout D's history, there are regular and repeatedproposals to redesign D's view of char[] to pretend it is notUTF-8, but UTF-32. I.e. so D will automatically generate codeto decode and encode on every attempt to index char[].

I'm glad I'm not the only one who feels this way. Implicitdecoding must die.

I strongly believe that implicit decoding of character points instd.range has been a mistake.

- Algorithms such as "countUntil" will count code points. Thesenumbers are useless for slicing, and can introduce hard-to-findbugs.

- In lots of places, I've discovered that Phobos did UTF decoding(thus murdering performance) when it didn't need to. Such casesincluded format (now fixed), appender (now fixed), startsWith(now fixed - recently), skipOver (still unfixed). These havecaused latent bugs in my programs that happened to be fed non-UTFdata. There's no reason for why D should fail on non-UTF data ifit has no reason to decode it in the first place! These failureshave only served to identify places in Phobos where redundantdecoding was occurring.

Furthermore, it doesn't actually solve anything completely! Theonly thing it solves is a subset of cases for a subset oflanguages!

People want to look at a string "character by character". If aUnicode code point is a character in your language and alphabet,I'm really happy for you, but that's not how it is for everyone.Combining marks, complex scripts etc. make this point just afallacy that in the end will cause programmers to make mistakesthat will affect certain users somewhere.

Why do people want to look at individual characters? There are alot of misconceptions about Unicode, and I think some of thatapplies here.

- Do you want to split a string by whitespace? Some languageshave no notion of whitespace. What do you need it for? Linewrapping? Employ the Unicode line-breaking algorithm instead.

- Do you want to uppercase the first letter of a string? Somelanguage have no notion of letter case, and some use it fordifferent reasons. Furthermore, even languages with a Latin-basedalphabet may not have 1:1 mapping for case, e.g. the German ßletter.

- Do you want to count how wide a string will be in a fixed-pointfont? Wrong... Combining and control characters, zero-widthwhitespace, etc. will render this approach futile.

- Do you want to split or flush a stream to a character device ata point so that there's no garbage? I believe, this is the casein TDPL's mention of the subject. Again, combining characters orcomplex scripts will still be broken by this approach.

You need to either go all-out and provide completeimplementations of the relevant Unicode algorithms to performtasks such as the above that will work in all locales, or youneed to draw a line somewhere for which languages, alphabets,locales do you want to support in your program. D's line is drawnat the point where it considers that code points == characters,however the outcome of this is clear nowhere in its documentationand for such an arbitrary decision (from a cultural point ofview), it is embedded too deep into the language itself. Withstd.ascii, at least, it's clear to the user that the functionsthere will only work with English or languages using the samealphabet.

This doesn't apply universally. There are still cases like, e.g.,regular expression ranges. [a-z] makes sense in English, and[а-я] makes sense in Russian, but I don't think that makes sensefor all languages. However, for the most part, I think implicitdecoding must be axed, and instead we need implementations ofUnicode algorithms and the documentation to instruct users whyand how to use them.

Re: Major performance problem with std.array.front()

Reply via email to