On 2014-03-07 03:59:55 +0000, "bearophile" <bearophileh...@lycos.com> said:

Walter Bright:

I understand this all too well. (Note that we currently have a different silent problem: unnoticed large performance problems.)

On the other hand your change could introduce Unicode-related bugs in future code (that the current Phobos avoids) (and here I am not talking about code breakage).

The way Phobos works isn't any more correct than dealing with code units. Many graphemes span on multiple code points -- because of combined diacritics or character variant modifiers -- and decoding at the code-point level is thus often insufficient for correctness.

The problem with Unicode strings is that the representation you must work with depends on the things you want to do. If you want to count the characters then you need graphemes; if you want to parse XML then you'll need to work with code points (in theory, in practice you might still want direct access to code units for performance reasons); and if you want to slice or copy a string then you need to deal with code units. Because of this multiple-representation-for-different-purpose thing, generic algorithms for arrays don't map very well to string.

From my experience, I'd suggest these basic operations for a "string
range" instead of the regular range interface:

.empty
.frontCodeUnit
.frontCodePoint
.frontGrapheme
.popFrontCodeUnit
.popFrontCodePoint
.popFrontGrapheme
.codeUnitLength (aka length)
.codePointLength (for dchar[] only)
.codePointLengthLinear
.graphemeLengthLinear

Someone should be able to mix all the three 'front' and 'pop' function variants above in any code dealing with a string type. In my XML parser for instance I regularly use frontCodeUnit to avoid the decoding penalty when matching the next character with an ASCII one such as '<' or '&'. An API like the one above forces you to be aware of the level you're working on, making bugs and inefficiencies stand out (as long as you're familiar with each representation).

If someone wants to use a generic array/range algorithm with a string, my opinion is that he should have to wrap it in a range type that maps front and popFront to one of the above variant. Having to do that should make it obvious that there's an inefficiency there, as you're using an algorithm that wasn't tailored to work with strings and that more decoding than strictly necessary is being done.

--
Michel Fortin
michel.for...@michelf.ca
http://michelf.ca

Reply via email to