Re: Major performance problem with std.array.front()

Michel Fortin Fri, 07 Mar 2014 05:46:28 -0800

On 2014-03-07 03:59:55 +0000, "bearophile" <bearophileh...@lycos.com> said:

Walter Bright:
I understand this all too well. (Note that we currently have adifferent silent problem: unnoticed large performance problems.)
On the other hand your change could introduce Unicode-related bugs infuture code (that the current Phobos avoids) (and here I am not talkingabout code breakage).

The way Phobos works isn't any more correct than dealing with codeunits. Many graphemes span on multiple code points -- because ofcombined diacritics or character variant modifiers -- and decoding atthe code-point level is thus often insufficient for correctness.

The problem with Unicode strings is that the representation you mustwork with depends on the things you want to do. If you want to countthe characters then you need graphemes; if you want to parse XML thenyou'll need to work with code points (in theory, in practice you mightstill want direct access to code units for performance reasons); and ifyou want to slice or copy a string then you need to deal with codeunits. Because of this multiple-representation-for-different-purposething, generic algorithms for arrays don't map very well to string.

From my experience, I'd suggest these basic operations for a "string

range" instead of the regular range interface:

.empty
.frontCodeUnit
.frontCodePoint
.frontGrapheme
.popFrontCodeUnit
.popFrontCodePoint
.popFrontGrapheme
.codeUnitLength (aka length)
.codePointLength (for dchar[] only)
.codePointLengthLinear
.graphemeLengthLinear

Someone should be able to mix all the three 'front' and 'pop' functionvariants above in any code dealing with a string type. In my XML parserfor instance I regularly use frontCodeUnit to avoid the decodingpenalty when matching the next character with an ASCII one such as '<'or '&'. An API like the one above forces you to be aware of the levelyou're working on, making bugs and inefficiencies stand out (as long asyou're familiar with each representation).

If someone wants to use a generic array/range algorithm with a string,my opinion is that he should have to wrap it in a range type that mapsfront and popFront to one of the above variant. Having to do thatshould make it obvious that there's an inefficiency there, as you'reusing an algorithm that wasn't tailored to work with strings and thatmore decoding than strictly necessary is being done.


--
Michel Fortin
michel.for...@michelf.ca
http://michelf.ca

Re: Major performance problem with std.array.front()

Reply via email to