Re: Major performance problem with std.array.front()

Peter Alexander Sat, 08 Mar 2014 08:37:26 -0800

On Saturday, 8 March 2014 at 16:00:38 UTC, Vladimir Panteleevwrote:

On Saturday, 8 March 2014 at 15:33:34 UTC, Andrei Alexandrescuwrote:
Why? Couldn't the grapheme 'compare true with the character?I.e. the byGrapheme iteration normalizes on the fly.
Grapheme segmentation and normalization are distinct Unicodealgorithms:
http://www.unicode.org/reports/tr15/
http://www.unicode.org/reports/tr29/

There are also several normalization algorithms.

http://en.wikipedia.org/wiki/Unicode_equivalence#Normalization


How about this?

s.normalize!NFKD

To return a range of normalized code points?

Clearly, no definition of string can handle this natively. As yousay, there are multiple algorithms, so there is no one 'right'answer. byGrapheme is useful, but doesn't and cannot solve thenormalization issue.

I feel this discussion is tangential to main debate: whetherstrings should be ranges of code points or code units. By codeunit is faster by default, and simpler to implement in Phobos (nomore special code). By code point works better when searching forindividual code points, but as you rightly point out this mightnot be useful in practice as you rarely search for individualnon-ASCII code points, and it isn't a complete solution anywaybecause of normalization.


There's a few problems with by code unit:

1. Searching string/wstring for dchar fails silently. You havesuggested making this a compilation error, but Andrei argues thiswould break lots of code. You counter that it's possible thatpeople rarely search for dchar anyway, so may not matter.

2. It's a fundamental change. Regardless of which is better, weneed to consider the impact of such a change.

3. Ranges of code units are random access and sliceable, whichmeans they will be accepted by algorithms such as sort, whichwill just produce garbage strings. Maybe this isn't an issue.

Re: Major performance problem with std.array.front()

Reply via email to