On Saturday, 8 March 2014 at 16:00:38 UTC, Vladimir Panteleev wrote:
On Saturday, 8 March 2014 at 15:33:34 UTC, Andrei Alexandrescu wrote:
Why? Couldn't the grapheme 'compare true with the character? I.e. the byGrapheme iteration normalizes on the fly.

Grapheme segmentation and normalization are distinct Unicode algorithms:

http://www.unicode.org/reports/tr15/
http://www.unicode.org/reports/tr29/

There are also several normalization algorithms.

http://en.wikipedia.org/wiki/Unicode_equivalence#Normalization

How about this?

s.normalize!NFKD

To return a range of normalized code points?

Clearly, no definition of string can handle this natively. As you say, there are multiple algorithms, so there is no one 'right' answer. byGrapheme is useful, but doesn't and cannot solve the normalization issue.

I feel this discussion is tangential to main debate: whether strings should be ranges of code points or code units. By code unit is faster by default, and simpler to implement in Phobos (no more special code). By code point works better when searching for individual code points, but as you rightly point out this might not be useful in practice as you rarely search for individual non-ASCII code points, and it isn't a complete solution anyway because of normalization.

There's a few problems with by code unit:

1. Searching string/wstring for dchar fails silently. You have suggested making this a compilation error, but Andrei argues this would break lots of code. You counter that it's possible that people rarely search for dchar anyway, so may not matter.

2. It's a fundamental change. Regardless of which is better, we need to consider the impact of such a change.

3. Ranges of code units are random access and sliceable, which means they will be accepted by algorithms such as sort, which will just produce garbage strings. Maybe this isn't an issue.

Reply via email to