On Friday, 7 March 2014 at 22:27:35 UTC, H. S. Teoh wrote:
On Fri, Mar 07, 2014 at 09:58:39PM +0000, Vladimir Panteleev wrote:
On Friday, 7 March 2014 at 21:56:45 UTC, Eyrk wrote:
>On Friday, 7 March 2014 at 20:43:45 UTC, Vladimir Panteleev >wrote:
>>No, it doesn't.
>>
>>import std.algorithm;
>>
>>void main()
>>{
>>   auto s = "cassé";
>>   assert(s.canFind('é'));
>>}
>>
>
>Hm, I'm not following? Works perfectly fine on my system?

Probably because your browser is normalizing the unicode string when you
copy-n-paste Vladimir's message? See below:


Something's messing with your Unicode. Try downloading and compiling
this file:
http://dump.thecybershadow.net/6f82ea151c1a00835cbcf5baaace2801/test.d

I downloaded the file and looked at it through `od -ctx1`: the first é is encoded as the byte sequence 65 cc 81, that is, [U+65, U+301] (small letter e + combining diacritic acute accent), whereas the second é is encoded as c3 a9, that is, U+E9 (precomposed small letter e with acute
accent).

This illustrates one of my objections to Andrei's post: by auto-decoding behind the user's back and hiding the intricacies of unicode from him, it has masked the fact that codepoint-for-codepoint comparison of a unicode string is not guaranteed to always return the correct results,
due to the possibility of non-normalized strings.

Basically, to have correct behaviour in all cases, the user must be aware of, and use, the Unicode collation / normalization algorithms prescribed by the Unicode standard. What we have in std.algorithm right now is an incomplete implementation with non-working edge cases (like Vladimir's example) that has poor performance to start with. Its only
redeeming factor is that the auto-decoding hack has given it the
illusion of being correct, when actually it's not correct according to the Unicode standard. I don't see how this is necessarily superior to
Walter's proposal.


T

To me, the status quo feels like an ok compromise between performance and correctness. Everyone is pointing out that working at the code point level is bad because it's not correct but working at the code unit level as Walter proposes is correct even less often so that's not really an argument for moving to that. It is, however, an argument for forcing the user to decide what level of correctness and performance they need.

Walter's idea (code unit level) would be fastest but least correct.
The current is somewhat fast and is somewhat correct.
The next level, graphemes, would be slowest of all but most correct.

It seems like there is just no way to avoid the tradeoff between speed and correctness so we shouldn't try, only try to force the user to make a decision.

Maybe some more string types are in order (hrm). In order of performance to correctness:

 string, wstring (code units)
 dstring         (code points)
+gstring         (graphemes)

(do grapheme's completely normalize? If not probably need another level, say, nstring)

Then if a user needs correctness over performance they just work with gstrings. If they need performance over correctness they work with strings (assuming some of Walter's idea happens, otherwise they'd work with string.representation).

Reply via email to