On Friday, 7 March 2014 at 22:27:35 UTC, H. S. Teoh wrote:
On Fri, Mar 07, 2014 at 09:58:39PM +0000, Vladimir Panteleev
wrote:
On Friday, 7 March 2014 at 21:56:45 UTC, Eyrk wrote:
>On Friday, 7 March 2014 at 20:43:45 UTC, Vladimir Panteleev
>wrote:
>>No, it doesn't.
>>
>>import std.algorithm;
>>
>>void main()
>>{
>> auto s = "cassé";
>> assert(s.canFind('é'));
>>}
>>
>
>Hm, I'm not following? Works perfectly fine on my system?
Probably because your browser is normalizing the unicode string
when you
copy-n-paste Vladimir's message? See below:
Something's messing with your Unicode. Try downloading and
compiling
this file:
http://dump.thecybershadow.net/6f82ea151c1a00835cbcf5baaace2801/test.d
I downloaded the file and looked at it through `od -ctx1`: the
first é
is encoded as the byte sequence 65 cc 81, that is, [U+65,
U+301] (small
letter e + combining diacritic acute accent), whereas the
second é is
encoded as c3 a9, that is, U+E9 (precomposed small letter e
with acute
accent).
This illustrates one of my objections to Andrei's post: by
auto-decoding
behind the user's back and hiding the intricacies of unicode
from him,
it has masked the fact that codepoint-for-codepoint comparison
of a
unicode string is not guaranteed to always return the correct
results,
due to the possibility of non-normalized strings.
Basically, to have correct behaviour in all cases, the user
must be
aware of, and use, the Unicode collation / normalization
algorithms
prescribed by the Unicode standard. What we have in
std.algorithm right
now is an incomplete implementation with non-working edge cases
(like
Vladimir's example) that has poor performance to start with.
Its only
redeeming factor is that the auto-decoding hack has given it the
illusion of being correct, when actually it's not correct
according to
the Unicode standard. I don't see how this is necessarily
superior to
Walter's proposal.
T
To me, the status quo feels like an ok compromise between
performance and correctness. Everyone is pointing out that
working at the code point level is bad because it's not correct
but working at the code unit level as Walter proposes is correct
even less often so that's not really an argument for moving to
that. It is, however, an argument for forcing the user to decide
what level of correctness and performance they need.
Walter's idea (code unit level) would be fastest but least
correct.
The current is somewhat fast and is somewhat correct.
The next level, graphemes, would be slowest of all but most
correct.
It seems like there is just no way to avoid the tradeoff between
speed and correctness so we shouldn't try, only try to force the
user to make a decision.
Maybe some more string types are in order (hrm). In order of
performance to correctness:
string, wstring (code units)
dstring (code points)
+gstring (graphemes)
(do grapheme's completely normalize? If not probably need another
level, say, nstring)
Then if a user needs correctness over performance they just work
with gstrings. If they need performance over correctness they
work with strings (assuming some of Walter's idea happens,
otherwise they'd work with string.representation).