On Monday, 20 April 2015 at 11:04:58 UTC, Panke wrote:
Yes, again and again I encountered length related bugs with
Unicode characters. Normalization is not 100% reliable.
I think it is 100% reliable, it just doesn't make the problems
go away. It just guarantees that two strings normalized to the
same form are binary equal iff they are equal in the unicode
sense. Nothing about columns or string length or grapheme count.
The problem is not normalization as such, the problem is with
string (as opposed to dstring):
import std.uni : normalize, NFC;
void main() {
dstring de_one = "é";
dstring de_two = "e\u0301";
assert(de_one.length == 1);
assert(de_two.length == 2);
string e_one = "é";
string e_two = "e\u0301";
string random = "ab";
assert(e_one.length == 2);
assert(e_two.length == 3);
assert(e_one.length == random.length);
assert(normalize!NFC(e_one).length == 2);
assert(normalize!NFC(e_two).length == 2);
}
This can lead to subtle bugs, cf. length of random and e_one. You
have to convert everything to dstring to get the "expected"
result. However, this is not always desirable.