On Monday, 20 April 2015 at 11:04:58 UTC, Panke wrote:

Yes, again and again I encountered length related bugs with Unicode characters. Normalization is not 100% reliable.

I think it is 100% reliable, it just doesn't make the problems go away. It just guarantees that two strings normalized to the same form are binary equal iff they are equal in the unicode sense. Nothing about columns or string length or grapheme count.

The problem is not normalization as such, the problem is with string (as opposed to dstring):

import std.uni : normalize, NFC;
void main() {

  dstring de_one = "é";
  dstring de_two = "e\u0301";

  assert(de_one.length == 1);
  assert(de_two.length == 2);

  string e_one = "é";
  string e_two = "e\u0301";

  string random = "ab";

  assert(e_one.length == 2);
  assert(e_two.length == 3);
  assert(e_one.length == random.length);

  assert(normalize!NFC(e_one).length == 2);
  assert(normalize!NFC(e_two).length == 2);
}

This can lead to subtle bugs, cf. length of random and e_one. You have to convert everything to dstring to get the "expected" result. However, this is not always desirable.

Reply via email to