https://bugzilla.wikimedia.org/show_bug.cgi?id=36439
--- Comment #7 from jeb...@gmail.com 2012-06-29 10:06:52 UTC --- Note * the vast majority of input data is already in form C, using precomposed characters * Form C is supposed to be relatively lossless, with the only changes being invisible transformations between base character + combining character sequences and precomposed chars. In theory text should never change appearance because it's been normalized to form C. * and further, the W3C recommends it http://www.mediawiki.org/wiki/Unicode_normalization_considerations#What_is_it.3F This means that an accented character works if it can be normalized into a precomposed character. For example O₂ and O² works because they can be normalized into precomposed characters. The code sequence U+30A COMBINING RING ABOVE preceded by a might be interpreted as a U+00E5 LATIN SMALL LETTER A WITH RING ABOVE, but it can also be interpreted as an a followed by a small ring. The same thing happens with a lot of accented letters. There are also the problem with similarly looking character, which the following shows package main import "fmt" func main() { a1 := string([]byte{0xe2,0x84,0xab}) a2 := string([]byte{0xc3,0x85}) fmt.Println(a1, a2, a1 == a2) } Prints: Å Å false One character is Angstrom while the other is an A with a ring above, that is the usual character in Danish and Norwegian. For now the aliases, labels and descriptions will be normalized into the form C, and the text will then be trimmed for leading and trailing whitespace and internal whitespace will be compressed. Whitespace will only be handled for a limited set of whitespace characters. -- Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug. _______________________________________________ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l