https://bugzilla.wikimedia.org/show_bug.cgi?id=36439

--- Comment #7 from jeb...@gmail.com 2012-06-29 10:06:52 UTC ---
Note
* the vast majority of input data is already in form C, using precomposed
  characters
* Form C is supposed to be relatively lossless, with the only changes being
  invisible transformations between base character + combining character
  sequences and precomposed chars. In theory text should never change
  appearance because it's been normalized to form C.
* and further, the W3C recommends it

http://www.mediawiki.org/wiki/Unicode_normalization_considerations#What_is_it.3F

This means that an accented character works if it can be normalized into a
precomposed character. For example O₂ and O² works because they can be
normalized into precomposed characters. The code sequence U+30A COMBINING RING
ABOVE preceded by a might be interpreted as a U+00E5 LATIN SMALL LETTER A WITH
RING ABOVE, but it can also be interpreted as an a followed by a small ring.
The same thing happens with a lot of accented letters.

There are also the problem with similarly looking character, which the
following shows

package main
import "fmt"
func main() {
    a1 := string([]byte{0xe2,0x84,0xab})
    a2 := string([]byte{0xc3,0x85})
    fmt.Println(a1, a2, a1 == a2)
}

Prints:

Å Å false

One character is Angstrom while the other is an A with a ring above, that is
the usual character in Danish and Norwegian.

For now the aliases, labels and descriptions will be normalized into the form
C, and the text will then be trimmed for leading and trailing whitespace and
internal whitespace will be compressed. Whitespace will only be handled for a
limited set of whitespace characters.

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.
_______________________________________________
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l

Reply via email to