A better example is

Å (U+212B: ANGSTROM SIGN)
Å (U+00C5: LATIN CAPITAL LETTER A WITH RING ABOVE)
Å (U+0041: LATIN CAPITAL LETTER A and U+030A: COMBINING RING ABOVE)

These all look the same, and pretty much mean the same thing. Luckily, we don't have to argue about what the characters "mean", that's a job for the Unicode consortium. For example, they have decided that:

А (U+0410: CYRILLIC CAPITAL LETTER A)

does *not* map onto U+0041.  Whatever.

The important thing is that for all three of the Å's above, they all canonicalize (in NFKC) to the UTF-8 bytes:

41 CC 8A (hex)

or

61 CC 8A (hex)

if you've got case folding turned on.

This way you can compare them together for equality.

Oh, another favorite example of mine is Ⅷ (U+2167: ROMAN NUMERAL EIGHT). This NFKC's to viii. There are some more examples here:

http://jabberstudio.org/cgi-bin/viewcvs.cgi/cvs/jabber-net/test/ stringprep/

On Dec 10, 2005, at 1:49 PM, Yves Goergen wrote:

On 10.12.2005 12:28 (+0100), Matthias Wimmer wrote:
Examples of mapped characters are:

“℉” (U+2109, single charater!) is mapped to “°f” (two characters), “™” (U+2122, single character!) is mapped to “tm” (two characters),
“ℂ” (U+2102) is mapped to “c”,
“ℹ” (U+2139) is mapped to “i”,
“№” (U+2116, single character!) is mapped to “no” (two characters),
“²” (U+00B2) is mapped to “2”.

What's the point in mapping similar-looking characters to another one?
They are simply not the same and mapping a character from one language
set to one of an arbitrary other language can disturb sorting things
very much. Imagine our alphabet was A,B,D,F,G,H,...,C,E only because C
and E were mapped to the greerillew language characters that look
similar (or vice versa). Well anyway, I don't think I need this for now.
I'll simply make sure it's Unicode-capable, plugging in a string
converter later is still possible.

--
Yves Goergen "LonelyPixel" <[EMAIL PROTECTED]>
"Does the movement of the trees make the wind blow?"
http://newsboard.unclassified.de - Unclassified NewsBoard Forum


Reply via email to