On 2011-01-12 19:45:36 -0500, Michel Fortin <michel.for...@michelf.com> said:
A funny exercise to make a fool of an algorithm working only with code
points would be to replace the word "fortune" in a text containing the
word "fortuné". If the last "é" is expressed as two code points, as "e"
followed by a combining acute accent (this: é), replacing occurrences
of "fortune" by "expose" would also replace "fortuné" with "exposé"
because the combining acute accent remains as the code point following
the word. Quite amusing, but it doesn't really make sense that it works
like that.
In the case of "é", we're lucky enough to also have a pre-combined
character to encode it as a single code point, so encountering "é"
written as two code points is quite rare. But not all combinations of
marks and characters can be represented as a single code point. The
correct thing to do is to treat "é" (single code point) and "é" ("e" +
combining acute accent) as equivalent.
Crap, I meant to send this as UTF-8 with combining characters in it,
but my news client converted everything to ISO-8859-1.
I'm not sure it'll work, but here's my second attempt at posting real
combining marks:
Single code point: é
e with combining mark: é
t with combining mark: t̂
t with two combining marks: t̂̃
--
Michel Fortin
michel.for...@michelf.com
http://michelf.com/