On 2011-01-12 14:57:58 -0500, spir <denis.s...@gmail.com> said:
On 01/12/2011 08:28 PM, Don wrote:
I think the only problem that we really have, is that "char[]",
"dchar[]" implies that code points is always the appropriate level of
abstraction.
I'd like to know when it happens that codepoint is the appropriate
level of abstraction.
I agree with you. I don't see many use for code points.
One of these uses is writing a parser for a format defined in term of
code points (XML for instance). But beyond that, I don't see one.
* If pieces of text are not manipulated, meaning just used in the
application, or just transferred via the application as is (from file /
input / literal to any kind of output), then any kind of encoding just
works. One can even concatenate, provided all pieces use the same
encoding. --> _lower_ level than codepoint is OK.
* But any of manipulation (indexing, slicing, compare, search, count,
replace, not to speak about regex/parsing) requires operating at the
_higher_ level of characters (in the common sense). Just like with
historic character sets in which codes used to represent characters
(not lower-level thingies as in UCS). Else, one reads, compares,
changes meaningless bits of text.
Very true. In the same way that code points can span on multiple code
units, user-perceived characters (graphemes) can span on multiple code
points.
A funny exercise to make a fool of an algorithm working only with code
points would be to replace the word "fortune" in a text containing the
word "fortuné". If the last "é" is expressed as two code points, as "e"
followed by a combining acute accent (this: é), replacing occurrences
of "fortune" by "expose" would also replace "fortuné" with "exposé"
because the combining acute accent remains as the code point following
the word. Quite amusing, but it doesn't really make sense that it works
like that.
In the case of "é", we're lucky enough to also have a pre-combined
character to encode it as a single code point, so encountering "é"
written as two code points is quite rare. But not all combinations of
marks and characters can be represented as a single code point. The
correct thing to do is to treat "é" (single code point) and "é" ("e" +
combining acute accent) as equivalent.
--
Michel Fortin
michel.for...@michelf.com
http://michelf.com/