On 2011-01-12 14:57:58 -0500, spir <denis.s...@gmail.com> said:

On 01/12/2011 08:28 PM, Don wrote:
I think the only problem that we really have, is that "char[]",
"dchar[]" implies that code points is always the appropriate level of
abstraction.

I'd like to know when it happens that codepoint is the appropriate level of abstraction.

I agree with you. I don't see many use for code points.

One of these uses is writing a parser for a format defined in term of code points (XML for instance). But beyond that, I don't see one.


* If pieces of text are not manipulated, meaning just used in the application, or just transferred via the application as is (from file / input / literal to any kind of output), then any kind of encoding just works. One can even concatenate, provided all pieces use the same encoding. --> _lower_ level than codepoint is OK. * But any of manipulation (indexing, slicing, compare, search, count, replace, not to speak about regex/parsing) requires operating at the _higher_ level of characters (in the common sense). Just like with historic character sets in which codes used to represent characters (not lower-level thingies as in UCS). Else, one reads, compares, changes meaningless bits of text.

Very true. In the same way that code points can span on multiple code units, user-perceived characters (graphemes) can span on multiple code points.

A funny exercise to make a fool of an algorithm working only with code points would be to replace the word "fortune" in a text containing the word "fortuné". If the last "é" is expressed as two code points, as "e" followed by a combining acute accent (this: é), replacing occurrences of "fortune" by "expose" would also replace "fortuné" with "exposé" because the combining acute accent remains as the code point following the word. Quite amusing, but it doesn't really make sense that it works like that.

In the case of "é", we're lucky enough to also have a pre-combined character to encode it as a single code point, so encountering "é" written as two code points is quite rare. But not all combinations of marks and characters can be represented as a single code point. The correct thing to do is to treat "é" (single code point) and "é" ("e" + combining acute accent) as equivalent.

--
Michel Fortin
michel.for...@michelf.com
http://michelf.com/

Reply via email to