Re: VLERange: a range in between BidirectionalRange and RandomAccessRange

Michel Fortin Wed, 12 Jan 2011 16:50:54 -0800

On 2011-01-12 14:57:58 -0500, spir <denis.s...@gmail.com> said:

On 01/12/2011 08:28 PM, Don wrote:
I think the only problem that we really have, is that "char[]",
"dchar[]" implies that code points is always the appropriate level of
abstraction.
I'd like to know when it happens that codepoint is the appropriatelevel of abstraction.


I agree with you. I don't see many use for code points.

One of these uses is writing a parser for a format defined in term ofcode points (XML for instance). But beyond that, I don't see one.

* If pieces of text are not manipulated, meaning just used in theapplication, or just transferred via the application as is (from file /input / literal to any kind of output), then any kind of encoding justworks. One can even concatenate, provided all pieces use the sameencoding. --> _lower_ level than codepoint is OK.* But any of manipulation (indexing, slicing, compare, search, count,replace, not to speak about regex/parsing) requires operating at the_higher_ level of characters (in the common sense). Just like withhistoric character sets in which codes used to represent characters(not lower-level thingies as in UCS). Else, one reads, compares,changes meaningless bits of text.

Very true. In the same way that code points can span on multiple codeunits, user-perceived characters (graphemes) can span on multiple codepoints.

A funny exercise to make a fool of an algorithm working only with codepoints would be to replace the word "fortune" in a text containing theword "fortuné". If the last "é" is expressed as two code points, as "e"followed by a combining acute accent (this: é), replacing occurrencesof "fortune" by "expose" would also replace "fortuné" with "exposé"because the combining acute accent remains as the code point followingthe word. Quite amusing, but it doesn't really make sense that it workslike that.

In the case of "é", we're lucky enough to also have a pre-combinedcharacter to encode it as a single code point, so encountering "é"written as two code points is quite rare. But not all combinations ofmarks and characters can be represented as a single code point. Thecorrect thing to do is to treat "é" (single code point) and "é" ("e" +combining acute accent) as equivalent.


--
Michel Fortin
michel.for...@michelf.com
http://michelf.com/

Re: VLERange: a range in between BidirectionalRange and RandomAccessRange

Reply via email to