On 01/12/2011 08:28 PM, Don wrote:
I think the only problem that we really have, is that "char[]",
"dchar[]" implies that code points is always the appropriate level of
abstraction.
I'd like to know when it happens that codepoint is the appropriate level
of abstraction.
* If pieces of text are not manipulated, meaning just used in the
application, or just transferred via the application as is (from file /
input / literal to any kind of output), then any kind of encoding just
works. One can even concatenate, provided all pieces use the same
encoding. --> _lower_ level than codepoint is OK.
* But any of manipulation (indexing, slicing, compare, search, count,
replace, not to speak about regex/parsing) requires operating at the
_higher_ level of characters (in the common sense). Just like with
historic character sets in which codes used to represent characters (not
lower-level thingies as in UCS). Else, one reads, compares, changes
meaningless bits of text.
As I see it now, we need 2 types:
* One plain string similar to good old ones (bytestring would do the
job, since most unicode is utf8 encoded) for the first kind of use
above. With optional validity check when it's supposed to be unicode text.
* One hiher-level type abstracting from codepoint (not code unit)
issues, restoring the necessary properties: (1) each character is one
element in the sequence (2) each character is always represented the
same way.
Denis
_________________
vita es estrany
spir.wikidot.com