On 01/12/2011 08:28 PM, Don wrote:
I think the only problem that we really have, is that "char[]",
"dchar[]" implies that code points is always the appropriate level of
abstraction.

I'd like to know when it happens that codepoint is the appropriate level of abstraction. * If pieces of text are not manipulated, meaning just used in the application, or just transferred via the application as is (from file / input / literal to any kind of output), then any kind of encoding just works. One can even concatenate, provided all pieces use the same encoding. --> _lower_ level than codepoint is OK. * But any of manipulation (indexing, slicing, compare, search, count, replace, not to speak about regex/parsing) requires operating at the _higher_ level of characters (in the common sense). Just like with historic character sets in which codes used to represent characters (not lower-level thingies as in UCS). Else, one reads, compares, changes meaningless bits of text.

As I see it now, we need 2 types:
* One plain string similar to good old ones (bytestring would do the job, since most unicode is utf8 encoded) for the first kind of use above. With optional validity check when it's supposed to be unicode text. * One hiher-level type abstracting from codepoint (not code unit) issues, restoring the necessary properties: (1) each character is one element in the sequence (2) each character is always represented the same way.


Denis
_________________
vita es estrany
spir.wikidot.com

Reply via email to