spir wrote:
> On 01/12/2011 08:28 PM, Don wrote:
>> I think the only problem that we really have, is that "char[]",
>> "dchar[]" implies that code points is always the appropriate level of
>> abstraction.
>
> I'd like to know when it happens that codepoint is the appropriate level
> of abstraction.

When on a document that describes code points... :)

> * If pieces of text are not manipulated, meaning just used in the
> application, or just transferred via the application as is (from file /
> input / literal to any kind of output), then any kind of encoding just
> works. One can even concatenate, provided all pieces use the same
> encoding. --> _lower_ level than codepoint is OK.
> * But any of manipulation (indexing, slicing, compare,

Compare according to which alphabet's ordering? Surely not Unicode's... I may be alone in this, but ordering is tied to an alphabet (or writing system), not locale.)

I try to solve that issue with my trileri library:

  http://code.google.com/p/trileri/source/browse/#svn%2Ftrunk%2Ftr

Warning: the code is in Turkish and is not aware of the concept of collation at all; it has its own simplistic view of text, where every character is an entity that can be lower/upper cased to a single character.

> search, count,
> replace, not to speak about regex/parsing) requires operating at the
> _higher_ level of characters (in the common sense).

I don't know this about Unicode: should e and ´ (acute accent) be always collated? If so, wouldn't it be impossible to put those two in that order say, in a text book? (Perhaps Unicode defines a way to stop collation.)

> Just like with
> historic character sets in which codes used to represent characters (not
> lower-level thingies as in UCS). Else, one reads, compares, changes
> meaningless bits of text.
>
> As I see it now, we need 2 types:

I think we need more than 2 types...

> * One plain string similar to good old ones (bytestring would do the
> job, since most unicode is utf8 encoded) for the first kind of use
> above. With optional validity check when it's supposed to be unicode text.

Agreed. D gives us three UTF encondings, but I am not sure that there is only one abstraction above that.

> * One hiher-level type abstracting from codepoint (not code unit)
> issues, restoring the necessary properties: (1) each character is one
> element in the sequence (2) each character is always represented the
> same way.

I think VLERange should solve only the variable-length-encoding issue. It should not get into higher abstractions.

Ali

Reply via email to