spir wrote:

> What I have in mind is a "UText" type that provides the right abstraction
> for text processing / string maipulation as one has when dealing with ASCII > (in any fact any legacy character set). All what is needed is having a true > one-to-one mapping between characters (in the common sense) and elements of > strings (what I call "code stacks"); one given stack unambiguously denotes > one character. To reach this point, in addition to decoding (ag from utf8 to
> code points), we must:

> * group codes into stacks
> * normalize (into 'NFD')

Are those operations independent of the context? Is stacking always desired?

I guess one would use one of the D string types when grouping or normalization is not desired, right? Makes sense.

> * sorts points in stacks

Ok, I see that it is possible with NFD. I am not experienced with Unicode, but I think there will be issues with other types of Unicode normalization. (Judging from your posts, I know that you know all these. :) )

> Then, we can for instance index or slice in O(1) as usual, and get a
> consistent substring of _characters_ [...] I do not want to deal with anything related to script-, language-, locale- specific issues.

Is the concept of _character_ well defined in Unicode outside of the context of the an alphabet (I think your "script" covers alphabet.)

It is an interesting decision when we actually want to see an array of code points as characters. When would it be correct to do so? I think the answer is when we start treating the string as a piece of text.

For a string to be considered as text, it must be based on an alphabet. ASCII strings are pieces of text, because they are based on the 26-letter alphabet.

I hope I don't sound like saying against anything that you said. I am also thinking about the other common operations that work on pieces of text:

- sorting (e.g. ç is between c and d in many alphabets)
- lowercasing, uppercasing (e.g. i<->İ and ı<->I in many alphabets)

As a part of the Turkish D community, we've played with the idea of such a text type. It took advantage of D's support for Unicode encoded source code, so it's fully in Turkish. Yay! :)

Here is the module that takes care of sorting, capitalization, and producing the base forms of the letters of the alphabets:

    http://code.google.com/p/trileri/source/browse/trunk/tr/alfabe.d

It is also based on dchar[], as you recommend elsewhere in this thread.

It is written with the older D2 operator overloading, doesn't support ranges, etc. But it currently supports ten alphabets (including the 26-letter English, and the Old Irish alphabet).

Going out of the context of this thread, we've also worked on a type that contains pieces of text from different alphabets to make a "text", where a text like "jim & ali" is correctly capitalized as "JIM & ALİ".

I am thinking more than what you describe. But your string would be useful for implementing ours, as we don't have normalization or stacking support at all.

Thanks,
Ali

Reply via email to