Re: [review] new string type

Ali Çehreli Fri, 03 Dec 2010 14:15:47 -0800

spir wrote:

> What I have in mind is a "UText" type that provides the right abstraction

> for text processing / string maipulation as one has when dealing withASCII> (in any fact any legacy character set). All what is needed is havinga true> one-to-one mapping between characters (in the common sense) andelements of> strings (what I call "code stacks"); one given stack unambiguouslydenotes> one character. To reach this point, in addition to decoding (ag fromutf8 to

> code points), we must:


> * group codes into stacks
> * normalize (into 'NFD')

Are those operations independent of the context? Is stacking always desired?

I guess one would use one of the D string types when grouping ornormalization is not desired, right? Makes sense.


> * sorts points in stacks

Ok, I see that it is possible with NFD. I am not experienced withUnicode, but I think there will be issues with other types of Unicodenormalization. (Judging from your posts, I know that you know all these.:) )


> Then, we can for instance index or slice in O(1) as usual, and get a

> consistent substring of _characters_ [...] I do not want to deal withanything related to script-, language-, locale- specific issues.

Is the concept of _character_ well defined in Unicode outside of thecontext of the an alphabet (I think your "script" covers alphabet.)

It is an interesting decision when we actually want to see an array ofcode points as characters. When would it be correct to do so? I thinkthe answer is when we start treating the string as a piece of text.

For a string to be considered as text, it must be based on an alphabet.ASCII strings are pieces of text, because they are based on the26-letter alphabet.

I hope I don't sound like saying against anything that you said. I amalso thinking about the other common operations that work on pieces of text:


- sorting (e.g. ç is between c and d in many alphabets)
- lowercasing, uppercasing (e.g. i<->İ and ı<->I in many alphabets)

As a part of the Turkish D community, we've played with the idea of sucha text type. It took advantage of D's support for Unicode encoded sourcecode, so it's fully in Turkish. Yay! :)

Here is the module that takes care of sorting, capitalization, andproducing the base forms of the letters of the alphabets:


    http://code.google.com/p/trileri/source/browse/trunk/tr/alfabe.d

It is also based on dchar[], as you recommend elsewhere in this thread.

It is written with the older D2 operator overloading, doesn't supportranges, etc. But it currently supports ten alphabets (including the26-letter English, and the Old Irish alphabet).

Going out of the context of this thread, we've also worked on a typethat contains pieces of text from different alphabets to make a "text",where a text like "jim & ali" is correctly capitalized as "JIM & ALİ".

I am thinking more than what you describe. But your string would beuseful for implementing ours, as we don't have normalization or stackingsupport at all.


Thanks,
Ali

Re: [review] new string type

Reply via email to