On Fri, 14 Jan 2011 01:44:19 -0500, Nick Sabalausky <a@a.a> wrote:

"Andrei Alexandrescu" <seewebsiteforem...@erdani.org> wrote in message
news:igoqrm$1n5r$1...@digitalmars.com...
On 1/13/11 10:26 PM, Nick Sabalausky wrote:
[snip]
[ 'f', {u with the umlaut}, 'n', 'f' ]

Or:

[ 'f', 'u', {umlaut combining character}, 'n', 'f' ]

Those *both* get rendered exactly the same, and both represent the same
four-letter sequence. In the second example, the 'u' and the {umlaut
combining character} combine to form one grapheme. The f's and n's just
happen to be single-code-point graphemes.

Note that while some characters exist in pre-combined form (such as the
{u
with the umlaut} above), legend has it there are others than can only be
represented using a combining character.

It's also my understanding, though I'm not certain, that sometimes
multiple
combining characters can be used together on the same "root" character.

Thanks. One further question is: in the above example with u-with-umlaut,
there is one code point that corresponds to the entire combination. Are
there combinations that do not have a unique code point?


My understanding is "yes". At least that's what I've heard, and I've never heard any claims of "no". I don't know of any specific ones offhand, though. Actually, it might be possible to use any combining character with any old
letter or number (like maybe a 7 with an umlaut), though I'm not certain.

FWIW, the Wikipedia article might help, or at least link to other things
that might help: http://en.wikipedia.org/wiki/Combining_character

http://en.wikipedia.org/wiki/Unicode_normalization

Linked from that page, the normalization process is probably something we need to look at. Using decomposed canonical form would mean we need more state than just what code-unit are we on, plus it creates more likelyhood that a match will be found with part of a grapheme (spir or Michel brought it up earlier). So I think the correct case is to use composed canonical form. This is after just reading that page, so maybe I'm missing something.

Non-composable combinations would be a problem. The string range is formed on the basis that the element type is a dchar. If there are combinations that cannot be composed into a single dchar, then the element type has to be a dchar array (or some other type which contains all the info). The other option is to simply leave them decomposed. Then you risk things like partial matches.

I'm leaning towards a solution like this: While iterating a string, it should output dchars in normalized composed form. But a specialized comparison function should be used when doing things like searches or regex, because it might not be possible to compose two combining characters.

The drawback to this is that a dchar might not be able to represent a grapheme (only if it cannot be composed), but I think it's too much of a hit in complexity and performance to make the element type of a string larger than a dchar.

Those who wish to work with a more comprehensive string type can use a more complex string type such as the one created by spir.

Does that sound reasonable?

-Steve

Reply via email to