There're a lot of good questions here. Some comments:

At 11:38 AM 3/6/02 -0600, [EMAIL PROTECTED] wrote:

> >The sole change required would be for the CGJ to be Me instead of Mn.
> >If we made this change, it would provide for a mechanism for
> >representing diacritics over multiple characters, without the addition
> >of any other characters -- or the wait for them to be encoded.
>
>Let me make sure I have this straight: Say you had لل and wanted to draw a 
>breve over the whole pair of characters.  How would you express that?
>
>I'm guessing that the answer is:
>
>a<umlaut><CGJ>a<umlaut><CGJ><breve>
>
>What it looks like is that you're taking the INVISIBLE ENCLOSING MARK 
>which someone proposed a while back and giving this semantic to the 
>CGJ.  Seems like this gives the CGJ at least two distinct jobs:
>
>1) It causes the grapheme clusters on either side to be treated as a 
>single grapheme cluster.
>2) It causes the preceding grapheme cluster to be treated as a single unit 
>for the purpose of applying non-spacing marks (i.e., a non-spacing mark 
>normally applies to the preceding base character; a CGJ causes it to be 
>applied to the preceding grapheme cluster instead).
>
>The name "combining grapheme joiner" only suggests job #1 to me, and that 
>makes me a little dubious about extending its charter to include job 
>#2.  Can we be completely confident that situations won't arise where the 
>semantics of CGJ won't be ambiguous, where you don't know for sure whether 
>meaning #1 or meaning #2 is intended?  Even if we can, will the double 
>usage be confusing to people?

This is a tough question and like you, I suspect that we don't have the answer.

>I think you can disambiguate them by specifying the following rules...
>
>1) If CGJ is followed by a non-combining character, meaning #1 (the 
>original CGJ meaning) is intended.
>2) If CGJ is followed by a combining character, meaning #2 (IEM) is intended.
>
>...but I don't know that this is a good idea.
>
>[start of off-topic rambling]
>
>I haven't read the most recent draft of Unicode 3.2 yet, but this whole 
>grapheme-cluster thing has always felt rather ill-defined to me, 
>especially when it comes to how grapheme clusters and combining marks 
>behave.  As I see it, grapheme clusters have the following purposes:

Be careful - Base character + combining marks are also 'clusters' (even 
those not using CGJ) and many of these 'rules' do not apply to them.

>1) In a text-editing application, arrow keys generally move forward and 
>back an entire grapheme cluster at a time.

Not true for clusters containing Mc (spacing combining marks)

>2) In a text-editing application, the backspace and delete keys generally 
>delete whole grapheme clusters.

Not true for clusters containing Mc (spacing combining marks)

>3) Grapheme clusters are always kept together on a single line, even in 
>cases where words aren't.
>4) A search on a piece of text shouldn't report a hit if the matching text 
>doesn't begin and end on grapheme-cluster boundaries.

I suspect that this may not be true for clusters containing Mc (spacing 
combining marks) - but I may be wrong about this one. It depends on whether 
it makes sense to allow searches for common prefixes whether or not they 
are continued with an Mc or not.

>5) Language-sensitive comparison should generally treat grapheme clusters 
>as single units (i.e., a grapheme cluster maps to a single collation 
>element, not to one collation element for each component part).

Not true, o-umlaut may be collated as if it was oe under some tailorings. 
Similar things may happen with other clusters. [I realize that this is not 
the same as sorting o and umlaut as two units, but the simplistic 'one 
cluster-one unit' rule is deceptive]

>6) Enclosing marks apply to the preceding grapheme cluster.
>7) Sometimes, the other combining marks also apply to the preceding 
>grapheme cluster.
>
>Leaving aside for a moment the fact that I'm not sure the same sequences 
>of characters should be considered "grapheme clusters" for all of the 
>above purposes, 6 and 7 bother me.
>
>The big problem with 6 is that we've stated that a combining character 
>sequence is a grapheme cluster.  An enclosing mark, being a combining 
>character, would thus be part of a combining character sequence.  So 
>you've got some sequence of code points being treated as a "grapheme 
>cluster" solely for the purpose of figuring out how to draw the enclosing 
>mark.  The enclosing mark gets treated as part of the same "grapheme 
>cluster" as the characters it encloses (and, for that matter, any 
>following combining marks) for all other purposes.  You've got grapheme 
>clusters inside grapheme clusters.  This seems confusing and weird.
>
>7 is even more problematic.  Unicode 3.2 says explicitly that a 
>non-spacing mark applies to an entire Hangul syllable, and not just to the 
>last jamo, when the syllable is spelled out in jamo, and that it does this 
>because a Hangul syllable is a grapheme cluster.  But when a "grapheme 
>cluster" is formed with a CGJ, a non-spacing mark only applies to the last 
>character in the grapheme cluster (unless, if we adopt this new rule, the 
>last character happens to be another CGJ).  It's not clear whether a 
>generic non-spacing mark (such as a tilde or macron) would apply to an 
>entire Indic syllable cluster (following the Hangul-syllable precedent) or 
>just to the last component (following the CGJ precedent).  And, of course, 
>if a non-spacing mark follows a normal combining character sequence, it's 
>just considered a component part of a grapheme cluster and applies to the 
>preceding base character.
>
>Enclosing marks, on the other hand, always apply to the immediately 
>preceding grapheme cluster, so they interact with grapheme clusters (and 
>particularly with CGJ) differently from non-spacing marks.  I'm guessing 
>combining spacing marks interact with grapheme clusters the same way 
>non-spacing marks do, but this isn't clear either.  In any event, you've 
>not got different types of combining marks behaving differently in ways 
>that you didn't have before grapheme clusters were introduced, and this 
>seems questionable.

No, Me always behaved differently. Take applying a circumflex to a sequence 
ending in combining (enclosing) circle. Clearly the circumflex needs to be 
positioned relative to the circle, i.e. centered, even if the base 
character would have had a non-centered accent (e.g. accent on top of a J 
or L might not be centered on the glyph, but centered on the stem). In 
other words, Me always created an unanalyzable cluster.

>I think I'm coming around to the idea that there should be one concept of 
>"a group of code points" that's used to affect analysis algorithms such as 
>searching, sorting, line breaking, and arrow-key movement, and another 
>"group of code points" that's used to affect mark positioning, and that 
>different formatting characters be used to control the partitioning of 
>code points into the different types of groups.  In any event, I feel that 
>the grapheme-cluster concept in its current state (or at least in its 
>state as of about six weeks ago) isn't as well-thought-out as it should be.

Since Mc (the spacing combinging marks) are typically edited one at a time, 
and  are not, like Mn (the non-spacing marks), fused into the cluster for 
editing, it's toudhg to come up with a *single* set of rules that works for 
all 7 of your contexts.

Reply via email to