There're a lot of good questions here. Some comments: At 11:38 AM 3/6/02 -0600, [EMAIL PROTECTED] wrote:
> >The sole change required would be for the CGJ to be Me instead of Mn. > >If we made this change, it would provide for a mechanism for > >representing diacritics over multiple characters, without the addition > >of any other characters -- or the wait for them to be encoded. > >Let me make sure I have this straight: Say you had لل and wanted to draw a >breve over the whole pair of characters. How would you express that? > >I'm guessing that the answer is: > >a<umlaut><CGJ>a<umlaut><CGJ><breve> > >What it looks like is that you're taking the INVISIBLE ENCLOSING MARK >which someone proposed a while back and giving this semantic to the >CGJ. Seems like this gives the CGJ at least two distinct jobs: > >1) It causes the grapheme clusters on either side to be treated as a >single grapheme cluster. >2) It causes the preceding grapheme cluster to be treated as a single unit >for the purpose of applying non-spacing marks (i.e., a non-spacing mark >normally applies to the preceding base character; a CGJ causes it to be >applied to the preceding grapheme cluster instead). > >The name "combining grapheme joiner" only suggests job #1 to me, and that >makes me a little dubious about extending its charter to include job >#2. Can we be completely confident that situations won't arise where the >semantics of CGJ won't be ambiguous, where you don't know for sure whether >meaning #1 or meaning #2 is intended? Even if we can, will the double >usage be confusing to people? This is a tough question and like you, I suspect that we don't have the answer. >I think you can disambiguate them by specifying the following rules... > >1) If CGJ is followed by a non-combining character, meaning #1 (the >original CGJ meaning) is intended. >2) If CGJ is followed by a combining character, meaning #2 (IEM) is intended. > >...but I don't know that this is a good idea. > >[start of off-topic rambling] > >I haven't read the most recent draft of Unicode 3.2 yet, but this whole >grapheme-cluster thing has always felt rather ill-defined to me, >especially when it comes to how grapheme clusters and combining marks >behave. As I see it, grapheme clusters have the following purposes: Be careful - Base character + combining marks are also 'clusters' (even those not using CGJ) and many of these 'rules' do not apply to them. >1) In a text-editing application, arrow keys generally move forward and >back an entire grapheme cluster at a time. Not true for clusters containing Mc (spacing combining marks) >2) In a text-editing application, the backspace and delete keys generally >delete whole grapheme clusters. Not true for clusters containing Mc (spacing combining marks) >3) Grapheme clusters are always kept together on a single line, even in >cases where words aren't. >4) A search on a piece of text shouldn't report a hit if the matching text >doesn't begin and end on grapheme-cluster boundaries. I suspect that this may not be true for clusters containing Mc (spacing combining marks) - but I may be wrong about this one. It depends on whether it makes sense to allow searches for common prefixes whether or not they are continued with an Mc or not. >5) Language-sensitive comparison should generally treat grapheme clusters >as single units (i.e., a grapheme cluster maps to a single collation >element, not to one collation element for each component part). Not true, o-umlaut may be collated as if it was oe under some tailorings. Similar things may happen with other clusters. [I realize that this is not the same as sorting o and umlaut as two units, but the simplistic 'one cluster-one unit' rule is deceptive] >6) Enclosing marks apply to the preceding grapheme cluster. >7) Sometimes, the other combining marks also apply to the preceding >grapheme cluster. > >Leaving aside for a moment the fact that I'm not sure the same sequences >of characters should be considered "grapheme clusters" for all of the >above purposes, 6 and 7 bother me. > >The big problem with 6 is that we've stated that a combining character >sequence is a grapheme cluster. An enclosing mark, being a combining >character, would thus be part of a combining character sequence. So >you've got some sequence of code points being treated as a "grapheme >cluster" solely for the purpose of figuring out how to draw the enclosing >mark. The enclosing mark gets treated as part of the same "grapheme >cluster" as the characters it encloses (and, for that matter, any >following combining marks) for all other purposes. You've got grapheme >clusters inside grapheme clusters. This seems confusing and weird. > >7 is even more problematic. Unicode 3.2 says explicitly that a >non-spacing mark applies to an entire Hangul syllable, and not just to the >last jamo, when the syllable is spelled out in jamo, and that it does this >because a Hangul syllable is a grapheme cluster. But when a "grapheme >cluster" is formed with a CGJ, a non-spacing mark only applies to the last >character in the grapheme cluster (unless, if we adopt this new rule, the >last character happens to be another CGJ). It's not clear whether a >generic non-spacing mark (such as a tilde or macron) would apply to an >entire Indic syllable cluster (following the Hangul-syllable precedent) or >just to the last component (following the CGJ precedent). And, of course, >if a non-spacing mark follows a normal combining character sequence, it's >just considered a component part of a grapheme cluster and applies to the >preceding base character. > >Enclosing marks, on the other hand, always apply to the immediately >preceding grapheme cluster, so they interact with grapheme clusters (and >particularly with CGJ) differently from non-spacing marks. I'm guessing >combining spacing marks interact with grapheme clusters the same way >non-spacing marks do, but this isn't clear either. In any event, you've >not got different types of combining marks behaving differently in ways >that you didn't have before grapheme clusters were introduced, and this >seems questionable. No, Me always behaved differently. Take applying a circumflex to a sequence ending in combining (enclosing) circle. Clearly the circumflex needs to be positioned relative to the circle, i.e. centered, even if the base character would have had a non-centered accent (e.g. accent on top of a J or L might not be centered on the glyph, but centered on the stem). In other words, Me always created an unanalyzable cluster. >I think I'm coming around to the idea that there should be one concept of >"a group of code points" that's used to affect analysis algorithms such as >searching, sorting, line breaking, and arrow-key movement, and another >"group of code points" that's used to affect mark positioning, and that >different formatting characters be used to control the partitioning of >code points into the different types of groups. In any event, I feel that >the grapheme-cluster concept in its current state (or at least in its >state as of about six weeks ago) isn't as well-thought-out as it should be. Since Mc (the spacing combinging marks) are typically edited one at a time, and are not, like Mn (the non-spacing marks), fused into the cluster for editing, it's toudhg to come up with a *single* set of rules that works for all 7 of your contexts.