Re: GRAPHEME JOINER vs. double diacritics

2002-01-08 Thread Kenneth Whistler

O.k., o.k., as Kent and Mark have pointed out, I've
already managed to make my first significant error of the new year.

The intent and wording of the PDUTR #28 text on the CGJ is best
stated in the Article II.3.9 Application of Combining Marks --
a section I overlooked in responding previously to Eric Muller's
query.

The problem, of course, is that if you start to apply ordinary
combining marks to entire grapheme clusters comprised of sequences
with the CGJ, you run afoul of canonical equivalences involving
those combining marks. The same thing does not apply for the
enclosing combining marks, since there are no canonical equivalences
involving those combining marks.

So, taking that into consideration, here is my restatement of
what I think ought to happen for the three possible cases for
the ng-tilde:


1. 
2. 
3. 

1. uses the double-diacritic tilde, which nominally applies merely to
   the U+006E, but would be designed to lay over the top of a following
   base character on display.

2. uses the compatibility combining double-tilde halves. These occur
   in legacy bibliographic data records. In principle, 2 should display
   in the same way as 1, but would be recommended only for interoperating
   with the legacy data.

3. uses the grapheme joiner to create a "grapheme cluster", which in
   this case would be the digraph "ng". Unlike 1 and 2, the tilde would
   apply only to the "g", so that 3 would not display the same as 1 or 2.


To illustrate the canonical equivalence question, consider:

1a.   ==> aá
1b.   ==> aá

1a and 1b are canonically equivalent sequences, and should display
the same.

2a. 
2b. 

Now if we insert a CGJ between the two a's, the 
sequences are still canonically equivalent, and should display the
same. If, however, we say that the creating an "aa" grapheme cluster
changes the context over which the following acute accent will display,
then we have a situation where canonically equivalent sequences have
consistently different display (and possibly interpretation). That
wouldn't be a good thing -- hence the wording in PDUTR #28 to preclude
the application of combining marks to other than the base character
they follow (except for enclosing combining marks or other
specified exceptions).

--Ken







Re: GRAPHEME JOINER vs. double diacritics

2002-01-04 Thread Vladimir Ivanov

Eric Muller wrote:

> Is it correct that the sequences U+x U+0360 U+y and U+x U+034F U+y
> U+0303 should display the same? Would it be worth putting some words
> about those situations in section 13.2 of PDUTR #28?

In what font can we find U+034F? In Arial Unicode MS just after U+0345 goes
U+0360.
Can we have the full path to section 13.2 of PDUTR #28?

Thank you,
Vladimir Ivanov





Re: GRAPHEME JOINER vs. double diacritics

2002-01-03 Thread Kenneth Whistler

Eric Muller asked:

> Is it correct that the sequences U+x U+0360 U+y and U+x U+034F U+y
> U+0303 should display the same? Would it be worth putting some words
> about those situations in section 13.2 of PDUTR #28?

I think that that should be the case, given the current definitions.

In particular, if U+x = U+006E "n" and U+y = U+0067 "g", you would
get the following three possibilities for writing the Tagalog ng-tilde:

1. 
2. 
3. 

1. uses the double-diacritic tilde, which nominally applies merely to
   the U+006E, but would be designed to lay over the top of a following
   base character on display.

2. uses the compatibility combining double-tilde halves. These occur
   in legacy bibliographic data records. In principle, 2 should display
   in the same way as 1, but would be recommended only for interoperating
   with the legacy data.

3. uses the grapheme joiner to create a "grapheme cluster", which in
   this case would be the digraph "ng". A rendering engine savvy to
   grapheme cluster status should then attempt to apply a following
   combining mark, in this case a regular combining tilde, to the entire 
   grapheme cluster, rather than simply to the preceding base character.

While these are three alternative ways of representing the "same thing",
we aren't talking about canonical equivalences here. 3 creates a
grapheme cluster (which could have implications for other processing),
while 1 and 2 do not. For example, if I added U+0301 (combining acute)
after each of the above sequences, 1 would put the acute on the "g"
(and might result in overlap with the right half of the double tilde);
2 would put the acute over the right-half tilde on the "g"; 3 should
put the acute midships over the stretched tilde applying to the digraph.

2 is used for interoperating with legacy
bibliographic data, while 1 and 2 are not. And there are quite likely
to be other small formatting differences between the three options. In the
real world it is unlikely that you will run into a "perfect" rendering
engine that would produce exactly the same image from each of the
sequences.

The combining grapheme joiner is the best answer that Unicode currently
has for the extensibility problem for unusual accent placements over
(or under) groups of letters, where the existing compatibility answers
(U+0360..U+0362 for double diacritics; U+FE20..U+FE23 for diacritic halves)
aren't sufficient. For example, it makes it possible to represent a
double breve or a double macron over (as seen in some American dictionary
orthographies) or a double (or triple) underline under (as seen in some
transliterations).

--Ken