=?ISO-8859-1?Q?Re:_Umlaut_and_Tr=E9ma,_was:_Variation_sele___cto?= =?ISO-8859-1?Q?rs_and_vowel_marks?=

Kenneth Whistler Wed, 14 Jul 2004 11:07:33 -0700

Peter Kirk wrote:

> > At 11:02 AM 7/13/2004, Peter Kirk wrote:
> >
> >> I was surprised to see that WG2 has accepted a proposal made by the 
> >> US National Body to use CGJ to distinguish between Umlaut and Tréma 
> >> in German bibliographic data.


And Asmus responded:

> > You raise some interesting questions. However, note that the purpose 
> > of CGJ is intended for sorting related distinctions, which are at 
> > issue here. This is different from variation selectors which are 
> > intended to be used for displayed variations.

Note that the problem for German bibliographic records of
distinguishing umlaut from tréma was a longstanding issue for
the German national body, and was blocking them from cutover of
German bibliographic systems from ISO 5429 implementations to
Unicode-based implementations.

The proposal that the U.S. national body made met the technical
requirements that the German national body had, breaking this
logjam. And unlike the original German proposal, it did not
have massive consequences for the representation of umlaut in
other data and for interoperating with German bibliographic systems.

So the fact that the proposal was acceptable and accepted by WG2
should not be too surprising. It solved a data representation
problem in a manner acceptable to all parties involved.
 
> OK. But this is not a unique case. For example, in Hebrew Silluq and 
> Meteg, Dagesh and Shuruq are pairs of different marks which share a 
> glyph and so a Unicode character but may need to be distinguished for 
> certain processes. 

Can you show a pre-existing ISO character encoding standard, such
as ISO 5429, for which there are bibliographic implementations
whose conversion to Unicode is blocked by an encoding distinction
not maintained in Unicode for these particular cases? If so, then
you would have an analogous situation. If not, then you are simply
talking about functional distinctions for the same encoded diacritic,
which might be needed to be maintained for some kinds of processing,
for which people can use whatever kinds of conventions they sit
fit to deal with the issue -- but the issue doesn't rise to the
level of an encoding issue requiring formal intervention by WG2,
in my opinion.

This is a little like noting that U+0301 COMBINING ACUTE ACCENT,
when applied to Latin letters, might under some circumstances
represent a stress, under others a pitch accent, under others a
formal tonemic distinctions, under others a vocalic length
distinction, and under others a change in vowel quality. Such
distinctions might be relevant to many different kinds of
textual processing concerned with linguistic effects, but it
is not a character encoding issue.

> Should similar encodings with CGJ be proposed to make 
> these distinctions? 

If formal maintenance of a collation distinction between two
otherwise identically *appearing* pieces of text -- based on
whatever analytic status of the text is relevant -- is at issue,
then representation of one sequence with CGJ and one without
is a recommended way by the Unicode Standard to introduce a
distinction which a tailored collation can then weight differently
to get the required collation difference.

> So I must agree with Doug that 
> "CGJ + COMBINING DIAERESIS is a hack".

It is simply a way to maintain a distinction needed for German
bibliographic data to behave as required, while representing
their data in Unicode. Call it a hack if you like, but it
satisfied the relevant parties as an appropriate means for
representing the data in question.


> 256 variation selectors won't do if they have all been defined 
> unchangeably with the wrong properties e.g combining class. On the other 
> hand, if the UTC is prepared to ignore the combining class and 
> normalisation problems involved in using one combining class zero 
> character, CGJ, to modify a combining mark, 

This completely misconstrues the solution in question for the
German umlaut and tréma in bibliographic records. The CGJ is
not introduced "to modify a combining mark". Instead, two
text elements required to be distinguished in German bibliographic
data are represented by two distinct sequences:

X + COMBINING DIAERESIS
X + CGJ + COMBINING DIAERESIS

This is completely in keeping with the intent of the CGJ in the
standard, and the proposal did not, in any way, "ignore the
combining class and normalisation problems" in this case.
... Which, by the way, is why the solution met with unanimous
approval in WG2, without objection from the UTC liaison.

> it may as well ignore the 
> identical problems involved in using variation selectors, also combining 
> class zero, with combining marks.

What you have been suggesting to do, however, *does* advocate
ignoring the problems involved in attempting to use variation
selectors to formally distinguish variants of combining marks.

--Ken

=?ISO-8859-1?Q?Re:_Umlaut_and_Tr=E9ma,_was:_Variation_sele___cto?= =?ISO-8859-1?Q?rs_and_vowel_marks?=

Reply via email to