Peter Kirk wrote: > > At 11:02 AM 7/13/2004, Peter Kirk wrote: > > > >> I was surprised to see that WG2 has accepted a proposal made by the > >> US National Body to use CGJ to distinguish between Umlaut and Tréma > >> in German bibliographic data.
And Asmus responded: > > You raise some interesting questions. However, note that the purpose > > of CGJ is intended for sorting related distinctions, which are at > > issue here. This is different from variation selectors which are > > intended to be used for displayed variations. Note that the problem for German bibliographic records of distinguishing umlaut from tréma was a longstanding issue for the German national body, and was blocking them from cutover of German bibliographic systems from ISO 5429 implementations to Unicode-based implementations. The proposal that the U.S. national body made met the technical requirements that the German national body had, breaking this logjam. And unlike the original German proposal, it did not have massive consequences for the representation of umlaut in other data and for interoperating with German bibliographic systems. So the fact that the proposal was acceptable and accepted by WG2 should not be too surprising. It solved a data representation problem in a manner acceptable to all parties involved. > OK. But this is not a unique case. For example, in Hebrew Silluq and > Meteg, Dagesh and Shuruq are pairs of different marks which share a > glyph and so a Unicode character but may need to be distinguished for > certain processes. Can you show a pre-existing ISO character encoding standard, such as ISO 5429, for which there are bibliographic implementations whose conversion to Unicode is blocked by an encoding distinction not maintained in Unicode for these particular cases? If so, then you would have an analogous situation. If not, then you are simply talking about functional distinctions for the same encoded diacritic, which might be needed to be maintained for some kinds of processing, for which people can use whatever kinds of conventions they sit fit to deal with the issue -- but the issue doesn't rise to the level of an encoding issue requiring formal intervention by WG2, in my opinion. This is a little like noting that U+0301 COMBINING ACUTE ACCENT, when applied to Latin letters, might under some circumstances represent a stress, under others a pitch accent, under others a formal tonemic distinctions, under others a vocalic length distinction, and under others a change in vowel quality. Such distinctions might be relevant to many different kinds of textual processing concerned with linguistic effects, but it is not a character encoding issue. > Should similar encodings with CGJ be proposed to make > these distinctions? If formal maintenance of a collation distinction between two otherwise identically *appearing* pieces of text -- based on whatever analytic status of the text is relevant -- is at issue, then representation of one sequence with CGJ and one without is a recommended way by the Unicode Standard to introduce a distinction which a tailored collation can then weight differently to get the required collation difference. > So I must agree with Doug that > "CGJ + COMBINING DIAERESIS is a hack". It is simply a way to maintain a distinction needed for German bibliographic data to behave as required, while representing their data in Unicode. Call it a hack if you like, but it satisfied the relevant parties as an appropriate means for representing the data in question. > 256 variation selectors won't do if they have all been defined > unchangeably with the wrong properties e.g combining class. On the other > hand, if the UTC is prepared to ignore the combining class and > normalisation problems involved in using one combining class zero > character, CGJ, to modify a combining mark, This completely misconstrues the solution in question for the German umlaut and tréma in bibliographic records. The CGJ is not introduced "to modify a combining mark". Instead, two text elements required to be distinguished in German bibliographic data are represented by two distinct sequences: X + COMBINING DIAERESIS X + CGJ + COMBINING DIAERESIS This is completely in keeping with the intent of the CGJ in the standard, and the proposal did not, in any way, "ignore the combining class and normalisation problems" in this case. ... Which, by the way, is why the solution met with unanimous approval in WG2, without objection from the UTC liaison. > it may as well ignore the > identical problems involved in using variation selectors, also combining > class zero, with combining marks. What you have been suggesting to do, however, *does* advocate ignoring the problems involved in attempting to use variation selectors to formally distinguish variants of combining marks. --Ken