William Overington asked: > In the discussion about romanization of Cyrillic ligatures I asked how one > expresses in Unicode the ts ligature with a dot above. > > Regarding Ken's response to the Byzantine legal codes matter, it would > appear possible that the way that the ts ligature with a dot above for > romanization of Cyrillic could be represented in Unicode would be by the > following sequence. > > t U+FE20 s U+FE21 U+0307 > > The ordinary ts ligature for romanization of Cyrillic being expressed as > follows. > > t U+FE20 s U+FE21 >
As Peter indicated, the preferred way to represent this graphic ligature tie in Unicode is with the double diacritics, i.e.: t U+0361 s U+FE20 and U+FE21 are compatibility characters, for interoperation, in particular, with the USMARC catalog records using the Extended Latin Alphabet Coded Character Set for Bibliographic Use (ANSEL). See: http://lcweb.loc.gov/catdir/cpso/romanization/charsets/pdf > It appears to me that the ts ligature with a dot above, and a similar ng > ligature with a dot above, are already needed for the Library of Congress > romanization of Cyrillic system. > > The following directory contains a lot of pdf files. > > http://lcweb.loc.gov/catdir/cpso/romanization > > The ts ligature with a dot above can be found on page 2 of the nonslav.pdf > file. The ng ligature with a dot above can be found on page 13 of the same > file. And, in particular, the ts ligature with a dot above is for an Abkhaz romanization, and the ng ligature with a dot above is for an obsolete Mansi (related to Khanty) romanization. I suspect their actual use is pretty limited. > > Capital letter versions of the two ligatures are needed as well. Well, this is interesting, since these were *added*, systematically, to the 1997 version of the ALA-LC non-Slavic romanization systems. The 1990 version did not have them. That raises the question of whether these were simply editorial extensions, or were actually *needed* for some bibliographical data. I consider it unlikely that all of the capital forms were suddenly discovered between 1990 and 1997 and that a whole bunch of USMARC bibliographical records making use of the capital forms were created during that interval. In this regard, one should *read* the ALA-LC document. See charsets.pdf: "The transliterations produced by applying ALA-LC Romanization Tables are encoded in machine-readable form into USMARC records. Encoding of the basic Latin alphabet, special characters, and character modifiers listed in this publication is done in USMARC records following two American National Standards; the Code for Information Interchange (ASCII) (ANSI X3.4), and the Extended Latin Alphabet Coded Character Set for Bibliographic Use (ANSEL) (ANSI Z39.47). Each character is assigned a unique hexadecimal (base-16) code which identifies it unambiguously for computer processing." The current version of how that is done is the "MARC 21 Specifications for Record Structure, Character Sets, and Exchange Media." Among other things, that specification spells out how the combining marks are used with base characters in USMARC records. I don't know, however, if any provision was actually made in MARC 21 for these instances of ligature ties with dots above, however. Perhaps someone familiar with the details of USMARC can answer that. The USMARC records (using ANSEL) *would*, however, be making use of the half ligature characters: 0xEB LIGATURE, FIRST HALF 0xEC LIGATURE, SECOND HALF as well as: 0xE7 SUPERIOD [sic] DOT (s.b. "SUPERIOR DOT") It just isn't clear exactly what order these would occur in any hypothetical USMARC record actually using either the Abkhaz or Mansi romanizations in question. > I wonder if consideration could please be given as to whether this matter > should be left unregulated or whether some level of regulation should be > used. I think this should depend first on a determination of whether there is a demonstrated need for an actual representation of these sequences -- which ought to be determined by the people responsible for the data stores which might contain them, namely the online bibliographic community. The ALA-LC conventions are not the only alternatives available for representation of Abkhaz and/or Khanty/Mansi data in romanization. In fact, you can find such data on the web using alternative romanizations. So it isn't as if the current gap in figuring out precisely how, in Unicode, to represent a double diacritic with another diacritic applied outside the visible double diacritic on a digraph is preventing anyone from using romanized Abkhaz or Khanty/Mansi data in interchange. --Ken > > William Overington > > 18 September 2002