Re: Sequences of combining characters (from Romanization of Cyrillic and Byzantine legal codes)

Kenneth Whistler Wed, 18 Sep 2002 14:32:56 -0700

William Overington asked:

> In the discussion about romanization of Cyrillic ligatures I asked how one
> expresses in Unicode the ts ligature with a dot above.
> 
> Regarding Ken's response to the Byzantine legal codes matter, it would
> appear possible that the way that the ts ligature with a dot above for
> romanization of Cyrillic could be represented in Unicode would be by the
> following sequence.
> 
> t U+FE20 s U+FE21 U+0307
> 
> The ordinary ts ligature for romanization of Cyrillic being expressed as
> follows.
> 
> t U+FE20 s U+FE21
>


As Peter indicated, the preferred way to represent this graphic ligature
tie in Unicode is with the double diacritics, i.e.:

t U+0361 s

U+FE20 and U+FE21 are compatibility characters, for interoperation,
in particular, with the USMARC catalog records using the Extended
Latin Alphabet Coded Character Set for Bibliographic Use (ANSEL). See:

http://lcweb.loc.gov/catdir/cpso/romanization/charsets/pdf

> It appears to me that the ts ligature with a dot above, and a similar ng
> ligature with a dot above, are already needed for the Library of Congress
> romanization of Cyrillic system.
> 
> The following directory contains a lot of pdf files.
> 
> http://lcweb.loc.gov/catdir/cpso/romanization
> 
> The ts ligature with a dot above can be found on page 2 of the nonslav.pdf
> file.  The ng ligature with a dot above can be found on page 13 of the same
> file.

And, in particular, the ts ligature with a dot above is for an Abkhaz
romanization, and the ng ligature with a dot above is for an obsolete
Mansi (related to Khanty) romanization. I suspect their actual use
is pretty limited.

> 
> Capital letter versions of the two ligatures are needed as well.

Well, this is interesting, since these were *added*, systematically,
to the 1997 version of the ALA-LC non-Slavic romanization systems. The
1990 version did not have them.

That raises the question of whether these were simply editorial
extensions, or were actually *needed* for some bibliographical
data. I consider it unlikely that all of the capital forms were
suddenly discovered between 1990 and 1997 and that a whole bunch
of USMARC bibliographical records making use of the capital forms
were created during that interval.

In this regard, one should *read* the ALA-LC document. See charsets.pdf:

"The transliterations produced by applying ALA-LC Romanization Tables
are encoded in machine-readable form into USMARC records. Encoding of
the basic Latin alphabet, special characters, and character modifiers
listed in this publication is done in USMARC records following two
American National Standards; the Code for Information Interchange
(ASCII) (ANSI X3.4), and the Extended Latin Alphabet Coded Character
Set for Bibliographic Use (ANSEL) (ANSI Z39.47). Each character
is assigned a unique hexadecimal (base-16) code which identifies it
unambiguously for computer processing."

The current version of how that is done is the "MARC 21 Specifications
for Record Structure, Character Sets, and Exchange Media." Among other
things, that specification spells out how the combining marks are used with base
characters in USMARC records. 

I don't know, however, if any provision was actually made in MARC 21 
for these instances of ligature ties with dots above, however. Perhaps
someone familiar with the details of USMARC can answer that.

The USMARC records (using ANSEL) *would*, however, be making use
of the half ligature characters:

0xEB LIGATURE, FIRST HALF
0xEC LIGATURE, SECOND HALF

as well as:

0xE7 SUPERIOD [sic] DOT   (s.b. "SUPERIOR DOT")

It just isn't clear exactly what order these would occur in any
hypothetical USMARC record actually using either the Abkhaz or
Mansi romanizations in question.

> I wonder if consideration could please be given as to whether this matter
> should be left unregulated or whether some level of regulation should be
> used.

I think this should depend first on a determination of whether there
is a demonstrated need for an actual representation of these sequences --
which ought to be determined by the people responsible for the
data stores which might contain them, namely the online bibliographic
community.

The ALA-LC conventions are not the only alternatives available for
representation of Abkhaz and/or Khanty/Mansi data in romanization.
In fact, you can find such data on the web using alternative
romanizations. So it isn't as if the current gap in figuring out
precisely how, in Unicode, to represent a double diacritic with
another diacritic applied outside the visible double diacritic
on a digraph is preventing anyone from using romanized Abkhaz or
Khanty/Mansi data in interchange.

--Ken

> 
> William Overington
> 
> 18 September 2002

Re: Sequences of combining characters (from Romanization of Cyrillic and Byzantine legal codes)

Reply via email to