Re: Sequences of combining characters (from Romanization of Cyrillicand Byzantine legal codes)

Kenneth Whistler Wed, 18 Sep 2002 18:59:43 -0700

> >The ALA-LC conventions are not the only alternatives available for
> >representation of Abkhaz and/or Khanty/Mansi data in romanization.
> >In fact, you can find such data on the web using alternative
> >romanizations. So it isn't as if the current gap in figuring out
> >precisely how, in Unicode, to represent a double diacritic with
> >another diacritic applied outside the visible double diacritic
> >on a digraph is preventing anyone from using romanized Abkhaz or
> >Khanty/Mansi data in interchange.
> 
> By the same argument, Unicode might as well stop taking new characters; 
> surely, between the 500 Latin characters and dozens of punctuation marks 
> and combining characters and the other 70,000 characters, you can find 
> a way to communicate whatever language or data you need communicated.


Of course. Let them use ASCII, for that matter.

But that wasn't my point. There is no particular evidence
that the ALA-LC conventions with the dot above the graphic
ligature ties is in widespread use for romanizations of these
particular languages, that I can see. So the *urgency* of
solving this problem isn't there, unless the LC/library/bibliographic
community comes to the UTC and indicates that they have a data interchange
problem with USMARC records using ANSEL that requires a clear
representation solution in Unicode. And before we go there, I'd
like to have a clear specification of how it works in USMARC
records, so we would know how to do the following conversion:

    USMARC <--> Unicode

for the two forms in question.

The 1990 version of the LC romanizations for this non-Slavic stuff
used all kinds of hand-drawn forms. And even the 1997 version of
the ALA-LC document is photo-offset from pages that include various
kinds of pasteup from who-knows-what sources, including some
hand-drawn, with at least one of these dots above being added by
hand. So it isn't clear that there is any connection between the
ALA-LC document text and the ANSEL character encoding actually used
in the USMARC records; this could be arbitrary markup with some
system like TEX for publication.

BTW, if we are blueskying about this topic, the *elegant* way
to resolve this would be to recategorize all the double
diacritics as *enclosing* combining marks (Me), rather than
Mn, and then rewriting the conventions for their use to
match those of the enclosing circle and such. Then they
would subtend (or supertend) any grapheme cluster, including
arbitrary digraphs indicated with a COMBINING GRAPHEME JOINER
character. And a dot above would neatly apply to the entire
subtended cluster, as for circled characters, and so on.
Of course, that would invalidate anybody's current
usage of the characters. Oh well, you can't win 'em all.

--Ken

Re: Sequences of combining characters (from Romanization of Cyrillicand Byzantine legal codes)

Reply via email to