John Cowan asked:

> I would like to ask the old farts^W^Wrespected elders of the UTC
> which principle they consider more important, abstractly speaking:
> the principle that combining marks always follow their base characters
> (a typographical principle), or that text is stored, with a few minor
> exceptions, in phonetic order (a lexicographical principle).

As may often be the case in such hypothetical questions, I
think there is a false dichotomy presumed here.

The principle of the order of combining marks results from the
need to resolve the following architectural question for the
standard:

   Does a combining mark apply to the base character that
   precedes it or to the base character that follows it?
   
   In other words, does á = <0065, 0301> or does á = <0301, 0065>?
   
There can only be one right answer to that question, while having
a coherent, interoperable character encoding standard.

The choice that the Unicode architects made on this principle in
1989 is sacrosanct and inviolable.

The principle of logical order of encoding results from the
need to resolve the following architectural question for the
standard:

   Is a right-to-left script encoded in visual order in
   the backing store or in phonetic (= logical) order?
   
   In other words, is "tsava" spelled <05E6, 05D1, 05D0> or
   <05D0, 05D1, 05E6>.
   
There can only be one right answer to that question, while having
a coherent, interoperable character encoding standard.

The choice that the Unicode architects made on this principle in
1989 is sacrosanct and inviolable.

Everything else is just working out the details for making actual
script encodings consistent in the context of those overarching
principles. The status of a character as combining or not is
up for grabs, depending on the analysis of a script's behavior
and how it should be represented. And the layout for actual
display of rendered texts does not, and never has, slavishly
followed logical order in lockstep.

Again, everyone, if you haven't already, go back and meditate
some more on the fundamental mandala of Unicode: Figure 2-3,
Unicode Character Code to Rendered Glyphs, which illustrates
both issues of combining mark order with respect to base
character and general logical order of characters as applied
to a particular script encoding (Devanagari).

And don't miss the following piece of text associated with that
figure:

  "The Unicode Standard documents the default relationship
   between character sequences and glyphic appearance for the
   purpose of ensuring that the same text content can be
   stored with the same, and therefore interchangeable,
   sequence of character codes."
   
This should, IMO, be put up on a pedestal and have the spotlights
shined on it. This is the *fundamental* obligation of a character
encoding standard. If you cannot accomplish this, then you just
have a bunch of charts full of pretty pictures, and everyone is
on their own for trying to figure out how to communicate with
anybody else using them.

> As someone or other said, "I believe that hitherto -- *hitherto,* mark
> you -- [we have] entirely overlooked the existence of", well, scripts
> that might cause a conflict between these esteemed principles.

The reason why the UTC should tackle the encoding of Tengwar
is not so much because it would help in the publication of Elvish
poetry, but because confronting the architectural issues
it poses for encoding would make an excellent tutorial case
for how the two principles of combining mark order and
logical order impact the task of coming up with an appropriate
encoding for a complex script. And it would starkly illustrate
the fact that an appropriate character encoding does not
necessarily directly reflect the phonological structure of
a language as represented by that script.

--Ken



Reply via email to