John Cowan asked: > I would like to ask the old farts^W^Wrespected elders of the UTC > which principle they consider more important, abstractly speaking: > the principle that combining marks always follow their base characters > (a typographical principle), or that text is stored, with a few minor > exceptions, in phonetic order (a lexicographical principle).
As may often be the case in such hypothetical questions, I think there is a false dichotomy presumed here. The principle of the order of combining marks results from the need to resolve the following architectural question for the standard: Does a combining mark apply to the base character that precedes it or to the base character that follows it? In other words, does á = <0065, 0301> or does á = <0301, 0065>? There can only be one right answer to that question, while having a coherent, interoperable character encoding standard. The choice that the Unicode architects made on this principle in 1989 is sacrosanct and inviolable. The principle of logical order of encoding results from the need to resolve the following architectural question for the standard: Is a right-to-left script encoded in visual order in the backing store or in phonetic (= logical) order? In other words, is "tsava" spelled <05E6, 05D1, 05D0> or <05D0, 05D1, 05E6>. There can only be one right answer to that question, while having a coherent, interoperable character encoding standard. The choice that the Unicode architects made on this principle in 1989 is sacrosanct and inviolable. Everything else is just working out the details for making actual script encodings consistent in the context of those overarching principles. The status of a character as combining or not is up for grabs, depending on the analysis of a script's behavior and how it should be represented. And the layout for actual display of rendered texts does not, and never has, slavishly followed logical order in lockstep. Again, everyone, if you haven't already, go back and meditate some more on the fundamental mandala of Unicode: Figure 2-3, Unicode Character Code to Rendered Glyphs, which illustrates both issues of combining mark order with respect to base character and general logical order of characters as applied to a particular script encoding (Devanagari). And don't miss the following piece of text associated with that figure: "The Unicode Standard documents the default relationship between character sequences and glyphic appearance for the purpose of ensuring that the same text content can be stored with the same, and therefore interchangeable, sequence of character codes." This should, IMO, be put up on a pedestal and have the spotlights shined on it. This is the *fundamental* obligation of a character encoding standard. If you cannot accomplish this, then you just have a bunch of charts full of pretty pictures, and everyone is on their own for trying to figure out how to communicate with anybody else using them. > As someone or other said, "I believe that hitherto -- *hitherto,* mark > you -- [we have] entirely overlooked the existence of", well, scripts > that might cause a conflict between these esteemed principles. The reason why the UTC should tackle the encoding of Tengwar is not so much because it would help in the publication of Elvish poetry, but because confronting the architectural issues it poses for encoding would make an excellent tutorial case for how the two principles of combining mark order and logical order impact the task of coming up with an appropriate encoding for a complex script. And it would starkly illustrate the fact that an appropriate character encoding does not necessarily directly reflect the phonological structure of a language as represented by that script. --Ken