Roozbeh asked: > > Expecting the compatibility decompositions to serve this purpose > > effectively is overvaluing what they can actually do. > > I would love to hear your opinion about what compatibility decompositions > *are* for, then. I feel a little confused here.
They are helpful annotations to an earlier version of the standard that got swept up first by changing expectations and then were caught in a normative stasis trap by the normalization specification. Originally, they were a shorthand way of saying things like: "This character is not really a 'good' Unicode character -- it should be thought of as a font variant of X." "This character is not really a 'good' Unicode character -- it should be thought of as effectively representing the sequence of X, Y, and Z." And so on. The terminology of "compatibility character" confused everyone, including the people writing the standard, since it meant, on the one hand, characters that didn't really fit the Unicode text model, but which were encoded for compatibility with important standards, for ease of round-trip conversions, mostly. On the other hand, it came to mean characters that had compatibility decompositions, once those were officially specified in the Unicode 2.0 publication, since most "compatibility characters" had "compatibility decompositions". This situation was further confused by the abortive early attempt to encode "compatibility characters" in a "compatibility zone", which resulted in people assuming that if a character was in that zone it automatically *was* a compatibility character and (later) that it should also have a compatibility decomposition. However, compatibility decompositions were originally assigned pretty much by a seat-of-the-pants method, without a clear implementation model to guide all of the decisions. As the UTC approached the critical milestone of Unicode 3.0 (and normalization), many of the earlier decompositions were refined and further rationalized, but they still retained some of the helter-skelter context of their annotational origins. The intuition was that the compatibility decompositions "sort of" made sense for such things as fallback, loose comparison (e.g. for collation and searching), normalizing, and such. However, when detailed specifications started to be written for such things, guided by implementation experience, it turned out that the compatibility decompositions were typically in the ballpark, as it were, but not correct in detail for any one purpose, let alone all purposes. And the publication of UAX #15 Normalization drastically turned things on their head. Instead of being annotational, and "fixable", compatibility decompositions became part of the normative definition of NFKD and NFKC, and became "unfixable", because of the requirements of normalization stability. So post-Unicode 3.0, the right way to think of the compatibility decomposition mappings is as the normative data used to define NFKD and NFKC. They bear some resemblance to relationships between characters and character sequences that may be useful in other processes, but in *all* cases should not be taken as a sufficiently precise set of classifications and equivalences for other processes -- there are always going to be exceptions, particularly since compatibility decompositions can no longer be "fixed" as a result of tuning based on implementation experience. > > > providing backup rendering when they lack the glyph, > > > > This seems unlikely to be particularly helpful in this *particular* > > case. > > Believe me, it really is. I'm implementing char-cell rendering for Arabic > terminals, and when it comes to Arabic ligatures, since I don't want two > get into a mess of double width things, I just decompose than ligature, > and render the equivalent string. It's not as genuine as may be, but it's > automatic, simple, clean, and conformant. For this kind of application, then, you simply add on decompositions for whatever else cannot be conveniently rendered in a char-cell. Arabic terminal applications have often already departed from what the Unicode Standard specifies in the way of compatibility decompositions by doing special handling of character "tails" in a separate cell, for example. Note that there isn't any compatibility mapping for U+FEB1 (isolated seen) --> U+FEB3 (initial seen) + U+FE73 (tail fragment), even though that might be what a Arabic terminal could do for display. It isn't non-conformant with the Unicode Standard to transform Unicode characters to alternate representations -- such as a glyph stream for terminal rendering -- it would only be nonconformant to *claim* that such a glyph stream is NFKD data when it departs from that specification. > Some other point: We like to discourage the usage of Arabic Presentation > Forms, don't we? Of course. They are compatibility characters for working with the existing legacy code pages that encoded Arabic that way. > That is mentioned in TUS 3.0 at the end of the chapter > about Arabic. All the characters in the Arabic Presentation Forms blocks > have these decompositions, exactly for this. Only three miss it, U+FD3E > and U+FD3F, Ornate Parentheses, which got there by mistake (and are > mentioned in the text), and U+FE73, which is a half-character (and could > not have it in any way). > > By not providing a compatibility decomposition, we are making the proposed > character a healthy and normal characters, just like Arabic letters or > symbols. Nope. See my above discussion for the distinctions. Presence or absence of a compatibility decomposition is not criterial for "this is a 'good' Unicode character" or "this is a 'bad' Unicode character." There are plenty of waaaay worse Unicode characters, encoded for a variety of legacy or even political reasons, but which have no compatibility decompositions. And some of the characters with compatibility decompositions, such as U+00A0 NO-BREAK SPACE are considered essential parts of many Unicode applications -- and nobody seriously considers them to be 'bad' characters. It is a good idea for people to stop thinking of the presence of a compatibility mapping as the mark of Cain -- it is more correctly now just a piece of normative data used in the definition of normalization in UAX #15. > It won't be a compatibility character like Chinese and Japanese > ones, or other Arabic ligatures, but a new beast encouraged to be used. Correct. It isn't a duplicate of something already encoded, and it has a reasonable implementation rationale, so it isn't born "predeprecated", like some of the junk that gets into the standard. > Why don't we encode it in the 06xx block then? Because it is another word ligature symbol like the others in the FDFX column, and because there no longer are officially good neighborhoods and bad neighborhoods in the BMP. Putting like things with like in the increasingly crowded BMP area is basically doing a favor to font implementers and builders of character property tables, for the most part, as well as simplifying the task of structuring the explanations needed in the documentation of the standard. > > > reading a text stream aloud, and things like that, > > > > And this requires much more than just some raw access to an > > NFKD normalization of the text stream to make any sense, for > > any real application. > > Of course, but look how nice it is now: In whatever encoding it is, just > pass it through a converter to Unicode NFKC, and then you will have > something very clean and consistent to work with. Why bother with > difficulties of the various character encodings? This applies to almost > every similar application which is not rendering-oriented. NFKC is not "Cleanicode". It has all kinds of problems when you study it in detail: with respect to compatibility with markup, with respect to format distinctions which should or should not be maintained under various circumstances, and with respect to the various incompatible kinds of foldings that get applied under the same process. One uses NFKC as a raw processing form only with great trepidation, since it is easy to destroy a distinction that you (or the consumer of your data) may assume was important for preservation. > > Implementation practice since then has suggested that compatibility > > decompositions for these Arabic word ligatures used symbolically are not > > much help -- and if any thing just provoke edge case failures for > > implementations. > > I don't get you. They have definitely been a help to me. What are the > other difficulties (other than the decomposition buffer size you just > mentioned)? As guidelines, sure. I'm not suggesting it was a bad idea in the first place to indicate all these kinds of character equivalencies that got associated with various of the compatibility characters in the standard. You just cannot assume that compatibility mappings can be used without discrimination and refinement for particular processes. Another example of a complication for the decompositions of the Arabic word ligatures would come from assuming that compatibility decompositions should be mapped onto input methods. That would be *correct* in the case of a two-character ligature -- say something like FCA6 THEH WITH MEEM -- one could expect to type THEH ... MEEM ... and then have automatic ligature formation under certain circumstances. But the word ligature symbols are different. Nobody really expects to have to type: 0635 0644 0649 0020 0627 0644 0644 0647 0020 0639 0644 064A 0647 0020 0648 0633 0644 0645 and then have ligature formation scoop up the entire sequence to create the SALLALLAHOU ALAYHE WASALLAM symbol, do they? No, if you are using such a special word ligature symbolically, as in a regular header for documents, or such, then you expect to have the symbol on its own key (or the moral equivalent, thereof). And the implementers of fonts and rendering systems can't reasonably be expected to search for these few extraordinary edge cases and deal with them automatically, when instead they should be focussed on the more regular 2- and 3-element ligatures. > Also, please note that we cannot remove all those decompositions, we can > only do the implementer's' job a little harder by breaking the model (and > encouraging the use of Arabic Presentation Forms block), or we can help > him a little. > > > Nope. The UTC wouldn't do it, nor have the Pakistani delegates > > working with the UTC and WG2 asked for it. > > It was in the first proposals, IIRC. They were not formal, of course. The first "proposals" were actually simply background documents about UZT, rather than well-formed proposals for actual encodings. Those came later. > > Not everything that gets into a national standard gets into Unicode. > > Undoubtedly, but being in a national standard helps a lot. So OK, are you > telling me that it is not just for compatibility, it is a legitimate > character that could have got accepted by WG2 even if it was not in UZT? I expect so, actually, given its usage. A similar (but not identical) BISMALLAH was requested early for the Thaana script, and I expect that the decision about that will eventually be revisited, as well. > BTW, whose was the suggestion to not provide a compatibility decomposition > for the character? At this point, who can recall who was the first to raise their hand? ;-) Essentially it was a consensus decision by the committee, with little dissent that I can recall. > > But I think you may be overestimating the caving in going on here. > > The UTC is still pushing back on another proposal to disunify Urdu > > digits, for example -- those did *not* get accepted by WG2, nor do I > > expect they will pass muster in future UTC meetings. > > Yes, I'm doing that on purpose. I was talking with a Pakistani expert > about UZT before the Dublin meeting. He told me that his colleagues will > propose the character to WG2, and I told him it's impossible, WG2 has > already passed something about not encoding more Arabic ligatures, unless > it is in a pre-90s standard. He told me: "You only need to push hard. Just > insist enough, they will surrender". No one doubts that there is a political aspect to character encoding. After all, this works in an international context, with lots of competing interests and individuals in two different large committees. But it isn't as simple as just pushing and insisting. In the end, you have to convince two committees that there is *some* technical merit to the proposal. Totally off-the-wall stuff doesn't get in, no matter how hard you push. You haven't seen a *real* game of political character encoding hardball if you haven't seen the 7-member North Korean delegation at the Beijing WG2 meeting insisting on the complete reencoding and renaming of all Korean characters in 10646! > I'm just a geek who prefers technical excellence to political reasons. I > do try my best for implementing this in our national standard committees, > and I somehow except this from UTC. I just like to see more of that > resistance against grass radicals. :-) I pushed *very* hard to avoid the proliferation of grass radicals in the standard. In the end, I lost on that one -- and we ended up with more grass radicals, anyway. I consider that one more chapter in the sorry history of mistakes in de jure and de facto Japanese encoding standards. But you win some and you lose some. --Ken