"Kent Karlsson" <kent.karlsso...@telia.com> wrote: > Den 2010-07-24 10.07, skrev "Philippe Verdy" <verd...@wanadoo.fr>: > > > Double diacritics have a combining property equal to zero, so they > > No, they don't. The above ones have combining class 234 and the below > ones have combining class 233 (other characters with the word DOUBLE > in them are 'double' in some other way): > > 035C;COMBINING DOUBLE BREVE BELOW;Mn;233;NSM;;;;;N;;;;; > ...
Aren't they using the maximum value of the combining class ? If so, you can still use double diacritics betweeb two sequences containing a base character and any "simple" diacritic, and be sure that the double diacritic will be rendered about them, as it will remain in the last position of the normalized form. Anyway I also said that a character with combining class 0 was needed to add other diacritics on top of double diacritics, after encoding the two sequences joined with the double diacritic. Why did you assign such bogous non-zero combining class for double diacritics is a mystery for me, as it was really not needed for compatibility with legacy encodings? These combining classes 233 and 234 have absolutely no interest except that it complicated things for absolutely no benefit (including the fact that now an additional character with combining class 0, such as CGJ or other, is always needed to stack anything else on top of double diacritics). I did not realize that before (yes I should have looked in the UCD to verify). And given their existing behavior, this has prevented other simpler encodings of texts. Also I have NEVER found any occurence ever where the fact that they have combining class 233/234 instead of 0 makes any difference, because double diacritics where ALWAYS encoded between the two base graphemes encoded separately, and the canonical order preserves this encoding position in all cases between the two base graphemes encoded completely. Note that I'm not even sure that CGJ is the right choice for stacking more diacritics on top of double diacritics, because it would mean that the additional diacritic will need to be encoded just after the double diacritic and CGJ, but before the second grapheme, and this does not really match with double diacritics used between triplets of graphemes: where the additional diacritics need to be placed, on the first or on the second double diacritic ? For me the logical ordering would require encoding first the base graphemes, separated by the double diacritic, then encode the additional diacritics applicable to the whole previous group (and so it requires adding a new virtual base to block the reordering. (1) If using CGJ at end of the sequence containing the two bases and the double diacritic, it will still attach logically and visually the additional diacritics to the last base grapheme, and so they will still stack on them, below the double macron for example, even if their relative order is preserved. It's needless (or logically wrong), in this order, to use CGJ instead of ZWJ, in a sequence like: <base-1, double-diacritic, base-2, CGJ, additional-diacritics> because in that position, CGJ has no other effect to block the reordering of additional-diacritics as they are already blocked by base-2, so it would be still interpreted as: <base-1, double-diacritic, base-2, additional-diacritics> and so the additional diacritic will be linked to base-2, and the double diacritic will cover the full group containing <base-1> and <base-2, additional diacritics> (2) The only way to encode the additional diacritics in the middle of the group linked by CGJ, in this order: <base-1, double-diacritic, CGJ, additional diacritics..., base-2> and it will be impossible to have longer groups applying the double diacritic to more than 2 bases. This encoding using CGJ clearly breaks the logical assumption that the additional diacritic applying to a group should be all encoded AFTER the full group has been encoded. Here the additional diacritics need to be inserted at a specific position in the middle of the sequence (and in pratice, for input editors, they would have to scan back before base-2 through the additional diacritics and CGJ just to find the double-diacritic and see that any further diacritics need to be inserted there...) CGJ was not intended to apply to more than one character, but only as a way to block some normalized reordering of combining characters occuring after a single base character (which always has combining class 0). In that position, it should only occur between two combining characters with non-0 combining class, and only if the second onle has a lower combining class than the first one, and only if this creates a semantic or visual difference on rendered documents (for example because of the variable positions of the cedilla, that the combining class are unifying as if it was unique). (3) Using ZWJ, this terminates the last base grapheme so you can safely append other diacritics applying to the whole group joined by the double diacritic, and this becomes encoded very logically in this order: <base-1, double-diacritic, base-2, ZWJ, additional-diacritics> Where it will have a more consistant behavior, if ever double diacritics or ZWJ are not supported by the renderer to create long groupings. In that position, if the renderer can only draw the double-diacritic with nothing else on top of it, the additional diacritics will be drawn after the sequence of the two bases and the double diacritic, and only the additional diacritics will be drawn like a defective sequence (by drawing a dotted circle for example). (4) With ZWJ as the base separator with combining class 0 (just like CGJ which has a more "local" usage, to force the relative order of simple diacritics above only one base grapheme, when it has to be semantically different from the canonical order) between the last base grapheme and the addition diacritics (which I think is logically better than CGJ), we could *also* have longer sequences such as: <base-1, double-diacritic, base-2, double-diacritic, base-3, ZWJ, additional diacritics...> without any ambiguity about which double diacritic should "support" the additional diacritics. The occurences of double diacritics should be treated indistinctly where they ever occur ; by default, in a simple renderer, they will overlap in the middle except above the first and last base graphemes, but a smarter engine will avoid this overlap (when they are identical) and will draw a longer diacritic covering more all base graphemes on which the double diacritic is encoded. I've still not seen encoded texts needing that, but such groupings with more than two base graphemes is common in the litterature (for example when emphasizing trigrams like "sch" in German, or even "str" in English, or finals appended to conjugated verbs or declined nouns, or in phonetic notations needing longer ties to group complex groups of consonnants or diphtongs). In some cases, they are acting like interlinear annotations (such as emphasized trigrams, where it acts like an alternate underlining), but in others they have a semantic value within the encoded text itself from which they can't be safely detached (such as in phonetic notations, or in mathematical notations and other scientific and technical formulas). Anyway, I still think that double diacritics are a "hack" inserted in the UCS and now they clearly appear as an unjustified desunification of the diacritics: we should be able to encode the NORMAL (non-double) diacritics (from any Unicode block where it is already encoded) and apply them to an arbitrarily long group of characters, encoding the normal diacritics in the logical order after encoding the group, because: - most of them were added in the UCS before ZWJ was encoded. - this is the natural order with which they are perceived and drawn. - this is the natural way of interpreting the diacritics (and they are not necessarily "elongated") - the concept of groupings is inherent to the logical semantic of the text, and should be preserved by its encoding. Adding the explicit encoding of semantically significant groupings (and that are still missing) was certainly more important than adding these desunified "double" diacritics (that also have their own distinct combining class). Not only this encoding of double diacritics did not solve the problem completely within a general character model, but it added new exceptions and problems for automated text parsers and renderers. Philippe.