On Monday, August 11, 2003 2:05 AM, Kenneth Whistler <[EMAIL PROTECTED]> wrote:
> Um, no. Precisely because it would introduce *another* way > to do what is already specified in the standard. It would, I > predict, lead to nothing but more trouble. > > You might, perhaps, find it satisfying, but I can guarantee > that there would then be a future critic complaining about > an unnecessary distinction introduced into the standard. And > then there would be *more* text in different places of the > standard to try to correct and change, in an attempt to > try to make consistent distinctions between the behavior > of <SPACE, NSM> and <ACCENT_ANCHOR, NSM>. I don't think so: for texts that are already coded with SPACE+NSM, it won't be needed to do changes, as long as applications using them are satisfied with their existing behavior, even if it's ambiguous or causes problems in other applications. The rule would be not to change things, but offer to writers a way to create new texts without those ambiguities and problems, and correct them if authors wish it. For me, the "ACCENT ANCHOR" if you call it like this, is solving the usage of isolated diacritics as plain letters (such as the implied missing y in Hebrew Yerushala(y)im), and so would behave like an alphabetic character (whose directionality is still to define...) Existing coded spacing diacritics are coded as symbols (Sk) and mostly for accents used in LTR scripts, so the confusion of these symbols with letters behavior in some UAX's which give them the AL property (including for one case of SPACE+NSM) is not a problem. The usage as symbols is mostly correct for the case where a text is speaking about a diacritic as a isolated symbol and not within words (this is correct for most languages). The usage within words (for an implied missing base letter, including when this missing letter is an initial) leaves a distinct hole (for example if one was trying to encode a word like "(Y)erushala(y)im", where the missing base letter is the initial. For languages like Arabic and South-Asian scripts, there's no problem as there already is a base letter to hold initial combining vowel signs, which also works for the case of multiple combining vowels which should not stack but be writtenon this base letter. In fact in those languages, the missing consonnantal base letter is actually written with a visible glyph. But for Latin, Cyrillic, Greek, Hebrew, and probably other scripts, their isolated diacritics are missing a explicit coded form. And there is still the need even for Arabic and Brahmic scripts to be able to speak about the diacritic itself, without an explicit base letter, and so the SPACE+NSM combining sequence is for now the only solution with its undocumented properties problems. Reread some UAXes to see the problematic impact of SPACE+NSM in areas which are NOT related to rendering, notably when extracting word sequences (for search and indexing), managing keyboard selections, computing line breaks, and handling the directionality. Now consider the even greater impact with the legac use os SPACE as a normalizable padding whitespace (a key feature of SGML, HTML and XML), and the legacy use of SPACE+NSM cause too many problems that won't satisfy authors, which in some case will not be able to use it as it will not work as expected. Due to these problems, authors are then using even worse hacks, like using a control before the NSM, even if it creates "defective" combining sequences, and the dotted circle is sometimes displayed, and even if it is parsed with an invisible but still additional grapheme cluster for the control itself, whose presence is a pollution. Instead of forcing authors to use defective combining sequences like control+NSM, which would be a even worse hack, why not designating a clean and pure invisible base character with the required properties, so that it creates a pure combining sequence for the isolated diacritic(s)? So the question is which invisible base character(s) to define, with which properties? - A invisible symbolic base character (Sk), with neutral directionality (I called it a INVISIBLE SYMBOL); - A invisible letter base character (Lo) with neutral directionality (you call it a ACCENT ANCHOR, and I called it a INVISIBLE LETTER), or - A invisible letter base character (Lo) with LTR directionality and - A invisible letter base character (Lo) with RTL directionality Personnally, the term ACCENT ANCHOR seems ambiguous and does not indicate precisely the usage (it fits more like the current ambiguous usage of SPACE as this anchor for accents), and it seems restrictive to the kind of diacritic or other combining mark that may (should?) be applied to it. In addition, nothing would forbid to combine several diacritics or marks on this base character. Consider then these new characters are better base characters than SPACE, and define them with only a compatibility decomposition to SPACE, to match the previous encoding. If those new base characters are used without diacritics, they will be shown like the glyph for NBSP, and not necessarily as zero-width (there's no requirement for these invisible symbols to be zero-width in all cases, as this is a more precise substitution for the legacy SPACE, but without the associated whitespace properties). With these new characters, there is no need to change the rules in the various UAX's and other Unicode algorithms. -- Philippe. Spams non tolérés: tout message non sollicité sera rapporté à vos fournisseurs de services Internet.