Peter Kirk said: > I will say again as I have said before - but the above (and what I > snipped) is extra evidence for it - that what is broke ... is > the rule that the isolated (generally spacing) form of a combining mark > should be formed by SPACE or NBSP followed by the combining mark.
This has been the *intent* of the standard since its inception in 1989. > There > are many good reasons for not using SPACE for this, including default > behaviour like inserting line breaks immediately after SPACE. Nope. UAX #14 specifies the following regarding SPACE followed by combining marks: "If U+0020 SPACE is used as a base character, it is treated as AL instead of SP." This means that a combining character sequence of this type is treated as a unit for the purposes of line breaking, and this overrides the behavior otherwise of SPACE to be treated as a line break opportunity. Of course UAX #14 only spells out default behavior, but then "default behaviour" is what was claimed just above. > Using NBSP rather than SPACE has several advantages, and has long been > specified in Unicode, although not widely implemented. It is less likely > to occur accidentally. But it has disadvantages, especially that it will > always be a spacing character, whereas for display of isolated Indic > vowels no extra spacing is required. NBSP is not a fixed-width space. > I would like to repeat my earlier proposal for a new character ISOLATED > COMBINING MARK BASE. This character would have no glyph, and the general > properties of a letter. Its spacing would be just as much as required > for proper display of the combining mark - which would be zero for > combining marks which have their own width. And after 15 years presence in the standard (or its earlier drafts) of the SP + CM recommendation, what makes you think that introduction of a *new* convention using a *new*, special purpose format control character sorta like a space only different, would lead to any better situation in actual practice? Use of such a character would *NOT* resolve the differences regarding how to display such a combination, by the way. > I realise that for backward compatibility reasons the old encoding > cannot be made illegal. But it can be deprecated, and a note can be > added that this sequence may not always be displayed as preferred. This is a recipe for prolonging the confusion and inconsistency in implementations of this feature. > You can't get away with it that easily. If the standard specifies that > <space, combining mark> should be displayed as an isolated combining > mark, then it would be conformant for a partial implementation to > display this sequence as nothing or as an illegal sequence. But if the > system attempts to display the sequence in a meaningful manner, it must > do so according to the standard, i.e. not as dotted circle plus > combining mark. The standard does not *require* this rendering or anything else. For the most part, the Unicode Standard is *NOT* a text rendering standard -- it is a character encoding standard. All kinds of recommendations are put in regarding how to handle one kind or another of rendering problem, precisely so that every implementer doesn't start from scratch to reinvent the wheel here, and so as to provide some basis for people to represent the same text content with the same "spellings" for complex scripts. There are reasons why such recommendations are found in Chapters 7 (and 5 and 2) of the standard, and are not nailed down with conformance clauses in Chapter 3. The UTC has, over the years, not found it appropriate to try to make normative requirements on the details of text display, except insofar (as in the Bidirectional Algorithm) as they have a direct bearing on the interpretation of the logical content of the text itself. > Well, as I understand it NBSP is often expected to be a fixed-width > space, and it is in many implementations. In fact I think it ought to > be, whether or not this is actually specified. But there ought to be a > character which is explicitly NOT fixed width to carry NSMs. There are *two* such characters: SPACE and NBSP. John Cowan noted: > Well, it depends on what the equivoque "combining marks" in the title of > Section 7.7 means. and then quoted the relevant text from p. 187. By the way, the first part of that text has survived almost verbatim from Unicode 1.0, where it was printed on p. 40 in what was then Chapter 3, Character Blocks. It was written there as part of the section "Generic Diacritical Marks U+0300 --> U+036F", as that was the most obviously a propos point in the text at the time. The text of the standard has since been morphed, restructured, and extensively added to, but some of its quirks result from the fact that the text has a *history*, and it isn't completely rewritten every time a new book is published. The intent of the UTC and the editors has always seemed clear to me on this particular point -- and the fact that the text in question has survived 3 major reeditings of the entire standard without significant change indicates to me that this has not been a problematical part of the standard for the UTC. > So assuming that "combining mark" means "combinining character" rather than > "non-spacing mark" (the term does not appear in the Glossary), it seems that > combining vowels should work fine with SP or NBSP. This, however, is a textual problem which should be addressed. As it stands, Section 7.7, Combining Marks deals with various types of combining characters, including non-spacing combining marks and enclosing combining characters. It does not say anything explicit about Indic dependent vowels, in part because of its textual history. Peter Kirk continued: > But it is a source of great confusion to > everyone when a widely used application does something clearly different > from what the standard intends, and yet claims "conformance" even if > technically this is correct. What the standard intends is that the textual representation (encoding) of an isolated combining mark be done via the sequence <SP, CM>. It does not *require* or *not require* that the visual rendering of such a sequence be done with or without a dotted circle indicating the absence of an expected normal base letter. In fact, the standard itself makes widespread and explicit use of the convention to display such combinations *with* a dotted circle. > It seems, from what Srivas (Avarangal) wrote, to be part of the > requirement for correct display of Tamil, and perhaps other Indic > languages, to be able to display isolated forms of such characters as > U+0BC6. If Uniscribe does not support this, even if it is technically > Unicode conformant, Microsoft cannot claim to support Tamil and other > languages. It is a *meta*requirement, required for text *about* the writing system. That may be an important requirement, but it is a specialized requirement, and it is silly to turn that into a claim that "Microsoft cannot claim to support Tamil and other languages." That's a silly as claiming that a JIS X 0208 conformant computer system does not support Japanese because it doesn't have a specified way to write stroke-order writing learning books that show Japanese characters written one stroke at a time. Yes, you can show a genuine need to produce such publications in Japan, but that doesn't mean that the character encoding standard has to spell out how to produce them. > But a claim to support particular scripts or languages > surely implies that all characters in that script (or at least in its > modern form) are supported. That is not perhaps a Unicode requirement, > but at least in the UK a failure here might be a breach of laws on > truthful advertising and description of products. Puh-leeez. --Ken