Re: Printing and Displaying Dependent Vowels

Peter Kirk Tue, 30 Mar 2004 04:13:52 -0800

On 29/03/2004 16:28, Kenneth Whistler wrote:

...

Using NBSP rather than SPACE has several advantages, and has long been specified in Unicode, although not widely implemented. It is less likely to occur accidentally. But it has disadvantages, especially that it will always be a spacing character, whereas for display of isolated Indic vowels no extra spacing is required.

NBSP is not a fixed-width space.

Yes it is, in Unicode 4.0.0. Ernest quoted from UAX #14 "All other space characters have fixed width." This may be in the standard by mistake, but it is in the standard. Asmus says that this will be changed in 4.0.1, but that has not yet been released. If a statement is written in a standard, even in the introduction to a different section, that is normative.

I would like to repeat my earlier proposal for a new character ISOLATED COMBINING MARK BASE. This character would have no glyph, and the general properties of a letter. Its spacing would be just as much as required for proper display of the combining mark - which would be zero for combining marks which have their own width.
And after 15 years presence in the standard (or its earlier drafts)
of the SP + CM recommendation, what makes you think that introduction
of a *new* convention using a *new*, special purpose format control
character sorta like a space only different, would lead to any
better situation in actual practice? Use of such a character would
*NOT* resolve the differences regarding how to display such a
combination, by the way.

I would be happy for NBSP to be used in this way, now that it has been clarified that this should not be considered fixed width when followed by a combining mark. I would like to see a clear recommendation (not a conformance requirement, I agree) that the sequence <NBSP, non-spacing combining mark> should be rendered as a spacing version of the mark with just enough space for the mark and no added glyph. My reason for preferring NBSP to SPACE is that it is unambiguously non-breaking and (I think) not a word boundary.

But this doesn't solve the Tamil etc problem as what is needed there is a non-spacing non-breaking base character which can allow the vowel to display without the dotted circle. Perhaps ZWJ would be suitable.

...

Well, as I understand it NBSP is often expected to be a fixed-width space, and it is in many implementations. In fact I think it ought to be, whether or not this is actually specified. But there ought to be a character which is explicitly NOT fixed width to carry NSMs.

There are *two* such characters: SPACE and NBSP.

You mean, there will be in 4.0.1. The problem with SPACE is a different one.

...

The intent of the UTC and the editors has always seemed clear to
me on this particular point -- and the fact that the text in
question has survived 3 major reeditings of the entire standard
without significant change indicates to me that this has not been
a problematical part of the standard for the UTC.

Well, a text needs to be clear to its readers, not just to its authors. Obviously this text is not clear to readers, even ones as experienced as John Cowan, and so needs clarification.

So assuming that "combining mark" means "combinining character" rather than "non-spacing mark" (the term does not appear in the Glossary), it seems that combining vowels should work fine with SP or NBSP.

This, however, is a textual problem which should be addressed. As it stands, Section 7.7, Combining Marks deals with various types of combining characters, including non-spacing combining marks and enclosing combining characters. It does not say anything explicit about Indic dependent vowels, in part because of its textual history.

In that case something clear and sensible needs to be added about Indic dependent vowels.

Peter Kirk continued:

But it is a source of great confusion to everyone when a widely used application does something clearly different from what the standard intends, and yet claims "conformance" even if technically this is correct.

What the standard intends is that the textual representation (encoding) of an isolated combining mark be done via the sequence <SP, CM>. It does not *require* or *not require* that the visual rendering of such a sequence be done with or without a dotted circle indicating the absence of an expected normal base letter. In fact, the standard itself makes widespread and explicit use of the convention to display such combinations *with* a dotted circle.

Well, the standard clearly intends that the character for "a" is rendered with the glyph "a" and not the glyph "b". It may not formally require this, but a system which breaks this rule, while possibly formally conformant, can hardly claim to support Unicode properly.

One convention for display of isolated combining marks is to use a dotted circle. But this convention is far from universal across all writing systems. It is wrong to impose it on all systems - except perhaps in such a context as the Unicode standard text and character charts where different systems are compared. It is clear that there is sometimes (and even in Latin script) a requirement to display isolated combining marks without dotted circles.

It seems, from what Srivas (Avarangal) wrote, to be part of the requirement for correct display of Tamil, and perhaps other Indic languages, to be able to display isolated forms of such characters as U+0BC6. If Uniscribe does not support this, even if it is technically Unicode conformant, Microsoft cannot claim to support Tamil and other languages.

It is a *meta*requirement, required for text *about* the writing system. That may be an important requirement, but it is a specialized requirement, and it is silly to turn that into a claim that "Microsoft cannot claim to support Tamil and other languages."

I don't accept that this is a specialised requirement or "*meta*requirement". Potentially, any text which includes a list of characters in the language or script is likely, at least for certain scripts, to include isolated dependent vowels. Such texts include all dictionaries, encyclopedias, language learning and literacy materials etc etc, and even all books with indexes. There are also cases of isolated dependent vowels being used in variant spellings, abbreviations etc in other texts. Such texts counted together are likely to constitute a high proportion of the total corpus in many languages.

I would say that if specific products do not support dictionaries, indexes or literacy primers in Tamil, they cannot claim to support Tamil.


--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/

Re: Printing and Displaying Dependent Vowels

Reply via email to