On 02/08/2004 13:12, Antoine Leca wrote:

...

However, if I can agree with you about the area being fuzzy when it comes to
*ZWJ* and its numerous uses and some abuses (like Devanagari half-forms),
the verdict is not anywhere as bad about ZWNJ.
Behaviour of ZWNJ is consistent in about any place, and the correct
explanation is the one that is, among others, in chapter 15, that is that
ZWNJ restricts rendering to unconnected and unligatured forms (or prevent
use of any connected form or ligature, if you prefer), where possible.



I agree that the situation with ZWJ is more complex than that with ZWNJ. But there is still uncertainty concerning ZWNJ because of the uncertainty about what is actually considered a "ligature", and so what exactly may be broken by ZWNJ.

In discussions on my Holam proposal, John Hudson wrote:

[Note that on Unicode lists I tend to use the term ligature in a purely technical sense: a single glyph representing two or more characters. This says nothing about the form of that glyph. Discussion of ligatures in complex scripts can become confusing unless this strictly technical definition is kept in mind. It helps to remember that when you are looking at rendered text, what *looks* like a ligature -- i.e. two or more conjoined forms -- may or may not in fact be a single glyph.]

He clarified later that it is irrelevant to him whether a glyph consists of a single continuous block or is graphically equivalent to a base character plus a diacritical mark; all that matters is how it is implemented.


But other experts use a very different definition of "ligature" which is apparently restricted to glyphs with a particular *form*, perhaps John Hudson's "two or more conjoined forms". This definition apparently excludes combinations of a base character with a diacritical mark, even when these are represented as two Unicode characters (i.e. not precomposed) but are implemented with a single glyph e.g. by substitution of a presentation form. On this latter definition, the glyph for the alphabetic presentation form U+FB4B HEBREW LETTER VAV WITH HOLAM cannot be considered a ligature, even though it is used, and is automatically substituted by rendering engines e.g. Uniscribe, only (in all normalisation forms) to represent the combination of two characters <VAV, HOLAM>.

The situation is even more confused in that some Unicode characters, e.g. U+0152 LATIN CAPITAL LIGATURE OE, are called LIGATUREs in their character names but are unambiguously single Unicode characters (e.g. they have no decomposition even for compatibility). (These are in addition to the characters named LIGATURE in the Alphabetic Presentation Forms block, which mostly have compatibility decompositions.)

The Unicode definition in the TUS glossary (http://www.unicode.org/versions/Unicode4.0.0/b1.pdf) seems ambiguous. Here it is:

Ligature. A glyph representing a combination of two or more characters. In the Latin script,
there are only a few in modern use, such as the ligatures between “f ” and “i” (= fi) or “f”
and “l” (= fl). Other scripts make use of many ligatures, depending on the font and style.


The first sentence would seem to confirm John Hudson's definition, for a "glyph" is defined in terms of rendering engine implementation rather than graphical identity or continuity. But the comment that there are only a few ligatures in modern use in Latin script seems to restrict the concept to certain graphical forms without making a proper definition.

So the uncertain point is, what exactly are the "ligatures" whose formation ZWNJ should inhibit? Are they the technical ligatures as understood by John Hudson, or are they the undefined formal ligatures or conjoined forms?

Which brings me back to the specific debate over the Holam proposals: Is it a proper use of ZWNJ to block the mapping of the character sequence <VAV, HOLAM> on to the glyph for the alphabetic presentation form U+FB4B HEBREW LETTER VAV WITH HOLAM, so that the HOLAM dot is positioned in its regular top left position relative to the base character, rather than the irregular (top centre or top right) place in the alphabetic presentation form?



Another argument against our proposal is that by defining
ZWNJ as breaking a ligature I am specifying implementation.



This is a dubious argument. Unicode specifies encodings. When two different "meanings" are identified, different encodings are requested, so it is a task for Unicode.

OTOH, if there is no underlying difference and the matter is purely of
presentation (like the aspect of a, like a reversed e or like a o with left
stem), then Unicode is not to be involved.

I know the border is fuzzy. ;-) or :-(.

Here, the fact it ligates or no does mean something (and this is the hard
part of the demonstration) is what should be examined. How it is implemented
is largely irrelevant (in fact, it is relevant when the result is *not*
implementable!)



There is a separate issue of whether it is proper to use ZWNJ or ZWJ for a semantically significant distinction. It is arguable whether the Holam distinction is actually semantic, although it does need to be made in plain text for proper exact typography. But then there are other distinctions made by ZWNJ e.g. in Persian which are certainly semantically significant.


My proposal was criticised at one point for restricting how something could be implemented. I had demonstrated that there was one feasible implementation strategy, that it is *not* something *not* implementable. Is it really necessary to demonstrate that there is more than one feasible strategy so that implementers have a choice? In any case, the restriction to one strategy was not imposed by the proposal or by TUS, but by the rendering system (OpenType) and particular implementations of it, which had the effect of restricting the font implementer's options.


OTOH, regarding your problem, I should point out that the Bengali's precedent is anything but something that should be taken as example: it appears to me as an ad-hoc solution built in a hurry, that happened to fit well with certain technical implementations; it is a nightmare to handle for others; and now there is on the table a proposal, PR-37, which among other things will (try to) remove this hack and replace it with another, more orthogonal (using ZWJ).



Thanks for your advice about PR-37. I realised after including this example in the draft Holam proposal that it is in fact controversial. However, it seems that the controversy is over whether to use ZWJ or ZWNJ; the principle seems to be accepted that one or other may be used, and in this position between a base character and a combining mark. The UTC obviously needs to decide this issue once and for all, and then implementers will need to adjust their implementations to fit. Any adjustments are likely to make things easier also for implementation of my Holam proposal.

No one, as far as I know, has proposed a resolution of the Bengali ligature issue by defining a new Unicode character. Why not? Presumably because this would be a breach of the character/glyph model. Very similar principles apply to the Holam case. Use of ZWNJ has been proposed because it seems to fit Unicode definitions better. But I would not object if the UTC preferred a representation with ZWJ for continued compatibility with the Bengali case, especially if this solves actual implementation difficulties. My objection to a new character solution is basically that it breaks the character/glyph model by defining a new character for what is no more than a glyph variant.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/





Reply via email to