> I have been doing a little research into the defined properties of CGJ. > I note also that according to > http://www.unicode.org/book/preview/ch03.pdf it is defined in Unicode > 4.0 as a "Default Ignorable". Well, I am not surprised that some people > are confused ...
Yes, I'm not surprised, either, because the whole philosophical area of character "nothingness" is fraught with difficulties. Particularly with Unicode, which has introduced many more kinds of characters which aren't really there, or characters which disappear when you look at them in a mirror ;-), it is rather complex. Consider all the following categories of "nothingness": ISO Control (gc=Cc) Unicode Format Control (gc=Cf) Layout Control (gc=Cf, Zl, Zp, some Cc, and arguably, spaces) Space (gc=Zs) White_Space Blank (of glyph) Placeholder (e.g. U+FFFC OBJECT REPLACEMENT CHARACTER) Default_Ignorable_Code_Point They don't define all the same classes, and overlap in funny ways, sometimes. > According to this, > "Default ignorable code points are those that should be ignored by > default in rendering (unless explicitly supported)... An implementation > should ignore default ignorable characters in rendering whenever it does > /not/ support the characters." So my suggestion that a renderer should > simply ignore CGJ is far from twisting the requirements of Unicode, it > is in fact a requirement of Unicode 4.0 though one that I am hardly > surprised that some people have missed. Here is the wording from Unicode 4.0: ==================================================================== Default ignorable code points are those that should be ignored by default in rendering unless explicitly supported. They have no visible glyph or advance width in and of themselves, although they may affect the display, positioning, or adornment of adjacent or surrounding characters. ... And implementation should ignore default ignorable characters in rendering whenever it does *not* support the characters. ... With default ignorable characters, such as U+200D ZERO WIDTH JOINER, the situation is different [from the normal case where an unsupported character would be displayed with a black box, for example]. If the program does not support that character, the best practice is to ignore it completely without displaying a last-resort glyph or a visible box because the normal display of the character is invisible: Its effects are no other characters. Because the character is not supported, those effects cannot be shown. -- TUS 4.0, p. 142. ===================================================================== This wording was, of course, written with such format controls as ZWJ and ZWNJ in mind, which *do* have formatting effects on adjacent characters. But the CGJ is also given the Default_Ignorable_Code_Point property. In fact, in order to get that (derived) property, it has to be *explicitly* given the Other_Default_Ignorable_Code_Point property in PropList.txt, since it (along with the variation selectors) are gc=Mn (non-spacing combining marks), which aren't automatically defined to be default ignorable. Where the CGJ differs from the format controls (and the variation selectors, for that matter) is that it is defined to have *no* formatting effect on neighboring characters. So even if you don't formally support it, you know that it shouldn't be having any effect on the formatting of neighboring characters. However, making it default ignorable is the right thing to do, because it is itself always invisible for display. (Unless you are doing a Show Hidden display, of course.) > The internal process by which a particular renderer implements ignoring > a glyph is a matter for a particular implementation. John Hudson and I > have suggested a mechanism for doing this with Uniscribe by treating the > character internally as a normal character with a blank glyph and always > ligating it with the preceding character. There may be other mechanisms > which are cleaner. But in any case it seems to be a requirement not just > for fixing this Hebrew problem but for conformance with Unicode as a > whole that some such mechanism is implemented, so that CGJ is ignored by > the renderer unless some specific behaviour is defined. Correct. And the difficulty seems to be in the interpretation of what "ignored by the renderer" means and what obligations it places on implementations. If "ignored by the renderer" is taken as swallowed internally in the script logic and never presented to the actual glyph display mechanism (i.e., never "paint" it), then we run into the trouble that John Hudson has been talking about for use of format controls. But if "ignored by the renderer" is taken as do no processing in the script logic and instead just present it blindly to the actual glyph display mechanism, where the fonts then deal with its default ignorable status by rendering it with a non-advance, blank glyph rather than the missing glyph box, then we are in a position to have both the text processing requirements and the display requirements for Biblical Hebrew neatly met. And the bonus is this: any other case of mismatch between required distinctions for ordering of combining marks for any script, where normalization of the text would result in collapse of distinctions or unexpected order, can *also* be dealt with by the same use of CGJ. No special cases are required, no new characters are required, and no change of any properties are required. > In the case of > rendering Hebrew, there seems to be no pressing need to define specific > behaviour as the default is at least close to what is required. Exactly. And frankly, I am finding it difficult to understand why people are characterizing the CGJ proposal as a kludge or an ugly hack. It strikes me as a rather elegant way of resolving the problem -- using existing encoded characters and existing defined behavior. And as Peter Kirk pointed out, in the main Unicode electronic corpus in question, the *data* fix involved for this is insertion of CGJ in 367 instances of Yerushala(y)im plus a smattering of other places. That is *way* less disruptive than the proposal to replace all of the Hebrew points with cloned code points. It is *way* *way* *way* less disruptive than the impact of destabilizing normalization by trying to change the combining classes. And it is far more elegant than trying to catalog and encode Hebrew point combinations as separate characters. --Ken