Re: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...)
Peter responded to Mark: On 05/08/2003 14:40, Mark Davis wrote: Where did you get the notion that space is not a base character? And base characters include those that are not control or format characters. Space is neither one. The standard specifically states in a number of places that to exhibit a combining mark in isolation you use a space (or NBSP). Mark __ http://www.macchiato.com ► “Eppur si muove” ◄ I got this from the Unicode Standard 4.0, as quoted by Jim Allan: *Mis*quoted by Jim Allan. In http://www.unicode.org/book/preview/ch03.pdf the space characters in general are given class Zs: Zs, Zl, and Zp are considered format characters, but their membership in the Z (separator) class takes precedence over their membership in the Cf class, because the General Category assigns only a single value to each character. That piece of text is *NOT* a quotation from Chapter 3 of Unicode 4.0. Go to that URL and search for it yourself. It is quoted from Chapter 4 of Unicode *3.0*, p. 88, in the discussion of General Category in Section 4.5, General Category -- Normative in Part. The corresponding paragraph has been deleted from the relevant section in Unicode 4.0, precisely because the standard now precisely defines format control characters as {Cf, Zl, Zp} but *ex*cluding Zs. See p. 25 in: http://www.unicode.org/book/preview/ch02.pdf So the various space characters (class Zs) are also classified as format characters. From http://www.unicode.org/book/ch04.pdf: _D13 Base character:_ a character that does not graphically combine with preceding character, and that is neither control nor a format character. Accordingly, by definition, spaces are not base characters. This conclusion is false. As Mark indicated, SPACE (and NBSP) are base characters, and have been treated as such in terms of diacritic application since Unicode 1.0 was published: By convention, diacritical marks used by the Unicode encoding scheme may be exhibited in (apparent) isolation by applying them to U+0020 SPACE or to U+00A0 NON-BREAKING SPACE. This might be done, for example, when talking about the diacritical mark itself as a mark, rather than using it in its normal way in text. -- Unicode 1.0, p. 19 [1991] And that *is* an accurate quote from the standard. In Unicode 4.0 that text survives as: By convention, diacritical marks used by the Unicode Standard may be exhibited in (apparent) isolation by applying them to U+0020 SPACE or to U+00A0 NON-BREAKING SPACE. This tactic might be employed, for example, when talking about the diacritical mark itself as a mark, rather than using it in its normal way in text. -- Unicode 4.0, p. 46 [2003] I'd say the intent of the UTC and the Unicode Standard in this regard has always been rather clear and has stayed unchanged for quite some time. --Ken
Re: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...)
Peter Kirk peter dot r dot kirk at ntlworld dot com wrote: Point taken. But when different fonts and rendering engines give different results because the standard is unclear or ambiguous, that is a matter for the discussion here. And when conforming fonts and rendering engines fail to give the required results, that may also be because of a deficiency in the standard. Or it may not. It may be a deficiency in the level of Unicode support afforded by the fonts and rendering engines. It may simply reflect a difference between your requirements and what the standard promises, and doesn't promise. It seems that many rendering engines give to the sequence space, combining mark the width normally assigned to a space. Is this actually what the standard suggests? The standard doesn't say anything about width in this case. It leaves it up to the display engine, which is as it should be. I have identified a need to display combining marks with no extra width, only the width required by the mark. Should the sequence space, combining mark do what I want, or shouldn't it? If so, this needs to be spelled out so that rendering engines know what they are supposed to do. If not, there may be a need for a new character. This is a deficiency in the standard, not in the rendering engines. When the specific alignment of isolated glyphs is important to me, I use markup. I'm a big supporter of plain text, as many members of this list know, but the exact spacing of isolated combining marks seems like a layout issue to me. -Doug Ewell Fullerton, California http://users.adelphia.net/~dewell/
RE: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...)
there is no such thing as NFD decompositions. Sorry for the confusion. Still even with a NFKD decomposition, And there is no such thing as NFKD decomposition either. It goes as follows, in steps: 1. Canonical and compatibility decomposition mappings (one-step), and canonical classes. 2. Canonical and compatibility full/recursive decompositions and canonical reordering. The compatibility (full) decompositions make use of both the canonical and compatibility decomposition mappings. 3. Canonical and compatibility equivalences. 4. The four Unicode normal forms (NFD, NFC, NFKD, and NFKC). Please don't turn it upside down, that's only confusing! Ok, the formal definition of equivalences and normal forms are a bit backwards in The Unicode standard, defining NFD (in practice, though not the name) before the equivalences. Normally, a normal form is defined as a particular representative element in an equivalence class... But there is no need to aggravate the backwardsness into cyclicity. ... It's true that not all (only most) combining non-spacing characters have a non-combining spacing counterpart. Only a *few* g.c. Mn characters have spacing counterparts! /kent k
Re: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...)
According to the docs at http://www.microsoft.com/typography/otfntdev/indicot/other.htm, uniscribe renders combining marks in isolation when they are applied to SPACE + ZWJ. (Without the ZWJ, it uses a dotted circle.) Perhaps this is an acceptable solution to the people calling for a new character. Combining marks and signs that appear in text not in conjunction with a valid consonant base are considered invalid. Uniscribe displays these marks using the fallback rendering mechanism defined in the Unicode Standard (section 5.12, 'Rendering Non-Spacing Marks' of the Unicode Standard 3.1), i.e. positioned on a dotted circle. Please note that to render a sign standalone (in apparent isolation from any base) one should apply it on a space (see section 2.5 'Combining Marks' of the Unicode Standard). Uniscribe requires a ZWJ to be placed between the space and a mark for them to combine into a standalone sign. Noah
Re: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...)
on 2003-08-06 15:24 Doug Ewell wrote: I'm not a typographer (intelligent or otherwise), but I'm having a tough time seeing how Section 2.10 *requires* fonts and rendering engines to give a space-plus-combining-diacritic combination the exact minimum width of the diacritic alone, or to leave equal space before and after such a combination. All I think it is saying is that, for example, the combination i-plus-tilde may be wider than i alone, because tilde is wider than i. Considering that one approach is to use opentype to map a letter plus diacritical to a single glyph, an obvious solution would be to include space + diacritical combos in that table. An important font issue, but a font issue nonetheless. -- Curtis Clark http://www.csupomona.edu/~jcclark/ Mockingbird Font Works http://www.mockfont.com/
Re: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...)
On Sunday, August 10, 2003 9:30 AM, Mark Davis [EMAIL PROTECTED] wrote: As for oe-ligature, the French representative to WG3 (or its predecessor) said that France could live without it. Even worse; the story I heard was that the committee had planned from the start to have and in positions D7 and F7, but that late in the process the representative from France objected, so they replaced them by and . That would certainly explain why these symbols are in the middle of a batch of letters... It's true that in French these are really ligatures, and not plain letters, meaning that this is mostly a standard typographic convention, rather than orthographic. The national AFNOR may have opted for this solution thinking that these holes would have benfited for other languages commonly used in Europe, and there were probably other candidate characters that finally got encoded in a separate ISO-8859-* set. I don't know which compromize was taken, but the origin DEC VT set also had holes at those positions. It's just strange that the ISO working group opted for those two characters at D7 and F7, when there could have been a pair of characters coded for Finnish, or Catalan (like the dotted L which is still coded with a separate middle dot symbol instead of a true diacritic, and that renders quite poorly with ISO-8859-1 and even with Windows 1252). Well, French and Catalan writers have lived with those encoded sequences, and fixed the rendering using ligating rules in their renderers or fonts (or used the oe/OE ligatures in Windows1252). I just suspect that the French objection on oe/OE was related to the fear of modifying keyboards that were previously created based on the French version of ISO646, where such ligature could not be coded. Since then, the AFNOR version of ISO646-FR has been simplified to remove the tricky combining sequences built with BACKSPACE, like C+BACKSPACE+COMMA to code a C WITH CEDILLA, as they were no longer necessary with a more universally used 8-bit set (7-bit sets have survived only within Teletex/Videotex standards, built in accordance with ISO646 with SS2 sequences to encode non-spacing diacritics *before* the base character with which they combine, to match the keyboard input order based on dead keys for combining diacritics, and this 7-bit set is probably the only one remaining in large use today for French, with ISO646-FR now nearly extinct in favor of ISO646-US/ASCII) -- Philippe. Spams non tolrs: tout message non sollicit sera rapport vos fournisseurs de services Internet.
Re: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...)
On 05/08/2003 14:40, Mark Davis wrote: Where did you get the notion that space is not a base character? And base characters include those that are not control or format characters. Space is neither one. The standard specifically states in a number of places that to exhibit a combining mark in isolation you use a space (or NBSP). Mark __ http://www.macchiato.com Eppur si muove I got this from the Unicode Standard 4.0, as quoted by Jim Allan: In http://www.unicode.org/book/preview/ch03.pdf the space characters in general are given class Zs: Zs, Zl, and Zp are considered format characters, but their membership in the Z (separator) class takes precedence over their membership in the Cf class, because the General Category assigns only a single value to each character. So the various space characters (class Zs) are also classified as format characters. From http://www.unicode.org/book/ch04.pdf: _D13 Base character:_ a character that does not graphically combine with preceding character, and that is neither control nor a format character. Accordingly, by definition, spaces are not base characters. -- Peter Kirk [EMAIL PROTECTED] http://web.onetel.net.uk/~peterkirk/
Re: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...)
On 08/08/2003 09:54, Jim Allan wrote: ... It certainly makes sense that in the case of space characters that have a defined width that this width is innate to the definition of the character and in such a case should take precidence over the width of the normally non-spacing combining character. I would welcome clear instructions by Unicode on this point where either result would be useful in order than applications may be expected to produce results that are consistent with each other. :-) Agreed! I would think it would be consistant with Unicode for an application to shrink the width of normal space followed by a diacritic such as a single overdot as exact formatting behavior is not defined in such cases. Well, is a space followed by a diacritic actually a space, or is it the same code point reused or overloaded By convention (to quote the standard) for a logically distinct purpose? Some of the discussions here have implied the latter. Either way, the best clarification would be to add a character whose explicit function is to form non-spacing variants of diacritics. -- Peter Kirk [EMAIL PROTECTED] (personal) [EMAIL PROTECTED] (work) http://www.qaya.org/
Re: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...)
Ted Hopp asked: I believe that reasonable people might reasonably conclude from factoids 1 and 2 that SPACE is indeed a format character. Reasonable, but evidently wrong. Explanation, please? I provided the text deconstruction in my last email, but to continue, the confusion arises from the strange nature of SPACE in the history of character encoding. SPACE, for a long time now in the history of character encodings, has been classified as a *graphic* character. Certainly, in the general SC2 character encoding context of ISO 2022, SPACE always shows up in the G0 set, with other graphic characters, instead of in the various control functions encoded in C0 or C1 sets. But looked at from the legacy of device control, SPACE could just as well been categorized as a control function: MOVE PRINT HEAD ONE UNIT RIGHT, comparable to BACKSPACE. And in the context of the Unicode Standard, people often loosely talk about space characters as being format characters, since they are a) more akin to punctuation than normal letters, b) have no glyph associated with them, and c) impact line-breaking and other aspects of the formatting of characters in their vicinity. But the *formal* categorization of Unicode characters, defined by the UTC to help eliminate this kind of ambiguity in talk about the character types, is spelled out in Figure 2.5 of Unicode 4.0 now: http://www.unicode.org/book/preview/ch02.pdf and the *formal* meaning of format control character (Basic type = Format) in Unicode is now any character with the General Category of {Cf, Zl, Zp}. The space characters are all lumped in with graphic characters. So while there are still some ambiguities to be worked out in the definition of base character in the Unicode Standard, neither the status of SPACE as a graphic character nor the recommendation of the standard that non-spacing marks be applied to SPACE as a means of showing them in isolation is in question. --Ken
Re: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...)
On 05/08/2003 09:42, Jim Allan wrote: Peter Kirk posted: If I want to do this, should I explicitly encode a dotted circle, or should I encode nothing and expect the font to generate the dotted circle, as it often does? I think that practise of a font or application automaticaly inserting a dotted circle under an orphaned combining character is dubious compliant with Unicode specifications. ... Thanks, Jim, for all this data, but now I am totally confused. Well, at least it seems clear that if I want a dotted circle I should explicitly encode it. But if I don't... Suppose for example I want to write a sentence like In this language the diacritic ^ may appear above the letters ..., but instead of ^ I want to use a combining character, a regularly positioned centred above the letter diacritic, which does not have a defined spacing variant. I don't want a dotted circle. And I want it to be spaced as here, i.e. with one space before the diacritic and one after it. It seems to me that at one place in the standard I am told to encode space - combining mark - space, for the combining mark will not combine with the space because the space is not a base character; and in another place I am implicitly told to encode space - space - combining mark - space, because the second space acts as a carrier for the combining mark. I hope that wanting to display this correctly is not another place where I have stepped over the boundaries of what is reasonable to expect plain text to convey, but that this too can be grist for the Unicode 5.0 mill to grind very finely - both quotes from Ken Whistler earlier today. And I think that if this issue is clarified it will also become clear what should be done about string initial holam and alef etc. Perhaps a simple way ahead would be to define a new character something like COMBINING MARK HOLDER with no glyph, which is defined specifically for this purpose, is a base character and not a format character, and is expected to be just as wide as is necessary to display the combining mark. Then we could say that a spacing accent is equivalent (possibly even canonically if made a composition exclusion?) to COMBINING MARK HOLDER plus a non-spacing accent, and remove the misleading compatibility equivalences to SPACE plus a non-spacing accent. -- Peter Kirk [EMAIL PROTECTED] http://web.onetel.net.uk/~peterkirk/
RE: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...)
The NFD decompositions of spacing marks is alredy defined as a SPACE plus a non-spacing combining character. Philippe, please! Those are *compatibility* decompositions. The normal form NFD only uses *canonical* decompositions. And there is no such thing as NFD decompositions. /kent k
Re: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...)
On Wednesday, August 06, 2003 11:48 PM, Peter Kirk [EMAIL PROTECTED] wrote: OK, what kind of markup should I use, in any well-known markup language, to ensure that an isolated diacritic is centred in the space between the words before and after it? In plain text, I think that this encoding: ...endOfWord1, SPACE, SPACE, diacritic, SPACE, startOfWord2... is what you need, as it creates the following combining sequences: ...endOfWord1, SPACE, SPACE, diacritic, SPACE, startOfWord2... If you don't want any space around the diacritic which must be displayed isolated but in the middle of a word, the following would work: ...endOfWord1, SPACE, diacritic, startOfWord2... Here the SPACE is not a break opportunity, but just the base character for the diacritic inserted. What is missing in the standard is defining the property of such SPACE+diacritic sequence: normally it inherits the properties of the base character, and properties of diacritics are ignored. But when using a SPACE or NBSP base character new properties may be needed. If there's still a break opportunity on the base SPACE of a combining sequence, it is not clear where the break occurs: before the SPACE (i.e. before the combining sequence), or after the diacritic (i.e. after the combining sequence)? I think that the second option applies here, i.e. the base SPACE would create a break opportunity at end of the whole combining sequence made with a SPACE and the following combining characters (including CGJ if needed to fix canonical ordering). Another similar case would be the use of a isolated nukta (which normally modifies a following base character): the sequence nukta, SPACE is a single combining sequence with a break opportunity. So a sequence like nukta, SPACE, acute accent would be unbreakable but would include a break opportunity at its end, unless it is followed by a NBSP. And the sequence nukta, NBSP, acute accent would also be unbreakable either in the middle or on both ends. -- Philippe. Spams non tolérés: tout message non sollicité sera rapporté à vos fournisseurs de services Internet.
RE: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...)
I would like to point out that with all due respect, how particular fonts or rendering engines behave is only marginally relevant to the Unicode list. I think that we should deal only with the Unicode specification. A particular implementation or many implementations may not behave as expected, and then may be either conformant or non-conformant, or may behave as expected and still be either conformant or non-conformant. Messages such as the attached help the discussion of the specification only as illustrations and as a basis for discussing conformity. Jony -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Peter Kirk Sent: Wednesday, August 06, 2003 12:11 PM To: Curtis Clark Cc: Unicode List Subject: Re: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...) On 05/08/2003 16:59, Curtis Clark wrote: on 2003-08-05 15:31 Peter Kirk wrote: Thank you, Mark. This helps to clarify things, but still doesn't explicitly answer my question of how to encode a sentence like In this language the diacritic ^ may appear above the letters ..., but instead of ^ I want to use a combining character and want to display exactly one space before the combining character - do I encode two spaces or one? In this language the diacritic may appear above the letters... Two spaces, at least in Thunderbird Mail. Thank you. Well, this sort of works. I looked in various fonts. In some of them the diacritic is centred in the space between the words diacritic and may, but in others it is offset to the left or the right. The problem is that the space is wider than the diacritic, which confuses things, and all the more so no doubt if it expands for justification. NBSP would probably be a better choice in that it is less likely to expand. But what I am looking for is a diacritic holder which is defined to be only as wide as the diacritic. On the principle that base characters expand to fit the width of the diacritic, ZWSP or, better, a real (rather than misnamed) zero width no break space would seem to have the right properties for that. -- Peter Kirk [EMAIL PROTECTED] http://web.onetel.net.uk/~peterkirk/
Re: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...)
Philip Verdy posted: Could ZWS+combining diacritic may be the best solution for isolated diacritics in text? From http://www.unicode.org/book/ch04.pdf: * Such characters may be large enough to effect the placement of their base character relative to preceding and succeeding base characters. For example, a circumflex applied to an i may effect spacing (î), as might the character U+20DD COMBINING ENCLOSED CIRCLE. Unless Unicode 4.0 as changed this the words may and might here would indicate that ZWSP is not *necessarily* the best solution. There is no specification about what an application *must* do to be conforming in this circumstance, merely indication that an application that does expand spacing for the sake of appearance is not non-confirming. It is *probably* implied that this is the right way to go. But I would guess that it would also be conforming for an application to not expand spacing at all on ZWSP so that coding of _o_ + ZWSP + COMBINING CIRCUMFLEX + _o_ would place the circumflex centered over _oo_ with its center point between the two letters. Either result would be useful for different purposes. It certainly makes sense that in the case of space characters that have a defined width that this width is innate to the definition of the character and in such a case should take precidence over the width of the normally non-spacing combining character. I would welcome clear instructions by Unicode on this point where either result would be useful in order than applications may be expected to produce results that are consistent with each other. :-) I would think it would be consistant with Unicode for an application to shrink the width of normal space followed by a diacritic such as a single overdot as exact formatting behavior is not defined in such cases. Jim Allan
Re: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...)
On Thursday, August 07, 2003 8:06 PM, Peter Kirk [EMAIL PROTECTED] wrote: On 06/08/2003 15:47, Philippe Verdy wrote: On Wednesday, August 06, 2003 11:48 PM, Peter Kirk [EMAIL PROTECTED] wrote: OK, what kind of markup should I use, in any well-known markup language, to ensure that an isolated diacritic is centred in the space between the words before and after it? In plain text, I think that this encoding: ...endOfWord1, SPACE, SPACE, diacritic, SPACE, startOfWord2... is what you need, as it creates the following combining sequences: ...endOfWord1, SPACE, SPACE, diacritic, SPACE, startOfWord2... Thank you, Philippe. This is where we started. But I noted that some current implementations render the space diacritic combination as a full width space with the diacritic not centred over it. I suggested that this was wrong, that the diacritic should be centred. Doug suggested I used markup outside the scope of Unicode. ... Another similar case would be the use of a isolated nukta (which normally modifies a following base character): the sequence nukta, SPACE is a single combining sequence with a break opportunity. So a sequence like nukta, SPACE, acute accent would be unbreakable but would include a break opportunity at its end, unless it is followed by a NBSP. And the sequence nukta, NBSP, acute accent would also be unbreakable either in the middle or on both ends. Tell me more about these nuktas which modify a FOLLOWING base character. This is just what I have been told is illegal, non-conformant or something. But if this is allowed for nuktas, why shouldn't it be allowed for Hebrew holam? Sorry, I should have checked my code to see which character exactly has a combining feature with the following base character. In fact there's already a special treatment for nukta, which gets internally swapped in front of its base character for glyph processing, and this was a source of confusion for me (yes nuktas have CC=7 and are combined with the previous base character, but only with the standard Unicode encoding sequence, but not in all legacy codepages, and not for some other text processings that put it in front. In fact, I may have discussed about the Candrabindu, which is combining with CC=230 (above?), except in the Devenagari, Bengali, Gujarati, Oriya scripts where they are combining but as base character (CC=0), and in Telugu and Gurmukhi (Adak Bindi) where it is Mc instead of Mn and is not combining. This reflects a different usage of the Candrabindu in ISCII, and this is a source of difficulty when transcoding from ISCII to Unicode... And I'm not sure if the CC=230 for the Tibetan Candrabindu is really accurate with its specific combining model. The treatment of Anusvara (or Tibetan JeSuNgaRo or Gurmukhi Bindi or Sinhala Anusvaraya) as a combining character with CC=0 is also script specific, as it is either Mc or Mn. The same thing may be said about Visarga signs (or Sinhala Visargaya) Such special treatment is not needed for the Viramas (CC=9), as it more or less behaves like a standard vowel sign, i.e. a regular diacritic. The original encoding model for Indian scripts has lot of legacy text resources coded with ISCII with a unified model that Unicode treat more or less specially, but with its own difficulties (we can ignore the ISCII font controls, or we can consider other ISCII control signs to manage it like ISO2022 with script switch controls, which are not encoded in Unicode. Despite what the Unicode reference section documents in the specific chapter for Brahmic scripts, there's little help here to avoid the confusions, notably because the same chapter covers scripts that have been encoded with distinct character models (notably Thai and Lao). For now the current text in Unicode 3 seems not very helpful to disambiguate things, and I hope that this chapter about Indic scripts will be greatly enhanced to cover the actual usages, and that Thai and Lao will be discussed separately from other Indic scripts. For now, I think that the ISCII or TIS620 standards are much more precise and helpful than the Unicode reference for the scripts they cover in a different way, with lots of conversion caveats not explained (at first read this chapter seems to make a proeminent reference to ISCII and TIS620, but there are some quirks where both references seem to contradict the actual usage of combining sequences, for which new Unicode properties should be added and precised (even if combining classes cannot be changed for stability reason as well as normalized forms considered canonnically equivalent, or distinct when in reality they are combining the same way and one form is considered normal and others are non-standard or defective according to the origin ISCII or TIS620 standard). -- Philippe. Spams non tolérés: tout message non sollicité sera rapporté à vos fournisseurs de services Internet.
Re: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...)
Moreover, as I wrote before, the wording in that one paragraph in 3.0 is not clearly stated, but it is clear from a reading of the rest of the standard -- with numerous examples -- and from the UCD 3.0 properties, that space *is not* a format character, and *is* a suitable base for combining marks. So the little coy remark below is not warranted with respect to combining marks on space. OK, understood now. As the previous version is obsolete, and the new one is unavailable, we can all take a break from conforming to Unicode at Mark __ http://www.macchiato.com Eppur si muove - Original Message - From: Kenneth Whistler [EMAIL PROTECTED] To: [EMAIL PROTECTED] Cc: [EMAIL PROTECTED]; [EMAIL PROTECTED] Sent: Wednesday, August 06, 2003 15:48 Subject: Re: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...) Peter Kirk responded to my plea for everyone to relax a bit: If everyone would just go off for a week or two on their August vacation, like they should be, we could all come back about Labor Day and we wouldn't have to be having these discussions. ;-) --Ken OK, understood now. As the previous version is obsolete, and the new one is unavailable, we can all take a break from conforming to Unicode at all and take a vacation! Sounds a good idea to me ;-) Just in the interest of truth in advertising, the previous version(s) are not obsolete, but are superseded by Unicode 4.0. ^^^ Applications claiming conformance to Unicode 3.0 will continue to claim conformance to that version, and that version is relevant to their claim. And so on for Unicode 3.1 and Unicode 3.2. But if and when people move on to claiming conformance to Unicode 4.0, then it is the text of *that* version which becomes relevant to their claim. We are simply in the inconvenient transition state where people are building Unicode 4.0 implementations, but the final, final text of the *book* (as opposed to the various UAX's and all the data files) is not available. There were similar transition periods for Unicode 1.0, Unicode 2.0, and Unicode 3.0, and nearly everyone understands that is the nature of things. So yes, please, it's time to take a vacation! :) --Ken
Re: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...)
On 05/08/2003 15:53, Ted Hopp wrote: On Tuesday, August 05, 2003 5:40 PM, Mark Davis wrote: Where did you get the notion that space is not a base character? And base characters include those that are not control or format characters. Space is neither one. Well, I think Jim Allan pointed to the source of this notion in his email of a few hours ago. 1) From the UCD: 0020;SPACE;Zs;... 2) From Unicode 3, Section 4.5, third paragraph (in its entirety): Zs, Zl, and Zp are considered format characters, but their membership in the Z (separator) class takes precedence over their membership in the Cf class, because General Category assigns only a single value to each character. I believe that reasonable people might reasonably conclude from factoids 1 and 2 that SPACE is indeed a format character. Reasonable, but evidently wrong. Explanation, please? Ted Ted Hopp, Ph.D. ZigZag, Inc. [EMAIL PROTECTED] +1-301-990-7453 newSLATE is your personal learning workspace ...on the web at http://www.newSLATE.com/ From what Ken says, it sounds like it will be wrong from whenever Unicode 4.0 is officially issued because this paragraph has been excised from that standard. But until then it seems to be correct, SPACE is indeed considered a format character. I was misled by Jim's reference to the URL of the final draft (as clearly stamped on the first page) of 4.0, but since in fact he was quoting from 3.0 what he says can hardly be considered obsolete yet. -- Peter Kirk [EMAIL PROTECTED] http://web.onetel.net.uk/~peterkirk/
Re: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...)
Peter Kirk scripsit: This is a clear demonstration that Microsoft also has problems with the mechanism which has been defined in the standard for ten years, This is a clear demonstration that Uniscribe fails to implement a standard correctly, a property unique neither to Microsoft nor to the Unicode Standard. -- Knowledge studies others / Wisdom is self-known; John Cowan Muscle masters brothers / Self-mastery is bone; [EMAIL PROTECTED] Content need never borrow / Ambition wanders blind; www.ccil.org/~cowan Vitality cleaves to the marrow / Leaving death behind.--Tao 33 (Bynner)
Re: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...)
On 06/08/2003 05:58, Jony Rosenne wrote: I would like to point out that with all due respect, how particular fonts or rendering engines behave is only marginally relevant to the Unicode list. I think that we should deal only with the Unicode specification. A particular implementation or many implementations may not behave as expected, and then may be either conformant or non-conformant, or may behave as expected and still be either conformant or non-conformant. Messages such as the attached help the discussion of the specification only as illustrations and as a basis for discussing conformity. Jony -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Peter Kirk Sent: Wednesday, August 06, 2003 12:11 PM To: Curtis Clark Cc: Unicode List Subject: Re: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...) On 05/08/2003 16:59, Curtis Clark wrote: on 2003-08-05 15:31 Peter Kirk wrote: Thank you, Mark. This helps to clarify things, but still doesn't explicitly answer my question of how to encode a sentence like In this language the diacritic ^ may appear above the letters ..., but instead of ^ I want to use a combining character and want to display exactly one space before the combining character - do I encode two spaces or one? In this language the diacritic may appear above the letters... Two spaces, at least in Thunderbird Mail. Thank you. Well, this sort of works. I looked in various fonts. In some of them the diacritic is centred in the space between the words diacritic and may, but in others it is offset to the left or the right. The problem is that the space is wider than the diacritic, which confuses things, and all the more so no doubt if it expands for justification. NBSP would probably be a better choice in that it is less likely to expand. But what I am looking for is a diacritic holder which is defined to be only as wide as the diacritic. On the principle that base characters expand to fit the width of the diacritic, ZWSP or, better, a real (rather than misnamed) zero width no break space would seem to have the right properties for that. -- Peter Kirk [EMAIL PROTECTED] http://web.onetel.net.uk/~peterkirk/ Point taken. But when different fonts and rendering engines give different results because the standard is unclear or ambiguous, that is a matter for the discussion here. And when conforming fonts and rendering engines fail to give the required results, that may also be because of a deficiency in the standard. It seems that many rendering engines give to the sequence space, combining mark the width normally assigned to a space. Is this actually what the standard suggests? I have identified a need to display combining marks with no extra width, only the width required by the mark. Should the sequence space, combining mark do what I want, or shouldn't it? If so, this needs to be spelled out so that rendering engines know what they are supposed to do. If not, there may be a need for a new character. This is a deficiency in the standard, not in the rendering engines. -- Peter Kirk [EMAIL PROTECTED] http://web.onetel.net.uk/~peterkirk/
Re: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...)
On 09/08/2003 13:23, Noah Levitt wrote: According to the docs at http://www.microsoft.com/typography/otfntdev/indicot/other.htm, uniscribe renders combining marks in isolation when they are applied to SPACE + ZWJ. (Without the ZWJ, it uses a dotted circle.) Perhaps this is an acceptable solution to the people calling for a new character. Combining marks and signs that appear in text not in conjunction with a valid consonant base are considered invalid. Uniscribe displays these marks using the fallback rendering mechanism defined in the Unicode Standard (section 5.12, 'Rendering Non-Spacing Marks' of the Unicode Standard 3.1), i.e. positioned on a dotted circle. Please note that to render a sign standalone (in apparent isolation from any base) one should apply it on a space (see section 2.5 'Combining Marks' of the Unicode Standard). Uniscribe requires a ZWJ to be placed between the space and a mark for them to combine into a standalone sign. Noah This is a clear demonstration that Microsoft also has problems with the mechanism which has been defined in the standard for ten years, that space followed by diacritic is legal and should be rendered as the isolated diacritic. But the alternative mechanism which they have implemented is non-standard and apparently a defective combining sequence, as ZWJ (if I remember correctly) is not a base character. The best way to fix this situation is to define a new character with the correct properties. -- Peter Kirk [EMAIL PROTECTED] (personal) [EMAIL PROTECTED] (work) http://www.qaya.org/
RE: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...)
(provided that the whitespace normalization algorithm will not include ZWSP in the whitespaces sequence and treat it isolately, something that a conforming HTML or XML processor should not do, as it should unify only sequences of SPACE, TAB, CR, LF, and only according to the context of the containing element whitespace properties controlling the normalization of XML whitespace sequences (leading, trailing, line break preservation, tabulator)... ZWSP being normalised would be quite a bizarre bug, I can see it happening only if someone relied on a isWhiteSpace function provided by a non-XML aware library and that function considered ZWSP to be whitespace. I've never seen this, although I have seen similar assumptions made about how characters act in XML, and some deeply incorrect ones about how octets act in XML (that is they made incorrect assumptions about encodings, or even had no thoughts about encodings at all, an error which some environments and languages can lead the nave too). NEL and LSEP is added to your list of characters affected by whitespace normalisation for XML1.1. Possibly some people implemented the suggestion in http://www.w3.org/TR/newline before 1.1.
Re: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...)
Peter Kirk scripsit: Really? It looks to me as if U+00B4 and U+02D8 to U+02DD have only a compatibility equivalences to space plus diacritic, and U+005E and U+0060 don't even have compatibility equivalences. Indeed. The last two, BTW, are because the ASCII repertoire has taken on a life of its own: ^ is not merely a spacing clone of COMBINING CIRCUMFLEX, but has become a fully distinct character with many functions. In particular, none of the Unicode canonical forms will affect text written solely in the ASCII repertoire. Every character has its own story. Someone asked about whether XML documents SHOULD or MUST be in NFC. The answer is SHOULD, and this is formally applied only to the not-yet-promulgated XML 1.1. XML documents *on the Web* SHOULD be in NFC by reason of the W3C Character Model. -- John Cowan [EMAIL PROTECTED]http://www.reutershealth.com Not to know The Smiths is not to know K.X.U. --K.X.U.
Re: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...)
On 06/08/2003 15:24, Doug Ewell wrote: Like Freud's cigar, sometimes a may is just a may. And I suspect the phrase any intelligent typographer MAY generate some flak from typographers on this list who consider themselves intelligent enough yet have a different opinion. I'm not a typographer (intelligent or otherwise), but I'm having a tough time seeing how Section 2.10 *requires* fonts and rendering engines to give a space-plus-combining-diacritic combination the exact minimum width of the diacritic alone, or to leave equal space before and after such a combination. All I think it is saying is that, for example, the combination i-plus-tilde may be wider than i alone, because tilde is wider than i. OK, Doug, I accept that a may is a may and an implementation in which the tilde on an i collides with neighbouring characters is Unicode compliant. It's just bad typography (unless some special effect is intended). Any typographers on the list care to disagree? I would suggest that it is also bad typography for a space, diacritic combination to be wider than the diacritic, as long as the typographer realises that space is being used here as a convention and, according to the standard, does not have the usual properties of a space. -- Peter Kirk [EMAIL PROTECTED] (personal) [EMAIL PROTECTED] (work) http://www.qaya.org/
Re: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...)
On 06/08/2003 15:47, Philippe Verdy wrote: On Wednesday, August 06, 2003 11:48 PM, Peter Kirk [EMAIL PROTECTED] wrote: OK, what kind of markup should I use, in any well-known markup language, to ensure that an isolated diacritic is centred in the space between the words before and after it? In plain text, I think that this encoding: ...endOfWord1, SPACE, SPACE, diacritic, SPACE, startOfWord2... is what you need, as it creates the following combining sequences: ...endOfWord1, SPACE, SPACE, diacritic, SPACE, startOfWord2... Thank you, Philippe. This is where we started. But I noted that some current implementations render the space diacritic combination as a full width space with the diacritic not centred over it. I suggested that this was wrong, that the diacritic should be centred. Doug suggested I used markup outside the scope of Unicode. ... Another similar case would be the use of a isolated nukta (which normally modifies a following base character): the sequence nukta, SPACE is a single combining sequence with a break opportunity. So a sequence like nukta, SPACE, acute accent would be unbreakable but would include a break opportunity at its end, unless it is followed by a NBSP. And the sequence nukta, NBSP, acute accent would also be unbreakable either in the middle or on both ends. Tell me more about these nuktas which modify a FOLLOWING base character. This is just what I have been told is illegal, non-conformant or something. But if this is allowed for nuktas, why shouldn't it be allowed for Hebrew holam? -- Peter Kirk [EMAIL PROTECTED] (personal) [EMAIL PROTECTED] (work) http://www.qaya.org/
Re: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...)
Where did you get the notion that space is not a base character? And base characters include those that are not control or format characters. Space is neither one. The standard specifically states in a number of places that to exhibit a combining mark in isolation you use a space (or NBSP). Mark __ http://www.macchiato.com Eppur si muove - Original Message - From: Peter Kirk [EMAIL PROTECTED] To: Jim Allan [EMAIL PROTECTED] Cc: [EMAIL PROTECTED] Sent: Tuesday, August 05, 2003 13:47 Subject: Re: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...) On 05/08/2003 09:42, Jim Allan wrote: Peter Kirk posted: If I want to do this, should I explicitly encode a dotted circle, or should I encode nothing and expect the font to generate the dotted circle, as it often does? I think that practise of a font or application automaticaly inserting a dotted circle under an orphaned combining character is dubious compliant with Unicode specifications. ... Thanks, Jim, for all this data, but now I am totally confused. Well, at least it seems clear that if I want a dotted circle I should explicitly encode it. But if I don't... Suppose for example I want to write a sentence like In this language the diacritic ^ may appear above the letters ..., but instead of ^ I want to use a combining character, a regularly positioned centred above the letter diacritic, which does not have a defined spacing variant. I don't want a dotted circle. And I want it to be spaced as here, i.e. with one space before the diacritic and one after it. It seems to me that at one place in the standard I am told to encode space - combining mark - space, for the combining mark will not combine with the space because the space is not a base character; and in another place I am implicitly told to encode space - space - combining mark - space, because the second space acts as a carrier for the combining mark. I hope that wanting to display this correctly is not another place where I have stepped over the boundaries of what is reasonable to expect plain text to convey, but that this too can be grist for the Unicode 5.0 mill to grind very finely - both quotes from Ken Whistler earlier today. And I think that if this issue is clarified it will also become clear what should be done about string initial holam and alef etc. Perhaps a simple way ahead would be to define a new character something like COMBINING MARK HOLDER with no glyph, which is defined specifically for this purpose, is a base character and not a format character, and is expected to be just as wide as is necessary to display the combining mark. Then we could say that a spacing accent is equivalent (possibly even canonically if made a composition exclusion?) to COMBINING MARK HOLDER plus a non-spacing accent, and remove the misleading compatibility equivalences to SPACE plus a non-spacing accent. -- Peter Kirk [EMAIL PROTECTED] http://web.onetel.net.uk/~peterkirk/
Re: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...)
On 05/08/2003 17:13, Kenneth Whistler wrote: Peter Kirk said: From what Ken says, it sounds like it will be wrong from whenever Unicode 4.0 is officially issued Actually Unicode 4.0 was officially issued on April 17, 2003. What we are waiting on now is for the publication of the text of the book to catch up to that fact. ;-) ... I was misled by Jim's reference to the URL of the final draft (as clearly stamped on the first page) of 4.0, but since in fact he was quoting from 3.0 what he says can hardly be considered obsolete yet. Actually it can. And that would have been obvious to everyone if a preview version of Chapter 4 had also been posted. Once again, I appeal to people to stop trying to second-guess the text of the standard. The final pdf for the online version is in preparation even as I write this. The final final proofs for the book itself have already been produced by the printer -- all they need to do now is turn on the press and start the binder. If everyone would just go off for a week or two on their August vacation, like they should be, we could all come back about Labor Day and we wouldn't have to be having these discussions. ;-) --Ken OK, understood now. As the previous version is obsolete, and the new one is unavailable, we can all take a break from conforming to Unicode at all and take a vacation! Sounds a good idea to me ;-) -- Peter Kirk [EMAIL PROTECTED] http://web.onetel.net.uk/~peterkirk/
RE: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...)
It *is* part of the Unicode Standard. You want a stand-alone diacritic? Use SP or NBSP followed by the combining diacritic. It says so, right there. Yes. But it is not quite clear how this should interact with combining characters that aren't purely 'above' or 'below' a single character (in horizontal writing): in particular double diacritics (SPACE, dbl diacritic or SPACE, dbl diacritic, SPACE to get an isolated one?), and left-side or right-side combining characters (SPACE, rightside comb. char does that give unwanted space on the left or not?). /kent k
Re: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...)
As for oe-ligature, the French representative to WG3 (or its predecessor) said that France could live without it. Even worse; the story I heard was that the committee had planned from the start to have and in positions D7 and F7, but that late in the process the representative from France objected, so they replaced them by and . That would certainly explain why these symbols are in the middle of a batch of letters... Mark __ http://www.macchiato.com Eppur si muove - Original Message - From: John Cowan [EMAIL PROTECTED] To: Philippe Verdy [EMAIL PROTECTED] Cc: [EMAIL PROTECTED] Sent: Saturday, August 09, 2003 20:13 Subject: Re: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...) Philippe Verdy scripsit: Except that in that case, we are no speaking about something that has already been standardized, but only used as a legacy mean to achieve some results with mosre or less success. It *is* part of the Unicode Standard. You want a stand-alone diacritic? Use SP or NBSP followed by the combining diacritic. It says so, right there. Your implementation doesn't work? Complain to the implementor, switch to another implementation, fix the implementation yourself, or pay someone to fix it. SPACE+diacritic is still a hack, and certainly not a canonical equivalent (including for its properties), of the existing spacing diacritics, which also do not fit all usages because they are symbols. It's the spacing diacritics that are a hack, for the most part. The ASCII ones have, as I said, taken on a life of their own. * [OT] This was a shame when ISO adapted the DEC VT charset to create ISO-8859-1, but forgot important characters needed for the languages that this charset was supposed to cover (like the French oe and OE ligatures, and a few characters missing for Baltic languages, Icelandic, and Catalan.) ISO-8859-1 was not meant to cover the whole of Europe; it was part of a quartet, parts 1 to 4. The fact that parts 3 and 4 didn't work out was not ISO's fault: it didn't foresee how important European as opposed ot merely regional data interchange would be. As for oe-ligature, the French representative to WG3 (or its predecessor) said that France could live without it. -- John Cowan [EMAIL PROTECTED] www.ccil.org/~cowan www.reutershealth.com If I have seen farther than others, it is because I am surrounded by dwarves. --Murray Gell-Mann
Re: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...)
On 04/08/2003 17:36, Kenneth Whistler wrote: Peter Kirk asked: A similar issue which is not Hebrew related would be a (mythical) requirement to display a diacritic like 0315, 031B or 0322 in isolation. It would not always be appropriate to use a space or NBSP as a base character as this would indent the glyph from the beginning of a line in a way which might not be wanted. What would be the recommended encoding if one wanted to display one of these characters with no leading white space? If you want to display some character like U+0315 COMBINING COMMA ABOVE RIGHT *and* you want to do it is isolation *and* you want it to occur at the beginning of a line *and* you want there to be no display width between the margin and the left edge of the display bits of the glyph, then you have stepped over the boundaries of what is reasonable to expect plain text to convey. Feel free to make use of the higher-level capabilities of your word processor or page layout program to individually adjust the positioning of particular glyphs displayed in particular fonts. That's true for such defective sequences that may be used temporarily during text handling operations (where the combining mark should be rendered in editors with the dotted circle glyph). But one can still represent a isolated combining character in a non defective way by putting it after a Zero-Width Space, without creating any margin. This can be done due to the Zs category of this character which qualifies it the same way as a ASCII SPACE would: 0020;SPACE;Zs;0;WS;N; 200B;ZERO WIDTH SPACE;Zs;0;BN;N; In fact, using ZWS may even be more accurate than using SPACE in bidirectional contexts, as it is bidirectionally neutral, and does not break directionality clusters for display reordering (so such encoded isolated diacritic can appear even in a RTL sequence, as if it was a single character with the current directionality). I just wonder what would be the width of the combination of ZWS plus a diacritic: logically the ZWS as width 0, but diacritics are supposed to expand, if needed the width of the base character, unless kerning is used to reduce the interletter spacing. But I doubt that any font would define a kerning pair for a preceding grapheme cluster plus this isolated diacritic (ZWS+combining), or for that isolated diacritic and the next grapheme cluster, so in absence of such kerning pair, most programs will just use the default combined width. I just tried to see how Windows XP represent the sequences: A, SPACE, ZWS, COMBINING MACRON, SPACE, B A, ZWS, COMBINING MACRON, B And it shows the spaces correctly even in HTML with IE6, with Arial, Arial Unicode MS, Times New Roman, Courier New... On the opposite, the sequence SPACE, COMBINING MACRON is incorrectly rendered with a too large width (larger than a single space or a single non-combining macron). Could ZWS+combining diacritic may be the best solution for isolated diacritics in text?
Re: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...)
On 10/08/2003 10:09, Michael Everson wrote: At 01:30 +0200 2003-08-10, Philippe Verdy wrote: Whateer you think, the SPACE+diacritic is still a hack, and certainly not a canonical equivalent (including for its properties), of the existing spacing diacritics, which also do not fit all usages because they are symbols. It is the formally specified way to represent what you say you want to represent. If an implementation doesn't do that nicely enough, complain to the implementors. (This has already been suggested to you.) As has already been clearly pointed out by Philippe, Kent and myself (and ignored by those opposed to any change), the combination SPACE + diacritic does not have the required categories, properties and specification for the function it is supposed to perform. Either these categories etc need to be adjusted (and I don't expect the general category of SPACE to be changed!), or some exceptional mechanism needs to be clearly defined, or, by far the simplest solution, a new base character can be defined which, when combined with the diacritic, has the required categories and properties. -- Peter Kirk [EMAIL PROTECTED] (personal) [EMAIL PROTECTED] (work) http://www.qaya.org/
Re: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...)
On Sunday, August 10, 2003 12:32 AM, John Cowan [EMAIL PROTECTED] wrote: Peter Kirk scripsit: This is a clear demonstration that Microsoft also has problems with the mechanism which has been defined in the standard for ten years, This is a clear demonstration that Uniscribe fails to implement a standard correctly, a property unique neither to Microsoft nor to the Unicode Standard. Except that in that case, we are no speaking about something that has already been standardized, but only used as a legacy mean to achieve some results with mosre or less success. Whateer you think, the SPACE+diacritic is still a hack, and certainly not a canonical equivalent (including for its properties), of the existing spacing diacritics, which also do not fit all usages because they are symbols. The fact that there are compatibility decompositions of these spacing diacritics is just to match those legacy uses, but it is not a solution. It just ressembles the way many keyboard drivers allow users to enter those spacing diacritics, but input methods and keyboard drivers are nothing as a proof face to Unicode, as the keyboard driver will still only return a combined spacing diacritic, but not the sequence SPACE+diacritics (whose real usage in text seems to occur only in old texts where non-spacing combining diacritics where not encodable or renderable, or just to allow speaking in full text about the individual diacritics themselves, a more rare case). May be I'm wrong for this assertion, but this is my feeling and experience about these characters, which were merely symbols or hacks to represent non English text with a restricted ASCII alphabet as an approximate representation (the inclusion of other spacing diacritics in the high range of an 8-bit ISO-8859-1 encoding was very strange for me, as if they were there only to allow approximating other missing precombined characters which could not fit in the table, but produced poor results so that most texts were never encoded with this charset but with other more appropriate charsets when needed. * * [OT] This was a shame when ISO adapted the DEC VT charset to create ISO-8859-1, but forgot important characters needed for the languages that this charset was supposed to cover (like the French oe and OE ligatures, and a few characters missing for Baltic languages, Icelandic, and Catalan.) ISO-8859-15 is certainly better now than ISO-8859-1 for the same languages and for even more than initially defined, and in practice that's Microsoft that filled the gap with Windows1252 when dropping the unnecessary C1 controls (forgetting the legacy roundtrip compatibility of controls with the dying EBCDIC).
Re: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...)
On Sunday, August 10, 2003 9:17 PM, Peter Kirk [EMAIL PROTECTED] wrote: On 10/08/2003 10:09, Michael Everson wrote: It is the formally specified way to represent what you say you want to represent. If an implementation doesn't do that nicely enough, complain to the implementors. (This has already been suggested to you.) As has already been clearly pointed out by Philippe, Kent and myself (and ignored by those opposed to any change), the combination SPACE + diacritic does not have the required categories, properties and specification for the function it is supposed to perform. Either these categories etc need to be adjusted (and I don't expect the general category of SPACE to be changed!), or some exceptional mechanism needs to be clearly defined, or, by far the simplest solution, a new base character can be defined which, when combined with the diacritic, has the required categories and properties. That's exactly what I suggested (and I used the word suggest, and wanted to show the inaccuracy of the SPACE or NBSP to represent spacing diacritics as a normal symbol, due to the undocumented properties for that combination). Due to the lack of formal documentation (no one here demonstrated that such sequence with SPACE was really documented as such somewhere in the Unicode specs), such legacy usage is still just a hack which only works sometimes, but not always as intended because it contradicts some other principles like the inheritance of the base character properties to the whole combining sequence using it. And still, even if SPACE+diacritics is documented now as producing officially a symbol, its properties are still not defined (not interoperable as varying among implementations), and it still gies problems with the huge legacy use of SPACE as a padding character or with space normalizations like in XML, HTML and SGML. In addition, it still does not solve the problem of its insertion within words, and of its directionality for BiDi, its parsing for breaking (line breaking, word breaking, ...) where distinct base character(s) for the correct interpretation would be needed. Yes I have read your comment, and Yes I know that SPACE+diacritics is widely used. But this is with many unsolved problems that one could legitimately want to solve with more precise: - definition of such combining sequence with SPACE - definition of its properties - documentation within the Unicode breaking algorithms - adjustments to the BiDi specs - etc... If all these adjustments are made, there will be many, all of them handled like exceptions to the normal rules, when a much simpler approach (which would not require all these changes in specs), would consist in defining other(s) more explicit base character(s) for the appropriate function. If Ken, Michael, Kent and other respectable UTC members can't see the problem, who will? Please consider the problem itself and don't be too much focused on the exact terminology that you would have used yourself to better describe the problem and its solutions. I am not discussing the terminology itself, but the lack of documentation and support for what seems a true interoperability problem. So please don't flame me with sarcasms, that's not the subject of my messages which do not want to comment about the respective Unicode expertize of respectable UTC members... Sorry if this message seems still too long for you. But each time I want to be short, I am flamed for inaccuracies, or imprecisions, or suspected of claiming something about the standard when in fact I am not discussing what is currently in the standard itself, but what is not there now and causes problems. It's easy to be short if you only refer to the standard itself, and only respond as if this list was just a FAQ. -- Philippe. Spams non tolérés: tout message non sollicité sera rapporté à vos fournisseurs de services Internet.
Re: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...)
Peter Kirk asked: If I want to do this, should I explicitly encode a dotted circle, or should I encode nothing and expect the font to generate the dotted circle, as it often does? If you want to represent the text content of a dotted circle with an accent on it, the recommended representation would be, for example: 25CC, 0301 A compliant renderer that supports those characters should always then display a dotted circle with an acute accent over it. If you just leave a 0301 in isolation, then you are at the mercy of what a renderer might do in a fallback situation for a defective combining character sequence. It *might* show it on a dotted circle, or it might show it in some other way. And if that combining character is in any other context, it may end up being misapplied to the wrong preceding character -- wrong in the sense that that was not your intention. --Ken
Re: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...)
On 05/08/2003 16:59, Curtis Clark wrote: on 2003-08-05 15:31 Peter Kirk wrote: Thank you, Mark. This helps to clarify things, but still doesn't explicitly answer my question of how to encode a sentence like In this language the diacritic ^ may appear above the letters ..., but instead of ^ I want to use a combining character and want to display exactly one space before the combining character - do I encode two spaces or one? In this language the diacritic may appear above the letters... Two spaces, at least in Thunderbird Mail. Thank you. Well, this sort of works. I looked in various fonts. In some of them the diacritic is centred in the space between the words diacritic and may, but in others it is offset to the left or the right. The problem is that the space is wider than the diacritic, which confuses things, and all the more so no doubt if it expands for justification. NBSP would probably be a better choice in that it is less likely to expand. But what I am looking for is a diacritic holder which is defined to be only as wide as the diacritic. On the principle that base characters expand to fit the width of the diacritic, ZWSP or, better, a real (rather than misnamed) zero width no break space would seem to have the right properties for that. -- Peter Kirk [EMAIL PROTECTED] http://web.onetel.net.uk/~peterkirk/
Re: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...)
Zs, Zl, and Zp are considered format characters, but their membership in the Z (separator) class takes precedence over their membership in the Cf class, because the General Category assigns only a single value to each character. Whenever you have a question about the status of a character, you need to look it up in the UCD. You can either do that by going through the unicode website, or if you want a more readable interface, use the ICU character browser, which formats that data. Look at space, U+0020. http://oss.software.ibm.com/cgi-bin/icu/ub/utf-8/?go=0020ch.x=4ch.y=7 The general category is Space_Separator, *not* a format character. Now wording there could definitely be clearer, but the operant phrase is: ...but their membership in the Z (separator) class *takes precedence* over their membership in the Cf class... So it would be cleared to say something like: In many ways the characters, Zs, Zl, and Zp, are similar to format characters, but because their general usage is significantly different they are broken out into a separate General Category, as Separator characters. Mark __ http://www.macchiato.com Eppur si muove - Original Message - From: Peter Kirk [EMAIL PROTECTED] To: Mark Davis [EMAIL PROTECTED] Cc: Unicode List [EMAIL PROTECTED] Sent: Tuesday, August 05, 2003 14:50 Subject: Re: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...) On 05/08/2003 14:40, Mark Davis wrote: Where did you get the notion that space is not a base character? And base characters include those that are not control or format characters. Space is neither one. The standard specifically states in a number of places that to exhibit a combining mark in isolation you use a space (or NBSP). Mark __ http://www.macchiato.com Eppur si muove I got this from the Unicode Standard 4.0, as quoted by Jim Allan: In http://www.unicode.org/book/preview/ch03.pdf the space characters in general are given class Zs: Zs, Zl, and Zp are considered format characters, but their membership in the Z (separator) class takes precedence over their membership in the Cf class, because the General Category assigns only a single value to each character. So the various space characters (class Zs) are also classified as format characters. From http://www.unicode.org/book/ch04.pdf: _D13 Base character:_ a character that does not graphically combine with preceding character, and that is neither control nor a format character. Accordingly, by definition, spaces are not base characters. -- Peter Kirk [EMAIL PROTECTED] http://web.onetel.net.uk/~peterkirk/
Re: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...)
Peter Kirk responded to my plea for everyone to relax a bit: If everyone would just go off for a week or two on their August vacation, like they should be, we could all come back about Labor Day and we wouldn't have to be having these discussions. ;-) --Ken OK, understood now. As the previous version is obsolete, and the new one is unavailable, we can all take a break from conforming to Unicode at all and take a vacation! Sounds a good idea to me ;-) Just in the interest of truth in advertising, the previous version(s) are not obsolete, but are superseded by Unicode 4.0. ^^^ Applications claiming conformance to Unicode 3.0 will continue to claim conformance to that version, and that version is relevant to their claim. And so on for Unicode 3.1 and Unicode 3.2. But if and when people move on to claiming conformance to Unicode 4.0, then it is the text of *that* version which becomes relevant to their claim. We are simply in the inconvenient transition state where people are building Unicode 4.0 implementations, but the final, final text of the *book* (as opposed to the various UAX's and all the data files) is not available. There were similar transition periods for Unicode 1.0, Unicode 2.0, and Unicode 3.0, and nearly everyone understands that is the nature of things. So yes, please, it's time to take a vacation! :) --Ken
Re: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...)
Mark Davis scripsit: Where did you get the notion that space is not a base character? And base characters include those that are not control or format characters. Space is neither one. Unfortunately, p. 88 of TUS3.0 (section 4.5, paragraph 3) says Zs, Zl, and Zp [characters] are considered format characters. This is obviously wrong, but there it is. -- Kill Gorg)Bn! Kill orc-folk! John Cowan No other words please Wild Men. [EMAIL PROTECTED] Drive away bad air and darkness http://www.reutershealth.com with bright iron! --Gh)Bn-buri-Ghnhttp://www.ccil.org/~cowan
RE: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...)
I was so glad that you got things so nearly right for once, and then you go and spoil it with: Another similar case would be the use of a isolated nukta (which normally modifies a following base character): the sequence nukta, SPACE Like all other combining characters, NUKTA follows the base character (the consonant) in the character stream. But I'm not sure if consonant, nukta, vowel *should* be any different from consonant, vowel, nukta, but maybe they should be different since they are not canonically equivalent. (But...) /kent k
Re: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...)
On 04/08/2003 17:36, Kenneth Whistler wrote: Peter Kirk asked: A similar issue which is not Hebrew related would be a (mythical) requirement to display a diacritic like 0315, 031B or 0322 in isolation. It would not always be appropriate to use a space or NBSP as a base character as this would indent the glyph from the beginning of a line in a way which might not be wanted. What would be the recommended encoding if one wanted to display one of these characters with no leading white space? If you just want to display a nonspacing mark in isolation, then you apply it to a SPACE (or NO-BREAK SPACE) and typically let the metrics of the font then handle how the mark is going to appear floating in space as it were. If you want to display some character like U+0315 COMBINING COMMA ABOVE RIGHT *and* you want to do it is isolation *and* you want it to occur at the beginning of a line *and* you want there to be no display width between the margin and the left edge of the display bits of the glyph, then you have stepped over the boundaries of what is reasonable to expect plain text to convey. Feel free to make use of the higher-level capabilities of your word processor or page layout program to individually adjust the positioning of particular glyphs displayed in particular fonts. Thank you. Understood. More generally, however, when the issue of the relative position of a non-spacing mark with respect to its base glyph is what is in question, the standard recommends (and uses) the convention of displaying the non-spacing mark on a dotted circle as a base. This makes it clear that we are talking about the non-spacing mark itself, but also makes clear the positional differences between left, centered, and right forms, for example. If I want to do this, should I explicitly encode a dotted circle, or should I encode nothing and expect the font to generate the dotted circle, as it often does? --Ken -- Peter Kirk [EMAIL PROTECTED] http://web.onetel.net.uk/~peterkirk/
Re: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...)
On Wednesday, August 06, 2003 1:59 AM, Curtis Clark [EMAIL PROTECTED] wrote: on 2003-08-05 15:31 Peter Kirk wrote: Thank you, Mark. This helps to clarify things, but still doesn't explicitly answer my question of how to encode a sentence like In this language the diacritic ^ may appear above the letters ..., but instead of ^ I want to use a combining character and want to display exactly one space before the combining character - do I encode two spaces or one? In this language the diacritic may appear above the letters... Two spaces, at least in Thunderbird Mail. The NFD decompositions of spacing marks is alredy defined as a SPACE plus a non-spacing combining character. This officially documents the usage of SPACE as a base character, and its use in combining sequences. In the context of XML processing, where strings should (must?) be presented in NFC form, this extra SPACE will be invisible, hidden within the precomposed sequence, so this space does not have the line-breaking property. Breaking properties apply only to combining sequences, not to isolated encoded characters. It's illegal to break in the middle of a combining sequence. So as soon as a SPACE is followed by a combining character, it looses its breaking properties, as those properties are only defined for the combining sequence containing only a SPACE. So I don't think there's any ambiguity: parsers and renderers must correctly identify combining sequences before applying any algorithm. This means that an algorithm like normalization of whitespace sequences in XML or HTML should not include SPACEs that are used as base characters in a combining sequence, and so it should keep two spaces if the intent is to encode a logical space followed by a logical spacing diacritic. (This is not a problem for XML which processes strings in their NFC form). -- Philippe. Spams non tolrs: tout message non sollicit sera rapport vos fournisseurs de services Internet.
Re: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...)
On 06/08/2003 03:54, Philippe Verdy wrote: On Wednesday, August 06, 2003 1:59 AM, Curtis Clark [EMAIL PROTECTED] wrote: on 2003-08-05 15:31 Peter Kirk wrote: Thank you, Mark. This helps to clarify things, but still doesn't explicitly answer my question of how to encode a sentence like In this language the diacritic ^ may appear above the letters ..., but instead of ^ I want to use a combining character and want to display exactly one space before the combining character - do I encode two spaces or one? In this language the diacritic may appear above the letters... Two spaces, at least in Thunderbird Mail. The NFD decompositions of spacing marks is alredy defined as a SPACE plus a non-spacing combining character. ... Really? It looks to me as if U+00B4 and U+02D8 to U+02DD have only a compatibility equivalences to space plus diacritic, and U+005E and U+0060 don't even have compatibility equivalences. ... This means that an algorithm like normalization of whitespace sequences in XML or HTML should not include SPACEs that are used as base characters in a combining sequence, and so it should keep two spaces if the intent is to encode a logical space followed by a logical spacing diacritic. (This is not a problem for XML which processes strings in their NFC form). It is, because there are very many combining marks which do not have spacing equivalents (even for compatibility), and so with these the NFC form will certainly be space plus diacritic. -- Peter Kirk [EMAIL PROTECTED] http://web.onetel.net.uk/~peterkirk/
Re: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...)
Peter Kirk said: From what Ken says, it sounds like it will be wrong from whenever Unicode 4.0 is officially issued Actually Unicode 4.0 was officially issued on April 17, 2003. What we are waiting on now is for the publication of the text of the book to catch up to that fact. ;-) because this paragraph has been excised from that standard. But until then it seems to be correct, SPACE is indeed considered a format character. Nope. It is incorrect to try to mix and match between versions of the standard. In Unicode 3.0 this was an ambiguity in the meaning and usage of the term format character, and for Unicode 3.0, we can all see how people who ran into section 4.5 of the standard could be a little confused about the status of SPACE. The actual intent of that offending paragraph was to attempt to explain the somewhat procrustean nature of the General Category classes, which may not do justice to the complicated behavior of some of the characters in Unicode, rather than to explain the status of SPACE in particular. I was misled by Jim's reference to the URL of the final draft (as clearly stamped on the first page) of 4.0, but since in fact he was quoting from 3.0 what he says can hardly be considered obsolete yet. Actually it can. And that would have been obvious to everyone if a preview version of Chapter 4 had also been posted. Once again, I appeal to people to stop trying to second-guess the text of the standard. The final pdf for the online version is in preparation even as I write this. The final final proofs for the book itself have already been produced by the printer -- all they need to do now is turn on the press and start the binder. If everyone would just go off for a week or two on their August vacation, like they should be, we could all come back about Labor Day and we wouldn't have to be having these discussions. ;-) --Ken
Re: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...)
Peter Kirk peter dot r dot kirk at ntlworld dot com wrote: Or it may not. It may be a deficiency in the level of Unicode support afforded by the fonts and rendering engines. ... If there are such deficiencies in fonts and rendering engines which purport to be Unicode compliant, that suggests a lack of clarity in the standard which should be rectified. I wish I had a dollar for every Unicode-compliant font, rendering engine, or other software that was in some way less compliant than advertised. Only a fraction of the non-compliances are traceable to ambiguities or deficiencies in the Unicode Standard. ... It may simply reflect a difference between your requirements and what the standard promises, and doesn't promise. If Unicode doesn't promise what I require, surely it is at least reasonable for me to ask on this list whether it ought to be extended or clarified to do so. The UTC may choose not to make any changes, but I don't see why they shouldn't even be asked to. Absolutely, you are allowed to ask. Go ahead. I wasn't trying to prevent questions from being asked, only trying to state why I think the problem is out of scope for Unicode. The standard doesn't say anything about width in this case. It leaves it up to the display engine, which is as it should be. The standard does say, section 2.10 of 4.0, that In rendering, the combination of a base character and a nonspacing character may have a different advance width than the base character itself. I apologize for missing this reference. And any intelligent typographer will realise that this may is a must, with regular character designs but not of course in monospace, in some cases like the example given of i with circumflex. This sentence applies to spaces with diacritics as space is a base character, as we have been informed. The subsection of 2.10 entitled Spacing Clones of European Diacritical Marks (by the way, why European when the text appears to apply to all diacritical marks?) should suggest to any intelligent typographer that the sequence space, diacritic is intended to be spaced as the diacritic and not as a space, but it would help for this to be clarified as not all typographers are very intelligent and some may not be aware that this space has actually lost most of the properties of a space e.g. line breaking and is being used only By convention. Like Freud's cigar, sometimes a may is just a may. And I suspect the phrase any intelligent typographer MAY generate some flak from typographers on this list who consider themselves intelligent enough yet have a different opinion. I'm not a typographer (intelligent or otherwise), but I'm having a tough time seeing how Section 2.10 *requires* fonts and rendering engines to give a space-plus-combining-diacritic combination the exact minimum width of the diacritic alone, or to leave equal space before and after such a combination. All I think it is saying is that, for example, the combination i-plus-tilde may be wider than i alone, because tilde is wider than i. When the specific alignment of isolated glyphs is important to me, I use markup. I'm a big supporter of plain text, as many members of this list know, but the exact spacing of isolated combining marks seems like a layout issue to me. OK, what kind of markup should I use, in any well-known markup language, to ensure that an isolated diacritic is centred in the space between the words before and after it? All right, you've got me there. I'll have to think about it. But I still think this is a layout problem, a problem having to do with glyphs and not characters. -Doug Ewell Fullerton, California http://users.adelphia.net/~dewell/
Re: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...)
Peter Kirk posted: If I want to do this, should I explicitly encode a dotted circle, or should I encode nothing and expect the font to generate the dotted circle, as it often does? I think that practise of a font or application automaticaly inserting a dotted circle under an orphaned combining character is dubious compliant with Unicode specifications. In http://www.unicode.org/book/preview/ch03.pdf the space characters in general are given class Zs: Zs, Zl, and Zp are considered format characters, but their membership in the Z (separator) class takes precedence over their membership in the Cf class, because the General Category assigns only a single value to each character. So the various space characters (class Zs) are also classified as format characters. From http://www.unicode.org/book/ch04.pdf: _D13 Base character:_ a character that does not graphically combine with preceding character, and that is neither control nor a format character. Accordingly, by definition, spaces are not base characters. Also from http://www.unicode.org/book/ch04.pdf: _D14 Combining character:_ a character that graphically combines with a preceding base character. The combining character is said to _apply_ to the base character. So we know what happens with a combining character follows a base character. It combines with it. What happens when a combining character follows a character that is not a base character or appears initially? The same source explains: o Even though a combining character is intended to be presented in graphical combination with a base character, circumstances may arise where either (1) no base character precedes the combining character or (2) a process is unable to perform graphical combination. In both cases it may present a combining character without graphical combination; that is, it may present it as if it were a base character. o The representative images of combining characters are depicted with a dotted circle in the code charts; when presented in a graphical combination with a preceding base character, that base character is intended to appear in the position occupied by the dotted circle. So a display device *may* present an oprhaned combining character as suggested. But the word may is weak. Or there other things it may do that would still be compliant with Unicode? May it ignore the character altogether? May it display the character as U+FFFD REPLACEMENT CHARACTER? May it display the over some other character altogether, perhaps even U+20CC DOTTED CIRCLE? This is the only way I can to justify the display of U+20CC DOTTED CIRCLE in such cases by the Unicode specifications. But is then is there any display that is not acceptable according to these specifications? Note that even if an application takes the suggestion made here, the combination of the non-base character SPACE followed by a combining character would be rendered as the non-base character SPACE followed by the combining character rendered as a base character. They would not be combined. From the same source: _D17a Defective combining character sequence:- a combining character sequence that does not start with a base character. o Defective combining character sequences occur when a sequence of combining charactes appears at the start of a string or follows a control or format character. Such sequences are defective from the point of handling of combining marks, but are not _ill-formed_. (See D30.) Accordingly any space character followed by a combining character is a defective combining character sequence. From http://unicode.org/book/ch07.pdf *Marks as Spacing Characters.* By convention, combining marks may be exhibited in (apparent) isolation by applying them to U+0020 SPACE or to U+00A0 NO-BREAK SPACE. This approach might be taken, for example, when referring to the diacritical mark itself as a mark, rather than by using it in its normal way in text. The use of U+0020 SPACE versus U+00A0 NO-BREAK SPACE affects line-break behavior. The words by convention are odd. It perhaps acknowledges that this shouldn't work according to general other Unicode rules and definitions. This passage, however, does not even hint that by convention a dotted circle should appear under the diacritic. Presumably if someone wanted a combining character applied to a dotted circle that person would code U+20CC followed by the combining character. One could fix this messiness by changing the definition of base character to specifically include U+0020 SPACE and U+00A0 NO-BREAK SPACE. That in effect is exactly what the above passage does. So it in a structured manner by making it part of the rule instead burying it in the text an odd exception to the rule. But it does seems philosphically odd that U+0020 and U+00A0 alone of the category Zs characters should be especially singled out. It would be more intuitive if all Zs characters could be included in the
Re: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...)
On Tuesday, August 05, 2003 5:40 PM, Mark Davis wrote: Where did you get the notion that space is not a base character? And base characters include those that are not control or format characters. Space is neither one. Well, I think Jim Allan pointed to the source of this notion in his email of a few hours ago. 1) From the UCD: 0020;SPACE;Zs;... 2) From Unicode 3, Section 4.5, third paragraph (in its entirety): Zs, Zl, and Zp are considered format characters, but their membership in the Z (separator) class takes precedence over their membership in the Cf class, because General Category assigns only a single value to each character. I believe that reasonable people might reasonably conclude from factoids 1 and 2 that SPACE is indeed a format character. Reasonable, but evidently wrong. Explanation, please? Ted Ted Hopp, Ph.D. ZigZag, Inc. [EMAIL PROTECTED] +1-301-990-7453 newSLATE is your personal learning workspace ...on the web at http://www.newSLATE.com/
Re: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...)
on 2003-08-05 15:31 Peter Kirk wrote: Thank you, Mark. This helps to clarify things, but still doesn't explicitly answer my question of how to encode a sentence like In this language the diacritic ^ may appear above the letters ..., but instead of ^ I want to use a combining character and want to display exactly one space before the combining character - do I encode two spaces or one? In this language the diacritic may appear above the letters... Two spaces, at least in Thunderbird Mail. -- Curtis Clark http://www.csupomona.edu/~jcclark/ Mockingbird Font Works http://www.mockfont.com/
Re: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...)
On 05/08/2003 15:09, Mark Davis wrote: Zs, Zl, and Zp are considered format characters, but their membership in the Z (separator) class takes precedence over their membership in the Cf class, because the General Category assigns only a single value to each character. Whenever you have a question about the status of a character, you need to look it up in the UCD. You can either do that by going through the unicode website, or if you want a more readable interface, use the ICU character browser, which formats that data. Look at space, U+0020. http://oss.software.ibm.com/cgi-bin/icu/ub/utf-8/?go=0020ch.x=4ch.y=7 The general category is Space_Separator, *not* a format character. Now wording there could definitely be clearer, but the operant phrase is: ...but their membership in the Z (separator) class *takes precedence* over their membership in the Cf class... So it would be cleared to say something like: In many ways the characters, Zs, Zl, and Zp, are similar to format characters, but because their general usage is significantly different they are broken out into a separate General Category, as Separator characters. Mark __ http://www.macchiato.com Eppur si muove Thank you, Mark. This helps to clarify things, but still doesn't explicitly answer my question of how to encode a sentence like In this language the diacritic ^ may appear above the letters ..., but instead of ^ I want to use a combining character and want to display exactly one space before the combining character - do I encode two spaces or one? -- Peter Kirk [EMAIL PROTECTED] http://web.onetel.net.uk/~peterkirk/
Re: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...)
Peter Kirk peter dot r dot kirk at ntlworld dot com wrote: Suppose for example I want to write a sentence like In this language the diacritic ^ may appear above the letters ..., but instead of ^ I want to use a combining character, a regularly positioned centred above the letter diacritic, which does not have a defined spacing variant. I don't want a dotted circle. And I want it to be spaced as here, i.e. with one space before the diacritic and one after it. It seems to me that at one place in the standard I am told to encode space - combining mark - space, for the combining mark will not combine with the space because the space is not a base character; and in another place I am implicitly told to encode space - space - combining mark - space, because the second space acts as a carrier for the combining mark. space + (space + combining character) + space Perhaps a simple way ahead would be to define a new character something like COMBINING MARK HOLDER... Uhh, no. -Doug Ewell Fullerton, California http://users.adelphia.net/~dewell/