subject:"Re\: Display of Isolated Nonspacing Marks \(was Re\: Questions on ZWNBS...\)"

Re: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...)

2003-08-14 Thread Kenneth Whistler

Peter responded to Mark:

 On 05/08/2003 14:40, Mark Davis wrote:
 
 Where did you get the notion that space is not a base character? And
 base characters include those that are not control or format
 characters. Space is neither one.
 
 The standard specifically states in a number of places that to exhibit
 a combining mark in isolation you use a space (or NBSP).
 
 Mark
 __
 http://www.macchiato.com
 ►  “Eppur si muove” ◄
 
   
 
 I got this from the Unicode Standard 4.0, as quoted by Jim Allan:

*Mis*quoted by Jim Allan.

 
  In http://www.unicode.org/book/preview/ch03.pdf the space characters 
  in general are given class Zs:
 
   Zs, Zl, and Zp are considered format characters, but their 
  membership in the Z (separator) class takes precedence over their 
  membership in the Cf class, because the General Category assigns only 
  a single value to each character. 

That piece of text is *NOT* a quotation from Chapter 3 of Unicode
4.0. Go to that URL and search for it yourself.

It is quoted from Chapter 4 of Unicode *3.0*, p. 88, in the
discussion of General Category in Section 4.5, General Category --
Normative in Part. The corresponding paragraph has been deleted
from the relevant section in Unicode 4.0, precisely because the
standard now precisely defines format control characters as
{Cf, Zl, Zp} but *ex*cluding Zs. See p. 25 in:

http://www.unicode.org/book/preview/ch02.pdf

 
  So the various space characters (class Zs) are also classified as 
  format characters.
 
  From http://www.unicode.org/book/ch04.pdf:
 
   _D13  Base character:_ a character that does not graphically 
  combine with preceding character, and that is neither control nor a 
  format character. 
 
  Accordingly, by definition, spaces are not base characters.

This conclusion is false. As Mark indicated, SPACE (and NBSP) are
base characters, and have been treated as such in terms of
diacritic application since Unicode 1.0 was published:

By convention, diacritical marks used by the Unicode encoding
scheme may be exhibited in (apparent) isolation by applying
them to U+0020 SPACE or to U+00A0 NON-BREAKING SPACE. This
might be done, for example, when talking about the diacritical 
mark itself as a mark, rather than using it in its normal way
in text.
 -- Unicode 1.0, p. 19 [1991]
 
And that *is* an accurate quote from the standard. In Unicode 4.0
that text survives as:

By convention, diacritical marks used by the Unicode Standard
may be exhibited in (apparent) isolation by applying
them to U+0020 SPACE or to U+00A0 NON-BREAKING SPACE. This tactic
might be employed, for example, when talking about the diacritical 
mark itself as a mark, rather than using it in its normal way
in text.
 -- Unicode 4.0, p. 46 [2003]

I'd say the intent of the UTC and the Unicode Standard in this
regard has always been rather clear and has stayed
unchanged for quite some time.

--Ken

Re: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...)

2003-08-14 Thread Doug Ewell

Peter Kirk peter dot r dot kirk at ntlworld dot com wrote:

 Point taken. But when different fonts and rendering engines give
 different results because the standard is unclear or ambiguous, that
 is a matter for the discussion here. And when conforming fonts and
 rendering engines fail to give the required results, that may also be
 because of a deficiency in the standard.

Or it may not.  It may be a deficiency in the level of Unicode support
afforded by the fonts and rendering engines.  It may simply reflect a
difference between your requirements and what the standard promises,
and doesn't promise.

 It seems that many rendering engines give to the sequence space,
 combining mark the width normally assigned to a space. Is this
 actually what the standard suggests?

The standard doesn't say anything about width in this case.  It leaves
it up to the display engine, which is as it should be.

 I have identified a need to display combining marks with no extra
 width, only the width required by the mark. Should the sequence space,
 combining mark do what I want, or shouldn't it? If so, this needs to
 be spelled out so that rendering engines know what they are supposed
 to do. If not, there may be a need for a new character. This is a
 deficiency in the standard, not in the rendering engines.

When the specific alignment of isolated glyphs is important to me, I use
markup.  I'm a big supporter of plain text, as many members of this list
know, but the exact spacing of isolated combining marks seems like a
layout issue to me.

-Doug Ewell
 Fullerton, California
 http://users.adelphia.net/~dewell/

RE: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...)

2003-08-14 Thread Kent Karlsson


  there is no such thing as NFD decompositions.
 
 Sorry for the confusion. Still even with a NFKD decomposition, 

And there is no such thing as NFKD decomposition either.
It goes as follows, in steps:

1. Canonical and compatibility decomposition mappings (one-step),
   and canonical classes.

2. Canonical and compatibility full/recursive decompositions
and canonical reordering. The compatibility (full) decompositions
make use of both the canonical and compatibility
decomposition mappings.

3. Canonical and compatibility equivalences.

4. The four Unicode normal forms (NFD, NFC, NFKD, and NFKC).

Please don't turn it upside down, that's only confusing!

Ok, the formal definition of equivalences and normal forms
are a bit backwards in The Unicode standard, defining NFD
(in practice, though not the name) before the equivalences.
Normally, a normal form is defined as a particular representative
element in an equivalence class...

But there is no need to aggravate the backwardsness into
cyclicity.

...
 It's true that not all (only most)  combining non-spacing
 characters have a non-combining spacing counterpart.

Only a *few* g.c. Mn characters have spacing counterparts!

/kent k

Re: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...)

2003-08-14 Thread Noah Levitt

According to the docs at
http://www.microsoft.com/typography/otfntdev/indicot/other.htm,
uniscribe renders combining marks in isolation when they are
applied to SPACE + ZWJ. (Without the ZWJ, it uses a dotted
circle.) Perhaps this is an acceptable solution to the
people calling for a new character.

  Combining marks and signs that appear in text not in
  conjunction with a valid consonant base are considered
  invalid. Uniscribe displays these marks using the fallback
  rendering mechanism defined in the Unicode Standard
  (section 5.12, 'Rendering Non-Spacing Marks' of the
  Unicode Standard 3.1), i.e. positioned on a dotted circle. 

  Please note that to render a sign standalone (in apparent
  isolation from any base) one should apply it on a space
  (see section 2.5 'Combining Marks' of the Unicode
  Standard). Uniscribe requires a ZWJ to be placed between
  the space and a mark for them to combine into a standalone
  sign.

Noah

Re: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...)

2003-08-14 Thread Curtis Clark

on 2003-08-06 15:24 Doug Ewell wrote:
I'm not a typographer (intelligent or otherwise), but I'm having a tough
time seeing how Section 2.10 *requires* fonts and rendering engines to
give a space-plus-combining-diacritic combination the exact minimum
width of the diacritic alone, or to leave equal space before and after
such a combination.  All I think it is saying is that, for example, the
combination i-plus-tilde may be wider than i alone, because tilde is
wider than i.
Considering that one approach is to use opentype to map a letter plus 
diacritical to a single glyph, an obvious solution would be to include 
space + diacritical combos in that table. An important font issue, but a 
font issue nonetheless.

--
Curtis Clark  http://www.csupomona.edu/~jcclark/
Mockingbird Font Works  http://www.mockfont.com/

Re: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...)

2003-08-14 Thread Philippe Verdy

On Sunday, August 10, 2003 9:30 AM, Mark Davis [EMAIL PROTECTED] wrote:

  As for oe-ligature, the
  French representative to WG3 (or its predecessor) said that France
  could live without it.
 
 Even worse; the story I heard was that the committee had planned from
 the start to have  and  in positions D7 and F7, but that late in the
 process the representative from France objected, so they replaced them
 by  and . That would certainly explain why these symbols are in the
 middle of a batch of letters...

It's true that in French these are really ligatures, and not plain letters,
meaning that this is mostly a standard typographic convention, rather
than orthographic. The national AFNOR may have opted for this solution
thinking that these holes would have benfited for other languages
commonly used in Europe, and there were probably other candidate
characters that finally got encoded in a separate ISO-8859-* set.

I don't know which compromize was taken, but the origin DEC VT set
also had holes at those positions. It's just strange that the ISO working
group opted for those two characters at D7 and F7, when there could
have been a pair of characters coded for Finnish, or Catalan (like the
dotted L which is still coded with a separate middle dot symbol instead
of a true diacritic, and that renders quite poorly with ISO-8859-1 and
even with Windows 1252). Well, French and Catalan writers have lived
with those encoded sequences, and fixed the rendering using ligating
rules in their renderers or fonts (or used the oe/OE ligatures in
Windows1252).

I just suspect that the French objection on oe/OE was related to the
fear of modifying keyboards that were previously created based on
the French version of ISO646, where such ligature could not be coded.
Since then, the AFNOR version of ISO646-FR has been simplified to
remove the tricky combining sequences built with BACKSPACE,
like C+BACKSPACE+COMMA to code a C WITH CEDILLA, as they
were no longer necessary with a more universally used 8-bit set (7-bit
sets have survived only within Teletex/Videotex standards, built in
accordance with ISO646 with SS2 sequences to encode non-spacing
diacritics *before* the base character with which they combine, to
match the keyboard input order based on dead keys for combining
diacritics, and this 7-bit set is probably the only one remaining in
large use today for French, with ISO646-FR now nearly extinct
in favor of ISO646-US/ASCII)

-- 
Philippe.
Spams non tolrs: tout message non sollicit sera
rapport  vos fournisseurs de services Internet.

Re: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...)

2003-08-14 Thread Peter Kirk

On 05/08/2003 14:40, Mark Davis wrote:

Where did you get the notion that space is not a base character? And
base characters include those that are not control or format
characters. Space is neither one.
The standard specifically states in a number of places that to exhibit
a combining mark in isolation you use a space (or NBSP).
Mark
__
http://www.macchiato.com
  Eppur si muove 
 

I got this from the Unicode Standard 4.0, as quoted by Jim Allan:

In http://www.unicode.org/book/preview/ch03.pdf the space characters 
in general are given class Zs:

 Zs, Zl, and Zp are considered format characters, but their 
membership in the Z (separator) class takes precedence over their 
membership in the Cf class, because the General Category assigns only 
a single value to each character. 

So the various space characters (class Zs) are also classified as 
format characters.

From http://www.unicode.org/book/ch04.pdf:

 _D13  Base character:_ a character that does not graphically 
combine with preceding character, and that is neither control nor a 
format character. 

Accordingly, by definition, spaces are not base characters.


--
Peter Kirk
[EMAIL PROTECTED]
http://web.onetel.net.uk/~peterkirk/

Re: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...)

2003-08-14 Thread Peter Kirk

On 08/08/2003 09:54, Jim Allan wrote:

...

It certainly makes sense that in the case of space characters that 
have a defined width that this width is innate to the definition of 
the character and in such a case should take precidence over the width 
of the normally non-spacing combining character.

I would welcome clear instructions by Unicode on this point where 
either result would be useful in order than applications may be 
expected to produce results that are consistent with each other. :-)
Agreed!

I would think it would be consistant with Unicode for an application 
to shrink the width of normal space followed by a diacritic such as a 
single overdot as exact formatting behavior is not defined in such cases.
Well, is a space followed by a diacritic actually a space, or is it the 
same code point reused or overloaded By convention (to quote the 
standard) for a logically distinct purpose? Some of the discussions here 
have implied the latter. Either way, the best clarification would be to 
add a character whose explicit function is to form non-spacing variants 
of diacritics.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/

Re: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...)

2003-08-14 Thread Kenneth Whistler

Ted Hopp asked:

 I believe that reasonable people might reasonably conclude from factoids 1
 and 2 that SPACE is indeed a format character.
 
 Reasonable, but evidently wrong. Explanation, please?

I provided the text deconstruction in my last email, but to
continue, the confusion arises from the strange nature of
SPACE in the history of character encoding.

SPACE, for a long time now in the history of character encodings,
has been classified as a *graphic* character. Certainly, in
the general SC2 character encoding context of ISO 2022,
SPACE always shows up in the G0 set, with other graphic
characters, instead of in the various control functions
encoded in C0 or C1 sets.

But looked at from the legacy of device control, SPACE
could just as well been categorized as a control function:
MOVE PRINT HEAD ONE UNIT RIGHT, comparable to BACKSPACE.

And in the context of the Unicode Standard, people often
loosely talk about space characters as being format
characters, since they are a) more akin to punctuation than
normal letters, b) have no glyph associated with them,
and c) impact line-breaking and other aspects of the formatting
of characters in their vicinity.

But the *formal* categorization of Unicode characters,
defined by the UTC to help eliminate this kind of
ambiguity in talk about the character types, is spelled
out in Figure 2.5 of Unicode 4.0 now:

http://www.unicode.org/book/preview/ch02.pdf

and the *formal* meaning of format control character
(Basic type = Format) in Unicode is now any character 
with the General Category of {Cf, Zl, Zp}.

The space characters are all lumped in with graphic characters.

So while there are still some ambiguities to be worked out
in the definition of base character in the Unicode Standard,
neither the status of SPACE as a graphic character nor the
recommendation of the standard that non-spacing marks be
applied to SPACE as a means of showing them in isolation
is in question.

--Ken

Re: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...)

2003-08-14 Thread Peter Kirk

On 05/08/2003 09:42, Jim Allan wrote:

Peter Kirk posted:

If I want to do this, should I explicitly encode a dotted circle, or
should I encode nothing and expect the font to generate the dotted
circle, as it often does? 


I think that practise of a font or application automaticaly inserting 
a dotted circle under an orphaned combining character is dubious 
compliant with Unicode specifications.

...


Thanks, Jim, for all this data, but now I am totally confused. Well, at 
least it seems clear that if I want a dotted circle I should explicitly 
encode it. But if I don't...

Suppose for example I want to write a sentence like In this language 
the diacritic ^ may appear above the letters ..., but instead of ^ I 
want to use a combining character, a regularly positioned centred above 
the letter diacritic, which does not have a defined spacing variant. I 
don't want a dotted circle. And I want it to be spaced as here, i.e. 
with one space before the diacritic and one after it. It seems to me 
that at one place in the standard I am told to encode space - combining 
mark - space, for the combining mark will not combine with the space 
because the space is not a base character; and in another place I am 
implicitly told to encode space - space - combining mark - space, 
because the second space acts as a carrier for the combining mark.

I hope that wanting to display this correctly is not another place where 
I have stepped over the boundaries of what is reasonable to expect 
plain text to convey, but that this too can be grist for the Unicode 
5.0 mill to grind very finely - both quotes from Ken Whistler earlier 
today. And I think that if this issue is clarified it will also become 
clear what should be done about string initial holam and alef etc.

Perhaps a simple way ahead would be to define a new character something 
like COMBINING MARK HOLDER with no glyph, which is defined specifically 
for this purpose, is a base character and not a format character, and is 
expected to be just as wide as is necessary to display the combining 
mark. Then we could say that a spacing accent is equivalent  (possibly 
even canonically if made a composition exclusion?) to COMBINING MARK 
HOLDER plus a non-spacing accent, and remove the misleading 
compatibility equivalences to SPACE plus a non-spacing accent.

--
Peter Kirk
[EMAIL PROTECTED]
http://web.onetel.net.uk/~peterkirk/

RE: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...)

2003-08-14 Thread Kent Karlsson


 The NFD decompositions of spacing marks is alredy defined as a SPACE
 plus a non-spacing combining character. 

Philippe, please!  Those are *compatibility* decompositions. The normal
form NFD only uses *canonical* decompositions. And there is no such
thing as NFD decompositions.

/kent k

Re: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...)

2003-08-14 Thread Philippe Verdy

On Wednesday, August 06, 2003 11:48 PM, Peter Kirk [EMAIL PROTECTED] wrote:

 OK, what kind of markup should I use, in any well-known markup
 language, to ensure that an isolated diacritic is centred in the
 space between the words before and after it?

In plain text, I think that this encoding:
...endOfWord1, SPACE, SPACE, diacritic, SPACE,
startOfWord2...
is what you need, as it creates the following combining sequences:
...endOfWord1, SPACE, SPACE, diacritic, SPACE,
startOfWord2...

If you don't want any space around the diacritic which must be displayed
isolated but in the middle of a word, the following would work:
...endOfWord1, SPACE, diacritic, startOfWord2...
Here the SPACE is not a break opportunity, but just the base character
for the diacritic inserted. What is missing in the standard is defining the
property of such SPACE+diacritic sequence: normally it inherits the
properties of the base character, and properties of diacritics are ignored.

But when using a SPACE or NBSP base character new properties may
be needed. If there's still a break opportunity on the base SPACE of a
combining sequence, it is not clear where the break occurs: before the
SPACE (i.e. before the combining sequence), or after the diacritic (i.e.
after the combining sequence)?

I think that the second option applies here, i.e. the base SPACE would
create a break opportunity at end of the whole combining sequence
made with a SPACE and the following combining characters (including
CGJ if needed to fix canonical ordering).

Another similar case would be the use of a isolated nukta (which
normally modifies a following base character): the sequence
nukta, SPACE is a single combining sequence with a break
opportunity. So a sequence like nukta, SPACE, acute accent
would be unbreakable but would include a break opportunity at its
end, unless it is followed by a NBSP.
And the sequence nukta, NBSP, acute accent would also be
unbreakable either in the middle or on both ends.

-- 
Philippe.
Spams non tolérés: tout message non sollicité sera
rapporté à vos fournisseurs de services Internet.

RE: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...)

2003-08-14 Thread Jony Rosenne

I would like to point out that with all due respect, how particular fonts or rendering 
engines behave is only marginally relevant to the Unicode list. I think that we should 
deal only with the Unicode specification.

A particular implementation or many implementations may not behave as expected, and 
then may be either conformant or non-conformant, or may behave as expected and still 
be either conformant or non-conformant. Messages such as the attached help the 
discussion of the specification only as illustrations and as a basis for discussing 
conformity.

Jony

 -Original Message-
 From: [EMAIL PROTECTED] 
 [mailto:[EMAIL PROTECTED] On Behalf Of Peter Kirk
 Sent: Wednesday, August 06, 2003 12:11 PM
 To: Curtis Clark
 Cc: Unicode List
 Subject: Re: Display of Isolated Nonspacing Marks (was Re: 
 Questions on ZWNBS...)
 
 
 On 05/08/2003 16:59, Curtis Clark wrote:
 
  on 2003-08-05 15:31 Peter Kirk wrote:
 
  Thank you, Mark. This helps to clarify things, but still doesn't
  explicitly answer my question of how to encode a sentence 
 like In 
  this language the diacritic ^ may appear above the letters 
 ..., but 
  instead of ^ I want to use a combining character  and want to 
  display exactly one space before the combining character - do I 
  encode two spaces or one?
 
 
  In this language the diacritic   may appear above the letters...
 
  Two spaces, at least in Thunderbird Mail.
 
 
 Thank you. Well, this sort of works. I looked in various 
 fonts. In some 
 of them the diacritic is centred in the space between the words 
 diacritic and may, but in others it is offset to the left or the 
 right. The problem is that the space is wider than the 
 diacritic, which 
 confuses things, and all the more so no doubt if it expands for 
 justification. NBSP would probably be a better choice in that 
 it is less 
 likely to expand. But what I am looking for is a diacritic 
 holder which 
 is defined to be only as wide as the diacritic. On the principle that 
 base characters expand to fit the width of the diacritic,  ZWSP or, 
 better, a real (rather than misnamed) zero width no break space would 
 seem to have the right properties for that.
 
 -- 
 Peter Kirk
 [EMAIL PROTECTED]
 http://web.onetel.net.uk/~peterkirk/

Re: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...)

2003-08-14 Thread Jim Allan

Philip Verdy posted:

Could ZWS+combining diacritic may be the best solution for
isolated diacritics in text? 
From http://www.unicode.org/book/ch04.pdf:

 * Such characters may be large enough to effect the placement of
their base character relative to preceding and succeeding base
characters. For example, a circumflex applied to an i may effect
spacing (î), as might the character U+20DD COMBINING ENCLOSED CIRCLE. 
Unless Unicode 4.0 as changed this the words may and might here 
would indicate that ZWSP is not *necessarily* the best solution.

There is no specification about what an application *must* do to be 
conforming in this circumstance, merely indication that an application 
that does expand spacing for the sake of appearance is not 
non-confirming. It is *probably* implied that this is the right way to go.

But I would guess that it would also be conforming for an application to 
not expand spacing at all on ZWSP so that coding of _o_ + ZWSP + 
COMBINING CIRCUMFLEX + _o_ would place the circumflex centered over _oo_ 
with its center point between the two letters.

Either result would be useful for different purposes.

It certainly makes sense that in the case of space characters that have 
a defined width that this width is innate to the definition of the 
character and in such a case should take precidence over the width of 
the normally non-spacing combining character.

I would welcome clear instructions by Unicode on this point where either 
result would be useful in order than applications may be expected to 
produce results that are consistent with each other. :-)

I would think it would be consistant with Unicode for an application to 
shrink the width of normal space followed by a diacritic such as a 
single overdot as exact formatting behavior is not defined in such cases.

Jim Allan

Re: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...)

2003-08-14 Thread Philippe Verdy

On Thursday, August 07, 2003 8:06 PM, Peter Kirk [EMAIL PROTECTED] wrote:

 On 06/08/2003 15:47, Philippe Verdy wrote:
 
  On Wednesday, August 06, 2003 11:48 PM, Peter Kirk
  [EMAIL PROTECTED] wrote: 
  
  
  
   OK, what kind of markup should I use, in any well-known markup
   language, to ensure that an isolated diacritic is centred in the
   space between the words before and after it?
   
   
  
  In plain text, I think that this encoding:
 ...endOfWord1, SPACE, SPACE, diacritic, SPACE,
 startOfWord2...
  is what you need, as it creates the following combining sequences:
 ...endOfWord1, SPACE, SPACE, diacritic, SPACE,
 startOfWord2...
  
  
 Thank you, Philippe. This is where we started. But I noted that some
 current implementations render the space diacritic combination as a
 full 
 width space with the diacritic not centred over it. I suggested that
 this was wrong, that the diacritic should be centred. Doug suggested I
 used markup outside the scope of Unicode.
 
  ...
  
  Another similar case would be the use of a isolated nukta (which
  normally modifies a following base character): the sequence
  nukta, SPACE is a single combining sequence with a break
  opportunity. So a sequence like nukta, SPACE, acute accent
  would be unbreakable but would include a break opportunity at its
  end, unless it is followed by a NBSP.
  And the sequence nukta, NBSP, acute accent would also be
  unbreakable either in the middle or on both ends.
  
  
  
 Tell me more about these nuktas which modify a FOLLOWING base
 character. 
 This is just what I have been told is illegal, non-conformant or
 something. But if this is allowed for nuktas, why shouldn't it be
 allowed for Hebrew holam?

Sorry, I should have checked my code to see which character exactly
has a combining feature with the following base character. In fact there's
already a special treatment for nukta, which gets internally swapped in
front of its base character for glyph processing, and this was a source
of confusion for me (yes nuktas have CC=7 and are combined with the
previous base character, but only with the standard Unicode encoding
sequence, but not in all legacy codepages, and not for some other
text processings that put it in front.

In fact, I may have discussed about the Candrabindu, which is combining
with CC=230 (above?), except in the Devenagari, Bengali, Gujarati,
Oriya scripts where they are combining but as base character (CC=0),
and in Telugu and Gurmukhi (Adak Bindi) where it is Mc instead of Mn
and is not combining.

This reflects a different usage of the Candrabindu in ISCII, and this is
a source of difficulty when transcoding from ISCII to Unicode...
And I'm not sure if the CC=230 for the Tibetan Candrabindu is really
accurate with its specific combining model.

The treatment of Anusvara (or Tibetan JeSuNgaRo or Gurmukhi Bindi
or Sinhala Anusvaraya) as a combining character with CC=0 is also
script specific, as it is either Mc or Mn. The same thing may be said
about Visarga signs (or Sinhala Visargaya)

Such special treatment is not needed for the Viramas (CC=9), as it
more or less behaves like a standard vowel sign, i.e. a regular diacritic.

The original encoding model for Indian scripts has lot of legacy text
resources coded with ISCII with a unified model that Unicode treat
more or less specially, but with its own difficulties (we can ignore the
ISCII font controls, or we can consider other ISCII control signs to
manage it like ISO2022 with script switch controls, which are not
encoded in Unicode. Despite what the Unicode reference section
documents in the specific chapter for Brahmic scripts, there's little
help here to avoid the confusions, notably because the same
chapter covers scripts that have been encoded with distinct
character models (notably Thai and Lao).

For now the current text in Unicode 3 seems not very helpful to
disambiguate things, and I hope that this chapter about Indic
scripts will be greatly enhanced to cover the actual usages, and
that Thai and Lao will be discussed separately from other
Indic scripts. For now, I think that the ISCII or TIS620 standards
are much more precise and helpful than the Unicode reference
for the scripts they cover in a different way, with lots of conversion
caveats not explained (at first read this chapter seems to make
a proeminent reference to ISCII and TIS620, but there are
some quirks where both references seem to contradict the
actual usage of combining sequences, for which new Unicode
properties should be added and precised (even if combining
classes cannot be changed for stability reason as well as
normalized forms considered canonnically equivalent, or
distinct when in reality they are combining the same way and
one form is considered normal and others are non-standard
or defective according to the origin ISCII or TIS620 standard).

-- 
Philippe.
Spams non tolérés: tout message non sollicité sera
rapporté à vos fournisseurs de services Internet.

Re: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...)

2003-08-14 Thread Mark Davis

Moreover, as I wrote before, the wording in that one paragraph in 3.0
is not clearly stated, but it is clear from a reading of the rest of
the standard -- with numerous examples -- and from the UCD 3.0
properties, that space *is not* a format character, and *is* a
suitable base for combining marks. So the little coy remark below is
not warranted with respect to combining marks on space.

  OK, understood now. As the previous version is obsolete, and the
new one
  is unavailable, we can all take a break from conforming to Unicode
at

Mark
__
http://www.macchiato.com
  Eppur si muove 

- Original Message - 
From: Kenneth Whistler [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]; [EMAIL PROTECTED]
Sent: Wednesday, August 06, 2003 15:48
Subject: Re: Display of Isolated Nonspacing Marks (was Re: Questions
on ZWNBS...)


 Peter Kirk responded to my plea for everyone to relax a bit:

  If everyone would just go off for a week or two on their
  August vacation, like they should be, we could all come back
  about Labor Day and we wouldn't have to be having these
  discussions. ;-)
  
  --Ken

  OK, understood now. As the previous version is obsolete, and the
new one
  is unavailable, we can all take a break from conforming to Unicode
at
  all and take a vacation! Sounds a good idea to me  ;-)

 Just in the interest of truth in advertising, the previous
 version(s) are not obsolete, but are superseded by Unicode 4.0.
^^^

 Applications claiming conformance to Unicode 3.0 will continue
 to claim conformance to that version, and that version is
 relevant to their claim. And so on for Unicode 3.1 and
 Unicode 3.2.

 But if and when people move on to claiming conformance to
 Unicode 4.0, then it is the text of *that* version which becomes
 relevant to their claim.

 We are simply in the inconvenient transition state where people
 are building Unicode 4.0 implementations, but the final, final
 text of the *book* (as opposed to the various UAX's and all
 the data files) is not available. There were similar
 transition periods for Unicode 1.0, Unicode 2.0, and Unicode 3.0,
 and nearly everyone understands that is the nature of things.

 So yes, please, it's time to take a vacation! :)

 --Ken

Re: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...)

2003-08-14 Thread Peter Kirk

On 05/08/2003 15:53, Ted Hopp wrote:

On Tuesday, August 05, 2003 5:40 PM, Mark Davis wrote:
 

Where did you get the notion that space is not a base character? And
base characters include those that are not control or format
characters. Space is neither one.
   

Well, I think Jim Allan pointed to the source of this notion in his email of
a few hours ago.
1) From the UCD:
0020;SPACE;Zs;...
2) From Unicode 3, Section 4.5, third paragraph (in its entirety):
Zs, Zl, and Zp are considered format characters, but their membership in
the Z (separator) class takes precedence over their membership in the Cf
class, because General Category assigns only a single value to each
character.
I believe that reasonable people might reasonably conclude from factoids 1
and 2 that SPACE is indeed a format character.
Reasonable, but evidently wrong. Explanation, please?

Ted

Ted Hopp, Ph.D.
ZigZag, Inc.
[EMAIL PROTECTED]
+1-301-990-7453
newSLATE is your personal learning workspace
  ...on the web at http://www.newSLATE.com/


 

From what Ken says, it sounds like it will be wrong from whenever 
Unicode 4.0 is officially issued because this paragraph  has been 
excised from that standard. But until then it seems to be correct, SPACE 
is indeed considered a format character. I was misled by Jim's 
reference to the URL of the final draft (as clearly stamped on the first 
page) of 4.0, but since in fact he was quoting from 3.0 what he says can 
hardly be considered obsolete yet.

--
Peter Kirk
[EMAIL PROTECTED]
http://web.onetel.net.uk/~peterkirk/

Re: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...)

2003-08-14 Thread John Cowan

Peter Kirk scripsit:

 This is a clear demonstration that Microsoft also has problems with the 
 mechanism which has been defined in the standard for ten years,  

This is a clear demonstration that Uniscribe fails to implement a
standard correctly, a property unique neither to Microsoft nor to the
Unicode Standard.

-- 
Knowledge studies others / Wisdom is self-known;  John Cowan
Muscle masters brothers / Self-mastery is bone;   [EMAIL PROTECTED]
Content need never borrow / Ambition wanders blind;   www.ccil.org/~cowan
Vitality cleaves to the marrow / Leaving death behind.--Tao 33 (Bynner)

Re: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...)

2003-08-14 Thread Peter Kirk

On 06/08/2003 05:58, Jony Rosenne wrote:

I would like to point out that with all due respect, how particular fonts or rendering engines behave is only marginally relevant to the Unicode list. I think that we should deal only with the Unicode specification.

A particular implementation or many implementations may not behave as expected, and then may be either conformant or non-conformant, or may behave as expected and still be either conformant or non-conformant. Messages such as the attached help the discussion of the specification only as illustrations and as a basis for discussing conformity.

Jony

 

-Original Message-
From: [EMAIL PROTECTED] 
[mailto:[EMAIL PROTECTED] On Behalf Of Peter Kirk
Sent: Wednesday, August 06, 2003 12:11 PM
To: Curtis Clark
Cc: Unicode List
Subject: Re: Display of Isolated Nonspacing Marks (was Re: 
Questions on ZWNBS...)

On 05/08/2003 16:59, Curtis Clark wrote:

   

on 2003-08-05 15:31 Peter Kirk wrote:

 

Thank you, Mark. This helps to clarify things, but still doesn't
explicitly answer my question of how to encode a sentence 
   

like In 
   

this language the diacritic ^ may appear above the letters 
   

..., but 
   

instead of ^ I want to use a combining character  and want to 
display exactly one space before the combining character - do I 
encode two spaces or one?
   

In this language the diacritic   may appear above the letters...

Two spaces, at least in Thunderbird Mail.

 

Thank you. Well, this sort of works. I looked in various 
fonts. In some 
of them the diacritic is centred in the space between the words 
diacritic and may, but in others it is offset to the left or the 
right. The problem is that the space is wider than the 
diacritic, which 
confuses things, and all the more so no doubt if it expands for 
justification. NBSP would probably be a better choice in that 
it is less 
likely to expand. But what I am looking for is a diacritic 
holder which 
is defined to be only as wide as the diacritic. On the principle that 
base characters expand to fit the width of the diacritic,  ZWSP or, 
better, a real (rather than misnamed) zero width no break space would 
seem to have the right properties for that.

--
Peter Kirk
[EMAIL PROTECTED]
http://web.onetel.net.uk/~peterkirk/




   



 

Point taken. But when different fonts and rendering engines give 
different results because the standard is unclear or ambiguous, that is 
a matter for the discussion here. And when conforming fonts and 
rendering engines fail to give the required results, that may also be 
because of a deficiency in the standard.

It seems that many rendering engines give to the sequence space, 
combining mark the width normally assigned to a space. Is this actually 
what the standard suggests? I have identified a need to display 
combining marks with no extra width, only the width required by the 
mark. Should the sequence space, combining mark do what I want, or 
shouldn't it? If so, this needs to be spelled out so that rendering 
engines know what they are supposed to do. If not, there may be a need 
for a new character. This is a deficiency in the standard, not in the 
rendering engines.

--
Peter Kirk
[EMAIL PROTECTED]
http://web.onetel.net.uk/~peterkirk/

Re: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...)

2003-08-14 Thread Peter Kirk

On 09/08/2003 13:23, Noah Levitt wrote:

According to the docs at
http://www.microsoft.com/typography/otfntdev/indicot/other.htm,
uniscribe renders combining marks in isolation when they are
applied to SPACE + ZWJ. (Without the ZWJ, it uses a dotted
circle.) Perhaps this is an acceptable solution to the
people calling for a new character.
 Combining marks and signs that appear in text not in
 conjunction with a valid consonant base are considered
 invalid. Uniscribe displays these marks using the fallback
 rendering mechanism defined in the Unicode Standard
 (section 5.12, 'Rendering Non-Spacing Marks' of the
 Unicode Standard 3.1), i.e. positioned on a dotted circle. 

 Please note that to render a sign standalone (in apparent
 isolation from any base) one should apply it on a space
 (see section 2.5 'Combining Marks' of the Unicode
 Standard). Uniscribe requires a ZWJ to be placed between
 the space and a mark for them to combine into a standalone
 sign.
Noah

 

This is a clear demonstration that Microsoft also has problems with the 
mechanism which has been defined in the standard for ten years, that 
space followed by diacritic is legal and should be rendered as the 
isolated diacritic. But the alternative mechanism which they have 
implemented is non-standard and apparently a defective combining 
sequence, as ZWJ (if I remember correctly) is not a base character. The 
best way to fix this situation is to define a new character with the 
correct properties.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/

RE: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...)

2003-08-14 Thread Jon Hanna

 (provided that the whitespace normalization algorithm will not
 include ZWSP in the whitespaces sequence and treat it
 isolately, something that a conforming HTML or XML processor
 should not do, as it should unify only sequences of SPACE,
 TAB, CR, LF, and only according to the context of the
 containing element whitespace properties controlling the
 normalization of XML whitespace sequences (leading, trailing,
 line break preservation, tabulator)...

ZWSP being normalised would be quite a bizarre bug, I can see it happening only if 
someone relied on a isWhiteSpace function provided by a non-XML aware library and that 
function considered ZWSP to be whitespace. I've never seen this, although I have seen 
similar assumptions made about how characters act in XML, and some deeply incorrect 
ones about how octets act in XML (that is they made incorrect assumptions about 
encodings, or even had no thoughts about encodings at all, an error which some 
environments and languages can lead the nave too).

NEL and LSEP is added to your list of characters affected by whitespace 
normalisation for XML1.1. Possibly some people implemented the suggestion in 
http://www.w3.org/TR/newline before 1.1.

Re: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...)

2003-08-14 Thread John Cowan

Peter Kirk scripsit:

 Really? It looks to me as if U+00B4 and U+02D8 to U+02DD have only a 
 compatibility equivalences to space plus diacritic, and U+005E and 
 U+0060 don't even have compatibility equivalences.

Indeed.  The last two, BTW, are because the ASCII repertoire has taken
on a life of its own:  ^ is not merely a spacing clone of COMBINING
CIRCUMFLEX, but has become a fully distinct character with many functions.
In particular, none of the Unicode canonical forms will affect text
written solely in the ASCII repertoire.  Every character has its
own story.

Someone asked about whether XML documents SHOULD or MUST be in NFC.
The answer is SHOULD, and this is formally applied only to the
not-yet-promulgated XML 1.1.  XML documents *on the Web* SHOULD be in
NFC by reason of the W3C Character Model.

-- 
John Cowan  [EMAIL PROTECTED]http://www.reutershealth.com
Not to know The Smiths is not to know K.X.U.  --K.X.U.

Re: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...)

2003-08-14 Thread Peter Kirk

On 06/08/2003 15:24, Doug Ewell wrote:

Like Freud's cigar, sometimes a may is just a may.  And I suspect
the phrase any intelligent typographer MAY generate some flak from
typographers on this list who consider themselves intelligent enough
yet have a different opinion.
I'm not a typographer (intelligent or otherwise), but I'm having a tough
time seeing how Section 2.10 *requires* fonts and rendering engines to
give a space-plus-combining-diacritic combination the exact minimum
width of the diacritic alone, or to leave equal space before and after
such a combination.  All I think it is saying is that, for example, the
combination i-plus-tilde may be wider than i alone, because tilde is
wider than i.
 

OK, Doug, I accept that a may is a may and an implementation in 
which the tilde on an i collides with neighbouring characters is Unicode 
compliant. It's just bad typography (unless some special effect is 
intended). Any typographers on the list care to disagree? I would 
suggest that it is also bad typography for a space, diacritic 
combination to be wider than the diacritic, as long as the typographer 
realises that space is being used here as a convention and, according to 
the standard, does not have the usual properties of a space.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/

Re: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...)

2003-08-14 Thread Peter Kirk

On 06/08/2003 15:47, Philippe Verdy wrote:

On Wednesday, August 06, 2003 11:48 PM, Peter Kirk [EMAIL PROTECTED] wrote:

 

OK, what kind of markup should I use, in any well-known markup
language, to ensure that an isolated diacritic is centred in the
space between the words before and after it?
   

In plain text, I think that this encoding:
   ...endOfWord1, SPACE, SPACE, diacritic, SPACE,
   startOfWord2...
is what you need, as it creates the following combining sequences:
   ...endOfWord1, SPACE, SPACE, diacritic, SPACE,
   startOfWord2...
 

Thank you, Philippe. This is where we started. But I noted that some 
current implementations render the space diacritic combination as a full 
width space with the diacritic not centred over it. I suggested that 
this was wrong, that the diacritic should be centred. Doug suggested I 
used markup outside the scope of Unicode.

...

Another similar case would be the use of a isolated nukta (which
normally modifies a following base character): the sequence
nukta, SPACE is a single combining sequence with a break
opportunity. So a sequence like nukta, SPACE, acute accent
would be unbreakable but would include a break opportunity at its
end, unless it is followed by a NBSP.
And the sequence nukta, NBSP, acute accent would also be
unbreakable either in the middle or on both ends.
 

Tell me more about these nuktas which modify a FOLLOWING base character. 
This is just what I have been told is illegal, non-conformant or 
something. But if this is allowed for nuktas, why shouldn't it be 
allowed for Hebrew holam?

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/

Re: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...)

2003-08-14 Thread Mark Davis

Where did you get the notion that space is not a base character? And
base characters include those that are not control or format
characters. Space is neither one.

The standard specifically states in a number of places that to exhibit
a combining mark in isolation you use a space (or NBSP).

Mark
__
http://www.macchiato.com
  Eppur si muove 

- Original Message - 
From: Peter Kirk [EMAIL PROTECTED]
To: Jim Allan [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]
Sent: Tuesday, August 05, 2003 13:47
Subject: Re: Display of Isolated Nonspacing Marks (was Re: Questions
on ZWNBS...)


 On 05/08/2003 09:42, Jim Allan wrote:

  Peter Kirk posted:
 
  If I want to do this, should I explicitly encode a dotted circle,
or
  should I encode nothing and expect the font to generate the
dotted
  circle, as it often does?
 
 
  I think that practise of a font or application automaticaly
inserting
  a dotted circle under an orphaned combining character is dubious
  compliant with Unicode specifications.
 
  ...
 
 
 Thanks, Jim, for all this data, but now I am totally confused. Well,
at
 least it seems clear that if I want a dotted circle I should
explicitly
 encode it. But if I don't...

 Suppose for example I want to write a sentence like In this
language
 the diacritic ^ may appear above the letters ..., but instead of ^
I
 want to use a combining character, a regularly positioned centred
above
 the letter diacritic, which does not have a defined spacing variant.
I
 don't want a dotted circle. And I want it to be spaced as here, i.e.
 with one space before the diacritic and one after it. It seems to me
 that at one place in the standard I am told to encode space -
combining
 mark - space, for the combining mark will not combine with the space
 because the space is not a base character; and in another place I am
 implicitly told to encode space - space - combining mark - space,
 because the second space acts as a carrier for the combining mark.

 I hope that wanting to display this correctly is not another place
where
 I have stepped over the boundaries of what is reasonable to expect
 plain text to convey, but that this too can be grist for the
Unicode
 5.0 mill to grind very finely - both quotes from Ken Whistler
earlier
 today. And I think that if this issue is clarified it will also
become
 clear what should be done about string initial holam and alef etc.

 Perhaps a simple way ahead would be to define a new character
something
 like COMBINING MARK HOLDER with no glyph, which is defined
specifically
 for this purpose, is a base character and not a format character,
and is
 expected to be just as wide as is necessary to display the combining
 mark. Then we could say that a spacing accent is equivalent
(possibly
 even canonically if made a composition exclusion?) to COMBINING MARK
 HOLDER plus a non-spacing accent, and remove the misleading
 compatibility equivalences to SPACE plus a non-spacing accent.

 -- 
 Peter Kirk
 [EMAIL PROTECTED]
 http://web.onetel.net.uk/~peterkirk/

Re: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...)

2003-08-12 Thread Peter Kirk

On 05/08/2003 17:13, Kenneth Whistler wrote:

Peter Kirk said:

 

From what Ken says, it sounds like it will be wrong from whenever 
Unicode 4.0 is officially issued 
   

Actually Unicode 4.0 was officially issued on April 17, 2003.

What we are waiting on now is for the publication of the text
of the book to catch up to that fact. ;-)
 

...

I was misled by Jim's 
reference to the URL of the final draft (as clearly stamped on the first 
page) of 4.0, but since in fact he was quoting from 3.0 what he says can 
hardly be considered obsolete yet.
   

Actually it can. And that would have been obvious to everyone if
a preview version of Chapter 4 had also been posted.
Once again, I appeal to people to stop trying to second-guess
the text of the standard. The final pdf for the online version
is in preparation even as I write this. The final final
proofs for the book itself have already been produced by
the printer -- all they need to do now is turn on the press
and start the binder.
If everyone would just go off for a week or two on their
August vacation, like they should be, we could all come back
about Labor Day and we wouldn't have to be having these
discussions. ;-)
--Ken



 

OK, understood now. As the previous version is obsolete, and the new one 
is unavailable, we can all take a break from conforming to Unicode at 
all and take a vacation! Sounds a good idea to me  ;-)

--
Peter Kirk
[EMAIL PROTECTED]
http://web.onetel.net.uk/~peterkirk/

RE: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...)

2003-08-11 Thread Kent Karlsson


 It *is* part of the Unicode Standard.  You want a stand-alone
diacritic?
 Use SP or NBSP followed by the combining diacritic.  It says so, right
there.

Yes. But it is not quite clear how this should interact with combining
characters
that aren't purely 'above' or 'below' a single character (in horizontal
writing): in
particular double diacritics (SPACE, dbl diacritic or SPACE, dbl
diacritic, SPACE
to get an isolated one?), and left-side or right-side combining
characters
(SPACE, rightside comb. char does that give unwanted space on the left
or not?).

/kent k

Re: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...)

2003-08-10 Thread Mark Davis

 As for oe-ligature, the
 French representative to WG3 (or its predecessor) said that France
could
 live without it.

Even worse; the story I heard was that the committee had planned from
the start to have  and  in positions D7 and F7, but that late in the
process the representative from France objected, so they replaced them
by  and . That would certainly explain why these symbols are in the
middle of a batch of letters...

Mark
__
http://www.macchiato.com
  Eppur si muove 

- Original Message - 
From: John Cowan [EMAIL PROTECTED]
To: Philippe Verdy [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]
Sent: Saturday, August 09, 2003 20:13
Subject: Re: Display of Isolated Nonspacing Marks (was Re: Questions
on ZWNBS...)


 Philippe Verdy scripsit:

  Except that in that case, we are no speaking about something that
has
  already been standardized, but only used as a legacy mean to
achieve
  some results with mosre or less success.

 It *is* part of the Unicode Standard.  You want a stand-alone
diacritic?
 Use SP or NBSP followed by the combining diacritic.  It says so,
right
 there.

 Your implementation doesn't work?  Complain to the implementor,
switch to
 another implementation, fix the implementation yourself, or pay
someone
 to fix it.

  SPACE+diacritic is still a hack, and certainly not a canonical
equivalent
  (including for its properties), of the existing spacing
diacritics, which
  also do not fit all usages because they are symbols.

 It's the spacing diacritics that are a hack, for the most part.  The
 ASCII ones have, as I said, taken on a life of their own.

  * [OT] This was a shame when ISO adapted the DEC VT charset to
  create ISO-8859-1, but forgot important characters needed for the
  languages that this charset was supposed to cover (like the French
  oe and OE ligatures, and a few characters missing for Baltic
languages,
  Icelandic, and Catalan.)

 ISO-8859-1 was not meant to cover the whole of Europe; it was part
of
 a quartet, parts 1 to 4.  The fact that parts 3 and 4 didn't work
out was
 not ISO's fault: it didn't foresee how important European as opposed
ot
 merely regional data interchange would be.  As for oe-ligature, the
 French representative to WG3 (or its predecessor) said that France
could
 live without it.


 -- 
 John Cowan  [EMAIL PROTECTED]  www.ccil.org/~cowan
www.reutershealth.com
 If I have seen farther than others, it is because I am surrounded
by dwarves.
 --Murray Gell-Mann

Re: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...)

2003-08-10 Thread Philippe Verdy

On 04/08/2003 17:36, Kenneth Whistler wrote:
 Peter Kirk asked:
  A similar issue which is not Hebrew related would be a (mythical)
  requirement to display a diacritic like 0315, 031B or 0322 in
  isolation. It would not always be appropriate to use a space or
  NBSP as a base character as this would indent the glyph from the
  beginning of a line in a way which might not be wanted. What
  would be the recommended encoding if one wanted to display one of
  these characters with no leading white space?
 If you want to display some character like U+0315 COMBINING COMMA
 ABOVE RIGHT *and* you want to do it is isolation *and* you want
 it to occur at the beginning of a line *and* you want there to
 be no display width between the margin and the left edge of the
 display bits of the glyph, then you have stepped over the boundaries
 of what is reasonable to expect plain text to convey. Feel free
 to make use of the higher-level capabilities of your word
 processor or page layout program to individually adjust the
 positioning of particular glyphs displayed in particular fonts.

That's true for such defective sequences that may be used temporarily
during text handling operations (where the combining mark should be
rendered in editors with the dotted circle glyph).

But one can still represent a isolated combining character in a non
defective way by putting it after a Zero-Width Space, without creating
any margin. This can be done due to the Zs category of this character
which qualifies it the same way as a ASCII SPACE would:

0020;SPACE;Zs;0;WS;N;
200B;ZERO WIDTH SPACE;Zs;0;BN;N;

In fact, using ZWS may even be more accurate than using SPACE
in bidirectional contexts, as it is bidirectionally neutral, and does not
break directionality clusters for display reordering (so such encoded
isolated diacritic can appear even in a RTL sequence, as if it was a
single character with the current directionality).

I just wonder what would be the width of the combination of ZWS plus
a diacritic: logically the ZWS as width 0, but diacritics are supposed
to expand, if needed the width of the base character, unless kerning
is used to reduce the interletter spacing. But I doubt that any font
would define a kerning pair for a preceding grapheme cluster plus
this isolated diacritic (ZWS+combining), or for that isolated diacritic
and the next grapheme cluster, so in absence of such kerning pair,
most programs will just use the default combined width.

I just tried to see how Windows XP represent the sequences:
A, SPACE, ZWS, COMBINING MACRON, SPACE, B
A, ZWS, COMBINING MACRON, B
And it shows the spaces correctly even in HTML with IE6, with
Arial, Arial Unicode MS, Times New Roman, Courier New...

On the opposite, the sequence SPACE, COMBINING MACRON
is incorrectly rendered with a too large width (larger than a single
space or a single non-combining macron).

Could ZWS+combining diacritic may be the best solution for
isolated diacritics in text?

Re: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...)

2003-08-10 Thread Peter Kirk

On 10/08/2003 10:09, Michael Everson wrote:

At 01:30 +0200 2003-08-10, Philippe Verdy wrote:

Whateer you think, the SPACE+diacritic is still a hack, and certainly 
not a canonical equivalent (including for its properties), of the 
existing spacing diacritics, which also do not fit all usages because 
they are symbols.


It is the formally specified way to represent what you say you want to 
represent. If an implementation doesn't do that nicely enough, 
complain to the implementors. (This has already been suggested to you.)
As has already been clearly pointed out by Philippe, Kent and myself 
(and ignored by those opposed to any change), the combination SPACE + 
diacritic does not have the required categories, properties and 
specification for the function it is supposed to perform. Either these 
categories etc need to be adjusted (and I don't expect the general 
category of SPACE to be changed!), or some exceptional mechanism needs 
to be clearly defined, or, by far the simplest solution, a new base 
character can be defined which, when combined with the diacritic, has 
the required categories and properties.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/

Re: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...)

2003-08-10 Thread Philippe Verdy

On Sunday, August 10, 2003 12:32 AM, John Cowan [EMAIL PROTECTED] wrote:

 Peter Kirk scripsit:
 
  This is a clear demonstration that Microsoft also has problems with
  the mechanism which has been defined in the standard for ten years,
 
 This is a clear demonstration that Uniscribe fails to implement a
 standard correctly, a property unique neither to Microsoft nor to the
 Unicode Standard.


Except that in that case, we are no speaking about something that has
already been standardized, but only used as a legacy mean to achieve
some results with mosre or less success. Whateer you think, the
SPACE+diacritic is still a hack, and certainly not a canonical equivalent
(including for its properties), of the existing spacing diacritics, which
also do not fit all usages because they are symbols.

The fact that there are compatibility decompositions of these spacing
diacritics is just to match those legacy uses, but it is not a solution.
It just ressembles the way many keyboard drivers allow users to enter
those spacing diacritics, but input methods and keyboard drivers are
nothing as a proof face to Unicode, as the keyboard driver will still
only return a combined spacing diacritic, but not the sequence
SPACE+diacritics (whose real usage in text seems to occur only in
old texts where non-spacing combining diacritics where not
encodable or renderable, or just to allow speaking in full text about
the individual diacritics themselves, a more rare case).

May be I'm wrong for this assertion, but this is my feeling and experience
about these characters, which were merely symbols or hacks to represent
non English text with a restricted ASCII alphabet as an approximate
representation (the inclusion of other spacing diacritics in the high range
of an 8-bit ISO-8859-1 encoding was very strange for me, as if they were
there only to allow approximating other missing precombined characters
which could not fit in the table, but produced poor results so that most
texts were never encoded with this charset but with other more appropriate
charsets when needed. *

* [OT] This was a shame when ISO adapted the DEC VT charset to
create ISO-8859-1, but forgot important characters needed for the
languages that this charset was supposed to cover (like the French
oe and OE ligatures, and a few characters missing for Baltic languages,
Icelandic, and Catalan.) ISO-8859-15 is certainly better now than ISO-8859-1
for the same languages and for even more than initially defined, and in
practice that's Microsoft that filled the gap with Windows1252 when
dropping the unnecessary C1 controls (forgetting the legacy roundtrip
compatibility of controls with the dying EBCDIC).

Re: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...)

2003-08-10 Thread Philippe Verdy

On Sunday, August 10, 2003 9:17 PM, Peter Kirk [EMAIL PROTECTED] wrote:

 On 10/08/2003 10:09, Michael Everson wrote:
 
  It is the formally specified way to represent what you say you want
  to represent. If an implementation doesn't do that nicely enough,
  complain to the implementors. (This has already been suggested to
  you.) 
 
 As has already been clearly pointed out by Philippe, Kent and myself
 (and ignored by those opposed to any change), the combination SPACE +
 diacritic does not have the required categories, properties and
 specification for the function it is supposed to perform. Either these
 categories etc need to be adjusted (and I don't expect the general
 category of SPACE to be changed!), or some exceptional mechanism needs
 to be clearly defined, or, by far the simplest solution, a new base
 character can be defined which, when combined with the diacritic, has
 the required categories and properties.

That's exactly what I suggested (and I used the word suggest, and
wanted to show the inaccuracy of the SPACE or NBSP to represent
spacing diacritics as a normal symbol, due to the undocumented
properties for that combination). Due to the lack of formal
documentation (no one here demonstrated that such sequence with
SPACE was really documented as such somewhere in the Unicode
specs), such legacy usage is still just a hack which only works
sometimes, but not always as intended because it contradicts some
other principles like the inheritance of the base character properties
to the whole combining sequence using it.

And still, even if SPACE+diacritics is documented now as producing
officially a symbol, its properties are still not defined (not interoperable
as varying among implementations), and it still gies problems with the
huge legacy use of SPACE as a padding character or with
space normalizations like in XML, HTML and SGML.

In addition, it still does not solve the problem of its insertion within
words, and of its directionality for BiDi, its parsing for breaking
(line breaking, word breaking, ...) where distinct base character(s)
for the correct interpretation would be needed.

Yes I have read your comment, and Yes I know that
SPACE+diacritics is widely used. But this is with many unsolved
problems that one could legitimately want to solve with more precise:
- definition of such combining sequence with SPACE
- definition of its properties
- documentation within the Unicode breaking algorithms
- adjustments to the BiDi specs
- etc...

If all these adjustments are made, there will be many, all of them
handled like exceptions to the normal rules, when a much simpler
approach (which would not require all these changes in specs),
would consist in defining other(s) more explicit base character(s)
for the appropriate function.

If Ken, Michael, Kent and other respectable UTC members can't
see the problem, who will? Please consider the problem itself and
don't be too much focused on the exact terminology that you would
have used yourself to better describe the problem and its solutions.

I am not discussing the terminology itself, but the lack of
documentation and support for what seems a true interoperability
problem. So please don't flame me with sarcasms, that's not the
subject of my messages which do not want to comment about
the respective Unicode expertize of respectable UTC members...

Sorry if this message seems still too long for you. But each time
I want to be short, I am flamed for inaccuracies, or imprecisions,
or suspected of claiming something about the standard when in
fact I am not discussing what is currently in the standard itself,
but what is not there now and causes problems. It's easy to
be short if you only refer to the standard itself, and only respond
as if this list was just a FAQ.

-- 
Philippe.
Spams non tolérés: tout message non sollicité sera
rapporté à vos fournisseurs de services Internet.

Re: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...)

2003-08-10 Thread Kenneth Whistler

Peter Kirk asked:

 If I want to do this, should I explicitly encode a dotted circle, or 
 should I encode nothing and expect the font to generate the dotted 
 circle, as it often does?

If you want to represent the text content of a dotted circle with
an accent on it, the recommended representation would be, for
example:

25CC, 0301

A compliant renderer that supports those characters should always
then display a dotted circle with an acute accent over it.

If you just leave a 0301 in isolation, then you are at
the mercy of what a renderer might do in a fallback situation
for a defective combining character sequence. It *might* show
it on a dotted circle, or it might show it in some other way.

And if that combining character is in any other context, it
may end up being misapplied to the wrong preceding character --
wrong in the sense that that was not your intention.

--Ken

Re: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...)

2003-08-09 Thread Peter Kirk

On 05/08/2003 16:59, Curtis Clark wrote:

on 2003-08-05 15:31 Peter Kirk wrote:

Thank you, Mark. This helps to clarify things, but still doesn't 
explicitly answer my question of how to encode a sentence like In 
this language the diacritic ^ may appear above the letters ..., but 
instead of ^ I want to use a combining character  and want to 
display exactly one space before the combining character - do I 
encode two spaces or one?


In this language the diacritic   may appear above the letters...

Two spaces, at least in Thunderbird Mail.


Thank you. Well, this sort of works. I looked in various fonts. In some 
of them the diacritic is centred in the space between the words 
diacritic and may, but in others it is offset to the left or the 
right. The problem is that the space is wider than the diacritic, which 
confuses things, and all the more so no doubt if it expands for 
justification. NBSP would probably be a better choice in that it is less 
likely to expand. But what I am looking for is a diacritic holder which 
is defined to be only as wide as the diacritic. On the principle that 
base characters expand to fit the width of the diacritic,  ZWSP or, 
better, a real (rather than misnamed) zero width no break space would 
seem to have the right properties for that.

--
Peter Kirk
[EMAIL PROTECTED]
http://web.onetel.net.uk/~peterkirk/

Re: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...)

2003-08-08 Thread Mark Davis

  Zs, Zl, and Zp are considered format characters, but their
 membership in the Z (separator) class takes precedence over their
 membership in the Cf class, because the General Category assigns
only
 a single value to each character. 

Whenever you have a question about the status of a character, you need
to look it up in the UCD. You can either do that by going through the
unicode website, or if you want a more readable interface, use the ICU
character browser, which formats that data.

Look at space, U+0020.

http://oss.software.ibm.com/cgi-bin/icu/ub/utf-8/?go=0020ch.x=4ch.y=7

The general category is Space_Separator, *not* a format character.

Now wording there could definitely be clearer, but the operant phrase
is:

 ...but their
 membership in the Z (separator) class *takes precedence* over their
 membership in the Cf class...

So it would be cleared to say something like:

In many ways the characters, Zs, Zl, and Zp, are similar to format
characters, but because their general usage is significantly different
they are broken out into a separate General Category, as Separator
characters.

Mark
__
http://www.macchiato.com
  Eppur si muove 

- Original Message - 
From: Peter Kirk [EMAIL PROTECTED]
To: Mark Davis [EMAIL PROTECTED]
Cc: Unicode List [EMAIL PROTECTED]
Sent: Tuesday, August 05, 2003 14:50
Subject: Re: Display of Isolated Nonspacing Marks (was Re: Questions
on ZWNBS...)


 On 05/08/2003 14:40, Mark Davis wrote:

 Where did you get the notion that space is not a base character?
And
 base characters include those that are not control or format
 characters. Space is neither one.
 
 The standard specifically states in a number of places that to
exhibit
 a combining mark in isolation you use a space (or NBSP).
 
 Mark
 __
 http://www.macchiato.com
   Eppur si muove 
 
 
 
 I got this from the Unicode Standard 4.0, as quoted by Jim Allan:

  In http://www.unicode.org/book/preview/ch03.pdf the space
characters
  in general are given class Zs:
 
   Zs, Zl, and Zp are considered format characters, but their
  membership in the Z (separator) class takes precedence over their
  membership in the Cf class, because the General Category assigns
only
  a single value to each character. 
 
  So the various space characters (class Zs) are also classified as
  format characters.
 
  From http://www.unicode.org/book/ch04.pdf:
 
   _D13  Base character:_ a character that does not graphically
  combine with preceding character, and that is neither control nor
a
  format character. 
 
  Accordingly, by definition, spaces are not base characters.



 -- 
 Peter Kirk
 [EMAIL PROTECTED]
 http://web.onetel.net.uk/~peterkirk/

Re: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...)

2003-08-08 Thread Kenneth Whistler

Peter Kirk responded to my plea for everyone to relax a bit:

 If everyone would just go off for a week or two on their
 August vacation, like they should be, we could all come back
 about Labor Day and we wouldn't have to be having these
 discussions. ;-)
 
 --Ken

 OK, understood now. As the previous version is obsolete, and the new one 
 is unavailable, we can all take a break from conforming to Unicode at 
 all and take a vacation! Sounds a good idea to me  ;-)

Just in the interest of truth in advertising, the previous
version(s) are not obsolete, but are superseded by Unicode 4.0.
   ^^^
   
Applications claiming conformance to Unicode 3.0 will continue
to claim conformance to that version, and that version is
relevant to their claim. And so on for Unicode 3.1 and
Unicode 3.2.

But if and when people move on to claiming conformance to
Unicode 4.0, then it is the text of *that* version which becomes
relevant to their claim.

We are simply in the inconvenient transition state where people
are building Unicode 4.0 implementations, but the final, final
text of the *book* (as opposed to the various UAX's and all
the data files) is not available. There were similar
transition periods for Unicode 1.0, Unicode 2.0, and Unicode 3.0,
and nearly everyone understands that is the nature of things.

So yes, please, it's time to take a vacation! :)

--Ken

Re: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...)

2003-08-07 Thread John Cowan

Mark Davis scripsit:

 Where did you get the notion that space is not a base character? And
 base characters include those that are not control or format
 characters. Space is neither one.

Unfortunately, p. 88 of TUS3.0 (section 4.5, paragraph 3) says
Zs, Zl, and Zp [characters] are considered format characters.
This is obviously wrong, but there it is.

-- 
Kill Gorg)Bn!  Kill orc-folk!   John Cowan
No other words please Wild Men. [EMAIL PROTECTED]
Drive away bad air and darkness http://www.reutershealth.com
with bright iron!  --Gh)Bn-buri-Ghnhttp://www.ccil.org/~cowan

RE: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...)

2003-08-07 Thread Kent Karlsson


I was so glad that you got things so nearly right for once, and then
you go and spoil it with:

 Another similar case would be the use of a isolated nukta (which
 normally modifies a following base character): the sequence
 nukta, SPACE 

Like all other combining characters, NUKTA follows the base
character (the consonant) in the character stream. But I'm not
sure if consonant, nukta, vowel *should* be any different
from consonant, vowel, nukta, but maybe they should be
different since they are not canonically equivalent. (But...)

/kent k

Re: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...)

2003-08-06 Thread Peter Kirk

On 04/08/2003 17:36, Kenneth Whistler wrote:

Peter Kirk asked:

 

A similar issue which is not Hebrew related would be a (mythical) 
requirement to display a diacritic like 0315, 031B or 0322 in isolation. 
It would not always be appropriate to use a space or NBSP as a base 
character as this would indent the glyph from the beginning of a line in 
a way which might not be wanted. What would be the recommended encoding 
if one wanted to display one of these characters with no leading white 
space?
   

If you just want to display a nonspacing mark in isolation, then
you apply it to a SPACE (or NO-BREAK SPACE) and typically let the
metrics of the font then handle how the mark is going to appear
floating in space as it were.
If you want to display some character like U+0315 COMBINING COMMA
ABOVE RIGHT *and* you want to do it is isolation *and* you want
it to occur at the beginning of a line *and* you want there to
be no display width between the margin and the left edge of the
display bits of the glyph, then you have stepped over the boundaries
of what is reasonable to expect plain text to convey. Feel free
to make use of the higher-level capabilities of your word
processor or page layout program to individually adjust the
positioning of particular glyphs displayed in particular fonts.
 

Thank you. Understood.

More generally, however, when the issue of the relative
position of a non-spacing mark with respect to its base
glyph is what is in question, the standard recommends
(and uses) the convention of displaying the non-spacing
mark on a dotted circle as a base. This makes it clear that
we are talking about the non-spacing mark itself, but also
makes clear the positional differences between left, centered,
and right forms, for example.
 

If I want to do this, should I explicitly encode a dotted circle, or 
should I encode nothing and expect the font to generate the dotted 
circle, as it often does?

--Ken

 



--
Peter Kirk
[EMAIL PROTECTED]
http://web.onetel.net.uk/~peterkirk/

Re: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...)

2003-08-06 Thread Philippe Verdy

On Wednesday, August 06, 2003 1:59 AM, Curtis Clark [EMAIL PROTECTED] wrote:

 on 2003-08-05 15:31 Peter Kirk wrote:
  Thank you, Mark. This helps to clarify things, but still doesn't
  explicitly answer my question of how to encode a sentence like In
  this language the diacritic ^ may appear above the letters ...,
  but instead of ^ I want to use a combining character  and want to
  display exactly one space before the combining character - do I
  encode two spaces or one? 
 
 In this language the diacritic   may appear above the letters...
 
 Two spaces, at least in Thunderbird Mail.

The NFD decompositions of spacing marks is alredy defined as a SPACE
plus a non-spacing combining character. This officially documents the
usage of SPACE as a base character, and its use in combining sequences.
In the context of XML processing, where strings should (must?) be
presented in NFC form, this extra SPACE will be invisible, hidden within the
precomposed sequence, so this space does not have the line-breaking
property.

Breaking properties apply only to combining sequences, not to isolated
encoded characters. It's illegal to break in the middle of a combining
sequence. So as soon as a SPACE is followed by a combining character,
it looses its breaking properties, as those properties are only defined for
the combining sequence containing only a SPACE. So I don't think there's
any ambiguity: parsers and renderers must correctly identify combining
sequences before applying any algorithm.

This means that an algorithm like normalization of whitespace sequences
in XML or HTML should not include SPACEs that are used as base
characters in a combining sequence, and so it should keep two spaces
if the intent is to encode a logical space followed by a logical spacing
diacritic. (This is not a problem for XML which processes strings in their
NFC form).

-- 
Philippe.
Spams non tolrs: tout message non sollicit sera
rapport  vos fournisseurs de services Internet.

Re: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...)

2003-08-06 Thread Peter Kirk

On 06/08/2003 03:54, Philippe Verdy wrote:

On Wednesday, August 06, 2003 1:59 AM, Curtis Clark [EMAIL PROTECTED] wrote:

 

on 2003-08-05 15:31 Peter Kirk wrote:
   

Thank you, Mark. This helps to clarify things, but still doesn't
explicitly answer my question of how to encode a sentence like In
this language the diacritic ^ may appear above the letters ...,
but instead of ^ I want to use a combining character  and want to
display exactly one space before the combining character - do I
encode two spaces or one? 
 

In this language the diacritic   may appear above the letters...

Two spaces, at least in Thunderbird Mail.
   

The NFD decompositions of spacing marks is alredy defined as a SPACE
plus a non-spacing combining character. ...
Really? It looks to me as if U+00B4 and U+02D8 to U+02DD have only a 
compatibility equivalences to space plus diacritic, and U+005E and 
U+0060 don't even have compatibility equivalences.

... 
This means that an algorithm like normalization of whitespace sequences
in XML or HTML should not include SPACEs that are used as base
characters in a combining sequence, and so it should keep two spaces
if the intent is to encode a logical space followed by a logical spacing
diacritic. (This is not a problem for XML which processes strings in their
NFC form).

 

It is,  because there are very many combining marks which do not have 
spacing equivalents (even for compatibility), and so with these the NFC 
form will certainly be space plus diacritic.

--
Peter Kirk
[EMAIL PROTECTED]
http://web.onetel.net.uk/~peterkirk/

Re: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...)

2003-08-06 Thread Kenneth Whistler

Peter Kirk said:

  From what Ken says, it sounds like it will be wrong from whenever 
 Unicode 4.0 is officially issued 

Actually Unicode 4.0 was officially issued on April 17, 2003.

What we are waiting on now is for the publication of the text
of the book to catch up to that fact. ;-)

 because this paragraph  has been 
 excised from that standard. But until then it seems to be correct, SPACE 
 is indeed considered a format character.

Nope. It is incorrect to try to mix and match between versions
of the standard.

In Unicode 3.0 this was an ambiguity in the meaning and usage
of the term format character, and for Unicode 3.0, we can
all see how people who ran into section 4.5 of the standard
could be a little confused about the status of SPACE.

The actual intent of that offending paragraph was to attempt to
explain the somewhat procrustean nature of the General Category
classes, which may not do justice to the complicated behavior
of some of the characters in Unicode, rather than to explain the
status of SPACE in particular. 

 I was misled by Jim's 
 reference to the URL of the final draft (as clearly stamped on the first 
 page) of 4.0, but since in fact he was quoting from 3.0 what he says can 
 hardly be considered obsolete yet.

Actually it can. And that would have been obvious to everyone if
a preview version of Chapter 4 had also been posted.

Once again, I appeal to people to stop trying to second-guess
the text of the standard. The final pdf for the online version
is in preparation even as I write this. The final final
proofs for the book itself have already been produced by
the printer -- all they need to do now is turn on the press
and start the binder.

If everyone would just go off for a week or two on their
August vacation, like they should be, we could all come back
about Labor Day and we wouldn't have to be having these
discussions. ;-)

--Ken

Re: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...)

2003-08-06 Thread Doug Ewell

Peter Kirk peter dot r dot kirk at ntlworld dot com wrote:

 Or it may not.  It may be a deficiency in the level of Unicode
 support afforded by the fonts and rendering engines. ...

 If there are such deficiencies in fonts and rendering engines which
 purport to be Unicode compliant, that suggests a lack of clarity in
 the standard which should be rectified.

I wish I had a dollar for every Unicode-compliant font, rendering
engine, or other software that was in some way less compliant than
advertised.  Only a fraction of the non-compliances are traceable to
ambiguities or deficiencies in the Unicode Standard.

 ... It may simply reflect a difference between your requirements
 and what the standard promises, and doesn't promise.

 If Unicode doesn't promise what I require, surely it is at least
 reasonable for me to ask on this list whether it ought to be extended
 or clarified to do so. The UTC may choose not to make any changes, but
 I don't see why they shouldn't even be asked to.

Absolutely, you are allowed to ask.  Go ahead.  I wasn't trying to
prevent questions from being asked, only trying to state why I think the
problem is out of scope for Unicode.

 The standard doesn't say anything about width in this case.  It
 leaves it up to the display engine, which is as it should be.

 The standard does say, section 2.10 of 4.0, that In rendering, the
 combination of a base character and a nonspacing character may have a
 different advance width than the base character itself.

I apologize for missing this reference.

 And any intelligent typographer will realise that this may is a
 must, with regular character designs but not of course in monospace,
 in some cases like the example given of i with circumflex. This
 sentence applies to spaces with diacritics as space is a base
 character, as we have been informed. The subsection of 2.10 entitled
 Spacing Clones of European Diacritical Marks (by the way, why
 European when the text appears to apply to all diacritical marks?)
 should suggest to any intelligent typographer that the sequence space,
 diacritic is intended to be spaced as the diacritic and not as a
 space, but it would help for this to be clarified as not all
 typographers are very intelligent and some may not be aware that this
 space has actually lost most of the properties of a space e.g. line
 breaking and is being used only By convention.

Like Freud's cigar, sometimes a may is just a may.  And I suspect
the phrase any intelligent typographer MAY generate some flak from
typographers on this list who consider themselves intelligent enough
yet have a different opinion.

I'm not a typographer (intelligent or otherwise), but I'm having a tough
time seeing how Section 2.10 *requires* fonts and rendering engines to
give a space-plus-combining-diacritic combination the exact minimum
width of the diacritic alone, or to leave equal space before and after
such a combination.  All I think it is saying is that, for example, the
combination i-plus-tilde may be wider than i alone, because tilde is
wider than i.

 When the specific alignment of isolated glyphs is important to me, I
 use markup.  I'm a big supporter of plain text, as many members of
 this list know, but the exact spacing of isolated combining marks
 seems like a layout issue to me.

 OK, what kind of markup should I use, in any well-known markup
 language, to ensure that an isolated diacritic is centred in the space
 between the words before and after it?

All right, you've got me there.  I'll have to think about it.  But I
still think this is a layout problem, a problem having to do with glyphs
and not characters.

-Doug Ewell
 Fullerton, California
 http://users.adelphia.net/~dewell/

Re: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...)

2003-08-05 Thread Jim Allan

Peter Kirk posted:

If I want to do this, should I explicitly encode a dotted circle, or
should I encode nothing and expect the font to generate the dotted
circle, as it often does? 
I think that practise of a font or application automaticaly inserting a 
dotted circle under an orphaned combining character is dubious compliant 
with Unicode specifications.

In http://www.unicode.org/book/preview/ch03.pdf the space characters in 
general are given class Zs:

 Zs, Zl, and Zp are considered format characters, but their membership 
in the Z (separator) class takes precedence over their membership in the 
Cf class, because the General Category assigns only a single value to 
each character. 

So the various space characters (class Zs) are also classified as format 
characters.

From http://www.unicode.org/book/ch04.pdf:

 _D13  Base character:_ a character that does not graphically combine 
with preceding character, and that is neither control nor a format 
character. 

Accordingly, by definition, spaces are not base characters.

Also from http://www.unicode.org/book/ch04.pdf:

 _D14  Combining character:_  a character that graphically combines 
with a preceding base character. The combining character is said to 
_apply_ to the base character. 

So we know what happens with a combining character follows a base 
character. It combines with it.

What happens when a combining character follows a character that is not 
a base character or appears initially? The same source explains:

 o Even though a combining character is intended to be presented in 
graphical combination with a base character, circumstances may arise 
where either (1) no base character precedes the combining character or 
(2) a process is unable to perform graphical combination. In both cases 
it may present a combining character without graphical combination; that 
is, it may present it as if it were a base character.

o The representative images of combining characters are depicted with a 
dotted circle in the code charts; when presented in a graphical 
combination with a preceding base character, that base character is 
intended to appear in the position occupied by the dotted circle. 

So a display device *may* present an oprhaned combining character as 
suggested.

But the word may is weak.  Or there other things it may do that would 
still be compliant with Unicode?  May it ignore the character 
altogether? May it display the character as U+FFFD REPLACEMENT 
CHARACTER? May it display the over some other character altogether, 
perhaps even U+20CC DOTTED CIRCLE? This is the only way I can to justify 
the display of U+20CC DOTTED CIRCLE in such cases by the Unicode 
specifications.

But is then is there any display that is not acceptable according to 
these specifications?

Note that even if an application takes the suggestion made here, the 
combination of the non-base character SPACE followed by a combining 
character would be rendered as the non-base character SPACE followed by 
the combining character rendered as a base character. They would not be 
combined.

From the same source:

 _D17a  Defective combining character sequence:- a combining character 
sequence that does not start with a base character.

o Defective combining character sequences occur when a sequence of 
combining charactes appears at the start of a string or follows a 
control or format character. Such sequences are defective from the point 
of handling of combining marks, but are not _ill-formed_. (See D30.)

Accordingly any space character followed by a combining character is a 
defective combining character sequence.

From http://unicode.org/book/ch07.pdf

 *Marks as Spacing Characters.* By convention, combining marks may be 
exhibited in (apparent) isolation by applying them to U+0020 SPACE or to 
U+00A0 NO-BREAK SPACE. This approach might be taken, for example, when 
referring to the diacritical mark itself as a mark, rather than by using 
it in its normal way in text. The use of U+0020 SPACE versus U+00A0 
NO-BREAK SPACE affects line-break behavior.

The words by convention are odd. It perhaps acknowledges that this 
shouldn't work according to general other Unicode rules and definitions.

This passage, however, does not even hint that by convention a dotted 
circle should appear under the diacritic.

Presumably if someone wanted a combining character applied to a dotted 
circle that person would code U+20CC followed by the combining character.

One could fix this messiness by changing the definition of base 
character to specifically include U+0020 SPACE and U+00A0 NO-BREAK 
SPACE. That in effect is exactly what the above passage does. So it in a 
 structured manner by making it part of the rule instead burying it in 
the text an odd exception to the rule.

But it does seems philosphically odd that U+0020 and U+00A0 alone of the 
category Zs characters should be especially singled out.

It would be more intuitive if all Zs characters could be included in the

Re: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...)

2003-08-05 Thread Ted Hopp

On Tuesday, August 05, 2003 5:40 PM, Mark Davis wrote:
 Where did you get the notion that space is not a base character? And
 base characters include those that are not control or format
 characters. Space is neither one.

Well, I think Jim Allan pointed to the source of this notion in his email of
a few hours ago.

1) From the UCD:
0020;SPACE;Zs;...

2) From Unicode 3, Section 4.5, third paragraph (in its entirety):
Zs, Zl, and Zp are considered format characters, but their membership in
the Z (separator) class takes precedence over their membership in the Cf
class, because General Category assigns only a single value to each
character.

I believe that reasonable people might reasonably conclude from factoids 1
and 2 that SPACE is indeed a format character.

Reasonable, but evidently wrong. Explanation, please?

Ted

Ted Hopp, Ph.D.
ZigZag, Inc.
[EMAIL PROTECTED]
+1-301-990-7453

newSLATE is your personal learning workspace
   ...on the web at http://www.newSLATE.com/

Re: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...)

2003-08-05 Thread Curtis Clark

on 2003-08-05 15:31 Peter Kirk wrote:
Thank you, Mark. This helps to clarify things, but still doesn't 
explicitly answer my question of how to encode a sentence like In this 
language the diacritic ^ may appear above the letters ..., but instead 
of ^ I want to use a combining character  and want to display exactly 
one space before the combining character - do I encode two spaces or one?
In this language the diacritic   may appear above the letters...

Two spaces, at least in Thunderbird Mail.

--
Curtis Clark  http://www.csupomona.edu/~jcclark/
Mockingbird Font Works  http://www.mockfont.com/

Re: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...)

2003-08-05 Thread Peter Kirk

On 05/08/2003 15:09, Mark Davis wrote:

 Zs, Zl, and Zp are considered format characters, but their
membership in the Z (separator) class takes precedence over their
membership in the Cf class, because the General Category assigns
   

only
 

a single value to each character. 
   

Whenever you have a question about the status of a character, you need
to look it up in the UCD. You can either do that by going through the
unicode website, or if you want a more readable interface, use the ICU
character browser, which formats that data.
Look at space, U+0020.

http://oss.software.ibm.com/cgi-bin/icu/ub/utf-8/?go=0020ch.x=4ch.y=7

The general category is Space_Separator, *not* a format character.

Now wording there could definitely be clearer, but the operant phrase
is:
 

...but their
membership in the Z (separator) class *takes precedence* over their
membership in the Cf class...
   

So it would be cleared to say something like:

In many ways the characters, Zs, Zl, and Zp, are similar to format
characters, but because their general usage is significantly different
they are broken out into a separate General Category, as Separator
characters.
Mark
__
http://www.macchiato.com
  Eppur si muove 
 

 

Thank you, Mark. This helps to clarify things, but still doesn't 
explicitly answer my question of how to encode a sentence like In this 
language the diacritic ^ may appear above the letters ..., but instead 
of ^ I want to use a combining character  and want to display exactly 
one space before the combining character - do I encode two spaces or one?

--
Peter Kirk
[EMAIL PROTECTED]
http://web.onetel.net.uk/~peterkirk/

Re: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...)

2003-08-05 Thread Doug Ewell

Peter Kirk peter dot r dot kirk at ntlworld dot com wrote:

 Suppose for example I want to write a sentence like In this language
 the diacritic ^ may appear above the letters ..., but instead of ^ I
 want to use a combining character, a regularly positioned centred
 above the letter diacritic, which does not have a defined spacing
 variant. I don't want a dotted circle. And I want it to be spaced as
 here, i.e. with one space before the diacritic and one after it. It
 seems to me that at one place in the standard I am told to encode
 space - combining mark - space, for the combining mark will not
 combine with the space because the space is not a base character; and
 in another place I am implicitly told to encode space - space -
 combining mark - space, because the second space acts as a carrier for
 the combining mark.

space + (space + combining character) + space

 Perhaps a simple way ahead would be to define a new character
 something like COMBINING MARK HOLDER...

Uhh, no.

-Doug Ewell
 Fullerton, California
 http://users.adelphia.net/~dewell/

48 matches

Mail list logo