Re: Questions on ZWNBS - for line initial holam plus alef

Philippe Verdy Mon, 11 Aug 2003 14:44:25 -0700

On Monday, August 11, 2003 2:05 AM, Kenneth Whistler <[EMAIL PROTECTED]> wrote:


> Um, no. Precisely because it would introduce *another* way
> to do what is already specified in the standard. It would, I
> predict, lead to nothing but more trouble.
> 
> You might, perhaps, find it satisfying, but I can guarantee
> that there would then be a future critic complaining about
> an unnecessary distinction introduced into the standard. And
> then there would be *more* text in different places of the
> standard to try to correct and change, in an attempt to
> try to make consistent distinctions between the behavior
> of <SPACE, NSM> and <ACCENT_ANCHOR, NSM>.

I don't think so: for texts that are already coded with SPACE+NSM,
it won't be needed to do changes, as long as applications using
them are satisfied with their existing behavior, even if it's ambiguous
or causes problems in other applications. The rule would be not to
change things, but offer to writers a way to create new texts without
those ambiguities and problems, and correct them if authors wish it.

For me, the "ACCENT ANCHOR" if you call it like this, is solving the
usage of isolated diacritics as plain letters (such as the implied missing
y in Hebrew Yerushala(y)im), and so would behave like an alphabetic
character (whose directionality is still to define...)

Existing coded spacing diacritics are coded as symbols (Sk) and
mostly for accents used in LTR scripts, so the confusion of these
symbols with letters behavior in some UAX's which give them the AL
property (including for one case of SPACE+NSM) is not a problem.

The usage as symbols is mostly correct for the case where a text is
speaking about a diacritic as a isolated symbol and not within words
(this is correct for most languages).

The usage within words (for an implied missing base letter, including
when this missing letter is an initial) leaves a distinct hole (for example
if one was trying to encode a word like "(Y)erushala(y)im", where the
missing base letter is the initial. For languages like Arabic and South-Asian
scripts, there's no problem as there already is a base letter to hold
initial combining vowel signs, which also works for the case of multiple
combining vowels which should not stack but be writtenon this base
letter. In fact in those languages, the missing consonnantal base letter
is actually written with a visible glyph.

But for Latin, Cyrillic, Greek, Hebrew, and probably other scripts, their
isolated diacritics are missing a explicit coded form. And there is still
the need even for Arabic and Brahmic scripts to be able to speak about
the diacritic itself, without an explicit base letter, and so the SPACE+NSM
combining sequence is for now the only solution with its undocumented
properties problems.

Reread some UAXes to see the problematic impact of SPACE+NSM in
areas which are NOT related to rendering, notably when extracting word
sequences (for search and indexing), managing keyboard selections,
computing line breaks, and handling the directionality. Now consider the
even greater impact with the legac use os SPACE as a normalizable
padding whitespace (a key feature of SGML, HTML and XML), and the
legacy use of SPACE+NSM cause too many problems that won't
satisfy authors, which in some case will not be able to use it as it will
not work as expected. Due to these problems, authors are then using
even worse hacks, like using a control before the NSM, even if it creates
"defective" combining sequences, and the dotted circle is sometimes
displayed, and even if it is parsed with an invisible but still additional
grapheme cluster for the control itself, whose presence is a pollution.

Instead of forcing authors to use defective combining sequences like
control+NSM, which would be a even worse hack, why not designating
a clean and pure invisible base character with the required properties,
so that it creates a pure combining sequence for the isolated diacritic(s)?

So the question is which invisible base character(s) to define, with
which properties?
- A invisible symbolic base character (Sk), with neutral directionality (I
called it a INVISIBLE SYMBOL);
- A invisible letter base character (Lo) with neutral directionality (you call
it a ACCENT ANCHOR, and I called it a INVISIBLE LETTER), or
- A invisible letter base character (Lo) with LTR directionality and
- A invisible letter base character (Lo) with RTL directionality

Personnally, the term ACCENT ANCHOR seems ambiguous and does
not indicate precisely the usage (it fits more like the current ambiguous
usage of SPACE as this anchor for accents), and it seems restrictive to
the kind of diacritic or other combining mark that may (should?) be
applied to it. In addition, nothing would forbid to combine several
diacritics or marks on this base character.

Consider then these new characters are better base characters than
SPACE, and define them with only a compatibility decomposition to
SPACE, to match the previous encoding. If those new base characters
are used without diacritics, they will be shown like the glyph for NBSP,
and not necessarily as zero-width (there's no requirement for these
invisible symbols to be zero-width in all cases, as this is a more precise
substitution for the legacy SPACE, but without the associated whitespace
properties). With these new characters, there is no need to change the
rules in the various UAX's and other Unicode algorithms.

-- 
Philippe.
Spams non tolérés: tout message non sollicité sera
rapporté à vos fournisseurs de services Internet.

Re: Questions on ZWNBS - for line initial holam plus alef

Reply via email to