Re: A few questions about decomposition, equvalence and rendering

2002-02-06 Thread David Hopwood

-BEGIN PGP SIGNED MESSAGE-

Kenneth Whistler wrote:
> ... See CompositionExclusions.txt, which
> has a special section mentioning just these four oddballs:
> 
> # 
> # (4) Non-Starter Decompositions
> # These characters can be derived from the UnicodeData file
> # by including all characters whose canonical decomposition consists
> # of a sequence of characters, the first of which has a non-zero
> # combining class.

Shouldn't that say, "a sequence of two characters"? Taken literally
this definition includes characters with a canonical decomposition
that is a single combining character.

(To forestall the obvious objection, no, using the plural does not
imply more than one: any decomposition is "a sequence of characters".)

- -- 
David Hopwood <[EMAIL PROTECTED]>

Home page & PGP public key: http://www.users.zetnet.co.uk/hopwood/
RSA 2048-bit; fingerprint 71 8E A6 23 0E D3 4C E5  0F 69 8C D4 FA 66 15 01
Nothing in this message is intended to be legally binding. If I revoke a
public key but refuse to specify why, it is because the private key has been
seized under the Regulation of Investigatory Powers Act; see www.fipr.org/rip


-BEGIN PGP SIGNATURE-
Version: 2.6.3i
Charset: noconv

iQEVAwUBPGCgFTkCAxeYt5gVAQFdYAf+JOHLD7dAfZgPT7vAid+Ttt9ojgR3dMUv
tkxu7pC1eqx0h1u9yBkwv42S7r3M41ha6dvwCrlKlxPT1H8nPj+CWP4nhRcWeDxF
8fK+Plk0FxmIedksAXL1vbPbCI5Vf36/O3OFN++oLurGdf+DuA1lZ0WC191njW6V
/+rqRjCPKwSz8UiftLrF9EApjHaSwHH5skO9OZIrbocsfGU44pl3SsJIB0HsjxU4
GAp+HbABJ+67EDH8KtUAa0lHEBKHRoC4a1KWLuFV7E1uLCGH8X2fVbAOYX/jIHEU
I8W9gJDebquu/Vnph3AIlW9MVO1hALWqB80ngZtHBYDbkT9zSXRRQg==
=+vHx
-END PGP SIGNATURE-




Re: A few questions about decomposition, equvalence and rendering

2002-02-06 Thread Kenneth Whistler

Juliusz continued:

> KW> There is no good reason to invent composite combining marks
> KW> involving two accents together. (In fact, there are good reasons
> KW> *not* to do so.) The few that exist, e.g. U+0344, cause
> KW> implementation problems and are discouraged from use.
> 
> What are those problems?  As long as they have canonical
> decompositions, won't such precomposed characters be discared at
> normalisation time, hopefully during I/O?
> 
> (I'm not arguing in favour of precomposed characters; I'm just saying
> that my gut instinct is that we have to deal with normalisation
> anyway, and hence they don't complicate anything further; I'd be
> curious to hear why you think otherwise.)

Perhaps I overstated the case slightly. It is true enough that if
you are working with normalized data, U+0344 gets normalized away:

% egrep 0344 NormalizationTest-3.2.0d6.txt
0344;0308 0301;0308 0301;0308 0301;0308 0301; # ... COMBINING GREEK DIALYTIKA TONOS

and you just end up with an otherwise typical sequence of combining marks.

However, the complication is in the statement of the algorithm,
where you end up having to talk about (and include in your tables)
the "Non-Starter Decompositions". See CompositionExclusions.txt, which
has a special section mentioning just these four oddballs:

# 
# (4) Non-Starter Decompositions
# These characters can be derived from the UnicodeData file
# by including all characters whose canonical decomposition consists
# of a sequence of characters, the first of which has a non-zero
# combining class.
# These characters are simply quoted here for reference.
# 

# 0344 COMBINING GREEK DIALYTIKA TONOS
# 0F73 TIBETAN VOWEL SIGN II
# 0F75 TIBETAN VOWEL SIGN UU
# 0F81 TIBETAN VOWEL SIGN REVERSED II

Note also that all four of these characters get "use of this character
is discouraged" notes in the Unicode names list.

These characters also result in a problematical edge case for
processing of the tables for the Unicode Collation Algorithm to
provide proper weightings.

> >> does anyone [have] a map from mathematical characters to the
> >> Geometric Shapes, Misc. symbols and Dingbats that would be useful
> >> for rendering?
> 
> KW> As opposed to the characters themselves? I'm not sure what you
> KW> are getting at here.
> 
> The user invokes a search for ``f o g'' (the composite of g with f),
> and she entered U+25CB WHITE CIRCLE.  The document does contain the
> required formula, but encoded with U+2218 RING OPERATOR.  The user's
> input was arguably incorrect, but I hope you'll agree that the search
> should match.
> 
> I'm rendering a document that contains U+2218.  The current font
> doesn't contain a glyph associated to this codepoint, but it has a
> perfectly good glyph for U+25CB.  The rendering software should
> silently use the latter.
> 
> Analogous examples can be made for the ``modifier letters''.
> 
> I'll mention that I do understand why these are encoded separately[1],
> and I do understand why and how they will behave differently in a
> number of situations.  I am merely noting that there are applications
> (useful-in-practice search, rendering) where they may be identified or
> at least related, and I am wondering whether people have already
> compiled the data necessary to do so.

I don't think so -- at least not officially within the Unicode
Consortium. This is concerned with shape similarities that go
beyond the kind of character folding implicit in the Unicode
Collation Algorithm.

The Unicode names list provides a considerable number of cross-references
for similarly-shaped characters and confusables, but this is, of
course, far short of a detailed listing that could be used as
the basis of a specification for shaped-based folding for search
purposes.

--Ken





Re: A few questions about decomposition, equvalence and rendering

2002-02-06 Thread Juliusz Chroboczek

Thanks a lot for the explanations.

KW> There is no good reason to invent composite combining marks
KW> involving two accents together. (In fact, there are good reasons
KW> *not* to do so.) The few that exist, e.g. U+0344, cause
KW> implementation problems and are discouraged from use.

What are those problems?  As long as they have canonical
decompositions, won't such precomposed characters be discared at
normalisation time, hopefully during I/O?

(I'm not arguing in favour of precomposed characters; I'm just saying
that my gut instinct is that we have to deal with normalisation
anyway, and hence they don't complicate anything further; I'd be
curious to hear why you think otherwise.)

>> As far as I can tell, there is nothing in the Unicode database that
>> relates a ``modifier letter'' to the associated punctuation mark.

KW> Correct. They are viewed as distinct classes.

>> does anyone [have] a map from mathematical characters to the
>> Geometric Shapes, Misc. symbols and Dingbats that would be useful
>> for rendering?

KW> As opposed to the characters themselves? I'm not sure what you
KW> are getting at here.

The user invokes a search for ``f o g'' (the composite of g with f),
and she entered U+25CB WHITE CIRCLE.  The document does contain the
required formula, but encoded with U+2218 RING OPERATOR.  The user's
input was arguably incorrect, but I hope you'll agree that the search
should match.

I'm rendering a document that contains U+2218.  The current font
doesn't contain a glyph associated to this codepoint, but it has a
perfectly good glyph for U+25CB.  The rendering software should
silently use the latter.

Analogous examples can be made for the ``modifier letters''.

I'll mention that I do understand why these are encoded separately[1],
and I do understand why and how they will behave differently in a
number of situations.  I am merely noting that there are applications
(useful-in-practice search, rendering) where they may be identified or
at least related, and I am wondering whether people have already
compiled the data necessary to do so.

Thanks again,

Juliusz

[1] Offtopic: I have mixed feelings on the inclusion of STICS.  On the
one hand it's great to at last have a standardised encoding for math
characters, on the other I feel it is based on very different encoding
principles than the rest of Unicode.




Re: A few questions about decomposition, equvalence and rendering

2002-02-06 Thread Juliusz Chroboczek

JC> It's pretty much a given that a normalization form that meddles with
JC> plain ASCII text isn't going to get used.

I had to think about it, but it does makes sense.

JC> The U+1Fxx ones are the spacing compatibility equivalents,

Compatibility who with?

Juliusz




Re: A few questions about decomposition, equvalence and rendering

2002-02-05 Thread Kenneth Whistler

Juliusz wrote:

> Spacing diacritical marks (e.g. U+00A8) have compatibility
> decompositions of the form 0020 .  Why are these not canonical
> decompositions? 

Unicode legacy, and the involvement of the SPACE. This stuff goes
back to Unicode 1.1, which attempted to codify the Unicode
recommendation for how to display accents in isolation (by applying
them to SPACE characters) by putting "[0020]&[0301]", etc. decompositions
in the character list. In Unicode 2.0, the UTC had to make a decision
as to whether these were canonical or compatibility decompositions,
and decided on compatibility basically for two reasons: 1. the
spacing diacritics were considered compatibility characters (even
the ASCII ones), and 2. application of a non-spacing accent onto
a SPACE wasn't guaranteed to produce exactly the same rendition
as use of an atomic non-spacing form, particularly for the ASCII
spacing accents, so it seemed inadvisable to assert that these were
canonical equivalences.

> Under what circumstances would you expect the spacing
> marks to behave differently from their decompositions?

Well, for example, when used as, or confused with, modifier
letters.

> 
> The two that are in ASCII don't decompose.  Is that because they're
> overloaded?

John Cowan provided the answer here: normalization considerations.

> 
> A number of combining characters (e.g. U+0340, U+0341, U+0343) have
> canonical equivalents, i.e. canonical decompositions that are a single
> character.  In other words, we have pairs of codepoints that are bound
> to behave in exactly the same manner under all circumstances.  What's
> the deal?

Assertions by the UTC that these are effectively duplicate characters.
U+0340 and U+0341 were mistakes made when Vietnamese was being
considered for encoding. U+0343 is a result of the Greek committee
asserting that Koronis and Psili are distinct (which is how they got
into 10646), but the UTC assertion is that they are the same *character*,
although with slightly different functions (as for different functions
of other accent marks and diacritics in Latin).

"Über das Zeichen der Mischung wird ein Häkchen, korooni's, gesetzt.
Es sieht aus wie ein spiritus lenis, ist aber keiner, denn dieser
kann nur am Anfang eines Wortes stehen. Die Koronis zeigt nur an, daß
eine Krasis stattgefinden hat."

"62. Crasis (kraasis _mingling_) is the contraction of a vowel or
diphthong at the end of a word with a vowel or diphthong beginning the
following word. Over the syllable resulting from contraction is
placed a {{apostrophe glyph}} called coroonis (hook), as ... [examples
follow]" -- Smythe.

So you can get a psili (smooth breathing, spiritus lenis) at the
beginning of a vowel-initial Greek word in polytoniko, but a
koronis in the middle of a word, over the vowel where crasis has
occurred. In any case, the hook itself appears identical, and there
is no character encoding reason to distinguish them. Hence
the canonical decomposition of U+0343.

> Unicode contains a number of precomposed spacing diacritical marks for
> Greek (e.g. U+1FC1).  However, and unless I've missed something, with
> the exception of U+0385, they do not have combining (non-spacing)
> versions.  What's the rationale here?

They have combining versions of each of the pieces, for the
composite ones. There is no good reason to invent composite
combining marks involving two accents together. (In fact, there
are good reasons *not* to do so.) The few that exist, e.g. U+0344,
cause implementation problems and are discouraged from use.

> 
> (Similar precomposed diacritical marks do not seem to exist for
> Vietnamese, which makes me think they've been included for
> compatibility with legacy encodings rather than for a good reason.

The Greek precomposed spacing accents came in as a lump from
the Greek national body ELOT for polytoniko Greek, and had to
be accepted into Unicode as part of the merger compromise with
10646.

> Still, because their decompositions are not canonical, they need to be
> taken into account, which in my case complicates what would otherwise
> be somewhat cleaner code.)

True enough. The UTC disliked them from the beginning, but had no
choice in the matter.

> When rendering stacked combining characters (i.e. sequences of
> combining characters with the same non-zero combining class), which
> sequences need to be treated specially (as opposed to being stacked on
> top of each other)?  I already know about the pairs needed for Greek
> (both Mono- and Polytonic) and Vietnamese.

I don't know of any other regular, language-specific exceptions.
But you can expect to occasionally run into typographically-based
exceptional behavior whenever an orthography results in a requirement
to stack diacritics top or bottom.

> 
> As far as I can tell, there is nothing in the Unicode database that
> relates a ``modifier letter'' to the associated punctuation mark.  Is
> that right? 

Correct. They are viewed as distinct classes.

> Does anyone have s

Re: A few questions about decomposition, equvalence and rendering

2002-02-05 Thread John Cowan

Lukas Pietsch wrote:


> U+1FC1 is spacing in all the fonts that I've seen.


Oops.  Of course it is.

-- 
John Cowan <[EMAIL PROTECTED]> http://www.reutershealth.com
I amar prestar aen, han mathon ne nen,http://www.ccil.org/~cowan
han mathon ne chae, a han noston ne 'wilith.  --Galadriel, _LOTR:FOTR_





Re: A few questions about decomposition, equvalence and rendering

2002-02-05 Thread Lukas Pietsch

John Cowan wrote:
>
> Eh?  U+1FC1 *is* nonspacing.  The U+1Fxx ones are the spacing
> compatibility equivalents, except for this one.
>

U+1FC1 is spacing in all the fonts that I've seen. And it decomposes to
U+00A8 U+0342 (canonically), i.e. to a sequence of spacing plus
non-spacing character. At least it did so in Unicode 3.0.

Not that I would bother much - I have no idea where that character
should ever be used.

Lukas Pietsch






Re: A few questions about decomposition, equvalence and rendering

2002-02-05 Thread John Cowan

Juliusz Chroboczek wrote:

> The two that are in ASCII don't decompose.  Is that because they're
> overloaded?


It's pretty much a given that a normalization form that meddles with
plain ASCII text isn't going to get used.  It was I (ahem) who
spotted this discrepancy a while back, and the compatibility
decompositions of ASCII characters were quickly removed.


> A number of combining characters (e.g. U+0340, U+0341, U+0343) have
> canonical equivalents, i.e. canonical decompositions that are a single
> character.  In other words, we have pairs of codepoints that are bound
> to behave in exactly the same manner under all circumstances.  What's
> the deal?


The first two are deprecated.  They were originally intended to deal
with the special treatment of acute and grave in Vietnamese, which
are kerned next to rather than above the circumflex accent when they
are used together.  (Acute and grave are tone marks; circumflex marks
a distinct vowel.)  However, this is properly a font issue, not a
character issue.

I don't know the exact story for CORONIS, but I bet it's some kind
of political issue.

 
> Unicode contains a number of precomposed spacing diacritical marks for
> Greek (e.g. U+1FC1).  However, and unless I've missed something, with
> the exception of U+0385, they do not have combining (non-spacing)
> versions.  What's the rationale here?


Eh?  U+1FC1 *is* nonspacing.  The U+1Fxx ones are the spacing
compatibility equivalents, except for this one.

 

-- 
John Cowan <[EMAIL PROTECTED]> http://www.reutershealth.com
I amar prestar aen, han mathon ne nen,http://www.ccil.org/~cowan
han mathon ne chae, a han noston ne 'wilith.  --Galadriel, _LOTR:FOTR_