Re: A few questions about decomposition, equvalence and rendering
-BEGIN PGP SIGNED MESSAGE- Kenneth Whistler wrote: > ... See CompositionExclusions.txt, which > has a special section mentioning just these four oddballs: > > # > # (4) Non-Starter Decompositions > # These characters can be derived from the UnicodeData file > # by including all characters whose canonical decomposition consists > # of a sequence of characters, the first of which has a non-zero > # combining class. Shouldn't that say, "a sequence of two characters"? Taken literally this definition includes characters with a canonical decomposition that is a single combining character. (To forestall the obvious objection, no, using the plural does not imply more than one: any decomposition is "a sequence of characters".) - -- David Hopwood <[EMAIL PROTECTED]> Home page & PGP public key: http://www.users.zetnet.co.uk/hopwood/ RSA 2048-bit; fingerprint 71 8E A6 23 0E D3 4C E5 0F 69 8C D4 FA 66 15 01 Nothing in this message is intended to be legally binding. If I revoke a public key but refuse to specify why, it is because the private key has been seized under the Regulation of Investigatory Powers Act; see www.fipr.org/rip -BEGIN PGP SIGNATURE- Version: 2.6.3i Charset: noconv iQEVAwUBPGCgFTkCAxeYt5gVAQFdYAf+JOHLD7dAfZgPT7vAid+Ttt9ojgR3dMUv tkxu7pC1eqx0h1u9yBkwv42S7r3M41ha6dvwCrlKlxPT1H8nPj+CWP4nhRcWeDxF 8fK+Plk0FxmIedksAXL1vbPbCI5Vf36/O3OFN++oLurGdf+DuA1lZ0WC191njW6V /+rqRjCPKwSz8UiftLrF9EApjHaSwHH5skO9OZIrbocsfGU44pl3SsJIB0HsjxU4 GAp+HbABJ+67EDH8KtUAa0lHEBKHRoC4a1KWLuFV7E1uLCGH8X2fVbAOYX/jIHEU I8W9gJDebquu/Vnph3AIlW9MVO1hALWqB80ngZtHBYDbkT9zSXRRQg== =+vHx -END PGP SIGNATURE-
Re: A few questions about decomposition, equvalence and rendering
Juliusz continued: > KW> There is no good reason to invent composite combining marks > KW> involving two accents together. (In fact, there are good reasons > KW> *not* to do so.) The few that exist, e.g. U+0344, cause > KW> implementation problems and are discouraged from use. > > What are those problems? As long as they have canonical > decompositions, won't such precomposed characters be discared at > normalisation time, hopefully during I/O? > > (I'm not arguing in favour of precomposed characters; I'm just saying > that my gut instinct is that we have to deal with normalisation > anyway, and hence they don't complicate anything further; I'd be > curious to hear why you think otherwise.) Perhaps I overstated the case slightly. It is true enough that if you are working with normalized data, U+0344 gets normalized away: % egrep 0344 NormalizationTest-3.2.0d6.txt 0344;0308 0301;0308 0301;0308 0301;0308 0301; # ... COMBINING GREEK DIALYTIKA TONOS and you just end up with an otherwise typical sequence of combining marks. However, the complication is in the statement of the algorithm, where you end up having to talk about (and include in your tables) the "Non-Starter Decompositions". See CompositionExclusions.txt, which has a special section mentioning just these four oddballs: # # (4) Non-Starter Decompositions # These characters can be derived from the UnicodeData file # by including all characters whose canonical decomposition consists # of a sequence of characters, the first of which has a non-zero # combining class. # These characters are simply quoted here for reference. # # 0344 COMBINING GREEK DIALYTIKA TONOS # 0F73 TIBETAN VOWEL SIGN II # 0F75 TIBETAN VOWEL SIGN UU # 0F81 TIBETAN VOWEL SIGN REVERSED II Note also that all four of these characters get "use of this character is discouraged" notes in the Unicode names list. These characters also result in a problematical edge case for processing of the tables for the Unicode Collation Algorithm to provide proper weightings. > >> does anyone [have] a map from mathematical characters to the > >> Geometric Shapes, Misc. symbols and Dingbats that would be useful > >> for rendering? > > KW> As opposed to the characters themselves? I'm not sure what you > KW> are getting at here. > > The user invokes a search for ``f o g'' (the composite of g with f), > and she entered U+25CB WHITE CIRCLE. The document does contain the > required formula, but encoded with U+2218 RING OPERATOR. The user's > input was arguably incorrect, but I hope you'll agree that the search > should match. > > I'm rendering a document that contains U+2218. The current font > doesn't contain a glyph associated to this codepoint, but it has a > perfectly good glyph for U+25CB. The rendering software should > silently use the latter. > > Analogous examples can be made for the ``modifier letters''. > > I'll mention that I do understand why these are encoded separately[1], > and I do understand why and how they will behave differently in a > number of situations. I am merely noting that there are applications > (useful-in-practice search, rendering) where they may be identified or > at least related, and I am wondering whether people have already > compiled the data necessary to do so. I don't think so -- at least not officially within the Unicode Consortium. This is concerned with shape similarities that go beyond the kind of character folding implicit in the Unicode Collation Algorithm. The Unicode names list provides a considerable number of cross-references for similarly-shaped characters and confusables, but this is, of course, far short of a detailed listing that could be used as the basis of a specification for shaped-based folding for search purposes. --Ken
Re: A few questions about decomposition, equvalence and rendering
Thanks a lot for the explanations. KW> There is no good reason to invent composite combining marks KW> involving two accents together. (In fact, there are good reasons KW> *not* to do so.) The few that exist, e.g. U+0344, cause KW> implementation problems and are discouraged from use. What are those problems? As long as they have canonical decompositions, won't such precomposed characters be discared at normalisation time, hopefully during I/O? (I'm not arguing in favour of precomposed characters; I'm just saying that my gut instinct is that we have to deal with normalisation anyway, and hence they don't complicate anything further; I'd be curious to hear why you think otherwise.) >> As far as I can tell, there is nothing in the Unicode database that >> relates a ``modifier letter'' to the associated punctuation mark. KW> Correct. They are viewed as distinct classes. >> does anyone [have] a map from mathematical characters to the >> Geometric Shapes, Misc. symbols and Dingbats that would be useful >> for rendering? KW> As opposed to the characters themselves? I'm not sure what you KW> are getting at here. The user invokes a search for ``f o g'' (the composite of g with f), and she entered U+25CB WHITE CIRCLE. The document does contain the required formula, but encoded with U+2218 RING OPERATOR. The user's input was arguably incorrect, but I hope you'll agree that the search should match. I'm rendering a document that contains U+2218. The current font doesn't contain a glyph associated to this codepoint, but it has a perfectly good glyph for U+25CB. The rendering software should silently use the latter. Analogous examples can be made for the ``modifier letters''. I'll mention that I do understand why these are encoded separately[1], and I do understand why and how they will behave differently in a number of situations. I am merely noting that there are applications (useful-in-practice search, rendering) where they may be identified or at least related, and I am wondering whether people have already compiled the data necessary to do so. Thanks again, Juliusz [1] Offtopic: I have mixed feelings on the inclusion of STICS. On the one hand it's great to at last have a standardised encoding for math characters, on the other I feel it is based on very different encoding principles than the rest of Unicode.
Re: A few questions about decomposition, equvalence and rendering
JC> It's pretty much a given that a normalization form that meddles with JC> plain ASCII text isn't going to get used. I had to think about it, but it does makes sense. JC> The U+1Fxx ones are the spacing compatibility equivalents, Compatibility who with? Juliusz
Re: A few questions about decomposition, equvalence and rendering
Juliusz wrote: > Spacing diacritical marks (e.g. U+00A8) have compatibility > decompositions of the form 0020 . Why are these not canonical > decompositions? Unicode legacy, and the involvement of the SPACE. This stuff goes back to Unicode 1.1, which attempted to codify the Unicode recommendation for how to display accents in isolation (by applying them to SPACE characters) by putting "[0020]&[0301]", etc. decompositions in the character list. In Unicode 2.0, the UTC had to make a decision as to whether these were canonical or compatibility decompositions, and decided on compatibility basically for two reasons: 1. the spacing diacritics were considered compatibility characters (even the ASCII ones), and 2. application of a non-spacing accent onto a SPACE wasn't guaranteed to produce exactly the same rendition as use of an atomic non-spacing form, particularly for the ASCII spacing accents, so it seemed inadvisable to assert that these were canonical equivalences. > Under what circumstances would you expect the spacing > marks to behave differently from their decompositions? Well, for example, when used as, or confused with, modifier letters. > > The two that are in ASCII don't decompose. Is that because they're > overloaded? John Cowan provided the answer here: normalization considerations. > > A number of combining characters (e.g. U+0340, U+0341, U+0343) have > canonical equivalents, i.e. canonical decompositions that are a single > character. In other words, we have pairs of codepoints that are bound > to behave in exactly the same manner under all circumstances. What's > the deal? Assertions by the UTC that these are effectively duplicate characters. U+0340 and U+0341 were mistakes made when Vietnamese was being considered for encoding. U+0343 is a result of the Greek committee asserting that Koronis and Psili are distinct (which is how they got into 10646), but the UTC assertion is that they are the same *character*, although with slightly different functions (as for different functions of other accent marks and diacritics in Latin). "Über das Zeichen der Mischung wird ein Häkchen, korooni's, gesetzt. Es sieht aus wie ein spiritus lenis, ist aber keiner, denn dieser kann nur am Anfang eines Wortes stehen. Die Koronis zeigt nur an, daß eine Krasis stattgefinden hat." "62. Crasis (kraasis _mingling_) is the contraction of a vowel or diphthong at the end of a word with a vowel or diphthong beginning the following word. Over the syllable resulting from contraction is placed a {{apostrophe glyph}} called coroonis (hook), as ... [examples follow]" -- Smythe. So you can get a psili (smooth breathing, spiritus lenis) at the beginning of a vowel-initial Greek word in polytoniko, but a koronis in the middle of a word, over the vowel where crasis has occurred. In any case, the hook itself appears identical, and there is no character encoding reason to distinguish them. Hence the canonical decomposition of U+0343. > Unicode contains a number of precomposed spacing diacritical marks for > Greek (e.g. U+1FC1). However, and unless I've missed something, with > the exception of U+0385, they do not have combining (non-spacing) > versions. What's the rationale here? They have combining versions of each of the pieces, for the composite ones. There is no good reason to invent composite combining marks involving two accents together. (In fact, there are good reasons *not* to do so.) The few that exist, e.g. U+0344, cause implementation problems and are discouraged from use. > > (Similar precomposed diacritical marks do not seem to exist for > Vietnamese, which makes me think they've been included for > compatibility with legacy encodings rather than for a good reason. The Greek precomposed spacing accents came in as a lump from the Greek national body ELOT for polytoniko Greek, and had to be accepted into Unicode as part of the merger compromise with 10646. > Still, because their decompositions are not canonical, they need to be > taken into account, which in my case complicates what would otherwise > be somewhat cleaner code.) True enough. The UTC disliked them from the beginning, but had no choice in the matter. > When rendering stacked combining characters (i.e. sequences of > combining characters with the same non-zero combining class), which > sequences need to be treated specially (as opposed to being stacked on > top of each other)? I already know about the pairs needed for Greek > (both Mono- and Polytonic) and Vietnamese. I don't know of any other regular, language-specific exceptions. But you can expect to occasionally run into typographically-based exceptional behavior whenever an orthography results in a requirement to stack diacritics top or bottom. > > As far as I can tell, there is nothing in the Unicode database that > relates a ``modifier letter'' to the associated punctuation mark. Is > that right? Correct. They are viewed as distinct classes. > Does anyone have s
Re: A few questions about decomposition, equvalence and rendering
Lukas Pietsch wrote: > U+1FC1 is spacing in all the fonts that I've seen. Oops. Of course it is. -- John Cowan <[EMAIL PROTECTED]> http://www.reutershealth.com I amar prestar aen, han mathon ne nen,http://www.ccil.org/~cowan han mathon ne chae, a han noston ne 'wilith. --Galadriel, _LOTR:FOTR_
Re: A few questions about decomposition, equvalence and rendering
John Cowan wrote: > > Eh? U+1FC1 *is* nonspacing. The U+1Fxx ones are the spacing > compatibility equivalents, except for this one. > U+1FC1 is spacing in all the fonts that I've seen. And it decomposes to U+00A8 U+0342 (canonically), i.e. to a sequence of spacing plus non-spacing character. At least it did so in Unicode 3.0. Not that I would bother much - I have no idea where that character should ever be used. Lukas Pietsch
Re: A few questions about decomposition, equvalence and rendering
Juliusz Chroboczek wrote: > The two that are in ASCII don't decompose. Is that because they're > overloaded? It's pretty much a given that a normalization form that meddles with plain ASCII text isn't going to get used. It was I (ahem) who spotted this discrepancy a while back, and the compatibility decompositions of ASCII characters were quickly removed. > A number of combining characters (e.g. U+0340, U+0341, U+0343) have > canonical equivalents, i.e. canonical decompositions that are a single > character. In other words, we have pairs of codepoints that are bound > to behave in exactly the same manner under all circumstances. What's > the deal? The first two are deprecated. They were originally intended to deal with the special treatment of acute and grave in Vietnamese, which are kerned next to rather than above the circumflex accent when they are used together. (Acute and grave are tone marks; circumflex marks a distinct vowel.) However, this is properly a font issue, not a character issue. I don't know the exact story for CORONIS, but I bet it's some kind of political issue. > Unicode contains a number of precomposed spacing diacritical marks for > Greek (e.g. U+1FC1). However, and unless I've missed something, with > the exception of U+0385, they do not have combining (non-spacing) > versions. What's the rationale here? Eh? U+1FC1 *is* nonspacing. The U+1Fxx ones are the spacing compatibility equivalents, except for this one. -- John Cowan <[EMAIL PROTECTED]> http://www.reutershealth.com I amar prestar aen, han mathon ne nen,http://www.ccil.org/~cowan han mathon ne chae, a han noston ne 'wilith. --Galadriel, _LOTR:FOTR_