Re: Major Defect in Combining Classes of Tibetan Vowels
Rick McGowan [EMAIL PROTECTED] has privately suggested moving the discussion of Combining Classes of *Tibetan* Characters from the main Unicode list [EMAIL PROTECTED] to the TIBEX list [EMAIL PROTECTED] - an experts list which was set up several years ago specifically to discuss proposals for encoding Tibetan characters in Unicode. If there are people who have a particular interest in Tibetan characters and have been following the thread here who would like to continue following this thread - perhaps they could ask Rick how they can join that list. I'll follow Rick's advice - perhaps this discussion is more appropriate on the TIBEX list - even though similar issues with some Hebrew characters which have been raised here (again) as a result of this thread makes me think there may be a need for a non script specific solution or work-around to problems with cannoical combining class values. Anyway I'm going to move this discussion over there with a parting shot... Off-list Robert Chilton has pointed out to me the following: 3. A very common occasion of 0F7E occurring with a vowel is in the stack HaUm (orthographic sequence of 0F67 0F71 0F74 0F7E). Because 0F7E is currently assigned a cc of zero, this *same glyph-form* could theoretically be encoded with a total of 6 different character sequences, resulting in 4(!) different sequences following normalization. Properly, all 6 sequences should normalize to the same sequence -- which is indeed the case if 0F82 or 0F83 is used in place of 0F7E. Obviously a major problem, not only for rendering but also for searching and sorting. FOUR different sequences possible *after* normalisation ??? Personally I would have rather seen all Tibetan characters having a CCV of 0 (and all pre-combined Tibetan characters *strongly* depreciated)rather than this. If someone simply follows the normal rules for writing Tibetan, then characters will be entered in a very predictable order which is far easier to process than the one(s) they can end up in after Unicode normalisation. - Chris Fynn BTW My apologies to anyone who receives two copies of this message.
Re: Major Defect in Combining Classes of Tibetan Vowels
Ken Whistler wrote on 06/25/2003 05:29:59 PM: The point is that hiriq before patah is *not* canonically equivalent to patah before hiriq, This is true. except in the erroneous assumption of the Unicode Standard: the order of vowels makes words sound different and mean different things. This is not. Ken, I think you're reading John differently than he intended: the Unicode character sequences hiriq, patah and patah, hiriq *are* canonically equivalent, but the requirements for Biblical Hebrew are that alternate visual orders would correspond to different vocalizations, and thus the visual ordering of these does matter semantically, and therefore the encoded orders should *not* be canonically equivalent. The current situation is not optimal for implementations, nor does canonically ordered text follow traditional preferences for spelling order -- that we can agree on. But I think the claims of inadequacy for the representation or rendering of Biblical Hebrew text are overblown. The serious problem is that the writing distinctions that matter cannot currently be reliably represented, as they are not preserved under canonical ordering / normalization. This is all just a rehash of discussions we had on this list back in December, at which time it was acknowledged that this was the case, and that this was a problem. - Peter --- Peter Constable Non-Roman Script Initiative, SIL International 7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA Tel: +1 972 708 7485
Re: Major Defect in Combining Classes of Tibetan Vowels
Michael Everson wrote on 06/25/2003 04:36:20 PM: [ re Biblical Hebrew ] Write it up with glyphs and minimal pairs and people will see the problem, if any. Or propose some solution. (That isn't add duplicate characters.) The only solution that UTC is willing to consider I have already submitted in a proposal (L2/03-195). - Peter --- Peter Constable Non-Roman Script Initiative, SIL International 7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA Tel: +1 972 708 7485
Re: Major Defect in Combining Classes of Tibetan Vowels (Hebrew)
Jony Rosenne wrote on 06/26/2003 12:16:22 AM: When, in the Bible, one sees two vowels on a given consonant, it isn't so. That's silly. When one sees two vowels on a given consonant in the Bible, it *is* so: the two vowels are written there. It may not correspond to actual phonology, ie what is spoken, but as has been made clear on many occasions, Unicode is not encoding phonology, it is encoding text. And in relation to text, your statement is simply wrong. There is one vowel for the consonant one sees, and another vowel for an invisible consonant. The proper way to encode it is to use some code to represent the invisible consonant. Then the problem mentioned below does not arise. The idea of an invisible consonant would amount to encoding a phonological entity, which is the kind of thing that was at one time approved for Khmer (invisible characters representing inherent vowels), but later turned into an albatross, and when I proposed the same thing (invisible inherent vowel) for Syloti Nagri, it was made very clear to me that it would not go down well with UTC. Also, the proposed solution of an invisible consonant would leave unresolved the problem of meteg-vowel ordering distinctions, while the alternate proposal of having meteg and vowels all with a class of 230 solves both problems at once. Two ad hoc solutions (one for multi-vowel ordering, and another for meteg-vowel ordering) must certainly be far less preferred for one motivated solution (having characters with canonical combining classes that are appropriate for the writing behaviours exhibited). I invite people to review the discussions from the unicoRe list from last December, at which time everyone (including you, Jony) were all concluding that the solution which I proposed in L2/03-195 was the best solution to pursue. - Peter --- Peter Constable Non-Roman Script Initiative, SIL International 7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA Tel: +1 972 708 7485
Re: Major Defect in Combining Classes of Tibetan Vowels
John Hudson wrote on 06/25/2003 06:47:44 PM: This is not. The Unicode Standard makes no assumptions or claims about what the phonological or meaning equivalence of hiriq, patah or patah, hiriq is for Biblical Hebrew. But it does make assumptions about the canonical equivalence of the mark orders U+05B4, U+05B7 and U+05B7, U+05B4, unless my understanding of the purpose of combining classes is completely mistaken. Your understanding on this point is correct. My understanding is that any ordering of two marks with different combining classes is canonically equivalent; Yes. further, I understand that some normalisation forms will re-order marks to move marks with lower combining class values closer to the base character. *Every* Unicode normalization form will apply canonical reordering. * Meteg re-ordering is in some respects even more problematic than multi-vowel re-ordering And it is because of meteg-vowel ordering distinctions that the ordering of things like patah + hiriq should not be solved in any way other than the two having the same canonical combining class, because that is exactly what will be needed to deal with meteg-vowel ordering distinctions. - Peter --- Peter Constable Non-Roman Script Initiative, SIL International 7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA Tel: +1 972 708 7485
RE: Major Defect in Combining Classes of Tibetan Vowels (Hebrew)
It may look, silly, but it is correct. What you see are letters according to the writing tradition, which does not include a Yod, and vowels according to the reading tradition which does. There are in the Bible other, more extreme cases. I don't think we need any new characters, ZERO WIDTH SPACE would do and it requires no new semantics. Moreover, everybody who knows his Hebrew Bible knows the Yod is there although it isn't written. The Meteg is a completely different issue. There is a small number of places were the Meteg is placed differently. Since it does not behave the same as the regular Meteg, and is thus visually distinguishable, it should be possible to add a character, as long as it is clearly named. Jony -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of [EMAIL PROTECTED] Sent: Thursday, June 26, 2003 9:43 AM To: [EMAIL PROTECTED] Subject: Re: Major Defect in Combining Classes of Tibetan Vowels (Hebrew) Jony Rosenne wrote on 06/26/2003 12:16:22 AM: When, in the Bible, one sees two vowels on a given consonant, it isn't so. That's silly. When one sees two vowels on a given consonant in the Bible, it *is* so: the two vowels are written there. It may not correspond to actual phonology, ie what is spoken, but as has been made clear on many occasions, Unicode is not encoding phonology, it is encoding text. And in relation to text, your statement is simply wrong. There is one vowel for the consonant one sees, and another vowel for an invisible consonant. The proper way to encode it is to use some code to represent the invisible consonant. Then the problem mentioned below does not arise. The idea of an invisible consonant would amount to encoding a phonological entity, which is the kind of thing that was at one time approved for Khmer (invisible characters representing inherent vowels), but later turned into an albatross, and when I proposed the same thing (invisible inherent vowel) for Syloti Nagri, it was made very clear to me that it would not go down well with UTC. Also, the proposed solution of an invisible consonant would leave unresolved the problem of meteg-vowel ordering distinctions, while the alternate proposal of having meteg and vowels all with a class of 230 solves both problems at once. Two ad hoc solutions (one for multi-vowel ordering, and another for meteg-vowel ordering) must certainly be far less preferred for one motivated solution (having characters with canonical combining classes that are appropriate for the writing behaviours exhibited). I invite people to review the discussions from the unicoRe list from last December, at which time everyone (including you, Jony) were all concluding that the solution which I proposed in L2/03-195 was the best solution to pursue. - Peter -- - Peter Constable Non-Roman Script Initiative, SIL International 7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA Tel: +1 972 708 7485
RE: Major Defect in Combining Classes of Tibetan Vowels (Hebrew)
At 04:26 AM 6/26/2003, Jony Rosenne wrote: I don't think we need any new characters, ZERO WIDTH SPACE would do and it requires no new semantics. ZERO WIDTH SPACE would screw up search and sort algorithms, I think, because it is not a control character per se and may not be ignored as desired. I've made some tests using Ken's ZWJ suggestion and, as feared, it messes with the glyph positioning lookups. The results varied slightly between MS RichText clients and InDesign ME, but both displayed marks incorrectly when ZWJ was inserted. I strongly suspect that this is not something that can easily be resolved in the glyph shaping model. John Hudson Tiro Typeworks www.tiro.com Vancouver, BC [EMAIL PROTECTED] If you browse in the shelves that, in American bookstores, are labeled New Age, you can find there even Saint Augustine, who, as far as I know, was not a fascist. But combining Saint Augustine and Stonehenge -- that is a symptom of Ur-Fascism. - Umberto Eco
RE: Major Defect in Combining Classes of Tibetan Vowels (Hebrew)
Jony Rosenne wrote on 06/26/2003 06:26:02 AM: It may look, silly, but it is correct. What you see are letters according to the writing tradition, which does not include a Yod, and vowels according to the reading tradition which does. I understand that. My point was, you were talking about phonology, but in terms of the text, it was not correct: there *are* multiple vowels on a single consonant. There are in the Bible other, more extreme cases. I'd be interested on whatever info you can provide in that regard. I don't think we need any new characters, ZERO WIDTH SPACE would do and it requires no new semantics. No, that's a terrible solution: a space creates unwanted word boundaries. Moreover, everybody who knows his Hebrew Bible knows the Yod is there although it isn't written. But the point is, how to people encode the text? The yod is not there in the text. How does a publisher encode text in the typesetting process? How do researchsers encode the text they want to analyze? Saying, everybody knows there's a yod there doesn't provide a solution, particular given that the researchers know in point of fact that the consonantal text explicitly does not include a yod. The Meteg is a completely different issue. There is a small number of places were the Meteg is placed differently. Since it does not behave the same as the regular Meteg, and is thus visually distinguishable, it should be possible to add a character, as long as it is clearly named. That is a potential solution, thought it would have to be *two* additional metegs. - Peter --- Peter Constable Non-Roman Script Initiative, SIL International 7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA Tel: +1 972 708 7485
RE: Major Defect in Combining Classes of Tibetan Vowels (Hebrew)
That may be what you see. Myself, every time I look at it, I see an orphaned Hiriq without a consonant. It is normally placed in between the Lamed and the Mem, to make certain the point isn't missed (a pun). Jony -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of [EMAIL PROTECTED] Sent: Thursday, June 26, 2003 7:09 PM To: [EMAIL PROTECTED] Subject: RE: Major Defect in Combining Classes of Tibetan Vowels (Hebrew) Jony Rosenne wrote on 06/26/2003 06:26:02 AM: It may look, silly, but it is correct. What you see are letters according to the writing tradition, which does not include a Yod, and vowels according to the reading tradition which does. I understand that. My point was, you were talking about phonology, but in terms of the text, it was not correct: there *are* multiple vowels on a single consonant. There are in the Bible other, more extreme cases. I'd be interested on whatever info you can provide in that regard. I don't think we need any new characters, ZERO WIDTH SPACE would do and it requires no new semantics. No, that's a terrible solution: a space creates unwanted word boundaries. Moreover, everybody who knows his Hebrew Bible knows the Yod is there although it isn't written. But the point is, how to people encode the text? The yod is not there in the text. How does a publisher encode text in the typesetting process? How do researchsers encode the text they want to analyze? Saying, everybody knows there's a yod there doesn't provide a solution, particular given that the researchers know in point of fact that the consonantal text explicitly does not include a yod. The Meteg is a completely different issue. There is a small number of places were the Meteg is placed differently. Since it does not behave the same as the regular Meteg, and is thus visually distinguishable, it should be possible to add a character, as long as it is clearly named. That is a potential solution, thought it would have to be *two* additional metegs. - Peter -- - Peter Constable Non-Roman Script Initiative, SIL International 7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA Tel: +1 972 708 7485
Re: Major Defect in Combining Classes of Tibetan Vowels
Christopher John Fynn wrote on 06/21/2003 08:23:17 PM: Any suggestions as to how to create a standardized work around for these incorrect values? Propose new characters, and deprecate the old ones? - Peter --- Peter Constable Non-Roman Script Initiative, SIL International 7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA Tel: +1 972 708 7485
Re: Major Defect in Combining Classes of Tibetan Vowels
At 00:56 -0500 2003-06-25, [EMAIL PROTECTED] wrote: Christopher John Fynn wrote on 06/21/2003 08:23:17 PM: Any suggestions as to how to create a standardized work around for these incorrect values? Propose new characters, and deprecate the old ones? Fix the bloody errors, for heaven's sake. -- Michael Everson * * Everson Typography * * http://www.evertype.com
Re: Major Defect in Combining Classes of Tibetan Vowels
On Wed, Jun 25, 2003 at 02:10:44 -0700, Andrew C. West wrote: I've never really understood normalization, but it seems to me that normalising bcuig 0F56, 0F45, 0F74, 0F72, 0F42 to bciug 0F56, 0F45, 0F72, 0F74, 0F42 is wrong as bciug could conceivably be a shorthand abbreviation for a wcompletely different word with a gigu [i] on the first syllable and a shabkyu [u] on the second syllable. Err, as in this particular case one vowel sign is above and the other one is below the stack - i.e. they don't interact spatially - you cannot really distinguish them. ;) SY, Uwe -- [EMAIL PROTECTED] | Zu Grunde kommen http://www.ptc.spbu.ru/~uwe/| Ist zu Grunde gehen
Re: Major Defect in Combining Classes of Tibetan Vowels
On Wed, 25 Jun 2003 15:05:26 +0400, Valeriy E. Ushakov wrote: Err, as in this particular case one vowel sign is above and the other one is below the stack - i.e. they don't interact spatially - you cannot really distinguish them. ;) I know that the vowel signs do not interact with each other typographically, but what's that got to do with anything ? I'm talking about the logical ordering of the Unicode codepoints used to encode some Tibetan text, not the physical appearance of the glyphs that are used to render that sequence of codepoints. What I'm suggesting is that although cui 0F45, 0F74, 0F72 and ciu 0F45, 0F72, 0F74 should be rendered identically, the logical ordering of the codepoints representing the vowels may represent lexical differences that would be lost during the process of normalisation. Andrew
Re: Major Defect in Combining Classes of Tibetan Vowels
On Wed, Jun 25, 2003 at 07:31:51 -0700, Andrew C. West wrote: Err, as in this particular case one vowel sign is above and the other one is below the stack - i.e. they don't interact spatially - you cannot really distinguish them. ;) I know that the vowel signs do not interact with each other typographically, but what's that got to do with anything ? I'm talking about the logical ordering of the Unicode codepoints used to encode some Tibetan text, not the physical appearance of the glyphs that are used to render that sequence of codepoints. What I'm suggesting is that although cui 0F45, 0F74, 0F72 and ciu 0F45, 0F72, 0F74 should be rendered identically, the logical ordering of the codepoints representing the vowels may represent lexical differences that would be lost during the process of normalisation. And given that the two look identical in writing in the first palce, this lexical difference had a chance to originate exactly *where*? You are putting the cart before the horse. Also note that the original question from Chris is about things that do interact spatially. SY, Uwe -- [EMAIL PROTECTED] | Zu Grunde kommen http://www.ptc.spbu.ru/~uwe/| Ist zu Grunde gehen
Re: Major Defect in Combining Classes of Tibetan Vowels
Let me add that this was the case recently for Hebrew (to mention on example). So it is certainly not impossible. But we have enough real work to do that we should do our best to veer from the theoretical. :-) MichKa - Original Message - From: Michael (michka) Kaplan [EMAIL PROTECTED] To: [EMAIL PROTECTED]; Andrew C. West [EMAIL PROTECTED] Sent: Wednesday, June 25, 2003 8:11 AM Subject: Re: Major Defect in Combining Classes of Tibetan Vowels From: Andrew C. West [EMAIL PROTECTED] What I'm suggesting is that although cui 0F45, 0F74, 0F72 and ciu 0F45, 0F72, 0F74 should be rendered identically, the logical ordering of the codepoints representing the vowels may represent lexical differences that would be lost during the process of normalisation. Do you (or does anyone) have an actual example where this is the case? It may well be true but until someone has a proof there is not really an indication of a specific problem for the UTC to address. The current discussion is like arguing about a color that none of the participants have ever seen. MichKa
Re: Major Defect in Combining Classes of Tibetan Vowels
On Wednesday, June 25, 2003 4:31 PM, Andrew C. West [EMAIL PROTECTED] wrote: On Wed, 25 Jun 2003 15:05:26 +0400, Valeriy E. Ushakov wrote: What I'm suggesting is that although cui 0F45, 0F74, 0F72 and ciu 0F45, 0F72, 0F74 should be rendered identically, the logical ordering of the codepoints representing the vowels may represent lexical differences that would be lost during the process of normalisation. This is an excellent argument, and that's why the Vietnamese usage of multiple diacritics was studied so that it can preserve the logical ordering of accents on Latin letters. However if the actual rendered text cannot be distinguished, the effective order of diacritics is only important in the mind of the reader but does not exist in the written form. This would be important if there was a need to create a transliteration rule (for example from Tibetan to Latin script). But even in that case, knowledge of the origin language is required, as no transliteration rule works well usig only the script information. So transliteration rules are very often context-sensitive. What is important is how a native Tibetan reader would read the grapheme cluster. If it reads it as ciu then it is to be interpreted as ciu, and then the logical order is more important than the encoding order, because such difference do not exist in the actual written script. If I just take the example of the Latin script, a sequence like C, COMBINING CEDILLA, COMBINING ACCUTE ACCENT will have a canonical order for the two last diacritics which is not important at the linguisitic level if looking at the written script. The canonical order and comining classes just exists BECAUSE the encoding would allow several *equivalent* sequences that no reader would be allow to read distinctly. When there is possible confusions, and these distinction does not exist in the original script before its encoding, there should exist a way to unify all these. So even if the canonical ordering of Tibetan vowel signs is not logical, as long as it allows to produce the same written text, this is not a problem, and there is not more loss of semantic than in the original script. So if the Tibetan script cannot make a distinction between ciu and cui, this is *not* a Unicode defect. This confusion already exists in the original script, and there is no loss of semantic in the Unicode encoding when compared to the actual written script. Let's not make a problem by adding new semantics to the Tibetan language (such as creating a distinction between ciu and cui) *because* this seems /possible/ in Unicode. If we respect a script or language, we must not tolerate such artificial distinctions. It's true that the canonical ordering should match with the logical ordering, but I think that there is a lot of exceptions, notably within Brahmic scripts with disjoint letters, or in Thai (encoded according to a previous existing standard TIS620 which used the visual ordering), or even in many Hebrew or Arabic texts (sometimes encoded also with a visual ordering, and requiring some tools to reverse the encoding according to a prefered order, because this cannot be decided without an out-of-band specification of the actual ordering used in the text)... So if one wants to really handle the logical ordering, it's perfectly possible to exchange the i and u in cui without affecting the canonical equivalence and without changing the semantic of the original Tibetan text. Canonical ordering is only needed to unify equivalences, but is not intended to sort distinct strings (this is not part of the Unicode encoding, but part of a collation algorithm like UCA, tailored appropriately for each language on top of the default UCA order for the script). A correct UCA collation for the Tibetan script can perfectly be created, and then tailored for the Tibetan language to reorder the vowel signs. (This is not more complicated than handling a French reordering for accents). This just requires a multi-level sort algorithm, where u and i would have the same collation keys at level N, and could be reordered using a French-style reordering of vowel signs for keywords or grapheme clusters at level N+1 or N+2.
Re: Major Defect in Combining Classes of Tibetan Vowels
At 08:11 -0700 2003-06-25, Michael \(michka\) Kaplan wrote: Do you (or does anyone) have an actual example where this is the case? It may well be true but until someone has a proof there is not really an indication of a specific problem for the UTC to address. A document showing what happens in Case A and what happens in Case B with actual glyphs would be helpful. The current discussion is like arguing about a color that none of the participants have ever seen. Indeed. -- Michael Everson * * Everson Typography * * http://www.evertype.com
Re: Major Defect in Combining Classes of Tibetan Vowels
Michael, that is like saying move the bloody character or remove the bloody character. Mark __ http://www.macchiato.com Eppur si muove - Original Message - From: Michael Everson [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Wednesday, June 25, 2003 03:14 Subject: Re: Major Defect in Combining Classes of Tibetan Vowels At 00:56 -0500 2003-06-25, [EMAIL PROTECTED] wrote: Christopher John Fynn wrote on 06/21/2003 08:23:17 PM: Any suggestions as to how to create a standardized work around for these incorrect values? Propose new characters, and deprecate the old ones? Fix the bloody errors, for heaven's sake. -- Michael Everson * * Everson Typography * * http://www.evertype.com
Re: Major Defect in Combining Classes of Tibetan Vowels
At 8:11 AM -0700 6/25/03, Michael (michka) Kaplan wrote: From: Andrew C. West [EMAIL PROTECTED] What I'm suggesting is that although cui 0F45, 0F74, 0F72 and ciu 0F45, 0F72, 0F74 should be rendered identically, the logical ordering of the codepoints representing the vowels may represent lexical differences that would be lost during the process of normalisation. Do you (or does anyone) have an actual example where this is the case? It may well be true but until someone has a proof there is not really an indication of a specific problem for the UTC to address. The current discussion is like arguing about a color that none of the participants have ever seen. A list of common contractions would help here. I've seen at least one such published collection in the past which listed common contractions found in U-Med running text. However I don't have it with me. Does anyone on-line have access to a document like this? Peter
Re: Major Defect in Combining Classes of Tibetan Vowels
this was the case Someone might misread your statement. We did not change the combining classes for Hebrew. Mark __ http://www.macchiato.com Eppur si muove - Original Message - From: Michael (michka) Kaplan [EMAIL PROTECTED] To: [EMAIL PROTECTED]; Andrew C. West [EMAIL PROTECTED] Sent: Wednesday, June 25, 2003 08:55 Subject: Re: Major Defect in Combining Classes of Tibetan Vowels Let me add that this was the case recently for Hebrew (to mention on example). So it is certainly not impossible. But we have enough real work to do that we should do our best to veer from the theoretical. :-) MichKa - Original Message - From: Michael (michka) Kaplan [EMAIL PROTECTED] To: [EMAIL PROTECTED]; Andrew C. West [EMAIL PROTECTED] Sent: Wednesday, June 25, 2003 8:11 AM Subject: Re: Major Defect in Combining Classes of Tibetan Vowels From: Andrew C. West [EMAIL PROTECTED] What I'm suggesting is that although cui 0F45, 0F74, 0F72 and ciu 0F45, 0F72, 0F74 should be rendered identically, the logical ordering of the codepoints representing the vowels may represent lexical differences that would be lost during the process of normalisation. Do you (or does anyone) have an actual example where this is the case? It may well be true but until someone has a proof there is not really an indication of a specific problem for the UTC to address. The current discussion is like arguing about a color that none of the participants have ever seen. MichKa
Re: Major Defect in Combining Classes of Tibetan Vowels
At 09:13 -0700 2003-06-25, Mark Davis wrote: Michael, that is like saying move the bloody character or remove the bloody character. Fix the bloody errors, for heaven's sake. You'd like to think so. But Deprecate TIBETAN THINGY and add TIBETAN THINGY BIS so that we can fix the problem is utterly ridiculous. -- Michael Everson * * Everson Typography * * http://www.evertype.com
Re: Major Defect in Combining Classes of Tibetan Vowels
On Wednesday, June 25, 2003 6:13 PM, Mark Davis [EMAIL PROTECTED] wrote: Michael Everson wrote: [EMAIL PROTECTED] wrote: Christopher John Fynn wrote: Any suggestions as to how to create a standardized work around for these incorrect values? Propose new characters, and deprecate the old ones? Fix the bloody errors, for heaven's sake. Michael, that is like saying move the bloody character or remove the bloody character. If there are real distinct semantics that were abusively unified by the canonicalization, the only safe way would be to create a second character that would have another combining class than the existing one, to be used when lexical distinction from the most common use is necessary. So the added character for the modified vowel signs would have the same representative glyph, but would have the additional semantic contraction (clearly indicated in their name). This does not break the existing encoding of most texts, but allows a specific usage for contractions where the existing canonical equivalences would be inappropriate. -- Philippe.
Re: Major Defect in Combining Classes of Tibetan Vowels
At 18:26 +0100 2003-06-25, Michael Everson wrote: You'd like to think so. But Deprecate TIBETAN THINGY and add TIBETAN THINGY BIS so that we can fix the problem is utterly ridiculous. And by that I mean, given the TWO standards Unicode and ISO/IEC 10646, adding duplicate characters is frowned upon, so it should be less preferable than UTC fixing broken classes if they really are broken. -- Michael Everson * * Everson Typography * * http://www.evertype.com
Re: Major Defect in Combining Classes of Tibetan Vowels
From: Michael (michka) Kaplan [EMAIL PROTECTED] From: Michael (michka) Kaplan [EMAIL PROTECTED] From: Andrew C. West [EMAIL PROTECTED] What I'm suggesting is that although cui 0F45, 0F74, 0F72 and ciu 0F45, 0F72, 0F74 should be rendered identically, the logical ordering of the codepoints representing the vowels may represent lexical differences that would be lost during the process of normalisation. Do you (or does anyone) have an actual example where this is the case? It may well be true but until someone has a proof there is not really an indication of a specific problem for the UTC to address. Let me add that this was the case recently for Hebrew (to mention on example). So it is certainly not impossible. But we have enough real work to do that we should do our best to veer from the theoretical. :-) Another option would be, for the encoding of contractions, to encode an invisible letter (with combining class 0) that would prevent the reordering of combining characters. To be valid with the usage of Tibetan vowels, this character should be treated as a base consonnant, and then it would explicitly form a ligature with the previous encoding cluster, to create the actual grapheme cluster. Why not using in that case a halant (virama) character to encode these contractions (which would be implicitly obvious for a native Tibetan reader of a rendered or printed text, but explicit for a computer program such as a generic indexing engine) ? -- Philippe.
Re: Major Defect in Combining Classes of Tibetan Vowels
On Wednesday, June 25, 2003 8:14 PM, Peter Lofting [EMAIL PROTECTED] wrote: At 7:41 PM +0200 6/25/03, Philippe Verdy wrote: If there are real distinct semantics that were abusively unified by the canonicalization, the only safe way would be to create a second character that would have another combining class than the existing one, to be used when lexical distinction from the most common use is necessary. So the added character for the modified vowel signs would have the same representative glyph, but would have the additional semantic contraction (clearly indicated in their name). This does not break the existing encoding of most texts, but allows a specific usage for contractions where the existing canonical equivalences would be inappropriate. How do you envisage this getting into the data? Often in Tibetan data capture, operators are keying in the appearance of a text and do not know what a stack represents. So the data then requires expert review after input to verify and assign the semantic representation. This is not a major problem, in fact this occurs everyday in all scripts: there are correctors, and some dictionnary based corrections that may be used to help correct the incorrectly or ambiguously encoded string... This is true even for all Latin-based languages, where the incorrect accents are used, or missing, and only native readers will be able to see the incorrect interpretation of a grapheme cluster, using their own knowledge of the language when the error (introduced by some intermediate technical constraint such as a past missing standard) appears. I still think that the contraction problem has a limited impact, which doesnot affect the normal written form of the Tibetan language which clearly uses a single interpretation. If both interpretations of a grapheme cluster is needed, then we should keep the encoding of the existing characters for the most common interpretation (without the contraction semantics), and assign a variant specially to allow encoding the other interpretation or reading of the grapheme-cluster. Legacy encoded text may still contain such ambiguous encodings that will look erroneous with the new updated standard, but this offers a way to correct later the encoded text, by looking at occurences of such ambiguous sequences, and letting actual native readers correct these interpretation, if the correction is absolutely required for some text processing. I do think that most already encoded text will not need such correction, if the encoding is just a way to transport a text which is only intended to be rendered or printed, but not used with automated lexical analysis. And even in that case, if the encoding ambiguity is well documented in a revision of the standard, there is a possibility to enhance tools like automated full-text search engines to search for both encodings of the character, based on their actually identical glyphic representation. -- Philippe.
Re: Major Defect in Combining Classes of Tibetan Vowels
At 7:41 PM +0200 6/25/03, Philippe Verdy wrote: If there are real distinct semantics that were abusively unified by the canonicalization, the only safe way would be to create a second character that would have another combining class than the existing one, to be used when lexical distinction from the most common use is necessary. So the added character for the modified vowel signs would have the same representative glyph, but would have the additional semantic contraction (clearly indicated in their name). This does not break the existing encoding of most texts, but allows a specific usage for contractions where the existing canonical equivalences would be inappropriate. How do you envisage this getting into the data? Often in Tibetan data capture, operators are keying in the appearance of a text and do not know what a stack represents. So the data then requires expert review after input to verify and assign the semantic representation. Peter
Re: Major Defect in Combining Classes of Tibetan Vowels
On Wed, Jun 25, 2003 at 09:08:10 -0700, Peter Lofting wrote: A list of common contractions would help here. I've seen at least one such published collection in the past which listed common contractions found in U-Med running text. However I don't have it with me. Does anyone on-line have access to a document like this? A sample list of dbu can contractions from Schmidt grammar: http://snark.ptc.spbu.ru/~uwe/tibex/contractions/contractions.html SY, Uwe -- [EMAIL PROTECTED] | Zu Grunde kommen http://www.ptc.spbu.ru/~uwe/| Ist zu Grunde gehen
Re: Major Defect in Combining Classes of Tibetan Vowels
Let me remind you: Talk on this list doesn't mean that the issue is automatically brought up for UTC deliberation. If no documents are formally submitted, nothing will happen. After all the discussion of Tibetan, if anyone has a serious concrete proposal for a specific change to the Unicode Standard, please write it up in detail and submit it. If you develop such a document you can comment via our reporting page here: http://www.unicode.org/reporting.html If the document is more than plain-text, you can arrange to send it by talking with me off-list, and I will see that it is properly registered for UTC discussion. Rick
Re: Major Defect in Combining Classes of Tibetan Vowels
At 12:15 -0700 2003-06-25, John Hudson wrote: In this case, any existing normalisation for Hebrew is already broken -- in the sense of destroying Biblical Hebrew text -- but still the argument from the UTC seems to be that even broken implementations -- broken because the standard is broken -- must not be broken. That seems very short-sighted indeed. -- Michael Everson * * Everson Typography * * http://www.evertype.com
Re: Major Defect in Combining Classes of Tibetan Vowels
Rick McGowan posted and was answered by John Hudson: If there isn't a visual difference here, how could there be a lexical difference? Imagine the age before computers. All you have to go on is what's on the page. There isn't an inherent order in those elements; they could have been written by the scribe in any order. If they appear the same, you can't assign different meanings -- except by some extra-syllabic informational context... right? On the page, you would know -- or hopefully know -- from context. But a search engine or a sorting algorithm looking at the characters presumably needs to know the difference without additional context, hence the character ordering is important. I think such distinctions are more than one should expect from a standard search engine or from simple sortation. To move to French, for example, I would not expect to be able to tell whether the abbreviation M. in M. Bouteillier stands for Monsieur or a name like Marcel. How do you know except from context whether med. stands for medical or medieval? In a company name such as Perrault Lavigne should sort according to default Unicode or as and or as et? Should it be found from searches on and, et, und and so forth? This is the business of application protocol and application utilities. Indication of proper expansion of abbreviations for sorting and searching seems to me to be beyond what Unicode tries to do and what it can do reasonably. If lexical forms in any language have variant meanings, then they are not for Unicode to distinguish except occasionally when Unicode provides identical glyphs that represent characters with very different properties such as ! for punctuation and ! for a Zulu click in the hope, probably vain, that people in general will recognize the difference. Jim Allan
Re: Major Defect in Combining Classes of Tibetan Vowels
Michael Kaplan wrote on 06/25/2003 10:55:47 AM: Let me add that this was the case recently for Hebrew (to mention on example). So it is certainly not impossible. The Hebrew issue is different: that involves things that *are* visually distinct, and that distinction cannot be represented in a reliable manner. - Peter --- Peter Constable Non-Roman Script Initiative, SIL International 7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA Tel: +1 972 708 7485
Re: Major Defect in Combining Classes of Tibetan Vowels
John Hudson scripsit: I'm not saying I like this, but this is how it has been explained to me with regard to the very clearly erroneous Hebrew mark combining classes which demonstrably break Biblical Hebrew text. In this case, any existing normalisation for Hebrew is already broken -- in the sense of destroying Biblical Hebrew text -- but still the argument from the UTC seems to be that even broken implementations -- broken because the standard is broken -- must not be broken. I don't understand how the current implementation breaks BH text. At worst, normalization may put various combining marks in a non-traditional order, but all alternative orders are canonically equivalent anyway, and no (ordinary) Unicode process should depend on any specific order. -- Not to perambulate John Cowan [EMAIL PROTECTED] the corridors http://www.reutershealth.com during the hours of repose http://www.ccil.org/~cowan in the boots of ascension. --Sign in Austrian ski-resort hotel
Re: Major Defect in Combining Classes of Tibetan Vowels
Andrew C. West wrote on 06/25/2003 09:31:51 AM: What I'm suggesting is that although cui 0F45, 0F74, 0F72 and ciu 0F45, 0F72, 0F74 should be rendered identically, the logical ordering of the codepoints representing the vowels may represent lexical differencesthat would be lost during the process of normalisation. How can things that are visually indistinguishable be lexically different? We don't encode the phonological distinctions between homographs; we encode text. - Peter --- Peter Constable Non-Roman Script Initiative, SIL International 7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA Tel: +1 972 708 7485
Re: Major Defect in Combining Classes of Tibetan Vowels
Thank you for [indirectly] making my point for me. I am saying that if someone has an issue that *does* make a difference then they should bring it up. Otherwise, I say that a difference that makes no difference, make no difference. And we can move on to actual problems. :-) MichKa - Original Message - From: [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Wednesday, June 25, 2003 1:08 PM Subject: Re: Major Defect in Combining Classes of Tibetan Vowels Michael Kaplan wrote on 06/25/2003 10:55:47 AM: Let me add that this was the case recently for Hebrew (to mention on example). So it is certainly not impossible. The Hebrew issue is different: that involves things that *are* visually distinct, and that distinction cannot be represented in a reliable manner. - Peter -- - Peter Constable Non-Roman Script Initiative, SIL International 7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA Tel: +1 972 708 7485
Re: Major Defect in Combining Classes of Tibetan Vowels
Peter asked: How can things that are visually indistinguishable be lexically different? chat (en) chat (fr) We don't encode the phonological distinctions between homographs; we encode text. But I agree that we encode text. Both words above, which are *lexically* distinct, would have the same encoded character representation, and no amount of inspection of the encoding per se is going to distinguish them. --Ken
Re: Major Defect in Combining Classes of Tibetan Vowels
At 18:26 +0100 2003-06-25, Michael Everson wrote: You'd like to think so. But Deprecate TIBETAN THINGY and add TIBETAN THINGY BIS so that we can fix the problem is utterly ridiculous. And by that I mean, given the TWO standards Unicode and ISO/IEC 10646, adding duplicate characters is frowned upon, so it should be less preferable than UTC fixing broken classes if they really are broken. This neglects the fact that for the Unicode Standard (although not ISO/IEC 10646, for which combining classes and normalization are irrelevant), destabilization of normalization is as serious a business as adding duplicate characters. That is why Mark chimed in earlier with: Michael, that is like saying move the bloody character or remove the bloody character. This issue should not be framed as if it were one where character identity is the higher glory, enshrined in the superior standard, so that to fix a problem, the lesser standard, the Unicode Standard, should simply relent on its own stability guarantees. Instead, the two standards have synchronized guarantees regarding character identity, but the Unicode Standard has its own scope beyond 10646, and in that realm it must respect its own guarantees of stability, because the users of that standard depend on them. In any case, even with the clarification that there are instances, in Tibetan contractions, of cooccurrence of shabkyu and vowels above on the same consonant stack, I am failing to see how the particular combining class assignment for U+0F74 is creating any serious problem for the representation of such Tibetan data. --Ken
Re: Major Defect in Combining Classes of Tibetan Vowels
At 01:15 PM 6/25/2003, John Cowan wrote: I don't understand how the current implementation breaks BH text. At worst, normalization may put various combining marks in a non-traditional order, but all alternative orders are canonically equivalent anyway, and no (ordinary) Unicode process should depend on any specific order. In Biblical Hebrew, it is possible for more than one vowel to be attached to a single consonant. This means that is it very important to maintain the ordering of vowels applied to a single consonant. The Unicode Standard assigns an individual combining class to every vowel, meaning that NFC normalisation may re-order vowels on a consonant. This is not simply 'non-traditional' but results in incorrect rendering and a different vocalisation of the text. The point is that hiriq before patah is *not* canonically equivalent to patah before hiriq, except in the erroneous assumption of the Unicode Standard: the order of vowels makes words sound different and mean different things. In order to correctly encode and render the Biblical Hebrew text, it is necessary to either a) never use normalisation routines that re-order marks (which is beyond the control of document authors), or b) re-classify the existing Hebrew marks so that all vowels are in a single class and will not be re-ordered during normalisation, or c) encode new marks for Biblical Hebrew with all vowels in a single class. There are a few other desirable changes to the combining class assignments for some Hebrew accents, which make rendering easier and are more linguistically logical, but the vowels are the most problematic. John Hudson Tiro Typeworks www.tiro.com Vancouver, BC [EMAIL PROTECTED] If you browse in the shelves that, in American bookstores, are labeled New Age, you can find there even Saint Augustine, who, as far as I know, was not a fascist. But combining Saint Augustine and Stonehenge -- that is a symptom of Ur-Fascism. - Umberto Eco
Re: Major Defect in Combining Classes of Tibetan Vowels
At 14:20 -0700 2003-06-25, John Hudson wrote: John, Write it up with glyphs and minimal pairs and people will see the problem, if any. Or propose some solution. (That isn't add duplicate characters.) In Biblical Hebrew, it is possible for more than one vowel to be attached to a single consonant. This means that is it very important to maintain the ordering of vowels applied to a single consonant. The Unicode Standard assigns an individual combining class to every vowel, meaning that NFC normalisation may re-order vowels on a consonant. This is not simply 'non-traditional' but results in incorrect rendering and a different vocalisation of the text. The point is that hiriq before patah is *not* canonically equivalent to patah before hiriq, except in the erroneous assumption of the Unicode Standard: the order of vowels makes words sound different and mean different things. In order to correctly encode and render the Biblical Hebrew text, it is necessary to either a) never use normalisation routines that re-order marks (which is beyond the control of document authors), or b) re-classify the existing Hebrew marks so that all vowels are in a single class and will not be re-ordered during normalisation, or c) encode new marks for Biblical Hebrew with all vowels in a single class. There are a few other desirable changes to the combining class assignments for some Hebrew accents, which make rendering easier and are more linguistically logical, but the vowels are the most problematic. John Hudson Tiro Typeworks www.tiro.com Vancouver, BC [EMAIL PROTECTED] If you browse in the shelves that, in American bookstores, are labeled New Age, you can find there even Saint Augustine, who, as far as I know, was not a fascist. But combining Saint Augustine and Stonehenge -- that is a symptom of Ur-Fascism. - Umberto Eco -- Michael Everson * * Everson Typography * * http://www.evertype.com
Re: Major Defect in Combining Classes of Tibetan Vowels
John Hudson wrote: In Biblical Hebrew, it is possible for more than one vowel to be attached to a single consonant. This means that is it very important to maintain the ordering of vowels applied to a single consonant. The Unicode Standard assigns an individual combining class to every vowel, meaning that NFC normalisation may re-order vowels on a consonant. This is true. This is not simply 'non-traditional' but results in incorrect rendering and a different vocalisation of the text. I don't think this is true. First, the intent of the (admittedly problematical) fixed position combining classes was that the position of the relevant marks, including the relevant Hebrew points, was fixed with respect to the consonant base letter, so that application of one would not impact the rendering of application of another. Unlike the generic above and below combining classes, the general inside-out positioning rule would not apply to sequences of fixed position marks. It may be more *difficult* for applications to do correct rendering, but there was never any intention in the standard that I know of that a sequence hiriq, patah would render differently than a sequence patah, hiriq. And never any intent that it would represent a different vocalisation of the text. The point is that hiriq before patah is *not* canonically equivalent to patah before hiriq, This is true. except in the erroneous assumption of the Unicode Standard: the order of vowels makes words sound different and mean different things. This is not. The Unicode Standard makes no assumptions or claims about what the phonological or meaning equivalence of hiriq, patah or patah, hiriq is for Biblical Hebrew. The fact that traditional Biblical Hebrew spelling prefers one order of representation and canonically ordered Unicode text specifies the opposite order may be a problem for implementations, but that problem does not extend to the claims that John is making here. In order to correctly encode and render the Biblical Hebrew text, it is necessary to either a) never use normalisation routines that re-order marks (which is beyond the control of document authors), or b) re-classify the existing Hebrew marks so that all vowels are in a single class and will not be re-ordered during normalisation, or c) encode new marks for Biblical Hebrew with all vowels in a single class. I don't think these conclusions following from the current situation. Such changes are certainly not necessary in order to *render* Biblical Hebrew text correctly, nor to accurately represent the content of Biblical Hebrew text. The current situation is not optimal for implementations, nor does canonically ordered text follow traditional preferences for spelling order -- that we can agree on. But I think the claims of inadequacy for the representation or rendering of Biblical Hebrew text are overblown. --Ken
Re: Major Defect in Combining Classes of Tibetan Vowels
On Wed, 25 Jun 2003 19:47:26 +0400, Valeriy E. Ushakov wrote: And given that the two look identical in writing in the first palce, this lexical difference had a chance to originate exactly *where*? You are putting the cart before the horse. Well, unless the text has been scanned with OCR, a human user will have to enter Tibetan text manually, and if the user encounters a base consonant with two different vowel signs joined to it, they will have to make a choice as to which order the vowel signs are entered. For example, if the word bcuig (with the letter CA carrying both a shabkyu [u] and gigu [i] sign) is encountered in a text that is being transcribed into electronic form, and the user recognises it from its context as a contraction for bcu gcig (eleven), then it would be natural to enter b-c-u-i-g 0F56, 0F45, 0F74, 0F72, 0F42. On the other hand, if a syllable (tsheg bar) comprising the base consant GA with a shabkyu [u] sign below and a gigu [i] sign above is encountered (this is a plausible but hypothetical contraction), and the user recognises this from its context as a contraction for the word gi gu (the name for the I vowel sign), then it would be natural to enter g-i-u 0F42, 0F72, 0F74, even though when writing it by hand the shabkyu would be written before the gigu (calligraphic order does not necessarily equate to logical order). In the one case a base consonant plus shabkyu and gigu is entered as 0FXX, 0F74, 0F72, in the other case as 0FXX, 0F72, 0F74. Unfortunately it is precisely at this point that my argument starts to crumble, and I am forced to throw in the towel, and admit defeat. The key question is, if 0F56, 0F45, 0F74, 0F72, 0F42 (bcuig) gets normalised to 0F56, 0F45, 0F72, 0F74, 0F42 (bciug), then so what ? Well, so nothing, unless 0F56, 0F45, 0F74, 0F72, 0F42 (bcuig) is a shared contraction for two different words, and the order of the U and I distinguishes what the contraction is. As Tibetan shorthand abbreviations are an informal, non-standardised method of abbreviating words, it is hypothetically possible that two different scribes could come up with the same contracted form for two differently spelled words, but I very much doubt that this would ever happen in reality. If I do find such a case, I will certainly let this list know, but in the meanwhile I agree that perhaps it would be more productive to return to Chris's original question, rather than travel too far down this detour, scenic though it is. Regards, Andrew
Re: Major Defect in Combining Classes of Tibetan Vowels
On Wed, 25 Jun 2003 13:41:27 -0700 (PDT), Kenneth Whistler wrote: Peter asked: How can things that are visually indistinguishable be lexically different? chat (en) chat (fr) And if Unicode reordered vowels in front of consonants, then we wouldn't be able to distinguish : chat (en) chat (fr) acht (de) Andrew
Re: Major Defect in Combining Classes of Tibetan Vowels
At 03:29 PM 6/25/2003, Kenneth Whistler wrote: This is not simply 'non-traditional' but results in incorrect rendering and a different vocalisation of the text. I don't think this is true. First, the intent of the (admittedly problematical) fixed position combining classes was that the position of the relevant marks, including the relevant Hebrew points, was fixed with respect to the consonant base letter, so that application of one would not impact the rendering of application of another. This idea of Hebrew vowels as 'fixed' marks is problematical, because in Biblical Hebrew they are not fixed: they move relative to additional marks (other vowels or cantillation marks). It may be more *difficult* for applications to do correct rendering, but there was never any intention in the standard that I know of that a sequence hiriq, patah would render differently than a sequence patah, hiriq. Yes, this is what I am saying is wrong: hiriq, patah *should* render differently from patah, hiriq. This example is particularly important, because it occurs in the spelling of yerushalaim, the Masoretic approximation of yerushalayim. Correct rendering requires that the hiriq follows the patah, and not vice versa. And never any intent that it would represent a different vocalisation of the text. Fair enough for modern Hebrew. Fair enough for phonetically accurate Hebrew. Not good enough for Biblical Hebrew in which vocalisation reflects Masoretic pronunciation applied to ancient consonant structures. The point is that hiriq before patah is *not* canonically equivalent to patah before hiriq, This is true. except in the erroneous assumption of the Unicode Standard: the order of vowels makes words sound different and mean different things. This is not. The Unicode Standard makes no assumptions or claims about what the phonological or meaning equivalence of hiriq, patah or patah, hiriq is for Biblical Hebrew. But it does make assumptions about the canonical equivalence of the mark orders U+05B4, U+05B7 and U+05B7, U+05B4, unless my understanding of the purpose of combining classes is completely mistaken. My understanding is that any ordering of two marks with different combining classes is canonically equivalent; further, I understand that some normalisation forms will re-order marks to move marks with lower combining class values closer to the base character. If the sequence lamed, patah, hiriq, final mem is what the text says, normalisation that re-orders the sequence as lamed, hiriq, patah, final mem is erroneous. The fact that traditional Biblical Hebrew spelling prefers one order of representation and canonically ordered Unicode text specifies the opposite order may be a problem for implementations, but that problem does not extend to the claims that John is making here. This isn't a problem for implementations. This is a problem of Unicode canonical ordering re-ordering marks whose order is lexically significant. The fact that, in some cases, the canonical ordering also cannot be rendered with existing implementations simply makes the problem visually obvious. In order to correctly encode and render the Biblical Hebrew text, it is necessary to either a) never use normalisation routines that re-order marks (which is beyond the control of document authors), or b) re-classify the existing Hebrew marks so that all vowels are in a single class and will not be re-ordered during normalisation, or c) encode new marks for Biblical Hebrew with all vowels in a single class. I don't think these conclusions following from the current situation. Such changes are certainly not necessary in order to *render* Biblical Hebrew text correctly, nor to accurately represent the content of Biblical Hebrew text. They are necessary to render Biblical Hebrew text correctly using current font and layout engine technologies. These technologies work perfectly for Biblical Hebrew so long as Unicode canonical ordering is ignored. I think there is very little impetus to change or develop new implementations to take into account what strikes most of those involved with Biblical Hebrew text processing as an error in Unicode. The current situation is not optimal for implementations, nor does canonically ordered text follow traditional preferences for spelling order -- that we can agree on. But I think the claims of inadequacy for the representation or rendering of Biblical Hebrew text are overblown. I've spent nine months working on Biblical Hebrew rendering for the major user community (the Society of Biblical Literature and their Font Foundation partners), and their take on this is that a) they want a solution that works with today's technology, and b) they will avoid Unicode canonical ordering like the plague and use custom normalisations instead. When we conducted normalisation tests, switching from Unicode normalisation of to a custom normalisation that does not re-order vowels or meteg*, we
Re: Major Defect in Combining Classes of Tibetan Vowels
On Thursday, June 26, 2003 1:04 AM, Andrew C. West [EMAIL PROTECTED] wrote: On Wed, 25 Jun 2003 13:41:27 -0700 (PDT), Kenneth Whistler wrote: Peter asked: How can things that are visually indistinguishable be lexically different? chat (en) chat (fr) And if Unicode reordered vowels in front of consonants, then we wouldn't be able to distinguish : chat (en) chat (fr) acht (de) Andrew Such distinction by language is futile: you try to add a language-specific lexical meaning, that simply does not exist in Unicode which only standardizes the *script* so that it *can* be rendered correctly independantly of the actual language... So you need to assume a unique language when interpreting an encoded string, but this is out of scope of Unicode (which at best will define language-dependant character properties, but not language-dependant canonical equivalences. When Unicode defines such canonical equivalence, the contract must be *only* based on the rendered text: if the text is rendered identically so that it becomes impossible to determine which order was used to encode it in abstract character sequences, then all these orders should be made canonically equivalent. The only exception is for abstract character propertiesn, which MUST be language independant for normative properties (the only exception is character transformations such as case mappings, which change the semantic of the text) but need sometimes to be distinct for correct processing in the rendering process (for example the Mathematics Symbol category and the Letter category, as they influence the layout in actual renderers, notably for the choice of font styles or point sizes or alignment, or extraction of entities sharing a common set of properties, such as breaking rules that also influence the correct rendering of text in variable display environments with different capabilities). Labelling the text with extra information such as language or word semantics or phonetic values is not part of the Unicode standard. The Unicode standard stops at the point where a text *can* be rendered with its original semantics, and this excludes all phonological, phonetical, or logical ordering analysis that can be made equivalently on the rendered text or on the encoded text. -- Philippe.
Re: Major Defect in Combining Classes of Tibetan Vowels
Valeriy E. Ushakov [EMAIL PROTECTED] wrote: A sample list of dbu can contractions from Schmidt grammar: http://snark.ptc.spbu.ru/~uwe/tibex/contractions/contractions.ht ml When these combinations are written in dbu-can script, as they are here ,the problem may not look too bad. - However in semi-cursive and cursive forms of Tibetan script subjoined vowels are completly connected with the preceeding consonants - and the combination of consonant(s) + subjouned vowel(s) need to be implemented in a font as a single ligature. While the above headline vowel(s) can still be be a seperate combining glyph. Hence it is important to have subjoined vowel signs ordered before those which can occur above the stack. - Chris
Re: Major Defect in Combining Classes of Tibetan Vowels
John Hudson wrote: This idea of Hebrew vowels as 'fixed' marks is problematical, because in Biblical Hebrew they are not fixed: they move relative to additional marks (other vowels or cantillation marks). It may be more *difficult* for applications to do correct rendering, but there was never any intention in the standard that I know of that a sequence hiriq, patah would render differently than a sequence patah, hiriq. Yes, this is what I am saying is wrong: hiriq, patah *should* render differently from patah, hiriq. This example is particularly important, because it occurs in the spelling of yerushalaim, the Masoretic approximation of yerushalayim. Correct rendering requires that the hiriq follows the patah, and not vice versa. Understood. See my separate response on the Biblical Hebrew thread. They are necessary to render Biblical Hebrew text correctly using current font and layout engine technologies. These technologies work perfectly for Biblical Hebrew so long as Unicode canonical ordering is ignored. I think there is very little impetus to change or develop new implementations to take into account what strikes most of those involved with Biblical Hebrew text processing as an error in Unicode. so long as Unicode canonical ordering is ignored. But as you and Peter point out, you cannot actually ignore canonical ordering, since in the Internet context it is outside of the end user's control. Once text escapes your own system for interchange, it may be subject to normalization, and you are kaputt. As stated, this is also turning into a typical--dare I say, religious-- confrontation of I'm right and you're wrong with no compromise in prospect and people getting ready to shoot themselves in the foot to prove they are right. You say there is little impetus to change or develop new implementations, and yet the very solutions being proposed, e.g., by Peter, would force reencoding of all the Biblical Hebrew text to work at all, and would, ipso facto, require new implementations and new fonts to work right. The alternative I suggested, of agreeing on a text representational convention of vowel, ZWJ, vowel for those instances of sequences which should not reorder could be implemented *now* with existing characters, and only minor extensions to the fonts and to keyboard methods. Any existing corpus could be updated en masse (and more easily than switching over to Peter's scheme), or incrementally, as appropriate. The other alternative that some seem to prefer: just change the combining classes and be done with it -- is *not* going to happen. It would fly in the face of politically committed stability guarantees by the UTC and required by the IETF and W3C. An inconvenience for Biblical Hebrew implementations is not going to outweigh that, for any of the committees involved. And even, if by some miracle, it *were* to happen, you would also be awaiting the rollout of new implementations, since you'd have to wait through the chaotic transition while everyone updated their normalization algorithms. Just picking up the marbles and going home isn't an option, either. As you indicate, so long as Unicode canonical ordering is ignored the existing layout technologies work just fine. So address the problem with an appropriate fix. Insert a ZWJ (for instance) at the point where the canonical reordering needs to be blocked on a vowel sequence, and you are then in a situation where even though you are not ignoring canonical ordering (which in distributed systems you cannot), you end up preserving the order you need, anyway. I've spent nine months working on Biblical Hebrew rendering for the major user community (the Society of Biblical Literature and their Font Foundation partners), and their take on this is that a) they want a solution that works with today's technology, and b) they will avoid Unicode canonical ordering like the plague and use custom normalisations instead. And how is implementing a custom normalization not a matter of developing a new implementation? It doesn't even begin to deal with the problem of what happens if the text escapes out into the Internet context, which won't be using the same custom normalization. Implementing a custom text representational convention seems like a much more straightforward task to me. When we conducted normalisation tests, switching from Unicode normalisation of to a custom normalisation that does not re-order vowels or meteg*, we increased the number of unique consonant + mark(s) sequences encoded in the Old Testament text by more 340. This means that Unicode normalisation was creating 340 textual ambiguities by treating lexically distinct sequences as canonically equivalent. I don't think that kind of textual ambiguity is 'overblown'. Introduce a canonical reordering blocker (cc=0) into the textual sequences which get ordered in ways that lead to textual ambiguities, and the textual ambiguities should
Re: Major Defect in Combining Classes of Tibetan Vowels: Illustration
Difficulties due to the present combining class values attached to these characters most frequently occur with abbreviations/contractions and/or with cursive scripts. With abbreviations it is common to have two or more vowels on a consonant stack. In cursive or semi-cursive forms of Tibetan script the subjoined vowels 0F71, 0F74 and 0F75 form ligatures with the consonant(s) in the stack, while above headline vowel(s) such as U+0F72 U+0F7A and U+0F7C sometimes forms a ligature with the following consonant or punctuation mark. In Dzongkha (Bhutanese) abbreviated spellings are often the usual way of writing words and a semi-cursive form of Tibetan script (Joyig) is standard - so the problem frequently occurs. I have a 225 page dictionary, and several other lists, of common abbreviations which are full of examples where this problem occurs. I've attached a couple of real and fairly simple examples. Example 1 Following normal orthographic rules the characters to produce Example1_gtuig.jpg would be entered as: U+0F42 U+0F4F U+0F74 U+0F72 U+0F42 If the characters remain in that order there is no problem - the first U+0F42 is straight forward, the isolated character is displayed as a simple glyph uni0F42 the sequence U+0F4F U+0F74 is replaced by a ligature uni0F4F0F74 U+0F72 U+0F42 is replaced by a ligature uni0F720F42 Now if the text goes through a normalisation process the same text ends up reordered as: U+0F42 U+0F4F U+0F72 U+0F74 U+0F42 because the combining class value of U+0F72 is less than that of U+0F74. To render this there is no change for the first character but I now need a lookup to render the whole sequence: U+0F4F U+0F72 U+0F72 U+0F74 U+0F42 with two glyphs uni0F4F0F74 uni0F720F42 Example 2 Following normal orthographic rules the characters to produce Example1_gtuop.jpg would be entered as: U+0F42 U+0F4F U+0F74 U+0F7C U+0F54 If the characters remain in that order there is no proplem - the first U+0F42 is as in the first example the sequence U+0F4F U+0F74 is replaced by a ligature uni0F4F0F74 U+0F7C U+0F54 is replaced by a ligature uni0F7C0F54 However, since the combining class value of U+0F7C is less than that of U+0F74,. after a normalisation process the same text ends up reordered as: U+0F42 U+0F4F U+0F7C U+0F72 U+0F54 and the whole sequence: U+0F4F U+0F72 U+0F72 U+0F74 U+0F42 needs to be replaced with the two glyphs uni0F4F0F74 uni0F720F42. Example 3 - (Example3_aMi-aiM.jpg) == This is taken from an entirely different source, the TibetBT font which was specially created for a project in Sichuan digitising the Tibetan bstan-'gyur (a vast cannonical collection of texts in over 200 large volumes originally translated fromSanskrit into Tibetan). The glyph set of the font is the same as the the set of Tibetan stacks found in that collection. All stacks including any combining vowels are implemented as precomposed ligatures This font can be downloaded from (though it is wrapped-up in a Windows setup.exe file). Here we have two stacks which one would naturally enter as U+0F68 U+0F7E U+0F72 and U+0F68 U+0F72 U+0F7E respectively. No problem so long as the characters remain in that order. However since U+0F72 has a combining class value greater than that of U+0F7E - in a process of normalisation U+0F72 would always float to the end and both stings would end up as U+0F68 U+0F7E U+0F72 and be indistinguishable. If there were only a few and fixed number of cases like the first two examples it would not be *much* of a problem to add the extra lookups - even though my font would need both many to one and many to many lookups to handle it. But there are *numerous* cases I already know of and there is no fixed and final list of such abbreviations. So I should really build the tables in my font to be able to handle almost any possibility. If the combining classes of vowels marks were based on the expected order where subjoined vowels are always written before any above headline vowels, this would be reasonably straight-forward to do - but as they may now wind up after normalisation it requires adding a huge number of complex lookups to the tables in my font. - Once I've done this it is going to be very difficult to test all the permeutations. Because of the number of additional lookups I need it is also likely there will be a hefty performance hit - especially on reflowing large documents. Unfortunately the third example can't simply be fixed by font lookups since two distinct combinations wind up being identical and hence would have to be rendered identically. If I wrote a peice of software where values I'd assigned caused problems and innefficiencies like this, I'd count it as a major fault or bug and hurry to fix it by assigning the correct values. I know the Tibetan characters were discussed in great detail by a number of experts at the time they were encoded - however there was little or no substantial discussion
Re: Major Defect in Combining Classes of Tibetan Vowels (Hebrew)
When, in the Bible, one sees two vowels on a given consonant, it isn't so. There is one vowel for the consonant one sees, and another vowel for an invisible consonant. The proper way to encode it is to use some code to represent the invisible consonant. Then the problem mentioned below does not arise. For example, the word Jerusalem is often spelled without the Yod, to which the Hiriq belongs. Jony -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of John Hudson Sent: Wednesday, June 25, 2003 11:21 PM To: John Cowan Cc: [EMAIL PROTECTED]; [EMAIL PROTECTED] Subject: SPAM: Re: Major Defect in Combining Classes of Tibetan Vowels At 01:15 PM 6/25/2003, John Cowan wrote: I don't understand how the current implementation breaks BH text. At worst, normalization may put various combining marks in a non-traditional order, but all alternative orders are canonically equivalent anyway, and no (ordinary) Unicode process should depend on any specific order. In Biblical Hebrew, it is possible for more than one vowel to be attached to a single consonant. This means that is it very important to maintain the ordering of vowels applied to a single consonant. The Unicode Standard assigns an individual combining class to every vowel, meaning that NFC normalisation may re-order vowels on a consonant. This is not simply 'non-traditional' but results in incorrect rendering and a different vocalisation of the text. The point is that hiriq before patah is *not* canonically equivalent to patah before hiriq, except in the erroneous assumption of the Unicode Standard: the order of vowels makes words sound different and mean different things. In order to correctly encode and render the Biblical Hebrew text, it is necessary to either a) never use normalisation routines that re-order marks (which is beyond the control of document authors), or b) re-classify the existing Hebrew marks so that all vowels are in a single class and will not be re-ordered during normalisation, or c) encode new marks for Biblical Hebrew with all vowels in a single class. There are a few other desirable changes to the combining class assignments for some Hebrew accents, which make rendering easier and are more linguistically logical, but the vowels are the most problematic. John Hudson Tiro Typeworkswww.tiro.com Vancouver, BC [EMAIL PROTECTED] If you browse in the shelves that, in American bookstores, are labeled New Age, you can find there even Saint Augustine, who, as far as I know, was not a fascist. But combining Saint Augustine and Stonehenge -- that is a symptom of Ur-Fascism. - Umberto Eco
Re: Major Defect in Combining Classes of Tibetan Vowels
Chris Fynn wrote: In Unicode's UnicodeData.txt ( http://www.unicode.org/Public/UNIDATA/Unicodea.Dattxt ) 0F7E has a Canonical Combining Class Value (CCCV) of 0; 0F71 a CCCV of 129; 0F72 0F7A 0F7B 0F7C 0F7D and 0F80 a CCCV of 130; 0F74 a CCCV of 132; and 0F82 and 0F83 have a CCCV of 230. By normal Tibetan Dzongkha spelling, writing, and input rules Tibetan script stacks should be entered and written: 1 headline consonant (0F40-0F6A), any subjoined consonant(s) (0F90- 0F9C), achung (0F71), shabkyu (0F74), any above headline vowel(s) (0F72 0F7A 0F7B 0F7C 0F7D and 0F80) ; any ngaro (0F7E, 0F82 and 0F83) So following normal Tibetan Dzongkha input and spelling rules the relative ordering of these characters should be: A. 0F71 B. 0F74 C. 0F72 0F7A 0F7B 0F7C 0F7D and 0F80 D. 0F7E, 0F82 and 0F83 The fact that, in a process of canonical decomposition or normalisation, these combining characters can get reordered in a bizarre order relative to each other Actually, looking at this data, while I can see that the combining classes are assigned less than optimally, I don't see that this makes any practical problem for Tibetan data. You are saying, in effect, that the stack structure has the following position classes (treating the consonant stack itself as the more tightly bound unit that I will just symbolize as CS): CS - achung - shabkyu - vowelsabove - ngaro And since shabkyu has cc=132 whereas the vowelsabove have cc=130, they would reorder out of expected order if normalized. However, for most text the shabkyu (u-below) would be in complementary distribution with the vowels above, so the effective positional classes are: { vowelsabove } CS - achung - { shabkyu } - ngaro And in this case, the relative combining class of the vowels doesn't really matter, since we wouldn't be seeing both present to reorder around each other. I'm guessing that you are claiming there are instances where the shabkyu does cooccur with other vowels above as well. Wouldn't those, if they do occur, represent a distinctly minority case in terms of the overall processing? The short summaries of Tibetan writing that I've seen don't even mention it as a possibility, since even the few diphthongs in -u are written with a separate stack 0F60, 0F74 to the right of the main stack. causes difficulties with culturally correct collation (where 0F7E, 0F82 and 0F83 should have an equal value) - and especially it necessitates making lookups in smart fonts far more complex and inefficient than they should have to be. And I'm not seeing the problem here, either. Since the combining class of 0F82 is 0, and not some other random value, it isn't going to reorder around the other vowel marks. If it is entered in the traditional spelling order you have indicated, then it is going to stay in that position; normalization won't move it. And since the equivalent 0F82 and 0F83 sift to the end of the syllable, with their high combining class, they'll end up in the same position as the 0F7E ngaro if normalized. The only problem you'd have is with Tibetan data where a 0F7E ngaro is entered in other than the optimal spelling order you indicated. Such a sequence won't compare equal unless you add a spelling equivalence rule on top of the canonical equivalence. But there are a number of such edge cases for Brahmic scripts -- not just Tibetan. Culturally correct collation is first a matter of giving the three ngaro characters equivalent weights. Beyond that, as you indicated, the weighting of the syllables (or stacks) is complicated, and isn't going to be affected by 0F7E having combining class 0 in any case. (In Tibetan script fonts 0F71 and 0F74 are often ligated with preceding consonant (+ subjoined consonants) combined as a single glyph whereas above headline vowels are almost always treated as non spacing combining marks.) Yes, but the only point where this would be a problem would be for stacks with a shabkyu (u vowel) *and* another vowel. And even for such cases, wouldn't this be handled effectively by 6 triples in the ligature tables which would identify any shabkyu moved after one of the other 6 vowels? Currently there seems to be no easy or standardized work around for these problems and the standard seems to say that the relative values of assigned Canonical Combining Class Values cannot be changed. They cannot. Any suggestions as to how to create a standardized work around for these incorrect values? I guess I'm not getting it. I don't see the need for a standardized work around, here. --Ken - Chris
Re: Major Defect in Combining classes of Tibetan Vowels
From: Christopher John Fynn [EMAIL PROTECTED] So following normal Tibetan Dzongkha input and spelling rules the relative ordering of these characters should be: A. 0F71 (CCV=129) B. 0F74 (CCV=132) C. 0F72, 0F7A, 0F7B, 0F7C, 0F7D, 0F80 (CCV=130) D. 0F7E, (CCV=0) 0F82, 0F83 (CCV=230) Apart from defining a UCA-based decomposition, there does not seem to be an easy solution. This would require preprocessing of text similar to what is done for Arabic or Brahmic script layout processing (where ligature and character or subglyph reordering is performed before looking up for glyphs and ligatures in fonts). On Windows, it would require using UniScribe, but for collation, changes are still possible, because the UCA order can still be modified to document these reordering rules.
Re: Major Defect in Combining classes of Tibetan Vowels
Phillipe By relative ordering I did not mean relative collation weights but the order in which these combining characters are usually entered relative to other characters and each other - and the order relative to each other in which they should be stored in a string. The current CCCV weights for these characters mean that they can end up in a bizarre order which makes no sense, serves no useful purpose and complicates rendering and collation . The only thing I did mention specifically about collation is that 0F7E 0F82 and 0F83 should generally treated as equivalent for collation purposes. Culturally correct collation rules for Dzongkha and Tibetan are *very* complex when compared with those for any other language I know of and I don't want to get into all that here. - Chris