subject:"Re\: Major Defect in Combining Classes of Tibetan Vowels"

Re: Major Defect in Combining Classes of Tibetan Vowels

2003-06-27 Thread Christopher John Fynn


Rick McGowan [EMAIL PROTECTED] has privately suggested moving
the discussion of  Combining Classes of *Tibetan* Characters
from the main Unicode list [EMAIL PROTECTED] to the TIBEX list
[EMAIL PROTECTED] - an experts list which was set up several
years ago specifically to discuss proposals for encoding Tibetan
characters in Unicode.  If there are people  who have a
particular interest in Tibetan characters and have been
following the thread here who would like to continue following
this thread - perhaps they could ask Rick how they can join that
list.

I'll follow Rick's advice - perhaps this discussion is more
appropriate on the TIBEX list - even though similar issues with
some Hebrew characters which have been raised here (again) as a
result of this thread makes me think there may be a need for a
non script specific solution or work-around to problems with
cannoical combining class values.

Anyway I'm going to move this discussion over there with a
parting shot...

Off-list Robert Chilton has pointed out to me the following:

 3. A very common occasion of 0F7E occurring with a vowel is in
the stack
 HaUm (orthographic sequence of 0F67 0F71 0F74 0F7E).  Because
0F7E is
 currently assigned a cc of zero, this *same glyph-form* could
 theoretically be encoded with a total of 6 different character
 sequences, resulting in 4(!) different sequences following
 normalization.  Properly, all 6 sequences should normalize to
the same
 sequence -- which is indeed the case if 0F82 or 0F83 is used
in place of
 0F7E.  Obviously a major problem, not only for rendering but
also for
 searching and sorting.

FOUR different sequences possible *after* normalisation ???

Personally I would have rather seen all Tibetan characters
having a CCV of 0 (and all pre-combined Tibetan characters
*strongly* depreciated)rather than this. If someone simply
follows the normal rules for writing Tibetan, then characters
will be entered in a very predictable order which is far easier
to process than the one(s) they can end up in after Unicode
normalisation.

- Chris Fynn

BTW My apologies to anyone who receives two copies of this
message.

Re: Major Defect in Combining Classes of Tibetan Vowels

2003-06-26 Thread Peter_Constable

Ken Whistler wrote on 06/25/2003 05:29:59 PM:

  The point is that hiriq before patah is *not* 
  canonically equivalent to patah before hiriq,
 
 This is true. 
 
  except in the erroneous 
  assumption of the Unicode Standard: the order of vowels makes words 
sound 
  different and mean different things.
 
 This is not.

Ken, I think you're reading John differently than he intended: the Unicode 
character sequences  hiriq, patah  and  patah, hiriq  *are* 
canonically equivalent, but the requirements for Biblical Hebrew are that 
alternate visual orders would correspond to different vocalizations, and 
thus the visual ordering of these does matter semantically, and therefore 
the encoded orders should *not* be canonically equivalent.


 The current situation is not optimal for implementations, nor
 does canonically ordered text follow traditional preferences
 for spelling order -- that we can agree on. But I think the
 claims of inadequacy for the representation or rendering
 of Biblical Hebrew text are overblown.

The serious problem is that the writing distinctions that matter cannot 
currently be reliably represented, as they are not preserved under 
canonical ordering / normalization. This is all just a rehash of 
discussions we had on this list back in December, at which time it was 
acknowledged that this was the case, and that this was a problem.



- Peter


---
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485

Re: Major Defect in Combining Classes of Tibetan Vowels

2003-06-26 Thread Peter_Constable

Michael Everson wrote on 06/25/2003 04:36:20 PM:

[ re Biblical Hebrew ]

 Write it up with glyphs and minimal pairs and people will see the 
 problem, if any. Or propose some solution. (That isn't add duplicate 
 characters.)

The only solution that UTC is willing to consider I have already submitted 
in a proposal (L2/03-195).



- Peter


---
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485

Re: Major Defect in Combining Classes of Tibetan Vowels (Hebrew)

2003-06-26 Thread Peter_Constable

Jony Rosenne wrote on 06/26/2003 12:16:22 AM:

 When, in the Bible, one sees two vowels on a given consonant, it isn't 
so.

That's silly. When one sees two vowels on a given consonant in the Bible, 
it *is* so: the two vowels are written there. It may not correspond to 
actual phonology, ie what is spoken, but as has been made clear on many 
occasions, Unicode is not encoding phonology, it is encoding text. And in 
relation to text, your statement is simply wrong.


 There is one vowel for the consonant one sees, and another vowel for an
 invisible consonant. The proper way to encode it is to use some code to
 represent the invisible consonant. Then the problem mentioned below does 
not
 arise.

The idea of an invisible consonant would amount to encoding a phonological 
entity, which is the kind of thing that was at one time approved for Khmer 
(invisible characters representing inherent vowels), but later turned into 
an albatross, and when I proposed the same thing (invisible inherent 
vowel) for Syloti Nagri, it was made very clear to me that it would not go 
down well with UTC.

Also, the proposed solution of an invisible consonant would leave 
unresolved the problem of meteg-vowel ordering distinctions, while the 
alternate proposal of having meteg and vowels all with a class of 230 
solves both problems at once. Two ad hoc solutions (one for multi-vowel 
ordering, and another for meteg-vowel ordering) must certainly be far less 
preferred for one motivated solution (having characters with canonical 
combining classes that are appropriate for the writing behaviours 
exhibited).

I invite people to review the discussions from the unicoRe list from last 
December, at which time everyone (including you, Jony) were all concluding 
that the solution which I proposed in L2/03-195 was the best solution to 
pursue.


- Peter


---
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485

Re: Major Defect in Combining Classes of Tibetan Vowels

2003-06-26 Thread Peter_Constable

John Hudson wrote on 06/25/2003 06:47:44 PM:

 This is not. The Unicode Standard makes no assumptions or claims
 about what the phonological or meaning equivalence of hiriq, patah
 or patah, hiriq is for Biblical Hebrew.
 
 But it does make assumptions about the canonical equivalence of the mark 

 orders U+05B4, U+05B7 and U+05B7, U+05B4, unless my understanding of 

 the purpose of combining classes is completely mistaken.

Your understanding on this point is correct.


 My understanding 
 is that any ordering of two marks with different combining classes is 
 canonically equivalent; 

Yes.


 further, I understand that some normalisation forms 
 will re-order marks to move marks with lower combining class values 
closer 
 to the base character.

*Every* Unicode normalization form will apply canonical reordering.



 * Meteg re-ordering is in some respects even more problematic than 
 multi-vowel re-ordering

And it is because of meteg-vowel ordering distinctions that the ordering 
of things like patah + hiriq should not be solved in any way other than 
the two having the same canonical combining class, because that is exactly 
what will be needed to deal with meteg-vowel ordering distinctions.



- Peter


---
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485

RE: Major Defect in Combining Classes of Tibetan Vowels (Hebrew)

2003-06-26 Thread Jony Rosenne

It may look, silly, but it is correct. What you see are letters according to
the writing tradition, which does not include a Yod, and vowels according to
the reading tradition which does. There are in the Bible other, more extreme
cases. 

I don't think we need any new characters, ZERO WIDTH SPACE would do and it
requires no new semantics. Moreover, everybody who knows his Hebrew Bible
knows the Yod is there although it isn't written.

The Meteg is a completely different issue. There is a small number of places
were the Meteg is placed differently. Since it does not behave the same as
the regular Meteg, and is thus visually distinguishable, it should be
possible to add a character, as long as it is clearly named.

Jony

 -Original Message-
 From: [EMAIL PROTECTED] 
 [mailto:[EMAIL PROTECTED] On Behalf Of 
 [EMAIL PROTECTED]
 Sent: Thursday, June 26, 2003 9:43 AM
 To: [EMAIL PROTECTED]
 Subject: Re: Major Defect in Combining Classes of Tibetan 
 Vowels (Hebrew)
 
 
 Jony Rosenne wrote on 06/26/2003 12:16:22 AM:
 
  When, in the Bible, one sees two vowels on a given 
 consonant, it isn't
 so.
 
 That's silly. When one sees two vowels on a given consonant 
 in the Bible, 
 it *is* so: the two vowels are written there. It may not 
 correspond to 
 actual phonology, ie what is spoken, but as has been made 
 clear on many 
 occasions, Unicode is not encoding phonology, it is encoding 
 text. And in 
 relation to text, your statement is simply wrong.
 
 
  There is one vowel for the consonant one sees, and another 
 vowel for an
  invisible consonant. The proper way to encode it is to use 
 some code to
  represent the invisible consonant. Then the problem 
 mentioned below does 
 not
  arise.
 
 The idea of an invisible consonant would amount to encoding a 
 phonological 
 entity, which is the kind of thing that was at one time 
 approved for Khmer 
 (invisible characters representing inherent vowels), but 
 later turned into 
 an albatross, and when I proposed the same thing (invisible inherent 
 vowel) for Syloti Nagri, it was made very clear to me that it 
 would not go 
 down well with UTC.
 
 Also, the proposed solution of an invisible consonant would leave 
 unresolved the problem of meteg-vowel ordering distinctions, 
 while the 
 alternate proposal of having meteg and vowels all with a class of 230 
 solves both problems at once. Two ad hoc solutions (one for 
 multi-vowel 
 ordering, and another for meteg-vowel ordering) must 
 certainly be far less 
 preferred for one motivated solution (having characters with 
 canonical 
 combining classes that are appropriate for the writing behaviours 
 exhibited).
 
 I invite people to review the discussions from the unicoRe 
 list from last 
 December, at which time everyone (including you, Jony) were 
 all concluding 
 that the solution which I proposed in L2/03-195 was the best 
 solution to 
 pursue.
 
 
 - Peter
 
 
 --
-
 Peter Constable
 
 Non-Roman Script Initiative, SIL International
 7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
 Tel: +1 972 708 7485

RE: Major Defect in Combining Classes of Tibetan Vowels (Hebrew)

2003-06-26 Thread John Hudson

At 04:26 AM 6/26/2003, Jony Rosenne wrote:

I don't think we need any new characters, ZERO WIDTH SPACE would do and it
requires no new semantics.
ZERO WIDTH SPACE would screw up search and sort algorithms, I think, 
because it is not a control character per se and may not be ignored as desired.

I've made some tests using Ken's ZWJ suggestion and, as feared, it messes 
with the glyph positioning lookups. The results varied slightly between MS 
RichText clients and InDesign ME, but both displayed marks incorrectly when 
ZWJ was inserted. I strongly suspect that this is not something that can 
easily be resolved in the glyph shaping model.

John Hudson

Tiro Typeworks  www.tiro.com
Vancouver, BC   [EMAIL PROTECTED]
If you browse in the shelves that, in American bookstores,
are labeled New Age, you can find there even Saint Augustine,
who, as far as I know, was not a fascist. But combining Saint
Augustine and Stonehenge -- that is a symptom of Ur-Fascism.
- Umberto Eco

RE: Major Defect in Combining Classes of Tibetan Vowels (Hebrew)

2003-06-26 Thread Peter_Constable

Jony Rosenne wrote on 06/26/2003 06:26:02 AM:

 It may look, silly, but it is correct. What you see are letters 
according to
 the writing tradition, which does not include a Yod, and vowels 
according to
 the reading tradition which does.

I understand that. My point was, you were talking about phonology, but in 
terms of the text, it was not correct: there *are* multiple vowels on a 
single consonant.


 There are in the Bible other, more extreme
 cases. 

I'd be interested on whatever info you can provide in that regard.


 
 I don't think we need any new characters, ZERO WIDTH SPACE would do and 
it
 requires no new semantics.

No, that's a terrible solution: a space creates unwanted word boundaries.


 Moreover, everybody who knows his Hebrew Bible
 knows the Yod is there although it isn't written.

But the point is, how to people encode the text? The yod is not there in 
the text. How does a publisher encode text in the typesetting process? How 
do researchsers encode the text they want to analyze? Saying, everybody 
knows there's a yod there doesn't provide a solution, particular given 
that the researchers know in point of fact that the consonantal text 
explicitly does not include a yod.


 
 The Meteg is a completely different issue. There is a small number of 
places
 were the Meteg is placed differently. Since it does not behave the same 
as
 the regular Meteg, and is thus visually distinguishable, it should be
 possible to add a character, as long as it is clearly named.

That is a potential solution, thought it would have to be *two* additional 
metegs.



- Peter


---
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485

RE: Major Defect in Combining Classes of Tibetan Vowels (Hebrew)

2003-06-26 Thread Jony Rosenne

That may be what you see. Myself, every time I look at it, I see an orphaned
Hiriq without a consonant. It is normally placed in between the Lamed and
the Mem, to make certain the point isn't missed (a pun). 

Jony

 -Original Message-
 From: [EMAIL PROTECTED] 
 [mailto:[EMAIL PROTECTED] On Behalf Of 
 [EMAIL PROTECTED]
 Sent: Thursday, June 26, 2003 7:09 PM
 To: [EMAIL PROTECTED]
 Subject: RE: Major Defect in Combining Classes of Tibetan 
 Vowels (Hebrew)
 
 
 Jony Rosenne wrote on 06/26/2003 06:26:02 AM:
 
  It may look, silly, but it is correct. What you see are letters
 according to
  the writing tradition, which does not include a Yod, and vowels
 according to
  the reading tradition which does.
 
 I understand that. My point was, you were talking about 
 phonology, but in 
 terms of the text, it was not correct: there *are* multiple 
 vowels on a 
 single consonant.
 
 
  There are in the Bible other, more extreme
  cases.
 
 I'd be interested on whatever info you can provide in that regard.
 
 
  
  I don't think we need any new characters, ZERO WIDTH SPACE would do 
  and
 it
  requires no new semantics.
 
 No, that's a terrible solution: a space creates unwanted word 
 boundaries.
 
 
  Moreover, everybody who knows his Hebrew Bible
  knows the Yod is there although it isn't written.
 
 But the point is, how to people encode the text? The yod is 
 not there in 
 the text. How does a publisher encode text in the typesetting 
 process? How 
 do researchsers encode the text they want to analyze? Saying, 
 everybody 
 knows there's a yod there doesn't provide a solution, 
 particular given 
 that the researchers know in point of fact that the consonantal text 
 explicitly does not include a yod.
 
 
  
  The Meteg is a completely different issue. There is a small 
 number of
 places
  were the Meteg is placed differently. Since it does not behave the 
  same
 as
  the regular Meteg, and is thus visually distinguishable, it 
 should be 
  possible to add a character, as long as it is clearly named.
 
 That is a potential solution, thought it would have to be 
 *two* additional 
 metegs.
 
 
 
 - Peter
 
 
 --
-
 Peter Constable
 
 Non-Roman Script Initiative, SIL International
 7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
 Tel: +1 972 708 7485

Re: Major Defect in Combining Classes of Tibetan Vowels

2003-06-25 Thread Peter_Constable

Christopher John Fynn wrote on 06/21/2003 08:23:17 PM:

  Any suggestions as to how to create a standardized work around
  for these incorrect values?

Propose new characters, and deprecate the old ones?



- Peter


---
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485

Re: Major Defect in Combining Classes of Tibetan Vowels

2003-06-25 Thread Michael Everson

At 00:56 -0500 2003-06-25, [EMAIL PROTECTED] wrote:
Christopher John Fynn wrote on 06/21/2003 08:23:17 PM:

  Any suggestions as to how to create a standardized work around
  for these incorrect values?
Propose new characters, and deprecate the old ones?
Fix the bloody errors, for heaven's sake.
--
Michael Everson * * Everson Typography *  * http://www.evertype.com

Re: Major Defect in Combining Classes of Tibetan Vowels

2003-06-25 Thread Valeriy E. Ushakov

On Wed, Jun 25, 2003 at 02:10:44 -0700, Andrew C. West wrote:

 I've never really understood normalization, but it seems to me that
 normalising bcuig 0F56, 0F45, 0F74, 0F72, 0F42 to bciug 0F56,
 0F45, 0F72, 0F74, 0F42 is wrong as bciug could conceivably be a
 shorthand abbreviation for a wcompletely different word with a gigu
 [i] on the first syllable and a shabkyu [u] on the second syllable.

Err, as in this particular case one vowel sign is above and the other
one is below the stack - i.e. they don't interact spatially - you
cannot really distinguish them. ;)

SY, Uwe
-- 
[EMAIL PROTECTED] |   Zu Grunde kommen
http://www.ptc.spbu.ru/~uwe/|   Ist zu Grunde gehen

Re: Major Defect in Combining Classes of Tibetan Vowels

2003-06-25 Thread Andrew C. West

On Wed, 25 Jun 2003 15:05:26 +0400, Valeriy E. Ushakov wrote:

 Err, as in this particular case one vowel sign is above and the other
 one is below the stack - i.e. they don't interact spatially - you
 cannot really distinguish them. ;)

I know that the vowel signs do not interact with each other typographically, but
what's that got to do with anything ? I'm talking about the logical ordering of
the Unicode codepoints used to encode some Tibetan text, not the physical
appearance of the glyphs that are used to render that sequence of codepoints.

What I'm suggesting is that although cui 0F45, 0F74, 0F72 and ciu 0F45,
0F72, 0F74 should be rendered identically, the logical ordering of the
codepoints representing the vowels may represent lexical differences that would
be lost during the process of normalisation.

Andrew

Re: Major Defect in Combining Classes of Tibetan Vowels

2003-06-25 Thread Valeriy E. Ushakov

On Wed, Jun 25, 2003 at 07:31:51 -0700, Andrew C. West wrote:

  Err, as in this particular case one vowel sign is above and the other
  one is below the stack - i.e. they don't interact spatially - you
  cannot really distinguish them. ;)
 
 I know that the vowel signs do not interact with each other
 typographically, but what's that got to do with anything ? I'm
 talking about the logical ordering of the Unicode codepoints used to
 encode some Tibetan text, not the physical appearance of the glyphs
 that are used to render that sequence of codepoints.
 
 What I'm suggesting is that although cui 0F45, 0F74, 0F72 and
 ciu 0F45, 0F72, 0F74 should be rendered identically, the logical
 ordering of the codepoints representing the vowels may represent
 lexical differences that would be lost during the process of
 normalisation.

And given that the two look identical in writing in the first palce,
this lexical difference had a chance to originate exactly *where*?
You are putting the cart before the horse.

Also note that the original question from Chris is about things that
do interact spatially.

SY, Uwe
-- 
[EMAIL PROTECTED] |   Zu Grunde kommen
http://www.ptc.spbu.ru/~uwe/|   Ist zu Grunde gehen

Re: Major Defect in Combining Classes of Tibetan Vowels

2003-06-25 Thread Michael \(michka\) Kaplan

Let me add that this was the case recently for Hebrew (to mention on
example). So it is certainly not impossible.

But we have enough real work to do that we should do our best to veer from
the theoretical. :-)

MichKa

- Original Message - 
From: Michael (michka) Kaplan [EMAIL PROTECTED]
To: [EMAIL PROTECTED]; Andrew C. West
[EMAIL PROTECTED]
Sent: Wednesday, June 25, 2003 8:11 AM
Subject: Re: Major Defect in Combining Classes of Tibetan Vowels


 From: Andrew C. West [EMAIL PROTECTED]

  What I'm suggesting is that although cui 0F45, 0F74, 0F72 and ciu
 0F45,
  0F72, 0F74 should be rendered identically, the logical ordering of the
  codepoints representing the vowels may represent lexical differences
that
 would
  be lost during the process of normalisation.

 Do you (or does anyone) have an actual example where this is the case? It
 may well be true but until someone has a proof there is not really an
 indication of a specific problem for the UTC to address.

 The current discussion is like arguing about a color that none of the
 participants have ever seen.

 MichKa

Re: Major Defect in Combining Classes of Tibetan Vowels

2003-06-25 Thread Philippe Verdy

On Wednesday, June 25, 2003 4:31 PM, Andrew C. West [EMAIL PROTECTED] wrote:
 On Wed, 25 Jun 2003 15:05:26 +0400, Valeriy E. Ushakov wrote:
 What I'm suggesting is that although cui 0F45, 0F74, 0F72 and
 ciu 0F45, 0F72, 0F74 should be rendered identically, the logical
 ordering of the codepoints representing the vowels may represent
 lexical differences that would be lost during the process of
 normalisation. 

This is an excellent argument, and that's why the Vietnamese usage of multiple 
diacritics was studied so that it can preserve the logical ordering of accents on 
Latin letters. However if the actual rendered text cannot be distinguished, the 
effective order of diacritics is only important in the mind of the reader but does not 
exist in the written form.

This would be important if there was a need to create a transliteration rule (for 
example from Tibetan to Latin script). But even in that case, knowledge of the origin 
language is required, as no transliteration rule works well usig only the script 
information. So transliteration rules are very often context-sensitive.

What is important is how a native Tibetan reader would read the grapheme cluster. If 
it reads it as ciu then it is to be interpreted as ciu, and then the logical order 
is more important than the encoding order, because such difference do not exist in the 
actual written script.

If I just take the example of the Latin script, a sequence like C, COMBINING CEDILLA, 
COMBINING ACCUTE ACCENT will have a canonical order for the two last diacritics which 
is not important at the linguisitic level if looking at the written script. The 
canonical order and comining classes just exists BECAUSE the encoding would allow 
several *equivalent* sequences that no reader would be allow to read distinctly. When 
there is possible confusions, and these distinction does not exist in the original 
script before its encoding, there should exist a way to unify all these.

So even if the canonical ordering of Tibetan vowel signs is not logical, as long as it 
allows to produce the same written text, this is not a problem, and there is not more 
loss of semantic than in the original script.

So if the Tibetan script cannot make a distinction between ciu and cui, this is 
*not* a Unicode defect. This confusion already exists in the original script, and 
there is no loss of semantic in the Unicode encoding when compared to the actual 
written script. Let's not make a problem by adding new semantics to the Tibetan 
language (such as creating a distinction between ciu and cui) *because* this seems 
/possible/ in Unicode. If we respect a script or language, we must not tolerate such 
artificial distinctions.

It's true that the canonical ordering should match with the logical ordering, but I 
think that there is a lot of exceptions, notably within Brahmic scripts with disjoint 
letters, or in Thai (encoded according to a previous existing standard TIS620 which 
used the visual ordering), or even in many Hebrew or Arabic texts (sometimes encoded 
also with a visual ordering, and requiring some tools to reverse the encoding 
according to a prefered order, because this cannot be decided without an out-of-band 
specification of the actual ordering used in the text)...

So if one wants to really handle the logical ordering, it's perfectly possible to 
exchange the i and u in cui without affecting the canonical equivalence and 
without changing the semantic of the original Tibetan text. Canonical ordering is only 
needed to unify equivalences, but is not intended to sort distinct strings (this is 
not part of the Unicode encoding, but part of a collation algorithm like UCA, tailored 
appropriately for each language on top of the default UCA order for the script).

A correct UCA collation for the Tibetan script can perfectly be created, and then 
tailored for the Tibetan language to reorder the vowel signs. (This is not more 
complicated than handling a French reordering for accents). This just requires a 
multi-level sort algorithm, where u and i would have the same collation keys at 
level N, and could be reordered using a French-style reordering of vowel signs for 
keywords or grapheme clusters at level N+1 or N+2.

Re: Major Defect in Combining Classes of Tibetan Vowels

2003-06-25 Thread Michael Everson

At 08:11 -0700 2003-06-25, Michael \(michka\) Kaplan wrote:

Do you (or does anyone) have an actual example where this is the case? It
may well be true but until someone has a proof there is not really an
indication of a specific problem for the UTC to address.
A document showing what happens in Case A and what happens in Case B 
with actual glyphs would be helpful.

The current discussion is like arguing about a color that none of the
participants have ever seen.
Indeed.
--
Michael Everson * * Everson Typography *  * http://www.evertype.com

Re: Major Defect in Combining Classes of Tibetan Vowels

2003-06-25 Thread Mark Davis

Michael, that is like saying move the bloody character or remove
the bloody character.

Mark
__
http://www.macchiato.com
  Eppur si muove 

- Original Message - 
From: Michael Everson [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Wednesday, June 25, 2003 03:14
Subject: Re: Major Defect in Combining Classes of Tibetan Vowels


 At 00:56 -0500 2003-06-25, [EMAIL PROTECTED] wrote:
 Christopher John Fynn wrote on 06/21/2003 08:23:17 PM:
 
Any suggestions as to how to create a standardized work around
for these incorrect values?
 
 Propose new characters, and deprecate the old ones?

 Fix the bloody errors, for heaven's sake.
 -- 
 Michael Everson * * Everson Typography *  * http://www.evertype.com

Re: Major Defect in Combining Classes of Tibetan Vowels

2003-06-25 Thread Peter Lofting

At 8:11 AM -0700 6/25/03, Michael (michka) Kaplan wrote:
From: Andrew C. West [EMAIL PROTECTED]

  What I'm suggesting is that although cui 0F45, 0F74, 0F72 and 
ciu 0F45,
 0F72, 0F74 should be rendered identically, the logical ordering of the
  codepoints representing the vowels may represent lexical 
differences that would
 be lost during the process of normalisation.
Do you (or does anyone) have an actual example where this is the case? It
may well be true but until someone has a proof there is not really an
indication of a specific problem for the UTC to address.
The current discussion is like arguing about a color that none of the
participants have ever seen.
A list of common contractions would help here. I've seen at least one 
such published collection in the past which listed common 
contractions found in U-Med running text. However I don't have it 
with me. Does anyone on-line have access to a document like this?

Peter

Re: Major Defect in Combining Classes of Tibetan Vowels

2003-06-25 Thread Mark Davis

this was the case

Someone might misread your statement. We did not change the combining
classes for Hebrew.

Mark
__
http://www.macchiato.com
  Eppur si muove 

- Original Message - 
From: Michael (michka) Kaplan [EMAIL PROTECTED]
To: [EMAIL PROTECTED]; Andrew C. West
[EMAIL PROTECTED]
Sent: Wednesday, June 25, 2003 08:55
Subject: Re: Major Defect in Combining Classes of Tibetan Vowels


 Let me add that this was the case recently for Hebrew (to mention on
 example). So it is certainly not impossible.

 But we have enough real work to do that we should do our best to
veer from
 the theoretical. :-)

 MichKa

 - Original Message - 
 From: Michael (michka) Kaplan [EMAIL PROTECTED]
 To: [EMAIL PROTECTED]; Andrew C. West
 [EMAIL PROTECTED]
 Sent: Wednesday, June 25, 2003 8:11 AM
 Subject: Re: Major Defect in Combining Classes of Tibetan Vowels


  From: Andrew C. West [EMAIL PROTECTED]
 
   What I'm suggesting is that although cui 0F45, 0F74, 0F72
and ciu
  0F45,
   0F72, 0F74 should be rendered identically, the logical ordering
of the
   codepoints representing the vowels may represent lexical
differences
 that
  would
   be lost during the process of normalisation.
 
  Do you (or does anyone) have an actual example where this is the
case? It
  may well be true but until someone has a proof there is not really
an
  indication of a specific problem for the UTC to address.
 
  The current discussion is like arguing about a color that none of
the
  participants have ever seen.
 
  MichKa

Re: Major Defect in Combining Classes of Tibetan Vowels

2003-06-25 Thread Michael Everson

At 09:13 -0700 2003-06-25, Mark Davis wrote:
Michael, that is like saying move the bloody character or remove
the bloody character.
  Fix the bloody errors, for heaven's sake.
You'd like to think so. But Deprecate TIBETAN THINGY and add TIBETAN 
THINGY BIS so that we can fix the problem is utterly ridiculous.
--
Michael Everson * * Everson Typography *  * http://www.evertype.com

Re: Major Defect in Combining Classes of Tibetan Vowels

2003-06-25 Thread Philippe Verdy

On Wednesday, June 25, 2003 6:13 PM, Mark Davis [EMAIL PROTECTED] wrote:
 Michael Everson wrote:
  [EMAIL PROTECTED] wrote:
   Christopher John Fynn wrote:
  Any suggestions as to how to create a standardized work around
  for these incorrect values?
   
   Propose new characters, and deprecate the old ones?
  
  Fix the bloody errors, for heaven's sake.

 Michael, that is like saying move the bloody character or remove
 the bloody character.

If there are real distinct semantics that were abusively unified by the 
canonicalization, the only safe way would be to create a second character that would 
have another combining class than the existing one, to be used when lexical 
distinction from the most common use is necessary.

So the added character for the modified vowel signs would have the same representative 
glyph, but would have the additional semantic contraction (clearly indicated in 
their name). This does not break the existing encoding of most texts, but allows a 
specific usage for contractions where the existing canonical equivalences would be 
inappropriate.

-- Philippe.

Re: Major Defect in Combining Classes of Tibetan Vowels

2003-06-25 Thread Michael Everson

At 18:26 +0100 2003-06-25, Michael Everson wrote:

You'd like to think so. But Deprecate TIBETAN THINGY and add 
TIBETAN THINGY BIS so that we can fix the problem is utterly 
ridiculous.
And by that I mean, given the TWO standards Unicode and ISO/IEC 
10646, adding duplicate characters is frowned upon, so it should be 
less preferable than UTC fixing broken classes if they really are 
broken.
--
Michael Everson * * Everson Typography *  * http://www.evertype.com

Re: Major Defect in Combining Classes of Tibetan Vowels

2003-06-25 Thread Philippe Verdy

 From: Michael (michka) Kaplan [EMAIL PROTECTED]
  From: Michael (michka) Kaplan [EMAIL PROTECTED]
   From: Andrew C. West [EMAIL PROTECTED]
What I'm suggesting is that although cui 0F45, 0F74, 0F72
and ciu 0F45, 0F72, 0F74 should be rendered identically,
the logical ordering of the codepoints representing the vowels
may represent lexical differences that would
be lost during the process of normalisation.

   Do you (or does anyone) have an actual example where this is the
   case? It may well be true but until someone has a proof there is
   not really an indication of a specific problem for the UTC to
   address.

  Let me add that this was the case recently for Hebrew (to mention on
  example). So it is certainly not impossible.

  But we have enough real work to do that we should do our best to
  veer from the theoretical. :-)

Another option would be, for the encoding of contractions, to encode an invisible 
letter (with combining class 0) that would prevent the reordering of combining 
characters. To be valid with the usage of Tibetan vowels, this character should be 
treated as a base consonnant, and then it would explicitly form a ligature with the 
previous encoding cluster, to create the actual grapheme cluster.

Why not using in that case a halant (virama) character to encode these contractions 
(which would be implicitly obvious for a native Tibetan reader of a rendered or 
printed text, but explicit for a computer program such as a generic indexing engine) ?

-- Philippe.

Re: Major Defect in Combining Classes of Tibetan Vowels

2003-06-25 Thread Philippe Verdy

On Wednesday, June 25, 2003 8:14 PM, Peter Lofting [EMAIL PROTECTED] wrote:

 At 7:41 PM +0200 6/25/03, Philippe Verdy wrote:
  If there are real distinct semantics that were abusively unified
  by the canonicalization, the only safe way would be to create a
  second character that would have another combining class than the
  existing one, to be used when lexical distinction from the most
  common use is necessary.
  
  So the added character for the modified vowel signs would have the
  same representative glyph, but would have the additional semantic
  contraction (clearly indicated in their name). This does not break
  the existing encoding of most texts, but allows a specific usage for
  contractions where the existing canonical equivalences would be
  inappropriate.
 
 How do you envisage this getting into the data?
 
 Often in Tibetan data capture, operators are keying in the appearance
 of a text and do not know what a stack represents.
 
 So the data then requires expert review after input to verify and
 assign the semantic representation.

This is not a major problem, in fact this occurs everyday in all scripts: there are 
correctors, and some dictionnary based corrections that may be used to help correct 
the incorrectly or ambiguously encoded string...

This is true even for all Latin-based languages, where the incorrect accents are used, 
or missing, and only native readers will be able to see the incorrect interpretation 
of a grapheme cluster, using their own knowledge of the language when the error 
(introduced by some intermediate technical constraint such as a past missing standard) 
appears.

I still think that the contraction problem has a limited impact, which doesnot 
affect the normal written form of the Tibetan language which clearly uses a single 
interpretation. If both interpretations of a grapheme cluster is needed, then we 
should keep the encoding of the existing characters for the most common interpretation 
(without the contraction semantics), and assign a variant specially to allow encoding 
the other interpretation or reading of the grapheme-cluster.

Legacy encoded text may still contain such ambiguous encodings that will look 
erroneous with the new updated standard, but this offers a way to correct later the 
encoded text, by looking at occurences of such ambiguous sequences, and letting actual 
native readers correct these interpretation, if the correction is absolutely required 
for some text processing.

I do think that most already encoded text will not need such correction, if the 
encoding is just a way to transport a text which is only intended to be rendered or 
printed, but not used with automated lexical analysis. And even in that case, if the 
encoding ambiguity is well documented in a revision of the standard, there is a 
possibility to enhance tools like automated full-text search engines to search for 
both encodings of the character, based on their actually identical glyphic 
representation.

-- Philippe.

Re: Major Defect in Combining Classes of Tibetan Vowels

2003-06-25 Thread Peter Lofting

At 7:41 PM +0200 6/25/03, Philippe Verdy wrote:
If there are real distinct semantics that were abusively unified 
by the canonicalization, the only safe way would be to create a 
second character that would have another combining class than the 
existing one, to be used when lexical distinction from the most 
common use is necessary.

So the added character for the modified vowel signs would have the 
same representative glyph, but would have the additional semantic 
contraction (clearly indicated in their name). This does not break 
the existing encoding of most texts, but allows a specific usage for 
contractions where the existing canonical equivalences would be 
inappropriate.
How do you envisage this getting into the data?

Often in Tibetan data capture, operators are keying in the appearance 
of a text and do not know what a stack represents.

So the data then requires expert review after input to verify and 
assign the semantic representation.

Peter

Re: Major Defect in Combining Classes of Tibetan Vowels

2003-06-25 Thread Valeriy E. Ushakov

On Wed, Jun 25, 2003 at 09:08:10 -0700, Peter Lofting wrote:

 A list of common contractions would help here. I've seen at least one 
 such published collection in the past which listed common 
 contractions found in U-Med running text. However I don't have it 
 with me. Does anyone on-line have access to a document like this?

A sample list of dbu can contractions from Schmidt grammar:

http://snark.ptc.spbu.ru/~uwe/tibex/contractions/contractions.html


SY, Uwe
-- 
[EMAIL PROTECTED] |   Zu Grunde kommen
http://www.ptc.spbu.ru/~uwe/|   Ist zu Grunde gehen

Re: Major Defect in Combining Classes of Tibetan Vowels

2003-06-25 Thread Rick McGowan

Let me remind you: Talk on this list doesn't mean that the issue is  
automatically brought up for UTC deliberation. If no documents are formally  
submitted, nothing will happen.

After all the discussion of Tibetan, if anyone has a serious concrete  
proposal for a specific change to the Unicode Standard, please write it up  
in detail and submit it.

If you develop such a document you can comment via our reporting page here:

http://www.unicode.org/reporting.html

If the document is more than plain-text, you can arrange to send it by  
talking with me off-list, and I will see that it is properly registered for  
UTC discussion.

Rick

Re: Major Defect in Combining Classes of Tibetan Vowels

2003-06-25 Thread Michael Everson

At 12:15 -0700 2003-06-25, John Hudson wrote:

In this case, any existing normalisation for Hebrew is already 
broken -- in the sense of destroying Biblical Hebrew text -- but 
still the argument from the UTC seems to be that even broken 
implementations -- broken because the standard is broken -- must not 
be broken.
That seems very short-sighted indeed.
--
Michael Everson * * Everson Typography *  * http://www.evertype.com

Re: Major Defect in Combining Classes of Tibetan Vowels

2003-06-25 Thread Jim Allan

Rick McGowan posted and was answered by John Hudson:

If there isn't a visual difference here, how could there be a lexical
difference? Imagine the age before computers. All you have to go on is
what's on the page. There isn't an inherent order in those elements; they
could have been written by the scribe in any order. If they appear the
same, you can't assign different meanings -- except by some extra-syllabic
informational context... right?
On the page, you would know -- or hopefully know -- from context. But a
search engine or a sorting algorithm looking at the characters presumably
needs to know the difference without additional context, hence the
character ordering is important.
I think such distinctions are more than one should expect from a 
standard search engine or from simple sortation.

To move to French, for example, I would not expect to be able to tell 
whether the abbreviation M. in M. Bouteillier stands for Monsieur 
or a name like Marcel.

How do you know except from context whether med. stands for medical 
or medieval?

In a company name such as Perrault  Lavigne should  sort according 
to default Unicode or as and or as et?

Should it be found from searches on and, et, und and so forth?

This is the business of application protocol and application utilities.

Indication of proper expansion of abbreviations for sorting and 
searching seems to me to be beyond what Unicode tries to do and what it 
can do reasonably.

If lexical forms in any language have variant meanings, then they are 
not for Unicode to distinguish except occasionally when Unicode provides 
identical glyphs that represent characters with very different 
properties such as ! for punctuation and ! for a Zulu click in the 
hope, probably vain, that people in general will recognize the difference.

Jim Allan

Re: Major Defect in Combining Classes of Tibetan Vowels

2003-06-25 Thread Peter_Constable

Michael Kaplan wrote on 06/25/2003 10:55:47 AM:

 Let me add that this was the case recently for Hebrew (to mention on
 example). So it is certainly not impossible.

The Hebrew issue is different: that involves things that *are* visually 
distinct, and that distinction cannot be represented in a reliable manner.


- Peter


---
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485

Re: Major Defect in Combining Classes of Tibetan Vowels

2003-06-25 Thread John Cowan

John Hudson scripsit:

 I'm not saying I like this, but this is how it has been explained to 
 me with regard to the very clearly erroneous Hebrew mark combining classes 
 which demonstrably break Biblical Hebrew text. In this case, any existing 
 normalisation for Hebrew is already broken -- in the sense of destroying 
 Biblical Hebrew text -- but still the argument from the UTC seems to be 
 that even broken implementations -- broken because the standard is broken 
 -- must not be broken.

I don't understand how the current implementation breaks BH text.
At worst, normalization may put various combining marks in a non-traditional
order, but all alternative orders are canonically equivalent anyway, and
no (ordinary) Unicode process should depend on any specific order.

-- 
Not to perambulate John Cowan [EMAIL PROTECTED]
the corridors  http://www.reutershealth.com
during the hours of repose http://www.ccil.org/~cowan
in the boots of ascension.   --Sign in Austrian ski-resort hotel

Re: Major Defect in Combining Classes of Tibetan Vowels

2003-06-25 Thread Peter_Constable

Andrew C. West wrote on 06/25/2003 09:31:51 AM:

 What I'm suggesting is that although cui 0F45, 0F74, 0F72 and ciu 
0F45,
 0F72, 0F74 should be rendered identically, the logical ordering of the
 codepoints representing the vowels may represent lexical differencesthat 
would
 be lost during the process of normalisation.

How can things that are visually indistinguishable be lexically different? 
We don't encode the phonological distinctions between homographs; we 
encode text.


- Peter


---
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485

Re: Major Defect in Combining Classes of Tibetan Vowels

2003-06-25 Thread Michael \(michka\) Kaplan

Thank you for [indirectly] making my point for me. I am saying that if
someone has an issue that *does* make a difference then they should bring it
up.

Otherwise, I say that a difference that makes no difference, make no
difference. And we can move on to actual problems. :-)

MichKa

- Original Message - 
From: [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Wednesday, June 25, 2003 1:08 PM
Subject: Re: Major Defect in Combining Classes of Tibetan Vowels


 Michael Kaplan wrote on 06/25/2003 10:55:47 AM:

  Let me add that this was the case recently for Hebrew (to mention on
  example). So it is certainly not impossible.

 The Hebrew issue is different: that involves things that *are* visually
 distinct, and that distinction cannot be represented in a reliable manner.


 - Peter


 --
-
 Peter Constable

 Non-Roman Script Initiative, SIL International
 7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
 Tel: +1 972 708 7485

Re: Major Defect in Combining Classes of Tibetan Vowels

2003-06-25 Thread Kenneth Whistler

Peter asked:

 How can things that are visually indistinguishable be lexically different? 

chat (en)
chat (fr)

 We don't encode the phonological distinctions between homographs; we 
 encode text.

But I agree that we encode text. Both words above, which are
*lexically* distinct, would have the same encoded character
representation, and no amount of inspection of the encoding
per se is going to distinguish them.

--Ken

Re: Major Defect in Combining Classes of Tibetan Vowels

2003-06-25 Thread Kenneth Whistler

 At 18:26 +0100 2003-06-25, Michael Everson wrote:
 
 You'd like to think so. But Deprecate TIBETAN THINGY and add 
 TIBETAN THINGY BIS so that we can fix the problem is utterly 
 ridiculous.
 
 And by that I mean, given the TWO standards Unicode and ISO/IEC 
 10646, adding duplicate characters is frowned upon, so it should be 
 less preferable than UTC fixing broken classes if they really are 
 broken.

This neglects the fact that for the Unicode Standard (although
not ISO/IEC 10646, for which combining classes and normalization
are irrelevant), destabilization of normalization is as
serious a business as adding duplicate characters. That is
why Mark chimed in earlier with:

 Michael, that is like saying move the bloody character or remove
 the bloody character.

This issue should not be framed as if it were one where
character identity is the higher glory, enshrined in
the superior standard, so that to fix a problem, the
lesser standard, the Unicode Standard, should simply relent
on its own stability guarantees. Instead, the two standards
have synchronized guarantees regarding character identity,
but the Unicode Standard has its own scope beyond 10646,
and in that realm it must respect its own guarantees of
stability, because the users of that standard depend on
them.

In any case, even with the clarification that there are
instances, in Tibetan contractions, of cooccurrence of
shabkyu and vowels above on the same consonant stack, I
am failing to see how the particular combining class
assignment for U+0F74 is creating any serious problem for
the representation of such Tibetan data.

--Ken

Re: Major Defect in Combining Classes of Tibetan Vowels

2003-06-25 Thread John Hudson

At 01:15 PM 6/25/2003, John Cowan wrote:

I don't understand how the current implementation breaks BH text.
At worst, normalization may put various combining marks in a non-traditional
order, but all alternative orders are canonically equivalent anyway, and
no (ordinary) Unicode process should depend on any specific order.
In Biblical Hebrew, it is possible for more than one vowel to be attached 
to a single consonant. This means that is it very important to maintain the 
ordering of vowels applied to a single consonant. The Unicode Standard 
assigns an individual combining class to every vowel, meaning that NFC 
normalisation may re-order vowels on a consonant. This is not simply 
'non-traditional' but results in incorrect rendering and a different 
vocalisation of the text. The point is that hiriq before patah is *not* 
canonically equivalent to patah before hiriq, except in the erroneous 
assumption of the Unicode Standard: the order of vowels makes words sound 
different and mean different things.

In order to correctly encode and render the Biblical Hebrew text, it is 
necessary to either a) never use normalisation routines that re-order marks 
(which is beyond the control of document authors), or b) re-classify the 
existing Hebrew marks so that all vowels are in a single class and will not 
be re-ordered during normalisation, or c) encode new marks for Biblical 
Hebrew with all vowels in a single class.

There are a few other desirable changes to the combining class assignments 
for some Hebrew accents, which make rendering easier and are more 
linguistically logical, but the vowels are the most problematic.

John Hudson

Tiro Typeworks  www.tiro.com
Vancouver, BC   [EMAIL PROTECTED]
If you browse in the shelves that, in American bookstores,
are labeled New Age, you can find there even Saint Augustine,
who, as far as I know, was not a fascist. But combining Saint
Augustine and Stonehenge -- that is a symptom of Ur-Fascism.
- Umberto Eco

Re: Major Defect in Combining Classes of Tibetan Vowels

2003-06-25 Thread Michael Everson

At 14:20 -0700 2003-06-25, John Hudson wrote:

John,

Write it up with glyphs and minimal pairs and people will see the 
problem, if any. Or propose some solution. (That isn't add duplicate 
characters.)

In Biblical Hebrew, it is possible for more than one vowel to be 
attached to a single consonant. This means that is it very important 
to maintain the ordering of vowels applied to a single consonant. 
The Unicode Standard assigns an individual combining class to every 
vowel, meaning that NFC normalisation may re-order vowels on a 
consonant. This is not simply 'non-traditional' but results in 
incorrect rendering and a different vocalisation of the text. The 
point is that hiriq before patah is *not* canonically equivalent to 
patah before hiriq, except in the erroneous assumption of the 
Unicode Standard: the order of vowels makes words sound different 
and mean different things.

In order to correctly encode and render the Biblical Hebrew text, it 
is necessary to either a) never use normalisation routines that 
re-order marks (which is beyond the control of document authors), or 
b) re-classify the existing Hebrew marks so that all vowels are in a 
single class and will not be re-ordered during normalisation, or c) 
encode new marks for Biblical Hebrew with all vowels in a single 
class.

There are a few other desirable changes to the combining class 
assignments for some Hebrew accents, which make rendering easier and 
are more linguistically logical, but the vowels are the most 
problematic.

John Hudson

Tiro Typeworks  www.tiro.com
Vancouver, BC   [EMAIL PROTECTED]
If you browse in the shelves that, in American bookstores,
are labeled New Age, you can find there even Saint Augustine,
who, as far as I know, was not a fascist. But combining Saint
Augustine and Stonehenge -- that is a symptom of Ur-Fascism.
- Umberto Eco


--
Michael Everson * * Everson Typography *  * http://www.evertype.com

Re: Major Defect in Combining Classes of Tibetan Vowels

2003-06-25 Thread Kenneth Whistler

John Hudson wrote:

 In Biblical Hebrew, it is possible for more than one vowel to be attached 
 to a single consonant. This means that is it very important to maintain the 
 ordering of vowels applied to a single consonant. The Unicode Standard 
 assigns an individual combining class to every vowel, meaning that NFC 
 normalisation may re-order vowels on a consonant. 

This is true.

 This is not simply 
 'non-traditional' but results in incorrect rendering and a different 
 vocalisation of the text. 

I don't think this is true. 

First, the intent of the (admittedly problematical) fixed position
combining classes was that the position of the relevant marks,
including the relevant Hebrew points, was fixed with respect to
the consonant base letter, so that application of one would not
impact the rendering of application of another. Unlike the
generic above and below combining classes, the general inside-out
positioning rule would not apply to sequences of fixed position
marks.

It may be more *difficult* for applications to do correct rendering,
but there was never any intention in the standard that I know
of that a sequence hiriq, patah would render differently
than a sequence patah, hiriq. And never any intent that it
would represent a different vocalisation of the text.

 The point is that hiriq before patah is *not* 
 canonically equivalent to patah before hiriq,

This is true. 

 except in the erroneous 
 assumption of the Unicode Standard: the order of vowels makes words sound 
 different and mean different things.

This is not. The Unicode Standard makes no assumptions or claims
about what the phonological or meaning equivalence of hiriq, patah
or patah, hiriq is for Biblical Hebrew.

The fact that traditional Biblical Hebrew spelling prefers one
order of representation and canonically ordered Unicode text
specifies the opposite order may be a problem for implementations,
but that problem does not extend to the claims that John is
making here.

 
 In order to correctly encode and render the Biblical Hebrew text, it is 
 necessary to either a) never use normalisation routines that re-order marks 
 (which is beyond the control of document authors), or b) re-classify the 
 existing Hebrew marks so that all vowels are in a single class and will not 
 be re-ordered during normalisation, or c) encode new marks for Biblical 
 Hebrew with all vowels in a single class.

I don't think these conclusions following from the current
situation.

Such changes are certainly not necessary in order to *render*
Biblical Hebrew text correctly, nor to accurately represent
the content of Biblical Hebrew text.

The current situation is not optimal for implementations, nor
does canonically ordered text follow traditional preferences
for spelling order -- that we can agree on. But I think the
claims of inadequacy for the representation or rendering
of Biblical Hebrew text are overblown.

--Ken

Re: Major Defect in Combining Classes of Tibetan Vowels

2003-06-25 Thread Andrew C. West

On Wed, 25 Jun 2003 19:47:26 +0400, Valeriy E. Ushakov wrote:

 And given that the two look identical in writing in the first palce,
 this lexical difference had a chance to originate exactly *where*?
 You are putting the cart before the horse.

Well, unless the text has been scanned with OCR, a human user will have to enter
Tibetan text manually, and if the user encounters a base consonant with two
different vowel signs joined to it, they will have to make a choice as to which
order the vowel signs are entered.

For example, if the word bcuig (with the letter CA carrying both a shabkyu [u]
and gigu [i] sign) is encountered in a text that is being transcribed into
electronic form, and the user recognises it from its context as a contraction
for bcu gcig (eleven), then it would be natural to enter  b-c-u-i-g 0F56,
0F45, 0F74, 0F72, 0F42. On the other hand, if a syllable (tsheg bar) comprising
the base consant GA with a shabkyu [u] sign below and a gigu [i] sign above is
encountered (this is a plausible but hypothetical contraction), and the user
recognises this from its context as a contraction for the word gi gu (the name
for the I vowel sign), then it would be natural to enter g-i-u 0F42, 0F72,
0F74, even though when writing it by hand the shabkyu would be written before
the gigu (calligraphic order does not necessarily equate to logical order). In
the one case a base consonant plus shabkyu and gigu is entered as 0FXX, 0F74,
0F72, in the other case as 0FXX, 0F72, 0F74.

Unfortunately it is precisely at this point that my argument starts to crumble,
and I am forced to throw in the towel, and admit defeat.

The key question is, if 0F56, 0F45, 0F74, 0F72, 0F42 (bcuig) gets normalised
to 0F56, 0F45, 0F72, 0F74, 0F42 (bciug), then so what ? Well, so nothing,
unless 0F56, 0F45, 0F74, 0F72, 0F42 (bcuig) is a shared contraction for two
different words, and the order of the U and I distinguishes what the contraction
is. As Tibetan shorthand abbreviations are an informal, non-standardised method
of abbreviating words, it is hypothetically possible that two different scribes
could come up with the same contracted form for two differently spelled words,
but I very much doubt that this would ever happen in reality. If I do find such
a case, I will certainly let this list know, but in the meanwhile I agree that
perhaps it would be more productive to return to Chris's original question,
rather than travel too far down this detour, scenic though it is.

Regards,

Andrew

Re: Major Defect in Combining Classes of Tibetan Vowels

2003-06-25 Thread Andrew C. West

On Wed, 25 Jun 2003 13:41:27 -0700 (PDT), Kenneth Whistler wrote:

 
 Peter asked:
 
  How can things that are visually indistinguishable be lexically different? 
 
 chat (en)
 chat (fr)

And if Unicode reordered vowels in front of consonants, then we wouldn't be able
to distinguish :

chat (en)
chat (fr)
acht (de)

Andrew

Re: Major Defect in Combining Classes of Tibetan Vowels

2003-06-25 Thread John Hudson

At 03:29 PM 6/25/2003, Kenneth Whistler wrote:

 This is not simply
 'non-traditional' but results in incorrect rendering and a different
 vocalisation of the text.
I don't think this is true.

First, the intent of the (admittedly problematical) fixed position
combining classes was that the position of the relevant marks,
including the relevant Hebrew points, was fixed with respect to
the consonant base letter, so that application of one would not
impact the rendering of application of another.
This idea of Hebrew vowels as 'fixed' marks is problematical, because in 
Biblical Hebrew they are not fixed: they move relative to additional marks 
(other vowels or cantillation marks).

It may be more *difficult* for applications to do correct rendering,
but there was never any intention in the standard that I know
of that a sequence hiriq, patah would render differently
than a sequence patah, hiriq.
Yes, this is what I am saying is wrong: hiriq, patah *should* render 
differently from patah, hiriq. This example is particularly important, 
because it occurs in the spelling of yerushalaim, the Masoretic 
approximation of yerushalayim. Correct rendering requires that the hiriq 
follows the patah, and not vice versa.

And never any intent that it
would represent a different vocalisation of the text.
Fair enough for modern Hebrew. Fair enough for phonetically accurate 
Hebrew. Not good enough for Biblical Hebrew in which vocalisation reflects 
Masoretic pronunciation applied to ancient consonant structures.

 The point is that hiriq before patah is *not*
 canonically equivalent to patah before hiriq,
This is true.

 except in the erroneous
 assumption of the Unicode Standard: the order of vowels makes words sound
 different and mean different things.
This is not. The Unicode Standard makes no assumptions or claims
about what the phonological or meaning equivalence of hiriq, patah
or patah, hiriq is for Biblical Hebrew.
But it does make assumptions about the canonical equivalence of the mark 
orders U+05B4, U+05B7 and U+05B7, U+05B4, unless my understanding of 
the purpose of combining classes is completely mistaken. My understanding 
is that any ordering of two marks with different combining classes is 
canonically equivalent; further, I understand that some normalisation forms 
will re-order marks to move marks with lower combining class values closer 
to the base character. If the sequence lamed, patah, hiriq, final mem is 
what the text says, normalisation that re-orders the sequence as lamed, 
hiriq, patah, final mem is erroneous.

The fact that traditional Biblical Hebrew spelling prefers one
order of representation and canonically ordered Unicode text
specifies the opposite order may be a problem for implementations,
but that problem does not extend to the claims that John is
making here.
This isn't a problem for implementations. This is a problem of Unicode 
canonical ordering re-ordering marks whose order is lexically significant. 
The fact that, in some cases, the canonical ordering also cannot be 
rendered with existing implementations simply makes the problem visually 
obvious.

 In order to correctly encode and render the Biblical Hebrew text, it is
 necessary to either a) never use normalisation routines that re-order 
marks
 (which is beyond the control of document authors), or b) re-classify the
 existing Hebrew marks so that all vowels are in a single class and will 
not
 be re-ordered during normalisation, or c) encode new marks for Biblical
 Hebrew with all vowels in a single class.

I don't think these conclusions following from the current
situation.
Such changes are certainly not necessary in order to *render*
Biblical Hebrew text correctly, nor to accurately represent
the content of Biblical Hebrew text.
They are necessary to render Biblical Hebrew text correctly using current 
font and layout engine technologies. These technologies work perfectly for 
Biblical Hebrew so long as Unicode canonical ordering is ignored. I think 
there is very little impetus to change or develop new implementations to 
take into account what strikes most of those involved with Biblical Hebrew 
text processing as an error in Unicode.

The current situation is not optimal for implementations, nor
does canonically ordered text follow traditional preferences
for spelling order -- that we can agree on. But I think the
claims of inadequacy for the representation or rendering
of Biblical Hebrew text are overblown.
I've spent nine months working on Biblical Hebrew rendering for the major 
user community (the Society of Biblical Literature and their Font 
Foundation partners), and their take on this is that a) they want a 
solution that works with today's technology, and b) they will avoid Unicode 
canonical ordering like the plague and use custom normalisations instead. 
When we conducted normalisation tests, switching from Unicode normalisation 
of  to a custom normalisation that does not re-order vowels or meteg*, we

Re: Major Defect in Combining Classes of Tibetan Vowels

2003-06-25 Thread Philippe Verdy

On Thursday, June 26, 2003 1:04 AM, Andrew C. West [EMAIL PROTECTED] wrote:

 On Wed, 25 Jun 2003 13:41:27 -0700 (PDT), Kenneth Whistler wrote:
 
  
  Peter asked:
  
   How can things that are visually indistinguishable be lexically
   different? 
  
  chat (en)
  chat (fr)
 
 And if Unicode reordered vowels in front of consonants, then we
 wouldn't be able to distinguish :
 
 chat (en)
 chat (fr)
 acht (de)
 
 Andrew

Such distinction by language is futile: you try to add a language-specific lexical 
meaning, that simply does not exist in Unicode which only standardizes the *script* so 
that it *can* be rendered correctly independantly of the actual language...

So you need to assume a unique language when interpreting an encoded string, but this 
is out of scope of Unicode (which at best will define language-dependant character 
properties, but not language-dependant canonical equivalences.

When Unicode defines such canonical equivalence, the contract must be *only* based on 
the rendered text: if the text is rendered identically so that it becomes impossible 
to determine which order was used to encode it in abstract character sequences, then 
all these orders should be made canonically equivalent.

The only exception is for abstract character propertiesn, which MUST be language 
independant for normative properties (the only exception is character transformations 
such as case mappings, which change the semantic of the text) but need sometimes to be 
distinct for correct processing in the rendering process (for example the Mathematics 
Symbol category and the Letter category, as they influence the layout in actual 
renderers, notably for the choice of font styles or point sizes or alignment, or 
extraction of entities sharing a common set of properties, such as breaking rules that 
also influence the correct rendering of text in variable display environments with 
different capabilities).

Labelling the text with extra information such as language or word semantics or 
phonetic values is not part of the Unicode standard. The Unicode standard stops at the 
point where a text *can* be rendered with its original semantics, and this excludes 
all phonological, phonetical, or logical ordering analysis that can be made 
equivalently on the rendered text or on the encoded text.

-- Philippe.

Re: Major Defect in Combining Classes of Tibetan Vowels

2003-06-25 Thread Christopher John Fynn




Valeriy E. Ushakov [EMAIL PROTECTED] wrote:

 A sample list of dbu can contractions from Schmidt grammar:


http://snark.ptc.spbu.ru/~uwe/tibex/contractions/contractions.ht
ml

When these combinations are written in dbu-can script, as they
are here ,the problem may not look too bad. - However in
semi-cursive and cursive forms of Tibetan script  subjoined
vowels are completly connected with the preceeding consonants -
and the combination of consonant(s) + subjouned vowel(s) need to
be implemented in a font as a single ligature. While the above
headline vowel(s) can still be be a seperate combining glyph.
Hence it is important to have subjoined vowel signs ordered
before those which can occur above the stack.

- Chris

Re: Major Defect in Combining Classes of Tibetan Vowels

2003-06-25 Thread Kenneth Whistler

John Hudson wrote:

 This idea of Hebrew vowels as 'fixed' marks is problematical, because in 
 Biblical Hebrew they are not fixed: they move relative to additional marks 
 (other vowels or cantillation marks).
 
 It may be more *difficult* for applications to do correct rendering,
 but there was never any intention in the standard that I know
 of that a sequence hiriq, patah would render differently
 than a sequence patah, hiriq.
 
 Yes, this is what I am saying is wrong: hiriq, patah *should* render 
 differently from patah, hiriq. This example is particularly important, 
 because it occurs in the spelling of yerushalaim, the Masoretic 
 approximation of yerushalayim. Correct rendering requires that the hiriq 
 follows the patah, and not vice versa.

Understood. See my separate response on the Biblical Hebrew thread.

 They are necessary to render Biblical Hebrew text correctly using current 
 font and layout engine technologies. These technologies work perfectly for 
 Biblical Hebrew so long as Unicode canonical ordering is ignored. I think 
 there is very little impetus to change or develop new implementations to 
 take into account what strikes most of those involved with Biblical Hebrew 
 text processing as an error in Unicode.

so long as Unicode canonical ordering is ignored. But as you
and Peter point out, you cannot actually ignore canonical
ordering, since in the Internet context it is outside of
the end user's control. Once text escapes your own system
for interchange, it may be subject to normalization, and you
are kaputt.

As stated, this is also turning into a typical--dare I say, religious--
confrontation of I'm right and you're wrong with no compromise
in prospect and people getting ready to shoot themselves in the
foot to prove they are right.

You say there is little impetus to change or develop new implementations,
and yet the very solutions being proposed, e.g., by Peter, would
force reencoding of all the Biblical Hebrew text to work at all,
and would, ipso facto, require new implementations and new fonts
to work right.

The alternative I suggested, of agreeing on a text representational
convention of vowel, ZWJ, vowel for those instances of sequences
which should not reorder could be implemented *now* with
existing characters, and only minor extensions to the fonts and
to keyboard methods. Any existing corpus could be updated
en masse (and more easily than switching over to Peter's scheme),
or incrementally, as appropriate.

The other alternative that some seem to prefer: just change the
combining classes and be done with it -- is *not* going to
happen. It would fly in the face of politically committed
stability guarantees by the UTC and required by the IETF and
W3C. An inconvenience for Biblical Hebrew implementations is
not going to outweigh that, for any of the committees involved.
And even, if by some miracle, it *were* to happen, you would
also be awaiting the rollout of new implementations, since
you'd have to wait through the chaotic transition while everyone
updated their normalization algorithms.

Just picking up the marbles and going home isn't an option,
either. As you indicate, so long as Unicode canonical ordering 
is ignored the existing layout technologies work just fine.
So address the problem with an appropriate fix. Insert a
ZWJ (for instance) at the point where the canonical reordering
needs to be blocked on a vowel sequence, and you are then in
a situation where even though you are not ignoring canonical
ordering (which in distributed systems you cannot), you
end up preserving the order you need, anyway.

 I've spent nine months working on Biblical Hebrew rendering for the major 
 user community (the Society of Biblical Literature and their Font 
 Foundation partners), and their take on this is that a) they want a 
 solution that works with today's technology, and b) they will avoid Unicode 
 canonical ordering like the plague and use custom normalisations instead. 

And how is implementing a custom normalization not a matter of
developing a new implementation? It doesn't even begin to
deal with the problem of what happens if the text escapes out
into the Internet context, which won't be using the same
custom normalization.

Implementing a custom text representational convention seems
like a much more straightforward task to me.

 When we conducted normalisation tests, switching from Unicode normalisation 
 of  to a custom normalisation that does not re-order vowels or meteg*, we 
 increased the number of unique consonant + mark(s) sequences encoded in the 
 Old Testament text by more 340. This means that Unicode normalisation was 
 creating 340 textual ambiguities by treating lexically distinct sequences 
 as canonically equivalent. I don't think that kind of textual ambiguity is 
 'overblown'.

Introduce a canonical reordering blocker (cc=0) into the textual
sequences which get ordered in ways that lead to textual ambiguities,
and the textual ambiguities should

Re: Major Defect in Combining Classes of Tibetan Vowels: Illustration

2003-06-25 Thread Christopher John Fynn

Difficulties due to the present combining class values attached
to these characters most frequently occur with
abbreviations/contractions and/or with cursive scripts.  With
abbreviations it is common to have two or more vowels on a
consonant stack. In cursive or semi-cursive forms of Tibetan
script the subjoined vowels 0F71, 0F74 and 0F75 form ligatures
with the consonant(s) in the stack, while above headline
vowel(s) such as U+0F72 U+0F7A and U+0F7C sometimes forms a
ligature with the following consonant or punctuation mark.

 In Dzongkha (Bhutanese) abbreviated spellings are often the
usual way of writing words and a semi-cursive form of Tibetan
script (Joyig) is standard - so the problem frequently occurs.
I have a 225 page dictionary, and several other lists, of common
abbreviations which are full of examples where this problem
occurs.

I've attached a couple of real and fairly simple examples.

Example 1

Following normal orthographic rules the characters to produce
Example1_gtuig.jpg  would  be entered as:

U+0F42 U+0F4F U+0F74 U+0F72 U+0F42

If the characters remain in that order there is no problem -

the first U+0F42 is straight forward, the isolated character is
displayed as a simple glyph uni0F42
the sequence U+0F4F U+0F74 is replaced by a ligature
uni0F4F0F74
U+0F72  U+0F42 is replaced by a ligature uni0F720F42

Now if the text goes through a normalisation process the same
text ends up reordered as:
U+0F42 U+0F4F U+0F72 U+0F74  U+0F42
because the combining class value of U+0F72 is less than that of
U+0F74.

To render this there is no change for the first character but I
now need a lookup to render the whole sequence:
 U+0F4F U+0F72 U+0F72 U+0F74 U+0F42  with two glyphs
uni0F4F0F74  uni0F720F42

Example 2

Following normal orthographic rules the characters to produce
Example1_gtuop.jpg  would  be entered as:

U+0F42 U+0F4F U+0F74 U+0F7C U+0F54

If the characters remain in that order there is no proplem -

the first U+0F42 is as in the first example
the sequence U+0F4F U+0F74 is replaced by a ligature
uni0F4F0F74
U+0F7C  U+0F54 is replaced by a ligature uni0F7C0F54

However, since the combining class value of U+0F7C is less than
that of U+0F74,.
after a  normalisation process the same text ends up reordered
as:
U+0F42 U+0F4F U+0F7C U+0F72  U+0F54

and the whole sequence:
U+0F4F U+0F72 U+0F72 U+0F74 U+0F42 needs to be replaced with the
two glyphs uni0F4F0F74  uni0F720F42.


Example 3 - (Example3_aMi-aiM.jpg)
==

This is taken from an entirely different source, the TibetBT
font  which was specially created for a project in Sichuan
digitising the Tibetan bstan-'gyur (a vast cannonical collection
of texts in over 200 large volumes originally translated
fromSanskrit into Tibetan). The glyph set of the font is the
same as the the set of Tibetan stacks found in that collection.
All stacks including any combining vowels are implemented as
precomposed ligatures This font can be downloaded from
(though it is wrapped-up in a Windows setup.exe file).

Here we have two stacks which one would naturally enter as
U+0F68 U+0F7E U+0F72 and U+0F68 U+0F72 U+0F7E respectively. No
problem so long as the characters remain in that order. However
since U+0F72 has a combining class value greater than that of
U+0F7E - in a process of normalisation  U+0F72  would always
float to the end and both stings would end up as  U+0F68 U+0F7E
U+0F72  and be indistinguishable.

If there were only a few and fixed number of cases like the
first two examples it would not be *much* of a problem  to add
the extra lookups - even though my font would need both many to
one and many to many lookups  to handle it.  But there are
*numerous* cases I already know of and there is no fixed and
final list of such abbreviations. So I should really build the
tables in my font to be able to handle almost any possibility.
If the combining classes of vowels  marks were based on the
expected order where subjoined vowels are always written before
any above headline vowels, this would be reasonably
straight-forward to do - but as they may now wind up after
normalisation it requires adding a huge number of complex
lookups to the tables in my font.  - Once I've done this  it is
going to be very difficult to test all the permeutations.
Because of the number of additional lookups I need it is also
likely there will be a hefty performance hit - especially on
reflowing large documents.  Unfortunately the third example
can't simply be fixed by font lookups since two distinct
combinations wind up being identical and hence would have to be
rendered identically.

If I wrote a peice of software where values I'd assigned  caused
problems and innefficiencies like this, I'd count it as a major
fault or bug and hurry to fix it by assigning the correct
values. I know the Tibetan characters were discussed in great
detail by a number of experts at the time they were  encoded -
however there was little or no substantial discussion

Re: Major Defect in Combining Classes of Tibetan Vowels (Hebrew)

2003-06-25 Thread Jony Rosenne

When, in the Bible, one sees two vowels on a given consonant, it isn't so.
There is one vowel for the consonant one sees, and another vowel for an
invisible consonant. The proper way to encode it is to use some code to
represent the invisible consonant. Then the problem mentioned below does not
arise.

For example, the word Jerusalem is often spelled without the Yod, to which
the Hiriq belongs.

Jony

 -Original Message-
 From: [EMAIL PROTECTED] 
 [mailto:[EMAIL PROTECTED] On Behalf Of John Hudson
 Sent: Wednesday, June 25, 2003 11:21 PM
 To: John Cowan
 Cc: [EMAIL PROTECTED]; [EMAIL PROTECTED]
 Subject: SPAM: Re: Major Defect in Combining Classes of Tibetan Vowels
 
 
 At 01:15 PM 6/25/2003, John Cowan wrote:
 
 I don't understand how the current implementation breaks BH 
 text. At 
 worst, normalization may put various combining marks in a 
 non-traditional order, but all alternative orders are canonically 
 equivalent anyway, and no (ordinary) Unicode process should 
 depend on 
 any specific order.
 
 In Biblical Hebrew, it is possible for more than one vowel to 
 be attached 
 to a single consonant. This means that is it very important 
 to maintain the 
 ordering of vowels applied to a single consonant. The Unicode 
 Standard 
 assigns an individual combining class to every vowel, meaning 
 that NFC 
 normalisation may re-order vowels on a consonant. This is not simply 
 'non-traditional' but results in incorrect rendering and a different 
 vocalisation of the text. The point is that hiriq before 
 patah is *not* 
 canonically equivalent to patah before hiriq, except in the erroneous 
 assumption of the Unicode Standard: the order of vowels makes 
 words sound 
 different and mean different things.
 
 In order to correctly encode and render the Biblical Hebrew 
 text, it is 
 necessary to either a) never use normalisation routines that 
 re-order marks 
 (which is beyond the control of document authors), or b) 
 re-classify the 
 existing Hebrew marks so that all vowels are in a single 
 class and will not 
 be re-ordered during normalisation, or c) encode new marks 
 for Biblical 
 Hebrew with all vowels in a single class.
 
 There are a few other desirable changes to the combining 
 class assignments 
 for some Hebrew accents, which make rendering easier and are more 
 linguistically logical, but the vowels are the most problematic.
 
 John Hudson
 
 Tiro Typeworkswww.tiro.com
 Vancouver, BC [EMAIL PROTECTED]
 
 If you browse in the shelves that, in American bookstores,
 are labeled New Age, you can find there even Saint Augustine, 
 who, as far as I know, was not a fascist. But combining Saint 
 Augustine and Stonehenge -- that is a symptom of Ur-Fascism.
  
 - Umberto Eco

Re: Major Defect in Combining Classes of Tibetan Vowels

2003-06-24 Thread Kenneth Whistler

Chris Fynn wrote:

 In Unicode's UnicodeData.txt   (
  http://www.unicode.org/Public/UNIDATA/Unicodea.Dattxt )
  0F7E has a Canonical Combining Class Value (CCCV) of 0;
  0F71 a CCCV of 129;
  0F72 0F7A 0F7B 0F7C 0F7D and 0F80 a CCCV of 130;
  0F74 a CCCV of 132;
  and 0F82 and 0F83 have a CCCV of 230.
 
  By normal Tibetan  Dzongkha spelling, writing, and input rules
  Tibetan script stacks should be entered and written: 1 headline
  consonant (0F40-0F6A), any  subjoined consonant(s) (0F90-
  0F9C),  achung (0F71), shabkyu (0F74), any above headline
  vowel(s) (0F72 0F7A 0F7B 0F7C 0F7D and 0F80) ; any ngaro (0F7E,
  0F82 and 0F83)
 
  So following normal Tibetan  Dzongkha input and spelling rules
  the relative ordering of these characters should be:
  A.  0F71
  B.  0F74
  C.  0F72 0F7A 0F7B 0F7C 0F7D and 0F80
  D.  0F7E,  0F82 and 0F83
 
  The fact that, in a process of canonical decomposition or
  normalisation,  these combining characters can get reordered
  in a bizarre order relative to each other 

Actually, looking at this data, while I can see that the
combining classes are assigned less than optimally, I don't
see that this makes any practical problem for Tibetan data.

You are saying, in effect, that the stack structure has
the following position classes (treating the consonant stack
itself as the more tightly bound unit that I will just
symbolize as CS):

   CS - achung - shabkyu - vowelsabove - ngaro
   
And since shabkyu has cc=132 whereas the vowelsabove have
cc=130, they would reorder out of expected order if
normalized. However, for most text the shabkyu (u-below)
would be in complementary distribution with the vowels
above, so the effective positional classes are:

 { vowelsabove }
   CS - achung - { shabkyu } - ngaro
   
And in this case, the relative combining class of the vowels
doesn't really matter, since we wouldn't be seeing both
present to reorder around each other.

I'm guessing that you are claiming there are instances where
the shabkyu does cooccur with other vowels above as well.
Wouldn't those, if they do occur, represent a distinctly
minority case in terms of the overall processing? The short
summaries of Tibetan writing that I've seen don't even mention
it as a possibility, since even the few diphthongs in -u
are written with a separate stack 0F60, 0F74 to the
right of the main stack.

  causes difficulties
  with culturally correct collation (where  0F7E,  0F82 and 0F83
  should have an equal value) - and especially it necessitates
  making lookups in smart fonts far more complex and inefficient
  than they should have to be.

And I'm not seeing the problem here, either. Since the
combining class of 0F82 is 0, and not some other random
value, it isn't going to reorder around the other vowel
marks. If it is entered in the traditional spelling order you
have indicated, then it is going to stay in that position;
normalization won't move it. And since the equivalent
0F82 and 0F83 sift to the end of the syllable, with their
high combining class, they'll end up in the same position
as the 0F7E ngaro if normalized.

The only problem you'd have is with Tibetan data where a
0F7E ngaro is entered in other than the optimal spelling
order you indicated. Such a sequence won't compare equal
unless you add a spelling equivalence rule on top of the
canonical equivalence. But there are a number of such edge
cases for Brahmic scripts -- not just Tibetan.

Culturally correct collation is first a matter of giving
the three ngaro characters equivalent weights. Beyond that,
as you indicated, the weighting of the syllables (or stacks)
is complicated, and isn't going to be affected by 0F7E
having combining class 0 in any case.

 
  (In Tibetan script  fonts 0F71 and 0F74 are often  ligated with
  preceding consonant (+ subjoined consonants) combined as a
  single glyph whereas above headline vowels are almost always
  treated as non spacing combining marks.)

Yes, but the only point where this would be a problem would
be for stacks with a shabkyu (u vowel) *and* another vowel.
And even for such cases, wouldn't this be handled effectively
by 6 triples in the ligature tables which would identify
any shabkyu moved after one of the other 6 vowels?

 
  Currently there seems to be no easy or standardized work around
  for these problems and the standard seems to say that the
  relative values of assigned Canonical Combining Class Values
  cannot be changed.

They cannot.

  Any suggestions as to how to create a standardized work around
  for these incorrect values?

I guess I'm not getting it. I don't see the need for a
standardized work around, here. 

--Ken

 
  - Chris

Re: Major Defect in Combining classes of Tibetan Vowels

2003-06-21 Thread Philippe Verdy

From: Christopher John Fynn [EMAIL PROTECTED]
 So following normal Tibetan  Dzongkha input and spelling rules
 the relative ordering of these characters should be:
 A.  0F71 (CCV=129)
 B.  0F74 (CCV=132)
 C.  0F72, 0F7A, 0F7B, 0F7C, 0F7D, 0F80 (CCV=130)
 D.  0F7E, (CCV=0) 0F82, 0F83 (CCV=230)

Apart from defining a UCA-based decomposition, there does not
seem to be an easy solution. This would require preprocessing
of text similar to what is done for Arabic or Brahmic script
layout processing (where ligature and character or subglyph
reordering is performed before looking up for glyphs and ligatures
in fonts). On Windows, it would require using UniScribe, but
for collation, changes are still possible, because the UCA
order can still be modified to document these reordering rules.

Re: Major Defect in Combining classes of Tibetan Vowels

2003-06-21 Thread Christopher John Fynn


Phillipe

By relative ordering I did not mean relative collation weights
but  the order in which these combining characters are usually
entered relative to other characters and each other  - and the
order relative to each other in which they should be stored in a
string. The current CCCV weights for these characters mean that
they can end up in a bizarre order which makes no sense, serves
no useful purpose and complicates rendering and collation .

The only thing I did mention specifically about collation is
that 0F7E 0F82 and 0F83 should generally treated as equivalent
for collation purposes. Culturally correct collation rules for
Dzongkha and Tibetan are *very* complex when compared with those
for any other language I know of and I don't want to get into
all that here.

- Chris

50 matches

Mail list logo