Re: Serious problems with Arabic

2001-01-19 Thread Roozbeh Pournader


Dear Kenneth,

Due to some problems with Unicode Arabic behaviour, which I posted on the
mailing list in November, and using your guidance, I'm preparing a 
suggestion for UTC.

I think I know what should I suggest for shaping issues, but not about the
following problem I am attaching below to help remembering.

Do you think a purposal for changing the decomposition for U+0647 to my
suggestion but without the ZWNJ may have a chance? The current
decomposition is really a bug, even in semantics. The semantics is really
a Heh plus a Hamza Above. The current decomposition has possibly been done
only because of the glyph shape in the charts...

--roozbeh

On Tue, 21 Nov 2000, Kenneth Whistler wrote:

  My suggestion would be decomposing U+06C0 to
   
  U+0647 U+0654 U+200C
  ARABIC LETTER HEH ARABIC HAMZA ABOVE ZERO WIDTH NON-JOINER
   
  which seems to be the only solution for this. I again insist that this
  case appears really frequently in Persian, where HEH WITH YEH ABOVE is
  very common.
 
 Changing decompositions like this -- particularly to include a ZWNJ --
 is not going to be possible, because of the implications for
 normalization.
 
 Instead, the feasible way forward here is to write explicit exceptions
 for Arabic shaping rules, to account for instances such as this one.
 The shaping rules, unlike the decompositions, are not bound by
 ironclad guarantees of no further changes.




Serious problems with Arabic

2000-11-16 Thread Roozbeh Pournader


Dear All,
 
I have serious problems with Unicode Arabic. The main problem is with the
Arabic shaping rules in TUS 3.0, pages 192--197. I think these should be
changed in some suggested ways. Would someone please guide me on how
should I prepare an official suggestion?
 
1. "Bidi and Cursive Joining". Page 192 mentions:
 
"An implementation may choose to restate the following rules
 according to logical order so as to apply before the bidirectional
 algorithm's reordering phase. In this case, the words right and
 left as used in this section would become preceding and
 following."
 
But the effect is not the same! Consider the sequence
 
U+0628 U+202D U+0627 U+0631 U+202C
BEH LROALEF   REHPDF
 
If you apply bidi to this, you'll obtain
 
ALEF REH BEH
 
which will then become
 
ALEFisolated REHfinal BEHinitial
 
after cursive joining. But now try to reverse the order. First apply joining
and then bidi. Having in mind that LRO is transparent regarding joining,
(page 192, table 8-2 includes all format marks as being transparent; RLM is
included as an example, so we can deduce that by format marks, TUS means the
characters in the character class Cf, "Other, Format").
first you'll have
 
BEHinitial LRO ALEFfinal REHisolated PDF
 
and after bidi,
 
ALEF final REHisolated BEH initial
 
The former case is unacceptable because BEH and REH which are not adjacent
in logical order (this is the order one reads the text aloud), have joined
together, where one cannot find that they were not adjacent. The latter form
is also unacceptable, since you have a final ALEF, but it joins to nowhere
(you have not requested this, because you have not mentioned any ZWJ
in the text). It seems that this is the case that may occur with Arabic
enabled editors, when user is playing with the text. And it seems that
both solutions are probelmatic. UAX #9, in Reordering Resolved Levels,
recommends the latter case. 
 
My suggestion is making the five controls RLE, LRE, RLO, LRO, and PDF
non-joining and not transparent which will solve the problem. First, when
someone uses the explicit marks, he wants to render the text in different
levels, and second, the applications may now apply the joining before or
after the bidi, (they should consider the Retaining Format Codes part in
UAX #9 if they want to do joining after bidi).
 
2. "Transparency of Canonnical Decomposition". The standard claims
transparency according to cannonical decomposition. The text should have
the same behaviour if it is decomposed. But this is not true regarding
shaping U+06C0, ARABIC LETTER HEH WITH YEH ABOVE. It decomposes to
U+06D5 U+0654 which is ARABIC LETTER AE + ARABIC HAMZA ABOVE, while
HEH WITH YEH ABOVE is in the right-joining class and AE is in the
non-joining class. This will create problems for example with normal
Persian texts using the HEH WITH YEH ABOVE. If one has the very common
 
   KHAH ALEF NOON HEH WITH YEH ABOVE
 
(I'll follow the logical order, bidi is of no importance here), and then
shapes that, he will get
 
   KHAH-initial ALEF-final NOON-initial HEH WITH YEH ABOVE-final
 
but if he decomposes that and then applies the shaping, he will get
 
   KHAH ALEF NOON AE HAMZA ABOVE
 
and then
 
   KHAH-initial ALEF-final NOON-isolated AE-isolated HAMZA ABOVE
 
The last two are visually equal to HEH WITH YEH ABOVE-isolated. You can
see the difference between the shaping of NOON and AE. This is unbearable.
 
My suggestion would be decomposing U+06C0 to
 
U+0647 U+0654 U+200C
ARABIC LETTER HEH ARABIC HAMZA ABOVE ZERO WIDTH NON-JOINER
 
which seems to be the only solution for this. I again insist that this
case appears really frequently in Persian, where HEH WITH YEH ABOVE is
very common.