Proposed Update UAX #9, Unicode Bidirectional Algorithm

CE Whitehead Sat, 19 Jan 2013 10:24:01 -0800



Hi, I am commenting on Marcin 
Grzegorczyk's comments here; I also have one comment on Phillipe Verdy's
 comments, which follow Marcin's in the feedback page 
(http://www.unicode.org/review/pri232/) Hope it's clear who is speaking below. 





Date/Time: Thu Dec 13 17:39:27 CST 2012

Contact: [email protected]

Name: Marcin Grzegorczyk

Report Type: Public Review Issue 

Opt Subject: Feedback on PRI #232 



> In addition to Aharon Lanin’s comments, I would like to point out 
that the term “external neighbor” in the proposed rule N0 is ambiguous

 without a definition. It could mean either the adjacent character, 
or the nearest strong type; and in both cases sos/eos may be included or
 not.


> Also, I am not happy about the idea of having the UBA refer to 
properties not directly related to bidi (namely, the General Category 
property). In  fact, since the proposed update already adds new bidi 
classes for isolates, it might add new bidi classes for paired 
punctuation as well. I believe it would not only allow for more 
flexibility (e.g. if a need arises to include characters of a different 
General Category), but also enable expressing rule  N0 more clearly.


>


> Below is my list of proposed changes (relative to UAX #9 rev. 28 draft 4) 
> based on this idea.


>

>------

> 


> Add two new values to X_Bidi_Class (or Bidi_Class_X as per Asmus’s 
suggestion): Opening_Punctuation (OP) and Closing_Punctuation (CP).


> Assign OP to all characters with General_Category=Open_Punctuation for which 
> Bidi_Mirroring_Glyph is not .


I like this idea, which has been discussed prevously.


> Assign CP to all characters with General_Category=Close_Punctuation for which 
> Bidi_Mirroring_Glyph is not .


> In Tables 3 and 4, add the new two classes to the Neutral category.


O.k. so far.


> [Note: I believe that as of 6.2.0 all characters with gc=Ps or gc=Pe have 
> bc=ON.]


This is my understanding, too.


>


> Add a new definition:

>    BD11. Character A forms a mirrored pair with character B if the 
property Bidi_Mirrored is Yes for both A and B, and Bidi_Mirroring_Glyph
 of A is B.


> Rephrase N0 as follows:


>    N0. Search backward from each instance of a closing punctuation 
(CP) until either the first opening punctuation (OP) or sos is found. If
 an OP > is found, and it does not form a mirrored pair with the CP 
character, change that OP and all OPs preceding it in the isolating run 
sequence to

> Other Neutral (ON). [1] If an OP is found, and it forms a mirrored pair with 
> the CP character, then:


>        If the text between the OP and the CP contains at least one 
non-neutral type [2] (L, R, EN or AN) of the same direction as the 
embedding > direction [3], change both the OP and the CP to the 
strong type (L or R) corresponding to the embedding direction. 

>        Otherwise, if the text between the OP and the CP contains at
 least one non-neutral type of the direction opposite to the embedding

> direction,

> and at least one of the following conditions is true:

>            the last non-neutral type, if any, preceding the OP [4] 
is also of the direction opposite to the embedding direction,

>            the first non-neutral type, if any, following the CP is 
also of the direction opposite to the embedding direction,

>       then change both the OP and the CP to the strong type opposite to the 
> embedding direction.

>       Otherwise, change both the OP and the CP to ON. [5]


I do think mirrored characters need to be addressed in UAX 9, and so far they 
are,
" Paired punctuation marks are considered as a pair so that they both resolve 
to the same direction."

(http://www.unicode.org/reports/tr9/tr9-28.html#Resolving_Neutral_Types)


but I am not completely in agreement with Marcin's algorithm above.

The original algorithm  discussed for mirrored pairs (which I like; this
 algorithm may be found at: 
http://www.unicode.org/review/pri231/pri231-background.pdf) was,  as I 
understand things (I am quoting here):
"Once the paired punctuation marks have been identified, 
they should be resolved to the embedding direction except in the 
following cases which are resolved, based on context, opposite the 
embedding direction:

"* The directionality of the enclosed content  is opposite the embedding
 direction, and at least one 115 neighbor has a bidi level opposite to 
the embedding direction O(O)E, E(O)O, or O(O)O.

"*The enclosed content is neutral and both neighbors have a bidi level 
opposite to the embedding direction O(N)O. Resolving to opposite to the 
embedding direction is current behavior under the UBA (N1)."



Here the algorithm is again, expressed as a rule:

"*N0. Paired punctuation marks take the embedding direction if the 
enclosed text contains a strong type of the same direction. Else, if the
 enclosed text contains a strong type of the opposite direction and at 
least one external neighbor also has that direction the paired 
punctuation marks take the direction opposite the embedding direction."

This rule amounts to  if any text matches the embedding direction, since
 "if," "then" is applied in sequence. This is fine IMO. (And, otherwise,
 if all text inside the mirrored punctuation is neutral I suppose the 
embedding direction should be taken, I would suppose, not a neutral 
direction, based on the algorithm given at the url above, which, as I've
 said, I like.)


However, as far as the the bidi parentheses algorithm goes, what about 
the following symbols formed from various punctuation marks?

(-: , :-) 

 Would I treat the text between the two happy faces as neutral opening 
and closing text? These sequences should be somehow excepted, I think.

The above text is a comma separating a happy and a sad face which will 
all work as neutrals probably.I believe that these characters would be 
treated as the following sequence (a "/" separates each character): 
ON/ES/CS/CS/WS/ON/ES/CS/ ).  That is, these are all weak or neutrals. So
 this case might pose no problem.

I suppose we have to resolve the "ES" and "CS" characters first though, 
which then are resolved to other neutrals so all we have are neutrals, 
which take the embedding direction, and now of course the parentheses 
are interpreted as such.



But what about the following text (set off from my comments by asterisks)?


* * *


Salam my friend!  KAYFA HALUK? ANAA LHAMDU ULLAA (-: some problems 
though making my emails work with this new algorithm so ANAA LASTU 
SA'IYDUN )-: any suggestions?


* * *


Although I would tend to support exempting the happy face sequence from 
the parentheses algorithm, the happy faces here enclose parenthetical 
text.


According to the rules Marcin has suggested, but not really to those of 
the parentheses algorithm, the above "enclosed text" would be treated as
 RTL and thus some ordering would be reversed though I've not traced it 
through. Your algorithm treats this as RTL since an R character 
immediately precedes the parenthetical comment and since there are some R
 (strong RTL) characters within the parenthetical comment.

The levels are: 0s for the L text

then 1 for text  KAYFA HALUK ANAA LHAMDU ULLA

Then we find a mirrored piece of punctuation, and then a bit later a 
close parentheses (now we have to pop the stack back to the previous, 
and we find a match, and so have opening and closing punctuation).  
Whatever algorithm we use for display, I hope these two faces, if they 
are to be treated as mirrored at all, will display as left-to-right. 

One question: what level will the text in parentheses/happy faces be: 1s
 and 0s still? (or 3s and 2s?) (Sorry for asking this, but would it work
 better to treat the text inside the mirrored punctuation as a new 
embedding level? (I'm not a developer but may try to think through this 
sometime. I don't see how it will improve things to treat this text as a
 new embedding level)


> Notes:

> [1] This means that, if there is any mismatched pair of punctuation
 marks, the rule will be applied neither to that pair, nor to any 
enclosing pair.

> From Aharon Lanin’s comment #5 I understand that to be the original
 intent of the BPA, the current (ambiguous) wording notwithstanding. If a
 > more complicated algorithm is desired, it would have to be spelled
 out here.

> [2] I prefer “non-neutral” to “strong” here, to remind the reader 
that EN and AN also have to be taken into account (other weak types and 
AL > having been resolved already).

> [3] A check for mixed types seems to be redundant; if there are 
mixed-direction types, then at least one is in the embedding direction. 
(This is  > based on my reading of the current draft; if “mixed 
strong types” was intended to include mix of e.g. R and AN, then this 
condition would have > to read “… more than one non-neutral type (L, 
R, EN or AN), or at least one non-neutral type of the same direction as 
the embedding direction”.) 

> [4] This is based on the way I understand what “external neighbor” 
was intended to mean. The wording “if any” indicates that sos/eos are 
not  > included (if they are, then every character in an isolating 
run sequence is preceded and followed by some strong type).

> [5] This covers the case when the enclosed text does not contain 
any strong character; changing both marks to ON prevents mis-pairing the
 OP with a later CP. Note that the CP does not actually have to be 
changed to ON, as it makes no difference to further applications of rule
 N0 or to rules N1 and N2. (However, if a more complicated pairing 
algorithm is specified, it may become important to change both OP and CP
 here.)


> Note also that the new bidi classes may create additional ‘legacy’ 
classes of conforming systems (see chapter 4.2), namely those that use 
Bidi_Class instead of X_Bidi_Class (and thus effectively ignore rule 
N0).< br />





One more comment from me at this point: I tend to agree with one of Phillipe's 
comments:

Date/Time: Sat Dec 22 09:08:55 CST 2012
Contact: [email protected]

Name: Philippe Verdy

Report Type: Public Review Issue 

Opt Subject: UAX#9 (UBA) PRI 3.3.4 Resolving Neutral Types and stability 



> "(5) If a character A is mapped to a character B for mirroring
(Bidi_Mirroring_Glyph=code point B), the character A and B must be distinct
and NOT canonically equivalent to A: NFC(A) != NFC(B)"


(I may also agree with the comment that follows, [6], which I am sorry; I need 
to think through.)




Best,


--C. E. Whitehead

[email protected]





--C. E. Whitehead
[email protected]
Proposed Update UAX #9, Unicode Bidirectional Algorithm

Reply via email to