Re: Unclear text in the UBA (UAX#9) of Unicode 6.3

2014-04-21 Thread Asmus Freytag

On 4/20/2014 6:54 PM, James Clark wrote:
On Mon, Apr 21, 2014 at 2:58 AM, Asmus Freytag asm...@ix.netcom.com 
mailto:asm...@ix.netcom.com wrote:


On 4/20/2014 3:24 AM, Eli Zaretskii wrote:

Would someone please help understand the following subtleties and
obscure language in the UBA document found at
http://www.unicode.org/reports/tr9/?  Thanks in advance.
3. Paragraph 3.3.2 says, under Non-formatting characters:

X6. For all types besides B, BN, RLE, LRE, RLO, LRO, PDF, RLI, LRI,
FSI, and PDI:

. Set the current character’s embedding level to the embedding
  level of the last entry on the directional status stack.

 [...]

Note that the current embedding level is not changed by this rule.

What does this last sentence mean by the current embedding level?
The first bullet of X6 mandates that the current character’s
embedding level _is_ changed by this rule, so what other current
embedding level is alluded to here?

I'm punting on that one - can someone else answer this?


I assume current embedding level here meant the embedding level of 
the last entry on the directional status stack. (This is a natural 
slip to make if you think in terms of an optimized implementation that 
stores each component of the top of the directional status stack in a 
variable, as suggested in 3.3.2.)


James

In general, I heartily dislike specifications that just narrate a 
particular implementation...


A./
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: The application of localized read-out labels

2014-04-21 Thread William_J_G Overington
The text of the first post in this thread was not recorded in the archive of 
the Unicode Public Email List. Maybe because there was an attachment to the 
post?

This post is so as to include a transcript of the text of that post in the 
archive of the Unicode Public Email List.

William Overington

21 April 2014

Transcript:

 William, the UTC is not in the business of creating file formats for 
 localization data.

 Peter

Thank you for replying.

Feeling that a format for the particular application is important I have now 
produced a format myself and published it.

Please find a copy attached.

Posting the publication as an attachment here will also hopefully place it in 
the mailing list archives for long-term availability.

I have also sent a copy to the British Library for Legal Deposit.

The publication has the following title.

The format of the readouts.dat file suggested for possible use in the 
application of localized read-out labels

The file has the following file name.

The_format_of_the_readouts.dat_file_suggested_for_possible_use_in_the_application_of_localized_read-out_labels.pdf

William Overington

16 April 2014

___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Unclear text in the UBA (UAX#9) of Unicode 6.3

2014-04-21 Thread Eli Zaretskii
 Date: Sun, 20 Apr 2014 12:58:23 -0700
 From: Asmus Freytag asm...@ix.netcom.com
 
 On 4/20/2014 3:24 AM, Eli Zaretskii wrote:
  Would someone please help understand the following subtleties and
  obscure language in the UBA document found at
  http://www.unicode.org/reports/tr9/?  Thanks in advance.
 
 Eli,
 
 I've tried to give you some explanations

Thanks!

 in some places, I concur with you that the wording could be improved
 and that such improved wording should be proposed to the UTC (or its
 editorial committee) for incorporation into a future update.

How do we do that?

 For details, see below.
 
  1. In paragraph 3.1.2, near its very end, we have this sentence (with
  my emphasis):
 
 As rule X10 will specify, an isolating run sequence is the unit to
 which the rules following it are applied, and the last character of
   ^^
 one level run in the sequence is considered to be immediately
 followed by the first character of the next level run in the
 sequence during this phase of the algorithm.
 
  What does it mean here by the rules following it?  Following what?
 
 That looks like a bad referent,  but from context, this it must be X10

Ah, so simply saying the following rules or rules following X10
would be enough.

 Bullet 1 could be changed to
 
. Create a stack for elements each consisting of a*code point*  
 (Bidi_Paired_Bracket property value)
  and a text position. Initialize it to empty.
 
 to make things more clear. And a slight wording change might help the 
 reader with item 2:
 
2. Compare the*code point for the*closing paired bracket being 
 inspected or its
canonical equivalent to the*code poin*t (Bidi_Paired_Bracket 
 property value) in the current stack
element.
 
 
 And, to continue
 
3. If the values match, meaning*the character being inspected and the 
 character**
 ** at the text position in the stack*  form a bracket pair, then [...]

Right, this makes the description a whole lot more clear.

 Apply rules W1–W7, N0–N2, and I1–I2 to each of the isolating run 
 sequences.
 For each sequence, [completely] apply each rule in the order in which 
 they appear below.
 The order that one isolating run sequence is treated relative to another 
 does not matter.
 
 I believe the above restatement expresses the same thing in fewer words.

It does, thanks.

  5. Rule N0 says:
 
  . For each bracket-pair element in the list of pairs of text positions
 
a. Inspect the bidirectional types of the characters enclosed
  within the bracket pair.
b. If any strong type (either L or R) matching the embedding
  direction is found, set the type for both brackets in the pair
  to match the embedding direction.
 
  First, what is meant here by strong type [...] matching the embedding
  direction?  Does the match here consider only the odd/even value of
  the current embedding level vs R/L type, in the sense that odd levels
  match R and even levels match L?  Or does this mean some other
  kind of matching?  Table 3, which the only place that seems to refer
  to the issue, is not entirely clear, either:
 
 e   The text ordering type (L or R) that matches the embedding level
 direction (even or odd).
 
  Again, the sense of the match here is not clear.
 
 even/odd --- R/L match, might be made more explicit

I agree this should be made more explicit, as this is a somewhat
subtle issue that might trip the reader.

  Next, what is meant here by the characters enclosed within the
  bracket pair?  If the bracket pair encloses another bracket pair,
  which is inner to it, do the characters inside the inner pair count
  for the purposes of resolving the level of the outer pair?
 They do, so there's no need to change the text.

It might be a good idea to say that explicitly, e.g. as a note, or at
least provide another example where the strong characters are only
inside an inner bracket pair, which will send the same message to the
reader.

Thanks again for the clarifications.
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Unclear text in the UBA (UAX#9) of Unicode 6.3

2014-04-21 Thread Eli Zaretskii
 From: James Clark j...@jclark.com
 Date: Mon, 21 Apr 2014 08:54:34 +0700
 Cc: Eli Zaretskii e...@gnu.org, unicode@unicode.org, Kenneth Whistler 
 k...@unicode.org
 
 X6. For all types besides B, BN, RLE, LRE, RLO, LRO, PDF, RLI, LRI,
 FSI, and PDI:
 
 . Set the current character’s embedding level to the embedding
   level of the last entry on the directional status stack.
 
  [...]
 
 Note that the current embedding level is not changed by this rule.
 
  What does this last sentence mean by the current embedding level?
  The first bullet of X6 mandates that the current character’s
  embedding level _is_ changed by this rule, so what other current
  embedding level is alluded to here?
 
   I'm punting on that one - can someone else answer this?
 
 
 I assume current embedding level here meant the embedding level of the
 last entry on the directional status stack.

Thanks, that was my guess as well, but I wanted to be sure.

IMO, the unfortunate wording here is that the same phrase (current
embedding level) was used just before the problematic sentence to
mean something completely different.  Having identical phrases close
to one another always tricks readers into thinking they are describing
the same thing; when they aren't, confusion settles in.  So I would
suggest to reword one or both of these references to the current
embedding level.

Btw, why is that note, about the current embedding level not being
changed by X6, important?  Why would someone mistakenly think the
contrary?
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Unclear text in the UBA (UAX#9) of Unicode 6.3

2014-04-21 Thread Eli Zaretskii
 Date: Sun, 20 Apr 2014 23:03:20 -0700
 From: Asmus Freytag asm...@ix.netcom.com
 CC: Eli Zaretskii e...@gnu.org, unicode@unicode.org, 
  Kenneth Whistler k...@unicode.org
 
  Note that the current embedding level is not changed by this rule.
 
  What does this last sentence mean by the current embedding level?
  The first bullet of X6 mandates that the current character’s
  embedding level _is_ changed by this rule, so what other current
  embedding level is alluded to here?
  I'm punting on that one - can someone else answer this?
 
 
  I assume current embedding level here meant the embedding level of 
  the last entry on the directional status stack. (This is a natural 
  slip to make if you think in terms of an optimized implementation that 
  stores each component of the top of the directional status stack in a 
  variable, as suggested in 3.3.2.)
 
  James
 
 In general, I heartily dislike specifications that just narrate a 
 particular implementation...

I cannot agree more.

In fact, my main gripe about the UBA additions in 6.3 are that some of
their crucial parts are not formally defined, except by an algorithm
that narrates a specific implementation.  The two worst examples of
that are the definitions of the isolating run sequence and of the
bracket pair.  I didn't ask about those because I succeeded to figure
them out, but it took many readings of the corresponding parts of the
document.  It is IMO a pity that the two main features added in 6.3
are based on definitions that are so hard to penetrate, and which
actually all but force you to use the specific implementation
described by the document.

My working definition that replaces BD13 is this:

  An isolating run sequence is the maximal sequence of level runs of
  the same embedding level that can be obtained by removing all the
  characters between an isolate initiator and its matching PDI (or
  paragraph end, if there is no matching PDI) within those level runs.

As for bracket pair (BD16), I'm really amazed that a concept as easy
and widely known/used as this would need such an obscure definition
that must have an algorithm as its necessary part.  How about this
instead:

  A bracket pair is a pair of an opening paired bracket and a closing
  paired bracket characters within the same isolating run sequence,
  such that the Bidi_Paired_Bracket property value of the former
  character or its canonical equivalent equals the latter character or
  its canonical equivalent, and all the opening and closing bracket
  characters in between these two are balanced.

Then we could use the algorithm to explain what it means for brackets
to be balanced (for those readers who somehow don't already know
that).

Again, thanks for clarifying these subtle issues.  I can now proceed
to updating the Emacs bidirectional display with the changes in
Unicode 6.3.

___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Glyphs designed for the internationalization of the web-based on-line shops of museums and art galleries

2014-04-21 Thread William_J_G Overington
Glyphs designed for the internationalization of the web-based on-line shops of 
museums and art galleries

Imagine please if museum and art gallery websites each were to have an 
international webpage in its on-line shop.

If there were on the webpage colourful symbols, one each for Surname, Forename, 
Card number and so on and the end-user could display text in his or her own 
language by displaying the appropriate read-out label next to each symbol, thus 
localizing the web page, then that could be very helpful.

I have produced designs for nine symbols. There are two glyphs for each symbol, 
one colourful and one monochrome. The symbols are octagonal, using not quite a 
regular octagon. In the monochrome glyphs there is a border around the edge, 
yet in the colourful glyphs there is no border. The colourful glyphs are 
displayed in blue and orange, the idea being that the effect to the viewer is 
of blue upon an orange background.

The designs are influenced by heraldry to some extent.

This is because I consider Surname to be the most important, so I used a 
heraldic chief.

Then for Forename I used a pale as Forename is different from Surname yet 
accompanies to Surname to form a name.

A bar is used for Address.

Name as on card may be different from Forename concatenated with a space and 
Surname, due to use of Mr Mrs Miss Ms etc and initials, hence the reason for 
the design not being a union of a chief and one pale.

Two bars are used for Card number.

Card start date and Card expiry date seemed liked brackets, so that inspired 
the designs.

Card security code is just a design so as to be different from the other 
designs yet not use any diagonal shapes.

Delivery address is included to allow for the possibility of sending a gift 
directly to someone who lives at another address.

I am hoping to attach images showing the designs to other posts in this thread.

William Overington

21 April 2014


___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Glyphs designed for the internationalization of the web-based on-line shops of museums and art galleries

2014-04-21 Thread William_J_G Overington
 I am hoping to attach images showing the designs to other posts in this 
 thread.

Please find attached an image of the designs of the colourful glyphs.

William Overington

21 April 2014
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Glyphs designed for the internationalization of the web-based on-line shops of museums and art galleries

2014-04-21 Thread William_J_G Overington
 I am hoping to attach images showing the designs to other posts in this 
 thread.

Please find attached an image of the designs of the monochrome glyphs.

William Overington

21 April 2014
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Glyphs designed for the internationalization of the web-based on-line shops of museums and art galleries

2014-04-21 Thread Charlie Ruland ☘
I am sorry, but this doesn’t look like internationalization. Rather it 
seems like another attempt by the British to force their culture upon 
the rest of the world. The richness of world-wide naming conventions for 
people is simply ignored, Putin Vladimir Vladimirovič won’t be able to 
use his full name (let alone in the order required), and this will lead 
to World War III.


William J. G. Overington, please admit that others know so much more 
about internationalization than you do, and stop these imperialist 
off-topic activities.


Charlie Ruland ☘



William_J_G Overington a écrit:

Glyphs designed for the internationalization of the web-based on-line shops of 
museums and art galleries

Imagine please if museum and art gallery websites each were to have an 
international webpage in its on-line shop.

If there were on the webpage colourful symbols, one each for Surname, Forename, 
Card number and so on and the end-user could display text in his or her own 
language by displaying the appropriate read-out label next to each symbol, thus 
localizing the web page, then that could be very helpful.

I have produced designs for nine symbols. There are two glyphs for each symbol, 
one colourful and one monochrome. The symbols are octagonal, using not quite a 
regular octagon. In the monochrome glyphs there is a border around the edge, 
yet in the colourful glyphs there is no border. The colourful glyphs are 
displayed in blue and orange, the idea being that the effect to the viewer is 
of blue upon an orange background.

The designs are influenced by heraldry to some extent.

This is because I consider Surname to be the most important, so I used a 
heraldic chief.

Then for Forename I used a pale as Forename is different from Surname yet 
accompanies to Surname to form a name.

A bar is used for Address.

Name as on card may be different from Forename concatenated with a space and 
Surname, due to use of Mr Mrs Miss Ms etc and initials, hence the reason for 
the design not being a union of a chief and one pale.

Two bars are used for Card number.

Card start date and Card expiry date seemed liked brackets, so that inspired 
the designs.

Card security code is just a design so as to be different from the other 
designs yet not use any diagonal shapes.

Delivery address is included to allow for the possibility of sending a gift 
directly to someone who lives at another address.

I am hoping to attach images showing the designs to other posts in this thread.

William Overington

21 April 2014


___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode



___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: The application of localized read-out labels

2014-04-21 Thread William_J_G Overington
Doug Ewell d...@ewellic.org wrote:

 It's labeled prominently as a thought experiment, which means there is no 
 expectation that anyone will implement the format or software which reads it, 
 only think about what would happen if it were implemented.

Well, it states as follows.

quote

This is a thought experiment at present.

Automated localization would be by having a file readouts.dat available. In the 
thought experiment the file is a UTF-16 text file, such as can be saved from 
the WordPad program by selecting saving as a Unicode Text Document.

end quote

My reason for putting This is a thought experiment at present. was that the 
format has not been tested by me in practical application and is only 
theoretically based at the present time, yet I am hoping that the situation may 
change and that the format might become implemented in practice by someone and 
become widely used; or maybe that the publication of the format will act as a 
catalyst to someone publishing a format that is accepted, so that the end 
result of a standardized format is achieved.

 I actually read through the document, 18-point body type and all, before 
 noticing this key point.

Thank you for reading through the document.

http://en.wikipedia.org/wiki/Thought_experiment

http://en.wikipedia.org/wiki/John_Searle

http://en.wikipedia.org/wiki/Philosophy_of_language



http://en.wikipedia.org/wiki/Thought_experiment

http://en.wikipedia.org/wiki/Backcasting

William Overington

21 April 2014

___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Unclear text in the UBA (UAX#9) of Unicode 6.3

2014-04-21 Thread Asmus Freytag

On 4/21/2014 1:33 AM, Eli Zaretskii wrote:

Date: Sun, 20 Apr 2014 23:03:20 -0700
From: Asmus Freytag asm...@ix.netcom.com
CC: Eli Zaretskii e...@gnu.org, unicode@unicode.org,
  Kenneth Whistler k...@unicode.org


 Note that the current embedding level is not changed by this rule.

 What does this last sentence mean by the current embedding level?
 The first bullet of X6 mandates that the current character’s
 embedding level _is_ changed by this rule, so what other current
 embedding level is alluded to here?

 I'm punting on that one - can someone else answer this?


I assume current embedding level here meant the embedding level of
the last entry on the directional status stack. (This is a natural
slip to make if you think in terms of an optimized implementation that
stores each component of the top of the directional status stack in a
variable, as suggested in 3.3.2.)

James


In general, I heartily dislike specifications that just narrate a
particular implementation...

I cannot agree more.

In fact, my main gripe about the UBA additions in 6.3 are that some of
their crucial parts are not formally defined, except by an algorithm
that narrates a specific implementation.  The two worst examples of
that are the definitions of the isolating run sequence and of the
bracket pair.  I didn't ask about those because I succeeded to figure
them out, but it took many readings of the corresponding parts of the
document.  It is IMO a pity that the two main features added in 6.3
are based on definitions that are so hard to penetrate, and which
actually all but force you to use the specific implementation
described by the document.

My working definition that replaces BD13 is this:

   An isolating run sequence is the maximal sequence of level runs of
   the same embedding level that can be obtained by removing all the
   characters between an isolate initiator and its matching PDI (or
   paragraph end, if there is no matching PDI) within those level runs.

As for bracket pair (BD16), I'm really amazed that a concept as easy
and widely known/used as this would need such an obscure definition
that must have an algorithm as its necessary part.  How about this
instead:

   A bracket pair is a pair of an opening paired bracket and a closing
   paired bracket characters within the same isolating run sequence,
   such that the Bidi_Paired_Bracket property value of the former
   character or its canonical equivalent equals the latter character or
   its canonical equivalent, and all the opening and closing bracket
   characters in between these two are balanced.

Then we could use the algorithm to explain what it means for brackets
to be balanced (for those readers who somehow don't already know
that).

Again, thanks for clarifying these subtle issues.  I can now proceed
to updating the Emacs bidirectional display with the changes in
Unicode 6.3.



FWIW here is the restatement of BD16 that I used for myself (and that I put
into the source comments of the sample Java implementation):

// The following is a restatement of BD 16 using non-algorithmic 
language.

//
// A bracket pair is a pair of characters consisting of an opening
// paired bracket and a closing paired bracket such that the
// Bidi_Paired_Bracket property value of the former equals the latter,
// subject to the following constraints.
// - both characters of a pair occur in the same isolating run sequence
// - the closing character of a pair follows the opening character
// - any bracket character can belong at most to one pair, the 
earliest possible one
// - any bracket character not part of a pair is treated like an 
ordinary character

// - pairs may nest properly, but their spans may not overlap otherwise

// Bracket characters with canonical decompositions are supposed to 
be treated
// as if they had been normalized, to allow normalized and 
non-normalized text

// to give the same result.

Your language is more concise, but you may compare for differences.

A./
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Unclear text in the UBA (UAX#9) of Unicode 6.3

2014-04-21 Thread Asmus Freytag

On 4/21/2014 12:55 AM, Eli Zaretskii wrote:

in some places, I concur with you that the wording could be improved
and that such improved wording should be proposed to the UTC (or its
editorial committee) for incorporation into a future update.

How do we do that?


You file a problem report using the contact form.

A./
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


RE: The application of localized read-out labels

2014-04-21 Thread Doug Ewell
William_J_G Overington wjgo underscore 10009 at btinternet dot com
wrote:

 My reason for putting This is a thought experiment at present. was
 that the format has not been tested by me in practical application and
 is only theoretically based at the present time,

It's not, of course. It's specified in enough detail that conformant
files could be created, and consumed by an application.

 yet I am hoping that the situation may change and that the format
 might become implemented in practice by someone and become widely
 used; or maybe that the publication of the format will act as a
 catalyst to someone publishing a format that is accepted, so that the
 end result of a standardized format is achieved.

It could be argued that this is at least part of the hypothesis for the
experiment. The expected result, not quite stated, is that the format
will in fact be used, or will in fact stimulate the creation of a
similar format.

Because, of course, if there is no hypothesis, then this is neither a
Gedankenexperiment nor any other kind of experiment, just an exercise in
creating a file format, which is engineering.

--
Doug Ewell | Thornton, CO, USA
http://ewellic.org | @DougEwell


___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Unclear text in the UBA (UAX#9) of Unicode 6.3

2014-04-21 Thread Philippe Verdy
There are some cases where these rules will not be clear enough. Look at
the following where overlaps do occur; but directionality still matters:

This is an [] example [] for demonstration only.

There are two parsings possible if you just consider a hierarchic layout
where overlaps are disabled:

1. This is an [...] for demonstration only., embedding ..., itself
embedding ] example [ (here the square brackets match externally)

2. This is an [...] example [...] for demonstration only., embedding two
spans for  and  separately (they also pair externally)

Now suppose that the term example is translated in Arabic: It is not very
clear how the UBA will work while preserving the correct pariing direction
of the 3 pairs (one pair is ..., there are two pairs for [...]).
Still all 3 pairs have a coherent direction that Bidi-reordering or glyph
mirorring should not mix.

I see only one solution to tag such text so that it will behave correctly:
either the two pairs of square brackets or the pair or guillemets should be
encoded with isolated Bidi overrides. But then what is happening to the
ordering of the surrounding text?

There should be a stable way to encode this case so that UBA will still
work in preserving the correct reding order, and the expected semantics and
orientation of pairs and the fact that the guillemets are effectively not
really embedding the brackets, but the translated word example.

There are several ways to use Bidi-override or Bidi-embedding controls; I
don't know which one is better but all of them should still work with UBA.
I just hope that the complex cases of the brackets in the middle (]...[)
can be handled gracefully.

My opinion would require embedding and isolating the each square bracket,
they will no longer match together (externally they are treated as symbols
with transparent direction, but how we ensure that the sequence [] will
still occur before the RTL (Arabic) example word followed by the sequence
[] and that the rest of the sentence (for demonstration only) will still
occur in the correct order : we also have to embed/isolate the example,
or the whole sequence [] example [] so that the main sentence This is
an ... for demonstration only will stil have a coherent reading direction.

Such cases are not so exceptional because they occur to represent two
distinct parallel readings of te same text, where in one reading for one
kind of pairs will simply treat the other pairs as ignored transparently.

It should be an interesting case to investigate for validating UBA
algorithms in a conformance test case.


2014-04-21 16:32 GMT+02:00 Asmus Freytag asm...@ix.netcom.com:

  On 4/21/2014 1:33 AM, Eli Zaretskii wrote:

  Date: Sun, 20 Apr 2014 23:03:20 -0700
 From: Asmus Freytag asm...@ix.netcom.com asm...@ix.netcom.com
 CC: Eli Zaretskii e...@gnu.org e...@gnu.org, unicode@unicode.org,
  Kenneth Whistler k...@unicode.org k...@unicode.org

  Note that the current embedding level is not changed by this rule.

 What does this last sentence mean by the current embedding level?
 The first bullet of X6 mandates that the current character's
 embedding level _is_ changed by this rule, so what other current
 embedding level is alluded to here?

  I'm punting on that one - can someone else answer this?


 I assume current embedding level here meant the embedding level of
 the last entry on the directional status stack. (This is a natural
 slip to make if you think in terms of an optimized implementation that
 stores each component of the top of the directional status stack in a
 variable, as suggested in 3.3.2.)

 James


  In general, I heartily dislike specifications that just narrate a
 particular implementation...

  I cannot agree more.

 In fact, my main gripe about the UBA additions in 6.3 are that some of
 their crucial parts are not formally defined, except by an algorithm
 that narrates a specific implementation.  The two worst examples of
 that are the definitions of the isolating run sequence and of the
 bracket pair.  I didn't ask about those because I succeeded to figure
 them out, but it took many readings of the corresponding parts of the
 document.  It is IMO a pity that the two main features added in 6.3
 are based on definitions that are so hard to penetrate, and which
 actually all but force you to use the specific implementation
 described by the document.

 My working definition that replaces BD13 is this:

   An isolating run sequence is the maximal sequence of level runs of
   the same embedding level that can be obtained by removing all the
   characters between an isolate initiator and its matching PDI (or
   paragraph end, if there is no matching PDI) within those level runs.

 As for bracket pair (BD16), I'm really amazed that a concept as easy
 and widely known/used as this would need such an obscure definition
 that must have an algorithm as its necessary part.  How about this
 instead:

   A bracket pair is a pair of an opening 

Re: Unclear text in the UBA (UAX#9) of Unicode 6.3

2014-04-21 Thread Asmus Freytag

Philippe,

I fail to understand how your post contributes to the topic.

The issue was unclear wording of the specification, not deficiencies in 
the UBA or the PBA in general.


Let's keep this discussion limited to issues of wording for the 
*existing* specification. Feel free to start a new discussion about 
something else under a new subject.


A./

On 4/21/2014 9:18 AM, Philippe Verdy wrote:
There are some cases where these rules will not be clear enough. Look 
at the following where overlaps do occur; but directionality still 
matters:


This is an [«] example [»] for demonstration only.

There are two parsings possible if you just consider a hierarchic 
layout where overlaps are disabled:


1. This is an [...] for demonstration only., embedding «...», 
itself embedding ] example [ (here the square brackets match externally)


2. This is an [...] example [...] for demonstration only., embedding 
two spans for « and » separately (they also pair externally)


Now suppose that the term example is translated in Arabic: It is not 
very clear how the UBA will work while preserving the correct pariing 
direction of the 3 pairs (one pair is «...», there are two pairs for 
[...]). Still all 3 pairs have a coherent direction that 
Bidi-reordering or glyph mirorring should not mix.


I see only one solution to tag such text so that it will behave 
correctly: either the two pairs of square brackets or the pair or 
guillemets should be encoded with isolated Bidi overrides. But then 
what is happening to the ordering of the surrounding text?


There should be a stable way to encode this case so that UBA will 
still work in preserving the correct reding order, and the expected 
semantics and orientation of pairs and the fact that the guillemets 
are effectively not really embedding the brackets, but the translated 
word example.


There are several ways to use Bidi-override or Bidi-embedding 
controls; I don't know which one is better but all of them should 
still work with UBA. I just hope that the complex cases of the 
brackets in the middle (]...[) can be handled gracefully.


My opinion would require embedding and isolating the each square 
bracket, they will no longer match together (externally they are 
treated as symbols with transparent direction, but how we ensure that 
the sequence [«] will still occur before the RTL (Arabic) example 
word followed by the sequence [»] and that the rest of the sentence 
(for demonstration only) will still occur in the correct order : we 
also have to embed/isolate the example, or the whole sequence [«] 
example [»] so that the main sentence This is an ... for 
demonstration only will stil have a coherent reading direction.


Such cases are not so exceptional because they occur to represent two 
distinct parallel readings of te same text, where in one reading for 
one kind of pairs will simply treat the other pairs as ignored 
transparently.


It should be an interesting case to investigate for validating UBA 
algorithms in a conformance test case.



2014-04-21 16:32 GMT+02:00 Asmus Freytag asm...@ix.netcom.com 
mailto:asm...@ix.netcom.com:


On 4/21/2014 1:33 AM, Eli Zaretskii wrote:

Date: Sun, 20 Apr 2014 23:03:20 -0700
From: Asmus Freytagasm...@ix.netcom.com  mailto:asm...@ix.netcom.com
CC: Eli Zaretskiie...@gnu.org  mailto:e...@gnu.org,unicode@unicode.org  
mailto:unicode@unicode.org,
  Kenneth Whistlerk...@unicode.org  mailto:k...@unicode.org


 Note that the current embedding level is not changed by this rule.

 What does this last sentence mean by the current embedding level?
 The first bullet of X6 mandates that the current character’s
 embedding level _is_ changed by this rule, so what other current
 embedding level is alluded to here?

 I'm punting on that one - can someone else answer this?


I assume current embedding level here meant the embedding level of
the last entry on the directional status stack. (This is a natural
slip to make if you think in terms of an optimized implementation that
stores each component of the top of the directional status stack in a
variable, as suggested in 3.3.2.)

James


In general, I heartily dislike specifications that just narrate a
particular implementation...

I cannot agree more.

In fact, my main gripe about the UBA additions in 6.3 are that some of
their crucial parts are not formally defined, except by an algorithm
that narrates a specific implementation.  The two worst examples of
that are the definitions of the isolating run sequence and of the
bracket pair.  I didn't ask about those because I succeeded to figure
them out, but it took many readings of the corresponding parts of the
document.  It is IMO a pity that the two main features added in 6.3
are based on definitions that are so hard to penetrate, and which
actually all but force you to use the specific implementation

Re: Unclear text in the UBA (UAX#9) of Unicode 6.3

2014-04-21 Thread Doug Ewell
From: Asmus Freytag asmusf at ix dot netcom dot com wrote:

 In general, I heartily dislike specifications that just narrate a
 particular implementation...

I agree completely. I see this with CLDR as well; there is a more or
less implicit assumption that I will be using ICU to implement whatever
is being described. I don't care how robust and well-tested a wheel is;
as a developer, I should be able to use the specification to reinvent it
if I like.

--
Doug Ewell | Thornton, CO, USA
http://ewellic.org | @DougEwell


___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Unclear text in the UBA (UAX#9) of Unicode 6.3

2014-04-21 Thread Philippe Verdy
It is on topic because the proposed description attempts to explain how
paired brackets should match and how this witll then affect the rendering
in bidirectional contexts. This is exactly the kind of things that are
difficult because the proposed description assumes that paired brackets are
organized hierarchically.

Quote: both characters of a pair occur in the same isolating run sequence
(does not work here sequences are not fully isolated)
Quote: any bracket character can belong at most to one pair, the earliest
possible one (does not work here, this is not the earliest possible)


2014-04-21 19:48 GMT+02:00 Asmus Freytag asm...@ix.netcom.com:

  Philippe,

 I fail to understand how your post contributes to the topic.

 The issue was unclear wording of the specification, not deficiencies in
 the UBA or the PBA in general.

 Let's keep this discussion limited to issues of wording for the *existing*
 specification. Feel free to start a new discussion about something else
 under a new subject.

 A./


 On 4/21/2014 9:18 AM, Philippe Verdy wrote:

 There are some cases where these rules will not be clear enough. Look at
 the following where overlaps do occur; but directionality still matters:

  This is an [] example [] for demonstration only.

  There are two parsings possible if you just consider a hierarchic layout
 where overlaps are disabled:

  1. This is an [...] for demonstration only., embedding ..., itself
 embedding ] example [ (here the square brackets match externally)

  2. This is an [...] example [...] for demonstration only., embedding
 two spans for  and  separately (they also pair externally)

  Now suppose that the term example is translated in Arabic: It is not
 very clear how the UBA will work while preserving the correct pariing
 direction of the 3 pairs (one pair is ..., there are two pairs for
 [...]). Still all 3 pairs have a coherent direction that Bidi-reordering
 or glyph mirorring should not mix.

  I see only one solution to tag such text so that it will behave
 correctly: either the two pairs of square brackets or the pair or
 guillemets should be encoded with isolated Bidi overrides. But then what is
 happening to the ordering of the surrounding text?

  There should be a stable way to encode this case so that UBA will still
 work in preserving the correct reding order, and the expected semantics and
 orientation of pairs and the fact that the guillemets are effectively not
 really embedding the brackets, but the translated word example.

  There are several ways to use Bidi-override or Bidi-embedding controls;
 I don't know which one is better but all of them should still work with
 UBA. I just hope that the complex cases of the brackets in the middle
 (]...[) can be handled gracefully.

  My opinion would require embedding and isolating the each square
 bracket, they will no longer match together (externally they are treated as
 symbols with transparent direction, but how we ensure that the sequence
 [] will still occur before the RTL (Arabic) example word followed by
 the sequence [] and that the rest of the sentence (for demonstration
 only) will still occur in the correct order : we also have to embed/isolate
 the example, or the whole sequence [] example [] so that the main
 sentence This is an ... for demonstration only will stil have a coherent
 reading direction.

  Such cases are not so exceptional because they occur to represent two
 distinct parallel readings of te same text, where in one reading for one
 kind of pairs will simply treat the other pairs as ignored transparently.

  It should be an interesting case to investigate for validating UBA
 algorithms in a conformance test case.


 2014-04-21 16:32 GMT+02:00 Asmus Freytag asm...@ix.netcom.com:

   On 4/21/2014 1:33 AM, Eli Zaretskii wrote:

  Date: Sun, 20 Apr 2014 23:03:20 -0700
 From: Asmus Freytag asm...@ix.netcom.com asm...@ix.netcom.com
 CC: Eli Zaretskii e...@gnu.org e...@gnu.org, unicode@unicode.org,
  Kenneth Whistler k...@unicode.org k...@unicode.org

  Note that the current embedding level is not changed by this rule.

 What does this last sentence mean by the current embedding level?
 The first bullet of X6 mandates that the current character's
 embedding level _is_ changed by this rule, so what other current
 embedding level is alluded to here?

  I'm punting on that one - can someone else answer this?


 I assume current embedding level here meant the embedding level of
 the last entry on the directional status stack. (This is a natural
 slip to make if you think in terms of an optimized implementation that
 stores each component of the top of the directional status stack in a
 variable, as suggested in 3.3.2.)

 James


  In general, I heartily dislike specifications that just narrate a
 particular implementation...

  I cannot agree more.

 In fact, my main gripe about the UBA additions in 6.3 are that some of
 their crucial parts are not formally defined, 

Re: Unclear text in the UBA (UAX#9) of Unicode 6.3

2014-04-21 Thread Asmus Freytag

On 4/21/2014 11:23 AM, Philippe Verdy wrote:
It is on topic because the proposed description attempts to explain 
how paired brackets should match and how this witll then affect the 
rendering in bidirectional contexts. This is exactly the kind of 
things that are difficult because the proposed description assumes 
that paired brackets are organized hierarchically.


Quote: both characters of a pair occur in the same isolating run 
sequence (does not work here sequences are not fully isolated)
Quote: any bracket character can belong at most to one pair, the 
earliest possible one (does not work here, this is not the earliest 
possible)


That's OK, it's a limitation of the algorithm, not the description.

In other words, the algorithm can help set the a better directionality 
of paired (!) brackets, and those are the ones that nest properly.


What Eli brough to our attention is that the description of this 
algorithm is suboptimal - whether the algorithm could or should be 
improved is a separate matter.


A./

PS: I think it is unlikely that the UTC will be interested in 
substantial changes to the algorithm, but it should be interested in 
allowing the specification to be less dependent on the sample 
implementation.



2014-04-21 19:48 GMT+02:00 Asmus Freytag asm...@ix.netcom.com 
mailto:asm...@ix.netcom.com:


Philippe,

I fail to understand how your post contributes to the topic.

The issue was unclear wording of the specification, not
deficiencies in the UBA or the PBA in general.

Let's keep this discussion limited to issues of wording for the
*existing* specification. Feel free to start a new discussion
about something else under a new subject.

A./


On 4/21/2014 9:18 AM, Philippe Verdy wrote:

There are some cases where these rules will not be clear enough.
Look at the following where overlaps do occur; but directionality
still matters:

This is an [«] example [»] for demonstration only.

There are two parsings possible if you just consider a hierarchic
layout where overlaps are disabled:

1. This is an [...] for demonstration only., embedding «...»,
itself embedding ] example [ (here the square brackets match
externally)

2. This is an [...] example [...] for demonstration only.,
embedding two spans for « and » separately (they also pair
externally)

Now suppose that the term example is translated in Arabic: It
is not very clear how the UBA will work while preserving the
correct pariing direction of the 3 pairs (one pair is «...»,
there are two pairs for [...]). Still all 3 pairs have a
coherent direction that Bidi-reordering or glyph mirorring should
not mix.

I see only one solution to tag such text so that it will behave
correctly: either the two pairs of square brackets or the pair or
guillemets should be encoded with isolated Bidi overrides. But
then what is happening to the ordering of the surrounding text?

There should be a stable way to encode this case so that UBA will
still work in preserving the correct reding order, and the
expected semantics and orientation of pairs and the fact that the
guillemets are effectively not really embedding the brackets, but
the translated word example.

There are several ways to use Bidi-override or Bidi-embedding
controls; I don't know which one is better but all of them should
still work with UBA. I just hope that the complex cases of the
brackets in the middle (]...[) can be handled gracefully.

My opinion would require embedding and isolating the each square
bracket, they will no longer match together (externally they are
treated as symbols with transparent direction, but how we ensure
that the sequence [«] will still occur before the RTL (Arabic)
example word followed by the sequence [»] and that the rest
of the sentence (for demonstration only) will still occur in the
correct order : we also have to embed/isolate the example, or
the whole sequence [«] example [»] so that the main sentence
This is an ... for demonstration only will stil have a coherent
reading direction.

Such cases are not so exceptional because they occur to represent
two distinct parallel readings of te same text, where in one
reading for one kind of pairs will simply treat the other pairs
as ignored transparently.

It should be an interesting case to investigate for validating
UBA algorithms in a conformance test case.


2014-04-21 16:32 GMT+02:00 Asmus Freytag asm...@ix.netcom.com
mailto:asm...@ix.netcom.com:

On 4/21/2014 1:33 AM, Eli Zaretskii wrote:

Date: Sun, 20 Apr 2014 23:03:20 -0700
From: Asmus Freytagasm...@ix.netcom.com  mailto:asm...@ix.netcom.com
CC: Eli Zaretskiie...@gnu.org  mailto:e...@gnu.org,unicode@unicode.org  
mailto:unicode@unicode.org,
  Kenneth Whistlerk...@unicode.org  

Re: Unclear text in the UBA (UAX#9) of Unicode 6.3

2014-04-21 Thread Asmus Freytag

On 4/21/2014 11:14 AM, Doug Ewell wrote:

From: Asmus Freytag asmusf at ix dot netcom dot com wrote:


In general, I heartily dislike specifications that just narrate a
particular implementation...

I agree completely. I see this with CLDR as well; there is a more or
less implicit assumption that I will be using ICU to implement whatever
is being described. I don't care how robust and well-tested a wheel is;
as a developer, I should be able to use the specification to reinvent it
if I like.


Well put. Also, by simply narrating an implementation the UTC deprives 
the reader of a clear higher-level description of the concept and the 
intended result. The original part of the bidi specification does a much 
better job in that regard. It's time to revisit the language for the 
additions and bring them up to snuff.


A./
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Glyphs designed for the internationalization of the web-based on-line shops of museums and art galleries

2014-04-21 Thread Asmus Freytag

On 4/21/2014 2:47 AM, William_J_G Overington wrote:

I am hoping to attach images showing the designs to other posts in this thread.

Please find attached an image of the designs of the colourful glyphs.


The language I would use for my reaction to this, is just too colorful 
to reproduce here :)


A./
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Unclear text in the UBA (UAX#9) of Unicode 6.3

2014-04-21 Thread Philippe Verdy
My intent was not to demonstrate a bug in the algorithm, I have not even
claimed that, but to make sure that (less common) usages of paired brackets
that do not obey to a pure hierarchy (because these notations use different
type of brackets, they are not ambiguous) but still preserve their left vs.
right (or open vs. close) semantic.

However due to the way the algorithm is currently designed, distinct pairs
of brackets still need to be nested hierarchically, and this is not always
the case.

And to allow such usages (which does not cause big problems in
unidirectional texts i.e. texts using characters with the same strong
direction, or characters with neutral and weak directions) in bidirectional
texts, we'll necessarily need to use bidi controls however these controls
cannot be so strong that they will break also the necessary embedding
levels intended eah ch type of bracket; even when they do not match in
pairs with the algorithm. As they wil then be trated in isolation (unpaired
forthe hierarchic algorithm) they should still reain their intended RTL or
LTR semantics (and notably their relative placement with things they
surround in non-nested ways, and without being affected as well by
inconsistant mirroring)

The UBA test cases currently do not cover such uncommon cases; but only
cases with single isolated/unpaired brackets.

I want then to make sure that it will remain possible to write notations
without pure hierarchical nesting (for now they still don't work at all,
the result is already unpredicatable, even with bidi controls).

Also I'm not limited only to punctuation pairs but to any kind of textual
pairs (including XML element tags for example, or quotation marks
delimiting strings in programming languages, or begin end keywords in
Pascal or Lua programs, or descriptive expressions in humane languages
(e.g. [start singing] ... [end of song] (even if they are not concerned
by punctuation mirroring).

You could see these non-nested usages as internlinear or unstructured, but
in fact they do have a structure which should be preserved and not mixed
randomly by an alforithm unable to decipher their meaning; unless there's
some markup or controls sayng how to treat these items. We should not even
have to use specific parsers for specific notations (like XML); this is a
more generic abstract problem for texts whose content and semantic is not
nested in a pure hierarchical tree but in subtrees with parallel branches,
and whose rendering will then need to preserve these structures.

My initial message contained a very minimal example of what is needed. I'd
like this sample case to be clearly supported in some way without ambiguity.

It will be important for things like songs, poestry, legal texts containng
citations, discussions about another text, threaded discussions; annotating
documents created collaboratively, versioning and showing diffs; and more
exceptionally for interlinear notations (including the inclusion translator
notes; or notes started in one page and continued elsewhere; possibly on
another page; and containing their own sets of bracket pairs)...

In all these usages, the UBA (and the infered effect on mirroring) could
cause havoc. And of course I do not want to define a new technical syntax
using references and identifiers like in XML or JSON to explcit these
structures, for UBA it will be enough if it preserves the intended
direction and mirroring type without having to explicit which bracket pairs
with another one (it should just preserve the start/end or open/close
semantic, leaving the rest to an upper layer syntax if they need it for
more ambiguous cases; a renderer will use any trick it wants to exhibit
this supplementary structure, such as font styles, colors, decorations, or
custom 2D layouts, as provided by a rich text format which is out of scope
of Unicode and UBA). Only herarchical structure is supported in XML or
JSON, but SGML (an HTML) already shows that non-hierarchical structures are
also possible and are effectively used in their supported content models.



2014-04-21 20:56 GMT+02:00 Asmus Freytag asm...@ix.netcom.com:

  On 4/21/2014 11:23 AM, Philippe Verdy wrote:

 It is on topic because the proposed description attempts to explain how
 paired brackets should match and how this witll then affect the rendering
 in bidirectional contexts. This is exactly the kind of things that are
 difficult because the proposed description assumes that paired brackets are
 organized hierarchically.

 Quote: both characters of a pair occur in the same isolating run
 sequence (does not work here sequences are not fully isolated)
 Quote: any bracket character can belong at most to one pair, the
 earliest possible one (does not work here, this is not the earliest
 possible)


 That's OK, it's a limitation of the algorithm, not the description.

 In other words, the algorithm can help set the a better directionality of
 paired (!) brackets, and those are the ones that nest properly.


Re: Unclear text in the UBA (UAX#9) of Unicode 6.3

2014-04-21 Thread Asmus Freytag

On 4/21/2014 1:54 PM, Philippe Verdy wrote:
My intent was not to demonstrate a bug in the algorithm, I have not 
even claimed that, but to make sure that (less common) usages of 
paired brackets that do not obey to a pure hierarchy (because these 
notations use different type of brackets, they are not ambiguous) but 
still preserve their left vs. right (or open vs. close) semantic.

OK, so this has nothing to do with unclear text.

A./
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Unclear text in the UBA (UAX#9) of Unicode 6.3

2014-04-21 Thread Ilya Zakharevich
On Mon, Apr 21, 2014 at 02:44:14PM -0700, Asmus Freytag wrote:
 On 4/21/2014 1:54 PM, Philippe Verdy wrote:
 My intent was not to demonstrate a bug in the algorithm, I have
 not even claimed that, but to make sure that (less common) usages
 of paired brackets that do not obey to a pure hierarchy (because
 these notations use different type of brackets, they are not
 ambiguous) but still preserve their left vs. right (or open vs.
 close) semantic.

 OK, so this has nothing to do with unclear text.

Asmus, I cannot agree with this.  I think Philippe’s message is on topic.

  [Below, I completely ignore BIDI part of the specification, and
   concentrate ONLY on the parens match.  I do not understand why this
   question is interlaced with BIDI determination; I trust that it is.]

I suspect Philippe was motivated by a kinda-cowboy attitude which both Eli
and you show to the problem of “parentheses match” (and I suspect this
because THAT is my feeling ;-).  You give two (IMO, informal) interpretations
of what the algorithm-based description says.  These two interpretations
are obviously non-compatible (or at least not necessarily clearly stated).

As Eli said it: “bracket pair … a concept as easy and widely known/used as
this would need such an obscure definition ”.  Just for background: the first
theorem on the “Applied Algebra” class taught by Yu.I.Manin was about
parentheses match (it stated that the proper match is unique as far as it
exists).  This statement is a (tiny) mess to prove, but at least it should
look very plausible to unwashed masses.  (One corollary is that “the
earliest possible one” from your interpretation is not actually needed.)

The problems appear when one wants to allow non-matching parentheses as
as well as matched pairs.  [If one fixes Eli’s description so that “a pair”
and “matched” are complete synonims, then] what Eli conveys is that
all non-matching parentheses MUST appear “on top level” only.  This is
workable (meaning the match is still unique).

Your approach gives a circular definition: to define which paren chars match
one must know which ones DO NOT match, and the recursion is not terminated.
This is exactly what Philippe’s example shows.



My understanding is that Unicode is trying to do is to collect the best
practical ways to treat multi-Language texts (without knowing fine details
about the languages actually used in the text).  It may be that what is
“well understood” today IS only the case where non-matched parens appear on
top-level only.

So one may ask: what will be the result of the CURRENT UNICODE parsing applied
to Phillipe’s example?

  This is an [«] example [»] for demonstration only.

By Eli’s interpretation, it contains no matched parens.  In one reading of
your interpretation, the external-[] and guillemets would match, and
internal-][ would be non-matching ones.

If one could “show” that in majority of cases that is what the writer’s
intent was, THEN your interpretation would be “the best
practical ways to treat multi-Language texts”, and it may be prefered to
current-algorithmic-description.  THIS is why I think the message was on topic.

But this is all a very shaky ground…

Yours,
Ilya
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


RE: Unclear text in the UBA (UAX#9) of Unicode 6.3

2014-04-21 Thread Whistler, Ken
Ilya noted:



   [Below, I completely ignore BIDI part of the specification, and

concentrate ONLY on the parens match.  I do not understand why this

question is interlaced with BIDI determination; I trust that it is.]



Actually, it is, because the bracket-matching is really only interesting

in the cases where the boundaries of the isolating runs are in

question, and there are some directional differences in the runs.

The whole point of introducing the paired bracket complication was

to deal with edge cases for that, but...



 So one may ask: what will be the result of the CURRENT UNICODE parsing

 applied

 to Phillipe’s example?



   This is an [«] example [»] for demonstration only.



That is easily answered. Let's crank up the bidi reference code with

a shorter example that contains the relevant units: a [«] b [»] c



Turn up the trace output to see what rule N0 is actually doing,

and you get the following. (Set your display wide enough to not wrap the output

lines, for best interpretation.)



Trace: Entering br_UBA_ResolvePairedBrackets [N0]

Trace: br_PushBracketStack, bracket=005D, pos=2

Trace: br_PeekBracketStack, stack=00614808, top=00614810, tsptr=00614810

Trace: br_PeekBracketStack, bracket=005D, pos=2

Appended pair: opening pos 2, closing pos 4

Trace: br_PopBracketStack,  #elements=1

Matched bracket

Trace: br_PushBracketStack, bracket=005D, pos=8

Trace: br_PeekBracketStack, stack=00614808, top=00614810, tsptr=00614810

Trace: br_PeekBracketStack, bracket=005D, pos=8

Appended pair: opening pos 8, closing pos 10

Trace: br_PopBracketStack,  #elements=1

Matched bracket

Trace: Entering br_SortPairList

Pair list:  {2,4} {8,10}

Append at end

Trace: Exiting br_SortPairList

Pair list:  {2,4} {8,10}

Debug: No strong direction between brackets

Debug: No strong direction between brackets

Current State: 14

  Text:0061 0020 005B 00AB 005D 0020 0062 0020 005B 00BB 005D 0020 0063

  Bidi_Class: L   WS   ON   ON   ON   WSL   WS   ON   ON   ON   WSL

  Levels: 0000000000000

  Runs:LL



Because of the way the stack processing is defined, the first bracket pair is 
[«]

and the second bracket pair is [»]. The algorithm does not push down potential

matches while seeking for a largest outer pair to match. One could – 
particularly

if one is mathematically inclined – argue that that is not the right way to do 
the

matching, but it *is* the way the algorithm is currently defined. And it is the

way both of the bidi reference implementations, all of the BidiCharacterTest.txt

data, the ICU implementation, the Microsoft implementation, and the Harfbuzz

implementation are defined, to the best of my knowledge. Other implementations

would have to be doing the same, or they would be failing the conformance tests

in BidiCharacterTest.txt.



Note that for an all left-to-right run of text like this, with no isolating 
runs and

no embeddings, the implications of rule N0 are trivial and non-interesting. The

bracket matches don’t end up *doing* anything relevant to the text reordering

for bidi in this example. But once you start mixing directions of text and 
adding embeddings

and isolating runs, then things get complicated in non-trivial ways for the 
output.



--Ken




___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Unclear text in the UBA (UAX#9) of Unicode 6.3

2014-04-21 Thread Asmus Freytag

Ilya,

I appreciate your taking the time to take apart Philippe's message. That 
aspect of it was not obvious to me.


A./

PS: more comments below

On 4/21/2014 4:41 PM, Ilya Zakharevich wrote:

On Mon, Apr 21, 2014 at 02:44:14PM -0700, Asmus Freytag wrote:

On 4/21/2014 1:54 PM, Philippe Verdy wrote:

My intent was not to demonstrate a bug in the algorithm, I have
not even claimed that, but to make sure that (less common) usages
of paired brackets that do not obey to a pure hierarchy (because
these notations use different type of brackets, they are not
ambiguous) but still preserve their left vs. right (or open vs.
close) semantic.

OK, so this has nothing to do with unclear text.

Asmus, I cannot agree with this.  I think Philippe’s message is on topic.

   [Below, I completely ignore BIDI part of the specification, and
concentrate ONLY on the parens match.  I do not understand why this
question is interlaced with BIDI determination; I trust that it is.]

It really isn't.

The result of detecting pairs allows one to improve on assigning 
directionality to the

members of the pair, so that they would match (as expected).

This works only for a (hopefully common) subset of all possible uses.

Like the overall bidi algorithm (UBA) the paired bracket algorithm (PBA) 
is intended
as a heuristic that frees the author from having to explicitly declare 
directionality
for every bit of text, by providing a default directionality that should 
work with most
text. Exceptional cases then, and ideally only those, would need 
overrides and similar

mechanisms.


I suspect Philippe was motivated by a kinda-cowboy attitude which both Eli
and you show to the problem of “parentheses match” (and I suspect this
because THAT is my feeling ;-).  You give two (IMO, informal) interpretations
of what the algorithm-based description says.  These two interpretations
are obviously non-compatible (or at least not necessarily clearly stated).
Eli and I both believe that a non-algorithmic definition should be 
possible, and that
it is preferred to the current algorithmic definition. Not least, 
because with the algorithmic
definition, it is not possible for anyone, by inspection, to be sure 
that they understand
what the outcome would be. This is unacceptable, because authors of text 
(and not
only implementers of the PBA) need to be able to predict where the 
heuristic fails and

the text needs additional markup.

This is not a trivial point - not everybody creates text at an editor 
where they can
observe the results immediately and take corrective actions. Text is 
also edited in
environments that do not do bidi processing (e.g. certain kinds of 
source format
editing) or created as result of program action. Knowing when to insert 
(and when
not to insert) bidi controls under program action would benefit from a 
definition

that can be read independently of the implementation of the PBA.


As Eli said it: “bracket pair … a concept as easy and widely known/used as
this would need such an obscure definition ”.  Just for background: the first
theorem on the “Applied Algebra” class taught by Yu.I.Manin was about
parentheses match (it stated that the proper match is unique as far as it
exists).  This statement is a (tiny) mess to prove, but at least it should
look very plausible to unwashed masses.  (One corollary is that “the
earliest possible one” from your interpretation is not actually needed.)


( a [ b ) c ] ?

The PBA matches  the () but not the [].

Some statement about earliest is needed, to select between () and [], 
but my

language contains a mistake.


The problems appear when one wants to allow non-matching parentheses as
as well as matched pairs.  [If one fixes Eli’s description so that “a pair”
and “matched” are complete synonyms, then] what Eli conveys is that
all non-matching parentheses MUST appear “on top level” only.  This is
workable (meaning the match is still unique).



Eli's definition was:

  A bracket pair is a pair of an opening paired bracket and a closing
  paired bracket characters within the same isolating run sequence,
  such that the Bidi_Paired_Bracket property value of the former
  character or its canonical equivalent equals the latter character or
  its canonical equivalent, and all the opening and closing bracket
  characters in between these two are balanced.


Given

( a [ b ) c ] ?

his definition contains no bracket pair, but the example in UAX#9 says 
that the ()

should form a pair.

The purpose of providing my wording was to do precisely the comparison you
have been attempting here, so we end up with language that is an actual 
(and not

merely and attempted) restatement of the algorithmic definition.


Your approach gives a circular definition: to define which paren chars match
one must know which ones DO NOT match, and the recursion is not terminated.
This is exactly what Philippe’s example shows.
Here's the text I supplied, with numbers added for discussion. It 
definitely 

Re: Unclear text in the UBA (UAX#9) of Unicode 6.3

2014-04-21 Thread Asmus Freytag

On 4/21/2014 5:44 PM, Whistler, Ken wrote:


 So one may ask: what will be the result of the CURRENT UNICODE parsing

 applied

 to Phillipe’s example?



   This is an [«] example [»] for demonstration only.

That is easily answered. Let's crank up the bidi reference code with

a shorter example that contains the relevant units: a [«] b [»] c


I find it telling that this dispute can only be settled by showing
trace output - and not, as is normal, but looking at the wording
of the definition.

Really makes Eli's and my point that the cop out of using an algorithm
to define the matching results in it being unpredictable to anyone
not running sample text through an implementation.


Because of the way the stack processing is defined, the first bracket 
pair is [«]


and the second bracket pair is [»]. The algorithm does not push down 
potential


matches while seeking for a largest outer pair to match.



Rather than hiding this in the stack processing it would be possible
to express this approach in non-algorithmic language - as you have done
here. This is something that should be done.

A./
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Unclear text in the UBA (UAX#9) of Unicode 6.3

2014-04-21 Thread Ilya Zakharevich
On Mon, Apr 21, 2014 at 06:08:12PM -0700, Asmus Freytag wrote:
 Here's the text I supplied, with numbers added for discussion. It
 definitely needs some
 editing, but the point of the exercise would be to see what:
 
 1.  A bracket pair is a pair of characters consisting of an opening
  paired bracket and a closing paired bracket such that the
  Bidi_Paired_Bracket property value of the former equals the
 latter,
  subject to the following constraints.
 
 a - both characters of a pair occur in the same isolating run
sequence
 b - the closing character of a pair follows the opening character
 c - any bracket character can belong at most to one pair, the
earliest possible one
 d - any bracket character not part of a pair is treated like an
ordinary character
 e - pairs may nest properly, but their spans may not overlap
otherwise
 
 
 2.  Bracket characters with canonical decompositions are
 supposed to be treated
  as if they had been normalized, to allow normalized and
 non-normalized text
 to give the same result.
 
 
 c) needs rewording, because it is not correct
 
 The BD16 examples show
 
   a ( b ) c ) d   2-4
   a ( b ( c ) d   4-6
 
 From that, it follows that it's not the earliest but the one with the 
 smallest span.

Sorry, I do not see any definition here.  Just a collection of words
which looks like a definition, but only locally…

And I think I can even invent an example which I cannot parse using
your definition:

  1(  2[  3(  4]  5)  6)

Is looking-at-1 forcing match of 3-and-5?  Or what?

Thanks,
Ilya
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode