Re: Unclear text in the UBA (UAX#9) of Unicode 6.3
On 4/20/2014 6:54 PM, James Clark wrote: On Mon, Apr 21, 2014 at 2:58 AM, Asmus Freytag asm...@ix.netcom.com mailto:asm...@ix.netcom.com wrote: On 4/20/2014 3:24 AM, Eli Zaretskii wrote: Would someone please help understand the following subtleties and obscure language in the UBA document found at http://www.unicode.org/reports/tr9/? Thanks in advance. 3. Paragraph 3.3.2 says, under Non-formatting characters: X6. For all types besides B, BN, RLE, LRE, RLO, LRO, PDF, RLI, LRI, FSI, and PDI: . Set the current character’s embedding level to the embedding level of the last entry on the directional status stack. [...] Note that the current embedding level is not changed by this rule. What does this last sentence mean by the current embedding level? The first bullet of X6 mandates that the current character’s embedding level _is_ changed by this rule, so what other current embedding level is alluded to here? I'm punting on that one - can someone else answer this? I assume current embedding level here meant the embedding level of the last entry on the directional status stack. (This is a natural slip to make if you think in terms of an optimized implementation that stores each component of the top of the directional status stack in a variable, as suggested in 3.3.2.) James In general, I heartily dislike specifications that just narrate a particular implementation... A./ ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: The application of localized read-out labels
The text of the first post in this thread was not recorded in the archive of the Unicode Public Email List. Maybe because there was an attachment to the post? This post is so as to include a transcript of the text of that post in the archive of the Unicode Public Email List. William Overington 21 April 2014 Transcript: William, the UTC is not in the business of creating file formats for localization data. Peter Thank you for replying. Feeling that a format for the particular application is important I have now produced a format myself and published it. Please find a copy attached. Posting the publication as an attachment here will also hopefully place it in the mailing list archives for long-term availability. I have also sent a copy to the British Library for Legal Deposit. The publication has the following title. The format of the readouts.dat file suggested for possible use in the application of localized read-out labels The file has the following file name. The_format_of_the_readouts.dat_file_suggested_for_possible_use_in_the_application_of_localized_read-out_labels.pdf William Overington 16 April 2014 ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: Unclear text in the UBA (UAX#9) of Unicode 6.3
Date: Sun, 20 Apr 2014 12:58:23 -0700 From: Asmus Freytag asm...@ix.netcom.com On 4/20/2014 3:24 AM, Eli Zaretskii wrote: Would someone please help understand the following subtleties and obscure language in the UBA document found at http://www.unicode.org/reports/tr9/? Thanks in advance. Eli, I've tried to give you some explanations Thanks! in some places, I concur with you that the wording could be improved and that such improved wording should be proposed to the UTC (or its editorial committee) for incorporation into a future update. How do we do that? For details, see below. 1. In paragraph 3.1.2, near its very end, we have this sentence (with my emphasis): As rule X10 will specify, an isolating run sequence is the unit to which the rules following it are applied, and the last character of ^^ one level run in the sequence is considered to be immediately followed by the first character of the next level run in the sequence during this phase of the algorithm. What does it mean here by the rules following it? Following what? That looks like a bad referent, but from context, this it must be X10 Ah, so simply saying the following rules or rules following X10 would be enough. Bullet 1 could be changed to . Create a stack for elements each consisting of a*code point* (Bidi_Paired_Bracket property value) and a text position. Initialize it to empty. to make things more clear. And a slight wording change might help the reader with item 2: 2. Compare the*code point for the*closing paired bracket being inspected or its canonical equivalent to the*code poin*t (Bidi_Paired_Bracket property value) in the current stack element. And, to continue 3. If the values match, meaning*the character being inspected and the character** ** at the text position in the stack* form a bracket pair, then [...] Right, this makes the description a whole lot more clear. Apply rules W1–W7, N0–N2, and I1–I2 to each of the isolating run sequences. For each sequence, [completely] apply each rule in the order in which they appear below. The order that one isolating run sequence is treated relative to another does not matter. I believe the above restatement expresses the same thing in fewer words. It does, thanks. 5. Rule N0 says: . For each bracket-pair element in the list of pairs of text positions a. Inspect the bidirectional types of the characters enclosed within the bracket pair. b. If any strong type (either L or R) matching the embedding direction is found, set the type for both brackets in the pair to match the embedding direction. First, what is meant here by strong type [...] matching the embedding direction? Does the match here consider only the odd/even value of the current embedding level vs R/L type, in the sense that odd levels match R and even levels match L? Or does this mean some other kind of matching? Table 3, which the only place that seems to refer to the issue, is not entirely clear, either: e The text ordering type (L or R) that matches the embedding level direction (even or odd). Again, the sense of the match here is not clear. even/odd --- R/L match, might be made more explicit I agree this should be made more explicit, as this is a somewhat subtle issue that might trip the reader. Next, what is meant here by the characters enclosed within the bracket pair? If the bracket pair encloses another bracket pair, which is inner to it, do the characters inside the inner pair count for the purposes of resolving the level of the outer pair? They do, so there's no need to change the text. It might be a good idea to say that explicitly, e.g. as a note, or at least provide another example where the strong characters are only inside an inner bracket pair, which will send the same message to the reader. Thanks again for the clarifications. ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: Unclear text in the UBA (UAX#9) of Unicode 6.3
From: James Clark j...@jclark.com Date: Mon, 21 Apr 2014 08:54:34 +0700 Cc: Eli Zaretskii e...@gnu.org, unicode@unicode.org, Kenneth Whistler k...@unicode.org X6. For all types besides B, BN, RLE, LRE, RLO, LRO, PDF, RLI, LRI, FSI, and PDI: . Set the current character’s embedding level to the embedding level of the last entry on the directional status stack. [...] Note that the current embedding level is not changed by this rule. What does this last sentence mean by the current embedding level? The first bullet of X6 mandates that the current character’s embedding level _is_ changed by this rule, so what other current embedding level is alluded to here? I'm punting on that one - can someone else answer this? I assume current embedding level here meant the embedding level of the last entry on the directional status stack. Thanks, that was my guess as well, but I wanted to be sure. IMO, the unfortunate wording here is that the same phrase (current embedding level) was used just before the problematic sentence to mean something completely different. Having identical phrases close to one another always tricks readers into thinking they are describing the same thing; when they aren't, confusion settles in. So I would suggest to reword one or both of these references to the current embedding level. Btw, why is that note, about the current embedding level not being changed by X6, important? Why would someone mistakenly think the contrary? ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: Unclear text in the UBA (UAX#9) of Unicode 6.3
Date: Sun, 20 Apr 2014 23:03:20 -0700 From: Asmus Freytag asm...@ix.netcom.com CC: Eli Zaretskii e...@gnu.org, unicode@unicode.org, Kenneth Whistler k...@unicode.org Note that the current embedding level is not changed by this rule. What does this last sentence mean by the current embedding level? The first bullet of X6 mandates that the current character’s embedding level _is_ changed by this rule, so what other current embedding level is alluded to here? I'm punting on that one - can someone else answer this? I assume current embedding level here meant the embedding level of the last entry on the directional status stack. (This is a natural slip to make if you think in terms of an optimized implementation that stores each component of the top of the directional status stack in a variable, as suggested in 3.3.2.) James In general, I heartily dislike specifications that just narrate a particular implementation... I cannot agree more. In fact, my main gripe about the UBA additions in 6.3 are that some of their crucial parts are not formally defined, except by an algorithm that narrates a specific implementation. The two worst examples of that are the definitions of the isolating run sequence and of the bracket pair. I didn't ask about those because I succeeded to figure them out, but it took many readings of the corresponding parts of the document. It is IMO a pity that the two main features added in 6.3 are based on definitions that are so hard to penetrate, and which actually all but force you to use the specific implementation described by the document. My working definition that replaces BD13 is this: An isolating run sequence is the maximal sequence of level runs of the same embedding level that can be obtained by removing all the characters between an isolate initiator and its matching PDI (or paragraph end, if there is no matching PDI) within those level runs. As for bracket pair (BD16), I'm really amazed that a concept as easy and widely known/used as this would need such an obscure definition that must have an algorithm as its necessary part. How about this instead: A bracket pair is a pair of an opening paired bracket and a closing paired bracket characters within the same isolating run sequence, such that the Bidi_Paired_Bracket property value of the former character or its canonical equivalent equals the latter character or its canonical equivalent, and all the opening and closing bracket characters in between these two are balanced. Then we could use the algorithm to explain what it means for brackets to be balanced (for those readers who somehow don't already know that). Again, thanks for clarifying these subtle issues. I can now proceed to updating the Emacs bidirectional display with the changes in Unicode 6.3. ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Glyphs designed for the internationalization of the web-based on-line shops of museums and art galleries
Glyphs designed for the internationalization of the web-based on-line shops of museums and art galleries Imagine please if museum and art gallery websites each were to have an international webpage in its on-line shop. If there were on the webpage colourful symbols, one each for Surname, Forename, Card number and so on and the end-user could display text in his or her own language by displaying the appropriate read-out label next to each symbol, thus localizing the web page, then that could be very helpful. I have produced designs for nine symbols. There are two glyphs for each symbol, one colourful and one monochrome. The symbols are octagonal, using not quite a regular octagon. In the monochrome glyphs there is a border around the edge, yet in the colourful glyphs there is no border. The colourful glyphs are displayed in blue and orange, the idea being that the effect to the viewer is of blue upon an orange background. The designs are influenced by heraldry to some extent. This is because I consider Surname to be the most important, so I used a heraldic chief. Then for Forename I used a pale as Forename is different from Surname yet accompanies to Surname to form a name. A bar is used for Address. Name as on card may be different from Forename concatenated with a space and Surname, due to use of Mr Mrs Miss Ms etc and initials, hence the reason for the design not being a union of a chief and one pale. Two bars are used for Card number. Card start date and Card expiry date seemed liked brackets, so that inspired the designs. Card security code is just a design so as to be different from the other designs yet not use any diagonal shapes. Delivery address is included to allow for the possibility of sending a gift directly to someone who lives at another address. I am hoping to attach images showing the designs to other posts in this thread. William Overington 21 April 2014 ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: Glyphs designed for the internationalization of the web-based on-line shops of museums and art galleries
I am hoping to attach images showing the designs to other posts in this thread. Please find attached an image of the designs of the colourful glyphs. William Overington 21 April 2014 ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: Glyphs designed for the internationalization of the web-based on-line shops of museums and art galleries
I am hoping to attach images showing the designs to other posts in this thread. Please find attached an image of the designs of the monochrome glyphs. William Overington 21 April 2014 ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: Glyphs designed for the internationalization of the web-based on-line shops of museums and art galleries
I am sorry, but this doesn’t look like internationalization. Rather it seems like another attempt by the British to force their culture upon the rest of the world. The richness of world-wide naming conventions for people is simply ignored, Putin Vladimir Vladimirovič won’t be able to use his full name (let alone in the order required), and this will lead to World War III. William J. G. Overington, please admit that others know so much more about internationalization than you do, and stop these imperialist off-topic activities. Charlie Ruland ☘ William_J_G Overington a écrit: Glyphs designed for the internationalization of the web-based on-line shops of museums and art galleries Imagine please if museum and art gallery websites each were to have an international webpage in its on-line shop. If there were on the webpage colourful symbols, one each for Surname, Forename, Card number and so on and the end-user could display text in his or her own language by displaying the appropriate read-out label next to each symbol, thus localizing the web page, then that could be very helpful. I have produced designs for nine symbols. There are two glyphs for each symbol, one colourful and one monochrome. The symbols are octagonal, using not quite a regular octagon. In the monochrome glyphs there is a border around the edge, yet in the colourful glyphs there is no border. The colourful glyphs are displayed in blue and orange, the idea being that the effect to the viewer is of blue upon an orange background. The designs are influenced by heraldry to some extent. This is because I consider Surname to be the most important, so I used a heraldic chief. Then for Forename I used a pale as Forename is different from Surname yet accompanies to Surname to form a name. A bar is used for Address. Name as on card may be different from Forename concatenated with a space and Surname, due to use of Mr Mrs Miss Ms etc and initials, hence the reason for the design not being a union of a chief and one pale. Two bars are used for Card number. Card start date and Card expiry date seemed liked brackets, so that inspired the designs. Card security code is just a design so as to be different from the other designs yet not use any diagonal shapes. Delivery address is included to allow for the possibility of sending a gift directly to someone who lives at another address. I am hoping to attach images showing the designs to other posts in this thread. William Overington 21 April 2014 ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: The application of localized read-out labels
Doug Ewell d...@ewellic.org wrote: It's labeled prominently as a thought experiment, which means there is no expectation that anyone will implement the format or software which reads it, only think about what would happen if it were implemented. Well, it states as follows. quote This is a thought experiment at present. Automated localization would be by having a file readouts.dat available. In the thought experiment the file is a UTF-16 text file, such as can be saved from the WordPad program by selecting saving as a Unicode Text Document. end quote My reason for putting This is a thought experiment at present. was that the format has not been tested by me in practical application and is only theoretically based at the present time, yet I am hoping that the situation may change and that the format might become implemented in practice by someone and become widely used; or maybe that the publication of the format will act as a catalyst to someone publishing a format that is accepted, so that the end result of a standardized format is achieved. I actually read through the document, 18-point body type and all, before noticing this key point. Thank you for reading through the document. http://en.wikipedia.org/wiki/Thought_experiment http://en.wikipedia.org/wiki/John_Searle http://en.wikipedia.org/wiki/Philosophy_of_language http://en.wikipedia.org/wiki/Thought_experiment http://en.wikipedia.org/wiki/Backcasting William Overington 21 April 2014 ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: Unclear text in the UBA (UAX#9) of Unicode 6.3
On 4/21/2014 1:33 AM, Eli Zaretskii wrote: Date: Sun, 20 Apr 2014 23:03:20 -0700 From: Asmus Freytag asm...@ix.netcom.com CC: Eli Zaretskii e...@gnu.org, unicode@unicode.org, Kenneth Whistler k...@unicode.org Note that the current embedding level is not changed by this rule. What does this last sentence mean by the current embedding level? The first bullet of X6 mandates that the current character’s embedding level _is_ changed by this rule, so what other current embedding level is alluded to here? I'm punting on that one - can someone else answer this? I assume current embedding level here meant the embedding level of the last entry on the directional status stack. (This is a natural slip to make if you think in terms of an optimized implementation that stores each component of the top of the directional status stack in a variable, as suggested in 3.3.2.) James In general, I heartily dislike specifications that just narrate a particular implementation... I cannot agree more. In fact, my main gripe about the UBA additions in 6.3 are that some of their crucial parts are not formally defined, except by an algorithm that narrates a specific implementation. The two worst examples of that are the definitions of the isolating run sequence and of the bracket pair. I didn't ask about those because I succeeded to figure them out, but it took many readings of the corresponding parts of the document. It is IMO a pity that the two main features added in 6.3 are based on definitions that are so hard to penetrate, and which actually all but force you to use the specific implementation described by the document. My working definition that replaces BD13 is this: An isolating run sequence is the maximal sequence of level runs of the same embedding level that can be obtained by removing all the characters between an isolate initiator and its matching PDI (or paragraph end, if there is no matching PDI) within those level runs. As for bracket pair (BD16), I'm really amazed that a concept as easy and widely known/used as this would need such an obscure definition that must have an algorithm as its necessary part. How about this instead: A bracket pair is a pair of an opening paired bracket and a closing paired bracket characters within the same isolating run sequence, such that the Bidi_Paired_Bracket property value of the former character or its canonical equivalent equals the latter character or its canonical equivalent, and all the opening and closing bracket characters in between these two are balanced. Then we could use the algorithm to explain what it means for brackets to be balanced (for those readers who somehow don't already know that). Again, thanks for clarifying these subtle issues. I can now proceed to updating the Emacs bidirectional display with the changes in Unicode 6.3. FWIW here is the restatement of BD16 that I used for myself (and that I put into the source comments of the sample Java implementation): // The following is a restatement of BD 16 using non-algorithmic language. // // A bracket pair is a pair of characters consisting of an opening // paired bracket and a closing paired bracket such that the // Bidi_Paired_Bracket property value of the former equals the latter, // subject to the following constraints. // - both characters of a pair occur in the same isolating run sequence // - the closing character of a pair follows the opening character // - any bracket character can belong at most to one pair, the earliest possible one // - any bracket character not part of a pair is treated like an ordinary character // - pairs may nest properly, but their spans may not overlap otherwise // Bracket characters with canonical decompositions are supposed to be treated // as if they had been normalized, to allow normalized and non-normalized text // to give the same result. Your language is more concise, but you may compare for differences. A./ ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: Unclear text in the UBA (UAX#9) of Unicode 6.3
On 4/21/2014 12:55 AM, Eli Zaretskii wrote: in some places, I concur with you that the wording could be improved and that such improved wording should be proposed to the UTC (or its editorial committee) for incorporation into a future update. How do we do that? You file a problem report using the contact form. A./ ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
RE: The application of localized read-out labels
William_J_G Overington wjgo underscore 10009 at btinternet dot com wrote: My reason for putting This is a thought experiment at present. was that the format has not been tested by me in practical application and is only theoretically based at the present time, It's not, of course. It's specified in enough detail that conformant files could be created, and consumed by an application. yet I am hoping that the situation may change and that the format might become implemented in practice by someone and become widely used; or maybe that the publication of the format will act as a catalyst to someone publishing a format that is accepted, so that the end result of a standardized format is achieved. It could be argued that this is at least part of the hypothesis for the experiment. The expected result, not quite stated, is that the format will in fact be used, or will in fact stimulate the creation of a similar format. Because, of course, if there is no hypothesis, then this is neither a Gedankenexperiment nor any other kind of experiment, just an exercise in creating a file format, which is engineering. -- Doug Ewell | Thornton, CO, USA http://ewellic.org | @DougEwell ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: Unclear text in the UBA (UAX#9) of Unicode 6.3
There are some cases where these rules will not be clear enough. Look at the following where overlaps do occur; but directionality still matters: This is an [] example [] for demonstration only. There are two parsings possible if you just consider a hierarchic layout where overlaps are disabled: 1. This is an [...] for demonstration only., embedding ..., itself embedding ] example [ (here the square brackets match externally) 2. This is an [...] example [...] for demonstration only., embedding two spans for and separately (they also pair externally) Now suppose that the term example is translated in Arabic: It is not very clear how the UBA will work while preserving the correct pariing direction of the 3 pairs (one pair is ..., there are two pairs for [...]). Still all 3 pairs have a coherent direction that Bidi-reordering or glyph mirorring should not mix. I see only one solution to tag such text so that it will behave correctly: either the two pairs of square brackets or the pair or guillemets should be encoded with isolated Bidi overrides. But then what is happening to the ordering of the surrounding text? There should be a stable way to encode this case so that UBA will still work in preserving the correct reding order, and the expected semantics and orientation of pairs and the fact that the guillemets are effectively not really embedding the brackets, but the translated word example. There are several ways to use Bidi-override or Bidi-embedding controls; I don't know which one is better but all of them should still work with UBA. I just hope that the complex cases of the brackets in the middle (]...[) can be handled gracefully. My opinion would require embedding and isolating the each square bracket, they will no longer match together (externally they are treated as symbols with transparent direction, but how we ensure that the sequence [] will still occur before the RTL (Arabic) example word followed by the sequence [] and that the rest of the sentence (for demonstration only) will still occur in the correct order : we also have to embed/isolate the example, or the whole sequence [] example [] so that the main sentence This is an ... for demonstration only will stil have a coherent reading direction. Such cases are not so exceptional because they occur to represent two distinct parallel readings of te same text, where in one reading for one kind of pairs will simply treat the other pairs as ignored transparently. It should be an interesting case to investigate for validating UBA algorithms in a conformance test case. 2014-04-21 16:32 GMT+02:00 Asmus Freytag asm...@ix.netcom.com: On 4/21/2014 1:33 AM, Eli Zaretskii wrote: Date: Sun, 20 Apr 2014 23:03:20 -0700 From: Asmus Freytag asm...@ix.netcom.com asm...@ix.netcom.com CC: Eli Zaretskii e...@gnu.org e...@gnu.org, unicode@unicode.org, Kenneth Whistler k...@unicode.org k...@unicode.org Note that the current embedding level is not changed by this rule. What does this last sentence mean by the current embedding level? The first bullet of X6 mandates that the current character's embedding level _is_ changed by this rule, so what other current embedding level is alluded to here? I'm punting on that one - can someone else answer this? I assume current embedding level here meant the embedding level of the last entry on the directional status stack. (This is a natural slip to make if you think in terms of an optimized implementation that stores each component of the top of the directional status stack in a variable, as suggested in 3.3.2.) James In general, I heartily dislike specifications that just narrate a particular implementation... I cannot agree more. In fact, my main gripe about the UBA additions in 6.3 are that some of their crucial parts are not formally defined, except by an algorithm that narrates a specific implementation. The two worst examples of that are the definitions of the isolating run sequence and of the bracket pair. I didn't ask about those because I succeeded to figure them out, but it took many readings of the corresponding parts of the document. It is IMO a pity that the two main features added in 6.3 are based on definitions that are so hard to penetrate, and which actually all but force you to use the specific implementation described by the document. My working definition that replaces BD13 is this: An isolating run sequence is the maximal sequence of level runs of the same embedding level that can be obtained by removing all the characters between an isolate initiator and its matching PDI (or paragraph end, if there is no matching PDI) within those level runs. As for bracket pair (BD16), I'm really amazed that a concept as easy and widely known/used as this would need such an obscure definition that must have an algorithm as its necessary part. How about this instead: A bracket pair is a pair of an opening
Re: Unclear text in the UBA (UAX#9) of Unicode 6.3
Philippe, I fail to understand how your post contributes to the topic. The issue was unclear wording of the specification, not deficiencies in the UBA or the PBA in general. Let's keep this discussion limited to issues of wording for the *existing* specification. Feel free to start a new discussion about something else under a new subject. A./ On 4/21/2014 9:18 AM, Philippe Verdy wrote: There are some cases where these rules will not be clear enough. Look at the following where overlaps do occur; but directionality still matters: This is an [«] example [»] for demonstration only. There are two parsings possible if you just consider a hierarchic layout where overlaps are disabled: 1. This is an [...] for demonstration only., embedding «...», itself embedding ] example [ (here the square brackets match externally) 2. This is an [...] example [...] for demonstration only., embedding two spans for « and » separately (they also pair externally) Now suppose that the term example is translated in Arabic: It is not very clear how the UBA will work while preserving the correct pariing direction of the 3 pairs (one pair is «...», there are two pairs for [...]). Still all 3 pairs have a coherent direction that Bidi-reordering or glyph mirorring should not mix. I see only one solution to tag such text so that it will behave correctly: either the two pairs of square brackets or the pair or guillemets should be encoded with isolated Bidi overrides. But then what is happening to the ordering of the surrounding text? There should be a stable way to encode this case so that UBA will still work in preserving the correct reding order, and the expected semantics and orientation of pairs and the fact that the guillemets are effectively not really embedding the brackets, but the translated word example. There are several ways to use Bidi-override or Bidi-embedding controls; I don't know which one is better but all of them should still work with UBA. I just hope that the complex cases of the brackets in the middle (]...[) can be handled gracefully. My opinion would require embedding and isolating the each square bracket, they will no longer match together (externally they are treated as symbols with transparent direction, but how we ensure that the sequence [«] will still occur before the RTL (Arabic) example word followed by the sequence [»] and that the rest of the sentence (for demonstration only) will still occur in the correct order : we also have to embed/isolate the example, or the whole sequence [«] example [»] so that the main sentence This is an ... for demonstration only will stil have a coherent reading direction. Such cases are not so exceptional because they occur to represent two distinct parallel readings of te same text, where in one reading for one kind of pairs will simply treat the other pairs as ignored transparently. It should be an interesting case to investigate for validating UBA algorithms in a conformance test case. 2014-04-21 16:32 GMT+02:00 Asmus Freytag asm...@ix.netcom.com mailto:asm...@ix.netcom.com: On 4/21/2014 1:33 AM, Eli Zaretskii wrote: Date: Sun, 20 Apr 2014 23:03:20 -0700 From: Asmus Freytagasm...@ix.netcom.com mailto:asm...@ix.netcom.com CC: Eli Zaretskiie...@gnu.org mailto:e...@gnu.org,unicode@unicode.org mailto:unicode@unicode.org, Kenneth Whistlerk...@unicode.org mailto:k...@unicode.org Note that the current embedding level is not changed by this rule. What does this last sentence mean by the current embedding level? The first bullet of X6 mandates that the current character’s embedding level _is_ changed by this rule, so what other current embedding level is alluded to here? I'm punting on that one - can someone else answer this? I assume current embedding level here meant the embedding level of the last entry on the directional status stack. (This is a natural slip to make if you think in terms of an optimized implementation that stores each component of the top of the directional status stack in a variable, as suggested in 3.3.2.) James In general, I heartily dislike specifications that just narrate a particular implementation... I cannot agree more. In fact, my main gripe about the UBA additions in 6.3 are that some of their crucial parts are not formally defined, except by an algorithm that narrates a specific implementation. The two worst examples of that are the definitions of the isolating run sequence and of the bracket pair. I didn't ask about those because I succeeded to figure them out, but it took many readings of the corresponding parts of the document. It is IMO a pity that the two main features added in 6.3 are based on definitions that are so hard to penetrate, and which actually all but force you to use the specific implementation
Re: Unclear text in the UBA (UAX#9) of Unicode 6.3
From: Asmus Freytag asmusf at ix dot netcom dot com wrote: In general, I heartily dislike specifications that just narrate a particular implementation... I agree completely. I see this with CLDR as well; there is a more or less implicit assumption that I will be using ICU to implement whatever is being described. I don't care how robust and well-tested a wheel is; as a developer, I should be able to use the specification to reinvent it if I like. -- Doug Ewell | Thornton, CO, USA http://ewellic.org | @DougEwell ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: Unclear text in the UBA (UAX#9) of Unicode 6.3
It is on topic because the proposed description attempts to explain how paired brackets should match and how this witll then affect the rendering in bidirectional contexts. This is exactly the kind of things that are difficult because the proposed description assumes that paired brackets are organized hierarchically. Quote: both characters of a pair occur in the same isolating run sequence (does not work here sequences are not fully isolated) Quote: any bracket character can belong at most to one pair, the earliest possible one (does not work here, this is not the earliest possible) 2014-04-21 19:48 GMT+02:00 Asmus Freytag asm...@ix.netcom.com: Philippe, I fail to understand how your post contributes to the topic. The issue was unclear wording of the specification, not deficiencies in the UBA or the PBA in general. Let's keep this discussion limited to issues of wording for the *existing* specification. Feel free to start a new discussion about something else under a new subject. A./ On 4/21/2014 9:18 AM, Philippe Verdy wrote: There are some cases where these rules will not be clear enough. Look at the following where overlaps do occur; but directionality still matters: This is an [] example [] for demonstration only. There are two parsings possible if you just consider a hierarchic layout where overlaps are disabled: 1. This is an [...] for demonstration only., embedding ..., itself embedding ] example [ (here the square brackets match externally) 2. This is an [...] example [...] for demonstration only., embedding two spans for and separately (they also pair externally) Now suppose that the term example is translated in Arabic: It is not very clear how the UBA will work while preserving the correct pariing direction of the 3 pairs (one pair is ..., there are two pairs for [...]). Still all 3 pairs have a coherent direction that Bidi-reordering or glyph mirorring should not mix. I see only one solution to tag such text so that it will behave correctly: either the two pairs of square brackets or the pair or guillemets should be encoded with isolated Bidi overrides. But then what is happening to the ordering of the surrounding text? There should be a stable way to encode this case so that UBA will still work in preserving the correct reding order, and the expected semantics and orientation of pairs and the fact that the guillemets are effectively not really embedding the brackets, but the translated word example. There are several ways to use Bidi-override or Bidi-embedding controls; I don't know which one is better but all of them should still work with UBA. I just hope that the complex cases of the brackets in the middle (]...[) can be handled gracefully. My opinion would require embedding and isolating the each square bracket, they will no longer match together (externally they are treated as symbols with transparent direction, but how we ensure that the sequence [] will still occur before the RTL (Arabic) example word followed by the sequence [] and that the rest of the sentence (for demonstration only) will still occur in the correct order : we also have to embed/isolate the example, or the whole sequence [] example [] so that the main sentence This is an ... for demonstration only will stil have a coherent reading direction. Such cases are not so exceptional because they occur to represent two distinct parallel readings of te same text, where in one reading for one kind of pairs will simply treat the other pairs as ignored transparently. It should be an interesting case to investigate for validating UBA algorithms in a conformance test case. 2014-04-21 16:32 GMT+02:00 Asmus Freytag asm...@ix.netcom.com: On 4/21/2014 1:33 AM, Eli Zaretskii wrote: Date: Sun, 20 Apr 2014 23:03:20 -0700 From: Asmus Freytag asm...@ix.netcom.com asm...@ix.netcom.com CC: Eli Zaretskii e...@gnu.org e...@gnu.org, unicode@unicode.org, Kenneth Whistler k...@unicode.org k...@unicode.org Note that the current embedding level is not changed by this rule. What does this last sentence mean by the current embedding level? The first bullet of X6 mandates that the current character's embedding level _is_ changed by this rule, so what other current embedding level is alluded to here? I'm punting on that one - can someone else answer this? I assume current embedding level here meant the embedding level of the last entry on the directional status stack. (This is a natural slip to make if you think in terms of an optimized implementation that stores each component of the top of the directional status stack in a variable, as suggested in 3.3.2.) James In general, I heartily dislike specifications that just narrate a particular implementation... I cannot agree more. In fact, my main gripe about the UBA additions in 6.3 are that some of their crucial parts are not formally defined,
Re: Unclear text in the UBA (UAX#9) of Unicode 6.3
On 4/21/2014 11:23 AM, Philippe Verdy wrote: It is on topic because the proposed description attempts to explain how paired brackets should match and how this witll then affect the rendering in bidirectional contexts. This is exactly the kind of things that are difficult because the proposed description assumes that paired brackets are organized hierarchically. Quote: both characters of a pair occur in the same isolating run sequence (does not work here sequences are not fully isolated) Quote: any bracket character can belong at most to one pair, the earliest possible one (does not work here, this is not the earliest possible) That's OK, it's a limitation of the algorithm, not the description. In other words, the algorithm can help set the a better directionality of paired (!) brackets, and those are the ones that nest properly. What Eli brough to our attention is that the description of this algorithm is suboptimal - whether the algorithm could or should be improved is a separate matter. A./ PS: I think it is unlikely that the UTC will be interested in substantial changes to the algorithm, but it should be interested in allowing the specification to be less dependent on the sample implementation. 2014-04-21 19:48 GMT+02:00 Asmus Freytag asm...@ix.netcom.com mailto:asm...@ix.netcom.com: Philippe, I fail to understand how your post contributes to the topic. The issue was unclear wording of the specification, not deficiencies in the UBA or the PBA in general. Let's keep this discussion limited to issues of wording for the *existing* specification. Feel free to start a new discussion about something else under a new subject. A./ On 4/21/2014 9:18 AM, Philippe Verdy wrote: There are some cases where these rules will not be clear enough. Look at the following where overlaps do occur; but directionality still matters: This is an [«] example [»] for demonstration only. There are two parsings possible if you just consider a hierarchic layout where overlaps are disabled: 1. This is an [...] for demonstration only., embedding «...», itself embedding ] example [ (here the square brackets match externally) 2. This is an [...] example [...] for demonstration only., embedding two spans for « and » separately (they also pair externally) Now suppose that the term example is translated in Arabic: It is not very clear how the UBA will work while preserving the correct pariing direction of the 3 pairs (one pair is «...», there are two pairs for [...]). Still all 3 pairs have a coherent direction that Bidi-reordering or glyph mirorring should not mix. I see only one solution to tag such text so that it will behave correctly: either the two pairs of square brackets or the pair or guillemets should be encoded with isolated Bidi overrides. But then what is happening to the ordering of the surrounding text? There should be a stable way to encode this case so that UBA will still work in preserving the correct reding order, and the expected semantics and orientation of pairs and the fact that the guillemets are effectively not really embedding the brackets, but the translated word example. There are several ways to use Bidi-override or Bidi-embedding controls; I don't know which one is better but all of them should still work with UBA. I just hope that the complex cases of the brackets in the middle (]...[) can be handled gracefully. My opinion would require embedding and isolating the each square bracket, they will no longer match together (externally they are treated as symbols with transparent direction, but how we ensure that the sequence [«] will still occur before the RTL (Arabic) example word followed by the sequence [»] and that the rest of the sentence (for demonstration only) will still occur in the correct order : we also have to embed/isolate the example, or the whole sequence [«] example [»] so that the main sentence This is an ... for demonstration only will stil have a coherent reading direction. Such cases are not so exceptional because they occur to represent two distinct parallel readings of te same text, where in one reading for one kind of pairs will simply treat the other pairs as ignored transparently. It should be an interesting case to investigate for validating UBA algorithms in a conformance test case. 2014-04-21 16:32 GMT+02:00 Asmus Freytag asm...@ix.netcom.com mailto:asm...@ix.netcom.com: On 4/21/2014 1:33 AM, Eli Zaretskii wrote: Date: Sun, 20 Apr 2014 23:03:20 -0700 From: Asmus Freytagasm...@ix.netcom.com mailto:asm...@ix.netcom.com CC: Eli Zaretskiie...@gnu.org mailto:e...@gnu.org,unicode@unicode.org mailto:unicode@unicode.org, Kenneth Whistlerk...@unicode.org
Re: Unclear text in the UBA (UAX#9) of Unicode 6.3
On 4/21/2014 11:14 AM, Doug Ewell wrote: From: Asmus Freytag asmusf at ix dot netcom dot com wrote: In general, I heartily dislike specifications that just narrate a particular implementation... I agree completely. I see this with CLDR as well; there is a more or less implicit assumption that I will be using ICU to implement whatever is being described. I don't care how robust and well-tested a wheel is; as a developer, I should be able to use the specification to reinvent it if I like. Well put. Also, by simply narrating an implementation the UTC deprives the reader of a clear higher-level description of the concept and the intended result. The original part of the bidi specification does a much better job in that regard. It's time to revisit the language for the additions and bring them up to snuff. A./ ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: Glyphs designed for the internationalization of the web-based on-line shops of museums and art galleries
On 4/21/2014 2:47 AM, William_J_G Overington wrote: I am hoping to attach images showing the designs to other posts in this thread. Please find attached an image of the designs of the colourful glyphs. The language I would use for my reaction to this, is just too colorful to reproduce here :) A./ ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: Unclear text in the UBA (UAX#9) of Unicode 6.3
My intent was not to demonstrate a bug in the algorithm, I have not even claimed that, but to make sure that (less common) usages of paired brackets that do not obey to a pure hierarchy (because these notations use different type of brackets, they are not ambiguous) but still preserve their left vs. right (or open vs. close) semantic. However due to the way the algorithm is currently designed, distinct pairs of brackets still need to be nested hierarchically, and this is not always the case. And to allow such usages (which does not cause big problems in unidirectional texts i.e. texts using characters with the same strong direction, or characters with neutral and weak directions) in bidirectional texts, we'll necessarily need to use bidi controls however these controls cannot be so strong that they will break also the necessary embedding levels intended eah ch type of bracket; even when they do not match in pairs with the algorithm. As they wil then be trated in isolation (unpaired forthe hierarchic algorithm) they should still reain their intended RTL or LTR semantics (and notably their relative placement with things they surround in non-nested ways, and without being affected as well by inconsistant mirroring) The UBA test cases currently do not cover such uncommon cases; but only cases with single isolated/unpaired brackets. I want then to make sure that it will remain possible to write notations without pure hierarchical nesting (for now they still don't work at all, the result is already unpredicatable, even with bidi controls). Also I'm not limited only to punctuation pairs but to any kind of textual pairs (including XML element tags for example, or quotation marks delimiting strings in programming languages, or begin end keywords in Pascal or Lua programs, or descriptive expressions in humane languages (e.g. [start singing] ... [end of song] (even if they are not concerned by punctuation mirroring). You could see these non-nested usages as internlinear or unstructured, but in fact they do have a structure which should be preserved and not mixed randomly by an alforithm unable to decipher their meaning; unless there's some markup or controls sayng how to treat these items. We should not even have to use specific parsers for specific notations (like XML); this is a more generic abstract problem for texts whose content and semantic is not nested in a pure hierarchical tree but in subtrees with parallel branches, and whose rendering will then need to preserve these structures. My initial message contained a very minimal example of what is needed. I'd like this sample case to be clearly supported in some way without ambiguity. It will be important for things like songs, poestry, legal texts containng citations, discussions about another text, threaded discussions; annotating documents created collaboratively, versioning and showing diffs; and more exceptionally for interlinear notations (including the inclusion translator notes; or notes started in one page and continued elsewhere; possibly on another page; and containing their own sets of bracket pairs)... In all these usages, the UBA (and the infered effect on mirroring) could cause havoc. And of course I do not want to define a new technical syntax using references and identifiers like in XML or JSON to explcit these structures, for UBA it will be enough if it preserves the intended direction and mirroring type without having to explicit which bracket pairs with another one (it should just preserve the start/end or open/close semantic, leaving the rest to an upper layer syntax if they need it for more ambiguous cases; a renderer will use any trick it wants to exhibit this supplementary structure, such as font styles, colors, decorations, or custom 2D layouts, as provided by a rich text format which is out of scope of Unicode and UBA). Only herarchical structure is supported in XML or JSON, but SGML (an HTML) already shows that non-hierarchical structures are also possible and are effectively used in their supported content models. 2014-04-21 20:56 GMT+02:00 Asmus Freytag asm...@ix.netcom.com: On 4/21/2014 11:23 AM, Philippe Verdy wrote: It is on topic because the proposed description attempts to explain how paired brackets should match and how this witll then affect the rendering in bidirectional contexts. This is exactly the kind of things that are difficult because the proposed description assumes that paired brackets are organized hierarchically. Quote: both characters of a pair occur in the same isolating run sequence (does not work here sequences are not fully isolated) Quote: any bracket character can belong at most to one pair, the earliest possible one (does not work here, this is not the earliest possible) That's OK, it's a limitation of the algorithm, not the description. In other words, the algorithm can help set the a better directionality of paired (!) brackets, and those are the ones that nest properly.
Re: Unclear text in the UBA (UAX#9) of Unicode 6.3
On 4/21/2014 1:54 PM, Philippe Verdy wrote: My intent was not to demonstrate a bug in the algorithm, I have not even claimed that, but to make sure that (less common) usages of paired brackets that do not obey to a pure hierarchy (because these notations use different type of brackets, they are not ambiguous) but still preserve their left vs. right (or open vs. close) semantic. OK, so this has nothing to do with unclear text. A./ ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: Unclear text in the UBA (UAX#9) of Unicode 6.3
On Mon, Apr 21, 2014 at 02:44:14PM -0700, Asmus Freytag wrote: On 4/21/2014 1:54 PM, Philippe Verdy wrote: My intent was not to demonstrate a bug in the algorithm, I have not even claimed that, but to make sure that (less common) usages of paired brackets that do not obey to a pure hierarchy (because these notations use different type of brackets, they are not ambiguous) but still preserve their left vs. right (or open vs. close) semantic. OK, so this has nothing to do with unclear text. Asmus, I cannot agree with this. I think Philippe’s message is on topic. [Below, I completely ignore BIDI part of the specification, and concentrate ONLY on the parens match. I do not understand why this question is interlaced with BIDI determination; I trust that it is.] I suspect Philippe was motivated by a kinda-cowboy attitude which both Eli and you show to the problem of “parentheses match” (and I suspect this because THAT is my feeling ;-). You give two (IMO, informal) interpretations of what the algorithm-based description says. These two interpretations are obviously non-compatible (or at least not necessarily clearly stated). As Eli said it: “bracket pair … a concept as easy and widely known/used as this would need such an obscure definition ”. Just for background: the first theorem on the “Applied Algebra” class taught by Yu.I.Manin was about parentheses match (it stated that the proper match is unique as far as it exists). This statement is a (tiny) mess to prove, but at least it should look very plausible to unwashed masses. (One corollary is that “the earliest possible one” from your interpretation is not actually needed.) The problems appear when one wants to allow non-matching parentheses as as well as matched pairs. [If one fixes Eli’s description so that “a pair” and “matched” are complete synonims, then] what Eli conveys is that all non-matching parentheses MUST appear “on top level” only. This is workable (meaning the match is still unique). Your approach gives a circular definition: to define which paren chars match one must know which ones DO NOT match, and the recursion is not terminated. This is exactly what Philippe’s example shows. My understanding is that Unicode is trying to do is to collect the best practical ways to treat multi-Language texts (without knowing fine details about the languages actually used in the text). It may be that what is “well understood” today IS only the case where non-matched parens appear on top-level only. So one may ask: what will be the result of the CURRENT UNICODE parsing applied to Phillipe’s example? This is an [«] example [»] for demonstration only. By Eli’s interpretation, it contains no matched parens. In one reading of your interpretation, the external-[] and guillemets would match, and internal-][ would be non-matching ones. If one could “show” that in majority of cases that is what the writer’s intent was, THEN your interpretation would be “the best practical ways to treat multi-Language texts”, and it may be prefered to current-algorithmic-description. THIS is why I think the message was on topic. But this is all a very shaky ground… Yours, Ilya ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
RE: Unclear text in the UBA (UAX#9) of Unicode 6.3
Ilya noted: [Below, I completely ignore BIDI part of the specification, and concentrate ONLY on the parens match. I do not understand why this question is interlaced with BIDI determination; I trust that it is.] Actually, it is, because the bracket-matching is really only interesting in the cases where the boundaries of the isolating runs are in question, and there are some directional differences in the runs. The whole point of introducing the paired bracket complication was to deal with edge cases for that, but... So one may ask: what will be the result of the CURRENT UNICODE parsing applied to Phillipe’s example? This is an [«] example [»] for demonstration only. That is easily answered. Let's crank up the bidi reference code with a shorter example that contains the relevant units: a [«] b [»] c Turn up the trace output to see what rule N0 is actually doing, and you get the following. (Set your display wide enough to not wrap the output lines, for best interpretation.) Trace: Entering br_UBA_ResolvePairedBrackets [N0] Trace: br_PushBracketStack, bracket=005D, pos=2 Trace: br_PeekBracketStack, stack=00614808, top=00614810, tsptr=00614810 Trace: br_PeekBracketStack, bracket=005D, pos=2 Appended pair: opening pos 2, closing pos 4 Trace: br_PopBracketStack, #elements=1 Matched bracket Trace: br_PushBracketStack, bracket=005D, pos=8 Trace: br_PeekBracketStack, stack=00614808, top=00614810, tsptr=00614810 Trace: br_PeekBracketStack, bracket=005D, pos=8 Appended pair: opening pos 8, closing pos 10 Trace: br_PopBracketStack, #elements=1 Matched bracket Trace: Entering br_SortPairList Pair list: {2,4} {8,10} Append at end Trace: Exiting br_SortPairList Pair list: {2,4} {8,10} Debug: No strong direction between brackets Debug: No strong direction between brackets Current State: 14 Text:0061 0020 005B 00AB 005D 0020 0062 0020 005B 00BB 005D 0020 0063 Bidi_Class: L WS ON ON ON WSL WS ON ON ON WSL Levels: 0000000000000 Runs:LL Because of the way the stack processing is defined, the first bracket pair is [«] and the second bracket pair is [»]. The algorithm does not push down potential matches while seeking for a largest outer pair to match. One could – particularly if one is mathematically inclined – argue that that is not the right way to do the matching, but it *is* the way the algorithm is currently defined. And it is the way both of the bidi reference implementations, all of the BidiCharacterTest.txt data, the ICU implementation, the Microsoft implementation, and the Harfbuzz implementation are defined, to the best of my knowledge. Other implementations would have to be doing the same, or they would be failing the conformance tests in BidiCharacterTest.txt. Note that for an all left-to-right run of text like this, with no isolating runs and no embeddings, the implications of rule N0 are trivial and non-interesting. The bracket matches don’t end up *doing* anything relevant to the text reordering for bidi in this example. But once you start mixing directions of text and adding embeddings and isolating runs, then things get complicated in non-trivial ways for the output. --Ken ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: Unclear text in the UBA (UAX#9) of Unicode 6.3
Ilya, I appreciate your taking the time to take apart Philippe's message. That aspect of it was not obvious to me. A./ PS: more comments below On 4/21/2014 4:41 PM, Ilya Zakharevich wrote: On Mon, Apr 21, 2014 at 02:44:14PM -0700, Asmus Freytag wrote: On 4/21/2014 1:54 PM, Philippe Verdy wrote: My intent was not to demonstrate a bug in the algorithm, I have not even claimed that, but to make sure that (less common) usages of paired brackets that do not obey to a pure hierarchy (because these notations use different type of brackets, they are not ambiguous) but still preserve their left vs. right (or open vs. close) semantic. OK, so this has nothing to do with unclear text. Asmus, I cannot agree with this. I think Philippe’s message is on topic. [Below, I completely ignore BIDI part of the specification, and concentrate ONLY on the parens match. I do not understand why this question is interlaced with BIDI determination; I trust that it is.] It really isn't. The result of detecting pairs allows one to improve on assigning directionality to the members of the pair, so that they would match (as expected). This works only for a (hopefully common) subset of all possible uses. Like the overall bidi algorithm (UBA) the paired bracket algorithm (PBA) is intended as a heuristic that frees the author from having to explicitly declare directionality for every bit of text, by providing a default directionality that should work with most text. Exceptional cases then, and ideally only those, would need overrides and similar mechanisms. I suspect Philippe was motivated by a kinda-cowboy attitude which both Eli and you show to the problem of “parentheses match” (and I suspect this because THAT is my feeling ;-). You give two (IMO, informal) interpretations of what the algorithm-based description says. These two interpretations are obviously non-compatible (or at least not necessarily clearly stated). Eli and I both believe that a non-algorithmic definition should be possible, and that it is preferred to the current algorithmic definition. Not least, because with the algorithmic definition, it is not possible for anyone, by inspection, to be sure that they understand what the outcome would be. This is unacceptable, because authors of text (and not only implementers of the PBA) need to be able to predict where the heuristic fails and the text needs additional markup. This is not a trivial point - not everybody creates text at an editor where they can observe the results immediately and take corrective actions. Text is also edited in environments that do not do bidi processing (e.g. certain kinds of source format editing) or created as result of program action. Knowing when to insert (and when not to insert) bidi controls under program action would benefit from a definition that can be read independently of the implementation of the PBA. As Eli said it: “bracket pair … a concept as easy and widely known/used as this would need such an obscure definition ”. Just for background: the first theorem on the “Applied Algebra” class taught by Yu.I.Manin was about parentheses match (it stated that the proper match is unique as far as it exists). This statement is a (tiny) mess to prove, but at least it should look very plausible to unwashed masses. (One corollary is that “the earliest possible one” from your interpretation is not actually needed.) ( a [ b ) c ] ? The PBA matches the () but not the []. Some statement about earliest is needed, to select between () and [], but my language contains a mistake. The problems appear when one wants to allow non-matching parentheses as as well as matched pairs. [If one fixes Eli’s description so that “a pair” and “matched” are complete synonyms, then] what Eli conveys is that all non-matching parentheses MUST appear “on top level” only. This is workable (meaning the match is still unique). Eli's definition was: A bracket pair is a pair of an opening paired bracket and a closing paired bracket characters within the same isolating run sequence, such that the Bidi_Paired_Bracket property value of the former character or its canonical equivalent equals the latter character or its canonical equivalent, and all the opening and closing bracket characters in between these two are balanced. Given ( a [ b ) c ] ? his definition contains no bracket pair, but the example in UAX#9 says that the () should form a pair. The purpose of providing my wording was to do precisely the comparison you have been attempting here, so we end up with language that is an actual (and not merely and attempted) restatement of the algorithmic definition. Your approach gives a circular definition: to define which paren chars match one must know which ones DO NOT match, and the recursion is not terminated. This is exactly what Philippe’s example shows. Here's the text I supplied, with numbers added for discussion. It definitely
Re: Unclear text in the UBA (UAX#9) of Unicode 6.3
On 4/21/2014 5:44 PM, Whistler, Ken wrote: So one may ask: what will be the result of the CURRENT UNICODE parsing applied to Phillipe’s example? This is an [«] example [»] for demonstration only. That is easily answered. Let's crank up the bidi reference code with a shorter example that contains the relevant units: a [«] b [»] c I find it telling that this dispute can only be settled by showing trace output - and not, as is normal, but looking at the wording of the definition. Really makes Eli's and my point that the cop out of using an algorithm to define the matching results in it being unpredictable to anyone not running sample text through an implementation. Because of the way the stack processing is defined, the first bracket pair is [«] and the second bracket pair is [»]. The algorithm does not push down potential matches while seeking for a largest outer pair to match. Rather than hiding this in the stack processing it would be possible to express this approach in non-algorithmic language - as you have done here. This is something that should be done. A./ ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: Unclear text in the UBA (UAX#9) of Unicode 6.3
On Mon, Apr 21, 2014 at 06:08:12PM -0700, Asmus Freytag wrote: Here's the text I supplied, with numbers added for discussion. It definitely needs some editing, but the point of the exercise would be to see what: 1. A bracket pair is a pair of characters consisting of an opening paired bracket and a closing paired bracket such that the Bidi_Paired_Bracket property value of the former equals the latter, subject to the following constraints. a - both characters of a pair occur in the same isolating run sequence b - the closing character of a pair follows the opening character c - any bracket character can belong at most to one pair, the earliest possible one d - any bracket character not part of a pair is treated like an ordinary character e - pairs may nest properly, but their spans may not overlap otherwise 2. Bracket characters with canonical decompositions are supposed to be treated as if they had been normalized, to allow normalized and non-normalized text to give the same result. c) needs rewording, because it is not correct The BD16 examples show a ( b ) c ) d 2-4 a ( b ( c ) d 4-6 From that, it follows that it's not the earliest but the one with the smallest span. Sorry, I do not see any definition here. Just a collection of words which looks like a definition, but only locally… And I think I can even invent an example which I cannot parse using your definition: 1( 2[ 3( 4] 5) 6) Is looking-at-1 forcing match of 3-and-5? Or what? Thanks, Ilya ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode