Re: Unclear text in the UBA (UAX#9) of Unicode 6.3

Asmus Freytag Mon, 21 Apr 2014 12:00:08 -0700

On 4/21/2014 11:23 AM, Philippe Verdy wrote:

It is on topic because the proposed description attempts to explainhow paired brackets should match and how this witll then affect therendering in bidirectional contexts. This is exactly the kind ofthings that are difficult because the proposed description assumesthat paired brackets are organized hierarchically.
Quote: "both characters of a pair occur in the same isolating runsequence" (does not work here sequences are not fully isolated)Quote: "any bracket character can belong at most to one pair, theearliest possible one" (does not work here, this is not the earliestpossible)


That's OK, it's a limitation of the algorithm, not the description.

In other words, the algorithm can help set the a better directionalityof paired (!) brackets, and those are the ones that nest properly.

What Eli brough to our attention is that the description of thisalgorithm is suboptimal - whether the algorithm could or should beimproved is a separate matter.

A./

PS: I think it is unlikely that the UTC will be interested insubstantial changes to the algorithm, but it should be interested inallowing the specification to be less dependent on the sampleimplementation.

2014-04-21 19:48 GMT+02:00 Asmus Freytag <asm...@ix.netcom.com<mailto:asm...@ix.netcom.com>>:


    Philippe,

    I fail to understand how your post contributes to the topic.

    The issue was unclear wording of the specification, not
    deficiencies in the UBA or the PBA in general.

    Let's keep this discussion limited to issues of wording for the
    *existing* specification. Feel free to start a new discussion
    about something else under a new subject.

    A./


    On 4/21/2014 9:18 AM, Philippe Verdy wrote:

    There are some cases where these rules will not be clear enough.
    Look at the following where overlaps do occur; but directionality
    still matters:

    "This is an [«] example [»] for demonstration only."

    There are two parsings possible if you just consider a hierarchic
    layout where overlaps are disabled:

    1. "This is an [...] for demonstration only.", embedding "«...»",
    itself embedding "] example [" (here the square brackets match
    externally)

    2. "This is an [...] example [...] for demonstration only.",
    embedding two spans for "«" and "»" separately (they also pair
    externally)

    Now suppose that the term "example" is translated in Arabic: It
    is not very clear how the UBA will work while preserving the
    correct pariing direction of the 3 pairs (one pair is "«...»",
    there are two pairs for "[...]"). Still all 3 pairs have a
    coherent direction that Bidi-reordering or glyph mirorring should
    not mix.

    I see only one solution to tag such text so that it will behave
    correctly: either the two pairs of square brackets or the pair or
    guillemets should be encoded with isolated Bidi overrides. But
    then what is happening to the ordering of the surrounding text?

    There should be a stable way to encode this case so that UBA will
    still work in preserving the correct reding order, and the
    expected semantics and orientation of pairs and the fact that the
    guillemets are effectively not really embedding the brackets, but
    the translated word "example".

    There are several ways to use Bidi-override or Bidi-embedding
    controls; I don't know which one is better but all of them should
    still work with UBA. I just hope that the complex cases of the
    brackets in the middle ("]...[") can be handled gracefully.

    My opinion would require embedding and isolating the each square
    bracket, they will no longer match together (externally they are
    treated as symbols with transparent direction, but how we ensure
    that the sequence "[«]" will still occur before the RTL (Arabic)
    "example" word followed by the sequence "[»]" and that the rest
    of the sentence (for demonstration only) will still occur in the
    correct order : we also have to embed/isolate the "example", or
    the whole sequence "[«] example [»]" so that the main sentence
    "This is an ... for demonstration only" will stil have a coherent
    reading direction.

    Such cases are not so exceptional because they occur to represent
    two distinct parallel readings of te same text, where in one
    reading for one kind of pairs will simply treat the other pairs
    as ignored "transparently".

    It should be an interesting case to investigate for validating
    UBA algorithms in a conformance test case.


    2014-04-21 16:32 GMT+02:00 Asmus Freytag <asm...@ix.netcom.com
    <mailto:asm...@ix.netcom.com>>:

        On 4/21/2014 1:33 AM, Eli Zaretskii wrote:

        Date: Sun, 20 Apr 2014 23:03:20 -0700
        From: Asmus Freytag<asm...@ix.netcom.com>  <mailto:asm...@ix.netcom.com>
        CC: Eli Zaretskii<e...@gnu.org>  <mailto:e...@gnu.org>,unicode@unicode.org  
<mailto:unicode@unicode.org>,
          Kenneth Whistler<k...@unicode.org>  <mailto:k...@unicode.org>

                 Note that the current embedding level is not changed by this 
rule.

             What does this last sentence mean by "the current embedding level"?
             The first bullet of X6 mandates that "the current character’s
             embedding level" _is_ changed by this rule, so what other "current
             embedding level" is alluded to here?

             I'm punting on that one - can someone else answer this?


        I assume "current embedding level" here meant "the embedding level of
        the last entry on the directional status stack". (This is a natural
        slip to make if you think in terms of an optimized implementation that
        stores each component of the top of the directional status stack in a
        variable, as suggested in 3.3.2.)

        James

        In general, I heartily dislike "specifications" that just narrate a
        particular implementation...

        I cannot agree more.

        In fact, my main gripe about the UBA additions in 6.3 are that some of
        their crucial parts are not formally defined, except by an algorithm
        that narrates a specific implementation.  The two worst examples of
        that are the "definitions" of the isolating run sequence and of the
        bracket pair.  I didn't ask about those because I succeeded to figure
        them out, but it took many readings of the corresponding parts of the
        document.  It is IMO a pity that the two main features added in 6.3
        are based on definitions that are so hard to penetrate, and which
        actually all but force you to use the specific implementation
        described by the document.

        My working definition that replaces BD13 is this:

           An isolating run sequence is the maximal sequence of level runs of
           the same embedding level that can be obtained by removing all the
           characters between an isolate initiator and its matching PDI (or
           paragraph end, if there is no matching PDI) within those level runs.

        As for bracket pair (BD16), I'm really amazed that a concept as easy
        and widely known/used as this would need such an obscure definition
        that must have an algorithm as its necessary part.  How about this
        instead:

           A bracket pair is a pair of an opening paired bracket and a closing
           paired bracket characters within the same isolating run sequence,
           such that the Bidi_Paired_Bracket property value of the former
           character or its canonical equivalent equals the latter character or
           its canonical equivalent, and all the opening and closing bracket
           characters in between these two are balanced.

        Then we could use the algorithm to explain what it means for brackets
        to be balanced (for those readers who somehow don't already know
        that).

        Again, thanks for clarifying these subtle issues.  I can now proceed
        to updating the Emacs bidirectional display with the changes in
        Unicode 6.3.

        FWIW here is the restatement of BD16 that I used for myself
        (and that I put
        into the source comments of the sample Java implementation):

            // The following is a restatement of BD 16 using
        non-algorithmic language.
            //
            // A bracket pair is a pair of characters consisting of
        an opening
            // paired bracket and a closing paired bracket such that the
            // Bidi_Paired_Bracket property value of the former
        equals the latter,
            // subject to the following constraints.
            // - both characters of a pair occur in the same
        isolating run sequence
            // - the closing character of a pair follows the opening
        character
            // - any bracket character can belong at most to one
        pair, the earliest possible one
            // - any bracket character not part of a pair is treated
        like an ordinary character
            // - pairs may nest properly, but their spans may not
        overlap otherwise

            // Bracket characters with canonical decompositions are
        supposed to be treated
            // as if they had been normalized, to allow normalized
        and non-normalized text
            // to give the same result.

        Your language is more concise, but you may compare for
        differences.

        A./

        _______________________________________________
        Unicode mailing list
        Unicode@unicode.org <mailto:Unicode@unicode.org>
        http://unicode.org/mailman/listinfo/unicode

_______________________________________________
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode

Re: Unclear text in the UBA (UAX#9) of Unicode 6.3

Reply via email to