Re: Unclear text in the UBA (UAX#9) of Unicode 6.3

Asmus Freytag Wed, 23 Apr 2014 09:24:42 -0700

On 4/23/2014 12:35 AM, Ilya Zakharevich wrote:

On Tue, Apr 22, 2014 at 09:06:27AM -0700, Asmus Freytag wrote:

if you read UAX#9, the way the algorithm works is by pushing openers
on a stack, then, on finding the first closer, going down the stack
and attempting to locate a match, then, on finding a match,
discarding any enclosed openers, on not finding a match, discarding
the closer.

I think I LOVE this definition.  Simple, beautiful, and IMO following
people’s expectations very closely.

I hadn't intended it as a definition, but let's see how it would work asone. The "stack" isn't necessary:


   The algorithm works by finding the first closer, looking back
   and attempting to locate a match, then, on finding the nearest match,
   discarding any enclosed openers, otherwise on not finding a match, discarding
   the closer.

That an implementation uses a "stack" to avoid having to "look back" isa detail that has no place in a well-crafted definition.

It leaves a few things unstated - that all has to take place inside thesame isolating run, and, that, on reaching the end, only the matchedpairs are remembered (all unmatched openers are discarded). (And we needto say what "discarded" means).

While it's a statement of an algorithm, it's not obfuscatory, and it'srelatively easy to consider alternate implementation strategies.


Here is what “theoretizing” gives:

  a parsing is good if it satisfies all conditions below:

    0) Some delimiters in the string are marked as “non-matching”; the rest
       is broken into disjoint “matched” pairs;

    MATCH) A “matched” pair consists of an open-delimiter and matching close-
           delimiter (in this order in the string).

    NEST) “Matched” pairs are properly nested (meaning that 2 pairs cannot be
          positioned as Open1 Open2 Close1 Close2 in the string order).

    MINLEN) “Inside” a “matched” pair, every delimiter which could match 
elements
            of the pair but is marked as “non-matching” must nest inside
            some deeper-nested “matched” pair.

(I hope that the meaning of the word “inside” in MINLEN is clear.)

    GREED) Given any close-delimiter marked as “non-matching”, its
           pre-context does not contain any open-delimiter which could
           match it.

      Here pre-context of a position is a concatenation of substrings of the
      initial string:
      • Take the most deeply nested “matched pair” containing the position
        (if none, the whole string);
      • take the part of the string inside this pair AND before the position;
      • remove all “matched” pairs completely contained insidde this substring
        together with what they enclose.


This is a very nice formal definition. I'm surprised that your "GREED"
statement needs such a complex auxiliary concept (pre-context).

Can you explain why, if you make "pre-context" simply the part of the
whole string that precedes the unmatched close-delimiter, the words
"which could match it" are insufficient?

Any opener, that's inside a matched pair entirely in the pre-context
(my def) would be ineligible because of NEST, so you don't have
to "remove" it from the pre-context (your def).

Is there something I'm missing?


Ilya

P.S.  Judging by another message of yours, for you “theoretizing” is a
       4-letter word…  Oh well…

It can be - not in the sense you used it in this post.

A./

_______________________________________________
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode

Re: Unclear text in the UBA (UAX#9) of Unicode 6.3

Reply via email to