Re: [1003.1(2024)/Issue8 0001857]: Several problems with the new "lazy" regex quantifier.

Steffen Nurpmeso via austin-group-l at The Open Group Mon, 23 Sep 2024 15:41:17 -0700

Geoff Clare wrote in
 <ZvGKHOb0E3IZ5Y4Q@localhost>:
 |Steffen Nurpmeso wrote, on 20 Sep 2024:
 ..
 |> Steffen Nurpmeso via austin-group-l at The Open Group wrote in
 |>  <20240916152314.Do4nw6pA@steffen%sdaoden.eu>:
 |[...]
 |>|  It turns out the POSIX standard is ambiguous about this situation.
 |>|  The grammar in the standard for concatenated regular expressions is a
 |>|  left-associative grammar.  However, there is an example in the \
 |>|  rationale
 |>|  (not officially part of the standard...) that assumes concatenation is
 |>|  is right-associative.
 |> 
 |> I was about to open a clarification issue, but could not find the
 |> quoted rationale, yet i got my hands on Mike Haertel's email
 |> address and thought i ask him.
 |> He was so nice to answer and he says that the above arose from
 |> memory from something read in the past, and that he is unable to
 |> find the exact quote now.
 ...
 |I assume the example Mike refers to above is this in A.9.1:
 |
 |    For example, in the ERE "(a.*b)(a.*b)", the two identical
 |    subexpressions would match four and six characters, respectively,
 |    of accbaccccb.


That on a Monday.  But, yes, this is indeed right-associative in
that "matching will fail as such unless the first match is
terminated (at some point)".
In my personal regular expression vocabulary i would not call this
right-associative, i would call it "look-ahead" and "assert
a match is at all possible".

To continue with my own words regular expressions are complex
state machines that may (dependent upon the implementation) support
forward and backward assertions, random access of capture groups
(backwards), recurs(iv)e (to named) subpatterns, and even more.

Maybe the above can be extended with an additional example

  echo 'accbccbaccccb' |
    perl -e '$i=<STDIN>;if($i =~ "(a.*b)(a.*b)"){print "i<$i>; 1<$1> 
2<$2>\n"}else{print "no match\n"}'
  i<accbccbaccccb
  >; 1<accbccb> 2<accccb>

  echo 'accbccababaaccccbabab' |
    perl -e '$i=<STDIN>;if($i =~ "(a.*b)(a.*b)(a.*b)"){print "i<$i>; 1<$1> 
2<$2> 3<$3>\n"}else{print "no match\n"}'
  i<accbccababaaccccbabab
  >; 1<accbccababaaccccb> 2<ab> 3<ab>

so that it can be seen that in fact the longest ("most greedy")
match is used that satisfies the conditions of "a match" as such.
I would not call this right-associative.

Here i want to mention that my email to Mike Haertel said

  I would like to open an issue to clarify the desired behaviour,
  noting that for example tre (REG_RIGHT_ASSOC compilation option)
  says

    By default, concatenation is left associative in TRE, as per the
    grammar given in the [POSIX] of Std 1003.1-2001 (POSIX).

and in fact says

   REG_RIGHT_ASSOC
          By default, concatenation is left associative in TRE, as per the 
grammar given in the [3]base specifications on regular expressions of Std 
1003.1-2001 (POSIX). This flag flips
          associativity of concatenation to right associative. Associativity 
can have an effect on how a match is divided into submatches, but does not 
change what is matched by the
          entire regexp.

and the code does, not that i understand it,

  #ifdef REG_RIGHT_ASSOC
            if (ctx->cflags & REG_RIGHT_ASSOC)
              {
                /* Right associative concatenation. */
                STACK_PUSHX(stack, voidptr, result);
                STACK_PUSHX(stack, int, PARSE_POST_CATENATION);
                STACK_PUSHX(stack, int, PARSE_CATENATION);
                STACK_PUSHX(stack, int, PARSE_PIECE);
              }
            else
  #endif /* REG_RIGHT_ASSOC */
              {
                /* Default case, left associative concatenation. */
                STACK_PUSHX(stack, int, PARSE_CATENATION);
                STACK_PUSHX(stack, voidptr, result);
                STACK_PUSHX(stack, int, PARSE_POST_CATENATION);
                STACK_PUSHX(stack, int, PARSE_PIECE);
              }

anyway i want to reiterate the words

  does not change what is matched by the entire regexp.

The POSIX example however cannot be matched at all unless the
first match is terminated, which is, to say it again, not what
i call right-associative.

*But* there are people who spent real time with the implementation
side of regular expressions, and/or with their standardization,
whereas i (very much mostly) only come from the user side.

 |I think this is required by the normative text (elsewhere than the
 |grammar), not assumed by the example as Mike says.  The relevant text
 |is in the definition of "matched" in 9.1:
 |
 |    Consistent with the whole match being the longest of the leftmost
 |    matches, each subpattern, from left to right, shall match the
 |    longest possible string.

Yes, that is good.

 |and it goes on to give an example:
 |
 |    For example, matching the BRE "\(.*\).*" against "abcdef", the
 |    subexpression "(\1)" is "abcdef"

And really in that paragraph there are only successful matches,
even 'and matching the BRE "\(a*\)*" against "bc", the
subexpression "(\1)" is the null string' is so.  This text is,
like shell field splitting etc, nothing for the occasional
"standard text hopper", but can truly be read in full context
only.

Thank you,

--steffen
|
|Der Kragenbaer,                The moon bear,
|der holt sich munter           he cheerfully and one by one
|einen nach dem anderen runter  wa.ks himself off
|(By Robert Gernhardt)

Re: [1003.1(2024)/Issue8 0001857]: Several problems with the new "lazy" regex quantifier.

Reply via email to