Re: [1003.1(2024)/Issue8 0001857]: Several problems with the new "lazy" regex quantifier.

Steffen Nurpmeso via austin-group-l at The Open Group Wed, 25 Sep 2024 16:26:51 -0700

Geoff Clare via austin-group-l at The Open Group wrote in
 <ZvJ-YjQWnGfP9u7T@localhost>:
 |Steffen Nurpmeso wrote, on 24 Sep 2024:
 |> Geoff Clare wrote in
 |>  <ZvGKHOb0E3IZ5Y4Q@localhost>:
 |> 
 |>|I think this is required by the normative text (elsewhere than the
 |>|grammar), not assumed by the example as Mike says.  The relevant text
 |>|is in the definition of "matched" in 9.1:
 |>|
 |>|    Consistent with the whole match being the longest of the leftmost
 |>|    matches, each subpattern, from left to right, shall match the
 |>|    longest possible string.
 |> 
 |> Yes, that is good.
 |> 
 |>|and it goes on to give an example:
 |>|
 |>|    For example, matching the BRE "\(.*\).*" against "abcdef", the
 |>|    subexpression "(\1)" is "abcdef"
 |> 
 |> And really in that paragraph there are only successful matches,
 |> even 'and matching the BRE "\(a*\)*" against "bc", the
 |> subexpression "(\1)" is the null string' is so.  This text is,
 |> like shell field splitting etc, nothing for the occasional
 |> "standard text hopper", but can truly be read in full context
 |> only.
 |
 |Thinking some more about that text, I see a problem. Since it
 |specifically talks about subpatterns, it could be read as implying
 |that subpatterns are maximised at the expense of parts that are not
 |in subpatterns.  Modifying the example to matching ".*\(.*\)" against


Well .. "no" i say now and today.  Maybe the paragraph is just
fine, and always has been (in this form).

 |"abcdef", this interpretation would mean that the subpattern matches
 |the longest possible string, which is "abcdef", with the initial ".*"
 |matching nothing.  However, all the implementations I tried give the
 |expected null match for the subpattern.

Modified for REG_EXTENDED, yes:

  #?0|kent:tmp$ ./p-c '.*(.*)' abcdef
  0: 0/6
          <abcdef>
  1: 6/6
          <>
  #?0|kent:tmp$ ./p-pcre2 '.*(.*)' abcdef
  0: 0/6
          <abcdef>
  1: 6/6
          <>
  #?0|kent:tmp$ ./p-tre '.*(.*)' abcdef
  MINIINININI  0 mini=0
  MINIINININI  0 mini=0
  HAHAHAH
  0: 0/6
          <abcdef>
  1: 6/6
          <>

Also, compare for example this snippet of "man perlre"

 Alternatives are tried from left to right, so the first alternative
 found for which the entire expression matches, is the one that is
 chosen. This means that alternatives are not necessarily greedy. For
 example: when matching "foo|foot" against "barefoot", only the "foo"
 part will match, as that is the first alternative tried, and it
 successfully matches the target string. (This might not seem important,
 but it is important when you are capturing matched text using
 parentheses.)

with 9.1 (p 179 bottom / 180 top),

  If the pattern permits a variable number of matching characters
  and thus there is more than one such sequence starting at that
  point, the longest such sequence is matched. For example, the
  BRE "bb*" matches the second to fourth characters of the string
  "abbbc", and the ERE "(wee|week)(knights|night)" matches all ten
  characters of the string "weeknights.

What happens is

  $ ./p-c '(wee|week)(knights|night)' weeknights
  0: 0/10
          <weeknights>
  1: 0/3
          <wee>
  2: 3/10
          <knights>
  $ ./p-tre '(wee|week)(knights|night)' weeknights
  HAHAHAH
  0: 0/10
          <weeknights>
  1: 0/3
          <wee>
  2: 3/10
          <knights>
  $ ./p-pcre2 '(wee|week)(knights|night)' weeknights
  0: 0/10
          <weeknights>
  1: 0/3
          <wee>
  2: 3/10
          <knights>

That i would not truly expect from "matches all ten characters" in
respect to "match".  If i have subpatterns "matching" means "i
have data", whatever.  Maybe it would make sense to especially
refer to \1 being "wee", as that is not "the longest possible"
match here.  Other than that, you know, that is a large field.

--steffen
|
|Der Kragenbaer,                The moon bear,
|der holt sich munter           he cheerfully and one by one
|einen nach dem anderen runter  wa.ks himself off
|(By Robert Gernhardt)

Re: [1003.1(2024)/Issue8 0001857]: Several problems with the new "lazy" regex quantifier.

Reply via email to