Geoff Clare wrote in <ZvPGnLMvGkqL0a3U@localhost>: |Steffen Nurpmeso wrote, on 24 Sep 2024: |> Austin Group Bug Tracker via austin-group-l at The Open Group wrote in |> <ea7880ce68ba63cac427845dc4029...@austingroupbugs.net>: |> ... |>|https://austingroupbugs.net/view.php?id=1857 |> ... |>| (0006881) geoffclare (manager) - 2024-09-24 10:46 |>| https://austingroupbugs.net/view.php?id=1857#c6881 |> ... |> I have not yet read this completely, but from a glance i see |> |> the ERE "(aaa??)*" matches only the first four characters of the |> string "aaaaa", not all five, because in order to match all |> five, "a??" would match with length one instead of zero |> |> and that felt not right: |> |> echo 'aaaaa' | |> perl -e '$i=<STDIN>;if($i =~ "(aaa??)*"){print "i<$i>; 1<$1> \ |> 2<$2> 3<$3>\n"}else{print "no match\n"}' |> i<aaaaa |>>; 1<aa> 2<> 3<> |> |> It matches only two. | |Looks like a bug in perl. Each repetition of "aaa??" matches "aa" |because the "??" is non-greedy, but the "*" is greedy so should |match the longest string of repeated "aa" as possible. | |With regcomp() on macOS it matches "aaaa", as expected. (That example |is straight from the macOS re_format(7) man page, but I also tested it |to make sure.)
So ok then i really had to look and all i can say is "yes!" there are plenty of bugs everywhere, as can be verified with the following code snippet /* gcc -W -Wall -o p-tre preu.c -DXPCRE2 -ltre */ #ifdef XTRE # define X(Y) tre_ ## Y # include <tre.h> /* gcc -W -Wall -o p-pcre2 preu.c -DXPCRE2 -lpcre2-posix */ #elif defined XPCRE2 # define X(Y) pcre2_ ## Y # include <pcre2posix.h> /* gcc -W -Wall -o p-c preu.c */ #else # define X(Y) Y # include <regex.h> #endif #include <stdio.h> int main(int argc, char **argv){ regmatch_t remt[8]; regex_t ret; if(argc != 3) return 64; if(X(regcomp)(&ret, argv[1], REG_EXTENDED)) return 65; if(X(regexec)(&ret, argv[2], sizeof(remt)/sizeof(remt[0]), &remt[0], 0)) printf("no match\n"); else{ size_t i; for(i = 0; i <= ret.re_nsub; ++i){ printf("%zu: %ld/%ld\n", i, (long)remt[i].rm_so, (long)remt[i].rm_eo); if(remt[i].rm_so != -1) printf("\t<%.*s>\n", (int)(remt[i].rm_eo - remt[i].rm_so), &argv[2][(unsigned long)remt[i].rm_so]); } } X(regfree)(&ret); return 0; } Doing that reveals that the tre library (as of HEAD of [1] is *completely* broken in (at least) respect to _UNGREEDY matching, and that libpcre2 10.44 as of [2] comes over like so: $ ./p-pcre2 '(aaa??)*' 'aaaaac' 0: 0/4 <aaaa> 1: 2/4 <aa> which exactly mirrors the perl(1) outcome, and please look at the indices, too. Your usage of the asterisk ("star") outside of the parenthesis does not actually multiplicate the content of the match group, as Harald van Dijk has stated in another message i have already seen. $ ./p-pcre2 '(aaa??)*' 'aaaaaac' 0: 0/6 <aaaaaa> 1: 4/6 <aa> $ ./p-tre '(aaa??)*' 'aaaaaac' MINIINININI 0 mini=0 MINIINININI 1 mini=1 rest:? MINIINININI 0 mini=0 HAHAHAH 0: 0/0 <> 1: -1/-1 ^ Totally borked somewhere below tre_ast_new_iter(), i have not looked further. But it seems Dag-Erling Smørgrav of FreeBSD has actually started to having a look into tre, just a couple of months ago!! .. And that two reported UNGREEDY issues already have been marked by him (after a decade of existence) as bugs, back in July. I want to reiterate that i opened the POSIX issue in 2013, by then the sun must have been shining. Anyhow, i think Dag-Erling is also listening here. $ ./p-c '(aaa??)*' 'aaaaaac' ^ Shouldn't this report an error? 0: 0/6 <aaaaaa> 1: 3/6 <aaa> $ ./p-c '(aaa?)*' 'aaaaaac' 0: 0/6 <aaaaaa> 1: 3/6 <aaa> $ ./p-tre '(aaa?)*' 'aaaaaac' MINIINININI 0 mini=0 MINIINININI 0 mini=0 HAHAHAH 0: 0/6 <aaaaaa> 1: 3/6 <aaa> ^ Works without mini(mal)==UNGREEDY. (Only fewest tests tre has.) $ ./p-pcre2 '(aaa?)*' 'aaaaaac' 0: 0/6 <aaaaaa> 1: 3/6 <aaa> I am stunned, but not surprised, actually. Ha-ha. Anyhow, Apple is of course wrong when they do it like that, and perl and libpcre2 are right. Regarding all the other stuff, *my* opinion is that *if* i as a user explicitly attach ? as an ungreedy/minimalizing modifier to a regular expression, then i want it to be honoured. The same if a set REG_MINIMAL (REG_UNGREEDY) and suffix ? for the opposite. And if that counteracts some other thing, then because of the nature of regular expression the explicit "rule-changing" modifier has to have preference over the default, because there is no other way to adjust default behaviour otherwise. [1] https://github.com/laurikari/tre.git [2] https://www.pcre.org |Geoff Clare <g.cl...@opengroup.org> |The Open Group, Apex Plaza, Forbury Road, Reading, RG1 1AX, England --End of <ZvPGnLMvGkqL0a3U@localhost> --steffen | |Der Kragenbaer, The moon bear, |der holt sich munter he cheerfully and one by one |einen nach dem anderen runter wa.ks himself off |(By Robert Gernhardt)