Branch: refs/heads/yves/curlyx_curlym Home: https://github.com/Perl/perl5 Commit: 074a4d7419ab6f9e9292b4b46cd482f7662d8ed2 https://github.com/Perl/perl5/commit/074a4d7419ab6f9e9292b4b46cd482f7662d8ed2 Author: Yves Orton <demer...@gmail.com> Date: 2023-01-15 (Sun, 15 Jan 2023)
Changed paths: M regcomp_debug.c M regcomp_study.c M t/re/pat_re_eval.t Log Message: ----------- regcomp_study.c - disable CURLYX optimizations when EVAL has been seen anywhere Historically we disabled CURLYX optimizations when they *contained* an EVAL, on the assumption that the optimization might affect how many times, etc, the eval was called. However, this is also true for CURLYX with evals *afterwards*. If the CURLYN or CURLYM optimization can prune off the search space, then an eval afterwards will be affected. An when you take into account GOSUB, it means that an eval in front might be affected by an optimization after it. So for now we disable CURLYN and CURLYM in any pattern with an EVAL. Commit: 33ce983f184eace1dea679a0349a5be2c5137484 https://github.com/Perl/perl5/commit/33ce983f184eace1dea679a0349a5be2c5137484 Author: Yves Orton <demer...@gmail.com> Date: 2023-01-15 (Sun, 15 Jan 2023) Changed paths: M MANIFEST M ext/XS-APItest/APItest.xs A ext/XS-APItest/t/savestack.t M regexec.c Log Message: ----------- regexec.c - fix memory leak in EVAL. EVAL was calling regcppush twice per invocation, once before executing the callback, and once after. But not regcppop'ing twice. So each time we would accumulate an extra "frame" of data. This is/was hidden somewhat by the way we eventually "blow" the stack, so the extra data was just thrown away at the end. This removes the second set of pushes so that the save stack stays a stable size as it unwinds from each failed eval. We also weren't cleaning up after a (?{...}) when we failed to match to its right. This unwinds the stack and restores the parens properly. This adds tests to check how the save stack grows during patterns using (?{ ... }) and (??{ ... }) and ensure that when we backtrack and re-execute the EVAL again it cleans up the stack as it goes. Commit: 64b1e8b2393d4733b9414a37db12d5842d97eaf4 https://github.com/Perl/perl5/commit/64b1e8b2393d4733b9414a37db12d5842d97eaf4 Author: Yves Orton <demer...@gmail.com> Date: 2023-01-15 (Sun, 15 Jan 2023) Changed paths: M MANIFEST M regexec.c M t/re/pat_re_eval.t M t/re/re_tests M t/re/regexp.t A t/re/regexp_normal.t Log Message: ----------- regexec.c - fix accept in CURLYX/WHILEM construct. The ACCEPT logic didnt know how to handle WHILEM, which for some reason does not have a next_off defined. I am not sure why. This was revealed by forcing CURLYX optimisations off. This includes a patch to test what happens if we embed an eval group in the tests run by regexp.t when run via regexp_normal.t, which disabled CURLYX -> CURLYN and CURLYM optimisations and revealed this issue. This adds t/re/regexp_normal.t which test "normalized" forms of the patterns in t/re/re_tests by munging them in various ways to see if they still behave as expected. For instance injecting a (?{}) can disable optimisations. Commit: b3655a49afabb87ebf787310f36f4a744146ef55 https://github.com/Perl/perl5/commit/b3655a49afabb87ebf787310f36f4a744146ef55 Author: Yves Orton <demer...@gmail.com> Date: 2023-01-15 (Sun, 15 Jan 2023) Changed paths: M regcomp.c M regcomp.h M regcomp_internal.h M t/re/pat.t M t/re/reg_mesg.t Log Message: ----------- regcomp.c - increase size of CURLY nodes so the min/max is a I32 This allows us to resolve a test inconsistency between CURLYX and CURLY and CURLYM. We use I32 because the existing count logic uses -1 and this keeps everything unsigned compatible. Commit: a492304b44435a9dced29fa724da65a31b3aabd8 https://github.com/Perl/perl5/commit/a492304b44435a9dced29fa724da65a31b3aabd8 Author: Yves Orton <demer...@gmail.com> Date: 2023-01-15 (Sun, 15 Jan 2023) Changed paths: M pod/perldebguts.pod M pp_ctl.c M regcomp.c M regcomp.h M regcomp.sym M regcomp_debug.c M regexec.c M regexp.h M regnodes.h M t/re/pat.t M t/re/pat_rt_report.t M t/re/re_tests Log Message: ----------- regcomp.c - Resolve issues clearing buffers in CURLYX (MAJOR-CHANGE) CURLYX doesn't reset capture buffers properly. It is possible for multiple buffers to be defined at once with values from different iterations of the loop, which doesn't make sense really. An example is this: "foobarfoo"=~/((foo)|(bar))+/ after this matches $1 should equal $2 and $3 should be undefined, or $1 should equal $3 and $2 should be undefined. Prior to this patch this would not be the case. The solution that this patches uses is to introduce a form of "layered transactional storage" for paren data. The existing pair of start/end data for capture data is extended with a start_new/end_new pair. When the vast majority of our code wants to check if a given capture buffer is defined they first check "start_new/end_new", if either is -1 then they fall back to whatever is in start/end. When a capture buffer is CLOSEd the data is written into the start_new/end_new pair instead of the start/end pair. When a CURLYX loop is executing and has matched something (at least one "A" in /A*B/ -- thus actually in WHILEM) it "commits" the start_new/end_new data by writing it into start/end. When we begin a new iteration of the loop we clear the start_new/end_new pairs that are contained by the loop, by setting them to -1. If the loop fails then we roll back as we used to. If the loop succeeds we continue. When we hit an END block we commit everything. Consider the example above. We start off with everything set to -1. $1 = (-1,-1):(-1,-1) $2 = (-1,-1):(-1,-1) $3 = (-1,-1):(-1,-1) In the first iteration we have matched "foo" and end up with this: $1 = (-1,-1):( 0, 3) $2 = (-1,-1):( 0, 3) $3 = (-1,-1):(-1,-1) We commit the results of $2 and $3, and then clear the new data in the beginning of the next loop: $1 = (-1,-1):( 0, 3) $2 = ( 0, 3):(-1,-1) $3 = (-1,-1):(-1,-1) We then match "bar": $1 = (-1,-1):( 0, 3) $2 = ( 0, 3):(-1,-1) $3 = (-1,-1):( 3, 7) and then commit the result and clear the new data: $1 = (-1,-1):( 0, 3) $2 = (-1,-1):(-1,-1) $3 = ( 3, 7):(-1,-1) and then we match "foo" again: $1 = (-1,-1):( 0, 3) $2 = (-1,-1):( 7,10) $3 = ( 3, 7):(-1,-1) And we then commit. We do a regcppush here as normal. $1 = (-1,-1):( 0, 3) $2 = ( 7,10):( 7,10) $3 = (-1,-1):(-1,-1) We then clear it again, but since we don't match when we regcppop we store the buffers back to the above layout. When we finally hit the END buffer we also do a commit as well on all buffers, including the 0th (for the full match). Fixes GH Issue #18865, and adds tests for it and other things. Commit: 17c4b856ffebb88eacef9fbd401c1b04d4c4ead6 https://github.com/Perl/perl5/commit/17c4b856ffebb88eacef9fbd401c1b04d4c4ead6 Author: Yves Orton <demer...@gmail.com> Date: 2023-01-15 (Sun, 15 Jan 2023) Changed paths: M pod/perldebguts.pod M regcomp.c M regcomp.h M regcomp.sym M regcomp_debug.c M regcomp_trie.c M regexec.c M regexp.h M regnodes.h M t/re/re_tests Log Message: ----------- regexec.c - teach BRANCH and BRANCHJ nodes to reset capture buffers In /((a)(b)|(a))+/ we should not end up with $2 and $4 being set at the same time. When a branch fails it should reset any capture buffers that might be touched by its branch. We change BRANCH and BRANCHJ to store the number of parens before the branch, and the number of parens after the branch was completed. When a BRANCH operation fails, we clear the buffers it contains before we continue on. It is a bit more complex than it should be because we have BRANCHJ and BRANCH. (One of these days we should merge them together.) This is also made somewhat more complex because TRIE nodes are actually branches, and may need to track capture buffers also, at two levels. The overall TRIE op, and for jump tries especially where we emulate the behavior of branches. So we have to do the same clearing logic if a trie branch fails as well. Commit: ea97a3acd57cfd3c6a79ac890d0cee370dd8dbe2 https://github.com/Perl/perl5/commit/ea97a3acd57cfd3c6a79ac890d0cee370dd8dbe2 Author: Yves Orton <demer...@gmail.com> Date: 2023-01-15 (Sun, 15 Jan 2023) Changed paths: M pod/perldelta.pod M pod/perlre.pod M regcomp.c M regcomp.h M regcomp_debug.c M regcomp_internal.h M regcomp_study.c M regexec.c M regnodes.h M t/re/pat_re_eval.t M t/re/pat_rt_report.t M toke.c Log Message: ----------- regcomp.c - add optimistic eval This adds (*{ ... }) and (**{ ... }) as equivalents to (?{ ... }) and (??{ ... }). The only difference being that the star variants are "optimisitic" and are defined to never disable optimisations. This is especially relevant now that use of (?{ ... }) prevents important optimisations anywhere in the pattern, instead of the older and inconsistent rules where it only affected the parts that contained the EVAL. It is also very useful for injecting debugging style expressions to the pattern to understand what the regex engine is actually doing. The older style (?{ ... }) variants would change the regex engines behavior, meaning this was not as effective a tool as it could have been. Commit: a164ff889826685f3daf1324ba685a27420a440e https://github.com/Perl/perl5/commit/a164ff889826685f3daf1324ba685a27420a440e Author: Yves Orton <demer...@gmail.com> Date: 2023-01-15 (Sun, 15 Jan 2023) Changed paths: M pod/perldelta.pod Log Message: ----------- perldelta - add note about regex engine changes capture buffer semantics should now be consistent. Commit: a3c622c31289bbb859624f22e8fffb4917e64cf9 https://github.com/Perl/perl5/commit/a3c622c31289bbb859624f22e8fffb4917e64cf9 Author: Yves Orton <demer...@gmail.com> Date: 2023-01-15 (Sun, 15 Jan 2023) Changed paths: M regexec.c M regexp.h M t/re/re_tests Log Message: ----------- regexec.c - incredibly inefficient solution to backref problem Backrefs to unclosed parens inside of a quantified group were not being properly handled, which revealed we are not unrolling the paren state properly on failure and backtracking. Much of the code assumes that when we execute a "conditional" operation (where more than one thing could match) that we need not concern ourself with the paren state unless the conditional operation itself represents a paren, and that generally opcodes only needed to concern themselves with parens to their right. When you exclude backrefs from the equation this is broadly reasonable (i think), as on failure we typically dont care about the state of the paren buffers. They either get reset as we find a new different accepting pathway, or their state is irrelevant if the overal match is rejected (eg it fails). However backreferences are different. Consider the following pattern from the tests "xa=xaaa" =~ /^(xa|=?\1a){2}\z/ in the first iteration through this the first branch matches, and in fact because the \1 is in the second branch it can't match on the first iteration at all. After this $1 = "xa". We then perform the second iteration. "xa" does not match "=xaaa" so we fall to the second branch. The '=?' matches, but sets up a backtracking action to not match if the rest of the pattern does not match. \1 matches 'xa', and then the 'a' matches, leaving an unmatched 'a' in the string, we exit the quantifier loop with $1 = "=xaa" and match \z against the remaining "a" in the pattern, and fail. Here is where things go wrong in the old code, we unwind to the outer loop, but we do not unwind the paren state. We then unwind further into the 2nds iteration of the loop, to the '=?' where we then try to match the tail with the quantifier matching the empty string. We then match the old $1 (which was not unwound) as "=xaa", and then the "a" matches, and we are the end of the string and we have incorrectly accpeted this string as matching the pattern. What should have happend was when the \1 was resolved the second time it should have returned the same string as it did when the =? matched '=', which then would have resulted in the tail matching again, and etc, eventually unwinding the entire pattern when the second iteration failed entirely. This patch is very crude. It simple pushes the state of the parens and creates and unwind point for every case where we do a transition to a B or _next operation, and we make the corresponding _next_fail do the appropriate unwinding. The objective was to achieve correctness and then work towards making it more efficient. We almost certainly overstore items on the stack. In a future patch we can perhaps keep track of the unclosed parens before the relevant operators and make sure that they are properly pushed and unwound at the correct times. Commit: 4978499a9c823243b7e0039a52e6aa98789b1fb8 https://github.com/Perl/perl5/commit/4978499a9c823243b7e0039a52e6aa98789b1fb8 Author: Yves Orton <demer...@gmail.com> Date: 2023-01-15 (Sun, 15 Jan 2023) Changed paths: M regcomp.c M regcomp.h M regcomp.sym M regcomp_internal.h M regexec.c M regexp.h M regnodes.h Log Message: ----------- regexec.c - make REF into a backtracking state This way we can do the required paren restoration only when it is in use. When we match a REF type node which is potentially a reference to an unclosed paren we push the match context information, currently for "everything", but in a future patch we can teach it to be more efficient by adding a new parameter to the REF regop to track which parens it should save. This converts the backtracking changes from the previous commit, so that it is run only when specifically enabled via the define RE_PESSIMISTIC_PARENS which is by default 0. We don't make the new fields in the struct conditional as the stack frames are large and our changes don't make any real difference and it keeps things simpler to not have conditional members, especially since some of the structures have to line up with each other. If enabling RE_PESSIMISTIC_PARENS fixes a backtracking bug then it means something is sensitive to us not necessarily restoring the parens properly on failure. We make some assumptions that the paren state after a failing state will be corrected by a future successful state, or that the state of the parens is irrelevant as we will fail anyway. This can be made not true by EVAL, backrefs, and potentially some other scenarios. Thus I have left this inefficient logic in place but guarded by the flag. Commit: 3667fb5504f16c093961a5b14c0a8e35c54ea2bc https://github.com/Perl/perl5/commit/3667fb5504f16c093961a5b14c0a8e35c54ea2bc Author: Yves Orton <demer...@gmail.com> Date: 2023-01-15 (Sun, 15 Jan 2023) Changed paths: M pod/perldebguts.pod M regcomp.c M regcomp.h M regcomp.sym M regcomp_debug.c M regcomp_study.c M regcomp_trie.c M regexec.c M reginline.h M regnodes.h Log Message: ----------- regex engine - simplify regnode structures and make them consistent This eliminates the regnode_2L data structure, and merges it with the older regnode_2 data structure. At the same time it makes each "arg" property of the various regnode types that have one be consistently structured as an anonymous union like this: union { U32 arg1u; I32 arg2i; struct { U16 arg1a; U16 arg1b; }; }; We then expose four macros for accessing each slot: ARG1u() ARG1i() and ARG1a() and ARG1b(). Code then explicitly designates which they want. The old logic used ARG() to access an U32 arg1, and ARG1() to access an I32 arg1, which was confusing to say the least. The regnode_2L structure had a U32 arg1, and I32 arg2, and the regnode_2 data strucutre had two I32 args. With the new set of macros we use the regnode_2 for both, and use the appropriate macros to show whether we want to signed or unsigned values. This also renames the regnode_4 to regnode_3. The 3 stands for "three 32-bit args". However as each slot can also store two U16s, a regnode_3 can hold up to 6 U16s, or as 3 I32's, or a combination. For instance the CURLY style nodes use regnode_3 to store 4 values, ARG1i() for min count, ARG2i() for max count and ARG3a() and ARG3b() for parens before and inside the quantifier. Compare: https://github.com/Perl/perl5/compare/f7f5b5acb14e...3667fb5504f1