Branch: refs/heads/yves/curlyx_curlym Home: https://github.com/Perl/perl5 Commit: 32c009ba5d904b97fa291aa857234dd663694b2c https://github.com/Perl/perl5/commit/32c009ba5d904b97fa291aa857234dd663694b2c Author: Yves Orton <demer...@gmail.com> Date: 2023-01-12 (Thu, 12 Jan 2023)
Changed paths: M t/re/re_tests Log Message: ----------- t/re/re_rests - extend test to show more buffers This is a tricky test, showing more buffers makes it a bit easier to understand if you break it. (Guess what I did?) Commit: a560ea0be847f8d00ecae70b4894fd3fe7165737 https://github.com/Perl/perl5/commit/a560ea0be847f8d00ecae70b4894fd3fe7165737 Author: Yves Orton <demer...@gmail.com> Date: 2023-01-12 (Thu, 12 Jan 2023) Changed paths: M regcomp.c M regcomp.h M regcomp_internal.h M t/re/pat.t M t/re/reg_mesg.t Log Message: ----------- regcomp.c - increase size of CURLY nodes so the min/max is a I32 This allows us to resolve a test inconsistency between CURLYX and CURLY and CURLYM. We use I32 because the existing count logic uses -1 and this keeps everything unsigned compatible. Commit: cd38d640c233998e5a998a6f53ff668369fd3168 https://github.com/Perl/perl5/commit/cd38d640c233998e5a998a6f53ff668369fd3168 Author: Yves Orton <demer...@gmail.com> Date: 2023-01-12 (Thu, 12 Jan 2023) Changed paths: M regcomp_internal.h M regcomp_study.c Log Message: ----------- regcomp_study.c - Add a way to disable CURLYX optimisations Also break up the condition so there is one condition per line so it is more readable, and fold repeated binary tests together. This makes it more obvious what the expression is doing. Commit: 76e1f20f1d80d8d1bca7e1a4b7410dfe21354764 https://github.com/Perl/perl5/commit/76e1f20f1d80d8d1bca7e1a4b7410dfe21354764 Author: Yves Orton <demer...@gmail.com> Date: 2023-01-12 (Thu, 12 Jan 2023) Changed paths: M regcomp_debug.c M regcomp_study.c M t/re/pat_re_eval.t Log Message: ----------- regcomp_study.c - disable CURLYX optimizations when EVAL has been seen anywhere Historically we disabled CURLYX optimizations when they *contained* an EVAL, on the assumption that the optimization might affect how many times, etc, the eval was called. However, this is also true for CURLYX with evals *afterwards*. If the CURLYN or CURLYM optimization can prune off the search space, then an eval afterwards will be affected. An when you take into account GOSUB, it means that an eval in front might be affected by an optimization after it. So for now we disable CURLYN and CURLYM in any pattern with an EVAL. Commit: 995106349af81a044b298cf9c93b5903acf4670c https://github.com/Perl/perl5/commit/995106349af81a044b298cf9c93b5903acf4670c Author: Yves Orton <demer...@gmail.com> Date: 2023-01-12 (Thu, 12 Jan 2023) Changed paths: M regexec.c Log Message: ----------- regexec.c - rework CLOSE_CAPTURE() macro to take a rex argument This allows it to be used in contexts where rex isn't set up under this name. Commit: d8f65a38e2cd399eb371be91874931737919938b https://github.com/Perl/perl5/commit/d8f65a38e2cd399eb371be91874931737919938b Author: Yves Orton <demer...@gmail.com> Date: 2023-01-12 (Thu, 12 Jan 2023) Changed paths: M regcomp.c M regcomp.h Log Message: ----------- regcomp.h - get rid of EXTRA_STEP defines They are unused these days. Commit: 568942115c3335bae354da4a9a9e7d8f89eeeaee https://github.com/Perl/perl5/commit/568942115c3335bae354da4a9a9e7d8f89eeeaee Author: Yves Orton <demer...@gmail.com> Date: 2023-01-12 (Thu, 12 Jan 2023) Changed paths: M regcomp.c Log Message: ----------- regcomp.c - add whitespace to binary operation The tight & is hard to read. Commit: 9998e79469c31c02a4a7fb5b394df5c57d6a299e https://github.com/Perl/perl5/commit/9998e79469c31c02a4a7fb5b394df5c57d6a299e Author: Yves Orton <demer...@gmail.com> Date: 2023-01-12 (Thu, 12 Jan 2023) Changed paths: M regcomp_trie.c Log Message: ----------- regcomp_trie.c - use the indirect types so we are safe to changes We shouldnt assume that a TRIEC is a regcomp_charclass. We have a per opcode type exactly for this type of use, so lets use it. Commit: b223d00a98b4a766af19d523037f9b6a8789f43c https://github.com/Perl/perl5/commit/b223d00a98b4a766af19d523037f9b6a8789f43c Author: Yves Orton <demer...@gmail.com> Date: 2023-01-12 (Thu, 12 Jan 2023) Changed paths: M pod/perldebguts.pod M pp_ctl.c M regcomp.c M regcomp.h M regcomp.sym M regcomp_debug.c M regexec.c M regexp.h M regnodes.h M t/re/pat.t M t/re/pat_rt_report.t M t/re/re_tests Log Message: ----------- regcomp.c - Resolve issues clearing buffers in CURLYX (MAJOR-CHANGE) CURLYX doesn't reset capture buffers properly. It is possible for multiple buffers to be defined at once with values from different iterations of the loop, which doesn't make sense really. An example is this: "foobarfoo"=~/((foo)|(bar))+/ after this matches $1 should equal $2 and $3 should be undefined, or $1 should equal $3 and $2 should be undefined. Prior to this patch this would not be the case. The solution that this patches uses is to introduce a form of "layered transactional storage" for paren data. The existing pair of start/end data for capture data is extended with a start_new/end_new pair. When the vast majority of our code wants to check if a given capture buffer is defined they first check "start_new/end_new", if either is -1 then they fall back to whatever is in start/end. When a capture buffer is CLOSEd the data is written into the start_new/end_new pair instead of the start/end pair. When a CURLYX loop is executing and has matched something (at least one "A" in /A*B/ -- thus actually in WHILEM) it "commits" the start_new/end_new data by writing it into start/end. When we begin a new iteration of the loop we clear the start_new/end_new pairs that are contained by the loop, by setting them to -1. If the loop fails then we roll back as we used to. If the loop succeeds we continue. When we hit an END block we commit everything. Consider the example above. We start off with everything set to -1. $1 = (-1,-1):(-1,-1) $2 = (-1,-1):(-1,-1) $3 = (-1,-1):(-1,-1) In the first iteration we have matched "foo" and end up with this: $1 = (-1,-1):( 0, 3) $2 = (-1,-1):( 0, 3) $3 = (-1,-1):(-1,-1) We commit the results of $2 and $3, and then clear the new data in the beginning of the next loop: $1 = (-1,-1):( 0, 3) $2 = ( 0, 3):(-1,-1) $3 = (-1,-1):(-1,-1) We then match "bar": $1 = (-1,-1):( 0, 3) $2 = ( 0, 3):(-1,-1) $3 = (-1,-1):( 3, 7) and then commit the result and clear the new data: $1 = (-1,-1):( 0, 3) $2 = (-1,-1):(-1,-1) $3 = ( 3, 7):(-1,-1) and then we match "foo" again: $1 = (-1,-1):( 0, 3) $2 = (-1,-1):( 7,10) $3 = ( 3, 7):(-1,-1) And we then commit. We do a regcppush here as normal. $1 = (-1,-1):( 0, 3) $2 = ( 7,10):( 7,10) $3 = (-1,-1):(-1,-1) We then clear it again, but since we don't match when we regcppop we store the buffers back to the above layout. When we finally hit the END buffer we also do a commit as well on all buffers, including the 0th (for the full match). Fixes GH Issue #18865, and adds tests for it and other things. Commit: 2e24dc304c7a02911e93910c92ba8482d1d028eb https://github.com/Perl/perl5/commit/2e24dc304c7a02911e93910c92ba8482d1d028eb Author: Yves Orton <demer...@gmail.com> Date: 2023-01-12 (Thu, 12 Jan 2023) Changed paths: M regexec.c Log Message: ----------- fixup for branch reset Commit: 07aeb06d7eec57efe5866062fc9eee778630d5e9 https://github.com/Perl/perl5/commit/07aeb06d7eec57efe5866062fc9eee778630d5e9 Author: Yves Orton <demer...@gmail.com> Date: 2023-01-12 (Thu, 12 Jan 2023) Changed paths: M MANIFEST M t/re/regexp.t A t/re/regexp_normal.t Log Message: ----------- t/re/regexp_normal.t - test "normalized" forms of patterns This looks for discrepancies between different ways of writing a pattern. Commit: 9936eb18bed72b3dbdd5c791da2833b8762ffb8a https://github.com/Perl/perl5/commit/9936eb18bed72b3dbdd5c791da2833b8762ffb8a Author: Yves Orton <demer...@gmail.com> Date: 2023-01-12 (Thu, 12 Jan 2023) Changed paths: M pod/perldebguts.pod M regcomp.c M regcomp.h M regcomp.sym M regcomp_debug.c M regcomp_trie.c M regexec.c M regexp.h M regnodes.h M t/re/re_tests Log Message: ----------- regexec.c - teach BRANCH and BRANCHJ nodes to reset capture buffers In /((a)(b)|(a))+/ we should not end up with $2 and $4 being set at the same time. When a branch fails it should reset any capture buffers that might be touched by its branch. We change BRANCH and BRANCHJ to store the number of parens before the branch, and the number of parens after the branch was completed. When a BRANCH operation fails, we clear the buffers it contains before we continue on. It is a bit more complex than it should be because we have BRANCHJ and BRANCH. (One of these days we should merge them together.) This is also made somewhat more complex because TRIE nodes are actually branches, and may need to track capture buffers also, at two levels. The overall TRIE op, and for jump tries especially where we emulate the behavior of branches. So we have to do the same clearing logic if a trie branch fails as well. Commit: c569fc6235dbd9effcafdd13864b5ef7b396efd3 https://github.com/Perl/perl5/commit/c569fc6235dbd9effcafdd13864b5ef7b396efd3 Author: Yves Orton <demer...@gmail.com> Date: 2023-01-12 (Thu, 12 Jan 2023) Changed paths: M pod/perldelta.pod M pod/perlre.pod M regcomp.c M regcomp.h M regcomp_debug.c M regcomp_internal.h M regcomp_study.c M regexec.c M regnodes.h M t/re/pat_re_eval.t M t/re/pat_rt_report.t M toke.c Log Message: ----------- regcomp.c - add optimistic eval This adds (*{ ... }) and (**{ ... }) as equivalents to (?{ ... }) and (??{ ... }). The only difference being that the star variants are "optimisitic" and are defined to never disable optimisations. This is especially relevant now that use of (?{ ... }) prevents important optimisations anywhere in the pattern, instead of the older and inconsistent rules where it only affected the parts that contained the EVAL. It is also very useful for injecting debugging style expressions to the pattern to understand what the regex engine is actually doing. The older style (?{ ... }) variants would change the regex engines behavior, meaning this was not as effective a tool as it could have been. Commit: 7351c48f24377d6681942ff8f6bcb0776dab68b4 https://github.com/Perl/perl5/commit/7351c48f24377d6681942ff8f6bcb0776dab68b4 Author: Yves Orton <demer...@gmail.com> Date: 2023-01-12 (Thu, 12 Jan 2023) Changed paths: M regexec.c M t/re/pat_re_eval.t M t/re/regexp.t Log Message: ----------- regexec.c - fix accept in CURLYX/WHILEM construct. The ACCEPT logic didnt know how to handle WHILEM, which for some reason does not have a next_off defined. I am not sure why. This was revealed by forcing CURLYX optimisations off. This includes a patch to test what happens if we embed an eval group in the tests run by regexp.t when run via regexp_normal.t, which disabled CURLYX -> CURLYN and CURLYM optimisations and revealed this issue. Commit: 6513f007a25a2d370de568871b4fc07b4a8094dd https://github.com/Perl/perl5/commit/6513f007a25a2d370de568871b4fc07b4a8094dd Author: Yves Orton <demer...@gmail.com> Date: 2023-01-12 (Thu, 12 Jan 2023) Changed paths: M pod/perldelta.pod Log Message: ----------- perldelta - add note about regex engine changes capture buffer semantics should now be consistent. Compare: https://github.com/Perl/perl5/compare/ab4ad8dbfee3...6513f007a25a