Attached is as far as I've gotten with fixing depesz's complaint about
backrefs embedded within larger quantified expressions (the complaint
being that only the last match of the backref is checked properly).
This is per the analysis I posted at
https://sourceforge.net/tracker/index.php?func=detailaid=1115587group_id=10894atid=110894
to the effect that the engine really has to have an iteration subre
type.
The patch is incomplete because I have not written the code yet for the
shortest-match-first case, only longest-match-first. Within that
restriction, it seems to work, though I have not yet tried the Tcl
regression tests on it.
I have to set this aside now and go focus on release tasks (like writing
release notes), so it's not going to be done in time to include in the
upcoming back-branch releases. I have mixed feelings about whether to
treat it as a back-patchable bug fix when it's done ... it's certainly a
bug fix but the odds of introducing new issues seem higher than average.
So maybe it should only go into HEAD anyway. Thoughts?
regards, tom lane
diff --git a/src/backend/regex/README b/src/backend/regex/README
index 3fd58c000119a24d61dff594baec28a27a4437f0..89ba6a62ea2f70bfda729f03efedba4b9b6fce9b 100644
*** a/src/backend/regex/README
--- b/src/backend/regex/README
*** consists of a tree of sub-expressions (
*** 102,116
either plain regular expressions (which are executed as DFAs in the manner
described above) or back-references (which try to match the input to some
previous substring). Non-leaf nodes are capture nodes (which save the
! location of the substring currently matching their child node) or
! concatenation or alternation nodes. At execution time, the executor
! recursively scans the tree. At concatenation or alternation nodes,
! it considers each possible alternative way of matching the input string,
! ie each place where the string could be split for a concatenation, or each
! child node for an alternation. It tries the next alternative if the match
! fails according to the child nodes. This is exactly the sort of
! backtracking search done by a traditional NFA regex engine. If there are
! many tree levels it can get very slow.
But all is not lost: we can still be smarter than the average pure NFA
engine. To do this, each subre node has an associated DFA, which
--- 102,116
either plain regular expressions (which are executed as DFAs in the manner
described above) or back-references (which try to match the input to some
previous substring). Non-leaf nodes are capture nodes (which save the
! location of the substring currently matching their child node),
! concatenation, alternation, or iteration nodes. At execution time, the
! executor recursively scans the tree. At concatenation, alternation, or
! iteration nodes, it considers each possible alternative way of matching the
! input string, that is each place where the string could be split for a
! concatenation or iteration, or each child node for an alternation. It
! tries the next alternative if the match fails according to the child nodes.
! This is exactly the sort of backtracking search done by a traditional NFA
! regex engine. If there are many tree levels it can get very slow.
But all is not lost: we can still be smarter than the average pure NFA
engine. To do this, each subre node has an associated DFA, which
diff --git a/src/backend/regex/regcomp.c b/src/backend/regex/regcomp.c
index 6b80140e90940b4a348c342090e26fdd0bc82c8f..487d0670d2d9c6ae4b9fc5e500765d136c4e441d 100644
*** a/src/backend/regex/regcomp.c
--- b/src/backend/regex/regcomp.c
*** parseqatom(struct vars * v,
*** 1036,1046
/*--
* Prepare a general-purpose state skeleton.
*
! * --- [s] ---prefix--- [begin] ---atom--- [end] rest--- [rp]
! * / /
! * [lp] [s2] bypass-
*
! * where bypass is an empty, and prefix is some repetitions of atom
*--
*/
s = newstate(v-nfa); /* first, new endpoints for the atom */
--- 1038,1054
/*--
* Prepare a general-purpose state skeleton.
*
! * In the no-backrefs case, we want this:
*
! * [lp] --- [s] ---prefix--- [begin] ---atom--- [end] ---rest--- [rp]
! *
! * where prefix is some repetitions of atom. In the general case we need
! *
! * [lp] --- [s] ---iterator--- [s2] ---rest--- [rp]
! *
! * where the iterator wraps around [begin] ---atom--- [end]
! *
! * We make the s state here for both cases; s2 is made below if needed
*--
*/
s = newstate(v-nfa); /* first, new endpoints for the atom */
*** parseqatom(struct vars * v,
*** 1051,1061
NOERR();
atom-begin = s;
atom-end = s2;
! s = newstate(v-nfa); /* and spots for prefix and bypass */
! s2 = newstate(v-nfa);
NOERR();
EMPTYARC(lp, s);
- EMPTYARC(lp, s2);
NOERR();
/* break