[HACKERS] WIP: proof of concept patch for fixing quantified regex backrefs

2012-02-22 Thread Tom Lane
Attached is as far as I've gotten with fixing depesz's complaint about
backrefs embedded within larger quantified expressions (the complaint
being that only the last match of the backref is checked properly).
This is per the analysis I posted at
https://sourceforge.net/tracker/index.php?func=detailaid=1115587group_id=10894atid=110894
to the effect that the engine really has to have an iteration subre
type.

The patch is incomplete because I have not written the code yet for the
shortest-match-first case, only longest-match-first.  Within that
restriction, it seems to work, though I have not yet tried the Tcl
regression tests on it.

I have to set this aside now and go focus on release tasks (like writing
release notes), so it's not going to be done in time to include in the
upcoming back-branch releases.  I have mixed feelings about whether to
treat it as a back-patchable bug fix when it's done ... it's certainly a
bug fix but the odds of introducing new issues seem higher than average.
So maybe it should only go into HEAD anyway.  Thoughts?

regards, tom lane


diff --git a/src/backend/regex/README b/src/backend/regex/README
index 3fd58c000119a24d61dff594baec28a27a4437f0..89ba6a62ea2f70bfda729f03efedba4b9b6fce9b 100644
*** a/src/backend/regex/README
--- b/src/backend/regex/README
*** consists of a tree of sub-expressions (
*** 102,116 
  either plain regular expressions (which are executed as DFAs in the manner
  described above) or back-references (which try to match the input to some
  previous substring).  Non-leaf nodes are capture nodes (which save the
! location of the substring currently matching their child node) or
! concatenation or alternation nodes.  At execution time, the executor
! recursively scans the tree.  At concatenation or alternation nodes,
! it considers each possible alternative way of matching the input string,
! ie each place where the string could be split for a concatenation, or each
! child node for an alternation.  It tries the next alternative if the match
! fails according to the child nodes.  This is exactly the sort of
! backtracking search done by a traditional NFA regex engine.  If there are
! many tree levels it can get very slow.
  
  But all is not lost: we can still be smarter than the average pure NFA
  engine.  To do this, each subre node has an associated DFA, which
--- 102,116 
  either plain regular expressions (which are executed as DFAs in the manner
  described above) or back-references (which try to match the input to some
  previous substring).  Non-leaf nodes are capture nodes (which save the
! location of the substring currently matching their child node),
! concatenation, alternation, or iteration nodes.  At execution time, the
! executor recursively scans the tree.  At concatenation, alternation, or
! iteration nodes, it considers each possible alternative way of matching the
! input string, that is each place where the string could be split for a
! concatenation or iteration, or each child node for an alternation.  It
! tries the next alternative if the match fails according to the child nodes.
! This is exactly the sort of backtracking search done by a traditional NFA
! regex engine.  If there are many tree levels it can get very slow.
  
  But all is not lost: we can still be smarter than the average pure NFA
  engine.  To do this, each subre node has an associated DFA, which
diff --git a/src/backend/regex/regcomp.c b/src/backend/regex/regcomp.c
index 6b80140e90940b4a348c342090e26fdd0bc82c8f..487d0670d2d9c6ae4b9fc5e500765d136c4e441d 100644
*** a/src/backend/regex/regcomp.c
--- b/src/backend/regex/regcomp.c
*** parseqatom(struct vars * v,
*** 1036,1046 
  	/*--
  	 * Prepare a general-purpose state skeleton.
  	 *
! 	 *	  --- [s] ---prefix--- [begin] ---atom--- [end] rest--- [rp]
! 	 *	 /			  /
! 	 * [lp]  [s2] bypass-
  	 *
! 	 * where bypass is an empty, and prefix is some repetitions of atom
  	 *--
  	 */
  	s = newstate(v-nfa);		/* first, new endpoints for the atom */
--- 1038,1054 
  	/*--
  	 * Prepare a general-purpose state skeleton.
  	 *
! 	 * In the no-backrefs case, we want this:
  	 *
! 	 * [lp] --- [s] ---prefix--- [begin] ---atom--- [end] ---rest--- [rp]
! 	 *
! 	 * where prefix is some repetitions of atom.  In the general case we need
! 	 *
! 	 * [lp] --- [s] ---iterator--- [s2] ---rest--- [rp]
! 	 *
! 	 * where the iterator wraps around [begin] ---atom--- [end]
! 	 *
! 	 * We make the s state here for both cases; s2 is made below if needed
  	 *--
  	 */
  	s = newstate(v-nfa);		/* first, new endpoints for the atom */
*** parseqatom(struct vars * v,
*** 1051,1061 
  	NOERR();
  	atom-begin = s;
  	atom-end = s2;
! 	s = newstate(v-nfa);		/* and spots for prefix and bypass */
! 	s2 = newstate(v-nfa);
  	NOERR();
  	EMPTYARC(lp, s);
- 	EMPTYARC(lp, s2);
  	NOERR();
  
  	/* break 

Re: [HACKERS] WIP: proof of concept patch for fixing quantified regex backrefs

2012-02-22 Thread Robert Haas
On Wed, Feb 22, 2012 at 11:19 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Attached is as far as I've gotten with fixing depesz's complaint about
 backrefs embedded within larger quantified expressions (the complaint
 being that only the last match of the backref is checked properly).
 This is per the analysis I posted at
 https://sourceforge.net/tracker/index.php?func=detailaid=1115587group_id=10894atid=110894
 to the effect that the engine really has to have an iteration subre
 type.

 The patch is incomplete because I have not written the code yet for the
 shortest-match-first case, only longest-match-first.  Within that
 restriction, it seems to work, though I have not yet tried the Tcl
 regression tests on it.

 I have to set this aside now and go focus on release tasks (like writing
 release notes), so it's not going to be done in time to include in the
 upcoming back-branch releases.  I have mixed feelings about whether to
 treat it as a back-patchable bug fix when it's done ... it's certainly a
 bug fix but the odds of introducing new issues seem higher than average.
 So maybe it should only go into HEAD anyway.  Thoughts?

+1 for just doing it in HEAD.  This strikes me as the sort of thing
that has a higher-than-average chance of breaking applications in
subtle ways, and that's exactly the sort of thing that we don't want
to do in a minor release, especially in older branches where people
are presumably not upgrading precisely because what they have today
works for them.  If someone really needs the fix, they can always
back-port it themselves... I don't think that code has changed much.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers