Michael Glaesemann <michael.glaesem...@myyearbook.com> writes: > We came across a regexp that takes very much longer than expected.
I poked into this a little bit. What seems to be happening is that the use of non-greedy quantifiers plus backreferences turns off most of the optimization that the regexp engine usually does, leaving the RE as a tree of possibilities that is explored in a fairly dumb fashion. In particular, there is a loop in crevdissect() that tries to locate a feasible division point between two concatenated sub-patterns, and for each try, a recursive call to crevdissect() tries to locate a feasible division point between *its* two sub-patterns, resulting in O(N^2) runtime. With a not very pleasant constant factor, too, because of repeated startup/shutdown costs for the DFA matching engine. I found a possible patch that seems to improve matters: AFAICS the DFA matching is independent of the checking that cdissect() and friends do, so we can apply that check first instead of second. This brings the runtime down from minutes to sub-second on my machine. However I don't have a whole lot of faith either in the correctness of this change, or that it might not make some other cases slower instead of faster. Has anyone got a reasonably messy collection of regexps they'd like to try this patch on? BTW, I filed the problem upstream with the Tcl project https://sourceforge.net/tracker/?func=detail&aid=2942697&group_id=10894&atid=110894 but I'm not going to hold my breath waiting for useful advice from them. Since Henry Spencer dropped off the net, there doesn't seem to be anybody there who understands that code much better than we do. Also, we likely still ought to toss some CHECK_FOR_INTERRUPTS calls in there ... regards, tom lane
Index: src/backend/regex/regexec.c =================================================================== RCS file: /cvsroot/pgsql/src/backend/regex/regexec.c,v retrieving revision 1.27 diff -c -r1.27 regexec.c *** src/backend/regex/regexec.c 15 Oct 2005 02:49:24 -0000 1.27 --- src/backend/regex/regexec.c 30 Jan 2010 06:53:13 -0000 *************** *** 804,820 **** for (;;) { /* try this midpoint on for size */ ! er = cdissect(v, t->left, begin, mid); ! if (er == REG_OKAY && ! longest(v, d2, mid, end, (int *) NULL) == end && ! (er = cdissect(v, t->right, mid, end)) == ! REG_OKAY) ! break; /* NOTE BREAK OUT */ ! if (er != REG_OKAY && er != REG_NOMATCH) { ! freedfa(d); ! freedfa(d2); ! return er; } /* that midpoint didn't work, find a new one */ --- 804,821 ---- for (;;) { /* try this midpoint on for size */ ! if (longest(v, d2, mid, end, (int *) NULL) == end) { ! er = cdissect(v, t->left, begin, mid); ! if (er == REG_OKAY && ! (er = cdissect(v, t->right, mid, end)) == REG_OKAY) ! break; /* NOTE BREAK OUT */ ! if (er != REG_OKAY && er != REG_NOMATCH) ! { ! freedfa(d); ! freedfa(d2); ! return er; ! } } /* that midpoint didn't work, find a new one */ *************** *** 904,920 **** for (;;) { /* try this midpoint on for size */ ! er = cdissect(v, t->left, begin, mid); ! if (er == REG_OKAY && ! longest(v, d2, mid, end, (int *) NULL) == end && ! (er = cdissect(v, t->right, mid, end)) == ! REG_OKAY) ! break; /* NOTE BREAK OUT */ ! if (er != REG_OKAY && er != REG_NOMATCH) { ! freedfa(d); ! freedfa(d2); ! return er; } /* that midpoint didn't work, find a new one */ --- 905,922 ---- for (;;) { /* try this midpoint on for size */ ! if (longest(v, d2, mid, end, (int *) NULL) == end) { ! er = cdissect(v, t->left, begin, mid); ! if (er == REG_OKAY && ! (er = cdissect(v, t->right, mid, end)) == REG_OKAY) ! break; /* NOTE BREAK OUT */ ! if (er != REG_OKAY && er != REG_NOMATCH) ! { ! freedfa(d); ! freedfa(d2); ! return er; ! } } /* that midpoint didn't work, find a new one */
-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers