Hi,

webrev: http://cr.openjdk.java.net/~sherman/regexBacktracking/webrev/

Backtracking is the powerful and popular feature provided by Java regex engine when the regex pattern contains optional quantifiers or alternations, in which the regex engine can return to the previous saved state (backtracking) to continue matching/finding other alternative when the current try fails. However it is also a "well known" issue that some not-well-written regex might trigger "exponential backtraking" in situation like there is no possible match can be found after exhaustive search, in the worst case the regex engine will just keep going forever, hangup the vm. We have various reports in the past as showed in RegexFun.java <http://cr.openjdk.java.net/%7Esherman/regexBacktracking/RegexFun.java> [1] for this
issue. For example following is the one reported in  JDK-6693451.

    regex:  "^(\\s*foo\\s*)*$"
input: "foo foo foo foo foo foo foo foo foo foo foo foo foo foo foo foo foo foo foo foo foo foo foo foo foo foo foo foo foo foo foo foo foo foo foo foo foo foo foo fo",

In which the "group" (\\s*foo\\s*)*$" can alternatively match any one of the "foo", "foo ", " foo" and " foo ". The regex works fine and runs normally if it can find a match (replace the last "fo" in the input to "foo" for example). But in this case because the input text ends with a "fo", the regex actually can never match the input, it always fails at last one, nothing in the regex can match the "fo". Given the nature of the "greedy quantifier", the engine will be backtracking/bouncing among these 4 choices at each/every iteration/level and keep searching to the end of the input exhaustively, fail, backtrack, and try again recursively. Based on the number
of "foo" in the input, it can run forever until the end of the world.

There are lots of suggestions on how to avoid this kind of runaway regex. One is to use the possessive quantifier, if possible, to replace the greedy quantifier ( * -> *+ for example ) to workaround this kind of catastrophic backtracking, and it normally
meets the real need and makes the exponential backtracking go away.

The observation of these exponential backtracking cases indicates that most of these "exponential backtracking" however are actually not necessary in most cases. Even we can't solve all of these catastrophic backtracking, it might be possible, with a small
cost, we can actually workaround most of the "cases", in particular

(1) when the "exponential backtracking" happens at the top level group + greedy
quantifier, and
(2) there is no group ref in the pattern.

The "group + greedy repetition" in j.u.regex is implemented via three nodes, "Prolog + Loop + body node", in wihch the body takes care of (\\s*foo\\s*), the Loop takes care of the looping for "*", and the Prolog, as its name, just a starter to kick of the looping
(read source for more details).

The normal matching sequence works as

    Prolog.match() ->  Prolog.loop.matchInit()

        -> body.match() -> loop.match()
        -> body.match() -> loop.match()
        -> body.match() -> loop.match()
        ...
        -> body.match() -> loop.match()  -> loop.next.match()


The "body.match() -> loop.match -> body.match() -> loop.match() ..." is looping on the
runtime call stack "recursively" (the body.next is pointing to loop).

If we take the "recursive looping" part that we are interested (we only care about the "count >= cmin" situation, as it is where the exponential backtracking happens) out of
Loop.match(), the loop.match is as simple as

    boolean match(Matcher matcher, int i, CharSequence seq) {

---> Before
                    boolean b = body.match(matcher, i, seq);
                    // If match failed we must backtrack, so
                    // the loop count should NOT be incremented
                    if (b)
                        return true;
                    matcher.locals[countIndex] = count;
---> After
                    return next.match(matcher, i, seq);
    }

It appears that each of the repetitive "body.match() + loop.match()" looping (though actually recursive on the runtime stack) actually is matching the input text kind of "linearly" in the order of loop counter, if this is a "top level" group repetition (not an inner group inside
another repetitive group)

        ... foo foo foo foo foo foo foo...
            |_ i = n,
         (body-loop)(body-loop)(body-loop).... ->next()

given the nature of its "greediness", the engine will exhausts all the possible match of "body+loop+ loop.next" from any given starting position "i". We can actually assume that if we have failed to match A "body + loop + its down-stream matching" last time at position "i", next time if we come back here again (after backtracking the "up stream" and come down here again), and if we are at exactly the same position "i", we no longer need to try it again, the result will not change (failed). With the assumption that we DO NOT have a group ref in the pattern (the result/content of the group ref can change, if the down-stream contains the group ref, we can't assume the result going forward will
be the same, even start at the same position "i").

for example for the above sample, when we backtrack to loop counter "n", we can dance
among 4 choices

    "foo", " foo", "foo " and " foo ",

but when we pick one and more on to the next iteration/loop at n + 1, the only possible choice is either "foo..." or " foo" (with a leading space character), if we have tried either of them (or both) last time and it failed, we no longer need to try the same "down stream" again . And the good thing is that this "position-lookup-table" can be shared by all cmin <= n <= cmax iterations, again, because the nature of greediness. In last failed
try, the engine has tried/exhausted all possible combinations of

    body->loop->body->loop...body->loop->next

at "i", including the possilble exhaustive backtracking done by the embedded inner nodes of this group. So we are here at the same position 'i" again, the result will be the same.

So here is my propoased workaround solution.

We introduce in a "hashmap" (home-made hashmap for "int", supposed to be cheap and faster) to memorize all the tried-and-failed position "i", and check this hashmap before starting a new
round of loop.

    boolean match(Matcher matcher, int i, CharSequence seq) {

----------------
        if (posIndex != -1 &&
            matcher.localsPos[posIndex].contains(i)) {
            return next.match(matcher, i, seq);
        }
----------------
        boolean b = body.match(matcher, i, seq);
        // If match failed we must backtrack, so
        // the loop count should NOT be incremented
        if (b)
            return true;
        matcher.locals[countIndex] = count;

----------------
         if (posIndex != -1) {
             matcher.localsPos[posIndex].add(i);
         }
----------------
         return next.match(matcher, i, seq);
    }

This makes most of the exponential backtracking failures in RegexFun.java [1] go away,
except the two issues.

(1) the exponential backtracking starts/happens at the second inner group level, in
which the proposed change might help, but does not make it go away

    regex: "(h|h|ih(((i|a|c|c|a|i|i|j|b|a|i|b|a|a|j))+h)ahbfhba|c|i)*"
input: "hchcchicihcchciiicichhcichcihcchiihichiciiiihhcchicchhcihchcihiihciichhccciccichcichiihcchcihhicchcciicchcccihiiihhihihihichicihhcciccchihhhcchichchciihiicihciihcccciciccicciiiiiiiiicihhhiiiihchccchchhhhiiihchihcccchhhiiiiiiiicicichicihcciciihichhhhchihciiihhiccccccciciihhichiccchhicchicihihccichicciihcichccihhiciccccccccichhhhihihhcchchihihiihhihihihicichihiiiihhhhihhhchhichiicihhiiiiihchccccchichci"

(2) it's a {n, m} greedy case. This one probably can be handled the same way as the */+
 but need a little more testing.

    regex: "(\\w|_){1,64}@"
    input: "______________________________"

There are another kind of "abusing" case that overly and repeatitively uses ".*" or ".+", such as
reported in jdk-5014450
http://cr.openjdk.java.net/~sherman/regexBacktracking/RegexFun3.java

regex: "^\\s*sites\\[sites\\.length\\+\\+\\]\\s*=\\s*new\\s+Array\\(.*" +
           "\\s*//\\s*(\\d+).*" +
           "\\s*//\\s*([^-]+).*" +
           "\\s*//\\s*([^-]+).*" +
           "\\s*//\\s*([^-]+).*" +
           "/(?sui).*$"
    text   "\tsites[sites.length++] = new Array(\n" +
           "// 1079193366647 - creation time\n" +
"// 1078902678663 1078852539723 1078753482632 0 0 0 0 0 0 0 0 0 0 0 - creation time last 14 days\n" +
           "// 0 1 0 0 0 0 0 0 0 0 0 0 0 0 - bad\n" +
"// 0.0030 0.0080 0.01 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 -\n\n",

The performance can be much/much better if we can apply the similar optimization for top-level
dot-greedy-curly nodes as well, as showed at

http://cr.openjdk.java.net/~sherman/regexBacktracking/webrev2/

But I'm a little concerned the possibility that the extra checks might slowdown the normal case
and just wonder if it is worth the cost. So leave it out for now.

Another optimization included is for the "CharProperty + Curly" combination, normally this is what you get for the for regex construct .*, \w* \s*. The new GreedyCharProperty node now takes advantage of the fact that we are iterating char/codepoint by char/codepoint, the matching
implementation is more smooth and much faster.

Anyway, please help review and comment.

thanks,
Sherman

[1] http://cr.openjdk.java.net/~sherman/regexBacktracking/RegexFun.java

Reply via email to