RFR: Regex exponential backtracking issue

Xueming Shen Thu, 03 Mar 2016 22:44:30 -0800

Hi,

webrev: http://cr.openjdk.java.net/~sherman/regexBacktracking/webrev/

Backtracking is the powerful and popular feature provided by Java regexenginewhen the regex pattern contains optional quantifiers or alternations, inwhich theregex engine can return to the previous saved state (backtracking) tocontinuematching/finding other alternative when the current try fails. Howeverit is alsoa "well known" issue that some not-well-written regex might trigger"exponentialbacktraking" in situation like there is no possible match can be foundafter exhaustivesearch, in the worst case the regex engine will just keep going forever,hangup thevm. We have various reports in the past as showed in RegexFun.java<http://cr.openjdk.java.net/%7Esherman/regexBacktracking/RegexFun.java>[1] for this

issue. For example following is the one reported in  JDK-6693451.

    regex:  "^(\\s*foo\\s*)*$"

input: "foo foo foo foo foo foo foo foo foo foo foo foo foo foofoo foo foo foo foo foo foo foo foo foo foo foo foo foo foo foo foo foofoo foo foo foo foo foo foo fo",

In which the "group" (\\s*foo\\s*)*$" can alternatively match any one ofthe "foo","foo ", " foo" and " foo ". The regex works fine and runs normally if itcan find amatch (replace the last "fo" in the input to "foo" for example). But inthis casebecause the input text ends with a "fo", the regex actually can nevermatch theinput, it always fails at last one, nothing in the regex can match the"fo". Given thenature of the "greedy quantifier", the engine will bebacktracking/bouncing amongthese 4 choices at each/every iteration/level and keep searching to theend of theinput exhaustively, fail, backtrack, and try again recursively. Based onthe number

of "foo" in the input, it can run forever until the end of the world.

There are lots of suggestions on how to avoid this kind of runawayregex. One is touse the possessive quantifier, if possible, to replace the greedyquantifier ( * -> *+for example ) to workaround this kind of catastrophic backtracking, andit normally

meets the real need and makes the exponential backtracking go away.

The observation of these exponential backtracking cases indicates thatmost of these"exponential backtracking" however are actually not necessary in mostcases. Even wecan't solve all of these catastrophic backtracking, it might bepossible, with a small

cost, we can actually workaround most of the "cases", in particular

(1) when the "exponential backtracking" happens at the top level group +greedy

quantifier, and
(2) there is no group ref in the pattern.

The "group + greedy repetition" in j.u.regex is implemented via threenodes, "Prolog +Loop + body node", in wihch the body takes care of (\\s*foo\\s*), theLoop takes careof the looping for "*", and the Prolog, as its name, just a starter tokick of the looping

(read source for more details).

The normal matching sequence works as

    Prolog.match() ->  Prolog.loop.matchInit()

        -> body.match() -> loop.match()
        -> body.match() -> loop.match()
        -> body.match() -> loop.match()
        ...
        -> body.match() -> loop.match()  -> loop.next.match()

The "body.match() -> loop.match -> body.match() -> loop.match() ..." islooping on the

runtime call stack "recursively" (the body.next is pointing to loop).

If we take the "recursive looping" part that we are interested (we onlycare about the"count >= cmin" situation, as it is where the exponential backtrackinghappens) out of

Loop.match(), the loop.match is as simple as

    boolean match(Matcher matcher, int i, CharSequence seq) {

---> Before
                    boolean b = body.match(matcher, i, seq);
                    // If match failed we must backtrack, so
                    // the loop count should NOT be incremented
                    if (b)
                        return true;
                    matcher.locals[countIndex] = count;
---> After
                    return next.match(matcher, i, seq);
    }

It appears that each of the repetitive "body.match() + loop.match()"looping (though actuallyrecursive on the runtime stack) actually is matching the input textkind of "linearly" in theorder of loop counter, if this is a "top level" group repetition (not aninner group inside

another repetitive group)

        ... foo foo foo foo foo foo foo...
            |_ i = n,
         (body-loop)(body-loop)(body-loop).... ->next()

given the nature of its "greediness", the engine will exhausts all thepossible match of"body+loop+ loop.next" from any given starting position "i". We canactually assume thatif we have failed to match A "body + loop + its down-stream matching"last time atposition "i", next time if we come back here again (after backtrackingthe "up stream"and come down here again), and if we are at exactly the same position"i", we no longerneed to try it again, the result will not change (failed). With theassumption that we DONOT have a group ref in the pattern (the result/content of the group refcan change, ifthe down-stream contains the group ref, we can't assume the result goingforward will

be the same, even start at the same position "i").

for example for the above sample, when we backtrack to loop counter "n",we can dance

among 4 choices

    "foo", " foo", "foo " and " foo ",

but when we pick one and more on to the next iteration/loop at n + 1,the only possiblechoice is either "foo..." or " foo" (with a leading space character), ifwe have tried either ofthem (or both) last time and it failed, we no longer need to try thesame "down stream"again . And the good thing is that this "position-lookup-table" can beshared by allcmin <= n <= cmax iterations, again, because the nature of greediness.In last failed

try, the engine has tried/exhausted all possible combinations of

    body->loop->body->loop...body->loop->next

at "i", including the possilble exhaustive backtracking done by theembedded inner nodesof this group. So we are here at the same position 'i" again, the resultwill be the same.


So here is my propoased workaround solution.

We introduce in a "hashmap" (home-made hashmap for "int", supposed to becheap and faster)to memorize all the tried-and-failed position "i", and check thishashmap before starting a new

round of loop.

    boolean match(Matcher matcher, int i, CharSequence seq) {

----------------
        if (posIndex != -1 &&
            matcher.localsPos[posIndex].contains(i)) {
            return next.match(matcher, i, seq);
        }
----------------
        boolean b = body.match(matcher, i, seq);
        // If match failed we must backtrack, so
        // the loop count should NOT be incremented
        if (b)
            return true;
        matcher.locals[countIndex] = count;

----------------
         if (posIndex != -1) {
             matcher.localsPos[posIndex].add(i);
         }
----------------
         return next.match(matcher, i, seq);
    }

This makes most of the exponential backtracking failures inRegexFun.java [1] go away,

except the two issues.

(1) the exponential backtracking starts/happens at the second innergroup level, in

which the proposed change might help, but does not make it go away

    regex: "(h|h|ih(((i|a|c|c|a|i|i|j|b|a|i|b|a|a|j))+h)ahbfhba|c|i)*"

input:"hchcchicihcchciiicichhcichcihcchiihichiciiiihhcchicchhcihchcihiihciichhccciccichcichiihcchcihhicchcciicchcccihiiihhihihihichicihhcciccchihhhcchichchciihiicihciihcccciciccicciiiiiiiiicihhhiiiihchccchchhhhiiihchihcccchhhiiiiiiiicicichicihcciciihichhhhchihciiihhiccccccciciihhichiccchhicchicihihccichicciihcichccihhiciccccccccichhhhihihhcchchihihiihhihihihicichihiiiihhhhihhhchhichiicihhiiiiihchccccchichci"

(2) it's a {n, m} greedy case. This one probably can be handled the sameway as the */+

 but need a little more testing.

    regex: "(\\w|_){1,64}@"
    input: "______________________________"

There are another kind of "abusing" case that overly and repeatitivelyuses ".*" or ".+", such as

reported in jdk-5014450
http://cr.openjdk.java.net/~sherman/regexBacktracking/RegexFun3.java

regex:"^\\s*sites\\[sites\\.length\\+\\+\\]\\s*=\\s*new\\s+Array\\(.*" +

           "\\s*//\\s*(\\d+).*" +
           "\\s*//\\s*([^-]+).*" +
           "\\s*//\\s*([^-]+).*" +
           "\\s*//\\s*([^-]+).*" +
           "/(?sui).*$"
    text   "\tsites[sites.length++] = new Array(\n" +
           "// 1079193366647 - creation time\n" +

"// 1078902678663 1078852539723 1078753482632 0 0 0 0 0 0 00 0 0 0 - creation time last 14 days\n" +

           "// 0 1 0 0 0 0 0 0 0 0 0 0 0 0 - bad\n" +

"// 0.0030 0.0080 0.01 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00.0 0.0 -\n\n",

The performance can be much/much better if we can apply the similaroptimization for top-level

dot-greedy-curly nodes as well, as showed at

http://cr.openjdk.java.net/~sherman/regexBacktracking/webrev2/

But I'm a little concerned the possibility that the extra checks mightslowdown the normal case

and just wonder if it is worth the cost. So leave it out for now.

Another optimization included is for the "CharProperty + Curly"combination, normally thisis what you get for the for regex construct .*, \w* \s*. The newGreedyCharProperty node nowtakes advantage of the fact that we are iterating char/codepoint bychar/codepoint, the matching

implementation is more smooth and much faster.

Anyway, please help review and comment.

thanks,
Sherman

[1] http://cr.openjdk.java.net/~sherman/regexBacktracking/RegexFun.java

RFR: Regex exponential backtracking issue

Reply via email to