In message <[EMAIL PROTECTED]>, =?iso-8859-1?q
?Janek=20Bogucki?= writes:
>Here is a reduced, standalone test class,
>Performance.java.

Thanks.  It's a lot easier to see what the pattern is now and what
match attempts are taking the most time.  This is the culprit:

>    /*
>     * enclosed content
>     */
>    "((\\s|.)*)" +

Alternations are expensive, but much more so when a subpattern has the
potential to match the entire remaining input on every backtrack.  Instead
of this you should use:

     "(.*)" +

and compile the pattern with Perl5Compiler.SINGLELINE_MASK so that
. will match newlines:

_pattern = compiler.compile(SERVER_ACTIVE_HTML_PATTERN,
                            Perl5Compiler.CASE_INSENSITIVE_MASK |
                            Perl5Compiler.SINGLELINE_MASK |
                            Perl5Compiler.READ_ONLY_MASK);

All the matches should now take less than can be measured with
System.currentTimeMillis().

The problem with writing regular expressions is that you often have
to know something about how they are implemented in order to avoid
writing inefficient ones.  For example, the expression you used would
be fine with the Awk package because AwkCompiler will reduce all equivalent
regular expressions to the same DFA (at the price that it only works with
8-bit character input).  But the Perl package (and Perl itself) treats
the expressions as NFAs and has to dynamically search the state space of
possible matches rather than relying on predetermined state transitions.
"Mastering Regular Expressions" by Jeffrey Friedl does a pretty good job
of explaining how regular expressions typically work and how to avoid
writing "slow" expressions.  The trick is to avoid constructs that will
cause a lot of backtracking.

daniel


Reply via email to