In message <[EMAIL PROTECTED]>, =?iso-8859-1?q ?Janek=20Bogucki?= writes: >Here is a reduced, standalone test class, >Performance.java.
Thanks. It's a lot easier to see what the pattern is now and what match attempts are taking the most time. This is the culprit: > /* > * enclosed content > */ > "((\\s|.)*)" + Alternations are expensive, but much more so when a subpattern has the potential to match the entire remaining input on every backtrack. Instead of this you should use: "(.*)" + and compile the pattern with Perl5Compiler.SINGLELINE_MASK so that . will match newlines: _pattern = compiler.compile(SERVER_ACTIVE_HTML_PATTERN, Perl5Compiler.CASE_INSENSITIVE_MASK | Perl5Compiler.SINGLELINE_MASK | Perl5Compiler.READ_ONLY_MASK); All the matches should now take less than can be measured with System.currentTimeMillis(). The problem with writing regular expressions is that you often have to know something about how they are implemented in order to avoid writing inefficient ones. For example, the expression you used would be fine with the Awk package because AwkCompiler will reduce all equivalent regular expressions to the same DFA (at the price that it only works with 8-bit character input). But the Perl package (and Perl itself) treats the expressions as NFAs and has to dynamically search the state space of possible matches rather than relying on predetermined state transitions. "Mastering Regular Expressions" by Jeffrey Friedl does a pretty good job of explaining how regular expressions typically work and how to avoid writing "slow" expressions. The trick is to avoid constructs that will cause a lot of backtracking. daniel