Greetings.
I've been experimenting with Perl5 REGEXs which cause previously
reported issues with matching, namely, infinite-loops, and
java.lang.stackOverflowException(s). I've found some fairly short
regular expressions and input which seem to demonstrate the danger of
particularly odd or poorly written REGEXs.
I don't have immediate access to PERL 5.0003, but I am using 5.005,
5.008, and 5.8.0. None of these versions of PERL produce exactly the
same results as the Jakarta-ORO matcher. Each REGEXP compiles, and
matches or does match in 6-30ms consistently (minus the odd match time
in the thousands of ms.)
I've tested this test program with jakarta-oro 2.0 -> 2.0.7, JRE 1.3.1
-> 1.4.1, and linux 6.2, 7.0, and 9.0, and the same results apply.
I realize that these are NOT well written regular expressions, but this
whole experiment came about by having some hand me back my Java classes,
saying that my code hangs. I debugged down to the Perl5Matcher.match()
call in my code, and then wrote my test program. When stuff started
blowing up, I started looking through the oro-dev archives, and I
thought I would contribute to the "DON'T DO THIS" list.
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
ORO will generate the stackOverflow exception using the following:
INPUT STRING: "X0"
REGEX 1: "X(\d*)*X"
REGEX 2: "X(\d*)+X"
REGEX 3: "X(\d*\s*)*X"
REGEX 4: "X(\d*\s*)+X"
REGEX 5: "X([\d-]*\s*)*X"
REGEX 6: "X([\d-]*\s*)+X"
Minor mods, like "\s*" to "\s+" and "\d*" to "\d+" resolve this
problem. Also, changing the input string to "X0X" causes a match, but
changing it to "X0-" still causes some of the REGEXs to fail, while
others work just fine. See e-mail:
From: Pavel Sher
Subject: StackOverflowException
Date: Fri, 19 Sep 2003 10:00:43 -0700
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Here is an example of the output from the test program. What you
might notice is that once the input string has reached a certain
size, every new character appended to the input string doubles the
match time. See [Bug 6838], [Bug 14850], and [Bug 23132]:
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
$ ./RegexLoop1.sh
RegexLoop Version 1 (Linux 2.2.16-22smp [i386]) (Sun Microsystems Inc.
JRE 1.4.0_01-b03) (Jakarta ORO 2.0.7)
Compiling 'X([\d-]+)+X' ... DONE!
Matching 'X' ... DONE! false (2ms)
Matching 'X-' ... DONE! false (0ms)
Matching 'X-1' ... DONE! false (0ms)
Matching 'X-1-' ... DONE! false (0ms)
Matching 'X-1-1' ... DONE! false (0ms)
Matching 'X-1-12' ... DONE! false (1ms)
Matching 'X-1-12-' ... DONE! false (1ms)
Matching 'X-1-12-1' ... DONE! false (79ms)
Matching 'X-1-12-12' ... DONE! false (4ms)
Matching 'X-1-12-123' ... DONE! false (25ms)
Matching 'X-1-12-123-' ... DONE! false (57ms)
Matching 'X-1-12-123-1' ... DONE! false (9ms)
Matching 'X-1-12-123-12' ... DONE! false (7ms)
Matching 'X-1-12-123-123' ... DONE! false (17ms)
Matching 'X-1-12-123-1234' ... DONE! false (30ms)
Matching 'X-1-12-123-1234-' ... DONE! false (73ms)
Matching 'X-1-12-123-1234-1' ... DONE! false (124ms)
Matching 'X-1-12-123-1234-12' ... DONE! false (236ms)
Matching 'X-1-12-123-1234-123' ... DONE! false (475ms)
Matching 'X-1-12-123-1234-1234' ... DONE! false (946ms)
Matching 'X-1-12-123-1234-12345' ... DONE! false (1905ms)
Matching 'X-1-12-123-1234-12345-' ... DONE! false (3861ms)
Matching 'X-1-12-123-1234-12345-1' ... DONE! false (8130ms)
Matching 'X-1-12-123-1234-12345-12' ... DONE! false (16789ms)
Matching 'X-1-12-123-1234-12345-123' ... DONE! false (32375ms)
Matching 'X-1-12-123-1234-12345-1234' ... DONE! false (64900ms)
Matching 'X-1-12-123-1234-12345-12345' ... DONE! false (132538ms)
Matching 'X-1-12-123-1234-12345-123456' ... DONE! false (263073ms)
Matching 'X-1-12-123-1234-12345-123456-' ... DONE! false (529350ms)
Matching 'X-1-12-123-1234-12345-123456-1' ... ^C
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
$ ./RegexLoop1.sh -bad
RegexLoop Version 1 (Linux 2.2.16-22smp [i386]) (Sun Microsystems Inc.
JRE 1.4.0_01-b03) (Jakarta ORO 2.0.7)
Compiling 'X([\d-]*)*X' ... DONE!
Matching 'X' ... DONE! false (4ms)
Matching 'X-' ... Exception in thread "main" java.lang.StackOverflowError
at
org.apache.oro.text.regex.Perl5Matcher.__match(Perl5Matcher.java:1296)
at
org.apache.oro.text.regex.Perl5Matcher.__match(Perl5Matcher.java:1193)
at
org.apache.oro.text.regex.Perl5Matcher.__match(Perl5Matcher.java:1309)
at
org.apache.oro.text.regex.Perl5Matcher.__match(Perl5Matcher.java:1193)
at
org.apache.oro.text.regex.Perl5Matcher.__match(Perl5Matcher.java:1309)
at
org.apache.oro.text.regex.Perl5Matcher.__match(Perl5Matcher.java:1193)
at
org.apache.oro.text.regex.Perl5Matcher.__match(Perl5Matcher.java:1309)
etc..
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Here is my test program:
/*
----------------------------------------------------------------------------
-
* FILE: RegexLoop1.java
* AUTH: Nathan E. Fairchild
* DATE: Nov-14-2003
*
* Example program to illustrate the performance decay with a simple/bad
REGEXP
* and a special set of strings. Run with -bad, or -bad2 and watch it blow
up.
*
* Testing shell script:
*
* ---- RegexLoop1.sh
----------------------------------------------------------
* #!/bin/bash
*
* CLASSPATH=${JAKARTA_HOME}/jakarta-oro-2.0.X.jar;${JAVA_HOME}:./
*
* case $1 in
* -compile) ${JAVA_HOME}/bin/javac -classpath ${CLASSPATH}
RegexLoop1.java
* ;;
* *) ${JAVA_HOME}/bin/java -cp ${CLASSPATH} RegexLoop1 $@
* ;;
* esac
* exit $?
* ---- RegexLoop1.sh
----------------------------------------------------------
*
*
--------------------------------------------------------------------------
*/
import java.lang.String;
import java.util.Iterator;
import java.util.StringTokenizer;
// Used to pull info from Manifest in jar files
import java.util.jar.JarFile;
import java.util.jar.Attributes;
// The matching classes for this test
import org.apache.oro.text.regex.Perl5Pattern;
import org.apache.oro.text.regex.Perl5Matcher;
import org.apache.oro.text.regex.Perl5Compiler;
public class RegexLoop1 {
// We need a matcher to match the message against the pattern
private static final Perl5Matcher matcher = new Perl5Matcher();
// We need a compiler to generate the pattern from the REGEXP
private static final Perl5Compiler compiler = new Perl5Compiler();
// We need a pattern to hold the compiled REGEXP
private Perl5Pattern pattern = null;
// We need the problem REGEXPs
private static final String _regexps[] = {
"X([\\d-]+)+X",
"X([\\d-]*)*X", // java.lang.StackOverflowError
"X([\\d-]*)+X", // java.lang.StackOverflowError
};
// We need a message to excercise the bug (either one)
//private static final String _msg =
"X-1-11-111-1111-11111-111111-1111111-11111111";
private static final String _msg =
"X-1-12-123-1234-12345-123456-1234567-12345678";
/**
* Initializes the built-in REGEXP.
*
* @throws Exception because I don't really care about getting
specific.
*/
public RegexLoop1( int reNum ) throws Exception {
String re = _regexps[reNum];
System.err.print( "\nCompiling '" + re + "' ... " );
pattern = (Perl5Pattern)compiler.compile( re,
Perl5Compiler.SINGLELINE_MASK );
System.err.println( "DONE!\n" );
}
/**
* Return the target message string.
*/
public String getMsg() { return _msg; }
/**
* Test a match of the given msg against the built-in REGEXP.
*/
public boolean match( String msg ) {
System.err.print( "Matching '" + msg + "' ... " );
long startTime = System.currentTimeMillis();
boolean res = matcher.contains( msg, pattern );
long endTime = System.currentTimeMillis();
System.err.println( "DONE! " + res + " (" + (endTime -
startTime) + "ms)");
return res;
}
/**
* Self-document.
*/
public static void identify() {
// Display some basic runtime information
System.err.print( "\nRegexLoop Version 1" );
System.err.print( " (" + System.getProperty("os.name") );
System.err.print( " " + System.getProperty("os.version") );
System.err.print( " [" + System.getProperty("os.arch") +
"])" );
System.err.print( " (" + System.getProperty("java.vendor")
);
System.err.print( " JRE " +
System.getProperty("java.runtime.version") + ")" );
// Grab the path of the ORO matcher jar file
StringTokenizer paths = new StringTokenizer(
System.getProperty("java.class.path"),
System.getProperty("path.separator") );
String jarPath = null;
while( paths.hasMoreTokens() ) {
String p = paths.nextToken();
if( p.indexOf("jakarta-oro-") > -1) {
jarPath = p;
break;
}
}
// Query the manifest for title and version
if( jarPath != null ) {
try {
JarFile jar = new JarFile( jarPath );
Attributes attrs =
jar.getManifest().getAttributes("org/apache/oro");
String specTitle =
attrs.getValue(Attributes.Name.SPECIFICATION_TITLE);
String specVersion =
attrs.getValue(Attributes.Name.SPECIFICATION_VERSION);
System.err.print( " (" + specTitle + " " +
specVersion + ")" );
} catch( Exception e ) {
StringTokenizer parts = new StringTokenizer(
jarPath,
System.getProperty("file.separator") );
String jar = null;
while( parts.hasMoreTokens() ) {
jar = parts.nextToken();
}
System.err.print( " (" + jar + ") " );
}
}
System.err.println( "" );
}
/* MAIN
------------------------------------------------------------- */
public static void main( String args[] ) {
// Identify the working environment
identify();
// Check for the -bad command-line argument
int useRe = 0;
if( args.length > 0 ) {
if( "-bad".equals(args[0]) )
useRe = 1;
else if( "-bad2".equals(args[0]) )
useRe = 2;
}
// Create a new looper
RegexLoop1 looper = null;
try {
looper = new RegexLoop1( useRe );
} catch( Exception e ) {
System.err.println( e );
System.exit( 1 );
}
// Match strings of increasing size
int len = looper.getMsg().length();
for( int i = 1; i <= len; i++ ) {
String msg = looper.getMsg().substring(0, i);
looper.match( msg );
}
}
/* MAIN
------------------------------------------------------------- */
}
Here is some Perl output of the same REGEX matching:
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
$ fail2.pl
Fail Version 2 (linux 2.2.16-22smp [i686]) (Perl 5.006)
Compiling 'X([\d-]+)+X' ... DONE!
Matching 'X' ... DONE! false (0.000031)
Matching 'X-' ... DONE! false (0.000006)
Matching 'X-1' ... DONE! false (0.000015)
Matching 'X-1-' ... DONE! false (0.000007)
Matching 'X-1-1' ... DONE! false (0.000006)
Matching 'X-1-12' ... DONE! false (0.000006)
Matching 'X-1-12-' ... DONE! false (0.000023)
Matching 'X-1-12-1' ... DONE! false (0.000008)
Matching 'X-1-12-12' ... DONE! false (0.000007)
Matching 'X-1-12-123' ... DONE! false (0.000006)
Matching 'X-1-12-123-' ... DONE! false (0.000006)
Matching 'X-1-12-123-1' ... DONE! false (0.000007)
Matching 'X-1-12-123-12' ... DONE! false (0.000007)
Matching 'X-1-12-123-123' ... DONE! false (0.000006)
Matching 'X-1-12-123-1234' ... DONE! false (0.001725)
Matching 'X-1-12-123-1234-' ... DONE! false (0.000010)
Matching 'X-1-12-123-1234-1' ... DONE! false (0.000007)
Matching 'X-1-12-123-1234-12' ... DONE! false (0.000007)
Matching 'X-1-12-123-1234-123' ... DONE! false (0.000007)
Matching 'X-1-12-123-1234-1234' ... DONE! false (0.000006)
Matching 'X-1-12-123-1234-12345' ... DONE! false (0.000006)
Matching 'X-1-12-123-1234-12345-' ... DONE! false (0.000006)
Matching 'X-1-12-123-1234-12345-1' ... DONE! false (0.000012)
Matching 'X-1-12-123-1234-12345-12' ... DONE! false (0.000006)
Matching 'X-1-12-123-1234-12345-123' ... DONE! false (0.000006)
Matching 'X-1-12-123-1234-12345-1234' ... DONE! false (0.000006)
Matching 'X-1-12-123-1234-12345-12345' ... DONE! false (0.000007)
Matching 'X-1-12-123-1234-12345-123456' ... DONE! false (0.000006)
Matching 'X-1-12-123-1234-12345-123456-' ... DONE! false (0.000006)
Matching 'X-1-12-123-1234-12345-123456-1' ... DONE! false (0.003880)
Matching 'X-1-12-123-1234-12345-123456-12' ... DONE! false (0.000008)
Matching 'X-1-12-123-1234-12345-123456-123' ... DONE! false (0.000006)
Matching 'X-1-12-123-1234-12345-123456-1234' ... DONE! false (0.000006)
Matching 'X-1-12-123-1234-12345-123456-12345' ... DONE! false (0.000006)
Matching 'X-1-12-123-1234-12345-123456-123456' ... DONE! false (0.000007)
Matching 'X-1-12-123-1234-12345-123456-1234567' ... DONE! false (0.001842)
Matching 'X-1-12-123-1234-12345-123456-1234567-' ... DONE! false (0.000015)
Matching 'X-1-12-123-1234-12345-123456-1234567-1' ... DONE! false (0.000007)
Matching 'X-1-12-123-1234-12345-123456-1234567-12' ... DONE! false
(0.000006)
Matching 'X-1-12-123-1234-12345-123456-1234567-123' ... DONE! false
(0.000007)
Matching 'X-1-12-123-1234-12345-123456-1234567-1234' ... DONE! false
(0.000024)
Matching 'X-1-12-123-1234-12345-123456-1234567-12345' ... DONE! false
(0.000007)
Matching 'X-1-12-123-1234-12345-123456-1234567-123456' ... DONE! false
(0.000007)
Matching 'X-1-12-123-1234-12345-123456-1234567-1234567' ... DONE! false
(0.000021)
Matching 'X-1-12-123-1234-12345-123456-1234567-12345678' ... DONE! false
(0.000008)
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
-Nathan
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]