Daniel,
I was not implying that the ORO classes are larger than they should be. I was
merely comparing the physical sizes of the regexp and oro jars. Although I
probably won't personally make use of separate "slice" jars, it makes a lot
of sense to offer that option for developers who only need one or two subsets
of the functionality oro offers (especially when more general text processing
is added).
Now, to supply the sample benchmark code that backs up my
"performance anecdotes"...
I compared the ORO Perl5Pattern/Perl5Matcher classes with the Regexp RE
class (although I also compared the Perl5Util and the RE classes with similar
results). I'm attaching a (slightly modified) copy of the code on which I
based my previous speed assertions. The modifications, aside from adding
documentation, amount to removing many of the patterns that I tested. This
was meant as a highly targeted test of regular expression match times
(making no comparison of substitution times) and it's possible that a more
varied selection of patterns will yield different results. I ran this code
using Java HotSpot(TM) Client VM (build 1.3.0-C, mixed mode) on an 800Mhz
PIII w/ 256Meg RAM.
I believe that I am using both the Regexp and ORO packages in an efficient
manner. But, please let me know if this is not the case.
Thanks,
Ed.
Disclaimer: The attached code was written as a quick test and does not
necessarily reflect my best judgment in coding practices. Also, I used
synchronized blocks in an effort to ensure that Java's dynamic class loading
didn't cause erroneous timing results. If this is completely off-base, please
let me know of a better approach.
On Thu, 10 May 2001 "Daniel F. Savarese" <[EMAIL PROTECTED]> wrote:
>
>
> >Basically, the regexp package is smaller and has a reduced feature set.
> >In fact, the regexp package jar file is less than half the size of the oro
> >package jar.
>
> The feeling that jakarta-oro is large is a common misconception. The size
> of what used to be OROMatcher is very small. All you need for
> regular expressions is the org.apache.oro.text.regex package, not all
> of the other stuff. To alleviate this misconception, we're going to
> provide a jakarta-oro jar that has everything and then separate jars for
> strictly those slices that people want, roughly corresponding to the old
> OROMatcher, PerlTools, AwkTools, and TextTools packages.
>
> >Initially, regexp handles matching (and rejecting matches) more quickly. But,
> >after a few hundred matches, the time required by the regexp package
> >(especially in rejecting matches) increases considerably when compared to
> >the oro package.
>
> This is also another misconception, although not directly in relation to
> the regexp package. The jakarta-oro package has 4 different regular
> expression packages. So when you compare performance, you have to
> specify which one. Also, a lot of times people talk about jakarta-oro
> when they really mean the Perl5Util class, which is a convenience
> wrapper around the org.apache.oro.text.regex package. Perl5Util will
> always be slow (although we can improve its performance) because it
> does a higher level set of parsing so that you can use Perl-specific
> syntactic sugar like 's/foobar/barfoo/g' instead of the allegedly
> more cumbersome approach of directly using the org.apache.oro.text.regex
> classes. Furthermore, most people blatantly misuse the
> org.apache.oro.text.regex package by constantly reinstantiating and
> Perl5Compiler and Perl5Matcher instances and constantly recompiling
> regular expressions. Hopefully this will stop after we write a new
> user's guide explaining how to make proper use of the package.
> A valid performance comparison can only be made by posting the code used
> to make the comparison. I don't know how you reached the assessment you
> made. All performance evaluation code is welcome on oro-dev because
> even though the primary goal for at least the Perl related stuff is to
> achieve compatibility with Perl, the secondary goal is to be as fast
> as possible within the constraints of Perl's regex syntax and Java's
> runtime performance.
>
> daniel
>
>
import org.apache.regexp.RE;
import org.apache.regexp.RESyntaxException;
import org.apache.oro.text.regex.Pattern;
import org.apache.oro.text.regex.PatternMatcher;
import org.apache.oro.text.regex.Perl5Matcher;
import org.apache.oro.text.regex.Perl5Compiler;
import org.apache.oro.text.regex.PatternMatcherInput;
import org.apache.oro.text.regex.MalformedPatternException;
/**
* Test the speed between Jakarta regexp and oro packages
* using the classes org.apache.regexp.RE class and
* org.apache.oro.text.regex.Perl5Pattern.
*
* @author Ed Chidester
*/
public class TestREvPerl5Pattern {
/**
* An array of Perl-syntax regular expression strings for testing.
* This array should have the same number of elements as the
* {@link #matchString__ first matching string} array and the
* {@link #failString__ first failing string} array.
* Also, the number of elements in this array should equal the number of
* rows in the {@link #matchString2__ second matching string}
* array and the number of rows in the
* {@link #failString2__ second failing string} array.
*/
private static String [ ] reString__ = {
"/\\bmatch/i" ,
"/\\d+/" ,
// -------------------------------
// Insert more array elements here
// -------------------------------
};
/**
* An array of matching strings for testing.
* This array is used for the first round of timing test results in the
* {@link #main main testing method}.
* Each element will match against the corresponding element in
* the {@link #reString__ regular expression array}.
*/
private static String [ ] matchString__ = {
"This is a matching string" ,
"This 1 has a digit" ,
// -------------------------------
// Insert more array elements here
// -------------------------------
};
/**
* An array of failing strings for testing.
* This array is used for the first round of timing test results in the
* {@link #main main testing method}.
* Each element will fail to match against the corresponding element in
* the {@link #reString__ regular expression array}.
*/
private static String [ ] failString__ = {
"This is a failing string" ,
"This one doesn't have digits" ,
// -------------------------------
// Insert more array elements here
// -------------------------------
};
/**
* An array of multiple matching strings for testing.
* This array is used for the second round of timing tests in the
* {@link #main main testing method}.
* Each row holds Strings that will match the corresponding element
* in the {@link #reString__ regular expression array}.
*/
private static String [ ] [ ] matchString2__ = {
{ "This string matches" ,
"There is also a match in this string" ,
"How many matches should a match string have?" ,
"Many smokers use a match to light their cigarettes" ,
"This is a matching string" ,
} ,
{ "12345" ,
"867-5349" ,
"2 is not the loneliest number" ,
"This one's 4 you" ,
"This 1 has a digit" ,
} ,
// -------------------------------
// Insert more array elements here
// -------------------------------
};
/**
* An array of multiple failing strings for testing.
* This array is used for the second round of timing tests in the
* {@link #main main testing method}.
* Each row holds Strings that will not match the corresponding element
* in the {@link #reString__ regular expression array}.
*/
private static String [ ] [ ] failString2__ = {
{ "This string fails" ,
// "nonmatch" <--- RE doesn't work for '\b'
"There is also a fail in this string" ,
"How many failures should a failing string have?" ,
"Many smokers use a lighter to light their cigarettes" ,
"This is a failure-full string" ,
} ,
{ "one" ,
"phone number" ,
"two is not the loneliest number" ,
"This one's for you" ,
"This one has a digit" ,
} ,
// -------------------------------
// Insert more array elements here
// -------------------------------
};
/**
* <p>
* Main test method comparing the Regexp and ORO packages.
* There are two testing rounds. The first round retrieves matching and
* match failure timing results for both the Regexp and ORO packages
* (within the same loop). The second round retrieves separate match and
* failure times individually for both packages.
* </p>
*
* <p>
* Takes the number of cycles to run through as its command-line argument.
* If no command line argument is given, then the default of 100 cycles
* is used.
* </p>
*/
public static void main( String [ ] args ) {
//-------------------------------------------------------------------
// NOTE: Many blocks of code are not indented
// (synchronized and for loops)
// -- This is to keep subsequent lines of code from getting too long
//-------------------------------------------------------------------
int i;
int k;
int x;
int numCycles = 100;
int size = reString__.length;
int reMatchFlags;
int oroMatchFlags;
int firstIndex;
int lastIndex;
int finalIndex;
long reStartTime = 0;
long reStopTime = 0;
long oroStartTime = 0;
long oroStopTime = 0;
long reMatchStart2 = 0;
long reMatchStop2 = 0;
long oroMatchStart2 = 0;
long oroMatchStop2 = 0;
long reFailStart2 = 0;
long reFailStop2 = 0;
long oroFailStart2 = 0;
long oroFailStop2 = 0;
RE [] reObject = new RE [ size ];
Pattern [] oroObject = new Pattern [ size ];
Perl5Matcher oroMatcher;
Perl5Compiler oroCompiler;
if ( args.length == 1 ) {
try {
numCycles = Integer.parseInt( args[ 0 ] );
}
catch ( NumberFormatException e ) {
System.err.println( "Default of 100 used for number of cycles");
}
}
if ( matchString__.length < size ) {
size = matchString__.length;
}
if ( failString__.length < size ) {
size = failString__.length;
}
try {
// Initialize the Oro objects
oroMatcher = new Perl5Matcher( );
oroCompiler = new Perl5Compiler( );
synchronized ( reString__ ) {
synchronized ( matchString__ ) {
synchronized ( failString__ ) {
// ---------------------------------------------
// Initialize all the regular expression objects
// (both RE and ORO package objects)
// ---------------------------------------------
for ( i = 0 ; i < size ; i++ ) {
reMatchFlags = RE.MATCH_NORMAL;
oroMatchFlags = Perl5Compiler.DEFAULT_MASK;
firstIndex = reString__[ i ].indexOf( '/' );
lastIndex = reString__[ i ].lastIndexOf( '/' );
finalIndex = reString__[ i ].length( );
if ( lastIndex <= firstIndex ) {
System.err.println( "Error reading regular expression \""
+ reString__[ i ] + "\"" );
reObject[ i ] = new RE( reString__[ i ] );
lastIndex = reString__[ i ].length( );
}
else {
// // Testing printout...
// System.out.println( "Creating RE( \""
// + reString__[ i ].substring( firstIndex + 1 ,
// lastIndex )
// + "\" )" );
reObject[ i ] =
new RE( reString__[ i ].substring( firstIndex + 1 ,
lastIndex ) );
}
// Account for any global or case insensitive matches
for ( int j = lastIndex + 1 ; j < finalIndex ; j++ ) {
if ( reString__[ i ].charAt( j ) == 'i' ) {
// // Testing printout...
// System.out.println( "Case independent\t"
// + reString__[ i ] );
reMatchFlags |= RE.MATCH_CASEINDEPENDENT;
oroMatchFlags |= Perl5Compiler.CASE_INSENSITIVE_MASK;
}
else if ( reString__[ i ].charAt( j ) == 'm' ) {
// // Testing printout...
// System.out.println( "Multiline match\t"
// + reString__[ i ] );
reMatchFlags |= RE.MATCH_MULTILINE;
oroMatchFlags |= Perl5Compiler.MULTILINE_MASK;
}
// @UPDATE: As of Feb. 2001 (jakarta regexp 1.2)
// RE.MATCH_SINGLELINE is not allowed by
// RE.setMatchFlags
// else if ( reString__[ i ].charAt( j ) == 's' ) {
// reMatchFlags |= RE.MATCH_SINGLELINE;
// oroMatchFlags |= Perl5Compiler.SINGLELINE_MASK;
// }
else {
System.err.println( "Regular expression option \"/"
+ reString__[ i ].charAt( j )
+ "\" is being ignored" );
}
} // End for j
reObject[ i ].setMatchFlags( reMatchFlags );
oroObject[ i ] = oroCompiler.compile(
reString__[ i ].substring( (firstIndex + 1) , lastIndex ) ,
oroMatchFlags );
} // End for i
} // End synchronized failString__
} // End synchronized matchString__
} // End synchronized reString__
// ---------------------------
// Begin initial testing round
// ---------------------------
// Make sure that Everything has caught-up
// Runtime.getRuntime( ).exec( "sleep 10 " );
// Runtime.getRuntime( ).exec( "pause" );
synchronized ( matchString__ ) {
synchronized ( failString__ ) {
// Test the time taken by RE objects
reStartTime = System.currentTimeMillis( );
for ( k = 0 ; k < numCycles ; k++ ) {
for ( i = 0 ; i < size ; i++ ) {
if ( ! reObject[ i ].match( matchString__[ i ] ) ) {
System.err.println( "Error with RE match[ " + i + " ]" );
System.err.println( reString__[ i ] );
System.err.println( matchString__[ i ] );
}
if ( reObject[ i ].match( failString__[ i ] ) ) {
System.err.println( "Error with RE fail[ " + i + " ]" );
System.err.println( reString__[ i ] );
System.err.println( failString__[ i ] );
}
} // End for i
} // End for k
reStopTime = System.currentTimeMillis( );
} // End synchronized failString__
} // End synchronized matchString__
// Make sure that Everything has caught-up
// Runtime.getRuntime( ).exec( "sleep 10 " );
// Runtime.getRuntime( ).exec( "pause" );
synchronized ( matchString__ ) {
synchronized ( failString__ ) {
// Test the time taken by oro objects
oroStartTime = System.currentTimeMillis( );
for ( k = 0 ; k < numCycles ; k++ ) {
for ( i = 0 ; i < size ; i++ ) {
int beginIndex = 0;
boolean legitimateMatch = false;
boolean illegitimateMatch = false;
PatternMatcherInput pmiMatch =
new PatternMatcherInput( matchString__[ i ] );
PatternMatcherInput pmiFail =
new PatternMatcherInput( failString__[ i ] );
if ( ! oroMatcher.contains( pmiMatch , oroObject[ i ] ) ) {
System.err.println( "Error with Perl5 match[" + i + "]" );
System.err.println( reString__[ i ] );
System.err.println( matchString__[ i ] );
}
if ( oroMatcher.contains( pmiFail , oroObject[ i ] ) ) {
System.err.println( "Error with Perl5 fail[" + i + "]" );
System.err.println( reString__[ i ] );
System.err.println( failString__[ i ] );
}
} // End for i
} // End for k
oroStopTime = System.currentTimeMillis( );
} // End synchronized failString__
} // End synchronized matchString__
// --------------------------
// Begin second testing round
// --------------------------
synchronized ( matchString2__ ) {
reMatchStart2 = System.currentTimeMillis( );
for ( k = 0 ; k < numCycles ; k++ ) {
for ( i = 0 ; i < size ; i++ ) {
for ( x = 0 ; x < matchString2__[ i ].length ; x++ ) {
if ( ! reObject[ i ].match( matchString2__[ i ][ x ] ) ) {
System.err.println( "Error with RE match[ "
+ i + " ][ " + x + " ]" );
System.err.println( reString__[ i ] );
System.err.println( matchString2__[ i ][ x ] );
}
}
} // End for i
} // End for k
reMatchStop2 = System.currentTimeMillis( );
} // End of synchronized matchString2__
synchronized ( matchString2__ ) {
oroMatchStart2 = System.currentTimeMillis( );
for ( k = 0 ; k < numCycles ; k++ ) {
for ( i = 0 ; i < size ; i++ ) {
for ( x = 0 ; x < matchString2__[ i ].length ; x++ ) {
boolean legitimateMatch = false;
PatternMatcherInput pmiObj =
new PatternMatcherInput( matchString2__[ i ][ x ] );
if ( ! oroMatcher.contains( pmiObj , oroObject[ i ] ) ) {
System.err.println( "Error with Perl5 match[ "
+ i + " ][ " + x + " ]" );
System.err.println( reString__[ i ] );
System.err.println( matchString2__[ i ][ x ] );
}
}
} // End for i
} // End for k
oroMatchStop2 = System.currentTimeMillis( );
} // End synchronized matchString2__
synchronized ( failString2__ ) {
reFailStart2 = System.currentTimeMillis( );
for ( k = 0 ; k < numCycles ; k++ ) {
for ( i = 0 ; i < size ; i++ ) {
for ( x = 0 ; x < failString2__[ i ].length ; x++ ) {
if ( reObject[ i ].match( failString2__[ i ][ x ] ) ) {
System.err.println( "Error with RE fail[ "
+ i + " ][ " + x + " ]" );
System.err.println( reString__[ i ] );
System.err.println( failString2__[ i ][ x ] );
}
}
} // End for i
} // End for k
reFailStop2 = System.currentTimeMillis( );
} // End synchronized failString2__
synchronized ( failString2__ ) {
oroFailStart2 = System.currentTimeMillis( );
for ( k = 0 ; k < numCycles ; k++ ) {
for ( i = 0 ; i < size ; i++ ) {
for ( x = 0 ; x < failString2__[ i ].length ; x++ ) {
PatternMatcherInput pmiObj =
new PatternMatcherInput( failString2__[ i ][ x ] );
if ( oroMatcher.contains( pmiObj , oroObject[ i ] ) ) {
System.err.println( "Error with Perl5 fail[ "
+ i + " ][ " + x + " ]" );
System.err.println( reString__[ i ] );
System.err.println( failString2__[ i ][ x ] );
}
}
} // End for i
} // End for k
oroFailStop2 = System.currentTimeMillis( );
} // End synchronized failString2__
}
catch ( Exception e ) {
System.err.println( e + "Caught while running test" );
e.printStackTrace( System.err );
System.exit( 1 );
}
// -----------------
// Print the results
// -----------------
System.out.println( "RE Objects took "
+ ( reStopTime - reStartTime )
+ " milliseconds" );
System.out.println( "Perl5Pattern Objects took "
+ ( oroStopTime - oroStartTime )
+ " milliseconds" );
System.out.println( "\n\n" );
System.out.println( "reMatch Time = "
+ ( reMatchStop2 - reMatchStart2 )
+ " milliseconds\n"
+ "reFail Time = "
+ ( reFailStop2 - reFailStart2 )
+ " milliseconds" );
System.out.println( "oroMatch Time = "
+ ( oroMatchStop2 - oroMatchStart2 )
+ " milliseconds\n"
+ "oroFail Time = "
+ ( oroFailStop2 - oroFailStart2 )
+ " milliseconds" );
} // End main method
} // TestREvPerl5Pattern class