Daniel,

I was not implying that the ORO classes are larger than they should be. I was 
merely comparing the physical sizes of the regexp and oro jars. Although I 
probably won't personally make use of separate "slice" jars, it makes a lot 
of sense to offer that option for developers who only need one or two subsets 
of the functionality oro offers (especially when more general text processing 
is added).

Now, to supply the sample benchmark code that backs up my 
"performance anecdotes"...

I compared the ORO Perl5Pattern/Perl5Matcher classes with the Regexp RE 
class (although I also compared the Perl5Util and the RE classes with similar 
results). I'm attaching a (slightly modified) copy of the code on which I 
based my previous speed assertions. The modifications, aside from adding 
documentation, amount to removing many of the patterns that I tested. This 
was meant as a highly targeted test of regular expression match times 
(making no comparison of substitution times) and it's possible that a more 
varied selection of patterns will yield different results. I ran this code 
using Java HotSpot(TM) Client VM (build 1.3.0-C, mixed mode) on an 800Mhz 
PIII w/ 256Meg RAM.


I believe that I am using both the Regexp and ORO packages in an efficient
manner. But, please let me know if this is not the case.


Thanks,

Ed.


Disclaimer: The attached code was written as a quick test and does not 
  necessarily reflect my best judgment in coding practices. Also, I used 
  synchronized blocks in an effort to ensure that Java's dynamic class loading 
  didn't cause erroneous timing results. If this is completely off-base, please 
  let me know of a better approach.


On Thu, 10 May 2001 "Daniel F. Savarese" <[EMAIL PROTECTED]> wrote:
> 
> 
> >Basically, the regexp package is smaller and has a reduced feature set.
> >In fact, the regexp package jar file is less than half the size of the oro
> >package jar.
> 
> The feeling that jakarta-oro is large is a common misconception.  The size
> of what used to be OROMatcher is very small.  All you need for
> regular expressions is the org.apache.oro.text.regex package, not all
> of the other stuff.  To alleviate this misconception, we're going to
> provide a jakarta-oro jar that has everything and then separate jars for
> strictly those slices that people want, roughly corresponding to the old
> OROMatcher, PerlTools, AwkTools, and TextTools packages.
> 
> >Initially, regexp handles matching (and rejecting matches) more quickly. But,
> >after a few hundred matches, the time required by the regexp package
> >(especially in rejecting matches) increases considerably when compared to
> >the oro package.
> 
> This is also another misconception, although not directly in relation to
> the regexp package.  The jakarta-oro package has 4 different regular
> expression packages.  So when you compare performance, you have to
> specify which one.  Also, a lot of times people talk about jakarta-oro
> when they really mean the Perl5Util class, which is a convenience
> wrapper around the org.apache.oro.text.regex package.  Perl5Util will
> always be slow (although we can improve its performance) because it
> does a higher level set of parsing so that you can use Perl-specific
> syntactic sugar like 's/foobar/barfoo/g' instead of the allegedly
> more cumbersome approach of directly using the org.apache.oro.text.regex
> classes.  Furthermore, most people blatantly misuse the
> org.apache.oro.text.regex package by constantly reinstantiating and
> Perl5Compiler and Perl5Matcher instances and constantly recompiling
> regular expressions.  Hopefully this will stop after we write a new
> user's guide explaining how to make proper use of the package.
> A valid performance comparison can only be made by posting the code used
> to make the comparison.  I don't know how you reached the assessment you
> made.  All performance evaluation code is welcome on oro-dev because
> even though the primary goal for at least the Perl related stuff is to
> achieve compatibility with Perl, the secondary goal is to be as fast
> as possible within the constraints of Perl's regex syntax and Java's
> runtime performance.
> 
> daniel
> 
> 

import org.apache.regexp.RE;
import org.apache.regexp.RESyntaxException;

import org.apache.oro.text.regex.Pattern;
import org.apache.oro.text.regex.PatternMatcher;
import org.apache.oro.text.regex.Perl5Matcher;
import org.apache.oro.text.regex.Perl5Compiler;
import org.apache.oro.text.regex.PatternMatcherInput;
import org.apache.oro.text.regex.MalformedPatternException;


/**
 * Test the speed between Jakarta regexp and oro packages
 * using the classes org.apache.regexp.RE class and
 * org.apache.oro.text.regex.Perl5Pattern.
 *
 * @author Ed Chidester
 */
public class TestREvPerl5Pattern {

    /**
     * An array of Perl-syntax regular expression strings for testing.
     * This array should have the same number of elements as the
     * {@link #matchString__ first matching string} array and the
     * {@link #failString__  first failing  string} array.
     * Also, the number of elements in this array should equal the number of
     * rows in the {@link #matchString2__ second matching string}
     * array and the number of rows in the
     * {@link #failString2__ second failing string} array.
     */
    private static String [ ] reString__    = {
            "/\\bmatch/i"                                          ,
            "/\\d+/"                                               ,
            // -------------------------------
            // Insert more array elements here
            // -------------------------------
        };

    /**
     * An array of matching strings for testing.
     * This array is used for the first round of timing test results in the
     * {@link #main main testing method}.
     * Each element will match against the corresponding element in
     * the {@link #reString__ regular expression array}.
     */
    private static String [ ] matchString__ = {
            "This is a matching string"                            ,
            "This 1 has a digit"                                   ,
            // -------------------------------
            // Insert more array elements here
            // -------------------------------
        };

    /**
     * An array of failing strings for testing.
     * This array is used for the first round of timing test results in the
     * {@link #main main testing method}.
     * Each element will fail to match against the corresponding element in
     * the {@link #reString__ regular expression array}.
     */
    private static String [ ] failString__  = {
            "This is a failing string"                             ,
            "This one doesn't have digits"                         ,
            // -------------------------------
            // Insert more array elements here
            // -------------------------------
        };

    /**
     * An array of multiple matching strings for testing.
     * This array is used for the second round of timing tests in the
     * {@link #main main testing method}.
     * Each row holds Strings that will match the corresponding element
     * in the {@link #reString__ regular expression array}.
     */
    private static String [ ] [ ] matchString2__ = {
            { "This string matches"                                ,
              "There is also a match in this string"               ,
              "How many matches should a match string have?"       ,
              "Many smokers use a match to light their cigarettes" ,
              "This is a matching string"                          ,
                } ,
            { "12345"                                              ,
              "867-5349"                                           ,
              "2 is not the loneliest number"                      ,
              "This one's 4 you"                                   ,
              "This 1 has a digit"                                 ,
                } ,
            // -------------------------------
            // Insert more array elements here
            // -------------------------------
        };

    /**
     * An array of multiple failing strings for testing.
     * This array is used for the second round of timing tests in the
     * {@link #main main testing method}.
     * Each row holds Strings that will not match the corresponding element
     * in the {@link #reString__ regular expression array}.
     */
    private static String [ ] [ ] failString2__  = {
            { "This string fails"                                       ,
              //               "nonmatch" <--- RE doesn't work for '\b'
              "There is also a fail  in this string"                    ,
              "How many failures should a failing string have?"         ,
              "Many smokers use a lighter to light their cigarettes"    ,
              "This is a failure-full string"                           ,
                } ,
            { "one"                                                     ,
              "phone number"                                            ,
              "two is not the loneliest number"                         ,
              "This one's for you"                                      ,
              "This one has a digit"                                    ,
                } ,
            // -------------------------------
            // Insert more array elements here
            // -------------------------------
        };

    /**
     * <p>
     * Main test method comparing the Regexp and ORO packages.
     * There are two testing rounds. The first round retrieves matching and
     * match failure timing results for both the Regexp and ORO packages
     * (within the same loop). The second round retrieves separate match and
     * failure times individually for both packages.
     * </p>
     *
     * <p>
     * Takes the number of cycles to run through as its command-line argument.
     * If no command line argument is given, then the default of 100 cycles
     * is used.
     * </p>
     */
    public static void main( String [ ] args ) {

        //-------------------------------------------------------------------
        // NOTE: Many blocks of code are not indented
        //       (synchronized and for loops)
        // -- This is to keep subsequent lines of code from getting too long
        //-------------------------------------------------------------------

        int        i;
        int        k;
        int        x;
        int        numCycles      = 100;
        int        size           = reString__.length;
        int        reMatchFlags;
        int        oroMatchFlags;
        int        firstIndex;
        int        lastIndex;
        int        finalIndex;
        long       reStartTime    = 0;
        long       reStopTime     = 0;
        long       oroStartTime   = 0;
        long       oroStopTime    = 0;
        long       reMatchStart2  = 0;
        long       reMatchStop2   = 0;
        long       oroMatchStart2 = 0;
        long       oroMatchStop2  = 0;
        long       reFailStart2   = 0;
        long       reFailStop2    = 0;
        long       oroFailStart2  = 0;
        long       oroFailStop2   = 0;
        RE      [] reObject       = new RE      [ size ];
        Pattern [] oroObject      = new Pattern [ size ];
        Perl5Matcher  oroMatcher;
        Perl5Compiler oroCompiler;

        if ( args.length == 1 ) {
            try {
                numCycles = Integer.parseInt( args[ 0 ] );
            }
            catch ( NumberFormatException e ) {
                System.err.println( "Default of 100 used for number of cycles");
            }
        }

        if ( matchString__.length < size ) {
            size = matchString__.length;
        }
        if ( failString__.length  < size ) {
            size = failString__.length;
        }

        try {
            // Initialize the Oro objects
            oroMatcher  = new Perl5Matcher( );
            oroCompiler = new Perl5Compiler( );

            synchronized ( reString__    ) {
            synchronized ( matchString__ ) {
            synchronized ( failString__  ) {

            // ---------------------------------------------
            // Initialize all the regular expression objects
            // (both RE and ORO package objects)
            // ---------------------------------------------
            for ( i = 0 ; i < size ; i++ ) {
                reMatchFlags  = RE.MATCH_NORMAL;
                oroMatchFlags = Perl5Compiler.DEFAULT_MASK;
                firstIndex    = reString__[ i ].indexOf(     '/' );
                lastIndex     = reString__[ i ].lastIndexOf( '/' );
                finalIndex    = reString__[ i ].length( );
                if ( lastIndex <= firstIndex ) {
                    System.err.println( "Error reading regular expression \""
                                        + reString__[ i ] + "\"" );
                    reObject[ i ] = new RE( reString__[ i ] );
                    lastIndex  = reString__[ i ].length( );
                }
                else {

                    // // Testing printout...
                    // System.out.println( "Creating RE( \""
                    //    + reString__[ i ].substring( firstIndex + 1 ,
                    //                                 lastIndex      )
                    //    + "\" )" );

                    reObject[ i ] =
                        new RE( reString__[ i ].substring( firstIndex + 1 ,
                                                      lastIndex      ) );
                }
                // Account for any global or case insensitive matches
                for ( int j = lastIndex + 1 ; j < finalIndex ; j++ ) {
                    if      ( reString__[ i ].charAt( j ) == 'i' ) {
                        // // Testing printout...
                        // System.out.println( "Case independent\t"
                        //                     + reString__[ i ] );

                        reMatchFlags  |= RE.MATCH_CASEINDEPENDENT;
                        oroMatchFlags |= Perl5Compiler.CASE_INSENSITIVE_MASK;
                    }
                    else if ( reString__[ i ].charAt( j ) == 'm' ) {
                        // // Testing printout...
                        // System.out.println( "Multiline match\t"
                        //                     + reString__[ i ] );

                        reMatchFlags  |= RE.MATCH_MULTILINE;
                        oroMatchFlags |= Perl5Compiler.MULTILINE_MASK;
                    }
                    // @UPDATE:  As of Feb. 2001 (jakarta regexp 1.2)
                    //           RE.MATCH_SINGLELINE is not allowed by
                    //           RE.setMatchFlags
                    // else if ( reString__[ i ].charAt( j ) == 's' ) {
                    //     reMatchFlags  |= RE.MATCH_SINGLELINE;
                    //     oroMatchFlags |= Perl5Compiler.SINGLELINE_MASK;
                    // }
                    else {
                        System.err.println( "Regular expression option \"/"
                                            + reString__[ i ].charAt( j )
                                            + "\" is being ignored" );
                    }
                } // End for j
                reObject[ i ].setMatchFlags( reMatchFlags );
                oroObject[ i ] = oroCompiler.compile(
                    reString__[ i ].substring( (firstIndex + 1) , lastIndex ) ,
                    oroMatchFlags                                             );
            } // End for i

            } // End synchronized failString__
            } // End synchronized matchString__
            } // End synchronized reString__

            // ---------------------------
            // Begin initial testing round
            // ---------------------------

            // Make sure that Everything has caught-up
            // Runtime.getRuntime( ).exec( "sleep 10 " );
            // Runtime.getRuntime( ).exec( "pause" );
            synchronized ( matchString__ ) {
            synchronized ( failString__  ) {

            // Test the time taken by RE objects
            reStartTime = System.currentTimeMillis( );
            for ( k = 0 ; k < numCycles ; k++ ) {
            for ( i = 0 ; i < size ; i++ ) {
                if ( ! reObject[ i ].match( matchString__[ i ] ) ) {
                    System.err.println( "Error with RE match[ " + i + " ]" );
                    System.err.println( reString__[    i ] );
                    System.err.println( matchString__[ i ] );
                }
                if (   reObject[ i ].match( failString__[  i ] ) ) {
                    System.err.println( "Error with RE  fail[ " + i + " ]" );
                    System.err.println( reString__[   i ] );
                    System.err.println( failString__[ i ] );
                }
            } // End for i
            } // End for k
            reStopTime = System.currentTimeMillis( );
            } // End synchronized failString__
            } // End synchronized matchString__

            // Make sure that Everything has caught-up
            // Runtime.getRuntime( ).exec( "sleep 10 " );
            // Runtime.getRuntime( ).exec( "pause" );
            synchronized ( matchString__ ) {
            synchronized ( failString__  ) {

            // Test the time taken by oro objects
            oroStartTime = System.currentTimeMillis( );
            for ( k = 0 ; k < numCycles ; k++ ) {
            for ( i = 0 ; i < size ; i++ ) {
                int beginIndex = 0;
                boolean legitimateMatch   = false;
                boolean illegitimateMatch = false;

                PatternMatcherInput pmiMatch =
                    new PatternMatcherInput( matchString__[ i ] );

                PatternMatcherInput pmiFail  =
                    new PatternMatcherInput( failString__[  i ] );

                if ( ! oroMatcher.contains( pmiMatch , oroObject[ i ] ) ) {

                    System.err.println( "Error with Perl5 match[" + i + "]" );
                    System.err.println( reString__[    i ] );
                    System.err.println( matchString__[ i ] );
                }
                if (   oroMatcher.contains( pmiFail  , oroObject[ i ] ) ) {

                    System.err.println( "Error with Perl5  fail[" + i + "]" );
                    System.err.println( reString__[   i ] );
                    System.err.println( failString__[ i ] );
                }
            } // End for i
            } // End for k
            oroStopTime = System.currentTimeMillis( );
            } // End synchronized failString__
            } // End synchronized matchString__

            // --------------------------
            // Begin second testing round
            // --------------------------

            synchronized ( matchString2__ ) {
            reMatchStart2 = System.currentTimeMillis( );
            for ( k = 0 ; k < numCycles ; k++ ) {
            for ( i = 0 ; i < size ; i++ ) {
                for ( x = 0 ; x < matchString2__[ i ].length ; x++ ) {
                    if ( ! reObject[ i ].match( matchString2__[ i ][ x ] ) ) {
                        System.err.println( "Error with RE  match[ "
                                            + i + " ][ " + x + " ]"     );
                        System.err.println( reString__[     i ]         );
                        System.err.println( matchString2__[ i ][ x ]    );
                    }
                }
            } // End for i
            } // End for k
            reMatchStop2  = System.currentTimeMillis( );
            } // End of synchronized matchString2__

            synchronized ( matchString2__ ) {
            oroMatchStart2 = System.currentTimeMillis( );
            for ( k = 0 ; k < numCycles ; k++ ) {
            for ( i = 0 ; i < size ; i++ ) {
                for ( x = 0 ; x < matchString2__[ i ].length ; x++ ) {
                    boolean legitimateMatch = false;
                    PatternMatcherInput pmiObj =
                        new PatternMatcherInput( matchString2__[ i ][ x ] );

                    if ( ! oroMatcher.contains( pmiObj , oroObject[ i ] ) ) {
   
                        System.err.println( "Error with Perl5  match[ "
                                            + i + " ][ " + x + " ]"     );
                        System.err.println( reString__[     i ]         );
                        System.err.println( matchString2__[ i ][ x ]    );
                    }
                }
            } // End for i
            } // End for k
            oroMatchStop2  = System.currentTimeMillis( );
            } // End synchronized matchString2__
   
            synchronized ( failString2__  ) {
            reFailStart2 = System.currentTimeMillis( );
            for ( k = 0 ; k < numCycles ; k++ ) {
            for ( i = 0 ; i < size ; i++ ) {
                for ( x = 0 ; x < failString2__[ i ].length ; x++ ) {
                    if (   reObject[ i ].match( failString2__[ i ][ x ] ) ) {
                        System.err.println( "Error with RE  fail[ "
                                            + i + " ][ " + x + " ]"    );
                        System.err.println( reString__[    i ]         );
                        System.err.println( failString2__[ i ][ x ]    );
                    }
                }
            } // End for i
            } // End for k
            reFailStop2  = System.currentTimeMillis( );
            } // End synchronized failString2__
   
            synchronized ( failString2__  ) {
            oroFailStart2 = System.currentTimeMillis( );
            for ( k = 0 ; k < numCycles ; k++ ) {
            for ( i = 0 ; i < size ; i++ ) {
                for ( x = 0 ; x < failString2__[ i ].length ; x++ ) {
                    PatternMatcherInput pmiObj =
                        new PatternMatcherInput( failString2__[ i ][ x ] );
   
                    if (   oroMatcher.contains( pmiObj , oroObject[ i ] ) ) {
   
                        System.err.println( "Error with Perl5  fail[ "
                                            + i + " ][ " + x + " ]"    );
                        System.err.println( reString__[    i ]         );
                        System.err.println( failString2__[ i ][ x ]    );
                    }
                }
            } // End for i
            } // End for k
            oroFailStop2  = System.currentTimeMillis( );
            } // End synchronized failString2__

        }
        catch ( Exception e ) {
            System.err.println( e + "Caught while running test" );
            e.printStackTrace( System.err );
            System.exit( 1 );
        }

        // -----------------
        // Print the results
        // -----------------

        System.out.println( "RE           Objects took "
                            + ( reStopTime - reStartTime )
                            + " milliseconds" );
        System.out.println( "Perl5Pattern Objects took "
                            + ( oroStopTime - oroStartTime )
                            + " milliseconds" );

        System.out.println( "\n\n" );
        System.out.println(   "reMatch  Time = "
                            + ( reMatchStop2 - reMatchStart2 )
                            + " milliseconds\n"
                            + "reFail   Time = "
                            + ( reFailStop2 - reFailStart2 )
                            + " milliseconds" );
        System.out.println(   "oroMatch Time = "
                            + ( oroMatchStop2 - oroMatchStart2 )
                            + " milliseconds\n"
                            + "oroFail  Time = "
                            + ( oroFailStop2 - oroFailStart2 )
                            + " milliseconds" );
    } // End main method
    
} // TestREvPerl5Pattern class

Reply via email to