Warning: this is a long message, a small paper really. The synopsis is
that I have created a simple regex compiler with multiple backends to
test proposed regex coding schemes. Details and test results are
below.

Hi all,

One of the upcoming decisions that needs to be made is on the design
of the Parrot regex engine. More specifically, (1) how will compiled
regexes be represented and (2) how will strings be matched against
compiled regexes?

I am relatively new to p6i, but gather there are at least three
approaches: 

   A. Compile regexes down to Parrot bytecode that forms an NFA
      representation of the regex. Calling the code with a string
      parameter matches the regex against that string. Compile using
      only core ops.

   B. Same scheme as (A), but use specialized rx ops as well as core
      ops.  

   C. Create a specialized compiler for regexes that produces a
      specialized abstract bytecode representing the regex. To match,
      feed the bytecode and the string to a specialized engine. Thus
      regexes would be an independent subsystem within parrot. This is
      the approach perl5 takes.

Three metrics one might use to evaluate the possibilities are size of
the bytecode representing the regex, execution speed, and ease of
design/maintenance of the regex compilation code.

Regarding ease of design/maintenance, schemes (A) and (B) have the
advantage of making regex/perl code integration easy compared to
(C). Some people have claimed that perl5 regex code is complex, a
mess, and quite difficult to understand, so we should not make the
same mistake of implementing a scheme like (C) again. No doubt after
10 years of hacking, the perl5 regex code could use a major
house cleaning. But it is a mistake to think that a rewrite in parrot
bytecode will make the complexity magically disappear. Coding even the
simple implementation below is tricky and debugging the flow of a
recursive or stack-based iterative finite-state automaton will make
your head hurt. Adding code optimizations and metadata to match faster
just redoubles that complexity. So creating (A) or (B) will be a lot
of hard work and without good design and documentation, the resulting
code will be just as difficult to understand as the perl5 code is.

But the results below are about bytecode size and execution speed.

To compare schemes (A), (B), and (C), I implemented each of them. I
created a common parser for the three schemes that recognizes the
basics: ^, $, ., a, \a, [a], [^a], ab, a|b, a*, a+, a? and (a).  The
parsed regex can be converted to one of three types of bytecode: 

   A. 'reg'  - Parrot bytecode, just core ops
   B. 'rx'   - Parrot bytecode, core and rx ops
   C. 'spen' - a specialized bytecode used by the execution engine in
               Henry Spencer's original regex package. His package
               formed the basis for perl5 regex engine.

For the reg backend, I made use of assembly idioms used in Steve
Fink's Regex package. For the rx backend I used idioms suggested by
the rx pod.  For the spen backend, I generated bytecode that exactly
matched Spencer's bytecode, so that I could use his regexec routines
directly for execution.  Although not entirely bug-free, all three
backends pass the match/no-match portion of the Spencer test suite, so
they have a basic level of functionality. The code is available at

   http://keck.ucsf.edu/~kvale/simple_rx.tar.gz

and is meant to be unpacked in the languages/ directory.

To level the playing field for the three backends, I opted to include
no optimizations whatsoever: no regex rewriting, no 'must match'
strings, etc. Each backend is a straight translation of the regex.  I
did try to optimize emitted reg and rx bytecode for execution speed
and concision. Both parrot and the Spencer code were compiled with -O. 
We will compare with perl5 as well, which is most definitely not
fair, but shows the goals we need to shoot for.

Here are some relative bytecode sizes (in bytes) for various regexes:

regex                              reg      rx    spen

''                                 236     128      10
'a'                                300     172      12
'(bc+d$|ef*g.|h?i(j|k[^lmnop]))'  1860    1168     114

Overall, rx code is about 10-15 times as large as spen code and reg
code is 1.5-2 times as large as rx code.  The reason that reg and rx
codes are so much larger than spen code is that (1) they are complete
parrot subroutines, whereas spen code is just a set of instructions to
a separate engine, and (2) parrot bytecode is general purpose and thus
more verbose than a specialized coding scheme like that for spen.  rx
cuts down on code size relative to reg, thanks to specialized
instructions combining several tasks into one.

Execution time is perhaps the most crucial metric.  perl5 excels at
searching large amounts of text quickly; most users would not want to
give that up.

We'll model the execution time of a regex test as follows:

   t[i] = co + n * (lo + r[i])

with

   t[i] = execution time for the whole test of regex i
     co = a constant overhead, approximately independent of regex
      n = number of repetitions of the match
     lo = an overhead per match, approximately independent of regex
   r[i] = time taken by just the regex code for a single match

co is dependent on the test harness used and is uninteresting.  lo
takes into account the allocation, initialization and cleanup time
needed for any single regex match and is relevant for us; each regex
takes at the very least lo time. r[i] is the time taken to do the
actual match against a string and is dependent on both the string and
the regex.

We can solve for co by testing against a fixed regex and string, and
varying the number fo repetitions. Given co, we can solve for lo by
keeping repetitions fixed and varying the size of a repetitive regex
and string. For instance, in each of the backends, matching regex
'aaa' against string 'aaa' will take about three times as long as
matching regex 'a' against string 'a'.  For a regex-string combination
that can be made repetitive, the model becomes

   t[i] = co + n * (lo + m*ro[i])

with m the number repetitive elements in the regex and ro[i] the per
unit matching time.

Matching regex 'a' against string 'a' we get the following times on my
Lintel notebook: (All times are in seconds)

"a" =~ /a/:

backend   lo            ro             lo/ro     ro/spen ro

reg:      4.518e-06     1.935e-07      23.3      12.8
rx:       1.621e-06     5.387e-07       3.0      35.6
spen:     5.477e-07     1.509e-08      36.2       1.0
perl:     7.897e-07     1.267e-08      62.3       0.84

The fixed, per match overheard lo is 3 times larger for rx than spen
and 8 times larger for reg than spen. The per character match time ro
is 36 times larger for rx than for spen and 13 times larger for reg
than for spen. That the per character time for rx is so large seems
counterintuitive to me. On the other hand, the rx_literal code

op rx_literal(in pmc, in str, inconst int) {
        UINTVAL i;
        RX_dUNPACK($1);
        if(string_length(rx->string) < rx->index+string_length($2)) {
                goto OFFSET($3);
        }
        /* This is faster than using substr--it avoids memory allocations. */
        for(i=0; i < string_length($2); i++) {
                if(   string_ord(rx->string, rx->index+(INTVAL)i) 
                   != string_ord($2, (INTVAL)i)) {
                        goto OFFSET($3);
                }
        }
        RxAdvanceX(rx, string_length($2));
        goto NEXT();
} 

computes dynamically several functions, such as length of the regex
match string, that are precomputed in the reg code...

Note that the overhead/match ratio lo/ro is large in the reg, spen and
perl routines. In the reg code, I think this is due to the allocation
of two PerlArrays for subexpression indices. In the spen and perl
code, it is due to data structure initialization and optimization
checks.

Here are the times for two other atoms:

"a" =~ /./
             lo              ro           ro/spen 
reg:         4.423e-06       2.18e-07     5.4
rx:          1.522e-06       1.907e-07    4.7
spen:        4.839e-07       4.063e-08    1.0
perl:        8.945e-07       2.81e-08     0.7

The disparities between spen and reg and rx are more modest here.

"a" =~ /[A-Za-z0-9_]/
             lo              ro           ro/spen
reg:         4.389e-06       9.487e-07     3.1
rx:          1.599e-06       1.184e-05    39.2
spen:        3.758e-07       3.014e-07     1.0
perl:        9.922e-07       3.75e-08      0.12

spen slows down by calling strchr, so reg is only three times slower.
Clearly, rx_oneof could be optimized for multi-character classes.

For the '+' quantifier, we get

"a"x1000 =~ /a+/

       lo + r
reg:   8.07e-04
rx:    8.46e-04
spen:  3.92e-06
perl:  7.56e-06

Above we found lo/ro ratios of 3-60, so for a string of length 1000
most of the time is spent in the match, not the overhead.  Both perl
and spen do a greedy match using pointer arithmetic in a tight C
loop. rx and reg both do a branch and a save to stack for each
iteration, which slows them down by an amazing factor of 200 relative
to spen. This extreme factor is partly due to a deficit in my code;
spen differentiates between simple and complex atoms in the presence
of * or + and optimizes the execution accordingly. If did the same, I
could eliminate a stack save per iteration.

The following match induces a modest amount of backtracking:

("ab"x10000)."c" =~ /abc/

      r          r/spen
reg:  1.62e-02   2.2
rx:   2.52e-02   3.4
spen: 7.35e-03   1.0
perl: 2.47e-04   0.034

Recursion is typically slower than iteration with a stack, so that reg
and rx pull within a factor of 2-3 of spen. perl5 achieves a factor of
30 speedup by noting that abc is a required substring and searching in
an efficient fashion for that string.

Let's now look at performance under heavy backtracking. We'll use a
'naughty' regex with nested indeterminate quantifiers, which generate
an exponential number of partitions of the string in their search for
a match.

"bbbbXcXaaaaaaaaaaaaaaaaaaaa" =~ /.X(.+)+X/

            r
reg         6.64
rx          4.99
spen        3.89
perl        1.13e-04

In this test, rx is only 28% slower than spen. reg and rx have the
same flow control, so the advantage in rx must be in its private
integer stack, vs the general user stack that reg uses. perl5 uses
optimization to bypass exponential blowup and trounce the competition.

>From the data, we can make several conclusions:

 1. parrot bytecode is 10-25 times larger than the more specialized
    Spencer bytecode.

 2. Both the reg and rx backends produce code that is generally 3-35
    times slower that spen code fed into a specialized engine.

 3. rx ops can help reduce bytecode size, but often run slower than the
    equivalent core code.

 4. One area in which reg and rx are relatively competitive is
    backtracking. Iteration, coupled with a stack, provides efficient
    backtracking relative to a recursive approach.

 5. We already knew this, but all those optimizations in perl5 are a
    big win performance-wise over the Spencer engine.

Caveats:

 1. I've tried to write short, efficient code for the reg and rx
    backends. But I am new to Parrot assembly and could have done
    something seriously suboptimal. I welcome folks looking at and
    playing with the reg and rx code to see if there are faster idioms
    that can be machine generated.

 2. These are early days for parrot optimization, and it may be parrot
    bytecode based regex engines can narrow the gap against
    specialized regex engines.  

In my opinion, however, factors of 3-10 or more in execution time will
be hard to overcome. Both the code size and execution time data
strongly favor scheme (C), building an independent specialized
bytecode engine into parrot.

            -Mark


Reply via email to