RE: Regex bytecode bakeoff

Mark Kvale Mon, 06 May 2002 22:35:35 -0700

In the interests of brevity, I cut huge tracts of quoted text.

On Mon, 6 May 2002, Brent Dax wrote:


> (FYI, I'm the main author of Parrot's rx package.)
> 

I'm pleased to meet you.

> Mark Kvale:
> # Warning: this is a long message, a small paper really. The 
....
> # So creating (A) or (B) will be a lot of hard work and without 
> # good design and documentation, the resulting code will be 
> # just as difficult to understand as the perl5 code is.
> 
> They've tried to rewrite the code, and they always lose tons of speed in
> the process.  Perl's regex engine is coded to be fast, which makes it
> ugly.  The hope is that a completely different approach will keep us
> pretty close to Perl speed without all the cruft of Perl's engine.
> 

I suppose you are referring to projects like pcre and regexpp? Their
code looks clean by comparison, but they have implemented few
optimizations. Perhaps there is a correlation :-) 

> # But the results below are about bytecode size and execution speed.
> # 
....
> # 
> # Matching regex 'a' against string 'a' we get the following 
> # times on my Lintel notebook: (All times are in seconds)
> # 
> # "a" =~ /a/:
> # 
> # backend   lo            ro             lo/ro     ro/spen ro
> # 
> # reg:      4.518e-06     1.935e-07      23.3      12.8
> # rx:       1.621e-06     5.387e-07       3.0      35.6
> # spen:     5.477e-07     1.509e-08      36.2       1.0
> # perl:     7.897e-07     1.267e-08      62.3       0.84
> # 
> # The fixed, per match overheard lo is 3 times larger for rx 
> # than spen and 8 times larger for reg than spen. The per 
> # character match time ro is 36 times larger for rx than for 
> # spen and 13 times larger for reg than for spen. That the per 
> # character time for rx is so large seems counterintuitive to 
> # me. On the other hand, the rx_literal code
> # 
> # op rx_literal(in pmc, in str, inconst int) {
> #         UINTVAL i;
> #         RX_dUNPACK($1);
> #         if(string_length(rx->string) < rx->index+string_length($2)) {
> #                 goto OFFSET($3);
> #         }
> #         /* This is faster than using substr--it avoids memory 
> # allocations. */
> #         for(i=0; i < string_length($2); i++) {
> #                 if(   string_ord(rx->string, rx->index+(INTVAL)i) 
> #                    != string_ord($2, (INTVAL)i)) {
> #                         goto OFFSET($3);
> #                 }
> #         }
> #         RxAdvanceX(rx, string_length($2));
> #         goto NEXT();
> # } 
> 
> However, my current design plans for rx would make this faster.
> Basically, it would automagically use pointer arithmetic on fixed-width
> encodings and do various other dirty tricks to save time.  (To avoid
> tons of conditionals, it would use a vtable.)
> 

Yeah, encodings like utf8 inevitably slow down matching, with function
calls if nothing else. perl5 regex code is sprinkled liberally with
parallel code for utf8 and ASCII, a big win for the common ASCII
matching case.

> # computes dynamically several functions, such as length of the 
> # regex match string, that are precomputed in the reg code...
> # 
> # Note that the overhead/match ratio lo/ro is large in the reg, 
> # spen and perl routines. In the reg code, I think this is due 
> # to the allocation of two PerlArrays for subexpression 
> # indices. In the spen and perl code, it is due to data 
> # structure initialization and optimization checks.
> 
> It may be disabled in the current code, but rx allocates a pair of
> PerlArrays too.
> 
> # "a" =~ /[A-Za-z0-9_]/
> #              lo              ro           ro/spen
> # reg:         4.389e-06       9.487e-07     3.1
> # rx:          1.599e-06       1.184e-05    39.2
> # spen:        3.758e-07       3.014e-07     1.0
> # perl:        9.922e-07       3.75e-08      0.12
> # 
> # spen slows down by calling strchr, so reg is only three times 
> # slower. Clearly, rx_oneof could be optimized for 
> # multi-character classes.
> 
> rx_oneof is very, very slow because it compiles a bitmap of the
> character class each time, then uses the bitmap for a fast lookup.
> There are two opcodes, rx_makebitmap (?) and rx_oneof_bmp, which can be
> used to precompile the bitmap.
>

Thanks for the suggestion, I'll try those. But creating the bitmap
while executing a match would seem to unnecessarily slow things
down in certain cases. For constant character classes, it should be
possible for the compiler to precompute the bitmap and load it
directly. 

While I am requesting features, it would be nice to have a negated
claracter class op, and it would be nice to know the number of the
highest captured group.

> If you're testing these in a loop (which you almost certainly are),
> there's another trick you can use with rx.  rx_clearinfo can be used to
> reset all the fields in the regex's info structure without allocating
> anything; it might theoretically be possible to code up an entire (real
> world) program that only allocated one info structure.  From my
> experience coding and optimizing rx, memory allocations are killer.
> 

Already implemented, and a good deal faster than allocating and
deallocating each iteration.

> # For the '+' quantifier, we get
> # 
> # "a"x1000 =~ /a+/
> # 
> #        lo + r
> # reg:   8.07e-04
> # rx:    8.46e-04
> # spen:  3.92e-06
> # perl:  7.56e-06
> # 
> # Above we found lo/ro ratios of 3-60, so for a string of 
> # length 1000 most of the time is spent in the match, not the 
> # overhead.  Both perl and spen do a greedy match using pointer 
> # arithmetic in a tight C loop. rx and reg both do a branch and 
> 
> Something that wouldn't be possible in a well-behaved Parrot regex
> engine--a string's internals are supposed to be its own business.  (Of
> course, nobody ever said that the engine had to be well-behaved... :^) )
> 

perl5's engine is intimate with ASCII, utf8 encoding and
(mostly) deals with raw strings. Opaque objects just get in the way...

> # The following match induces a modest amount of backtracking:
> # 
> # ("ab"x10000)."c" =~ /abc/
> # 
> #       r          r/spen
> # reg:  1.62e-02   2.2
> # rx:   2.52e-02   3.4
> # spen: 7.35e-03   1.0
> # perl: 2.47e-04   0.034
> # 
> # Recursion is typically slower than iteration with a stack, so 
> # that reg and rx pull within a factor of 2-3 of spen. perl5 
> # achieves a factor of 30 speedup by noting that abc is a 
> # required substring and searching in an efficient fashion for 
> # that string.
> 
> I always hated that optimization--it made it really hard to benchmark
> against something realistic.
>

Many regexes in beginner/intermediate perl code are simple
strings. Detecting the longest floating substring, or that you could
do a fast BM match is a big win in those cases. I think it is in
general hard to determine what a realistic set of regexes is, but
simple string matching should surely be part of them.
 
> # Caveats:
> # 
> #  1. I've tried to write short, efficient code for the reg and rx
> #     backends. But I am new to Parrot assembly and could have done
> #     something seriously suboptimal. I welcome folks looking at and
> #     playing with the reg and rx code to see if there are faster idioms
> #     that can be machine generated.
> 
> More likely, I'm new to regex engines and have never studied finite
> automata, so I screwed up something in the design of rx.
>

Despite some slow ops, I am really impressed with the quality of the
rx ops. I haven't hit a bug yet!

> #  2. These are early days for parrot optimization, and it may be parrot
> #     bytecode based regex engines can narrow the gap against
> #     specialized regex engines.  
> # 
> # In my opinion, however, factors of 3-10 or more in execution 
> # time will be hard to overcome. Both the code size and 
> # execution time data strongly favor scheme (C), building an 
> # independent specialized bytecode engine into parrot.
> 
> Have you tried the rx and reg versions with exotic settings for Parrot,
> like JIT or prederef?  It would also be interesting (although quite
> difficult I'm sure) to modify Spencer's package to use Parrot strings.
>

I've not played with any exotic settings, it might be worth a
try. Rewriting the spencer regexec code to use Parrot strings would be
difficult, because I'm not familiar with the innards of Parrot
strings. I suspect the result would be a slower engine, especially for
general encodings.  If Parrot strings are to be sacrosanct, there will
be a speed hit.

> --Brent Dax <[EMAIL PROTECTED]>
> @roles=map {"Parrot $_"} qw(embedding regexen Configure)
> 

Thanks for all the comments!

   -Mark

RE: Regex bytecode bakeoff

Reply via email to