Warning: this is a long message, a small paper really. The synopsis is that I have created a simple regex compiler with multiple backends to test proposed regex coding schemes. Details and test results are below.
Hi all, One of the upcoming decisions that needs to be made is on the design of the Parrot regex engine. More specifically, (1) how will compiled regexes be represented and (2) how will strings be matched against compiled regexes? I am relatively new to p6i, but gather there are at least three approaches: A. Compile regexes down to Parrot bytecode that forms an NFA representation of the regex. Calling the code with a string parameter matches the regex against that string. Compile using only core ops. B. Same scheme as (A), but use specialized rx ops as well as core ops. C. Create a specialized compiler for regexes that produces a specialized abstract bytecode representing the regex. To match, feed the bytecode and the string to a specialized engine. Thus regexes would be an independent subsystem within parrot. This is the approach perl5 takes. Three metrics one might use to evaluate the possibilities are size of the bytecode representing the regex, execution speed, and ease of design/maintenance of the regex compilation code. Regarding ease of design/maintenance, schemes (A) and (B) have the advantage of making regex/perl code integration easy compared to (C). Some people have claimed that perl5 regex code is complex, a mess, and quite difficult to understand, so we should not make the same mistake of implementing a scheme like (C) again. No doubt after 10 years of hacking, the perl5 regex code could use a major house cleaning. But it is a mistake to think that a rewrite in parrot bytecode will make the complexity magically disappear. Coding even the simple implementation below is tricky and debugging the flow of a recursive or stack-based iterative finite-state automaton will make your head hurt. Adding code optimizations and metadata to match faster just redoubles that complexity. So creating (A) or (B) will be a lot of hard work and without good design and documentation, the resulting code will be just as difficult to understand as the perl5 code is. But the results below are about bytecode size and execution speed. To compare schemes (A), (B), and (C), I implemented each of them. I created a common parser for the three schemes that recognizes the basics: ^, $, ., a, \a, [a], [^a], ab, a|b, a*, a+, a? and (a). The parsed regex can be converted to one of three types of bytecode: A. 'reg' - Parrot bytecode, just core ops B. 'rx' - Parrot bytecode, core and rx ops C. 'spen' - a specialized bytecode used by the execution engine in Henry Spencer's original regex package. His package formed the basis for perl5 regex engine. For the reg backend, I made use of assembly idioms used in Steve Fink's Regex package. For the rx backend I used idioms suggested by the rx pod. For the spen backend, I generated bytecode that exactly matched Spencer's bytecode, so that I could use his regexec routines directly for execution. Although not entirely bug-free, all three backends pass the match/no-match portion of the Spencer test suite, so they have a basic level of functionality. The code is available at http://keck.ucsf.edu/~kvale/simple_rx.tar.gz and is meant to be unpacked in the languages/ directory. To level the playing field for the three backends, I opted to include no optimizations whatsoever: no regex rewriting, no 'must match' strings, etc. Each backend is a straight translation of the regex. I did try to optimize emitted reg and rx bytecode for execution speed and concision. Both parrot and the Spencer code were compiled with -O. We will compare with perl5 as well, which is most definitely not fair, but shows the goals we need to shoot for. Here are some relative bytecode sizes (in bytes) for various regexes: regex reg rx spen '' 236 128 10 'a' 300 172 12 '(bc+d$|ef*g.|h?i(j|k[^lmnop]))' 1860 1168 114 Overall, rx code is about 10-15 times as large as spen code and reg code is 1.5-2 times as large as rx code. The reason that reg and rx codes are so much larger than spen code is that (1) they are complete parrot subroutines, whereas spen code is just a set of instructions to a separate engine, and (2) parrot bytecode is general purpose and thus more verbose than a specialized coding scheme like that for spen. rx cuts down on code size relative to reg, thanks to specialized instructions combining several tasks into one. Execution time is perhaps the most crucial metric. perl5 excels at searching large amounts of text quickly; most users would not want to give that up. We'll model the execution time of a regex test as follows: t[i] = co + n * (lo + r[i]) with t[i] = execution time for the whole test of regex i co = a constant overhead, approximately independent of regex n = number of repetitions of the match lo = an overhead per match, approximately independent of regex r[i] = time taken by just the regex code for a single match co is dependent on the test harness used and is uninteresting. lo takes into account the allocation, initialization and cleanup time needed for any single regex match and is relevant for us; each regex takes at the very least lo time. r[i] is the time taken to do the actual match against a string and is dependent on both the string and the regex. We can solve for co by testing against a fixed regex and string, and varying the number fo repetitions. Given co, we can solve for lo by keeping repetitions fixed and varying the size of a repetitive regex and string. For instance, in each of the backends, matching regex 'aaa' against string 'aaa' will take about three times as long as matching regex 'a' against string 'a'. For a regex-string combination that can be made repetitive, the model becomes t[i] = co + n * (lo + m*ro[i]) with m the number repetitive elements in the regex and ro[i] the per unit matching time. Matching regex 'a' against string 'a' we get the following times on my Lintel notebook: (All times are in seconds) "a" =~ /a/: backend lo ro lo/ro ro/spen ro reg: 4.518e-06 1.935e-07 23.3 12.8 rx: 1.621e-06 5.387e-07 3.0 35.6 spen: 5.477e-07 1.509e-08 36.2 1.0 perl: 7.897e-07 1.267e-08 62.3 0.84 The fixed, per match overheard lo is 3 times larger for rx than spen and 8 times larger for reg than spen. The per character match time ro is 36 times larger for rx than for spen and 13 times larger for reg than for spen. That the per character time for rx is so large seems counterintuitive to me. On the other hand, the rx_literal code op rx_literal(in pmc, in str, inconst int) { UINTVAL i; RX_dUNPACK($1); if(string_length(rx->string) < rx->index+string_length($2)) { goto OFFSET($3); } /* This is faster than using substr--it avoids memory allocations. */ for(i=0; i < string_length($2); i++) { if( string_ord(rx->string, rx->index+(INTVAL)i) != string_ord($2, (INTVAL)i)) { goto OFFSET($3); } } RxAdvanceX(rx, string_length($2)); goto NEXT(); } computes dynamically several functions, such as length of the regex match string, that are precomputed in the reg code... Note that the overhead/match ratio lo/ro is large in the reg, spen and perl routines. In the reg code, I think this is due to the allocation of two PerlArrays for subexpression indices. In the spen and perl code, it is due to data structure initialization and optimization checks. Here are the times for two other atoms: "a" =~ /./ lo ro ro/spen reg: 4.423e-06 2.18e-07 5.4 rx: 1.522e-06 1.907e-07 4.7 spen: 4.839e-07 4.063e-08 1.0 perl: 8.945e-07 2.81e-08 0.7 The disparities between spen and reg and rx are more modest here. "a" =~ /[A-Za-z0-9_]/ lo ro ro/spen reg: 4.389e-06 9.487e-07 3.1 rx: 1.599e-06 1.184e-05 39.2 spen: 3.758e-07 3.014e-07 1.0 perl: 9.922e-07 3.75e-08 0.12 spen slows down by calling strchr, so reg is only three times slower. Clearly, rx_oneof could be optimized for multi-character classes. For the '+' quantifier, we get "a"x1000 =~ /a+/ lo + r reg: 8.07e-04 rx: 8.46e-04 spen: 3.92e-06 perl: 7.56e-06 Above we found lo/ro ratios of 3-60, so for a string of length 1000 most of the time is spent in the match, not the overhead. Both perl and spen do a greedy match using pointer arithmetic in a tight C loop. rx and reg both do a branch and a save to stack for each iteration, which slows them down by an amazing factor of 200 relative to spen. This extreme factor is partly due to a deficit in my code; spen differentiates between simple and complex atoms in the presence of * or + and optimizes the execution accordingly. If did the same, I could eliminate a stack save per iteration. The following match induces a modest amount of backtracking: ("ab"x10000)."c" =~ /abc/ r r/spen reg: 1.62e-02 2.2 rx: 2.52e-02 3.4 spen: 7.35e-03 1.0 perl: 2.47e-04 0.034 Recursion is typically slower than iteration with a stack, so that reg and rx pull within a factor of 2-3 of spen. perl5 achieves a factor of 30 speedup by noting that abc is a required substring and searching in an efficient fashion for that string. Let's now look at performance under heavy backtracking. We'll use a 'naughty' regex with nested indeterminate quantifiers, which generate an exponential number of partitions of the string in their search for a match. "bbbbXcXaaaaaaaaaaaaaaaaaaaa" =~ /.X(.+)+X/ r reg 6.64 rx 4.99 spen 3.89 perl 1.13e-04 In this test, rx is only 28% slower than spen. reg and rx have the same flow control, so the advantage in rx must be in its private integer stack, vs the general user stack that reg uses. perl5 uses optimization to bypass exponential blowup and trounce the competition. >From the data, we can make several conclusions: 1. parrot bytecode is 10-25 times larger than the more specialized Spencer bytecode. 2. Both the reg and rx backends produce code that is generally 3-35 times slower that spen code fed into a specialized engine. 3. rx ops can help reduce bytecode size, but often run slower than the equivalent core code. 4. One area in which reg and rx are relatively competitive is backtracking. Iteration, coupled with a stack, provides efficient backtracking relative to a recursive approach. 5. We already knew this, but all those optimizations in perl5 are a big win performance-wise over the Spencer engine. Caveats: 1. I've tried to write short, efficient code for the reg and rx backends. But I am new to Parrot assembly and could have done something seriously suboptimal. I welcome folks looking at and playing with the reg and rx code to see if there are faster idioms that can be machine generated. 2. These are early days for parrot optimization, and it may be parrot bytecode based regex engines can narrow the gap against specialized regex engines. In my opinion, however, factors of 3-10 or more in execution time will be hard to overcome. Both the code size and execution time data strongly favor scheme (C), building an independent specialized bytecode engine into parrot. -Mark