Re: Teaching Rakudo the tricks of Perl 5's regex optimiser

Brad Gilbert Tue, 13 Aug 2019 07:33:03 -0700

I would like to point out that regexes in Perl6 are treated as code.
(They can even have parameters and lexical variables.)


It uses the same compiler intrinsics as the rest of Perl6.
It uses the same VM opcodes as regular Perl6 code.

Regexes, string literals, and signatures are just domain specific
sub-languages.

Perl6 regexes are also missing features that are common to other regexes,
instead you just embed regular Perl6 code to handle it.

For example instead of `(?=…)` Perl6 uses a method call to `before` with a
regex argument.

    <.before …>

    use v5;
    'abcd' =~ / . (?=c) . /x # "bc"

    use v6;
    'abcd' ~~ / . <.before( /c/ )> .  / # "bc"
    'abcd' ~~ / . <.before   c   > .  / # "bc" # (exactly identical)

A person could change the code in the `before` method to have it do
something different

This means that the biggest wins in the Perl6 regex system are probably
going to look nothing like the ones in other regex engines.
At least in the short term.
Inlining method calls for instance.
Perhaps the biggest one may be the one about passing around “fates”. (I
barely understand the basics of this.)

So while information on how other projects do regex optimizations can help,
they are generally going to be small wins in comparison.
(That doesn't mean they aren't worth doing.)

Doing some of those optimizations may also make other bigger optimizations
harder to do.
(I'm fairly sure the “fates” optimization would be harder, but I'm not
sure.)

On Tue, Aug 13, 2019 at 4:53 AM Nicholas Clark <n...@ccl4.org> wrote:

> I'm cheating here - I'm using an e-mail message as a way to publish some
> notes
> for Jonathan (or *anyone else interested*):
>
> Jonathan did a talk in Riga, Perl 6 performance update,
> https://perlcon.eu/talk/80
> The (re-uploaded) live stream is at
> https://www.youtube.com/watch?v=S5iVBlk7pdg#t=4h39m
>
> One thing that this talk revealed is that (currently) Perl 5 beats
> Rakudo on various regular expressions. What *I* know is that this is
> because
> Perl 5 cheats - it has an "optimiser", which happens to automatically do
> what jnthn then showed manually implemented in some of his benchmarks.
>
> You can see what the Perl 5 regex optimiser is doing by using the re
> pragma.
> I'm no expert on this thing, but playing with it, I can see that it can do
> various things
>
> 0) Know that it can't do anything useful (it is skipped)
> 1) Know that a fixed string must be present
>    Look for it (which is fast), not find it, and immediately return "no
> match"
> 2) Know that a fixed string must be present
>    Look for it find it, carry on into the real engine
>    (which arguably is slower than no optimiser)
> 3) Know that a fixed string must be present, *and* is the entire match
>    Look for it (which is fast), find it, and immediately return "match".
>
> Clearly cases (1) and (3) are the big wins here. Case (3) was the one I had
> forgotten about - the cheating is so good that often the engine is never
> entered.
>
>
> So, if you run this:
>
> $ perl -wle 'use re "debug"; print "Perl rules" =~ /^Perl/ ? "Y" : "n"'
> Compiling REx "^Perl"
> Final program:
>    1: SBOL /^/ (2)
>    2: EXACT <Perl> (4)
>    4: END (0)
> anchored "Perl" at 0 (checking anchored noscan) anchored(SBOL) minlen 4
> Matching REx "^Perl" against "Perl rules"
> Intuit: trying to determine minimum start position...
>   Looking for check substr at fixed offset 0...
> Intuit: Successfully guessed: match at offset 0
>    0 <> <Perl rules>         |   0| 1:SBOL /^/(2)
>    0 <> <Perl rules>         |   0| 2:EXACT <Perl>(4)
>    4 <Perl> < rules>         |   0| 4:END(0)
> Match successful!
> Y
> Freeing REx: "^Perl"
>
>
> "Final program" describes how the regex ended up being compiled. This isn't
> really of interest here.
>
> What is of interest is the parts between the two lines "Intuit". That's the
> optimiser. For the case I specifically chose here, what we have is that
>
> 1) the optimiser knows that the fixed string "Perl" must be in target
> string
> 2) which it searches for and finds
> 3) at which point it calls the main engine (these lines):
>
>    0 <> <Perl rules>         |   0| 1:SBOL /^/(2)
>    0 <> <Perl rules>         |   0| 2:EXACT <Perl>(4)
>    4 <Perl> < rules>         |   0| 4:END(0)
>
> (that part isn't actually of interest here. I'm just noting it as "this is
> the main engine, and comparable to what Rakudo does).
>
>
> The interesting cases are things like:
>
> $ perl -wle 'use re "debug"; print "Perl rules" =~ /er./ ? "Y" : "n"'
> Compiling REx "er."
> Final program:
>    1: EXACT <er> (3)
>    3: REG_ANY (4)
>    4: END (0)
> anchored "er" at 0 (checking anchored) minlen 3
> Matching REx "er." against "Perl rules"
> Intuit: trying to determine minimum start position...
>   doing 'check' fbm scan, [0..9] gave 1
>   Found anchored substr "er" at offset 1 (rx_origin now 1)...
>   (multiline anchor test skipped)
>   try at offset...
> Intuit: Successfully guessed: match at offset 1
>    1 <P> <erl rules>         |   0| 1:EXACT <er>(3)
>    3 <Per> <l rules>         |   0| 3:REG_ANY(4)
>    4 <Perl> < rules>         |   0| 4:END(0)
> Match successful!
> Y
> Freeing REx: "er."
>
>
> You can see that
> 1) the compiler records that the longest fixed string is 'er'
> 2) the optimiser looks for this before even hitting the real engine
> 3) the optimiser tells the real engine that it doesn't even need to
> consider
>    trying to match at string offset 0
>
>
> and this:
>
> $ perl -wle 'use re "debug"; print "Perl rules" =~ /er/ ? "Y" : "n"'
> Compiling REx "er"
> Final program:
>    1: EXACT <er> (3)
>    3: END (0)
> anchored "er" at 0 (checking anchored isall) minlen 2
> Matching REx "er" against "Perl rules"
> Intuit: trying to determine minimum start position...
>   doing 'check' fbm scan, [0..10] gave 1
>   Found anchored substr "er" at offset 1 (rx_origin now 1)...
>   (multiline anchor test skipped)
>   try at offset...
> Intuit: Successfully guessed: match at offset 1
> Y
> Freeing REx: "er"
>
>
> The optimiser matches, and knows that if its crib[1] matches, the entire
> regex must match, so it doesn't even call into the engine.
>
>
> and this:
>
> $ perl -wle 'use re "debug"; print "Perl rules" =~ /Good *, #MoarVM/ ? "Y"
> : "n"'
> Compiling REx "Good *, #MoarVM"
> Final program:
>    1: EXACT <Good> (3)
>    3: STAR (6)
>    4:   EXACT < > (0)
>    6: EXACT <, #MoarVM> (10)
>   10: END (0)
> anchored "Good" at 0 floating ", #MoarVM" at 4..9223372036854775807
> (checking floating) minlen 13
> n
> Freeing REx: "Good *, #MoarVM"
>
> The optimiser knows that the pattern can't match less than 13 characters,
> but
> the offered string is too short.
>
> The optimiser also knows at what offsets the within the target string the
> fixed string must lie, if there is also some variable length stuff, but
> this
> is getting beyond what I know the plan for:
>
> $ perl -wle 'use re "debug"; print "Perl rules" =~ /Perl.*s/ ? "Y" : "n"'
> Compiling REx "Perl.*s"
> Final program:
>    1: EXACT <Perl> (3)
>    3: STAR (5)
>    4:   REG_ANY (0)
>    5: EXACT <s> (7)
>    7: END (0)
> anchored "Perl" at 0 floating "s" at 4..9223372036854775807 (checking
> anchored) minlen 5
> Matching REx "Perl.*s" against "Perl rules"
> Intuit: trying to determine minimum start position...
>   doing 'check' fbm scan, [0..9] gave 0
>   Found anchored substr "Perl" at offset 0 (rx_origin now 0)...
>   doing 'other' fbm scan, [4..10] gave 9
>   Found floating substr "s" at offset 9 (rx_origin now 0)...
>   (multiline anchor test skipped)
> Intuit: Successfully guessed: match at offset 0
>    0 <> <Perl rules>         |   0| 1:EXACT <Perl>(3)
>    4 <Perl> < rules>         |   0| 3:STAR(5)
>                              |   0| REG_ANY can match 6 times out of
> 2147483647...
>    9 <Perl rule> <s>         |   1|  5:EXACT <s>(7)
>   10 <Perl rules> <>         |   1|  7:END(0)
> Match successful!
> Y
> Freeing REx: "Perl.*s"
>
>
> If you're wearing Peril Sensitive Sunglasses, then the code that does this
> is
> the self-recursive Perl_re_intuit_start() in
> https://perl5.git.perl.org/perl.git/blob/HEAD:/regexec.c
>
> As a guide to reading that, for things like
>
>         DEBUG_EXECUTE_r(Perl_re_printf( aTHX_
>                               "  String too short...\n"));
>
>
> DEBUG_EXECUTE_r() only generates that debugging output if the command line
> flag -Dr is set, *and* the `use re "debug";` de facto sets -Dr, even on a
> perl binary compiled without all the other -D flags enabled.
>
>
> I know that PCRE also exposes some information it derives from the pattern
> which can be used to pre-match -- see PCRE_INFO_FIRSTCHARACTER and
> PCRE_INFO_REQUIREDCHAR in
> https://www.pcre.org/original/doc/html/pcre_fullinfo.html
>
> I *thought* that Google's re2 library (which came out of code search) could
> return the longest fixed string (if any) as part of its API, but I can't
> find
> that now. In https://swtch.com/~rsc/regexp/regexp4.html it mentions
> building
> trigrams (directly) from parsing the regex (with ANDs and ORs) which is
> what
> some $ork code is doing (but then using PCRE, not re2).
>
> Anyway, I hope that this helps someone, and helps make Rakudo's parser
> faster
>
> I assume that the useful strategy is roughly
>
> 0) Do the most simple thing - no alternations, anchored starts. Reject
> match
>    if not present
> 1) Then 1+ character fixed substrings - Reject match if not present
> 2) Then the big cheat of "string is actually fixed" - compile to a sub that
>    build a match result immediately, instead of the more general engine
> 3) Then the even more general "fixed string as part of larger regex", where
>    failure is a fast-path, but otherwise the regular regex engine follows
> 4) Minimum match length
> 5) The stuff where one starts to tell the main engine how
>
>
> but of course no plan survives contact with the enemy.
>
> Nicholas Clark
>
> 1: https://en.wiktionary.org/wiki/crib -- definitions 14 and 23:
>    (usually in the plural) A collection of quotes or references for
>    use in speaking, for assembling a written document, or as an aid to
>    a project of some sort; a crib sheet.
>    [and]
>    A cheat sheet or past test used by students; crib sheet.
>

Re: Teaching Rakudo the tricks of Perl 5's regex optimiser

Reply via email to