Re: Patterns and junctions

Luke Palmer Fri, 04 Apr 2003 14:41:21 -0800

> --- "Adam D. Lopresto" <[EMAIL PROTECTED]> wrote:
> > I propose that since the empty pattern is no longer legal (and
> > about time), we use "|" in patterns to indicate alternation without
> > preference, and "||" to indicate "try the first, then the second,
> > etc".
> 
> Hmm.... 
> A neat idea, but can you elaborate on when it would matter?


It would matter a lot for optimization.  If each alternative is a
subrule, each of which calls a lot of other subrules, then you could
gain a lot by not needing an ordered search through the possibilities.

I like Adam's symbols, with the junction/short-circuit
parallels.  Literally. :)

I'm currently working on Parse::OutsideIn, an experiment for larger
(non-peephole) optimizations on P6 regexen.  The idea is that it does
a recursive-descent match until it finds a top-down-independent
section of the grammar (very common), and then uses a compiled LR
bottom-up method on that section.

And of course, the "|"/"||" distinction could make that easier.

> > So
> >   "cat dog fox" ~~ /( fox | dog | cat )/;
> > might match any of the three, and in fact is quite likely to match
> > cat (since scanning the string could be a very long process), but
> >   "cat dog fox" ~~ /( fox || dog || cat )/;
> > would be guaranteed to first try fox, then if that fails backtrack
> > and try dog, and if that fails backtract to try cat.  The choice of
> > symbols parallels the junction and short-circuiting or operators,
> > and in most cases when an alternation is specified, the programmer
> > has no preference among the alternatives (indeed, in most cases only
> > would could match anyway), so it seems silly to force the engine to
> > prefer the first one.
> 
> I thought the requirement to prefer the first was largely an
> optimization, so that it could bail when it had a match without
> continuing to search. ("Wouldn't you know, it was in the *last* place I
> looked for it!")
> 
> Are you proposing a more DFA-ish behavior, maybe?
> Some programs actually embad a DFA for a quick "does it match?" as
> *well* as an NFA for pulling out backreferences once the DFA finds a
> hit, but I doubt anybody is planning to embed a pre-matching DFA in
> Perl (if so, somebody please tell me now).

I am, in some guise.  Here's another, related optimization I'm
considering.  I've noticed that Parse::RecDescent (through $RD_TRACE)
spends a lot of time trying trees that would _never_ match... none of
their subrules/grandsubrules could possibly.

So, at grammar compile-time, I compute the closure (not the Perl kind)
of each rule's initial state, to find all the terminals that could
come up first.  Then, at a high level, I try to match one of the
terminals.  If it did match, I cache the result and progress down the
tree (perhaps guided by which one matched).  If it didn't match, I
fail the rule and don't even bother.

For instance:

    statement: expression | declaration
    declaration: sub_declaration
    # ... lots of rules rooted at declaration
    sub_decl_keyword: /sub/
    method_decl_keyword: /method/
    # ... rules rooted at expression

So at compile time, it would find the closure of C<declaration>, which
would turn out to be /sub/|/method/.  It would match one of those,
before trying C<sub_declaration> and if it didn't, it would move on to
matching C<expression>.

My theory is that, if P6 is going to be used for grammars, that
little, local optimizations aren't going to help as much as these
kinds.

> > I'm imagining that in the first example, the implementation would
> > probably build an FSA and process each letter as it comes, while the
> > second would rely on backtracking.
> 
> Sounds like DFA/NFA to me....

Or an FSA.  You can't do grammars in a DFA.

> > What think you all?

I rather like the idea.  Finally, I'll talk about the language
implications of the idea....

So you're proposing that "|" be a "match order not guaranteed" match.
There are some constructs that that doesn't like very much:

    m:w/ if <expression> then <statement> else <statement>
       | if <expression> then <statement>/

One would argue that this would be a case where you would use "||".
The times where match order matters in a recursive-descent match are
the same as the times of a shift/reduce conflict in shift/reduce
parsers.

So what I'm looking for is some kind of guarantee that "|" could
provide, instead of being entirely random.  Junctions don't guarantee
that they evaluate all their junctends, or that they do it in any
particular order.  But they do guarantee that the result would be the
same (ignoring side-effects) as if they had.

Maybe that's where our guarantee lies---failure side-effects.

    grammar X {
        rule A { B || C }
        rule B { (\w+) {
            if $1 == $naughty_word {
                print "No bad words!";
                fail;
            }
        }}
        rule C {...}
    }
    grammar Y is X {
        rule A { B | C }
    }

The difference between these two grammars is that X would be required
to print "No bad words!" if C was the one that finally matched, while
Y would not [be required to].

Maybe the "|"/"||" distinction isn't needed, and we just need a
declarator on rules that says they are side-effect-free, and can thus
be optimized.

    grammar X {
        rule A is pure { B | C }  # Can be optimized to try C first, etc.
        rule B is pure { \w+ }
        rule C is pure { \d+ }
    }

    grammar Y is X {
        rule B { (\w+) {
            if $1 == $naughty_word {
                print "No bad words!";
                fail;
            }
        }}
    }

Since B is no longer "pure", Y's X::A can't optimize the order.

I like this solution better than making a new operator.  In Perl
tradition, it puts the optimization burdon on the compiler, not the
programmer.

Luke

Re: Patterns and junctions

Reply via email to