Re: Regular expressions and closures (was: Perl 6 Summary for 2004-10-01 through 2004-10-17)

Larry Wall Tue, 26 Oct 2004 17:16:55 -0700

On Tue, Oct 26, 2004 at 01:42:02PM -0400, Aaron Sherman wrote:
: Larry, while you're feeling chatty, I have a question about Perl 6
: regular expressions for you. You answered a question of mine, long ago
: with a correction. I had said something like:
: 
:       /ab(c|b){$1 eq 'c'}/
: 
: If I recall correctly you had said something like, "there is no plan
: (yet) to allow embedded closures to affect matching directly, other than
: failing outright." This implied (to me) that you were thinking along
: those lines, but were not sure if the regex engine would support it.


I think you might have misunderstood me slightly.  I was primarily
quibbling about your syntax, insofar as the above won't work because
the syntax of a bare closure indicates that the return value is to
be ignored, so it's normally used *only* for its side effects (in the
absence of a failure exception).  The current syntax for what you're
trying to write is:

        /ab(c|b) <($1 eq 'c')>/

which is equivalent to

        /ab(c|b) {fail unless $1 eq 'c'}/

The current match state is also magically communicated to submatches
within the closure so that progress on an embedded match is progress
on the outer match.  The $_ within the closure is therefore proxying
for the $0 of the outer match.

: Given as how I'm working on a prototype regex engine for Parrot, into
: which I intend to plug a Perl 6 rx parser I wanted to ask what your
: thoughts on the language are there.
: 
: My work so far, and discussions with friends suggest that my scheme for
: building an FSA with attached continuations will work as long as I place
: some heavy restrictions on the behavior and side-effects of the closure.

I suppose the usefulness of that will depend on what you mean by
"heavy restrictions".  But we have to have general side effects or
we can't build up our syntax tree.

: I have some code now, and should have a functional prototype in a couple
: of weeks.
: 
: The way I see this working is that you build an FSA like this:
: 
:  State 1:
:               match(token=>'a', target => 2),
:               match(token=>!'a', target => trap)

Does this notation imply that you're matching against 'a' twice?

:  State 2:
:               match(token=>'b', target => 3),
:               match(token=>!'b', target => trap)
:         NOTE: Up this point might be optimized by a back-end to a match
:         against the multi-character string, "ab", but that's not the
:         FSA's concern.
:  State 3 (start subexp mark, end subexp mark):
:               match(token => 'c', target => 4),
:               match(token => 'b', target => 4),
:               match(token => !['c','b'], target => trap)

I find this notation confusing.  In Perl 5 regex, the start and end are
considered separate states, and the end state would follow the matches,
or you wouldn't capture the "motion", and $1 would end up null.  But
then, I think in NFA, not DFA...

:  State 4 (closure {$1 eq 'c'}):
:               match(token => true, target => 5)
:               match(token => false, target => trap)

It's not the regex that would be evaluating for truth here.  It only needs
to make sure it can trap a failure.

:  State 5 (terminal)
: 
: This representation is, of course, after minimization. Comparison of
: states normally only takes into account if they are terminal or
: non-terminal. In this case, there are 3 additional attributes of a
: state:
: 
:       start subexp mark: boolean: start saving with this token
:       end subexp mark: boolean: stop saving after this token
:       closure: continuation: sub(RXContext $cx) returns(RXToken)

We also have the <{ $subregex }> form that has to be able to evaluate
a rule returned by a closure.

: For a state to be equivalent to another state, it must satisfy:
: 
:       state1.terminal == state2.terminal
:       state1.start_mark == state2.start_mark
:       state1.end_mark == state2.end_mark
:       state1.closure == state2.closure
: 
: and, of course, the normal recursive comparison of all transitions and
: the states they target.

I think you also have to consider *which* start and end you're saving here.
Don't want to mix up your paren counts...

Also have to allow saves to named variables.

: This allows for transformation and minimization of the FSA as well as
: arbitrary code execution. The obvious drawback is something that you've
: pointed out several times: the code may be executed many times, not at
: all or may be executed and then backtracked over (and then re-executed
: or not). This makes side-effects dangerous and unpredictable, but I
: think that's OK.

What I have said is that the ordinary side-effect form (bare closure)
must run at the canonical time for its location in the regex, even
if that disables certain optimizations.  The conditional form and
the regex returning form may be optimized away.

: I do think that in Perl, we'll want to add a backtrack API though so
: that you can do this:
: 
:       rule newuser {
:         user:(\w+) ssid:(\w+)
:         {
:               db_exec("begin transaction");
:               let $uid = db_exec("create_user($1,$2)");
:               BACKTRACK {
:                       db_exec("rollback transaction")
:               }
:         }
:         department:(\S+)
:         {
:               my $dpid = get_department($1) and
:                       db_exec("assign_dept($uid, $3)") and
:                       db_exec("commit transaction");
:         }
:       }

Um, that's what UNDO already is supposed to do.  Closures embedded
in rules logically return when backtracked over, and UNDO is already
set up to roll back on failure.

: These examples only use a true/false return from the closure, which may
: be all Perl allows for,

That's one of the things Perl doesn't allow for.  :-)

Perl allows you to return only

    success (void context)
    failure exception
    subrule to match

(I'm assuming <(cond)> is translated by the parser to {(cond) or fail}.)

: but the FSA itself will allow a closure to
: "match" any token (as if it had been in the input stream) by returning
: it (the token is consumed, but is translated to the empty string for
: purposes of sub-expressions and other "memory").

Matching one token seems relatively useless to me.  A closure typically
wants to match either 0 tokens or a sequence of tokens.

: Its important to keep
: this separate in concept from actually causing the FSA to transition to
: a target state (which the closure cannot do).

The closure must be able to change the current "pos" at least.  I don't
think it necessarily has to be able to pick the next state.  But it does
have to have access to the current $0 match state object.

: I can't think of a syntax
: off the top of my head for taking advantage of that, but its certainly
: something to bear in mind.
: 
: Larry et al., please let me know if this is in line with The Plan, as
: I'd like the first pass to be sufficient to implement Perl 6 rxes, so
: that even if the whole thing has to be re-written, at least we have a
: framework and API.

Be sure to keep in touch with Patrick and Luke, who are coming at this
from the direction of the Perl 6 parser.

Larry

Re: Regular expressions and closures (was: Perl 6 Summary for 2004-10-01 through 2004-10-17)

Reply via email to