RFC 72 (v1) The regexp engine should go backward as well as forward.

Perl6 RFC Librarian Mon, 28 Aug 2000 10:21:38 -0700
This and other RFCs are available on the web at
  http://dev.perl.org/rfc/

=head1 TITLE

The regexp engine should go backward as well as forward. 

=head1 VERSION

  Maintainer: Peter Heslin <[EMAIL PROTECTED]>
  Date: 9 Aug 2000
  Last Modified: 28 Aug 2000
  Version: 1
  Mailing List: [EMAIL PROTECTED]
  Number: 72

=head1 ABSTRACT

It is proposed that the regular expression engine should be designed so that,
when it is in an accepting state, it will either consume the character after
the end of the currently matching part of the target string, or the character
just before its beginning, depending on instructions embedded in the
regexp itself.

=head1 CHANGES

Version 2 of this RFC redirects discussion of this topic to 
[EMAIL PROTECTED]

=head1 DESCRIPTION

The way humans scan text for a pattern is often to look for a distinctive
feature of the pattern sought, and then to look at the surrounding context
before and after.  The Perl regular expression engine is optimized to look such
fixed string landmarks in this way, but if you want to write a hand-optimized
regexp that will explicitly look backward from a certain match in this
fashion, your lookbehind is limited to fixed-length expressions (at least in
my Perl 5.005, and I assume this true for 5.6, too).

A more general approach than lookbehind might be useful.  I propose two new
regexp extensions which mean "jump and reverse" and "jump and go forward".
These might be (?r) and (?f) or (?[) and (?]) respectively, or some other pair.
It is important to note that this proposal does not suggest that the same
characters in a string should be matched twice or more in a single hit by
different parts of the same regexp going in different directions; that way
madness lies.  

The "jump" part of these commands means to jump immediately to the beginning
or the end of the currently matched part of the string and continue from there.
In this way, the match continues to grow one character at a time, but at both
ends.

=head2 Usefulness

The Perl regexp engine is excellent at finding fixed strings embedded in your
regexp and starting from there, but, as Friedl says in his book, it is
generally more efficient if you can guide a regular expression engine
yourself based upon your knowledge and assumptions about the potential input.
Imagine a very long input string containing data such as this:

    ... GCAAGAATTGAACTGTAG ...

If you want to match text that includes the string GAAC, but only when it
follows GAATT or any one of a large number of other different possibilities,
you could (a) code the whole match, including all the possibile prefixes,
into the regexp or (b) use GAAC as the match string and then select the
particular hits you want by positive and negative lookbehind.  For some large
amounts of data and very complex matching requirements that come before any
fixed string, Perl's optimizer can't do as well as a human and option (b) can
(in my experience, anyway) be an order of magnitude faster.  The problem is
that with only fixed-length lookbehind available this is often not a practical
option.

With the proposed extension, you could write:

    m/GAAC(?r)(TTAAG| .... )/

and the regexp engine doesn't have to go looking deep into your regexp to
know where it should start potential matches.

As a frivolous illustration, the string 

        ABCDEFGHIJKLM

would be matched by:

    m/FG(?r)EDCB(?f)HIJK(?r)A^(?f)LM$/

=head2 Related Features

It will be important to know the offset where the match begins, as well as
where it ends (indeed it would be nice to have that info in Perl5 without
having to pay the C<length $&> performance penalty), so in addition to C<pos>,
there might be a function C<prepos> to give the start of the match -- or C<pos>
might return both end and start offsets in a list context.

It would be very nice if C<(?r)> would do something usefully optimized if it
appeared at the start of a regexp: C</$(?r).*CBA/> would then start matching at
the end of the string and proceed rapidly towards the front, until it hit
"ABC".  This could be particularly useful in conjunction with C<Backwards.pm>.
It would be extra work, however, as not only would the regexp engine itself
have to be modified to switch direction, but the code that "cheats" by
looking for likely places to start the engine (which presumably uses a
Boyer-Moore or similar algorithm) would have to run backward, too.

I have no idea whether this feature will help people parsing right-to-left
languages; it seems likely to help with bi-directional texts (see RFC 50).

=head1 IMPLEMENTATION

It may already be clear that the author of this RFC has no idea of how the
Perl regexp engine is implemented.  I have never even looked at the code of
this or any other regexp engine, nor am I competent to do so.  It may well be,
therefore, that what I propose is completely infeasible.  If so, and those who
are actually going to code this stuff say so, I will promptly withdraw this
RFC, which in the cirumstances may turn out to have been a Request For
Crucifixion.

Here are some potential implementation issues:

=head2 Backtracking

Presumably, the regexp engine maintains something like a stack which is
popped when a prospective match fails and it has to fall back to an earlier 
possibility.  If so, when we jump and switch direction, we will have to push
an indication of this onto the stack, so that when it is popped again we
know to jump and switch the other way when backtracking off a failed match.

=head2 Backreferences

The text captured in parentheses while (?r) is operative will be reversed
with respect to its order in the target string.

=head2 Multiple matching

The C</g> modifier would behave just as it does currently: the next attempt
begins the character after the end of the previous match.  If the previous
example of C</$(?r).*CBA/> were extended so that one might write C</(?r).*CBA/>
(without the anchor at the start), then we would want in this case (i.e. when the
regexp begins with (?r)) to make an exception to normal behavior and have the
next match begin the character before the beginning of the previous one.

=head1 REFERENCES

_Mastering Regular Expressions_, Jeffrey Friedl (O'Reilly)

RFC 50 (BiDirectional Support in PERL)
RFC 72 (v1) The regexp engine should go backward as well as forward.

Reply via email to