Re: RFC 72 (v3) Variable-length lookbehind: the regexp engine should also go backward.

Peter Heslin Wed, 13 Sep 2000 18:30:59 -0700
On Tue, Sep 12, 2000 at 05:16:17AM +0100, Hugo wrote:
> :Simply put, I want variable-length lookbehind.
> 
> The difficulty with variable-length lookbehind (let's call it
> VLLB) is this: suppose that we want to match "abcdef...xyz" =~
> /(?<=x+)y/. In theory, to check the possible /x+/ matches in
> the right order [0] we need to check whether there we can match
> 0 characters at offset 0 (no), then check whether we can match
> 1 character at offset 0 or 0 characters from offset 1 (no), then
> check whether we can match 2 characters from offset 0, or 1 from
> 1, or 0 from 2 ... and so it goes on. So we'd be trying a complete
> subpattern, of arbitrary complexity, against O(n^2) strings.
> 
> Now in practice it isn't quite so bad - in the above example we
> could expect the optimiser to decide that there is a fixed
> substring 'y' at the start of the match, so we'd jump straight
> to an offset where we found that, then do only O(n) matches of
> the possible substrings preceding it. But that only kicks in
> where the optimiser finds something good to work with.

This is precisely why I would prefer the syntax I proposed (to which
Bart Lateur quite reasonably objected in an earlier post):  for me, an
important part of this proposal (and not merely an incidental one) is to
let the programmer (and not the optimiser) determine at what point to
make the lookbehind happen.

Bart pointed out that it would be more consistent to use the type of
syntax you have also (unconsciously?) used:

"abcdef...xyz" =~ /(?<=x+)y/

rather than the syntax in the RFC, which might be:

"abcdef...xyz" =~ /y(?`=x+)/

It is true that my syntax is weirder, in that the lookbehind comes in
the middle, or even at the end of the regexp, but it has the advantage
of letting the programmer decide when the lookbehind should happen.
This is an advantage rather than a drawback because:

1) it means that the regexp engine *must*, by definition, match the
`y' in this example before it attempts the lookbehind -- thus the
O(n^2) behavior is avoided (or at least its avoidance is in the hands of
the programmer).

2) It frees the regexp code from the responsibility of doing any special
optimizations -- just match the part of the regexp that comes before the
lookbehind as usual, with normal optimizations, and only then is it
allowed to try matching backwards into the target string, starting
from the offset of the current prospective match.

3) The idea here is not to push work away from the optimizer and onto
the programmer (although that would be a nice side-effect for
implementability) -- rather, the point is to assume that the programmer
is using VLLB for a reason, and that that reason is because she has some
insight into the nature of the target data and *wants* to tell Perl to
jump back *here* rather than *there*, and then to continue forwards
with the match ...

I realize that the syntax is weird, but my feeling was that this
approach would be easier to implement and more useful, since on the one
hand the programmer would be doing the optimization, rather than Perl,
and on the other this added power to determine some part of how the
match is found would be desireable as an advanced feature (acknowledging
that it would also give the programmer the power write a really, really
inefficient regexp, if she so chose).

Incidentally, this is why I proposed new syntax, rather than extending
the old (?<= ... ) syntax, as Mark Dominus suggested:  this feature
would work in a somewhat different way to the old lookbehind, and I
thought it might be good to make it look different.

If in any case the regexp code is unlikely to be rewritten from ground
up, as Mark says in another post, then there may be little chance of
this feature being implemented; nor does there seem to be the
tremendous groundswell of enthusiasm here for this functionality that I
had eagerly anticipated.  I'll make a pitch for it anyway at the end of
my talk at YAPC::Europe, and then I'll freeze the RFC.  

Thanks for the input,

Peter
Re: RFC 72 (v3) Variable-length lookbehind: the regexp engine should also go backward.

Reply via email to