Re: Semantics for regexes

Larry Wall Wed, 01 Sep 2004 13:08:11 -0700

On Wed, Sep 01, 2004 at 01:57:32PM -0400, Dan Sugalski wrote:
: I promised Patrick this a while back but never got it, so here it is.
: 
: This is a list of the semantics that I see as needed for a regex 
: engine. When we have 'em, we'll map them to string ops, and may well 
: add in some special-case code for faster access.
: 
: *) extract substring


Mostly you don't want to go to the trouble of extracting substrings
until you're forced to, because you're continually creating and
destroying backrefs into your string, and you don't want to be
copying characters around for that.  As long as there's some kind
of COWish semantics to keep around an original copy of the string
being searched, that's probably all the regex engine itself wants.
It's generally the outer code that wants to make a copy of $1 et al.

: *) exact string compare
: *) find string in string

And maybe case insensitive variants of these, unless it turns out to
be better to combine it with other required composition/decomposition.
But then matching against the contents of a variable repeatedly is
likely to induce repeated canonicalizations.

: *) find first character of class X in string
: *) find first character not of class X in string

We're gonna run into the "what's a character?" issue here, especially
at higher Unicode levels where what the user things of as a single
character is really a sequence of codepoints.  From perspective of a
positive match, the character lengths can be largely self-defining.
>From the perspective of a negative match, the engine has to know what
"." means so you can skip one when the class doesn't match.  (And the
length of "." doesn't necessarily map to the length of all the entries
in the class...)

In particular, \n in Perl 6 has to match a set of weird sequences.
Which arguably could be matched by a routine if character classes
aren't up to it.

(Also, the Perl 6 parser may be based on the notion of a set of hash keys
being treated a bit like a lex grammar, and if we can use character
class lookup for that, it might need to have some "longest token
first" semantics.  Which you might need for ordinary classes anyway
as soon as you admit sequences.)

(Also, minor nit, I'm not sure "find" is the right verb here and
elsewhere.  Mostly the regex engine just wants to check the "current"
location.  The rest is control flow.)

: *) find boundary between X and not-X
: *) Find boundary defined by arbitrary code (mainly for word breaks)

We might have to use arbitrary code to match arrays and hashes as well,
if the opcodes support only scalar string matches.

: *) create new class X
: *) add or subtract character to class X
: *) create union|intersection|difference of two classes

Not sure you really need opcodes for those if character classes are
just real objects with a particular interface.

: I think this about does it, and we do some of this already. Are there 
: semantics people see as missing, or need more explanation? If so, 
: pipe up, we'll nail them down, then get the op mapping (with 
: implementation) and go from there.

I think that most of the other issues revolve around control flow
and remembering your current state, and being able to backtrack out
of that state, all of which Parrot can presumably handle with existing
ops, though perhaps not as efficiently as we might like.

I see one other potential gotcha with respect to backtracking and
closures.  In P6, a closure can declare a hypothetical variable
that is restored only if the closure exits "unsuccessfully".  Within
a rule, an embedded closure is unsuccessful if it is backtracked over.
But that implies that you can't know whether you have a successful
return until the entire regex is matched, all the way down, and all the
way back out the top, or at least out far enough that you know you
can't backtrack into this closure.  Abstractly, the closure doesn't
return until the entire rest of the match is decided.  Internally,
of course, the closure probably returns as soon as you run into the
end of it.  So we may have to jimmy the meaning of hypotheticality in
that context to defer undoing such variables until we hit a failure
continuation of some sort.  That's *probably* doable with the current
opcodes, but maybe not optimally.  In any event, we have to do all that
anyway for $1, $2, et al. whether they're inside or outside of closures.

Larry

Re: Semantics for regexes

Reply via email to