On Wed, Sep 01, 2004 at 01:57:32PM -0400, Dan Sugalski wrote: : I promised Patrick this a while back but never got it, so here it is. : : This is a list of the semantics that I see as needed for a regex : engine. When we have 'em, we'll map them to string ops, and may well : add in some special-case code for faster access. : : *) extract substring
Mostly you don't want to go to the trouble of extracting substrings until you're forced to, because you're continually creating and destroying backrefs into your string, and you don't want to be copying characters around for that. As long as there's some kind of COWish semantics to keep around an original copy of the string being searched, that's probably all the regex engine itself wants. It's generally the outer code that wants to make a copy of $1 et al. : *) exact string compare : *) find string in string And maybe case insensitive variants of these, unless it turns out to be better to combine it with other required composition/decomposition. But then matching against the contents of a variable repeatedly is likely to induce repeated canonicalizations. : *) find first character of class X in string : *) find first character not of class X in string We're gonna run into the "what's a character?" issue here, especially at higher Unicode levels where what the user things of as a single character is really a sequence of codepoints. From perspective of a positive match, the character lengths can be largely self-defining. >From the perspective of a negative match, the engine has to know what "." means so you can skip one when the class doesn't match. (And the length of "." doesn't necessarily map to the length of all the entries in the class...) In particular, \n in Perl 6 has to match a set of weird sequences. Which arguably could be matched by a routine if character classes aren't up to it. (Also, the Perl 6 parser may be based on the notion of a set of hash keys being treated a bit like a lex grammar, and if we can use character class lookup for that, it might need to have some "longest token first" semantics. Which you might need for ordinary classes anyway as soon as you admit sequences.) (Also, minor nit, I'm not sure "find" is the right verb here and elsewhere. Mostly the regex engine just wants to check the "current" location. The rest is control flow.) : *) find boundary between X and not-X : *) Find boundary defined by arbitrary code (mainly for word breaks) We might have to use arbitrary code to match arrays and hashes as well, if the opcodes support only scalar string matches. : *) create new class X : *) add or subtract character to class X : *) create union|intersection|difference of two classes Not sure you really need opcodes for those if character classes are just real objects with a particular interface. : I think this about does it, and we do some of this already. Are there : semantics people see as missing, or need more explanation? If so, : pipe up, we'll nail them down, then get the op mapping (with : implementation) and go from there. I think that most of the other issues revolve around control flow and remembering your current state, and being able to backtrack out of that state, all of which Parrot can presumably handle with existing ops, though perhaps not as efficiently as we might like. I see one other potential gotcha with respect to backtracking and closures. In P6, a closure can declare a hypothetical variable that is restored only if the closure exits "unsuccessfully". Within a rule, an embedded closure is unsuccessful if it is backtracked over. But that implies that you can't know whether you have a successful return until the entire regex is matched, all the way down, and all the way back out the top, or at least out far enough that you know you can't backtrack into this closure. Abstractly, the closure doesn't return until the entire rest of the match is decided. Internally, of course, the closure probably returns as soon as you run into the end of it. So we may have to jimmy the meaning of hypotheticality in that context to defer undoing such variables until we hit a failure continuation of some sort. That's *probably* doable with the current opcodes, but maybe not optimally. In any event, we have to do all that anyway for $1, $2, et al. whether they're inside or outside of closures. Larry