On Wed, 2004-10-13 at 10:44, Matt Fowles wrote: > I am of the opinion that we should treat regular expression as simply > another language. Thus one can register different compilers for > different regex syntaxes and we do not need to add more opcodes for > them.
That is essentially what I've proposed, however it is important to realize 2 things: * Just because parrot exports a set of regex syntaxes does not mean that those are the syntaxes that the user of a language will see. Language designers might pre-process down to one of those, or they might simply avoid all of them. * I'm shifting the interface from rx_compile (which is still there, though possibly as a PMC, given Leo's comments) to new_fsa, and letting language designers write their own rx compiler. Because states in the FSA can be continuations (which, yes, means it's not a true FSA, because it's not finite) you should be able to implement arbitrarily complex regexes, and even Perl 6 rules (which are not regular expressions or true FSAs at all because they maintain state as they recurse infinitely) without stepping outside of this framework. I am certain that I will make many mistakes while implementing this, as I'm learning how to create and manage FSAs, but putting an implementation in place seems to me to be the right way to start. > This also has the advantage of placing their internals in a > black box off to the side. So, the regex compiler can choose to > agressively compile and optimize from the start or do it lazy at its > whim while hiding behind the interface that the compile opcode already > presents. Again, this is all true, though it is also possible for the compiler writer to "take control" by directly building and managing the FSA (e.g. directing when/if minimization takes place, which is an optimization that MUST take place before Parrot bytecode is generated). The key here is that languages will be able to use diverse regular expression syntaxes AND directly generated FSAs (e.g. for matching high-level objects rather than characters), but use them all in the same way. This is a fundamentally OO approach to FSA management, which was inspired for me by the good work done by some of the excellent FSA toolkits out there (which, woefully, are mostly in C++, and not truly open source, so we cannot use them directly). Several things are standing in my way right now, but I'm confident that I'll find solutions: 1. The concept of an alphabet is, as yet, vague. Obviously ASCII characters, ISO-Latin-1 characters and Unicode code points are all possible input ranges, but so too are any finite range of integers or enumerated values. More research will be required to find out how to best special-case the common cases and make them efficient, using the already discussed low-level parrot opcodes for string matching. 2. It's not clear how a "continuation state" behaves when it invokes its return continuation... how it directs the FSA to its next state is probably my trickiest problem, as it must be a mechanism that is compatible with the concept of an FSA (e.g. you may have to declare what states any given continuation can trigger upon return, or simply pass multiple return continuations, one for each possible next state after the continuation). 3. I don't yet know if my extended translation matrix is sufficient for modern patter-matching (ala Perl 5). Certainly it's not sufficient for look-behind assertions, but that particular case might have to be a continuation state, rather than a regular node in the FSA. Other than that sort of thing, I do want all of Perl 5's regular expression syntax to be representable in my Parrot FSAs (which, in case anyone was wondering, I'm thinking strongly of translating directly from the NDFA, rather than translating to the corresponding DFA first... that is a harder problem, but probably one more suited to this application, which incidentally opens the door to possible auto-parallelization later on). I'll be working on these over the weekend at my parents' place (nothing like the ocean to make thinking easier). -- â 781-324-3772 â [EMAIL PROTECTED] â http://www.ajs.com/~ajs