Re: rx_compile and FSAs

Aaron Sherman Wed, 13 Oct 2004 08:25:57 -0700

On Wed, 2004-10-13 at 10:44, Matt Fowles wrote:

> I am of the opinion that we should treat regular expression as simply
> another language.  Thus one can register different compilers for
> different regex syntaxes and we do not need to add more opcodes for
> them.


That is essentially what I've proposed, however it is important to
realize 2 things:

      * Just because parrot exports a set of regex syntaxes does not
        mean that those are the syntaxes that the user of a language
        will see. Language designers might pre-process down to one of
        those, or they might simply avoid all of them.
      * I'm shifting the interface from rx_compile (which is still
        there, though possibly as a PMC, given Leo's comments) to
        new_fsa, and letting language designers write their own rx
        compiler. Because states in the FSA can be continuations (which,
        yes, means it's not a true FSA, because it's not finite) you
        should be able to implement arbitrarily complex regexes, and
        even Perl 6 rules (which are not regular expressions or true
        FSAs at all because they maintain state as they recurse
        infinitely) without stepping outside of this framework.

I am certain that I will make many mistakes while implementing this, as
I'm learning how to create and manage FSAs, but putting an
implementation in place seems to me to be the right way to start.

>   This also has the advantage of placing their internals in a
> black box off to the side.  So, the regex compiler can choose to
> agressively compile and optimize from the start or do it lazy at its
> whim while hiding behind the interface that the compile opcode already
> presents.

Again, this is all true, though it is also possible for the compiler
writer to "take control" by directly building and managing the FSA (e.g.
directing when/if minimization takes place, which is an optimization
that MUST take place before Parrot bytecode is generated).

The key here is that languages will be able to use diverse regular
expression syntaxes AND directly generated FSAs (e.g. for matching
high-level objects rather than characters), but use them all in the same
way. This is a fundamentally OO approach to FSA management, which was
inspired for me by the good work done by some of the excellent FSA
toolkits out there (which, woefully, are mostly in C++, and not truly
open source, so we cannot use them directly).

Several things are standing in my way right now, but I'm confident that
I'll find solutions:

     1. The concept of an alphabet is, as yet, vague. Obviously ASCII
        characters, ISO-Latin-1 characters and Unicode code points are
        all possible input ranges, but so too are any finite range of
        integers or enumerated values. More research will be required to
        find out how to best special-case the common cases and make them
        efficient, using the already discussed low-level parrot opcodes
        for string matching.
     2. It's not clear how a "continuation state" behaves when it
        invokes its return continuation... how it directs the FSA to its
        next state is probably my trickiest problem, as it must be a
        mechanism that is compatible with the concept of an FSA (e.g.
        you may have to declare what states any given continuation can
        trigger upon return, or simply pass multiple return
        continuations, one for each possible next state after the
        continuation).
     3. I don't yet know if my extended translation matrix is sufficient
        for modern patter-matching (ala Perl 5). Certainly it's not
        sufficient for look-behind assertions, but that particular case
        might have to be a continuation state, rather than a regular
        node in the FSA. Other than that sort of thing, I do want all of
        Perl 5's regular expression syntax to be representable in my
        Parrot FSAs (which, in case anyone was wondering, I'm thinking
        strongly of translating directly from the NDFA, rather than
        translating to the corresponding DFA first... that is a harder
        problem, but probably one more suited to this application, which
        incidentally opens the door to possible auto-parallelization
        later on).

I'll be working on these over the weekend at my parents' place (nothing
like the ocean to make thinking easier).

-- 
â 781-324-3772
â [EMAIL PROTECTED]
â http://www.ajs.com/~ajs

Re: rx_compile and FSAs

Reply via email to