On Wed, 2004-10-13 at 10:44, Matt Fowles wrote:
> I am of the opinion that we should treat regular expression as simply
> another language. Thus one can register different compilers for
> different regex syntaxes and we do not need to add more opcodes for
> them.
That is essentially what I've proposed, however it is important to
realize 2 things:
* Just because parrot exports a set of regex syntaxes does not
mean that those are the syntaxes that the user of a language
will see. Language designers might pre-process down to one of
those, or they might simply avoid all of them.
* I'm shifting the interface from rx_compile (which is still
there, though possibly as a PMC, given Leo's comments) to
new_fsa, and letting language designers write their own rx
compiler. Because states in the FSA can be continuations (which,
yes, means it's not a true FSA, because it's not finite) you
should be able to implement arbitrarily complex regexes, and
even Perl 6 rules (which are not regular expressions or true
FSAs at all because they maintain state as they recurse
infinitely) without stepping outside of this framework.
I am certain that I will make many mistakes while implementing this, as
I'm learning how to create and manage FSAs, but putting an
implementation in place seems to me to be the right way to start.
> This also has the advantage of placing their internals in a
> black box off to the side. So, the regex compiler can choose to
> agressively compile and optimize from the start or do it lazy at its
> whim while hiding behind the interface that the compile opcode already
> presents.
Again, this is all true, though it is also possible for the compiler
writer to "take control" by directly building and managing the FSA (e.g.
directing when/if minimization takes place, which is an optimization
that MUST take place before Parrot bytecode is generated).
The key here is that languages will be able to use diverse regular
expression syntaxes AND directly generated FSAs (e.g. for matching
high-level objects rather than characters), but use them all in the same
way. This is a fundamentally OO approach to FSA management, which was
inspired for me by the good work done by some of the excellent FSA
toolkits out there (which, woefully, are mostly in C++, and not truly
open source, so we cannot use them directly).
Several things are standing in my way right now, but I'm confident that
I'll find solutions:
1. The concept of an alphabet is, as yet, vague. Obviously ASCII
characters, ISO-Latin-1 characters and Unicode code points are
all possible input ranges, but so too are any finite range of
integers or enumerated values. More research will be required to
find out how to best special-case the common cases and make them
efficient, using the already discussed low-level parrot opcodes
for string matching.
2. It's not clear how a "continuation state" behaves when it
invokes its return continuation... how it directs the FSA to its
next state is probably my trickiest problem, as it must be a
mechanism that is compatible with the concept of an FSA (e.g.
you may have to declare what states any given continuation can
trigger upon return, or simply pass multiple return
continuations, one for each possible next state after the
continuation).
3. I don't yet know if my extended translation matrix is sufficient
for modern patter-matching (ala Perl 5). Certainly it's not
sufficient for look-behind assertions, but that particular case
might have to be a continuation state, rather than a regular
node in the FSA. Other than that sort of thing, I do want all of
Perl 5's regular expression syntax to be representable in my
Parrot FSAs (which, in case anyone was wondering, I'm thinking
strongly of translating directly from the NDFA, rather than
translating to the corresponding DFA first... that is a harder
problem, but probably one more suited to this application, which
incidentally opens the door to possible auto-parallelization
later on).
I'll be working on these over the weekend at my parents' place (nothing
like the ocean to make thinking easier).
--
â 781-324-3772
â [EMAIL PROTECTED]
â http://www.ajs.com/~ajs