>I was thinking the same thing.
>How would we go about that?

Ok, so maybe I called it wrong saying that now wasn't a good time to
start talking about it.  Well, 0330 EDT probably isn't the right time
for me, but I'm up, so here goes.

>I kept having ideas for improving regexp but they were already in oro so there
> seemed no point.

In no way should the oro stuff neutralize regexp development.  jakarta-oro
is an umbrella for Java text processing tools, which at the moment
is heavily oriented toward regular expressions.  Hopefully we'll
eventually start working on new generic classes, such as tokenizers
and regular expression factories, that can use any of the regular
expression classes that implement the right interfaces.  There's an awk
package, glob expression classes, a high level Perl package, and a low
level Perl package (yes, the Perl5 classes in .text.regex should probably
be moved into the .perl package, and I'd like to discuss making that
code-breaking change in a future release).  What ties it all together
are the interfaces in the .text.regex package.

The most straightforward approach to combining the projects would be to
integrate the regexp code as another oro package and have the classes
implement the interfaces in .text.regex.  Even if the classes have
to be refactored, a backwards compatible set of classes can be constructed
around the new ones.  Some methods unique to regexp (e.g., grep) might be
generalized and moved into the .text.regex.Util class (I know, terrible name
for the class, but it's another historical artifact).  I haven't studied
the performance issues with CharacterIterator.  Having a generic interface
for input is the right thing to do from a design perspective.  The oro stuff
never did it because in the 1.0.2 days, invoking a method to get at
each character was death in terms of performance.  If we can show
that it's not a problem because today's jits inline the accessor
methods so it's as cheap as an array index, we ought to apply the
CharacterIterator interface across the board (but I think we'd
need to add some methods to it).  

My impression was that regexp originally set out to implement POSIX
regular expressions.  If that is still a goal, I think the fit is
very good because it gives programmers yet another set of options.
By having a lot of choices, programmers can not only choose the
regular expression syntax they are most familiar with, but they
can also choose a regular expression syntax based on the performance
characteristics of the implementation for the type of matching
they're going to be doing.

There are some things that will never be able to be implemented in
the Perl5 classes in .text.regex because of their implementation approach.
For example, even though the interfaces are object-oriented, the
implementation is monolithic (for performance reasons), so it can't be
dynamically customized by creating custom matching objects.  Based
on a post to regexp, it seemd like regexp might better support
customization, which would further distinguish the package.  I like
the idea of being able to pick the right tool for the job, but not
have to change the code.  Add the right factory abstraction, and you
can write most of your code using just the Pattern, PatternCompiler,
and PatternMatcher interfaces and just change a runtime property
to switch the regular expression syntax (although some of the universal
pattern compilation constants should be moved out of PatternCompiler
implementations and into PatternCompiler to facilitate this).

It's 0400 EDT and I'm rambling.  Basically, I think there's
a fit and it will involve some cross-pollination, but it will
probably be more regexp fitting into the oro interfaces than
the other way around.

daniel


Reply via email to