What would the interest level be in an expanded regular expression
parser/matcher?
I've been integrating Scintilla into a project of mine that has its own
regular expression engine. Since my RE engine is visible to my users in
other contexts, I felt it would be best to use it as the RE engine for my
Scintilla integration as well - otherwise my users would have to remember
two RE syntaxes, and when to use which. So, I did a little massaging of my
engine's interface, and now it can be used as a drop-in replacement for
Scintilla's current RE engine.
If there's interest, I'd be happy to contribute my engine to Scintilla.
It's entirely my own work, so there are no copyright entanglements, and it
doesn't depend on any third-party code. It's fairly full-featured - beyond
the basics that are already in the Scintilla matcher, it handles alternation
(|), character classes (<alpha>, <^digit>, <alpha|digit>, etc), intervals
(x{3,5}), group recall, non-capturing groups, shortest- and longest-match
closures, and look-ahead assertions (positive and negative). Documentation
is at http://www.tads.org/t3doc/doc/sysman/regex.htm.
There are four drawbacks. The first, and probably biggest, is that it
follows my own coding style, which is rather different from Scintilla's.
Either we'd have to live with the inconsistency, or someone would have to go
through and reformat the code. The latter would be ideal, obviously, but
it's a fairly big job (~3500 lines) that I'm afraid I can't volunteer for.
The second snag is related: my code hasn't been run through the Borland
compiler, so there could potentially be a raft of warnings to fix. It's
unlikely it'll be all that bad, as this code has been ported for several
years to numerous systems, including Unix/gcc (where I believe it compiles
warning-free); but it's pretty much a foregone conclusion that a new
compiler will find something to complain about.
The third drawback is relatively minor: my RE syntax has a few small
differences from the canonical Unix-style RE syntax - e.g., the quoting
character is "%" rather than "\". It would probably be desirable to fix
that; this isn't a big job.
The fourth problem is that the code only handles SBCS. My original version
actually does all its work in UTF-8 (a multibyte Unicode encoding), so the
infrastructure is there for MBCS handling - but for the Scintilla conversion
I only accounted for single-byte characters. For proper MBCS support, it
would be necessary to retrofit whatever Scintilla's standard mechanism is.
This wouldn't be too hard, as all string access is already encapsulated as a
class; but it's obviously work, and as with the reformatting I probably
wouldn't be able to volunteer.
My acquaintance with Scintilla is a relatively recent development, so for
all I know this might be an unpopular suggestion. If so, I'll happily
withdraw it - I don't mean it as a complaint about the current RE engine.
But if this seems like a good idea, I'd be happy to contribute this code,
even if it makes a rather dubious gift given the to-do list attached to it.
--Mike
_________________________________________________________________
With tax season right around the corner, make sure to follow these few
simple tips.
http://articles.moneycentral.msn.com/Taxes/PreparationTips/PreparationTips.aspx?icid=HMFebtagline
_______________________________________________
Scintilla-interest mailing list
[email protected]
http://mailman.lyra.org/mailman/listinfo/scintilla-interest