What would the interest level be in an expanded regular expression parser/matcher?

I've been integrating Scintilla into a project of mine that has its own regular expression engine. Since my RE engine is visible to my users in other contexts, I felt it would be best to use it as the RE engine for my Scintilla integration as well - otherwise my users would have to remember two RE syntaxes, and when to use which. So, I did a little massaging of my engine's interface, and now it can be used as a drop-in replacement for Scintilla's current RE engine.

If there's interest, I'd be happy to contribute my engine to Scintilla. It's entirely my own work, so there are no copyright entanglements, and it doesn't depend on any third-party code. It's fairly full-featured - beyond the basics that are already in the Scintilla matcher, it handles alternation (|), character classes (<alpha>, <^digit>, <alpha|digit>, etc), intervals (x{3,5}), group recall, non-capturing groups, shortest- and longest-match closures, and look-ahead assertions (positive and negative). Documentation is at http://www.tads.org/t3doc/doc/sysman/regex.htm.

There are four drawbacks. The first, and probably biggest, is that it follows my own coding style, which is rather different from Scintilla's. Either we'd have to live with the inconsistency, or someone would have to go through and reformat the code. The latter would be ideal, obviously, but it's a fairly big job (~3500 lines) that I'm afraid I can't volunteer for.

The second snag is related: my code hasn't been run through the Borland compiler, so there could potentially be a raft of warnings to fix. It's unlikely it'll be all that bad, as this code has been ported for several years to numerous systems, including Unix/gcc (where I believe it compiles warning-free); but it's pretty much a foregone conclusion that a new compiler will find something to complain about.

The third drawback is relatively minor: my RE syntax has a few small differences from the canonical Unix-style RE syntax - e.g., the quoting character is "%" rather than "\". It would probably be desirable to fix that; this isn't a big job.

The fourth problem is that the code only handles SBCS. My original version actually does all its work in UTF-8 (a multibyte Unicode encoding), so the infrastructure is there for MBCS handling - but for the Scintilla conversion I only accounted for single-byte characters. For proper MBCS support, it would be necessary to retrofit whatever Scintilla's standard mechanism is. This wouldn't be too hard, as all string access is already encapsulated as a class; but it's obviously work, and as with the reformatting I probably wouldn't be able to volunteer.

My acquaintance with Scintilla is a relatively recent development, so for all I know this might be an unpopular suggestion. If so, I'll happily withdraw it - I don't mean it as a complaint about the current RE engine. But if this seems like a good idea, I'd be happy to contribute this code, even if it makes a rather dubious gift given the to-do list attached to it.

--Mike

_________________________________________________________________
With tax season right around the corner, make sure to follow these few simple tips. http://articles.moneycentral.msn.com/Taxes/PreparationTips/PreparationTips.aspx?icid=HMFebtagline

_______________________________________________
Scintilla-interest mailing list
[email protected]
http://mailman.lyra.org/mailman/listinfo/scintilla-interest

Reply via email to