[scintilla] expanded regular expression matcher

Mike Roberts Wed, 21 Feb 2007 19:14:24 -0800

What would the interest level be in an expanded regular expressionparser/matcher?

I've been integrating Scintilla into a project of mine that has its ownregular expression engine. Since my RE engine is visible to my users inother contexts, I felt it would be best to use it as the RE engine for myScintilla integration as well - otherwise my users would have to remembertwo RE syntaxes, and when to use which. So, I did a little massaging of myengine's interface, and now it can be used as a drop-in replacement forScintilla's current RE engine.

If there's interest, I'd be happy to contribute my engine to Scintilla.It's entirely my own work, so there are no copyright entanglements, and itdoesn't depend on any third-party code. It's fairly full-featured - beyondthe basics that are already in the Scintilla matcher, it handles alternation(|), character classes (<alpha>, <^digit>, <alpha|digit>, etc), intervals(x{3,5}), group recall, non-capturing groups, shortest- and longest-matchclosures, and look-ahead assertions (positive and negative). Documentationis at http://www.tads.org/t3doc/doc/sysman/regex.htm.

There are four drawbacks. The first, and probably biggest, is that itfollows my own coding style, which is rather different from Scintilla's.Either we'd have to live with the inconsistency, or someone would have to gothrough and reformat the code. The latter would be ideal, obviously, butit's a fairly big job (~3500 lines) that I'm afraid I can't volunteer for.

The second snag is related: my code hasn't been run through the Borlandcompiler, so there could potentially be a raft of warnings to fix. It'sunlikely it'll be all that bad, as this code has been ported for severalyears to numerous systems, including Unix/gcc (where I believe it compileswarning-free); but it's pretty much a foregone conclusion that a newcompiler will find something to complain about.

The third drawback is relatively minor: my RE syntax has a few smalldifferences from the canonical Unix-style RE syntax - e.g., the quotingcharacter is "%" rather than "\". It would probably be desirable to fixthat; this isn't a big job.

The fourth problem is that the code only handles SBCS. My original versionactually does all its work in UTF-8 (a multibyte Unicode encoding), so theinfrastructure is there for MBCS handling - but for the Scintilla conversionI only accounted for single-byte characters. For proper MBCS support, itwould be necessary to retrofit whatever Scintilla's standard mechanism is.This wouldn't be too hard, as all string access is already encapsulated as aclass; but it's obviously work, and as with the reformatting I probablywouldn't be able to volunteer.

My acquaintance with Scintilla is a relatively recent development, so forall I know this might be an unpopular suggestion. If so, I'll happilywithdraw it - I don't mean it as a complaint about the current RE engine.But if this seems like a good idea, I'd be happy to contribute this code,even if it makes a rather dubious gift given the to-do list attached to it.


--Mike

_________________________________________________________________

With tax season right around the corner, make sure to follow these fewsimple tips.http://articles.moneycentral.msn.com/Taxes/PreparationTips/PreparationTips.aspx?icid=HMFebtagline


_______________________________________________
Scintilla-interest mailing list
[email protected]
http://mailman.lyra.org/mailman/listinfo/scintilla-interest

[scintilla] expanded regular expression matcher

Reply via email to