Re: Improving std.regex(p)

Andrei Alexandrescu Fri, 18 Jun 2010 08:45:13 -0700

Ellery Newcomer wrote:

On 06/18/2010 06:36 AM, Ben Hanson wrote:
Reading your original post again and thinking about this a bit more...
If someone can help me get up to speed with TMP in D, I could probablyputtogether a proof of concept pretty quickly. Aside from the D syntax,it is allin the design (which is why I would like to discuss it in moredetail). Foranyone who is interested, here is how I go about DFA productioncurrently:
- Bog standard tokenisation of regex tokens
- Normalise each regex charset and enumerate in a map
- build a regex syntax tree with leaf nodes storing the enumeratedcharset id
- Intersect all the charsets and build a vector of equivalent charsets
- Perform the followset to DFA transform as described in the Dragon book

For my lexer generator, I build a lookup table for each character to an
equivalence class. For my wild card matcher, I leave therepresentation as
characters and do a linear lookup of chars to 'next' state. The lexer
equivalence class method (a la flex) is more efficient as you justhave two
lookups per character with no linear searches.
I'm thinking I could look at getting as far as the regex syntax treeand thenwe could discuss DFA representation (it sounds like you want togenerate codedirectly - how do you do that?! This is why I mentioned the .NETbytecode stuff- I imagine there is some overlap there, but I don't know enough to besure.Can you output source code at compilation time and have D compile it,or do you
use a load of recursion like C++ TMP?

And I forgot to mention - boost.spirit uses my lexertl library.

Regards,

Ben
I've always understood that dfas have an exponential lower boundrelative to .. something .. but I don't believe I have ever actuallyused a dfa generator (or written one, sadly enough). What has yourexperience been on dfa compile time?

I think only backtracking engines (not DFAs) have the exponential runtime problem. Regular expression grammar is itself context free (so it'sparsable efficiently), the NFA and NFA->DFA algorithms are polynomial,and the generated DFA makes one transition per input character.



Andrei

Re: Improving std.regex(p)

Reply via email to