Matt Oliveri's question prompted a useful thought while I was sitting in a waiting room this evening. Given that I need a lexical analyzer, why do I need to build yet another RE engine. Why isn't RE2, or PCRE, or some such, sufficient?
One reason, which is incidental to the compiler, is that I want to know how this stuff works. Eventually we're going to want a native implementation. Also: this is a type of coding that challenges safe programming languages, and it's useful to get my head around it. So that's a reason that comes under the heading of "shap needs to learn this example." But that's something I could defer; the existing implementations obviously work, and modulo some cleanup of the specifications (e.g. I found a bug in pcregrep earlier today) there isn't anything new to do here. The main reason is that I think I really need a token ID. PCRE/RE2 get close, in the sense that you can store named sub-patterns, but then you have to iterate through and see what sub-pattern matched to determine what token you actually have. Perhaps I'm missing something, and there is a way to achieve what I want within the feature set of PCRE/RE2. If so, *please* enlighten me so I can set this problem aside. Failing that, I *think* what I want is a mechanism that says "set token ID to *n* when the following subpattern matches". In the context of PCRE this could be done in "first visited" form. In the context of classical REs this could be done in "leftmost wins" form. As I say, I'm not sure if there is a need to extend PCRE syntax for this, but if there is, it seems to me that a backwards-compatible extension syntax is feasible. Before I go haring off into implementation, is there already a way to do this within the constraints of PCRE regular expressions, or is an extension needed? shap
_______________________________________________ bitc-dev mailing list [email protected] http://www.coyotos.org/mailman/listinfo/bitc-dev
