Re: [bitc-dev] Unicode RegExp Hell

Jonathan S. Shapiro Mon, 28 Apr 2014 20:22:24 -0700

Matt Oliveri's question prompted a useful thought while I was sitting in a
waiting room this evening. Given that I need a lexical analyzer, why do I
need to build yet another RE engine. Why isn't RE2, or PCRE, or some such,
sufficient?


One reason, which is incidental to the compiler, is that I want to know how
this stuff works. Eventually we're going to want a native implementation.
Also: this is a type of coding that challenges safe programming languages,
and it's useful to get my head around it. So that's a reason that comes
under the heading of "shap needs to learn this example." But that's
something I could defer; the existing implementations obviously work, and
modulo some cleanup of the specifications (e.g. I found a bug in pcregrep
earlier today) there isn't anything new to do here.

The main reason is that I think I really need a token ID. PCRE/RE2 get
close, in the sense that you can store named sub-patterns, but then you
have to iterate through and see what sub-pattern matched to determine what
token you actually have. Perhaps I'm missing something, and there is a way
to achieve what I want within the feature set of PCRE/RE2. If so,
*please* enlighten
me so I can set this problem aside.

Failing that, I *think* what I want is a mechanism that says "set token ID
to *n* when the following subpattern matches". In the context of PCRE this
could be done in "first visited" form. In the context of classical REs this
could be done in "leftmost wins" form. As I say, I'm not sure if there is a
need to extend PCRE syntax for this, but if there is, it seems to me that a
backwards-compatible extension syntax is feasible.

Before I go haring off into implementation, is there already a way to do
this within the constraints of PCRE regular expressions, or is an extension
needed?


shap

_______________________________________________
bitc-dev mailing list
[email protected]
http://www.coyotos.org/mailman/listinfo/bitc-dev

Re: [bitc-dev] Unicode RegExp Hell

Reply via email to