It must seem by now as if I dropped this thread. I've been coding, but I've also been doing some renovation on my house (new electrical circuits), so working on this has been stop-and-start. I also got hung up in some premature optimization, which I'll describe in a moment. Two responses to Florian first.
On Mon, Jan 20, 2014 at 1:45 PM, Florian Weimer <[email protected]> wrote: > If you have captures, you can tell which part matched. > > < > http://lists.alioth.debian.org/pipermail/secure-testing-commits/2010-May/015549.html > > > Yes. But if you have 100 keywords to match, that doesn't work out so well. Also, it's not as simple as it looks. In an RE like: ((do)|(while)) The "do" will end up being capture #2, but depending on the engine, the "while" may *also* end up being capture #2. It depends on whether the numbering scheme for captures is based on the order of appearance in the RE or the order of match. What is actually needed here is named captures. But in that case you end up running the RE and then iterating over the named captures to find out which one is non-empty. Which is something that the final state of the RE should have already been able to tell you. Captures aren't the right way to do this. > Whether this is deterministic enough for your purposes depends on the > implementation. > The only deterministic RE engine I know about that implements captures is the RE2 engine. The perl RE engine (and friends) also do back-references. The implementations are non-deterministic, but more importantly they are exponential in the size of the input. Google on "RE2 Engine" and read the sequence of 4 excellent notes that Russ Cox wrote on the implementation of RE2. They are a good read on algorithms. > On the other hand, using regular expressions for keyword matching is > rather expensive because it increases the size of the representation > of the regular expression. That's a consequence of a particular implementation. it isn't an inherent requirement. There are a variety of ways to encode a DFA in tables. Historically, they have been done as an array of pairs of the form (match-char, new-state) in which positive new-state indicates a valid transition. The *absence* of a pair indicates the absence of a transition. There are several variations on how to encode these tables. The encoding that Mike Lesk used in the original implementation of Lex was fairly tight. As Florian notes, this encoding is pretty inefficient. It wasn't really tolerable even for single-byte character sets (mainly because alignment issues made that 'char' tend to expand). It's completely unthinkable for Unicode. But this isn't the only possible encoding, and it certainly isn't the best possible encoding. A much better encoding is to simply emit a monster loop with a switch statement having one "case" per state, and then actually emit C code to implement the test. This is better for several reasons: 1. The state gets encoded implicitly in the PC, so it requires no space. 2. An appropriate search algorithm can be chosen. For large transition sets you want to do binary search, but for small ones that's not efficient. 3. Dense ranges can be checked efficiently (e.g. a-z in identifiers). In real-world cases, the main difficulty with this approach is that keywords multiply the number of states significantly. In particular, the need to match "[A-Za-z_][A-Za-z0-9_]*" and "while" adds five states because of the "while" that are completely unnecessary. For a completely general purpose RE engine, you need the ability to do this, but for a lexer this isn't the implementation you want. The RE match performed by a lexer is unusual. You basically have two cases that you are matching: regular expressions and fixed strings followed by separators (the keywords). Depending on the language, the strings may be case-insensitive. In all cases, matching the string takes precedence over any RE. That is: "do" is a keyword rather than an identifier. But you know from the input specification which ones are strings. As you build the RE matcher, you'll discover that a lot of the possible "final" states have two matches: one is an RE and the other is a (higher priority) string. The thing to do in this case is to build the RE matcher twice. The first time is for the purpose of identifying those overlaps. At that point you examine every final state that matches a string. If every one of those matches overlaps a non-string RE, you *remove* the string RE from the specification and add it to a (separate) string table. Continue this until you have removed every string RE that you can. Now re-generate the matching engine **omitting** the overlapping string states and emit code that implements a post-filter to check for explicit string matches using binary search. The end result of this process, when emitted as code, certainly will not be very pretty, but it will be very nearly as efficient as a hand-coded lexer. Oh. And if you are emitting for GCC you can take advantage of case labels as jump targets and make it even tighter. Jonathan
_______________________________________________ bitc-dev mailing list [email protected] http://www.coyotos.org/mailman/listinfo/bitc-dev
