It must seem by now as if I dropped this thread. I've been coding, but I've
also been doing some renovation on my house (new electrical circuits), so
working on this has been stop-and-start. I also got hung up in some
premature optimization, which I'll describe in a moment. Two responses to
Florian first.

On Mon, Jan 20, 2014 at 1:45 PM, Florian Weimer <[email protected]> wrote:

> If you have captures, you can tell which part matched.
>
> <
> http://lists.alioth.debian.org/pipermail/secure-testing-commits/2010-May/015549.html
> >
>

Yes. But if you have 100 keywords to match, that doesn't work out so well.
Also, it's not as simple as it looks. In an RE like:

  ((do)|(while))

The "do" will end up being capture #2, but depending on the engine, the
"while" may *also* end up being capture #2. It depends on whether the
numbering scheme for captures is based on the order of appearance in the RE
or the order of match. What is actually needed here is named captures. But
in that case you end up running the RE and then iterating over the named
captures to find out which one is non-empty. Which is something that the
final state of the RE should have already been able to tell you. Captures
aren't the right way to do this.


> Whether this is deterministic enough for your purposes depends on the
> implementation.
>

The only deterministic RE engine I know about that implements captures is
the RE2 engine. The perl RE engine (and friends) also do back-references.
The implementations are non-deterministic, but more importantly they are
exponential in the size of the input.  Google on "RE2 Engine" and read the
sequence of 4 excellent notes that Russ Cox wrote on the implementation of
RE2. They are a good read on algorithms.


> On the other hand, using regular expressions for keyword matching is
> rather expensive because it increases the size of the representation
> of the regular expression.


That's a consequence of a particular implementation. it isn't an inherent
requirement.

There are a variety of ways to encode a DFA in tables. Historically, they
have been done as an array of pairs of the form (match-char, new-state) in
which positive new-state indicates a valid transition. The *absence* of a
pair indicates the absence of a transition. There are several variations on
how to encode these tables. The encoding that Mike Lesk used in the
original implementation of Lex was fairly tight.

As Florian notes, this encoding is pretty inefficient. It wasn't really
tolerable even for single-byte character sets (mainly because alignment
issues made that 'char' tend to expand). It's completely unthinkable for
Unicode.

But this isn't the only possible encoding, and it certainly isn't the best
possible encoding. A much better encoding is to simply emit a monster loop
with a switch statement having one "case" per state, and then actually emit
C code to implement the test. This is better for several reasons:

   1. The state gets encoded implicitly in the PC, so it requires no space.
   2. An appropriate search algorithm can be chosen. For large transition
   sets you want to do binary search, but for small ones that's not efficient.
   3. Dense ranges can be checked efficiently (e.g. a-z in identifiers).

In real-world cases, the main difficulty with this approach is that
keywords multiply the number of states significantly. In particular, the
need to match "[A-Za-z_][A-Za-z0-9_]*" and "while" adds five states because
of the "while" that are completely unnecessary. For a completely general
purpose RE engine, you need the ability to do this, but for a lexer this
isn't the implementation you want.

The RE match performed by a lexer is unusual. You basically have two cases
that you are matching: regular expressions and fixed strings followed by
separators (the keywords). Depending on the language, the strings may be
case-insensitive. In all cases, matching the string takes precedence over
any RE. That is: "do" is a keyword rather than an identifier. But you know
from the input specification which ones are strings. As you build the RE
matcher, you'll discover that a lot of the possible "final" states have two
matches: one is an RE and the other is a (higher priority) string.

The thing to do in this case is to build the RE matcher twice. The first
time is for the purpose of identifying those overlaps. At that point you
examine every final state that matches a string. If every one of those
matches overlaps a non-string RE, you *remove* the string RE from the
specification and add it to a (separate) string table. Continue this until
you have removed every string RE that you can. Now re-generate the matching
engine **omitting** the overlapping string states and emit code that
implements a post-filter to check for explicit string matches using binary
search.

The end result of this process, when emitted as code, certainly will not be
very pretty, but it will be very nearly as efficient as a hand-coded lexer.

Oh. And if you are emitting for GCC you can take advantage of case labels
as jump targets and make it even tighter.


Jonathan
_______________________________________________
bitc-dev mailing list
[email protected]
http://www.coyotos.org/mailman/listinfo/bitc-dev

Reply via email to