On 29 January 2014 00:33, Jonathan S. Shapiro <[email protected]> wrote:
> There are a variety of ways to encode a DFA in tables. Historically, they
> have been done as an array of pairs of the form (match-char, new-state) in
> which positive new-state indicates a valid transition. The absence of a pair
> indicates the absence of a transition. There are several variations on how
> to encode these tables. The encoding that Mike Lesk used in the original
> implementation of Lex was fairly tight.
>
> As Florian notes, this encoding is pretty inefficient. It wasn't really
> tolerable even for single-byte character sets (mainly because alignment
> issues made that 'char' tend to expand). It's completely unthinkable for
> Unicode.
>
> But this isn't the only possible encoding, and it certainly isn't the best
> possible encoding. A much better encoding is to simply emit a monster loop
> with a switch statement having one "case" per state, and then actually emit
> C code to implement the test.

Firstly: I think it is useful to continue to think in terms of single
bytes when implementing NFAs for tokenisers, because keywords tend to
fit within ascii and so can be a single-byte case statement.  Handling
non-ascii with a single function is fine because you're usually going
to deal with a category or access a property and dispatch to the next
states on that.

With that in mind, a state can be represented abstractly by a
continuation, which is used to implement things that need to happen on
state transitions (eg, we've finished matching identifier characters,
emit the token and clear any match state) and which usually delegate
directly to another state to handle the current character.
Concretely, a state can be represented by a table of 128 transitions
for ASCII (usually as a switch), a function for handling the rest of
unicode, and a function for handling end-of-stream.  The kernel then
looks like so:

handle_normal_state state =
  case peek_byte of
    Nothing => state.end_of_stream
    Just c | c & 0x80 => state.handle_unicode
    Just c  =>
      do consume_byte
         return state.ascii_transitions ! c

Typical handle_unicode functions probably just read a char (making
sure it's valid), lookup a unicode property (I think it is fine to
represent properties as tables?) and dispatch on that.  For most
tokenisers, unicode identifiers are going to be the only
handle_unicode that doesn't simply read, emit, and return to the same
state.

Inline the abstract continuations, and presto!  Lots of cases and
jumping around.

> In real-world cases, the main difficulty with this approach is that keywords
> multiply the number of states significantly. In particular, the need to
> match "[A-Za-z_][A-Za-z0-9_]*" and "while" adds five states because of the
> "while" that are completely unnecessary. For a completely general purpose RE
> engine, you need the ability to do this, but for a lexer this isn't the
> implementation you want.

Is that particularly expensive in terms of the amount of code generated?

> The RE match performed by a lexer is unusual. You basically have two cases
> that you are matching: regular expressions and fixed strings followed by
> separators (the keywords). Depending on the language, the strings may be
> case-insensitive. In all cases, matching the string takes precedence over
> any RE. That is: "do" is a keyword rather than an identifier. But you know
> from the input specification which ones are strings. As you build the RE
> matcher, you'll discover that a lot of the possible "final" states have two
> matches: one is an RE and the other is a (higher priority) string.

But this is the same problem as building a pattern matching compiler,
for which there are already several good papers on doing so
efficiently.  Well, it's actually slightly simpler, but the standard
algorithm in the SPJ book[0] works.

> Now re-generate the matching
> engine **omitting** the overlapping string states and emit code that
> implements a post-filter to check for explicit string matches using binary
> search.

Yes!

[0] The Implementation Of Functional Programming Languages, Simon L.
Peyton Jones, Prentice Hall

-- 
William Leslie

Notice:
Likely much of this email is, by the nature of copyright, covered
under copyright law.  You absolutely may reproduce any part of it in
accordance with the copyright law of the nation you are reading this
in.  Any attempt to deny you those rights would be illegal without
prior contractual agreement.
_______________________________________________
bitc-dev mailing list
[email protected]
http://www.coyotos.org/mailman/listinfo/bitc-dev

Reply via email to