On Tue, Jan 28, 2014 at 6:58 PM, William ML Leslie <
[email protected]> wrote:
> Firstly: I think it is useful to continue to think in terms of single
> bytes when implementing NFAs for tokenisers, because keywords tend to
> fit within ascii and so can be a single-byte case statement. Handling
> non-ascii with a single function is fine because you're usually going
> to deal with a category or access a property and dispatch to the next
> states on that.
>
That was my thought when I first started the hand-implemented lexer for
BitC. It turns out to be wrong, but not all of the reasons are obvious.
Your intuition is right: the input wants to be viewed simultaneously at the
byte stream layer and at the codepoint layer. This is true because:
1. Most tokens are fixed, and will ultimately be handled by their ID rather
than their stringwise representation.
2. Any sequence consisting of valid UTF- encoded unicode code points
terminated by NUL is also a valid byte string consisting of non-zero bytes
terminated by a zero byte. This means that strcmp works.
3. All that being said, errors need to be reported in terms of line number
and **code point** offset, not byte offset.
The problem with doing the recognizer at the byte level is that you aren't
doing a push-down automata. So what happens is that you get a significant
multiplication of states. In most programming languages this is limited to
the matching of IDENT, but it gets more complicated if you accept numbers
in multiple languages. A lot of those states will disappear if you
cherry-pick the keywords.
The real problem is error reporting. To support that, you need to keep
track of your position in the input in terms of line number and *character*
(i.e. codepoint) offset.
What I have found worked well was to scan codepoints out of a (self-sizing)
byte buffer. As scanning proceeds, both the byte length and the codepoint
length of the pending token are tracked. When you get to a match point you
grab the byte string that has been cached by the byte buffer *as a
NUL-terminated byte sequence string* (i.e ignoring the fact that it is
unicode), because once you have it you can use strcmp() thereafter. This
works because the input stream is encoded as UTF-8.
I'm open to revising this later, but for the moment I'm going to stick with
what works.
With that in mind, a state can be represented abstractly by a
> continuation...
That's a functionally correct implementation, but it requires an aggressive
optimizer to make it efficient. I'm not planning to do anything quite that
aggressively functional.
> Typical handle_unicode functions probably just read a char (making
> sure it's valid), lookup a unicode property (I think it is fine to
> represent properties as tables?)
You can't implement properties as tables, because they can appear within
character ranges, where they can be duplicated. So, for example:
[a\p{Ll}]
is a range that matches 'a' in addition to all lowercase Unicode characters
(which includes 'a'). Also, you have to deal with things like [^\p{Ll}]
which is the negation of the lowercase Unicode characters. So at the end of
the day you can't reduce this to table lookup on unicode classes.
> > In real-world cases, the main difficulty with this approach is that
> keywords
> > multiply the number of states significantly. In particular, the need to
> > match "[A-Za-z_][A-Za-z0-9_]*" and "while" adds five states because of
> the
> > "while" that are completely unnecessary. For a completely general
> purpose RE
> > engine, you need the ability to do this, but for a lexer this isn't the
> > implementation you want.
>
> Is that particularly expensive in terms of the amount of code generated?
>
Yes. But more importantly, it's expensive in terms of compromising the
human ability to read the state transition diagram as represented in the
code. If you cherry pick the keywords for disambiguation in a later strcmp
pass, the state machine that's left actually makes sense to the human eye.
> > The RE match performed by a lexer is unusual. You basically have two
> cases
> > that you are matching: regular expressions and fixed strings followed by
> > separators (the keywords). Depending on the language, the strings may be
> > case-insensitive. In all cases, matching the string takes precedence over
> > any RE. That is: "do" is a keyword rather than an identifier. But you
> know
> > from the input specification which ones are strings. As you build the RE
> > matcher, you'll discover that a lot of the possible "final" states have
> two
> > matches: one is an RE and the other is a (higher priority) string.
>
> But this is the same problem as building a pattern matching compiler,
> for which there are already several good papers on doing so
> efficiently. Well, it's actually slightly simpler, but the standard
> algorithm in the SPJ book[0] works.
>
I haven't looked at SPJ (I have it, but it's not handy), but I haven't seen
the trick of cherry picking the fixed keywords this way in the literature.
It's certainly well known to implementers.
There's *lots* of work on pattern matching compilers, most notably in the
area of network packet filtering.
shap
_______________________________________________
bitc-dev mailing list
[email protected]
http://www.coyotos.org/mailman/listinfo/bitc-dev