Re: Parsing indent-sensitive languages

Larry Wall Thu, 08 Sep 2005 14:05:10 -0700

On Thu, Sep 08, 2005 at 08:37:21AM -0700, Dave Whipp wrote:
: If I want to parse a language that is sensitive to whitespace 
: indentation (e.g. Python, Haskell), how do I do it using P6 rules/grammars?
: 
: The way I'd usually handle it is to have a lexer that examines leading 
: whitespace and converts it into "indent" and "unindent" tokens. The 
: grammer can then use these tokens in the same way that it would any 
: other block-delimiter.


This is the multi-pass approach, which even in Perl 6 is still
certainly one way to do it.  Or actually, two ways, one of which is
to use source filters to mung the input text, and the other way of
which is to make one lexer pass to transform into a list or tree of
tokens, and then do a list/tree transformation on that.  Given that
tree transformation is something that a number of us are currently
thinking about for various other reasons, I suspect that can be made
to work pretty well.  But we will have to describe how rule matching
can be extended to lists and trees, and what happens when you intermix
text elements with non-textual objects.  But essentially, just think
what it would take to match against the tree of match objects returned
by a prior match instead of against a string.

: This requires a stateful lexer, because to work out the number of 
: "unindent" tokens on a line, it needs to know what the indentation 
: positions are. How would I write a P6 rule that defines <indent> and 
: <unindent> tokens? Alternatively (if a different approach is needed) how 
: would I use P6 to parse such a language?

I can think of two other approaches as well.  One would be to allow
"pushback" on the queued input so that when we hit a line transition,
we can immediately analyze the leading whitespace and replace it
conceptually with the appropriate number of indent/unindent objects.
It's not yet clear whether that is a good approach.  It might be
rather inefficient to splice faked up objects into the middle of the
input stream.  On the other hand, we don't actually have to splice
the input, only pretend that we did.  And certainly, the Perl 5 lexer
makes heavy use of this kind of we'll-fake-the-next-N-tokens queueing
(though I actually botched the Perl 5 implementation of it by making
it a stack instead of a queue).

My final idea is that you can treat it as a fancy kind of lookbehind
assertion, provided you have some landmark that will stop the
normal analysis from running over the newline boundary.  With the
other approaches, you have a precomputed positive unindent token
to function as a "stopper", but with this approach, the only thing
you can depend on is the newline itself.  So when you first hit a
newline, you don't progress beyond it, but instead look ahead at the
indentation of the following line and queue up the right number of
indent/unindent objects in your own data structure (as well as pushing
or popping your stack of current indents appropriately so that the
new indent is on top).  Then as long as there are queued up objects,
the newline doesn't report itself as a newline, but as the next object,
until you run out, and then the newline allows itself to be skipped,
along with the subsequent whitespace.  (The difference between this
and the previous approach is that you're not relying on the rules
engine to keep track of the queueing for you.)

The very fact that I had to use a phrase like "positive unindent token"
says a lot about why indentation as syntax is problematic in general.

Larry

Re: Parsing indent-sensitive languages

Reply via email to