On Thu, Sep 08, 2005 at 08:37:21AM -0700, Dave Whipp wrote: : If I want to parse a language that is sensitive to whitespace : indentation (e.g. Python, Haskell), how do I do it using P6 rules/grammars? : : The way I'd usually handle it is to have a lexer that examines leading : whitespace and converts it into "indent" and "unindent" tokens. The : grammer can then use these tokens in the same way that it would any : other block-delimiter.
This is the multi-pass approach, which even in Perl 6 is still certainly one way to do it. Or actually, two ways, one of which is to use source filters to mung the input text, and the other way of which is to make one lexer pass to transform into a list or tree of tokens, and then do a list/tree transformation on that. Given that tree transformation is something that a number of us are currently thinking about for various other reasons, I suspect that can be made to work pretty well. But we will have to describe how rule matching can be extended to lists and trees, and what happens when you intermix text elements with non-textual objects. But essentially, just think what it would take to match against the tree of match objects returned by a prior match instead of against a string. : This requires a stateful lexer, because to work out the number of : "unindent" tokens on a line, it needs to know what the indentation : positions are. How would I write a P6 rule that defines <indent> and : <unindent> tokens? Alternatively (if a different approach is needed) how : would I use P6 to parse such a language? I can think of two other approaches as well. One would be to allow "pushback" on the queued input so that when we hit a line transition, we can immediately analyze the leading whitespace and replace it conceptually with the appropriate number of indent/unindent objects. It's not yet clear whether that is a good approach. It might be rather inefficient to splice faked up objects into the middle of the input stream. On the other hand, we don't actually have to splice the input, only pretend that we did. And certainly, the Perl 5 lexer makes heavy use of this kind of we'll-fake-the-next-N-tokens queueing (though I actually botched the Perl 5 implementation of it by making it a stack instead of a queue). My final idea is that you can treat it as a fancy kind of lookbehind assertion, provided you have some landmark that will stop the normal analysis from running over the newline boundary. With the other approaches, you have a precomputed positive unindent token to function as a "stopper", but with this approach, the only thing you can depend on is the newline itself. So when you first hit a newline, you don't progress beyond it, but instead look ahead at the indentation of the following line and queue up the right number of indent/unindent objects in your own data structure (as well as pushing or popping your stack of current indents appropriately so that the new indent is on top). Then as long as there are queued up objects, the newline doesn't report itself as a newline, but as the next object, until you run out, and then the newline allows itself to be skipped, along with the subsequent whitespace. (The difference between this and the previous approach is that you're not relying on the rules engine to keep track of the queueing for you.) The very fact that I had to use a phrase like "positive unindent token" says a lot about why indentation as syntax is problematic in general. Larry