That's something I've been thinking about, too. There are a lot of "interesting" languages that cannot be described by context free grammars (such as {empty, 012, 001122, 000111222, ...} but very simple enhancements do make them easy to recognize. In the case of the "indentation grammar", then the (one) stack in a push-down automaton is basically used up keeping track of the indentation level. But you don't need a whole stack to keep track of indntation level, just a register that can be used to track the current level.
BTW, I'm new to this list. I haven't done that much with Perl recently, but of late, I've become a lot more interested in the language. I recently picked ot the Perl 6/Parrot book from O'Reilly but had really meant to finish reading the book before jumping in. It's just that this topic is too interesting! --- Larry Wall <[EMAIL PROTECTED]> wrote: > On Thu, Sep 08, 2005 at 08:37:21AM -0700, Dave Whipp wrote: > : If I want to parse a language that is sensitive to whitespace > : indentation (e.g. Python, Haskell), how do I do it using P6 > rules/grammars? > : > : The way I'd usually handle it is to have a lexer that examines > leading > : whitespace and converts it into "indent" and "unindent" tokens. The > > : grammer can then use these tokens in the same way that it would any > > : other block-delimiter. > > This is the multi-pass approach, which even in Perl 6 is still > certainly one way to do it. Or actually, two ways, one of which is > to use source filters to mung the input text, and the other way of > which is to make one lexer pass to transform into a list or tree of > tokens, and then do a list/tree transformation on that. Given that > tree transformation is something that a number of us are currently > thinking about for various other reasons, I suspect that can be made > to work pretty well. But we will have to describe how rule matching > can be extended to lists and trees, and what happens when you > intermix > text elements with non-textual objects. But essentially, just think > what it would take to match against the tree of match objects > returned > by a prior match instead of against a string. > > : This requires a stateful lexer, because to work out the number of > : "unindent" tokens on a line, it needs to know what the indentation > : positions are. How would I write a P6 rule that defines <indent> > and > : <unindent> tokens? Alternatively (if a different approach is > needed) how > : would I use P6 to parse such a language? > > I can think of two other approaches as well. One would be to allow > "pushback" on the queued input so that when we hit a line transition, > we can immediately analyze the leading whitespace and replace it > conceptually with the appropriate number of indent/unindent objects. > It's not yet clear whether that is a good approach. It might be > rather inefficient to splice faked up objects into the middle of the > input stream. On the other hand, we don't actually have to splice > the input, only pretend that we did. And certainly, the Perl 5 lexer > makes heavy use of this kind of we'll-fake-the-next-N-tokens queueing > (though I actually botched the Perl 5 implementation of it by making > it a stack instead of a queue). > > My final idea is that you can treat it as a fancy kind of lookbehind > assertion, provided you have some landmark that will stop the > normal analysis from running over the newline boundary. With the > other approaches, you have a precomputed positive unindent token > to function as a "stopper", but with this approach, the only thing > you can depend on is the newline itself. So when you first hit a > newline, you don't progress beyond it, but instead look ahead at the > indentation of the following line and queue up the right number of > indent/unindent objects in your own data structure (as well as > pushing > or popping your stack of current indents appropriately so that the > new indent is on top). Then as long as there are queued up objects, > the newline doesn't report itself as a newline, but as the next > object, > until you run out, and then the newline allows itself to be skipped, > along with the subsequent whitespace. (The difference between this > and the previous approach is that you're not relying on the rules > engine to keep track of the queueing for you.) > > The very fact that I had to use a phrase like "positive unindent > token" > says a lot about why indentation as syntax is problematic in general. > > Larry > === Gregory Woodhouse <[EMAIL PROTECTED]> "Without the requirement of mathematical aesthetics a great many discoveries would not have been made." -- Albert Einstein