Re: Parsing indent-sensitive languages

Greg Woodhouse Thu, 08 Sep 2005 14:17:04 -0700

That's something I've been thinking about, too. There are a lot of
"interesting" languages that cannot be described by context free
grammars (such as {empty, 012, 001122, 000111222, ...} but very simple
enhancements do make them easy to recognize. In the case of the
"indentation grammar", then the (one) stack in a push-down automaton is
basically used up keeping track of the indentation level. But you don't
need a whole stack to keep track of indntation level, just a register
that can be used to track the current level.


BTW, I'm new to this list. I haven't done that much with Perl recently,
but of late, I've become a lot more interested in the language. I
recently picked ot the Perl 6/Parrot book from O'Reilly but had really
meant to finish reading the book before jumping in. It's just that this
topic is too interesting!

--- Larry Wall <[EMAIL PROTECTED]> wrote:

> On Thu, Sep 08, 2005 at 08:37:21AM -0700, Dave Whipp wrote:
> : If I want to parse a language that is sensitive to whitespace 
> : indentation (e.g. Python, Haskell), how do I do it using P6
> rules/grammars?
> : 
> : The way I'd usually handle it is to have a lexer that examines
> leading 
> : whitespace and converts it into "indent" and "unindent" tokens. The
> 
> : grammer can then use these tokens in the same way that it would any
> 
> : other block-delimiter.
> 
> This is the multi-pass approach, which even in Perl 6 is still
> certainly one way to do it.  Or actually, two ways, one of which is
> to use source filters to mung the input text, and the other way of
> which is to make one lexer pass to transform into a list or tree of
> tokens, and then do a list/tree transformation on that.  Given that
> tree transformation is something that a number of us are currently
> thinking about for various other reasons, I suspect that can be made
> to work pretty well.  But we will have to describe how rule matching
> can be extended to lists and trees, and what happens when you
> intermix
> text elements with non-textual objects.  But essentially, just think
> what it would take to match against the tree of match objects
> returned
> by a prior match instead of against a string.
> 
> : This requires a stateful lexer, because to work out the number of 
> : "unindent" tokens on a line, it needs to know what the indentation 
> : positions are. How would I write a P6 rule that defines <indent>
> and 
> : <unindent> tokens? Alternatively (if a different approach is
> needed) how 
> : would I use P6 to parse such a language?
> 
> I can think of two other approaches as well.  One would be to allow
> "pushback" on the queued input so that when we hit a line transition,
> we can immediately analyze the leading whitespace and replace it
> conceptually with the appropriate number of indent/unindent objects.
> It's not yet clear whether that is a good approach.  It might be
> rather inefficient to splice faked up objects into the middle of the
> input stream.  On the other hand, we don't actually have to splice
> the input, only pretend that we did.  And certainly, the Perl 5 lexer
> makes heavy use of this kind of we'll-fake-the-next-N-tokens queueing
> (though I actually botched the Perl 5 implementation of it by making
> it a stack instead of a queue).
> 
> My final idea is that you can treat it as a fancy kind of lookbehind
> assertion, provided you have some landmark that will stop the
> normal analysis from running over the newline boundary.  With the
> other approaches, you have a precomputed positive unindent token
> to function as a "stopper", but with this approach, the only thing
> you can depend on is the newline itself.  So when you first hit a
> newline, you don't progress beyond it, but instead look ahead at the
> indentation of the following line and queue up the right number of
> indent/unindent objects in your own data structure (as well as
> pushing
> or popping your stack of current indents appropriately so that the
> new indent is on top).  Then as long as there are queued up objects,
> the newline doesn't report itself as a newline, but as the next
> object,
> until you run out, and then the newline allows itself to be skipped,
> along with the subsequent whitespace.  (The difference between this
> and the previous approach is that you're not relying on the rules
> engine to keep track of the queueing for you.)
> 
> The very fact that I had to use a phrase like "positive unindent
> token"
> says a lot about why indentation as syntax is problematic in general.
> 
> Larry
> 



===
Gregory Woodhouse  <[EMAIL PROTECTED]>



"Without the requirement of mathematical aesthetics a great many discoveries 
would not have been made."

-- Albert Einstein

Re: Parsing indent-sensitive languages

Reply via email to