Hello,
this is a rather specific question about parsing an indented grammar.
I have a working implementation (presented below) that may be worth a critic
review -- if you like
it.
First issue is: is there a way to express an indented formatfing using a common
grammar
language such as BNF (or regex)? If yes, how to do that? If not, what kind of
formal grammar is
actually such a format? Is there any standard way to express it?
I have searched python's grammar for this: actually, for any compound
statement, a block (=
section content) is simply renamed 'suite' with no more format expression. The
lexical analysis
section on indentation reads:
'' The indentation levels of consecutive lines are used to generate INDENT and
DEDENT tokens, using
a stack, as follows. Before the first line of the file is read, a single zero
is pushed on the
stack; this will never be popped off again. The numbers pushed on the stack
will always be
strictly increasing from bottom to top. At the beginning of each logical line,
the line’s
indentation level is compared to the top of the stack. If it is equal, nothing
happens. If it is
larger, it is pushed on the stack, and one INDENT token is generated. If it is
smaller, it must be
one of the numbers occurring on the stack; all numbers on the stack that are
larger are popped
off, and for each number popped off a DEDENT token is generated. At the end of
the file, a DEDENT
token is generated for each number remaining on the stack that is larger than
zero. ''
I guess from this text that python parsers have a pre-parse phase that generate
the so-called
INDENT and DEDENT tokens -- which are then used during actual parsing. So that
a typical section
looks like this for the parser:
header
INDENT
line
line
line
DEDENT
Which is, in fact, recreating a common block start / stop format (e.g. C's
{block}). Is it? Do you
know whether python parsers actually do that during a kind of pre-parse phase,
or whether their
designers have found a way of matching indent/dedent together with the rest of
the source text?
Thank you,
Denis
==================
My trial:
I have have implemented an overall parser to parse an indented grammar in the
simplest (recursive)
case I could specify:
* no blank line, no comment
* a single format of inline pattern
* each indent level marked as a single (tab) char
* only step 1 indentation increase
To do this, I use a set of tricks. Below the grammar (using a pseudo BNF/regex
format -- actually it is implemented using a custom parser generator):
inline : [a..z0..9_]+
line : Indent inline NL
bloc : (section | line)+
section : line IndentMore bloc IndentLess
all : <= bloc>
Oviously, I use 3 names starting with Indent, that are not defined. They are
kinds of
pseudo-patterns and all rely on an external variable that holds the current
indent level and is
hosted as an attribute of Indent:
* Indent: matches if a proper indentation is found, meaning equal to current
indent level --
advances pointer
* IndentMore: matches if the actual indentation found is =level+1 -- does not
advance pointer
(=lookahead) -- increments current level
* IndentLess: matches if the actual indentation found is <level -- does not
advance pointer --
decrements current level of -1 whatever the actual indent level found
Last note causes that n IndentLess matches will be generated even when the
indent level is
brutally decremented of n steps at once -- so that in a valid text each
IndentMore match gets its
IndentLess counter-part.
I wonder about the indent level record stack introduced in CPython's
description: I need only a
single variable.
As a example, here is the actual code of IndentLess:
class IndentLess(Pattern):
def __init__(self):
Pattern.__init__(self)
def _check(self,text,pos):
''' check indent level decrease (of any number of step)
-- but decrement current level by 1 only in any case
~ no pointer advance <==> Right() '''
# match indentation
start_pos = pos
(result,pos) = ZeroOrMore(TAB)._check(text,pos)
# case proper indent decrement
indent_level = pos - start_pos
if indent_level < Indent.level:
Indent.level -= 1
result = Result(self, None, text,start_pos,start_pos)
return (result,start_pos) # (no pointer advance)
# case no indent decrement (by step 1 or more)
# (may also be end-of-text reached)
if pos >= len(text):
raise EndOfText(self, text, start_pos)
message = "No proper indentation decrement:\nindent level
%s --> %s"\
%(Indent.level, indent_level)
raise NoMatch(self, text, start_pos, message)
def _full_expr(self):
return "INDENT--"
Below a test text and the corresponding parse result shown in tree view:
=======
1
2
3
31
32
4
5
6
61
62
63
631
632
7
=======
=======
all
line:1
line:2
section
line:3
bloc
line:31
line:32
line:4
line:5
section
line:6
bloc
line:61
line:62
section
line:63
bloc
line:631
line:632
line:7
=======
I first want to use this structure to parse config files written according to
an recursive
indented format. So that the result would be a (record) object holding
properties which actually
each are either a parameter or a nested sub-config, e.g.:
config
name : val
name : val
connexion
name : val
name : val
transmission
name : val
name : val
------
la vida e estranya
_______________________________________________
Tutor maillist - [email protected]
http://mail.python.org/mailman/listinfo/tutor