Re: a little parsing challenge ☺

Terry Reedy Thu, 21 Jul 2011 15:49:20 -0700

On 7/21/2011 2:53 PM, Xah Lee wrote:

had hopes that parser expert would show some proper parser solutions…
in particular i think such can be expressed in Parsing Expression
Grammar in just a few lines… but so far no deity came forward to show
the light. lol

I am not a parser expert but 20 years ago, I wrote a program in C toanalyze C programs for proper fence matching. My motivation was theoften obsurity of parser error messages derived from mis-matched fences.I just found the printed copy and an article I wrote but did not getpublished.

Balance.c matches tokens, not characters (and hence can deal with /* and*/). It properly takes into account allowed nestings. For C, {[]} islegal, [{}] is not. Ditto for () instead of []. Nothing nests within '',"", and /* */. (I know some C compilers do nest /* */, but not the onesI used).

I initially started with a recursive descent parser but 1) thishard-coded the rules for one language and make changes difficult and 2)made the low-level parsing difficult. So I switched to a table-drivenrecursive state/action machine. The tables for your challenge would bemuch simpler as you did not specify any nesting rules, although theywould be needed for html checking.

A key point that simplifies things a bit is that every file issurrounded by an unwritten BOF-EOF pair. So the machine starts withhaving 'seen' BOF and is 'looking' for EOF. So it is always looking tomatch *something*.

The total program is nearly four pages, but one page is mostlydeclarations and command-line processing, another two pages havetypedefs, #DEFINEs, and tables. The actual main loop is about 25 lines,and 10 lines of that is error reporting. The output is lines with filename, row and columns of the two tokens matched (optional) ormismatched, and what the two tokens are.

Since this program would be a useful example for my book, bothdidactically and practically, I will try to brush-up a bit on C andtranslate it to Python. I will use the re module for some of thelow-level token parsing, like C multibyte characters. I will then changeto tables for Python and perhaps for your challenge.

The current program assumes ascii byte input at it uses an array oflength 128 to classify ascii chars into 14 classes: 13 special for thematching and 1 'normal' class for everything else. This could bereplaced in Python with a dict 'special' that only maps specialcharacters to their token class and used as "special.get(char, NORMAL)"so that the thousands of normal characters are mapped by default toNORMAL without a humongous array.


--
Terry Jan Reedy


--
http://mail.python.org/mailman/listinfo/python-list

Re: a little parsing challenge ☺

Reply via email to