On 7/21/2011 2:53 PM, Xah Lee wrote:

had hopes that parser expert would show some proper parser solutions…
in particular i think such can be expressed in Parsing Expression
Grammar in just a few lines… but so far no deity came forward to show
the light. lol

I am not a parser expert but 20 years ago, I wrote a program in C to analyze C programs for proper fence matching. My motivation was the often obsurity of parser error messages derived from mis-matched fences. I just found the printed copy and an article I wrote but did not get published.

Balance.c matches tokens, not characters (and hence can deal with /* and */). It properly takes into account allowed nestings. For C, {[]} is legal, [{}] is not. Ditto for () instead of []. Nothing nests within '', "", and /* */. (I know some C compilers do nest /* */, but not the ones I used).

I initially started with a recursive descent parser but 1) this hard-coded the rules for one language and make changes difficult and 2) made the low-level parsing difficult. So I switched to a table-driven recursive state/action machine. The tables for your challenge would be much simpler as you did not specify any nesting rules, although they would be needed for html checking.

A key point that simplifies things a bit is that every file is surrounded by an unwritten BOF-EOF pair. So the machine starts with having 'seen' BOF and is 'looking' for EOF. So it is always looking to match *something*.

The total program is nearly four pages, but one page is mostly declarations and command-line processing, another two pages have typedefs, #DEFINEs, and tables. The actual main loop is about 25 lines, and 10 lines of that is error reporting. The output is lines with file name, row and columns of the two tokens matched (optional) or mismatched, and what the two tokens are.

Since this program would be a useful example for my book, both didactically and practically, I will try to brush-up a bit on C and translate it to Python. I will use the re module for some of the low-level token parsing, like C multibyte characters. I will then change to tables for Python and perhaps for your challenge.

The current program assumes ascii byte input at it uses an array of length 128 to classify ascii chars into 14 classes: 13 special for the matching and 1 'normal' class for everything else. This could be replaced in Python with a dict 'special' that only maps special characters to their token class and used as "special.get(char, NORMAL)" so that the thousands of normal characters are mapped by default to NORMAL without a humongous array.

--
Terry Jan Reedy


--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to