EKR: character-oriented parsing

Edward K. Ream Tue, 20 Sep 2011 06:08:07 -0700

In another thread I said:

“Yesterday's work suggests a new direction, one that could  only be
called 'eccentric', and it might be eccentric enough that nobody has
seriously considered it since the very earliest days of programming:
character-oriented parsing. It's a crazy idea, but it might be crazy
enough to work.”


Here, I'll explain how I am being led in this direction, and what the
future might bring.  Perhaps as self-defense, I'll also contrast this
idea with more standard techniques.  To anticipate, the main idea is
that the standard techniques aren't really that good for the problem
at hand.

Yesterday's work got bogged down in character-oriented bugs.  It
wasn't fun at time. However, the results are pretty impressive: c-to-
python can (reliably??) munge complex C code into much more readable
pseudo-python code.  I plan further improvements.

There are two trends apparent at the code level of c-to-python:  one
towards clever designs that *decrease* complexity, and the other
towards ever-more-complex hacks that *increase* complexity.

Simplicity
=======

Let's start with the good news.  The “new” tokenizer pattern continues
to resonate with me::

    def tokenize (self,s):

        i,result = 0,[]
        while i < len(s):
            # Loop invariant: at end j > i and s[i:j] is the new
token.
            j = i
            ch = s[i]
            if <ch starts token A>:
                j += len(token A)
            ...
            else:
                j += 1

            assert j > i
            result.append(''.join(s[i:j]))
            i = j

        return result

You could say this is C code :-)  And it is also simple and
unforgettable.  It also minimally stresses the GC.

Complexity
========

As yesterday's work shows, character-oriented code can get extremely
hairy very fast.  You can see this in the horrible code in
handlePossibleFunctionHeader and dedentBlocks, which I wrote late last
night.

The feeling when I finished last night was relief mixed with
revulsion :-)  Clearly, one wants a way to simplify the code.  In
particular, a way must be found to eliminate all the extra index
variables.  The tokenize code is good because it uses just two
variables, i and j.  Pretty much the entire story of the code is j =
skip_past_token_x(s,i)  But once there are more than two index
variables, it becomes a chore to figure out what is going on.

Tokens don't help
=============

Last night I investigated whether tokenizing the input to
convertCodeList would help.  The somewhat surprising result is that it
doesn't: it makes the code *harder* to write.  One can immediately see
this in the match_word method.  In the character-oriented approach,
one easily deals with the “next” character.  But in the token-oriented
approach everything becomes more complicated.  It's counter-intuitive.

Happily, the character-oriented approach *usually* works just fine.
However, it needs some support for parsing.

Parsers don't help (much)
===================

The following remarks may be controversial, but they are not ignorant.
In particular, I have studied the code for 2to3, my “new lint” program
uses Python's AST trees and I've studied (and written) parsers and
compilers.

One thing is clear from my study:  there is *no* easy way to munge
source code.  Having an AST tree really doesn't help much.  In fact,
it could be called a step *away* from simplicity, because at the
moment one must actually change the code one doesn't have the
character pointers (i and j, so to speak) that would allow one to make
the change easily.

Furthermore, at *no* time will a parser tree be a complete solution to
*any* problem in c-to-python.  The reason is simple: pointers
(indices) in the parse tree will become invalid as soon as the code
list changes.  We can have “dangling indices” in Python which will be
just as difficult to handle as dangling pointers are in C.  It's an
irony.

The situation is similar for programs like swig or pylint.  The parser
*itself* is non-trivial, and it produces results (parse trees) that
are only somewhat related to what is actually wanted.  I often think
of Bjarne Stroustrup's lament in his book, the C++ Programming
Language, that his biggest regret about the C++ compiler's code is
that he didn't use a recursive-descent parsers.  Like me, he has never
felt comfortable with bottom-up designs.

Furthermore, parsers introduce an entire new level of complexity: for
example, swig builds in support for bison, a compiler-compiler.
Presumably the actual parser gets built from some kind of language
description.  The *theory* is that language descriptions are somehow
easier to write than code.  In practice, they must be debugged just
like code.  Furthermore, the result is then *not* code but parse
trees.

I'm starting to wonder whether this indirect (parse-tree centered)
approach might not actually be best.  At minimum, *another* kind of
indirect approach, namely adding support *at the code level* for
character-oriented parsing might actually produce better results where
it actually matters.

There is no doubt that this is an eccentric point of view.  But it
seems natural to me. I love recursive-descent, and I for some strange
reason, I am even happier with character-oriented scanners.  They
always just “feel” right.  Maybe it's my 30+ years of C
programming :-)

I'm probably the only person in the world who might follow up on this
idea.  I'm the only one with the interest and the lack of pressing
work commitments to explore it...

The challenge
===========

By analogy with 2to3, I would like to create a set of tools that would
allow me to write parser-like tools at the character level.  I mention
2to3 because the language used to describe changes in 2to3 is, in
fact, a programming tool.  I doubt if an explicitly language-based
approach will work for me, and I am *certainly* not going to use
regular expressions, but the general idea isn't so dissimilar to 2to3:
we want to create a set of tools for manipulating source code.

Of course, we can't use any of 2to3's sources, because we are dealing
with C, not Python.  In particular, we don't have an easy way of
getting a C parser tree.  Of course, I don't care much, since I don't
plan to use a parse tree.  But I *do* want to do character-oriented
parsing.  The challenge is to create these admittedly-strange tools.

Aside about swig
=============

In the book, Swig Master Class, David Beazley laments that C
declarations are complex.  In some sense he is correct, but perhaps
not so correct in other senses.

Indeed C declarations (and definitions, for that matter), can be
parsed relatively easily in a recursive-descent manner.  I've done
it.  In fact, C declarations consist of a “head” and a “tail”.  The
head specifies a type, and the tail specifies an identifier.  There
are complications--the “head” actually wraps around the tail.  But
again, a recursive-descent parser can handle all this "easily".

The challenge in more detail
======================

1. At minimum, I want a way to reduce the horrors of mungeAllFunctions
and its helpers, including dedentBlocks.

2. Instead of focusing on “permanent” solutions that create dangling
indices, we need *temporary*, *context-dependent* solutions.

Of course, the present mess of indices could be called one such
“solution”, but it would be good to try for something more.

The essential point is that character-oriented tools will be “close”
to the problem.  In contrast, parse trees are surprisingly *far* from
the problem.

A glimmer of the solution
==================

My sense is that the *existing* code in c-to-python is almost good
enough.  I simply need to use the code more cleverly, or find
relatively simple ways of composing patterns.  Perhaps even better
naming conventions will help.

Enough for now.  I'm going to revisit the ugliest parsing code and see
if I can create some simpler patterns.  The tokenize code will be
inspiration and encouragement.  I'm looking for a big collapse in
complexity.  I've done it before, and I can probably do it here.

Edward






-- 
You received this message because you are subscribed to the Google Groups 
"leo-editor" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/leo-editor?hl=en.

EKR: character-oriented parsing

Reply via email to