[[email protected]: Python guts]

Simon Cozens Wed, 08 Aug 2001 16:27:36 -0700
Some notes I made while investigating Python internals. May
be interesting for other languages. I'll also write up a bit
of a comparison of Perl and Python's Unicode database handling;
I'm getting to the idea that I like Python's version better. :)

Simon

----- Forwarded message from Simon Cozens <[EMAIL PROTECTED]> -----

From: Simon Cozens <[EMAIL PROTECTED]>
Subject: Python guts
X-Operating-System: Linux riot-act 2.4.4
X-Addresses: The [EMAIL PROTECTED] address is deprecated due to being broken. 
[EMAIL PROTECTED] still works, but simon-cozens.org or netthink.co.uk are preferred.

OK. I started my trek into the sources by trying to find a way of doing
essentially the same as the B module - getting access to the op tree. 

However, Python Does Things Differently; there isn't a tree as such, but the
code is compiled down to a relatively low level language, in the style of the
Dragon Book. For instance, they use jumps and backpatching to do conditionals. 

Python compiles directly from a parse tree - which is very literally that,
how the code is parsed - straight into this low level machine code. For
instance, here's what a simple documentation string, a literal string constant
looks like:

DOCSTRING_STMT_PATTERN = (
    symbol.stmt,
    (symbol.simple_stmt,
     (symbol.small_stmt,
      (symbol.expr_stmt,
       (symbol.testlist,
        (symbol.test,
         (symbol.and_test,
          (symbol.not_test,
           (symbol.comparison,
            (symbol.expr,
             (symbol.xor_expr,
              (symbol.and_expr,
               (symbol.shift_expr,
                (symbol.arith_expr,
                 (symbol.term,
                  (symbol.factor,
                   (symbol.power,
                    (symbol.atom,
                     (token.STRING, ['docstring'])
                     )))))))))))))))),
     (token.NEWLINE, '')
     ))

You can find this out from the "parser" and "symbol" modules. The parser
module gives you access to the parse tree of a program. 

A note on the parser: the parser is generated from a grammar in
Grammar/Grammar. This is an EBNF of the Python language. When Python is built,
it first compiles a program called pgen in the Parser directory. This is a
parser generator, which reads in the EBNF and outputs a parser to
Grammar/graminit.c. How does pgen read the EBNF? It parses it with a
metaparser, by using the metagrammar in Parser/metagrammar.c! This
theoretically means it's easy to extend Python's grammar without getting your
hands all that dirty.

The main loop of the parser is PyParser_AddToken in Parser/parser.c; this 
routine accepts a token and decides whether or not to shift it or accept it,
based on the rules in the grammar. Tokens are produced by PyTokenizer_Get in 
Parser/tokenizer.c. (the equivalent of our toke.c) Python's tokenization is
simpler than ours as a lot of things are handled by the parser, and
everything's a NAME. (There's a python-based tokeniser in tokenise.py; I wish
there were a python-based parser and compiler, but there doesn't seem to be.) 
Parser/parsetok.c handles the parser/tokeniser linkage.

Once Python's parsed a file, it sends the nodes of the parse tree to com_node
in compile.c; this is essentially the same as the actions in perly.y, and
produces the bytecode stream by calling out to a bunch of different functions
which emit bytecode. com_push and com_pop took me a while to work out but
they're simply marker functions which tell you how many things are on the
stack at the current time. The bytecode stream is really, really not
a tree.

Optimization? Hm. Nothing much is done. optimize() in Python/compile.c only
takes care of replacing local variable names with indexes into, essentially,
registers. And that's all.

Now we've got some code, which is represented by a PyCode object. These are
CVs, they really are - they represent subroutines, packages and the main code.

Then comes execution. This is handled by Python/ceval.c, and 
specifically eval_code2. It first loads in any arguments specified by the
PyCode object, and then essentially goes into a big loop as you'd
expect. Operations are read sequentially from the bytecode and dispatched
by a switch statement. The bytecode can also encode arguments to the 
operation, constants to be used, and so on. Here's the operation which
does a binary multiply: it should be relatively obvious:

        case BINARY_MULTIPLY:
            w = POP();
            v = POP();
            x = PyNumber_Multiply(v, w);
            Py_DECREF(v);
            Py_DECREF(w);
            PUSH(x);
            if (x != NULL) continue;
            break;

There are a surprising number of low-level stack manipulation functions,
which are mainly used for doing clever things with slices and assignment.
Python only has 143 ops, but some of them are heavily overloaded - there's
a single comparison op, for instance, and a single BINARY_ADD which does
strings and numbers. (Although I can't see how - how strange...)

On the whole, the implementation is very straightforward, as you can
probably tell from the disassembler I just sent you - there's also a
disassembly module in pure Python that you ought to play with called
dis.py

More as I find it out.

Simon

----- End forwarded message -----
[[email protected]: Python guts]

Reply via email to