Some notes I made while investigating Python internals. May
be interesting for other languages. I'll also write up a bit
of a comparison of Perl and Python's Unicode database handling;
I'm getting to the idea that I like Python's version better. :)
Simon
----- Forwarded message from Simon Cozens <[EMAIL PROTECTED]> -----
From: Simon Cozens <[EMAIL PROTECTED]>
Subject: Python guts
X-Operating-System: Linux riot-act 2.4.4
X-Addresses: The [EMAIL PROTECTED] address is deprecated due to being broken.
[EMAIL PROTECTED] still works, but simon-cozens.org or netthink.co.uk are preferred.
OK. I started my trek into the sources by trying to find a way of doing
essentially the same as the B module - getting access to the op tree.
However, Python Does Things Differently; there isn't a tree as such, but the
code is compiled down to a relatively low level language, in the style of the
Dragon Book. For instance, they use jumps and backpatching to do conditionals.
Python compiles directly from a parse tree - which is very literally that,
how the code is parsed - straight into this low level machine code. For
instance, here's what a simple documentation string, a literal string constant
looks like:
DOCSTRING_STMT_PATTERN = (
symbol.stmt,
(symbol.simple_stmt,
(symbol.small_stmt,
(symbol.expr_stmt,
(symbol.testlist,
(symbol.test,
(symbol.and_test,
(symbol.not_test,
(symbol.comparison,
(symbol.expr,
(symbol.xor_expr,
(symbol.and_expr,
(symbol.shift_expr,
(symbol.arith_expr,
(symbol.term,
(symbol.factor,
(symbol.power,
(symbol.atom,
(token.STRING, ['docstring'])
)))))))))))))))),
(token.NEWLINE, '')
))
You can find this out from the "parser" and "symbol" modules. The parser
module gives you access to the parse tree of a program.
A note on the parser: the parser is generated from a grammar in
Grammar/Grammar. This is an EBNF of the Python language. When Python is built,
it first compiles a program called pgen in the Parser directory. This is a
parser generator, which reads in the EBNF and outputs a parser to
Grammar/graminit.c. How does pgen read the EBNF? It parses it with a
metaparser, by using the metagrammar in Parser/metagrammar.c! This
theoretically means it's easy to extend Python's grammar without getting your
hands all that dirty.
The main loop of the parser is PyParser_AddToken in Parser/parser.c; this
routine accepts a token and decides whether or not to shift it or accept it,
based on the rules in the grammar. Tokens are produced by PyTokenizer_Get in
Parser/tokenizer.c. (the equivalent of our toke.c) Python's tokenization is
simpler than ours as a lot of things are handled by the parser, and
everything's a NAME. (There's a python-based tokeniser in tokenise.py; I wish
there were a python-based parser and compiler, but there doesn't seem to be.)
Parser/parsetok.c handles the parser/tokeniser linkage.
Once Python's parsed a file, it sends the nodes of the parse tree to com_node
in compile.c; this is essentially the same as the actions in perly.y, and
produces the bytecode stream by calling out to a bunch of different functions
which emit bytecode. com_push and com_pop took me a while to work out but
they're simply marker functions which tell you how many things are on the
stack at the current time. The bytecode stream is really, really not
a tree.
Optimization? Hm. Nothing much is done. optimize() in Python/compile.c only
takes care of replacing local variable names with indexes into, essentially,
registers. And that's all.
Now we've got some code, which is represented by a PyCode object. These are
CVs, they really are - they represent subroutines, packages and the main code.
Then comes execution. This is handled by Python/ceval.c, and
specifically eval_code2. It first loads in any arguments specified by the
PyCode object, and then essentially goes into a big loop as you'd
expect. Operations are read sequentially from the bytecode and dispatched
by a switch statement. The bytecode can also encode arguments to the
operation, constants to be used, and so on. Here's the operation which
does a binary multiply: it should be relatively obvious:
case BINARY_MULTIPLY:
w = POP();
v = POP();
x = PyNumber_Multiply(v, w);
Py_DECREF(v);
Py_DECREF(w);
PUSH(x);
if (x != NULL) continue;
break;
There are a surprising number of low-level stack manipulation functions,
which are mainly used for doing clever things with slices and assignment.
Python only has 143 ops, but some of them are heavily overloaded - there's
a single comparison op, for instance, and a single BINARY_ADD which does
strings and numbers. (Although I can't see how - how strange...)
On the whole, the implementation is very straightforward, as you can
probably tell from the disassembler I just sent you - there's also a
disassembly module in pure Python that you ought to play with called
dis.py
More as I find it out.
Simon
----- End forwarded message -----