[Cython] A high-level discussion of the Cython internals

Dag Sverre Seljebotn Thu, 08 May 2008 10:15:12 -0700

This is in response to this statement by Robert:


-- 

I think this is good to talk about now. Basically, you want to change  
Cython to use the "visitor pattern" rather than the recursive pattern  
that it currently uses. This has pros and cons, but I think this  
could be a good thing, and you are certainly convinced that its  
needed to get started on the GSoC stuff.

--

So, leaving the patch and implementation strategies aside for now I'm 
going to attempt discussing this from the mountain top. (If a lot of 
this seems trivial, all the better -- it can then start to serve as 
documentation for a direction, rather than my "private" efforts :-). If 
not, it will be good to discuss it.)

I honestly believe that any time spent discussing and implementing this 
now can repay itself many times later on in saved developer hours and a 
lower learning curve for new developers. If people agree with me in how 
important this is, I think it would be good to perhaps spend some time 
at the beginning of dev1 talking and discussing this (with some slides 
etc.).

Here we go:

Assertion 1: Development of Cython has a potential for becoming stagnant 
(if it hasn't already) quickly.

My (subjective) felling is that the codebase has a somewhat high 
learning curve. If one wants to add a fundamentally new features (as 
opposed to quick fixes, which do currently happen very quickly -- but 
think type inference!) it looks like one has to be very careful to keep 
many different effects of one's changes in the head at the same time.


Assertion 2: The reason for this state can at least in some part be 
attributed to some confusion as to what the current code tree *is*.

I believe that the reason that the codebase can be practical in the 
current state is tightly connected with how similar C and CPython is 
with Python as a language. The design, so to speak, arose in an 
historical happy accident, because the input and output are almost 1:1 
related. A Python function and a C function loosely correspond, and so 
FuncDefNode can serve the role as both the parsed element and the 
element due for C serialization. However there is no reason to assume 
that the 1:1 correspondance holds everywhere; and I think there is a 
tendency that "ugly hacks" already tend to arise in situations where the 
correspondance in program structure diverges more. (To consider how this 
might become worse, think about a "method template node".)

If this was only a theoretically founded critisism one might simply 
dismiss it, but I believe the problems in assertion 1 comes from this 
very fact, and that they can be solved by dealing with this conceptual 
problem.


The proposed solution: Work towards "The ideal solution" below in tiny, 
incremental steps. All *new* major features are implemented with the 
ideal solution in mind, and old code refactored as far as is needed for 
that to happen, but one doesn't start rewriting working and bugfixed 
code just for the sake of abstract purity.


The ideal solution: A strict pipeline-based approach where the "data" 
(one or more trees, graphs or similar) is in different stages, with 
different phases transforming one into the other. I.e, always try to fit 
what one does into the following example scheme:

data stage:string of code -> phase:parsing ->
data stage:syntax tree -> phase:expand with statements ->
data stage:tree without with statements -> phase:create scopes ->
data stage:tree with scopes -> phase:analyse types ->
data stage:tree with ".type" attributes -> ...
...
data stage:tree closely related to *structure* of C source -> phase:output

Whenever something doesn't quit fit in the scheme of naive input/output, 
split it up into more phases until it does :-)

(Digression: Closely related to C structure above means, for instance, 
no inner functions. No with statements. No untyped variables. That kind 
of thing. Classes may be included though, since one can create a 1:1 
correspondance between C code using the CPython API and a class; so it 
is not about language "features" as such but some deeper structure.)

Note that all phases does not need to be visitors! For instance, the 
output-to-C phase will stay a recursive process for the foreseeable 
future. This is a way of thinking, not a recipe for implementation. (And 
indeed, this way of thinking is already followed to a degree -- but not 
everywhere, and we get problems at the spots where it is not.)

A very important side-effect of this is that it lends itself to 
"implementing complex Cython statements as more simple Cython 
statements". (Though ideally "more simple Cython statements" can also be 
"intermediate simpler instructions" that bridges the gap between Cython 
and C in a more fine-grained way in order to give more code reuse than 
there is today. But I digress.)

(Implementation notes: This does not necesarrily mean that the classes 
of the nodes changes between phases; one does whatever is most 
convenient. For instance, if one has T2 = expand_with_nodes(T1), then T1 
is allowed to keep WithStatementNode, T2 is however not, but otherwise 
the trees are the same. In the same way, if T1 = type_analysis(T0), then 
ExprNodes in T0 is not allowed to have the type attribute, while 
ExprNodes in T1 are required to have them. This can be a documentation 
matter or enforced (in a number of ways), that issue is better left for 
later.).


Final words: This might seem trivial. It is not. The reason is that 
currently, almost wherever you try to implement two simple phases f and 
g one ends up having the result of f depend on what is done in g for 
some nodes, and the result of g depend on f for other nodes. And so 
neither f(g(T)) nor g(f(T)) can be made to work.

The point is, this way of changing how one thinks mostly affects the 
compiler design as such, more than it is a specific style of coding. 
Visitors happen to be the most natural way to implement this, but are 
not the main point. I'd be happy (or, happier than I am today) with a 
cleanly defined phase-based recursive process.

-- 
Dag Sverre

_______________________________________________
Cython-dev mailing list
[email protected]
http://codespeak.net/mailman/listinfo/cython-dev

[Cython] A high-level discussion of the Cython internals

Reply via email to