Author: Armin Rigo <[email protected]>
Branch: extradoc
Changeset: r5625:eb069da8950f
Date: 2016-04-02 19:18 +0200
http://bitbucket.org/pypy/extradoc/changeset/eb069da8950f/

Log:    Add the current status of the piratepad

diff --git a/talk/bucharest2016/jit-backend-8vhY1ArTsh.txt 
b/talk/bucharest2016/jit-backend-8vhY1ArTsh.txt
new file mode 100644
--- /dev/null
+++ b/talk/bucharest2016/jit-backend-8vhY1ArTsh.txt
@@ -0,0 +1,119 @@
+pypy's assembler backend
+
+input: linear sequence of instructions, called a "trace".
+
+A trace is a sequence of instructions in SSA form.  Most instructions 
correspond to one or a few CPU-level instructions.  There are a few 
meta-instructions like `label` and debugging stuff.  All branching is done with 
guards, which are instructions that check that a condition is true and exit the 
trace if not.  A failing guard can have a new trace added to it later, called a 
"bridge".  A patched guard becomes a direct `Jcond` instruction going to the 
bridge, with no indirection, no register spilling, etc.
+
+A trace ends with either a `return` or a `jump to label`.  The target label is 
either inside the same trace, or in some older one.  For historical reasons we 
call a "loop" a trace that is not a bridge.  The machine code that we generate 
is organized as a forest of trees; the trunk of the tree is a "loop", and the 
branches are all bridges (branching off the trunk or off another branch).
+
+* every trunk or branch that ends in a `jump to label` can target a label from 
a different tree, too.
+
+* the whole process of assembling a loop or a branch is basically 
single-threaded, so no synchronization issue there (including to patch older 
generated instructions).
+
+* the generated assembler has got a "frame" in %rbp, which is actually not on 
the stack at all, but is a GC object (called a "jitframe").  Spilling goes 
there.
+
+* the guards are `Jcond` to a very small piece of generated code, which is 
basically pushing a couple of constants on the stack and then jumping to the 
general guard-recovery code.  That code will save the registers into the 
jitframe and then exit the whole generated function.  The caller of that 
generated function checks how it finished: if it finished by hitting a guard, 
then the caller is responsible for calling the "blackhole interpreter".  This 
is the part of the front-end that recovers from failing guards and finishes 
running the frame (including, possibly, by jumping again into generated 
assembler).
+
+
+Details about the JITting process:
+
+* front-end and optimization pass
+* rewrite (includes gc related transformation as well as simplifactions)
+* assembler generation
+
+
+# Front-end and optimization pass
+
+Not discussed here in detail.  This produces loops and bridges using an 
instruction set that is "high-level" in some sense: it contains intructions 
like "new"/"new_array", and "setfield"/"setarrayitem"/"setinteriorfield" which 
describe the action of storing a value in a precise field of the structure or 
array.  For example, the "setfield" action might require implicitly a GC write 
barrier.  This is the high-level trace that we send to the following step.
+
+
+# Rewrite
+
+A mostly but not completely CPU-independent phase: lowers some instructions.  
For example, the variants of "new" are lowered to "malloc" and a few 
"gc_store": it bumps the pointer of the GC and then sets a few fields 
explicitly in the newly allocated structure.  The "setfield" is replaced with a 
"cond_gc_wb_call" (conditional call to the write barrier) if needed, followed 
by a "gc_store".
+
+The "gc_store" instruction can be encoded in a single MOV assembler 
instruction, but is not as flexible as a MOV.  The address is always specified 
as "some GC pointer + an offset".  We don't have the notion of interior pointer 
for GC objects.
+
+A different instruction, "gc_store_indexed", offers additional operands, which 
can be mapped to a single MOV instruction using forms like `[rax+8*rcx+24]`.
+
+Some other complex instructions pass through to the backend, which must deal 
with them: for example, "card marking" in the GC.  (Writing an object pointer 
inside an array would require walking the whole array later to find "young" 
references. Instead of that, we flip a bit for every range of 128 entries.  
This is a common GC optimization.)  Setting the card bit of a GC object 
requires a sequence of assembler instructions that depends too much on the 
target CPU to be expressed explicitly here (moreover, it contains a few 
branches, which are hard to express at this level).
+
+
+
+
+# Assembly
+
+No fancy code generation technique, but greedy forward pass that tries to 
avoid some pitfalls
+
+
+## Handling instructions
+
+* One by one (forward direction).   Each instruction asks the register 
allocator to ensure that some arguments are in registers (not in the jitframe); 
asks for a register to put its result into; and asks for additional scratch 
registers that will be freed at  the end of the instruction.  There is a 
special case for boolean variables: they are stored in the condition code flags 
instead of being materialized as a 0/1 value.  (They are materialized later, 
except in the common case where they are only used by the next `guard_false` or 
`guard_true` and then forgotten.)
+
+* Instruction arguments are loaded into a register on demand.  This makes the 
backend quite easy to write, but leads do some bad decisions.
+
+
+## Linear scan register allocation
+
+Although it's always a linear trace that we consider, we don't use advanced 
techniques for register allocation: we do forward, on-demand allocation as the 
backend produces the assembler.  When it asks for a register to put some value 
into, we give it any free register, without consideration for what will be done 
with it later.  We compute the longevity of all variables, but only use it when 
choosing which register to spill (we spill the variable with the longest 
longevity).
+
+This works to some extend because it is well integrated with the earlier 
optimization pass. Loops are unrolled once by the optimization pass to allow 
more powerful optimizations---the optimization pass itself is the place that 
benefits the most, but it also has benefits here in the assembly pass.  These 
are:
+
+* The first peeling initializes the register binding on the first use.
+* This leads to an already allocated register of the trace loop.
+* As well as allocated registers when exiting bridges
+
+[Try to better allocate registers to match the ABI (minor to non benefit in 
the current state)]
+
+
+## More complex mappings
+
+Some instructions generate more complex code.  These are either or both of:
+
+* complex instructions generating some local control flow, like 
"cond_gc_wb_call" (for write barriers), "call_assembler" (a call followed by a 
few checks).
+
+* instructions that invoke custom assembler helpers, like the slow-path of 
write barriers or the slow-path of allocations.  These slow-paths are typically 
generated too, so that we are not constrained by the usual calling conventions.
+
+
+## GC pointers
+
+Around most CALL instructions, we need to record a description of where the GC 
pointers are (registers and stack frame).  This is needed in case the CALL 
invokes a garbage collection.  The GC pointers can move; the positions in the 
registers and stack frame are fixed by the GC.  That's a reason for why we 
don't have explicit interior pointers.
+
+GC pointers can appear as constants in the trace.  We are busy changing that 
to use a constant table and MOV REG, (%RIP+offset).  The "constant" table can 
actually change if the GC objects move.
+
+
+## Vectorization
+
+Optimization developed to use SIMD instructions for trace loops. Primary idea 
was to use it as an optimization of micro numpy. It has several passes on the 
already optimized trace.
+
+Shortly explained: It builds dependencies for an unrolled trace loop, 
gathering pairs/packs of operations that could be executed in parallel and 
finally schedules the operations.
+
+What did it add to the code base:
+
+* Dependencies can be constructed
+* Code motion of guards to relax dependencies
+* Scheduler to reorder trace
+* Array bound check removal (especially for unrolled traces)
+
+What can it do:
+
+* Transform vector loops (element wise operations)
+* Accumulation (reduce([...],operator,0)). Requires Operation to be 
associative and commutative
+* SSE 4.1 as ``vector backend''
+
+## We do not
+
+* Keep tracing data around to reoptimize the trace tree. (Once a trace is 
compiled, minimal data is kept.)  This is one reason (there are others in the 
front-end) for the following result: JIT-compiling a small loop with two common 
paths ends up as one "loop" and one bridge assembled, and the bridge-following 
path is slightly less efficient.  This is notably because this bridge is 
assembled with two constraints: the input registers are fixed (from the guard), 
and the output registers are fixed (from the jump target); usually these two 
sets of fixed registers are different, and copying around is needed.
+
+* We don't join trace tails: we only assemble *trees*.
+
+* We don't do any reordering (neither of trace instructions nor of individual 
assembler instructions)
+
+* We don't do any cross-instruction optimization that makes sense only for the 
backend and can't easily be expressed at a higher level.  I'm sure there are 
tons of examples of that, but e.g. loading a large constant in a register that 
will survive for several instructions; moving out of loops *parts* of some 
instruction like the address calculation; etc. etc.
+
+* Other optimization opportunities I can think about: look at the function 
prologue/epilogue; look at the overhead (small but not zero) at the start of a 
bridge.  Also check if the way guards are implemented makes sense.  Also, we 
generate large-ish sequences of assembler instructions with tons of `Jcond` 
that are almost never followed; any optimization opportunity there?  (They all 
go forward, if it changes anything.)  In theory we could also replace some of 
these with a signal handler on segfault (e.g. `guard_nonnull_class`).
+
+
+# a GCC or LLVM backend?
+
+At least for comparison we'd like a JIT backend that emits its code using GCC 
or LLVM (irrespective of the time it would take).  But it's hard to map 
reasonably well the guards to the C language or to LLVM IR.  The problems are: 
(1) we have many guards, we would like to avoid having many paths that each do 
a full saving-all-local-variables-that-are-still-alive; (2) it's hard to patch 
a guard when a bridge is compiled from it; (3) instructions like a CALL need to 
expose the local variables that are GC pointers; CALL_MAY_FORCE need to expose 
*all* local variables for optional off-line reconstruction of the interpreter 
state.
+
_______________________________________________
pypy-commit mailing list
[email protected]
https://mail.python.org/mailman/listinfo/pypy-commit

Reply via email to