Nice analysis. I indeed found with phc that shadow stack references
absolutely killed performance, and I aggressively cached stack
locations in locals, spilling to stack only when GC information
needed to be accurate. [There was a giant infrastructure to save
only live data to stack, but we won't go into that now as it was the
source of almost all the codegen bugs...]
On Oct 26, 2005, at 5:43 AM, John Meacham wrote:
here is the C code that jhc generates. (As an aside, I am very
proud of how
readable and how much structure the jhc generated C code preserves
of the
original haskell. it's a small thing, perhaps only implementors
appreciate it,
but I am glad I spent the time needed to do so.)
This makes a big difference. The phc compiler even put comments in
the code so that I could figure out what came from where.
v99 = fWXAXDfMainXDfac(v97, v98);
return v99;
...
notice that besides being a bit verbose and using a tailcall,
I'm impressed that gcc found this. It's definitely living a bit
dangerously, and your suggestions below for self tail call handling
are the ones I found most effective. (They also allowed me to bypass
some prologue garbage, since phc used a one-C-function-per-Haskell-
function model with internal resumption points.) Non-self tail calls
I was careful to compile to:
return f(...);
I expect from the above that gcc does better at spotting tail calls now.
furthermore gotos and labels are very problematic for gcc to
optimize around.
for various tiresome reasons gcc cannot perform (most) code motion
optimizations across explicit labels and gotos, especially when
they deal with
the global register variables and memory stores and loads. ...
there are a couple of things we can do to mitigate these problems:
get rid of indirect jumps whenever possible.
use C control constructs rather than gotos.
"for" loop introduction would be especially nice, but a bit tricky in
practice I fear (requiring a game of "spot the induction variable").
A couple simple rules seem to help greatly.
* turn anything of the form JMP_((W_)&self) where self is oneself
into a goto
that gotos a label at the beginning of the function.
Or better yet, wrap the whole function in
do {
} while (1);
and replace "JMP_" by "continue". This avoids the troubles with goto
which John mentioned above. It made a difference for phc, at least.
Of course, if you can introduce loops elsewhere you might get
yourself into trouble with this solution.
* do simple pattern matthing on the basic blocks to recognize where
C control
constructs can be placed.
for instance turn
if (x) { goto y; }
blah..
baz..
JMP_(foo)
into
if (x) { goto y; } else {
blah..
baz..
JMP_(foo)
}
extending the else to after the next jump or goto.
I'm surprised this actually helps, I must admit.
* getting stack dereferences out of your loops.
I recommend caching stack references in C locals where possible, but
it's tricky to get this right if loop bodies include embedded
function calls. Last I checked this wasn't an issue for GHC, since
function calls were CPS-converted and only tight call-free loops
ended up in a single function anyway.
in order to get rid of the unessesary memory accesess, we need to
either
1. fully convert it to use C control constructs, so gcc will do it
for us.
(code motion and loop invarient being inhibited again by gotos)
As I recall, the "right" solution here is to compute dominator trees,
and coalesce functions which are only tail called from their
dominator into a single function. Alas, I've forgotten where I saw
this written up, but there are indeed papers on it. Because it takes
a bunch of effort on the part of the implementor, it'd be nice to see
if its benefits are quantified.
These should be straightforward to implement in the C code
generator. it also
suggests we might want to try to use the native C calling
convention on leaf
nodes that deal with unboxed values (so we get register passing and
return
values for free) or taking groups of mutually recursive functions
and turning
them all into one function with explicit jumps between them.
Making sure things are marked "static" and occur in an appropriate
dependency order helps a bit here. It might even be worth marking
some stuff "inline" in the code generator, though that's shaky ground.
I actually considered making everything static and putting outwardly-
visible functionality in an extern wrapper---effectively carrying
worker-wrapper down to the C level.
some random notes:
the 3x-7x factor was tested on an i386, on x86_64 the margin is
much much
greater for reasons that are still unclear.
Does x86-64 use a register-based calling convention by default? If
you compiled the i386 code using __regparm(2), would you see the same
speed difference?
-Jan-Willem Maessen
_______________________________________________
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users