Nice analysis. I indeed found with phc that shadow stack references absolutely killed performance, and I aggressively cached stack locations in locals, spilling to stack only when GC information needed to be accurate. [There was a giant infrastructure to save only live data to stack, but we won't go into that now as it was the source of almost all the codegen bugs...]

On Oct 26, 2005, at 5:43 AM, John Meacham wrote:

here is the C code that jhc generates. (As an aside, I am very proud of how readable and how much structure the jhc generated C code preserves of the original haskell. it's a small thing, perhaps only implementors appreciate it,
but I am glad I spent the time needed to do so.)


This makes a big difference. The phc compiler even put comments in the code so that I could figure out what came from where.


            v99 = fWXAXDfMainXDfac(v97, v98);
            return v99;
...
notice that besides being a bit verbose and using a tailcall,


I'm impressed that gcc found this. It's definitely living a bit dangerously, and your suggestions below for self tail call handling are the ones I found most effective. (They also allowed me to bypass some prologue garbage, since phc used a one-C-function-per-Haskell- function model with internal resumption points.) Non-self tail calls I was careful to compile to:
  return f(...);
I expect from the above that gcc does better at spotting tail calls now.


furthermore gotos and labels are very problematic for gcc to optimize around.
for various tiresome reasons gcc cannot perform (most) code motion
optimizations across explicit labels and gotos, especially when they deal with
the global register variables and memory stores and loads. ...

there are a couple of things we can do to mitigate these problems:

get rid of indirect jumps whenever possible.

use C control constructs rather than gotos.


"for" loop introduction would be especially nice, but a bit tricky in practice I fear (requiring a game of "spot the induction variable").


A couple simple rules seem to help greatly.

* turn anything of the form JMP_((W_)&self) where self is oneself into a goto
that gotos a label at the beginning of the function.


Or better yet, wrap the whole function in

do {
} while (1);

and replace "JMP_" by "continue". This avoids the troubles with goto which John mentioned above. It made a difference for phc, at least. Of course, if you can introduce loops elsewhere you might get yourself into trouble with this solution.


* do simple pattern matthing on the basic blocks to recognize where C control
constructs can be placed.

for instance turn

if (x) { goto  y; }
blah..
baz..
JMP_(foo)

into

if (x) { goto  y; } else {
blah..
baz..
JMP_(foo)
}

extending the else to after the next jump or goto.


I'm surprised this actually helps, I must admit.


* getting stack dereferences out of your loops.


I recommend caching stack references in C locals where possible, but it's tricky to get this right if loop bodies include embedded function calls. Last I checked this wasn't an issue for GHC, since function calls were CPS-converted and only tight call-free loops ended up in a single function anyway.


in order to get rid of the unessesary memory accesess, we need to either

1. fully convert it to use C control constructs, so gcc will do it for us.
(code motion and loop invarient being inhibited again by gotos)


As I recall, the "right" solution here is to compute dominator trees, and coalesce functions which are only tail called from their dominator into a single function. Alas, I've forgotten where I saw this written up, but there are indeed papers on it. Because it takes a bunch of effort on the part of the implementor, it'd be nice to see if its benefits are quantified.


These should be straightforward to implement in the C code generator. it also suggests we might want to try to use the native C calling convention on leaf nodes that deal with unboxed values (so we get register passing and return values for free) or taking groups of mutually recursive functions and turning
them all into one function with explicit jumps between them.


Making sure things are marked "static" and occur in an appropriate dependency order helps a bit here. It might even be worth marking some stuff "inline" in the code generator, though that's shaky ground.

I actually considered making everything static and putting outwardly- visible functionality in an extern wrapper---effectively carrying worker-wrapper down to the C level.


some random notes:

the 3x-7x factor was tested on an i386, on x86_64 the margin is much much
greater for reasons that are still unclear.


Does x86-64 use a register-based calling convention by default? If you compiled the i386 code using __regparm(2), would you see the same speed difference?

-Jan-Willem Maessen


_______________________________________________
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users

Reply via email to