Re: jhc vs ghc and the surprising result involving ghc generated assembly.

Jan-Willem Maessen Wed, 26 Oct 2005 09:24:18 -0700

Nice analysis. I indeed found with phc that shadow stack referencesabsolutely killed performance, and I aggressively cached stacklocations in locals, spilling to stack only when GC informationneeded to be accurate. [There was a giant infrastructure to saveonly live data to stack, but we won't go into that now as it was thesource of almost all the codegen bugs...]


On Oct 26, 2005, at 5:43 AM, John Meacham wrote:

here is the C code that jhc generates. (As an aside, I am veryproud of howreadable and how much structure the jhc generated C code preservesof theoriginal haskell. it's a small thing, perhaps only implementorsappreciate it,
but I am glad I spent the time needed to do so.)

This makes a big difference. The phc compiler even put comments inthe code so that I could figure out what came from where.

            v99 = fWXAXDfMainXDfac(v97, v98);
            return v99;
...
notice that besides being a bit verbose and using a tailcall,

I'm impressed that gcc found this. It's definitely living a bitdangerously, and your suggestions below for self tail call handlingare the ones I found most effective. (They also allowed me to bypasssome prologue garbage, since phc used a one-C-function-per-Haskell-function model with internal resumption points.) Non-self tail callsI was careful to compile to:

  return f(...);
I expect from the above that gcc does better at spotting tail calls now.

furthermore gotos and labels are very problematic for gcc tooptimize around.
for various tiresome reasons gcc cannot perform (most) code motion
optimizations across explicit labels and gotos, especially whenthey deal with
the global register variables and memory stores and loads. ...

there are a couple of things we can do to mitigate these problems:

get rid of indirect jumps whenever possible.

use C control constructs rather than gotos.

"for" loop introduction would be especially nice, but a bit tricky inpractice I fear (requiring a game of "spot the induction variable").

A couple simple rules seem to help greatly.
* turn anything of the form JMP_((W_)&self) where self is oneselfinto a goto
that gotos a label at the beginning of the function.


Or better yet, wrap the whole function in

do {
} while (1);

and replace "JMP_" by "continue". This avoids the troubles with gotowhich John mentioned above. It made a difference for phc, at least.Of course, if you can introduce loops elsewhere you might getyourself into trouble with this solution.

* do simple pattern matthing on the basic blocks to recognize whereC control

constructs can be placed.

for instance turn

if (x) { goto  y; }
blah..
baz..
JMP_(foo)

into

if (x) { goto  y; } else {
blah..
baz..
JMP_(foo)
}

extending the else to after the next jump or goto.


I'm surprised this actually helps, I must admit.

* getting stack dereferences out of your loops.

I recommend caching stack references in C locals where possible, butit's tricky to get this right if loop bodies include embeddedfunction calls. Last I checked this wasn't an issue for GHC, sincefunction calls were CPS-converted and only tight call-free loopsended up in a single function anyway.

in order to get rid of the unessesary memory accesess, we need toeither
1. fully convert it to use C control constructs, so gcc will do itfor us.
(code motion and loop invarient being inhibited again by gotos)

As I recall, the "right" solution here is to compute dominator trees,and coalesce functions which are only tail called from theirdominator into a single function. Alas, I've forgotten where I sawthis written up, but there are indeed papers on it. Because it takesa bunch of effort on the part of the implementor, it'd be nice to seeif its benefits are quantified.

These should be straightforward to implement in the C codegenerator. it alsosuggests we might want to try to use the native C callingconvention on leafnodes that deal with unboxed values (so we get register passing andreturnvalues for free) or taking groups of mutually recursive functionsand turning
them all into one function with explicit jumps between them.

Making sure things are marked "static" and occur in an appropriatedependency order helps a bit here. It might even be worth markingsome stuff "inline" in the code generator, though that's shaky ground.

I actually considered making everything static and putting outwardly-visible functionality in an extern wrapper---effectively carryingworker-wrapper down to the C level.

some random notes:
the 3x-7x factor was tested on an i386, on x86_64 the margin ismuch much
greater for reasons that are still unclear.

Does x86-64 use a register-based calling convention by default? Ifyou compiled the i386 code using __regparm(2), would you see the samespeed difference?


-Jan-Willem Maessen


_______________________________________________
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users

Re: jhc vs ghc and the surprising result involving ghc generated assembly.

Reply via email to