RE: Optimizing the placement of heap checks (trac #1498)

Simon Peyton-Jones Wed, 17 Mar 2010 06:33:09 -0700

David

It would indeed be good to work on this.  Moreover, I know exactly how! Indeed 
a Cambridge undergrad, Krzysztof Wos, may do this very thing in his undergrad 
project.  If he has time, which is not certain.  I'm ccing him.


I've just dumped in some notes about the new back end status here
http://hackage.haskell.org/trac/ghc/wiki/Commentary/Compiler/NewCodeGen

You'll see that things are in a tremendous state of flux at the moment, which 
is not very helpful for you.   It should settle down in a few months, esp after 
David Terei's internship.

Basically, you want to modify the new code generator, which converts STG to 
CmmZ/CmmGADT (see the above link for what that is).   The new pipeline makes it 
easy to add safe points wherever you want, which the old one does not.


Hmm.  I hate to respond like this, but it might make most sense to give us 2-3 
months to get the new codegen framework settled, and then pick up the cudgels.

Simon

From: [email protected] [mailto:[email protected]] On 
Behalf Of David Peixotto
Sent: 15 March 2010 18:36
To: [email protected]
Subject: Optimizing the placement of heap checks (trac #1498)

I have some questions after starting to look at implementing an optimization in 
the GHC back end related to Trac ticket #1498[1,2]. The goal of the 
optimization is to eliminate unnecessary heap checks in recursive functions.

For example, if we have a recursive function such as

f :: Int -> Int -> Maybe Int
f = \x y -> case x of
             0 ->   (Just y)
             _ -> f (x-1) y

we see that it only does an allocation in the base case of the recursion. GHC 
HEAD generates great looking core for this function (-ddump-stg):

Main.$wf =
    \r [ww_sSX w_sT1]
        case ww_sSX of ds_sSZ {
          __DEFAULT ->
              case -# [ds_sSZ 1] of sat_sT8 {
                __DEFAULT -> Main.$wf sat_sT8 w_sT1;
              };
          0 -> Data.Maybe.Just [w_sT1];
        };

With the corresponding Cmm code (-ddump-cmm):

Main_zdwf_entry()
<...info table elided...>
cT6:
    Hp = Hp + 16;
    if (Hp > HpLim) goto cT9;
    _sSS::I64 = R2;
    if (_sSS::I64 != 0) goto cTc;
    I64[Hp - 8] = base_DataziMaybe_Just_con_info;
    I64[Hp + 0] = R3;
    R1 = Hp - 6;
    jump (I64[Sp + 0]) ();
cT9:
    HpAlloc = 16;
    R1 = Main_zdwf_closure;
    jump stg_gc_fun ();
cTc:
    _sT1::I64 = _sSS::I64 - 1;
    R2 = _sT1::I64;
    Hp = Hp - 16;
    jump Main_zdwf_info ();
}

Now we have a nice looking loop, but the straightforward code generation done 
in the native back end causes a heap check on every iteration giving these 
extra overheads on each iteration:

  1) incrementing the heap pointer
  2) loading the heap limit
  3) testing the heap pointer against the limit
  4) decrementing the heap pointer

It was suggested in an earlier discussion[2] that we could get rid of this 
overhead by pushing the heap check down into the branch that actually does the 
allocation. It is noted that this is not a safe transformation in general 
because the heap check can only be done at a "safe point" for the GC.

Interestingly, the LLVM backend gets good code by moving the heap check in the 
opposite direction. It pulls the heap check up out of the loop that it 
generates for the tail-recursive call. It ends up with the three instruction 
loop that you would get if you were to write the loop by hand (decrement, test, 
branch) and does the allocation in the fall through case of the loop.

The LLVM generated code shows the advantage of having powerful low-level 
transformations in GHC. If the LLVM is invoked with -O instead of -O2, it still 
transforms the recursive call to a loop, but does not hoist the heap check out 
of the loop. The assembly codes from the two back ends are pasted at the end of 
the message for the interested.

Looking into this issue brings me to ask a few questions. Any feedback is 
welcomed.

1) Does it still make sense to pursue these optimizations in the native code 
generator?

Given the strong optimizations already in LLVM is it worth investing time in 
the GHC native backend? One real advantage I see with the native back end (and 
an area I would like to explore) is that it is easier to pass information from 
the high-level optimizer to the low-level optimizer when it all resides in the 
same compiler. This may allow certain optimizations to fire that would be very 
difficult to prove safe in the LLVM back end.

2) What is the definition of a safe point for the GC?

My understanding is that currently all heap checks are done on entry to a 
closure, but this could be relaxed as long as we can provide a safe 
continuation point where we can resume execution after GC. One idea that could 
help in the example above would be to outline the allocation points into 
separate functions at the cmm level so that f no longer allocates any data and 
the check is only performed in the new outlined function in the base case. It 
seems to me that there could be some interesting optimization opportunities for 
code motion of heap checks.

3) What is the status of the new code generator?

I've seen a number of wiki pages describing the new code generation 
pipeline[3,4], but it is a little difficult to tell the exact status of the 
implementation. Are people still actively working on the new code generator or 
has most of the work been completed?


Thanks!

-Dave

[1] http://hackage.haskell.org/trac/ghc/ticket/1498
[2] http://www.haskell.org/pipermail/cvs-ghc/2007-July/036496.html
[3] http://hackage.haskell.org/trac/ghc/wiki/Commentary/Compiler/NewCodeGen
[4] 
http://hackage.haskell.org/trac/ghc/wiki/Commentary/Compiler/NewCodeGenPipeline

#
# LLVM ASSEMBLY (x86_64)
#
Main_zdwf_entry:                        # @Main_zdwf_entry
# BB#0:                                 # %cVo
          subq     $8, %rsp
          leaq      16(%r12), %rax
          cmpq   144(%r13), %rax
          ja         .LBB1_3
          jmp      .LBB1_1
          .align   16
.LBB1_4:                                # %cVG
                                        #   in Loop: Header=BB1_1 Depth=1
          decq    %r14
.LBB1_1:                                # %nVA
                                        # =>This Inner Loop Header: Depth=1
          testq    %r14, %r14
          jne       .LBB1_4
# BB#2:                                 # %nVH
          movq   $base_DataziMaybe_Just_con_info, 8(%r12)
          movq   %rsi, 16(%r12)
          movq   %r12, %rbx
          movq   %rax, %r12
          movq   (%rbp), %rcx
          addq    $10, %rbx
          movq   (%rcx), %r11
          addq    $8, %rsp
          jmpq    *%r11  # TAILCALL
.LBB1_3:                                # %cVz
          movq   $16, 184(%r13)
          movl    $Main_zdwf_closure, %ebx
          movq   %rax, %r12
          movq   -8(%r13), %r11
          addq    $8, %rsp
          jmpq    *%r11  # TAILCALL

#
# NATIVE BACK-END ASSEMBLY (x86_64)
#
Main_zdwf_info:
.LcTh:
          addq $16,%r12
          cmpq 144(%r13),%r12
          ja .LcTk
          movq %r14,%rax
          testq %rax,%rax
          jne .LcTn
          movq $base_DataziMaybe_Just_con_info,-8(%r12)
          movq %rsi,(%r12)
          leaq -6(%r12),%rbx
          jmp *(%rbp)
.LcTk:
          movq $16,184(%r13)
          movl $Main_zdwf_closure,%ebx
          jmp *-8(%r13)
.LcTn:
          leaq -1(%rax),%r14
          addq $-16,%r12
          jmp Main_zdwf_info

_______________________________________________
Cvs-ghc mailing list
[email protected]
http://www.haskell.org/mailman/listinfo/cvs-ghc

RE: Optimizing the placement of heap checks (trac #1498)

Reply via email to