Re: Optimize away immediately-called delegate literals?

H. S. Teoh Mon, 12 Mar 2012 17:26:47 -0700

On Tue, Mar 13, 2012 at 12:15:02AM +0100, Peter Alexander wrote:
> On Sunday, 11 March 2012 at 06:49:27 UTC, H. S. Teoh wrote:
> >On Sun, Mar 11, 2012 at 01:29:01AM -0500, Nick Sabalausky wrote:
> >>Suppose you have a delegate literal and immediately call it:
> >>
> >>auto a = x + (){ doStuff(); return y; }() + z;
> >>
> >>Does DMD ever (or always?) optimize away a delegate if it's executed
> >>immediately and never stored into a variable? If not, can it, and
> >>would it be a simple change? Is something like this already on the
> >>table?
> >[...]
> >
> >I've always wondered about whether delegates passed to opApply
> >ever get
> >inlined.
> 
> Don't wonder. Find out!
> 
> import std.stdio;
> void doStuff() { writeln("Howdy!"); }
> void main() {
>     int x = 1, y = 2, z = 3;
>     auto a = x + (){ doStuff(); return y; }() + z;
>     writeln(a);
> }
> 
> $ dmd test.d -O -release -inline
> 
> __Dmain:
> 000000010000106c      pushq   %rbp
> 000000010000106d      movq    %rsp,%rbp
> 0000000100001070      pushq   %rax
> 0000000100001071      pushq   %rbx
> 0000000100001072      movq    $0x000000000000000c,%rdi
> 000000010000107c      callq   0x1000237f0     ; symbol stub for:
> __d_allocmemory
> 0000000100001081      movq    %rax,%rbx
> 0000000100001084      movq    $0x00000000,(%rbx)
> 000000010000108b      movl    $0x00000002,0x08(%rbx)
> 0000000100001092      movq    %rbx,%rdi
> 0000000100001095      call    *0x0002318c(%rip)
> 000000010000109c      leal    0x04(%rax),%edx
> 000000010000109f      movl    $0x0000000a,%esi
> 00000001000010a4      leaq    0x00033eed(%rip),%rdi
> 00000001000010ab      callq   0x10002319c     ; symbol stub for:
> _D3std5stdio4File14__T5writeTiTaZ5writeMFiaZv
> 00000001000010b0      xorl    %eax,%eax
> 00000001000010b2      popq    %rbx
> 00000001000010b3      movq    %rbp,%rsp
> 00000001000010b6      popq    %rbp
> 00000001000010b7      ret
> 
> In short. No! It doesn't currently inline in this case.
> 
> Even if the lambda just returns a constant, it doesn't get inlined.


Hmph.

I tried this code:

        import std.stdio;
        struct A {
                int[] data;
                int opApply(int delegate(ref int) dg) {
                        foreach (d; data) {
                                if (dg(d)) return 1;
                        }
                        return 0;
                }
        }
        void main() {
                A a;
                int n = 0;

                a.data = [1,2,3,4,5];
                foreach (d; a) {
                        n++;
                }
        }

With both dmd and gdc, the delegate is never inlined. :-(  Compiling
with gdc -O3 causes opApply to get inlined and loop-unrolled, but the
call to the delegate is still there. With dmd -O, even opApply is not
inlined, and the code is generally much longer per loop iteration than
gdc -O3.

Here's the code generated by gdc for the foreach() loop in main():

  404839:       48 89 c3                mov    %rax,%rbx
  40483c:       48 8b 04 24             mov    (%rsp),%rax
  404840:       48 8d 74 24 3c          lea    0x3c(%rsp),%rsi
  404845:       48 8d 7c 24 20          lea    0x20(%rsp),%rdi
  40484a:       48 89 03                mov    %rax,(%rbx)
  40484d:       48 8b 44 24 08          mov    0x8(%rsp),%rax
  404852:       48 89 43 08             mov    %rax,0x8(%rbx)
  404856:       8b 44 24 10             mov    0x10(%rsp),%eax
  40485a:       89 43 10                mov    %eax,0x10(%rbx)
  40485d:       8b 03                   mov    (%rbx),%eax
  40485f:       89 44 24 3c             mov    %eax,0x3c(%rsp)
  404863:       ff d5                   callq  *%rbp
  404865:       85 c0                   test   %eax,%eax
  404867:       75 58                   jne    4048c1 <_Dmain+0xd1>
  404869:       8b 43 04                mov    0x4(%rbx),%eax
  40486c:       48 8d 74 24 3c          lea    0x3c(%rsp),%rsi
  404871:       48 8d 7c 24 20          lea    0x20(%rsp),%rdi
  404876:       89 44 24 3c             mov    %eax,0x3c(%rsp)
  40487a:       ff d5                   callq  *%rbp
  40487c:       85 c0                   test   %eax,%eax
  40487e:       75 41                   jne    4048c1 <_Dmain+0xd1>
  404880:       8b 43 08                mov    0x8(%rbx),%eax
  404883:       48 8d 74 24 3c          lea    0x3c(%rsp),%rsi
  404888:       48 8d 7c 24 20          lea    0x20(%rsp),%rdi
  40488d:       89 44 24 3c             mov    %eax,0x3c(%rsp)
  404891:       ff d5                   callq  *%rbp
  404893:       85 c0                   test   %eax,%eax
  404895:       75 2a                   jne    4048c1 <_Dmain+0xd1>
  404897:       8b 43 0c                mov    0xc(%rbx),%eax
  40489a:       48 8d 74 24 3c          lea    0x3c(%rsp),%rsi
  40489f:       48 8d 7c 24 20          lea    0x20(%rsp),%rdi
  4048a4:       89 44 24 3c             mov    %eax,0x3c(%rsp)
  4048a8:       ff d5                   callq  *%rbp
  4048aa:       85 c0                   test   %eax,%eax
  4048ac:       75 13                   jne    4048c1 <_Dmain+0xd1>
  4048ae:       8b 43 10                mov    0x10(%rbx),%eax
  4048b1:       48 8d 74 24 3c          lea    0x3c(%rsp),%rsi
  4048b6:       48 8d 7c 24 20          lea    0x20(%rsp),%rdi
  4048bb:       89 44 24 3c             mov    %eax,0x3c(%rsp)
  4048bf:       ff d5                   callq  *%rbp
  4048c1:       48 83 c4 48             add    $0x48,%rsp
  4048c5:       31 c0                   xor    %eax,%eax
  4048c7:       5b                      pop    %rbx
  4048c8:       5d                      pop    %rbp
  4048c9:       c3                      retq   

Notice that each loop iteration is only 7 instructions, with array
elements loaded directly via an offset.  The loop in opApply generated
by dmd looks like this:

08049314 <_D4test1A7opApplyMFDFKiZiZi>:
 8049314:       55                      push   %ebp
 8049315:       8b ec                   mov    %esp,%ebp
 8049317:       83 ec 18                sub    $0x18,%esp
 804931a:       53                      push   %ebx
 804931b:       56                      push   %esi
 804931c:       89 45 f8                mov    %eax,-0x8(%ebp)
 804931f:       83 7d f8 00             cmpl   $0x0,-0x8(%ebp)
 8049323:       75 1f                   jne    8049344 
<_D4test1A7opApplyMFDFKiZiZi+0x30>
 8049325:       6a 06                   push   $0x6
 8049327:       ff 35 d4 a3 05 08       pushl  0x805a3d4
 804932d:       ff 35 d0 a3 05 08       pushl  0x805a3d0
 8049333:       ff 35 ec a3 05 08       pushl  0x805a3ec
 8049339:       ff 35 e8 a3 05 08       pushl  0x805a3e8
 804933f:       e8 8c 04 00 00          call   80497d0 <_d_assert_msg>
 8049344:       8b 45 f8                mov    -0x8(%ebp),%eax
 8049347:       8b 50 04                mov    0x4(%eax),%edx
 804934a:       8b 00                   mov    (%eax),%eax
 804934c:       89 45 e8                mov    %eax,-0x18(%ebp)
 804934f:       89 55 ec                mov    %edx,-0x14(%ebp)
 8049352:       c7 45 f0 00 00 00 00    movl   $0x0,-0x10(%ebp)
 8049359:       8b 4d f0                mov    -0x10(%ebp),%ecx
 804935c:       3b 4d e8                cmp    -0x18(%ebp),%ecx
 804935f:       73 41                   jae    80493a2 
<_D4test1A7opApplyMFDFKiZiZi+0x8e>
 8049361:       8b 5d f0                mov    -0x10(%ebp),%ebx
 8049364:       3b 5d e8                cmp    -0x18(%ebp),%ebx
 8049367:       72 0a                   jb     8049373 
<_D4test1A7opApplyMFDFKiZiZi+0x5f>
 8049369:       b8 07 00 00 00          mov    $0x7,%eax
 804936e:       e8 b9 00 00 00          call   804942c <_D4test7__arrayZ>
 8049373:       8b 55 ec                mov    -0x14(%ebp),%edx
 8049376:       8b 45 e8                mov    -0x18(%ebp),%eax
 8049379:       8b 34 9a                mov    (%edx,%ebx,4),%esi
 804937c:       89 75 f4                mov    %esi,-0xc(%ebp)
 804937f:       8d 4d f4                lea    -0xc(%ebp),%ecx
 8049382:       51                      push   %ecx
 8049383:       8b 45 08                mov    0x8(%ebp),%eax
 8049386:       8b 55 0c                mov    0xc(%ebp),%edx
 8049389:       8b 5d 08                mov    0x8(%ebp),%ebx
 804938c:       ff d2                   call   *%edx
 804938e:       85 c0                   test   %eax,%eax
 8049390:       74 0b                   je     804939d 
<_D4test1A7opApplyMFDFKiZiZi+0x89>
 8049392:       b8 01 00 00 00          mov    $0x1,%eax
 8049397:       5e                      pop    %esi
 8049398:       5b                      pop    %ebx
 8049399:       c9                      leave  
 804939a:       c2 08 00                ret    $0x8
 804939d:       ff 45 f0                incl   -0x10(%ebp)
 80493a0:       eb b7                   jmp    8049359 
<_D4test1A7opApplyMFDFKiZiZi+0x45>
 80493a2:       31 c0                   xor    %eax,%eax
 80493a4:       5e                      pop    %esi
 80493a5:       5b                      pop    %ebx
 80493a6:       c9                      leave  
 80493a7:       c2 08 00                ret    $0x8
 80493aa:       90                      nop
 80493ab:       90                      nop

The loop body is significantly longer, about 22 instructions when the
exit branch isn't taken, and includes a call to a bounds check routine
per iteration.

I suppose the gdc case has been heavily optimized by gcc's optimizing
backend. :-) Though I'm kinda disappointed that it didn't inline the
trivial delegate. Or is that because of the way the front-end generates
the AST?


T

-- 
Nothing in the world is more distasteful to a man than to take the path
that leads to himself. -- Herman Hesse

Re: Optimize away immediately-called delegate literals?

Reply via email to