On Tue, Mar 13, 2012 at 12:15:02AM +0100, Peter Alexander wrote: > On Sunday, 11 March 2012 at 06:49:27 UTC, H. S. Teoh wrote: > >On Sun, Mar 11, 2012 at 01:29:01AM -0500, Nick Sabalausky wrote: > >>Suppose you have a delegate literal and immediately call it: > >> > >>auto a = x + (){ doStuff(); return y; }() + z; > >> > >>Does DMD ever (or always?) optimize away a delegate if it's executed > >>immediately and never stored into a variable? If not, can it, and > >>would it be a simple change? Is something like this already on the > >>table? > >[...] > > > >I've always wondered about whether delegates passed to opApply > >ever get > >inlined. > > Don't wonder. Find out! > > import std.stdio; > void doStuff() { writeln("Howdy!"); } > void main() { > int x = 1, y = 2, z = 3; > auto a = x + (){ doStuff(); return y; }() + z; > writeln(a); > } > > $ dmd test.d -O -release -inline > > __Dmain: > 000000010000106c pushq %rbp > 000000010000106d movq %rsp,%rbp > 0000000100001070 pushq %rax > 0000000100001071 pushq %rbx > 0000000100001072 movq $0x000000000000000c,%rdi > 000000010000107c callq 0x1000237f0 ; symbol stub for: > __d_allocmemory > 0000000100001081 movq %rax,%rbx > 0000000100001084 movq $0x00000000,(%rbx) > 000000010000108b movl $0x00000002,0x08(%rbx) > 0000000100001092 movq %rbx,%rdi > 0000000100001095 call *0x0002318c(%rip) > 000000010000109c leal 0x04(%rax),%edx > 000000010000109f movl $0x0000000a,%esi > 00000001000010a4 leaq 0x00033eed(%rip),%rdi > 00000001000010ab callq 0x10002319c ; symbol stub for: > _D3std5stdio4File14__T5writeTiTaZ5writeMFiaZv > 00000001000010b0 xorl %eax,%eax > 00000001000010b2 popq %rbx > 00000001000010b3 movq %rbp,%rsp > 00000001000010b6 popq %rbp > 00000001000010b7 ret > > In short. No! It doesn't currently inline in this case. > > Even if the lambda just returns a constant, it doesn't get inlined.
Hmph. I tried this code: import std.stdio; struct A { int[] data; int opApply(int delegate(ref int) dg) { foreach (d; data) { if (dg(d)) return 1; } return 0; } } void main() { A a; int n = 0; a.data = [1,2,3,4,5]; foreach (d; a) { n++; } } With both dmd and gdc, the delegate is never inlined. :-( Compiling with gdc -O3 causes opApply to get inlined and loop-unrolled, but the call to the delegate is still there. With dmd -O, even opApply is not inlined, and the code is generally much longer per loop iteration than gdc -O3. Here's the code generated by gdc for the foreach() loop in main(): 404839: 48 89 c3 mov %rax,%rbx 40483c: 48 8b 04 24 mov (%rsp),%rax 404840: 48 8d 74 24 3c lea 0x3c(%rsp),%rsi 404845: 48 8d 7c 24 20 lea 0x20(%rsp),%rdi 40484a: 48 89 03 mov %rax,(%rbx) 40484d: 48 8b 44 24 08 mov 0x8(%rsp),%rax 404852: 48 89 43 08 mov %rax,0x8(%rbx) 404856: 8b 44 24 10 mov 0x10(%rsp),%eax 40485a: 89 43 10 mov %eax,0x10(%rbx) 40485d: 8b 03 mov (%rbx),%eax 40485f: 89 44 24 3c mov %eax,0x3c(%rsp) 404863: ff d5 callq *%rbp 404865: 85 c0 test %eax,%eax 404867: 75 58 jne 4048c1 <_Dmain+0xd1> 404869: 8b 43 04 mov 0x4(%rbx),%eax 40486c: 48 8d 74 24 3c lea 0x3c(%rsp),%rsi 404871: 48 8d 7c 24 20 lea 0x20(%rsp),%rdi 404876: 89 44 24 3c mov %eax,0x3c(%rsp) 40487a: ff d5 callq *%rbp 40487c: 85 c0 test %eax,%eax 40487e: 75 41 jne 4048c1 <_Dmain+0xd1> 404880: 8b 43 08 mov 0x8(%rbx),%eax 404883: 48 8d 74 24 3c lea 0x3c(%rsp),%rsi 404888: 48 8d 7c 24 20 lea 0x20(%rsp),%rdi 40488d: 89 44 24 3c mov %eax,0x3c(%rsp) 404891: ff d5 callq *%rbp 404893: 85 c0 test %eax,%eax 404895: 75 2a jne 4048c1 <_Dmain+0xd1> 404897: 8b 43 0c mov 0xc(%rbx),%eax 40489a: 48 8d 74 24 3c lea 0x3c(%rsp),%rsi 40489f: 48 8d 7c 24 20 lea 0x20(%rsp),%rdi 4048a4: 89 44 24 3c mov %eax,0x3c(%rsp) 4048a8: ff d5 callq *%rbp 4048aa: 85 c0 test %eax,%eax 4048ac: 75 13 jne 4048c1 <_Dmain+0xd1> 4048ae: 8b 43 10 mov 0x10(%rbx),%eax 4048b1: 48 8d 74 24 3c lea 0x3c(%rsp),%rsi 4048b6: 48 8d 7c 24 20 lea 0x20(%rsp),%rdi 4048bb: 89 44 24 3c mov %eax,0x3c(%rsp) 4048bf: ff d5 callq *%rbp 4048c1: 48 83 c4 48 add $0x48,%rsp 4048c5: 31 c0 xor %eax,%eax 4048c7: 5b pop %rbx 4048c8: 5d pop %rbp 4048c9: c3 retq Notice that each loop iteration is only 7 instructions, with array elements loaded directly via an offset. The loop in opApply generated by dmd looks like this: 08049314 <_D4test1A7opApplyMFDFKiZiZi>: 8049314: 55 push %ebp 8049315: 8b ec mov %esp,%ebp 8049317: 83 ec 18 sub $0x18,%esp 804931a: 53 push %ebx 804931b: 56 push %esi 804931c: 89 45 f8 mov %eax,-0x8(%ebp) 804931f: 83 7d f8 00 cmpl $0x0,-0x8(%ebp) 8049323: 75 1f jne 8049344 <_D4test1A7opApplyMFDFKiZiZi+0x30> 8049325: 6a 06 push $0x6 8049327: ff 35 d4 a3 05 08 pushl 0x805a3d4 804932d: ff 35 d0 a3 05 08 pushl 0x805a3d0 8049333: ff 35 ec a3 05 08 pushl 0x805a3ec 8049339: ff 35 e8 a3 05 08 pushl 0x805a3e8 804933f: e8 8c 04 00 00 call 80497d0 <_d_assert_msg> 8049344: 8b 45 f8 mov -0x8(%ebp),%eax 8049347: 8b 50 04 mov 0x4(%eax),%edx 804934a: 8b 00 mov (%eax),%eax 804934c: 89 45 e8 mov %eax,-0x18(%ebp) 804934f: 89 55 ec mov %edx,-0x14(%ebp) 8049352: c7 45 f0 00 00 00 00 movl $0x0,-0x10(%ebp) 8049359: 8b 4d f0 mov -0x10(%ebp),%ecx 804935c: 3b 4d e8 cmp -0x18(%ebp),%ecx 804935f: 73 41 jae 80493a2 <_D4test1A7opApplyMFDFKiZiZi+0x8e> 8049361: 8b 5d f0 mov -0x10(%ebp),%ebx 8049364: 3b 5d e8 cmp -0x18(%ebp),%ebx 8049367: 72 0a jb 8049373 <_D4test1A7opApplyMFDFKiZiZi+0x5f> 8049369: b8 07 00 00 00 mov $0x7,%eax 804936e: e8 b9 00 00 00 call 804942c <_D4test7__arrayZ> 8049373: 8b 55 ec mov -0x14(%ebp),%edx 8049376: 8b 45 e8 mov -0x18(%ebp),%eax 8049379: 8b 34 9a mov (%edx,%ebx,4),%esi 804937c: 89 75 f4 mov %esi,-0xc(%ebp) 804937f: 8d 4d f4 lea -0xc(%ebp),%ecx 8049382: 51 push %ecx 8049383: 8b 45 08 mov 0x8(%ebp),%eax 8049386: 8b 55 0c mov 0xc(%ebp),%edx 8049389: 8b 5d 08 mov 0x8(%ebp),%ebx 804938c: ff d2 call *%edx 804938e: 85 c0 test %eax,%eax 8049390: 74 0b je 804939d <_D4test1A7opApplyMFDFKiZiZi+0x89> 8049392: b8 01 00 00 00 mov $0x1,%eax 8049397: 5e pop %esi 8049398: 5b pop %ebx 8049399: c9 leave 804939a: c2 08 00 ret $0x8 804939d: ff 45 f0 incl -0x10(%ebp) 80493a0: eb b7 jmp 8049359 <_D4test1A7opApplyMFDFKiZiZi+0x45> 80493a2: 31 c0 xor %eax,%eax 80493a4: 5e pop %esi 80493a5: 5b pop %ebx 80493a6: c9 leave 80493a7: c2 08 00 ret $0x8 80493aa: 90 nop 80493ab: 90 nop The loop body is significantly longer, about 22 instructions when the exit branch isn't taken, and includes a call to a bounds check routine per iteration. I suppose the gdc case has been heavily optimized by gcc's optimizing backend. :-) Though I'm kinda disappointed that it didn't inline the trivial delegate. Or is that because of the way the front-end generates the AST? T -- Nothing in the world is more distasteful to a man than to take the path that leads to himself. -- Herman Hesse