Re: Forcing inline functions (again) - groan
On Wednesday, 15 July 2020 at 13:38:34 UTC, Cecil Ward wrote: I recently noticed pragma(inline, true) which looks extremely useful. A couple of questions : 1. Is this cross-compiler compatible? Works for LDC and DMD, not sure about GDC, but if it doesn't support it, it's definitely on Iain's list. 2. Can I declare a function in one module and have it _inlined_ in another module at the call site? For LDC, this works in all cases (i.e., also if compiling multiple object files in a single cmdline) since v1.22. While you cannot force LLVM to actually inline, I haven't come across a case yet where it doesn't.
Re: Forcing inline functions (again) - groan
On Wednesday, 15 July 2020 at 13:38:34 UTC, Cecil Ward wrote: I recently noticed pragma(inline, true) which looks extremely useful. A couple of questions : 1. Is this cross-compiler compatible? 2. Can I declare a function in one module and have it _inlined_ in another module at the call site? I’m looking to write functions that expand to approx one or even zero machine instructions and having the overhead of a function call would be disastrous; in some cases would make it pointless having the function due to the slowdown. pragma inline will work for dmd. and it used to fail if it couldn't inline. Now it just generates a warning. So with -w it will still fail. Afaik other compilers cannot warn if the in-lining fails but I might be wrong. And ldc/gdc should be able to inline most code which makes sense to inline.
Forcing inline functions (again) - groan
I recently noticed pragma(inline, true) which looks extremely useful. A couple of questions : 1. Is this cross-compiler compatible? 2. Can I declare a function in one module and have it _inlined_ in another module at the call site? I’m looking to write functions that expand to approx one or even zero machine instructions and having the overhead of a function call would be disastrous; in some cases would make it pointless having the function due to the slowdown.
Re: inline functions
On Fri, 25 Mar 2011 22:04:20 -0400, Caligo iteronve...@gmail.com wrote: T[3] data; T dot(const ref Vector o){ return data[0] * o.data[0] + data[1] * o.data[1] + data[2] * o.data[2]; } T LengthSquared_Fast(){ return data[0] * data[0] + data[1] * data[1] + data[2] * data[2]; } T LengthSquared_Slow(){ return dot(this); } The faster LengthSquared() is twice as fast, and I've test with GDC and DMD. Is it because the compilers don't inline-expand the dot() function call? I need the performance, but the faster version is too verbose. ref parameters used to make functions not be inlined, but apparently that was fixed: http://d.puremagic.com/issues/show_bug.cgi?id=2008 The best thing to do is check the disassembly to see if the call is being inlined. Also, if you want more help besides guessing, a complete working program is good to have. -Steve
Re: inline functions
On Sat, Mar 26, 2011 at 3:47 AM, Jonathan M Davis jmdavisp...@gmx.com wrote: On 2011-03-26 01:06, Caligo wrote: On Fri, Mar 25, 2011 at 11:56 PM, Jonathan M Davis jmdavisp...@gmx.com wrote: On 2011-03-25 21:21, Caligo wrote: On Fri, Mar 25, 2011 at 10:49 PM, Jonathan M Davis jmdavisp...@gmx.com wrote: On 2011-03-25 19:04, Caligo wrote: T[3] data; T dot(const ref Vector o){ return data[0] * o.data[0] + data[1] * o.data[1] + data[2] * o.data[2]; } T LengthSquared_Fast(){ return data[0] * data[0] + data[1] * data[1] + data[2] * data[2]; } T LengthSquared_Slow(){ return dot(this); } The faster LengthSquared() is twice as fast, and I've test with GDC and DMD. Is it because the compilers don't inline-expand the dot() function call? I need the performance, but the faster version is too verbose. It sure sounds like it didn't inline it. Did you compile with -inline? If you didn't then it definitely won't inline it. - Jonathan M Davis I didn't know I had to supply GDC with -inline, so I did, and it did not help. In fact, with the -inline option the performance gets worse (for DMD and GDC), even for code that doesn't contain any function calls. In any case, code compiled with DMD is always behind GDC when it comes to performance. I don't know what gdc does, but you have to use -inline with dmd if you want it to inline anything. It also really doesn't make any sense at all that inlining would harm performance. If that's the case, something weird is going on. I don't see how inlining could _ever_ harm performance unless it just makes the program's binary so big that _that_ harms performance. That isn't very likely though. So, if using -inline is harming performance, then something weird is definitely going on. - Jonathan M Davis The only time that -inline has no effect is when I turn on -O3. This is also when the code performs the best. I've never used -O3 in my C++ code, but I guess things are different in D even with the same back-end. I really don't know what gdc does. With dmd, inlining is not turned on unless -inline is used. Also, -inline with dmd does not force inlining, it merely turns on the optimization. The compiler still chooses where and when it's best to inline. With gcc, I believe that inlining is normally turned on at a pretty low optimization level (probably -O), and like dmd, it chooses where and when it's best to inline, but unlike dmd, it uses the inline keyword in C++ as a hint as to what it should do. However, -O3 forces inlining on all functions marked with inline. How gdc deals with that given that D doesn't have an inline keyword, I don't know. Regardless, given what inlining does, I have a _very_ hard time believing that it would ever degrade performance unless it's buggy. - Jonathan M Davis I was going to post my code, but I take back what I said. What is happening is that there is a lot of fluctuation in performance. The low performance always occurred when I had -inline enabled, which made me think -inline degrades performance. The performance should be consistent, but for some reason it's not. The important thing is that -inline doesn't make any difference with GDC. The -O3 does make a big difference.
Re: inline functions
Answer for Jonathan M Davis and Caligo: I far as I remember you need to use -finline-functions on GDC to perform inlining. -O3 implies inlining, on GCC, and I presume on GDC too. Inlining is a complex art, the compilers compute a score for each function and each function call and decide if perform the inlining. There are many situations where inlining harms performance, and it's not just a matter of code cache pressure (this list of problems is not complete: http://en.wikipedia.org/wiki/Inlining#Problems ). DMD inlining is in many ways weak compared to GCC/LLVM (GDC/LDC) ones. 32 bit GDC/DMD are also able use SSE+ registers, that sometimes give performance gains. To discuss a bit about the dot product performance (that's present in Phobos too) I suggest to pull out and show the assembly code. Timings alone don't suffice. I may produce some assembly later, if I create a little test program or if Caligo posts here one. Bye, bearophile
Re: inline functions
This little test program: struct Vector(T) { T[3] data; T dot(const ref Vector o) { return data[0] * o.data[0] + data[1] * o.data[1] + data[2] * o.data[2]; } T lengthSquaredSlow() { return dot(this); } T lengthSquaredFast() { return data[0] * data[0] + data[1] * data[1] + data[2] * data[2]; } } Vector!double v; void main() {} The assembly, DMD 2.052, -O -release -inline: dot (T=double): pushEBX mov EDX,EAX mov EBX,8[ESP] fld qword ptr 010h[EDX] fld qword ptr [EDX] fxchST1 fmulqword ptr 010h[EBX] fxchST1 fld qword ptr 8[EDX] fxchST1 fmulqword ptr [EBX] fxchST1 fmulqword ptr 8[EBX] faddp ST(1),ST faddp ST(1),ST pop EBX ret 4 lengthSquaredSlow (T=double): mov ECX,EAX fld qword ptr 010h[EAX] fld qword ptr [ECX] fxchST1 fmulqword ptr 010h[ECX] fxchST1 fld qword ptr 8[ECX] fxchST1 fmulqword ptr [ECX] fxchST1 fmulqword ptr 8[ECX] faddp ST(1),ST faddp ST(1),ST ret lengthSquaredFast (T=double): mov ECX,EAX fld qword ptr 010h[EAX] fld qword ptr [ECX] fxchST1 fmulqword ptr 010h[ECX] fxchST1 fld qword ptr 8[ECX] fxchST1 fmulqword ptr [ECX] fxchST1 fmulqword ptr 8[ECX] faddp ST(1),ST faddp ST(1),ST ret The fast and slow versions seem to be compiled to the same code. So please, show a D code example where there is some difference. Bye, bearophile
Re: inline functions
I've changed my code since I posted this, so here is something different that shows performance difference: module t1; struct Vector{ private: double x = void; double y = void; double z = void; public: this(in double x, in double y, in double z){ this.x = x; this.y = y; this.z = z; } Vector opBinary(string op)(const double rhs) const if(op == *){ return mixin(Vector(x~op~rhs, y~op~rhs, z~op~rhs)); } Vector opBinaryRight(string op)(const double lhs) const if(op == *){ return opBinary!op(lhs); } } void main(){ auto v1 = Vector(4, 5, 6); for(int i = 0; i 60_000_000; i++){ v1 = v1 * 1.0012; //v1 = 1.0012 * v1; } } Calling opBinaryRight: /* gdc -O3 -o t1 t1.d real0m0.394s user0m0.390s sys 0m0.000s */ Calling opBinary: /* gdc -O3 -o t1 t1.d real0m0.321s user0m0.310s sys 0m0.000s */ Those results are best of 10. There shouldn't be a performance difference between the two, but there.
Re: inline functions
Caligo: There shouldn't be a performance difference between the two, but there. It seems the compiler isn't removing some useless code (the first has 3 groups of movsd, the second has 4 of them): v = v * 1.0012; main: L45:mov ESI,offset FLAT:_D4test6Vector6__initZ lea EDI,068h[ESP] movsd movsd movsd movsd movsd movsd fld qword ptr 010h[ESP] fld qword ptr 018h[ESP] fxchST1 fmulqword ptr FLAT:_DATA[018h] lea ESI,068h[ESP] lea EDI,048h[ESP] fxchST1 fmulqword ptr FLAT:_DATA[018h] fld qword ptr 8[ESP] fmulqword ptr FLAT:_DATA[018h] fxchST2 fstpqword ptr 080h[ESP] fxchST1 fld qword ptr 080h[ESP] fxchST2 fstpqword ptr 088h[ESP] fxchST1 fld qword ptr 088h[ESP] fxchST2 fstpqword ptr 068h[ESP] fstpqword ptr 070h[ESP] fstpqword ptr 078h[ESP] movsd movsd movsd movsd movsd movsd lea ESI,048h[ESP] lea EDI,8[ESP] movsd movsd movsd movsd movsd movsd inc EAX cmp EAX,03938700h jb L45 - v = 1.0012 * v; main: L45:mov ESI,offset FLAT:_D4test6Vector6__initZ lea EDI,088h[ESP] movsd movsd movsd movsd movsd movsd fld qword ptr 010h[ESP] fld qword ptr 018h[ESP] fxchST1 fmulqword ptr FLAT:_DATA[018h] lea ESI,088h[ESP] fxchST1 fmulqword ptr FLAT:_DATA[018h] fld qword ptr 8[ESP] fxchST2 lea EDI,068h[ESP] fxchST2 fmulqword ptr FLAT:_DATA[018h] fxchST2 fstpqword ptr 0A0h[ESP] fxchST1 fld qword ptr 0A0h[ESP] fxchST2 fstpqword ptr 0A8h[ESP] fxchST1 fld qword ptr 0A8h[ESP] fxchST2 fstpqword ptr 088h[ESP] fstpqword ptr 090h[ESP] fstpqword ptr 098h[ESP] movsd movsd movsd movsd movsd movsd lea ESI,068h[ESP] lea EDI,048h[ESP] movsd movsd movsd movsd movsd movsd lea ESI,048h[ESP] lea EDI,8[ESP] movsd movsd movsd movsd movsd movsd inc EAX cmp EAX,03938700h jb L45 - v.x *= 1.0012; v.y *= 1.0012; v.z *= 1.0012; L42:fld qword ptr FLAT:_DATA[018h] inc EAX cmp EAX,03938700h fmulqword ptr 8[ESP] fstpqword ptr 8[ESP] fld qword ptr FLAT:_DATA[018h] fmulqword ptr 010h[ESP] fstpqword ptr 010h[ESP] fld qword ptr FLAT:_DATA[018h] fmulqword ptr 018h[ESP] fstpqword ptr 018h[ESP] jb L42 - C GCC uses only 5 instructions/loop, to improve this : v.x *= 1.0012; v.y *= 1.0012; v.z *= 1.0012; L2: fmul%st, %st(3) subl$1, %eax fmul%st, %st(2) fmul%st, %st(1) jne L2 - C GCC, -mfpmath=sse -msse3 v.x *= 1.0012; v.y *= 1.0012; v.z *= 1.0012; L2: subl$1, %eax mulsd %xmm0, %xmm1 mulsd %xmm0, %xmm2 mulsd %xmm0, %xmm3 jne L2 - C GCC, -mfpmath=sse -msse3 -funroll-loops L2: subl$8, %eax mulsd %xmm0, %xmm1 mulsd %xmm0, %xmm2 mulsd %xmm0, %xmm3 mulsd %xmm0, %xmm1 mulsd %xmm0, %xmm2 mulsd %xmm0, %xmm3 mulsd %xmm0, %xmm1 mulsd %xmm0, %xmm2 mulsd %xmm0, %xmm3 mulsd %xmm0, %xmm1 mulsd %xmm0, %xmm2 mulsd %xmm0, %xmm3 mulsd %xmm0, %xmm1 mulsd %xmm0, %xmm2 mulsd %xmm0, %xmm3 mulsd %xmm0, %xmm1 mulsd %xmm0, %xmm2 mulsd %xmm0, %xmm3 mulsd %xmm0, %xmm1 mulsd %xmm0, %xmm2 mulsd %xmm0, %xmm3 mulsd %xmm0, %xmm1 mulsd %xmm0, %xmm2 mulsd %xmm0, %xmm3 jne L2 I have not found a quick way to let GCC vectorize this code, using two multiplications with one SSE instructions, I am not sure GCC is able to do this automatically. Bye, bearophile
Re: inline functions
bearophile wrote: I have not found a quick way to let GCC vectorize this code, using two multiplications with one SSE instructions, I am not sure GCC is able to do this automatically. Even with -ftree-vectorize? AFAIK it is considered experimental and needs to be turned on explicitly. Don't know how good it is though... Jerome -- mailto:jeber...@free.fr http://jeberger.free.fr Jabber: jeber...@jabber.fr signature.asc Description: OpenPGP digital signature
Re: inline functions
Jérôme M. Berger: Even with -ftree-vectorize? Right. AFAIK it is considered experimental and needs to be turned on explicitly. Don't know how good it is though... It's a very long lasting and complex experiment then :-) There is a lot of work behind that little switch. Modern compilers have a long way to go still, they need to compile little kernel loops better or much better. Bye, bearophile
Re: inline functions
On 2011-03-25 19:04, Caligo wrote: T[3] data; T dot(const ref Vector o){ return data[0] * o.data[0] + data[1] * o.data[1] + data[2] * o.data[2]; } T LengthSquared_Fast(){ return data[0] * data[0] + data[1] * data[1] + data[2] * data[2]; } T LengthSquared_Slow(){ return dot(this); } The faster LengthSquared() is twice as fast, and I've test with GDC and DMD. Is it because the compilers don't inline-expand the dot() function call? I need the performance, but the faster version is too verbose. It sure sounds like it didn't inline it. Did you compile with -inline? If you didn't then it definitely won't inline it. - Jonathan M Davis
Re: inline functions
On Fri, Mar 25, 2011 at 10:49 PM, Jonathan M Davis jmdavisp...@gmx.com wrote: On 2011-03-25 19:04, Caligo wrote: T[3] data; T dot(const ref Vector o){ return data[0] * o.data[0] + data[1] * o.data[1] + data[2] * o.data[2]; } T LengthSquared_Fast(){ return data[0] * data[0] + data[1] * data[1] + data[2] * data[2]; } T LengthSquared_Slow(){ return dot(this); } The faster LengthSquared() is twice as fast, and I've test with GDC and DMD. Is it because the compilers don't inline-expand the dot() function call? I need the performance, but the faster version is too verbose. It sure sounds like it didn't inline it. Did you compile with -inline? If you didn't then it definitely won't inline it. - Jonathan M Davis I didn't know I had to supply GDC with -inline, so I did, and it did not help. In fact, with the -inline option the performance gets worse (for DMD and GDC), even for code that doesn't contain any function calls. In any case, code compiled with DMD is always behind GDC when it comes to performance.
Re: inline functions
On 2011-03-25 21:21, Caligo wrote: On Fri, Mar 25, 2011 at 10:49 PM, Jonathan M Davis jmdavisp...@gmx.com wrote: On 2011-03-25 19:04, Caligo wrote: T[3] data; T dot(const ref Vector o){ return data[0] * o.data[0] + data[1] * o.data[1] + data[2] * o.data[2]; } T LengthSquared_Fast(){ return data[0] * data[0] + data[1] * data[1] + data[2] * data[2]; } T LengthSquared_Slow(){ return dot(this); } The faster LengthSquared() is twice as fast, and I've test with GDC and DMD. Is it because the compilers don't inline-expand the dot() function call? I need the performance, but the faster version is too verbose. It sure sounds like it didn't inline it. Did you compile with -inline? If you didn't then it definitely won't inline it. - Jonathan M Davis I didn't know I had to supply GDC with -inline, so I did, and it did not help. In fact, with the -inline option the performance gets worse (for DMD and GDC), even for code that doesn't contain any function calls. In any case, code compiled with DMD is always behind GDC when it comes to performance. I don't know what gdc does, but you have to use -inline with dmd if you want it to inline anything. It also really doesn't make any sense at all that inlining would harm performance. If that's the case, something weird is going on. I don't see how inlining could _ever_ harm performance unless it just makes the program's binary so big that _that_ harms performance. That isn't very likely though. So, if using -inline is harming performance, then something weird is definitely going on. - Jonathan M Davis