Re: Forcing inline functions (again) - groan

2020-07-15 Thread kinke via Digitalmars-d-learn

On Wednesday, 15 July 2020 at 13:38:34 UTC, Cecil Ward wrote:

I recently noticed
pragma(inline, true)
which looks extremely useful. A couple of questions :

1. Is this cross-compiler compatible?


Works for LDC and DMD, not sure about GDC, but if it doesn't 
support it, it's definitely on Iain's list.


2. Can I declare a function in one module and have it _inlined_ 
in another module at the call site?


For LDC, this works in all cases (i.e., also if compiling 
multiple object files in a single cmdline) since v1.22.


While you cannot force LLVM to actually inline, I haven't come 
across a case yet where it doesn't.


Re: Forcing inline functions (again) - groan

2020-07-15 Thread Stefan Koch via Digitalmars-d-learn

On Wednesday, 15 July 2020 at 13:38:34 UTC, Cecil Ward wrote:

I recently noticed
pragma(inline, true)
which looks extremely useful. A couple of questions :

1. Is this cross-compiler compatible?

2. Can I declare a function in one module and have it _inlined_ 
in another module at the call site?


I’m looking to write functions that expand to approx one or 
even zero machine instructions and having the overhead of a 
function call would be disastrous; in some cases would make it 
pointless having the function due to the slowdown.


pragma inline will work for dmd.
and it used to fail if it couldn't inline.
Now it just generates a warning.
So with -w it will still fail.

Afaik other compilers cannot warn if the in-lining fails but I 
might be wrong.
And ldc/gdc should be able to inline most code which makes sense 
to inline.


Forcing inline functions (again) - groan

2020-07-15 Thread Cecil Ward via Digitalmars-d-learn

I recently noticed
pragma(inline, true)
which looks extremely useful. A couple of questions :

1. Is this cross-compiler compatible?

2. Can I declare a function in one module and have it _inlined_ 
in another module at the call site?


I’m looking to write functions that expand to approx one or even 
zero machine instructions and having the overhead of a function 
call would be disastrous; in some cases would make it pointless 
having the function due to the slowdown.


Re: inline functions

2011-03-28 Thread Steven Schveighoffer

On Fri, 25 Mar 2011 22:04:20 -0400, Caligo iteronve...@gmail.com wrote:


T[3] data;

T dot(const ref Vector o){
return data[0] * o.data[0] + data[1] * o.data[1] + data[2] *  
o.data[2];

}

T LengthSquared_Fast(){ return data[0] * data[0] + data[1] * data[1] +
data[2] * data[2]; }
T LengthSquared_Slow(){ return dot(this); }


The faster LengthSquared() is twice as fast, and I've test with GDC
and DMD.  Is it because the compilers don't inline-expand the dot()
function call?  I need the performance, but the faster version is too
verbose.


ref parameters used to make functions not be inlined, but apparently that  
was fixed: http://d.puremagic.com/issues/show_bug.cgi?id=2008


The best thing to do is check the disassembly to see if the call is being  
inlined.


Also, if you want more help besides guessing, a complete working program  
is good to have.


-Steve


Re: inline functions

2011-03-26 Thread Caligo
On Sat, Mar 26, 2011 at 3:47 AM, Jonathan M Davis jmdavisp...@gmx.com wrote:
 On 2011-03-26 01:06, Caligo wrote:
 On Fri, Mar 25, 2011 at 11:56 PM, Jonathan M Davis jmdavisp...@gmx.com
 wrote:
  On 2011-03-25 21:21, Caligo wrote:
  On Fri, Mar 25, 2011 at 10:49 PM, Jonathan M Davis jmdavisp...@gmx.com
 
  wrote:
   On 2011-03-25 19:04, Caligo wrote:
   T[3] data;
  
   T dot(const ref Vector o){
       return data[0] * o.data[0] + data[1] * o.data[1] + data[2] *
   o.data[2]; }
  
   T LengthSquared_Fast(){ return data[0] * data[0] + data[1] * data[1]
   + data[2] * data[2]; }
   T LengthSquared_Slow(){ return dot(this); }
  
  
   The faster LengthSquared() is twice as fast, and I've test with GDC
   and DMD.  Is it because the compilers don't inline-expand the dot()
   function call?  I need the performance, but the faster version is too
   verbose.
  
   It sure sounds like it didn't inline it. Did you compile with -inline?
   If you didn't then it definitely won't inline it.
  
   - Jonathan M Davis
 
  I didn't know I had to supply GDC with -inline, so I did, and it did
  not help.  In fact, with the -inline option the performance gets worse
  (for DMD and GDC), even for code that doesn't contain any function
  calls.  In any case, code compiled with DMD is always behind GDC when
  it comes to performance.
 
  I don't know what gdc does, but you have to use -inline with dmd if you
  want it to inline anything. It also really doesn't make any sense at all
  that inlining would harm performance. If that's the case, something
  weird is going on. I don't see how inlining could _ever_ harm
  performance unless it just makes the program's binary so big that _that_
  harms performance. That isn't very likely though. So, if using -inline
  is harming performance, then something weird is definitely going on.
 
  - Jonathan M Davis

 The only time that -inline has no effect is when I turn on -O3.  This
 is also when the code performs the best.  I've never used -O3 in my
 C++ code, but I guess things are different in D even with the same
 back-end.

 I really don't know what gdc does. With dmd, inlining is not turned on unless
 -inline is used. Also, -inline with dmd does not force inlining, it merely
 turns on the optimization. The compiler still chooses where and when it's best
 to inline.

 With gcc, I believe that inlining is normally turned on at a pretty low
 optimization level (probably -O), and like dmd, it chooses where and when it's
 best to inline, but unlike dmd, it uses the inline keyword in C++ as a hint as
 to what it should do. However, -O3 forces inlining on all functions marked
 with inline. How gdc deals with that given that D doesn't have an inline
 keyword, I don't know.

 Regardless, given what inlining does, I have a _very_ hard time believing that
 it would ever degrade performance unless it's buggy.

 - Jonathan M Davis



I was going to post my code, but I take back what I said.  What is
happening is that there is a lot of fluctuation in performance.  The
low performance always occurred when I had -inline enabled, which made
me think -inline degrades performance.  The performance should be
consistent, but for some reason it's not.

The important thing is that -inline doesn't make any difference with
GDC.  The -O3 does make a big difference.


Re: inline functions

2011-03-26 Thread bearophile
Answer for Jonathan M Davis and Caligo:

I far as I remember you need to use -finline-functions on GDC to perform 
inlining.

-O3 implies inlining, on GCC, and I presume on GDC too.

Inlining is a complex art, the compilers compute a score for each function and 
each function call and decide if perform the inlining. There are many 
situations where inlining harms performance, and it's not just a matter of code 
cache pressure (this list of problems is not complete: 
http://en.wikipedia.org/wiki/Inlining#Problems ). DMD inlining is  in many ways 
weak compared to GCC/LLVM (GDC/LDC) ones. 32 bit GDC/DMD are also able use SSE+ 
registers, that sometimes give performance gains.

To discuss a bit about the dot product performance (that's present in Phobos 
too) I suggest to pull out and show the assembly code. Timings alone don't 
suffice. I may produce some assembly later, if I create a little test program 
or if Caligo posts here one.

Bye,
bearophile


Re: inline functions

2011-03-26 Thread bearophile
This little test program:


struct Vector(T) {
T[3] data;

T dot(const ref Vector o) {
return data[0] * o.data[0] +
   data[1] * o.data[1] +
   data[2] * o.data[2];
}

T lengthSquaredSlow() {
return dot(this);
}

T lengthSquaredFast() {
return data[0] * data[0] +
   data[1] * data[1] +
   data[2] * data[2];
}
}

Vector!double v;
void main() {}

The assembly, DMD 2.052, -O -release -inline:


dot (T=double):
pushEBX
mov EDX,EAX
mov EBX,8[ESP]
fld qword ptr 010h[EDX]
fld qword ptr [EDX]
fxchST1
fmulqword ptr 010h[EBX]
fxchST1
fld qword ptr 8[EDX]
fxchST1
fmulqword ptr [EBX]
fxchST1
fmulqword ptr 8[EBX]
faddp   ST(1),ST
faddp   ST(1),ST
pop EBX
ret 4

lengthSquaredSlow (T=double):
mov ECX,EAX
fld qword ptr 010h[EAX]
fld qword ptr [ECX]
fxchST1
fmulqword ptr 010h[ECX]
fxchST1
fld qword ptr 8[ECX]
fxchST1
fmulqword ptr [ECX]
fxchST1
fmulqword ptr 8[ECX]
faddp   ST(1),ST
faddp   ST(1),ST
ret

lengthSquaredFast (T=double):
mov ECX,EAX
fld qword ptr 010h[EAX]
fld qword ptr [ECX]
fxchST1
fmulqword ptr 010h[ECX]
fxchST1
fld qword ptr 8[ECX]
fxchST1
fmulqword ptr [ECX]
fxchST1
fmulqword ptr 8[ECX]
faddp   ST(1),ST
faddp   ST(1),ST
ret

The fast and slow versions seem to be compiled to the same code. So please, 
show a D code example where there is some difference.

Bye,
bearophile


Re: inline functions

2011-03-26 Thread Caligo
I've changed my code since I posted this, so here is something
different that shows performance difference:

module t1;

struct Vector{

private:
  double x = void;
  double y = void;
  double z = void;

public:
  this(in double x, in double y, in double z){
this.x = x;
this.y = y;
this.z = z;
  }

  Vector opBinary(string op)(const double rhs) const if(op == *){
return mixin(Vector(x~op~rhs, y~op~rhs, z~op~rhs));
  }

  Vector opBinaryRight(string op)(const double lhs) const if(op == *){
return opBinary!op(lhs);
  }
}

void main(){

  auto v1 = Vector(4, 5, 6);
  for(int i = 0; i  60_000_000; i++){
v1 = v1 * 1.0012;
//v1 = 1.0012 * v1;
  }
}


Calling opBinaryRight:
/*  gdc -O3 -o t1 t1.d

real0m0.394s
user0m0.390s
sys 0m0.000s
*/

Calling opBinary:
/* gdc -O3 -o t1 t1.d

real0m0.321s
user0m0.310s
sys 0m0.000s
*/

Those results are best of 10.

There shouldn't be a performance difference between the two, but there.


Re: inline functions

2011-03-26 Thread bearophile
Caligo:

 There shouldn't be a performance difference between the two, but there.

It seems the compiler isn't removing some useless code (the first has 3 groups 
of movsd, the second has 4 of them):



v = v * 1.0012;
main:

L45:mov ESI,offset FLAT:_D4test6Vector6__initZ
lea EDI,068h[ESP]
movsd
movsd
movsd
movsd
movsd
movsd
fld qword ptr 010h[ESP]
fld qword ptr 018h[ESP]
fxchST1
fmulqword ptr FLAT:_DATA[018h]
lea ESI,068h[ESP]
lea EDI,048h[ESP]
fxchST1
fmulqword ptr FLAT:_DATA[018h]
fld qword ptr 8[ESP]
fmulqword ptr FLAT:_DATA[018h]
fxchST2
fstpqword ptr 080h[ESP]
fxchST1
fld qword ptr 080h[ESP]
fxchST2
fstpqword ptr 088h[ESP]
fxchST1
fld qword ptr 088h[ESP]
fxchST2
fstpqword ptr 068h[ESP]
fstpqword ptr 070h[ESP]
fstpqword ptr 078h[ESP]
movsd
movsd
movsd
movsd
movsd
movsd
lea ESI,048h[ESP]
lea EDI,8[ESP]
movsd
movsd
movsd
movsd
movsd
movsd
inc EAX
cmp EAX,03938700h
jb  L45

-

v = 1.0012 * v;
main:

L45:mov ESI,offset FLAT:_D4test6Vector6__initZ
lea EDI,088h[ESP]
movsd
movsd
movsd
movsd
movsd
movsd
fld qword ptr 010h[ESP]
fld qword ptr 018h[ESP]
fxchST1
fmulqword ptr FLAT:_DATA[018h]
lea ESI,088h[ESP]
fxchST1
fmulqword ptr FLAT:_DATA[018h]
fld qword ptr 8[ESP]
fxchST2
lea EDI,068h[ESP]
fxchST2
fmulqword ptr FLAT:_DATA[018h]
fxchST2
fstpqword ptr 0A0h[ESP]
fxchST1
fld qword ptr 0A0h[ESP]
fxchST2
fstpqword ptr 0A8h[ESP]
fxchST1
fld qword ptr 0A8h[ESP]
fxchST2
fstpqword ptr 088h[ESP]
fstpqword ptr 090h[ESP]
fstpqword ptr 098h[ESP]
movsd
movsd
movsd
movsd
movsd
movsd
lea ESI,068h[ESP]
lea EDI,048h[ESP]
movsd
movsd
movsd
movsd
movsd
movsd
lea ESI,048h[ESP]
lea EDI,8[ESP]
movsd
movsd
movsd
movsd
movsd
movsd
inc EAX
cmp EAX,03938700h
jb  L45

-

v.x *= 1.0012; v.y *= 1.0012; v.z *= 1.0012;

L42:fld qword ptr FLAT:_DATA[018h]
inc EAX
cmp EAX,03938700h
fmulqword ptr 8[ESP]
fstpqword ptr 8[ESP]
fld qword ptr FLAT:_DATA[018h]
fmulqword ptr 010h[ESP]
fstpqword ptr 010h[ESP]
fld qword ptr FLAT:_DATA[018h]
fmulqword ptr 018h[ESP]
fstpqword ptr 018h[ESP]
jb  L42

-

C GCC uses only 5 instructions/loop, to improve this :

v.x *= 1.0012; v.y *= 1.0012; v.z *= 1.0012;

L2:
fmul%st, %st(3)
subl$1, %eax
fmul%st, %st(2)
fmul%st, %st(1)
jne L2

-

C GCC, -mfpmath=sse -msse3

v.x *= 1.0012; v.y *= 1.0012; v.z *= 1.0012;

L2:
subl$1, %eax
mulsd   %xmm0, %xmm1
mulsd   %xmm0, %xmm2
mulsd   %xmm0, %xmm3
jne L2

-

C GCC, -mfpmath=sse -msse3 -funroll-loops

L2:
subl$8, %eax
mulsd   %xmm0, %xmm1
mulsd   %xmm0, %xmm2
mulsd   %xmm0, %xmm3
mulsd   %xmm0, %xmm1
mulsd   %xmm0, %xmm2
mulsd   %xmm0, %xmm3
mulsd   %xmm0, %xmm1
mulsd   %xmm0, %xmm2
mulsd   %xmm0, %xmm3
mulsd   %xmm0, %xmm1
mulsd   %xmm0, %xmm2
mulsd   %xmm0, %xmm3
mulsd   %xmm0, %xmm1
mulsd   %xmm0, %xmm2
mulsd   %xmm0, %xmm3
mulsd   %xmm0, %xmm1
mulsd   %xmm0, %xmm2
mulsd   %xmm0, %xmm3
mulsd   %xmm0, %xmm1
mulsd   %xmm0, %xmm2
mulsd   %xmm0, %xmm3
mulsd   %xmm0, %xmm1
mulsd   %xmm0, %xmm2
mulsd   %xmm0, %xmm3
jne L2

I have not found a quick way to let GCC vectorize this code, using two 
multiplications with one SSE instructions, I am not sure GCC is able to do this 
automatically.

Bye,
bearophile


Re: inline functions

2011-03-26 Thread Jérôme M. Berger
bearophile wrote:
 I have not found a quick way to let GCC vectorize this code, using two 
 multiplications with one SSE instructions, I am not sure GCC is able to do 
 this automatically.
 
Even with -ftree-vectorize? AFAIK it is considered experimental and
needs to be turned on explicitly. Don't know how good it is though...

Jerome
-- 
mailto:jeber...@free.fr
http://jeberger.free.fr
Jabber: jeber...@jabber.fr



signature.asc
Description: OpenPGP digital signature


Re: inline functions

2011-03-26 Thread bearophile
Jérôme M. Berger:

   Even with -ftree-vectorize?

Right.


 AFAIK it is considered experimental and
 needs to be turned on explicitly. Don't know how good it is though...

It's a very long lasting and complex experiment then :-) There is a lot of work 
behind that little switch.
Modern compilers have a long way to go still, they need to compile little 
kernel loops better or much better.

Bye,
bearophile


Re: inline functions

2011-03-25 Thread Jonathan M Davis
On 2011-03-25 19:04, Caligo wrote:
 T[3] data;
 
 T dot(const ref Vector o){
 return data[0] * o.data[0] + data[1] * o.data[1] + data[2] * o.data[2];
 }
 
 T LengthSquared_Fast(){ return data[0] * data[0] + data[1] * data[1] +
 data[2] * data[2]; }
 T LengthSquared_Slow(){ return dot(this); }
 
 
 The faster LengthSquared() is twice as fast, and I've test with GDC
 and DMD.  Is it because the compilers don't inline-expand the dot()
 function call?  I need the performance, but the faster version is too
 verbose.

It sure sounds like it didn't inline it. Did you compile with -inline? If you 
didn't then it definitely won't inline it.

- Jonathan M Davis


Re: inline functions

2011-03-25 Thread Caligo
On Fri, Mar 25, 2011 at 10:49 PM, Jonathan M Davis jmdavisp...@gmx.com wrote:
 On 2011-03-25 19:04, Caligo wrote:
 T[3] data;

 T dot(const ref Vector o){
     return data[0] * o.data[0] + data[1] * o.data[1] + data[2] * o.data[2];
 }

 T LengthSquared_Fast(){ return data[0] * data[0] + data[1] * data[1] +
 data[2] * data[2]; }
 T LengthSquared_Slow(){ return dot(this); }


 The faster LengthSquared() is twice as fast, and I've test with GDC
 and DMD.  Is it because the compilers don't inline-expand the dot()
 function call?  I need the performance, but the faster version is too
 verbose.

 It sure sounds like it didn't inline it. Did you compile with -inline? If you
 didn't then it definitely won't inline it.

 - Jonathan M Davis


I didn't know I had to supply GDC with -inline, so I did, and it did
not help.  In fact, with the -inline option the performance gets worse
(for DMD and GDC), even for code that doesn't contain any function
calls.  In any case, code compiled with DMD is always behind GDC when
it comes to performance.


Re: inline functions

2011-03-25 Thread Jonathan M Davis
On 2011-03-25 21:21, Caligo wrote:
 On Fri, Mar 25, 2011 at 10:49 PM, Jonathan M Davis jmdavisp...@gmx.com 
wrote:
  On 2011-03-25 19:04, Caligo wrote:
  T[3] data;
  
  T dot(const ref Vector o){
  return data[0] * o.data[0] + data[1] * o.data[1] + data[2] *
  o.data[2]; }
  
  T LengthSquared_Fast(){ return data[0] * data[0] + data[1] * data[1] +
  data[2] * data[2]; }
  T LengthSquared_Slow(){ return dot(this); }
  
  
  The faster LengthSquared() is twice as fast, and I've test with GDC
  and DMD.  Is it because the compilers don't inline-expand the dot()
  function call?  I need the performance, but the faster version is too
  verbose.
  
  It sure sounds like it didn't inline it. Did you compile with -inline? If
  you didn't then it definitely won't inline it.
  
  - Jonathan M Davis
 
 I didn't know I had to supply GDC with -inline, so I did, and it did
 not help.  In fact, with the -inline option the performance gets worse
 (for DMD and GDC), even for code that doesn't contain any function
 calls.  In any case, code compiled with DMD is always behind GDC when
 it comes to performance.

I don't know what gdc does, but you have to use -inline with dmd if you want 
it to inline anything. It also really doesn't make any sense at all that 
inlining would harm performance. If that's the case, something weird is going 
on. I don't see how inlining could _ever_ harm performance unless it just 
makes the program's binary so big that _that_ harms performance. That isn't 
very likely though. So, if using -inline is harming performance, then 
something weird is definitely going on.

- Jonathan M Davis