On Friday, 14 February 2020 at 22:36:20 UTC, PatateVerte wrote:
Hello
I noticed a strange behaviour of the DMD compiler when it has
to call a function with float arguments.
I build with the flags "-mcpu=avx2 -O -m64" under windows 64
bits using "DMD32 D Compiler v2.090.1-dirty"
I have the following function :
float mul_add(float a, float b, float c); //Return a * b + c
When I try to call it :
float f = d_mul_add(1.0, 2.0, 3.0);
I tested with other functions with float parameters, and there
is the same problem.
Then the following instructions are generated :
//Loads the values, as it can be expected
vmovss xmm2,dword [rel 0x64830]
vmovss xmm1,dword [rel 0x64834]
vmovss xmm0,dword [rel 0x64838]
//Why ?
movq r8,xmm2
movq rdx,xmm1
movq rcx,xmm0
//
call 0x400 //0x400 is where the mul_add function is located
My questions are :
- Is there a reason why the registers xmm0/1/2 are saved in
rcx/rdx/r8 before calling ? The calling convention specifies
that the floating point parameters have to be put in xmm
registers, and not GPR, unless you are using your own calling
convention.
- Why is it done using non-avx instructions ? Mixing AVX and
non-AVX instructions may impact the speed greatly.
Any idea ? Thank you in advance.
It's simply the bad codegen (or rather a missed opportunity to
optimize) from DMD, its backend doesn't see that the parameters
are already in the right order and in the right registers so it
copy them and put them in the regs for the inner func call.
I had observed this in the past too, i.e unexplained round
tripping from GP to SSE regs. For good FP codegen use LDC2 or GDC
or write iasm (but loose inlining).
For other people who'd like to observe the problem:
https://godbolt.org/z/gvqEqz.
By the way I had to deactivate AVX2 targeting because otherwise
the result is even more weird (https://godbolt.org/z/T9NwMc)