Seems I had misused volatile. I removed ‘volatile’ from the function argument
on test_0 and it prevented the spill through the stack.
I added volatile because I was trying to avoid the compiler optimising away the
call to test_0 (as it has no side effects) but it appeared that volatile was
unnecessary and was a misuse of volatile (intended to indicate storage may
change outside of the control of the compiler). However it is an interesting
case… as a register arguments don’t have storage.
GCC, Clang folk, any ideas on why there is a stack spill for a volatile
register argument passed in esi? Does volatile force the argument to have
storage allocated on the stack? Is this a corner case in the C standard? This
argument in the x86_64 calling convention only has a register, so technically
it can’t change outside the control of the C "virtual machine” so volatile has
a vague meaning here. This seems to be a case of interpreting the C standard in
such a was as to make sure that a volatile argument “can be changed” outside
the control of the C "virtual machine” by explicitly giving it a storage
location on the stack. I think volatile scalar arguments are a special case and
that the volatile type label shouldn’t widen the scope beyond the register
unless it actually *needs* storage to spill. This is not a volatile stack
scoped variable unless the C standard interprets ABI register parameters as
actually having ‘storage’ so this is questionable… Maybe I should have gotten a
warning… or the volatile type qualifier on a scalar register argument should
have been ignored…
volatile for scalar function arguments seems to mean: “make this volatile and
subject to change outside of the compiler” rather than being a qualifier for
its storage (which is a register).
# gcc
test_0:
mov DWORD PTR [rsp-4], esi
mov ecx, DWORD PTR [rsp-4]
mov eax, edi
cdq
idivecx
mov eax, edx
ret
# clang
test_0:
mov dword ptr [rsp - 4], esi
xor edx, edx
mov eax, edi
div dword ptr [rsp - 4]
mov eax, edx
ret
/* Test program compiled on x86_64 with: cc -O3 -fomit-frame-pointer
-masm=intel -S test.c -o test.S */
#include
#include
static const int p = 8191;
static const int s = 13;
int __attribute__ ((noinline)) test_0(unsigned int k, volatile int p)
{
return k % p;
}
int __attribute__ ((noinline)) test_1(unsigned int k)
{
return k % p;
}
int __attribute__ ((noinline)) test_2(unsigned int k)
{
int i = (k) + (k>>s);
i = (i) + (i>>s);
if (i>=p) i -= p;
return i;
}
int main()
{
test_0(1, 8191); /* control */
for (int i = INT_MIN; i < INT_MAX; i++) {
int r1 = test_1(i), r2 = test_2(i);
if (r1 != r2) printf("%d %d %d\n", i, r1, r2);
}
}
> On 27 Mar 2016, at 2:32 PM, Andrew Waterman wrote:
>
> It would be good to figure out how to get rid of the spurious register spills.
>
> The strength reduction optimization isn't always profitable on Rocket,
> as it increases instruction count and code size. The divider has an
> early out and for small numbers is quite fast.
>
> On Fri, Mar 25, 2016 at 5:43 PM, Michael Clark wrote:
>> Now considering I have no idea how many cycles it takes for an integer
>> divide on the Rocket so the optimisation may not be a win.
>>
>> Trying to read MuDiv in multiplier.scala, and will at some point run some
>> timings in the cycle-accurate simulator.
>>
>> In either case, the spurious stack moves emitted by GCC are curious...
>>
>>> On 26 Mar 2016, at 9:42 AM, Michael Clark wrote:
>>>
>>> Hi All,
>>>
>>> I have found an interesting case where an optimisation is not being applied
>>> by GCC on RISC-V. And also some strange assembly output from GCC on RISC-V.
>>>
>>> Both GCC and Clang appear to optimise division by a constant Mersenne prime
>>> on x86_64 however GCC on RISC-V is not applying this optimisation.
>>>
>>> See test program and assembly output for these platforms:
>>>
>>> * GCC -O3 on RISC-V
>>> * GCC -O3 on x86_64
>>> * LLVM/Clang -O3 on x86_64
>>>
>>> Another strange observation is GCC on RISC-V is moving a1 to a5 via a stack
>>> store followed by a stack load. Odd? GCC 5 also seems to be doing odd stuff
>>> with stack ‘moves' on x86_64, moving esi to ecx via the stack (I think
>>> recent x86 micro-architecture treats tip of the stack like an extended
>>> register file so this may only have a small penalty on x86).
>>>
>>> See GCC on RISC-V is emitting this:
>>>
>>> test_0:
>>> add sp,sp,-16
>>> sw a1,12(sp)
>>> lw a5,12(sp)
>>> add sp,sp,16
>>> remuw a0,a0,a5
>>> jr ra
>>>
>>> instead of this:
>>>
>>> test_0:
>>> remuw a0,a0,a1
>>> jr ra
>>>
>>> Compiler devs, please read Test program and assembly output. I have not yet
>>> tested