https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117487
Bug ID: 117487
Summary: Power8 optimizations for math library aren't done in
power9 or power10 (PR target/71977)
Product: gcc
Version: 15.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: meissner at gcc dot gnu.org
Target Milestone: ---
I was answering an email about something else, and I wanted to look up code
that I added in January 4th, 2017 (PR target/71977, PR target/70568, PR
target/78823). I noticed while this code is optimized on power8, it is not
optimized on power9 or power10.
The code (gcc.target/pr71977-1.c) is:
#include <stdint.h>
typedef union
{
float value;
uint32_t word;
} ieee_float_shape_type;
float
mask_and_float_var (float f, uint32_t mask)
{
ieee_float_shape_type u;
u.value = f;
u.word &= mask;
return u.value;
}
The initial code generated before the January 4th, 2017 changes was:
xscvdpspn 0,1
mfvsrwz 9,0
and 9,9,4
sldi 9,9,32
mtvsrd 1,9
xscvspdpn 1,1
blr
Note, there is a direct move from the FPR/vector registers, the logical
operation is done in the GPR registers and then a direct move back to the
FPR/vector registers.
After the changes, the code for power8 is:
xscvdpspn 0,1
sldi 9,4,32
mtvsrd 32,9
xxland 1,0,32
xscvspdpn 1,1
blr
In this case, we avoid a direct register move from the FPR/vector registers to
the GPR registers, and we do the logical operation in the vector registers.
If we look at the power10/power9 code, it is:
xscvdpspn 0,1
mfvsrwz 2,0
and 2,2,4
mtvsrws 1,2
xscvspdpn 1,1
blr
I.e. we do 2 direct moves between the GPR registers and the FPR/vector
registers and do the logical operation in the GPR registers.
The reason for this is we have the MTVSRWS instruction in power9/power10 (splat
bottom 32-bits of a GPR register into a FPR register). In the power8 case, we
don't have MTVSRWS, so instead we need to do a shift left 32-bits (SLDI) and
then direct move to the FPR/vector registers before we can do XSCVSPDPN.
The XSCVSPDPN instruction wants the value in the upper 32-bits. We do this
either by a left shift or by a splat operation.
To fix this, we would need a similar define_peephole2 to the one around line
6318 of vsx.md that matches using the splat operation instead of a shift and
64-bit move.