https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117251
Bug ID: 117251
Summary: SHA3 code for PowerPC has a major slow down
Product: gcc
Version: 15.0
Status: UNCONFIRMED
Severity: major
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: meissner at gcc dot gnu.org
Target Milestone: ---
Created attachment 59405
--> https://gcc.gnu.org/bugzilla/attachment.cgi?id=59405&action=edit
Multibuff.c test
The sha3 functions compiled for the powerpc has a slowdown in GCC 15 and GCC 14
compared to GCC 13 and GCC 12 when compiled for power10, due to excessive
amounts of spilling.
The main function for multibuf.c has 3,747 lines, all of which are using vector
unsigned long long. There are 696 vector shifts (all shifts are constant),
1,824 vector xor's and 600 vector andc's.
The timing for these runs is the following:
Trunk (sources checked out October 5th): 6.15 seconds
GCC 14 (sources checked out October 21st): 6.28 seconds
GCC 13 (sources checked out October 21st): 5.57 seconds
GCC 12 (sources checked out October 21st): 5.61 seconds
GCC 11 (sources checked out October 21st): 9.56 seconds
In looking at it, the main thing that steps out is the reason for either
spilling or moving variables is the support in gcc/rs6000/fusion.md (generated
by gcc/rs6000/genfusion.pl) that tries to fuse the vec_andc feeding into
vec_xor, and other vec_xor's feeding into vec_xor.
On the powerpc for power10, there is a special fusion mode that happens if the
machine has a VANDC or VXOR instruction that is adjacent to a VXOR instruction
and the VANDC/VXOR feeds into the 2nd VXOR instruction.
While the Power10 has 64 vector registers (which uses the XXL prefix to do the
logical operation), the fusion only works with the older Altivec instruction
set (which uses the V prefix). The Altivec instruction only has 32 vector
registers (which are overlaid over the VSX vector registers 32-63).
By having the combiner patterns fuse_vandc_vxor and fuse_vxor_vxor to do this
fusion, it means that the register allocator has more register pressure for the
traditional Altivec registers instead of the VSX registers.
In addition, since there are vector shifts, these shifts only work on the
traditional Altivec registers, which adds to the Altivec register pressure.
Finally loading up the vector constants for the shifts requires Altivec
registers (using XXSPLTIB and VEXTSB2D to form the constant). But this doesn't
add to the register pressure, since these constants are all used in the VRLD
vector shift instruction.
Here are some summaries for the various compilers:
Trunk GCC14 GCC13 GCC12 GCC11
----- ----- ----- ----- -----
Fuse VANDC -> VXOR 600 600 600 600 600
Fuse VXOR -> VXOR 240 240 120 120 120
Spill vector to stack 364 364 172 184 110
Load spilled vector from stack 962 962 713 723 166
Vector moves 100 100 70 72 3,055
Vector shift right 696 696 696 696 696
XXLANDC or VANDC 600 600 600 600 600
XXLXOR or VXOR 1,824 1,824 1,824 1,824 1,825
XXSPLTIB and VEXTSB2D to load constants 24 24 24 24 24
This means that current trunk and GCC 14 have more vector spills and loads than
GCC 13 and GCC 12. In addition, they have some more vector moves.
Current trunk and GCC 14-12 have more vector spills than GCC 11, but GCC 11 has
many more vector moves that the other compilers. Thus even though it has way
less spills, the vector moves are why GCC 11 has the slowest results.