https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117007
Bug ID: 117007
Summary: Poor optimiation for small vector constants needed for
vector shift/rotate/mask genration.
Product: gcc
Version: 13.2.1
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: munroesj at gcc dot gnu.org
Target Milestone: ---
Created attachment 59291
--> https://gcc.gnu.org/bugzilla/attachment.cgi?id=59291&action=edit
compile withe -m64 -O3 -mcpu=power8 or power9
For vector library codes there is frequent need toe "splat" small integer
constants needed for vector shifts, rotates, and mask generation. The
instructions exist (i.e. vspltisw, xxspltib, xxspltiw) supported by intrinsic.
But when these are used to provide constants VRs got other vector operations
the compiler goes out of is way to convert them to vector loads from .rodata.
This is especially bad for power8/9 as .rodata require 32-bit offsets and
always generate 3/4 instructions with a best case (L1 cache hit) latency of 9
cycles. The original splat immediate / shift implementation will run 2-4
instruction (with a good chance for CSE) and 4-6 cycles latency.
For example:
vui32_t
mask_sig_v2 ()
{
vui32_t ones = vec_splat_u32(-1);
vui32_t shft = vec_splat_u32(9);
return vec_vsrw (ones, shft);
}
With GCC V6 generates:
00000000000001c0 <mask_sig_v2>:
1c0: 8c 03 09 10 vspltisw v0,9
1c4: 8c 03 5f 10 vspltisw v2,-1
1c8: 84 02 42 10 vsrw v2,v2,v0
1cc: 20 00 80 4e blr
While with GCC 13.2.1 generates:
00000000000001c0 <mask_sig_v2>:
1c0: 00 00 4c 3c addis r2,r12,0
1c0: R_PPC64_REL16_HA .TOC.
1c4: 00 00 42 38 addi r2,r2,0
1c4: R_PPC64_REL16_LO .TOC.+0x4
1c8: 00 00 22 3d addis r9,r2,0
1c8: R_PPC64_TOC16_HA .rodata.cst16+0x20
1cc: 00 00 29 39 addi r9,r9,0
1cc: R_PPC64_TOC16_LO .rodata.cst16+0x20
1d0: ce 48 40 7c lvx v2,0,r9
1d4: 20 00 80 4e blr
this is the samel for -mcpu=power8/power9
it gets worse for vector functions that require multiple shift/mask constants.
For example:
// Extract the float sig
vui32_t
test_extsig_v2 (vf32_t vrb)
{
const vui32_t zero = vec_splat_u32(0);
const vui32_t sigmask = mask_sig_v2 ();
const vui32_t expmask = mask_exp_v2 ();
#if 1
vui32_t ones = vec_splat_u32(-1);
const vui32_t hidden = vec_sub (sigmask, ones);
#else
const vui32_t hidden = mask_hidden_v2 ();
#endif
vui32_t exp, sig, normal;
exp = vec_and ((vui32_t) vrb, expmask);
normal = vec_nor ((vui32_t) vec_cmpeq (exp, expmask),
(vui32_t) vec_cmpeq (exp, zero));
sig = vec_and ((vui32_t) vrb, sigmask);
// If normal merger hidden-bit the sig-bits
return (vui32_t) vec_sel (sig, normal, hidden);
}
GCC V6 generated:
0000000000000310 <test_extsig_v2>:
310: 8c 03 bf 11 vspltisw v13,-1
314: 8c 03 37 10 vspltisw v1,-9
318: 8c 03 60 11 vspltisw v11,0
31c: 06 0a 0d 10 vcmpgtub v0,v13,v1
320: 84 09 00 10 vslw v0,v0,v1
324: 8c 03 29 10 vspltisw v1,9
328: 17 14 80 f1 xxland vs44,vs32,vs34
32c: 84 0a 2d 10 vsrw v1,v13,v1
330: 86 00 0c 10 vcmpequw v0,v12,v0
334: 86 58 8c 11 vcmpequw v12,v12,v11
338: 80 6c a1 11 vsubuwm v13,v1,v13
33c: 17 14 41 f0 xxland vs34,vs33,vs34
340: 17 65 00 f0 xxlnor vs32,vs32,vs44
344: 7f 03 42 f0 xxsel vs34,vs34,vs32,vs45
348: 20 00 80 4e blr
While GCC 13.2.1 -mcpu=power8 generates:
000000000000360 <test_extsig_v2>:
360: 00 00 4c 3c addis r2,r12,0
360: R_PPC64_REL16_HA .TOC.
364: 00 00 42 38 addi r2,r2,0
364: R_PPC64_REL16_LO .TOC.+0x4
368: 00 00 02 3d addis r8,r2,0
368: R_PPC64_TOC16_HA .rodata.cst16+0x30
36c: 00 00 42 3d addis r10,r2,0
36c: R_PPC64_TOC16_HA .rodata.cst16+0x20
370: 8c 03 a0 11 vspltisw v13,0
374: 00 00 08 39 addi r8,r8,0
374: R_PPC64_TOC16_LO .rodata.cst16+0x30
378: 00 00 4a 39 addi r10,r10,0
378: R_PPC64_TOC16_LO .rodata.cst16+0x20
37c: 00 00 22 3d addis r9,r2,0
37c: R_PPC64_TOC16_HA .rodata.cst16+0x40
380: e4 06 4a 79 rldicr r10,r10,0,59
384: ce 40 20 7c lvx v1,0,r8
388: 00 00 29 39 addi r9,r9,0
388: R_PPC64_TOC16_LO .rodata.cst16+0x40
38c: 8c 03 17 10 vspltisw v0,-9
390: 98 56 00 7c lxvd2x vs0,0,r10
394: e4 06 29 79 rldicr r9,r9,0,59
398: 98 4e 80 7d lxvd2x vs12,0,r9
39c: 84 01 21 10 vslw v1,v1,v0
3a0: 50 02 00 f0 xxswapd vs0,vs0
3a4: 17 14 01 f0 xxland vs32,vs33,vs34
3a8: 50 62 8c f1 xxswapd vs12,vs12
3ac: 12 14 00 f0 xxland vs0,vs0,vs34
3b0: 86 00 21 10 vcmpequw v1,v1,v0
3b4: 86 68 00 10 vcmpequw v0,v0,v13
3b8: 17 05 21 f0 xxlnor vs33,vs33,vs32
3bc: 33 0b 40 f0 xxsel vs34,vs0,vs33,vs12
3c0: 20 00 80 4e blr
And GCC 13.2.1 -mcpu=power9 generates:
0000000000000310 <test_extsig_v2>:
310: 00 00 4c 3c addis r2,r12,0
310: R_PPC64_REL16_HA .TOC.
314: 00 00 42 38 addi r2,r2,0
314: R_PPC64_REL16_LO .TOC.+0x4
318: 00 00 22 3d addis r9,r2,0
318: R_PPC64_TOC16_HA .rodata.cst16+0x10
31c: 8c 03 17 10 vspltisw v0,-9
320: 00 00 42 3d addis r10,r2,0
320: R_PPC64_TOC16_HA .rodata.cst16
324: d1 02 a0 f1 xxspltib vs45,0
328: 00 00 29 39 addi r9,r9,0
328: R_PPC64_TOC16_LO .rodata.cst16+0x10
32c: 00 00 4a 39 addi r10,r10,0
32c: R_PPC64_TOC16_LO .rodata.cst16
330: 09 00 29 f4 lxv vs33,0(r9)
334: 01 00 0a f4 lxv vs0,0(r10)
338: 00 00 22 3d addis r9,r2,0
338: R_PPC64_TOC16_HA .rodata.cst16+0x20
33c: 00 00 29 39 addi r9,r9,0
33c: R_PPC64_TOC16_LO .rodata.cst16+0x20
340: 01 00 89 f5 lxv vs12,0(r9)
344: 84 01 21 10 vslw v1,v1,v0
348: 12 14 00 f0 xxland vs0,vs0,vs34
34c: 17 14 01 f0 xxland vs32,vs33,vs34
350: 86 00 21 10 vcmpequw v1,v1,v0
354: 86 68 00 10 vcmpequw v0,v0,v13
358: 17 05 21 f0 xxlnor vs33,vs33,vs32
35c: 33 0b 40 f0 xxsel vs34,vs0,vs33,vs12
360: 20 00 80 4e blr
I have attached a reduced test case for vector unsigned int with more example.
None of these example should convert splat immediate instrinsics to the vector
load from .rodata.