[Bug target/117007] Poor optimization for small vector constants needed for vector shift/rotate/mask generation
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117007 --- Comment #13 from Steven Munroe --- Is seem like even for small values of signed char vec_splats ((signed char)x) for target -mcpu=power9 will sometime generate 2 instruction where it should only generate a single xxspltib.
[Bug target/117007] Poor optimization for small vector constants needed for vector shift/rotate/mask generation
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117007 --- Comment #12 from Steven Munroe --- Is seem like even for small values of signed char vec_splats ((signed char)x) will sometime generate 2 instruction where it should only generate a single xxspltib.
[Bug target/117007] Poor optimization for small vector constants needed for vector shift/rotate/mask generation
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117007 --- Comment #11 from Steven Munroe --- Created attachment 59560 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=59560&action=edit Test cases for vec_splats(signed chat) on -mcpu=power9 for and valid char value I would expect for example vec_splats ((signed char) 16) to generate single xxspltib vs34,16. But I am seeing: 0020 : 20: d1 42 40 f0 xxspltib vs34,8 24: 00 10 42 10 vaddubm v2,v2,v2
[Bug target/117007] Poor optimization for small vector constants needed for vector shift/rotate/mask generation
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117007 --- Comment #10 from Steven Munroe --- (In reply to Segher Boessenkool from comment #7) > It is always more and slower code. Yes. More examples: vui64_t test_sld_52_v1 (vui64_t vra) { vui32_t shft = vec_splat_u32(52-64); return vec_vsld (vra, (vui64_t) shft); } vui64_t test_sld_52_v0 (vui64_t vra) { return vra << 52; } The PowerISA is challenged to generate an vector doubleword constant. So it seems easier to load such constants from .rodata. Again a load from .rodata is minimum of 3 instructions and latency of 9 cycles (L1 cache hit). But there are many examples of vector doubleowrd operations that need small constants. Also the doubleword shift/rotate operations only require a 6-bit shift count. Here changing the vector shift intrinsic to accept vector unsigned char for the shift count would be helpful. It is often faster to generate these constants from existing splat immediate instructions and 1-2 other operations than to pay the full latency cost of a (.rodata) vector load. For power8 the current GCC compilers will take this option away of the library developer. For example: gcc-13 -O3 -mcpu=power8 -mtune=power8 01e0 : #TL 11/11 1e0: 00 00 4c 3c addis r2,r12,.TOC.@ha 1e4: 00 00 42 38 addir2,r2,.TOC.@l 1e8: 00 00 22 3d addis r9,r2,.rodata.cst16@ha #L 2/2 1ec: 00 00 29 39 addir9,r9,.rodata.cst16@l #L 2/2 1f0: ce 48 00 7c lvx v0,0,r9 #L 5/5 1f4: c4 05 42 10 vsldv2,v2,v0#L 2/2 1f8: 20 00 80 4e blr 01b0 : #TL 11/11 1e0: 00 00 4c 3c addis r2,r12,.TOC.@ha 1e4: 00 00 42 38 addir2,r2,.TOC.@l 1e8: 00 00 22 3d addis r9,r2,.rodata.cst16@ha #L 2/2 1ec: 00 00 29 39 addir9,r9,.rodata.cst16@l #L 2/2 1c0: ce 48 00 7c lvx v0,0,r9 #L 5/5 1c4: c4 05 42 10 vsldv2,v2,v0#L 2/2 1c8: 20 00 80 4e blr While the original Power64LE support compilers would allow the library developer to use intrinsics to generation smaller/faster sequences. Again the PowerISA vector shift/Rotate doubleword operations only needs the low-order 6-bits for the shift count. Here the original altivec vec_splat_u32() can generate shift-counts for ranges 0-15 and 48-63 easily. Or if the vector shift/rotate intrinsics would accept vector unsigned char for the shift count the library developer could use vec_splat_u8(). gcc-6 -O3 -mcpu=power8 -mtune=power8 0170 : #TL 4/4 170: 8c 03 14 10 vspltisw v0,-12 #L 2/2 174: c4 05 42 10 vsldv2,v2,v0#L 2/2 178: 20 00 80 4e blr Power 9 has the advantage of VSX Vector Splat Immediate Byte and will use it the vector inline. But this will alway insert the extend signed byte to doubleword. The current Power Intrinsic Reference does not provide a direct mechanism to generate xxspltib. If vec_splat_u32() is the current compiler (constant propagation?) will convert this into the load vector (lxv this time) from .rodata. This is still 3 instructions and 9 cycles. gcc-13 -O3 -mcpu=power9 -mtune=power9 01a0 : #TL 7/7 1a0: d1 a2 01 f0 xxspltib vs32,52#L 3/3 1a4: 02 06 18 10 vextsb2d v0,v0 #L 2/2 1a8: c4 05 42 10 vsldv2,v2,v0#L 2/2 1ac: 20 00 80 4e blr 0170 : #TL 11/11 1e0: 00 00 4c 3c addis r2,r12,.TOC.@ha 1e4: 00 00 42 38 addir2,r2,.TOC.@l 1e8: 00 00 22 3d addis r9,r2,.rodata.cst16@ha #L 2/2 1ec: 00 00 29 39 addir9,r9,.rodata.cst16@l #L 2/2 180: 09 00 09 f4 lxv vs32,0(r9) #L 5/5 184: c4 05 42 10 vsldv2,v2,v0#L 2/2 188: 20 00 80 4e blr This is still larger and slower then if the compiler/intrinsic would allow the direct use of xxspltib to generate the shift count for vsld. gcc-fix -O3 -mcpu=power9 -mtune=power9 0170 : #TL 5/5 170: d1 a2 01 f0 xxspltib vs32,52#L 3/3 174: c4 05 42 10 vsldv2,v2,v0#L 2/2 178: 20 00 80 4e blr Power10 also generates VSX Vector Splat Immediate Byte and extend sign vector inline doubleword shift. But it again converts vec_splat_u32() intrinsic into a load vector (plxv this time) from .rodata. This is smaller and faster then the power9 sequence but seems a bit of overkill for the small constant (52) involved. gcc-13 -O3 -mcpu=power10 -mtune=power10 01d0 : #TL 7/11 1d0: d1 a2 01 f0 xxspltib vs32,52#L 3/4 1d4: 02 06 18 10 vextsb2d v0,v0 #L 3/4 1d8: c4 05 42 10 vsldv2,v2,v0#L 1/3 1dc: 20 00 80 4e blr 01b0 :
[Bug target/117007] Poor optimization for small vector constants needed for vector shift/rotate/mask generation
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117007 --- Comment #9 from Steven Munroe --- (In reply to Segher Boessenkool from comment #7) > It is always more and slower code. Yes. lets try some specific examples and examine the code generated for power8/9/10 vui32_t test_slw_23_v0 (vui32_t vra) { return vra << 23; } vui32_t test_slw_23_v1 (__vector unsigned int vra) { vui32_t shft = vec_splat_u32(23-32); return vec_sl (vra, shft); } gcc-13 -O3 -mcpu=power8 -mtune=power8 0100 : #TL 11/11 100: 00 00 4c 3c addis r2,r12,.TOC.@ha 104: 00 00 42 38 addir2,r2,.TOC.@l 108: 00 00 22 3d addis r9,r2,.rodata.cst16@ha #L 2/2 10c: 00 00 29 39 addir9,r9,.rodata.cst16@l #L 2/2 110: ce 48 00 7c lvx v0,0,r9 #L 5/5 114: 84 01 42 10 vslwv2,v2,v0#L 2/2 118: 20 00 80 4e blr 00e0 : #TL 4/4 e0: 8c 03 17 10 vspltisw v0,-9 #L 2/2 e4: 84 01 42 10 vslwv2,v2,v0#L 2/2 e8: 20 00 80 4e blr For inline vector gcc tends to generate load from .rodata. The addis/addi/lvx (3 instruction) sequence is always generated for medium memory model. Only the linker will know the final offset so there is no optimization. This is a dependent sequence and best case (L1 cache hit) 11 cycles latency. Using the vector unsigned int type and intrinsic vec_splat_u32()/vec_sl() sequence generates to two instructions (vspltisw/vslw) for this simple case for this simple case. Again a dependent sequence for 4 cycles total. 4 cycles beats 11 gcc-13 -O3 -mcpu=power9 -mtune=power9 0100 : #TL 7/7 100: d1 ba 00 f0 xxspltib vs32,23#L 3/3 104: 02 06 10 10 vextsb2w v0,v0 #L 2/2 108: 84 01 42 10 vslwv2,v2,v0#L 2/2 10c: 20 00 80 4e blr 00e0 : #TL 5/5 e0: 8c 03 17 10 vspltisw v0,-9 #L 3/3 e4: 84 01 42 10 vslwv2,v2,v0#L 2/2 e8: 20 00 80 4e blr Power 9 has the advantage of VSX Vector Splat Immediate Byte and will use it the vector inline. The disadvantage is the it is a byte splat for a word shift. To the compiler insert the (pedantic) Expand Byte to Word. This adds 1 instruction and 2 cycles latency to the sequence. The ISA for vector shift word only requires the low order 5-bits of each element for the shift count. So the extend is not required and either vspltisw or xxspltib will work here. This is an example where changing the vector shift intrinsic to accept vector unsigned char for the shift count would be helpful. Again the intrinsic implementation beats the compiler vector inline code by 2-cycle (5 vs 7 cycles) and one less instruction. gcc-13 -O3 -mcpu=power10 -mtune=power10 0100 : #TL 4/7 100: 00 00 00 05 xxspltiw vs32,23#L 3/4 104: 17 00 07 80 108: 84 01 42 10 vslwv2,v2,v0#L 1/3 10c: 20 00 80 4e blr 00e0 : #TL 4/7 e0: 8c 03 17 10 vspltisw v0,-9 #L 3/4 e4: 84 01 42 10 vslwv2,v2,v0#L 1/3 e8: 20 00 80 4e blr Power10 has the advantage of the VSX Vector Splat Immediate Word instruction. This is a 8-byte prefixed instruction and is overkill for a 5-bit shift count. The good news is the cycle latency is the same but adds another word to the code stream which in not required to generate such a small (5-bit) constant. However VSX Vector Splat Immediate Word will be excellent for generating mb/me/sh masks for Vector Rotate Left Word then Mask Insert and the like. So I will concede the for the shift/rotate word immediate case for power10 the latencies are comparable. The problem I see is; as the examples get complex (generating mask for float) or double/quadword shifts the compiler (CSE or constant propagation) will convert splat immediate to vector load form .rodata.
[Bug target/117007] Poor optimization for small vector constants needed for vector shift/rotate/mask generation
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117007 --- Comment #6 from Steven Munroe --- I am starting to see pattern and wonder if the compiler is confused by assuming the sihft count must match the width/type of the shift/rotate target. This is implied all the way back to the Altivec-PIM and the current Intrinsic Reference and the GCC documentation. The intrinsics vec_rl(), vec_sl(), vec_sr(), vec_sra() all require that the shift-count be the same (unsigned) type (element size) as the shifted/rotated a value. This might confuse the compiler into thinking it MUST properly (zero/sign) extend any shift count. But that is wrong. But the PowerISA only requires that the shift-count in the (3-7-bits) low-order bits of each element. And any high-order element bits are don't care. So the shift-count (operand b) could easily be a vector unsigned char (byte elements). In fact the vec_sll(), vec_slo(), vec_srl(), and vec_sro() allow this. So the compiler can correctly use vspltisb, vspltish, vspltisw, xxspltib, for any vector shift/rotate where the shift-count is a compiler time constant. The is always less and faster code then loading vector constants from .rodata.
[Bug target/117007] Poor optimization for small vector constants needed for vector shift/rotate/mask generation
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117007 Steven Munroe changed: What|Removed |Added Attachment #59323|0 |1 is obsolete|| --- Comment #5 from Steven Munroe --- Created attachment 59446 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=59446&action=edit Examples of DW/QW shift immedaite Also found the compiler mishandling the qquadword shift by a constant for inline vector. I think is related to the fact that GCC does not actually support quadword integer constants.
[Bug target/117007] Poor optimization for small vector constants needed for vector shift/rotate/mask generation
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117007 --- Comment #4 from Steven Munroe --- Created attachment 59323 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=59323&action=edit Examples doe Vector DW int constant
[Bug target/117007] Poor optimization for small vector constants needed for vector shift/rotate/mask generation
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117007 --- Comment #3 from Steven Munroe --- I tested the attached example source on GCC 14.0.1 from Ubuntu on powerpc64le. Seeing the same results. So add GCC 14.0.1 to this list. Actually the last GCC version that did not have this bug was GCC 7. Looks like GCC 8-14 all do this. I don't have the time or stamina to build GCC from source head right now. But any one can try using the attached same source. gcc -O3 -mcpu=power8 -c vec-shift32-const.c Then objdump and look for any lvx instructions. There should be none.
[Bug target/117007] New: Poor optimiation for small vector constants needed for vector shift/rotate/mask genration.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117007 Bug ID: 117007 Summary: Poor optimiation for small vector constants needed for vector shift/rotate/mask genration. Product: gcc Version: 13.2.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: munroesj at gcc dot gnu.org Target Milestone: --- Created attachment 59291 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=59291&action=edit compile withe -m64 -O3 -mcpu=power8 or power9 For vector library codes there is frequent need toe "splat" small integer constants needed for vector shifts, rotates, and mask generation. The instructions exist (i.e. vspltisw, xxspltib, xxspltiw) supported by intrinsic. But when these are used to provide constants VRs got other vector operations the compiler goes out of is way to convert them to vector loads from .rodata. This is especially bad for power8/9 as .rodata require 32-bit offsets and always generate 3/4 instructions with a best case (L1 cache hit) latency of 9 cycles. The original splat immediate / shift implementation will run 2-4 instruction (with a good chance for CSE) and 4-6 cycles latency. For example: vui32_t mask_sig_v2 () { vui32_t ones = vec_splat_u32(-1); vui32_t shft = vec_splat_u32(9); return vec_vsrw (ones, shft); } With GCC V6 generates: 01c0 : 1c0: 8c 03 09 10 vspltisw v0,9 1c4: 8c 03 5f 10 vspltisw v2,-1 1c8: 84 02 42 10 vsrwv2,v2,v0 1cc: 20 00 80 4e blr While with GCC 13.2.1 generates: 01c0 : 1c0: 00 00 4c 3c addis r2,r12,0 1c0: R_PPC64_REL16_HA .TOC. 1c4: 00 00 42 38 addir2,r2,0 1c4: R_PPC64_REL16_LO .TOC.+0x4 1c8: 00 00 22 3d addis r9,r2,0 1c8: R_PPC64_TOC16_HA .rodata.cst16+0x20 1cc: 00 00 29 39 addir9,r9,0 1cc: R_PPC64_TOC16_LO .rodata.cst16+0x20 1d0: ce 48 40 7c lvx v2,0,r9 1d4: 20 00 80 4e blr this is the samel for -mcpu=power8/power9 it gets worse for vector functions that require multiple shift/mask constants. For example: // Extract the float sig vui32_t test_extsig_v2 (vf32_t vrb) { const vui32_t zero = vec_splat_u32(0); const vui32_t sigmask = mask_sig_v2 (); const vui32_t expmask = mask_exp_v2 (); #if 1 vui32_t ones = vec_splat_u32(-1); const vui32_t hidden = vec_sub (sigmask, ones); #else const vui32_t hidden = mask_hidden_v2 (); #endif vui32_t exp, sig, normal; exp = vec_and ((vui32_t) vrb, expmask); normal = vec_nor ((vui32_t) vec_cmpeq (exp, expmask), (vui32_t) vec_cmpeq (exp, zero)); sig = vec_and ((vui32_t) vrb, sigmask); // If normal merger hidden-bit the sig-bits return (vui32_t) vec_sel (sig, normal, hidden); } GCC V6 generated: 0310 : 310: 8c 03 bf 11 vspltisw v13,-1 314: 8c 03 37 10 vspltisw v1,-9 318: 8c 03 60 11 vspltisw v11,0 31c: 06 0a 0d 10 vcmpgtub v0,v13,v1 320: 84 09 00 10 vslwv0,v0,v1 324: 8c 03 29 10 vspltisw v1,9 328: 17 14 80 f1 xxland vs44,vs32,vs34 32c: 84 0a 2d 10 vsrwv1,v13,v1 330: 86 00 0c 10 vcmpequw v0,v12,v0 334: 86 58 8c 11 vcmpequw v12,v12,v11 338: 80 6c a1 11 vsubuwm v13,v1,v13 33c: 17 14 41 f0 xxland vs34,vs33,vs34 340: 17 65 00 f0 xxlnor vs32,vs32,vs44 344: 7f 03 42 f0 xxsel vs34,vs34,vs32,vs45 348: 20 00 80 4e blr While GCC 13.2.1 -mcpu=power8 generates: 360 : 360: 00 00 4c 3c addis r2,r12,0 360: R_PPC64_REL16_HA .TOC. 364: 00 00 42 38 addir2,r2,0 364: R_PPC64_REL16_LO .TOC.+0x4 368: 00 00 02 3d addis r8,r2,0 368: R_PPC64_TOC16_HA .rodata.cst16+0x30 36c: 00 00 42 3d addis r10,r2,0 36c: R_PPC64_TOC16_HA .rodata.cst16+0x20 370: 8c 03 a0 11 vspltisw v13,0 374: 00 00 08 39 addir8,r8,0 374: R_PPC64_TOC16_LO .rodata.cst16+0x30 378: 00 00 4a 39 addir10,r10,0 378: R_PPC64_TOC16_LO .rodata.cst16+0x20 37c: 00 00 22 3d addis r9,r2,0 37c: R_PPC64_TOC16_HA .rodata.cst16+0x40 380: e4 06 4a 79 rldicr r10,r10,0,59 384: ce 40 20 7c lvx v1,0,r8 388: 00 00 29 39 addir9,r9,0 388: R_PPC64_TOC16_LO .rodata.cst16+0x40 38c: 8c 03 17 10 vspltisw v0,-9 390: 98 56 00 7c lxvd2x vs0,0,r10 394: e4 06 29 79 rldicr r9,r9,0,59 398: 98 4e 80 7d lxvd2x vs12,0,r9 39c: 84 01 21 10 vslwv1,v1,v0 3a0: 50 02 00 f0 xxswapd vs0,vs0 3a4: 17 14 01 f0 xxland vs32,vs33,vs34 3a8: 50 62 8c f1
[Bug target/116004] PPC64 vector Intrinsic vec_first_mismatch_or_eos_index generates poor code
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116004 --- Comment #2 from Steven Munroe --- Actually: abnez = (vui8_t) vec_cmpnez (vra, vrb); result = vec_cntlz_lsbb (abnez);
[Bug target/116004] PPC64 vector Intrinsic vec_first_mismatch_or_eos_index generates poor code
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116004 --- Comment #1 from Steven Munroe --- Compile test code examples: int test_intrn_first_mismatch_or_eos_index_PWR9 (vui8_t vra, vui8_t vrb) { return vec_first_mismatch_or_eos_index (vra, vrb); } int test_first_mismatch_byte_or_eos_index_PWR9 (vui8_t vra, vui8_t vrb) { vui8_t abnez; int result; abnez = vec_cmpnez (vra, vrb); result = vec_cntlz_lsbb (abnez); return result; }
[Bug c/116004] New: PPC64 vector Intrinsic vec_first_mismatch_or_eos_index generates poor code
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116004 Bug ID: 116004 Summary: PPC64 vector Intrinsic vec_first_mismatch_or_eos_index generates poor code Product: gcc Version: 13.2.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: munroesj at gcc dot gnu.org Target Milestone: --- GCC13 generates the following code for the intrinsic vec_first_mismatch_or_eos_index -mcpu=power9 -O3: 00c0 : c0: d1 02 00 f0 xxspltib vs32,0 c4: 07 00 22 10 vcmpneb v1,v2,v0 c8: 07 00 03 10 vcmpneb v0,v3,v0 cc: 07 19 42 10 vcmpnezb v2,v2,v3 d0: 17 04 21 f0 xxland vs33,vs33,vs32 d4: 57 0d 42 f0 xxlorc vs34,vs34,vs33 d8: 02 16 61 10 vctzlsbb r3,v2 dc: b4 07 63 7c extsw r3,r3 e0: 20 00 80 4e blr The use of vcmpneb to compare for EOS is redundant to the vcmpnezb instruction (which includes the EOS compares). The additional xxland/xxorc logic is only necessary because of the extra vcmpneb compares. All you need is a single vcmpnezb as it already handles the a/b mismatch and EOS tests for both operands. For example: 0070 : 70: 07 19 42 10 vcmpnezb v2,v2,v3 74: 02 16 61 10 vctzlsbb r3,v2 78: b4 07 63 7c extsw r3,r3 7c: 20 00 80 4e blr
[Bug target/111645] Intrinsics vec_sldb /vec_srdb fail with __vector unsigned __int128
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111645 --- Comment #6 from Steven Munroe --- (In reply to Carl Love from comment #5) > There are a couple of issues with the test case in the attachment. For > example one of the tests is: > > > static inline vui64_t > vec_vsldbi_64 (vui64_t vra, vui64_t vrb, const unsigned int shb) > { > return vec_sldb (vra, vrb, shb); > } > > When I tried to compile it, it seemed to compile. However if I take off the > static inline, then I get an error about in compatible arguments. The > built-in requires an explicit integer be based in the third argument. The > following worked for me: > > > static inline vui64_t > vec_vsldbi_64 (vui64_t vra, vui64_t vrb, const unsigned int shb) > { > return vec_sldb (vra, vrb, 1); > } > > The compiler/assembler needs an explicit value for the third argument as it > has to generate the instruction with the immediate shift value as part of > the instruction. Hence a variable for the third argument will not work. > > Agreed that the __int128 arguments can and should be supported. Patch to > add that support is in progress but will require getting the LLVM/OpenXL > team to agree to adding the __128int variants as well. Yes I know. in the PVECLIB case these functions will always be static inline. So this is not issue for me.
[Bug target/111645] Intrinsics vec_sldb /vec_srdb fail with __vector unsigned __int128
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111645 --- Comment #4 from Steven Munroe --- Actually shift/rotate intrinsic: ,vec_rl, vec_rlmi, vec_rlnm, vec_sl, vec_sr, vec_sra Support vector __int128 as required for the PowerISA 3.1 POWER vector shift/rotate quadword instructions But: vec_sld, vec_sldb, vec_sldw, vec_sll, vec_slo, vec_srdb, vec_srl, vec_sro Do not. There is no obvious reason for this inconstancy as the target instructions are effectively 128/256-bit operations returning a 128-bit result.The type of the inputs is incidental to the operation. Any restrictions imposed by the original Altivec.h PIM was broken long ago by VSX and PowerISA 2.07. Net: the Power Vector Intrinsic Programming Reference and the compilers should support the vector __int128 type for any instruction where it makes sense as a input or result.
[Bug target/111645] Intrinsics vec_sldb /vec_srdb fail with __vector unsigned __int128
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111645 --- Comment #3 from Steven Munroe --- (In reply to Peter Bergner from comment #1) > I see that we have created built-in overloads for signed and unsigned vector > char through vector long long. That said, the rs6000-builtins.def only > seems to support the signed vector types though, which is why you're seeing > an error. So confirmed. > > That said, I believe your 3rd argument needs to be a real constant integer, > since the vsldbi instruction requires that. It doesn't allow for a const > int variable. I notice some older (not trunk) gcc versions are ICEing with > that, so another bug to look at. The original code is static inline, so the const int parm should transfer intact to the builtin const. It seems I over-simplified the deduced test case. > > I do not see any documentation that says we support the vector __int128 > type. Where exactly did you see that? However, from the instruction > description, it seems like the hw instruction could support that. I stand corrected. The documentation only describes vector unsigned long long. But the instruction is like vsldoi and does not really care what the type is.
[Bug target/111645] Intrinsics vec_sldb /vec_srdb fail with __vector unsigned __int128
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111645 Steven Munroe changed: What|Removed |Added Attachment #56018|0 |1 is obsolete|| --- Comment #2 from Steven Munroe --- Created attachment 56019 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=56019&action=edit Updated test case with static inline functions
[Bug target/111645] New: Intrinsics vec_sldb /vec_srdb fail with __vector unsigned __int128
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111645 Bug ID: 111645 Summary: Intrinsics vec_sldb /vec_srdb fail with __vector unsigned __int128 Product: gcc Version: 13.2.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: munroesj at gcc dot gnu.org Target Milestone: --- Created attachment 56018 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=56018&action=edit example of the problem. Compile with gcc -m64 -O3 -mcpu=power10 -c sldbi.c GCC 12 and 13 fail to compile vector intrisic vec_sldb / vec_srdb as required by the Power Vector Intrinsic Programming Reference. error: invalid parameter combination for AltiVec intrinsic ‘__builtin_vec_sldb’ Both the Programming Reference and the GCC documentation state that vector (unsigned/signed) __int128 are valid operands. But they fail with a error: invalid parameter combination for AltiVec intrinsic ‘__builtin_vec_sldb’ or error: invalid parameter combination for AltiVec intrinsic ‘__builtin_vec_srdb’
[Bug target/110795] Bad code gen for vector compare booleans
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110795 --- Comment #5 from Steven Munroe --- Thanks, sorry I missed the obvious.
[Bug target/110795] Bad code gen for vector compare booleans
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110795 --- Comment #2 from Steven Munroe --- Also fails with gcc11/12. Also fails with Advance Toolchain 10.0 GCC 6.4.1. It might fail for all version between GCC 6 and 13.
[Bug target/110795] Bad code gen for vector compare booleans
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110795 --- Comment #1 from Steven Munroe --- Created attachment 55627 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=55627&action=edit Main and unit-test. When compiled and linked with vec_divide.c will verify if the divide code is correct or not.
[Bug c/110795] New: Bad code gen for vector compare booleans
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110795 Bug ID: 110795 Summary: Bad code gen for vector compare booleans Product: gcc Version: 13.1.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: munroesj at gcc dot gnu.org Target Milestone: --- Created attachment 55626 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=55626&action=edit Test examples for vector code combinining vector compare combined with logical or, Combining a vec_cmplt and vec_cmpge with vector logical OR misscompiles. For example: // Capture the carry t as a bool using signed compare t = vec_cmplt ((vi32_t) x, zeros); ge = vec_cmpge (x, z); // Combine t with (x >= z) for 33-bit compare t = vec_or (ge, t); This seems to work for the minimized example above but fails when used in the more complex loop of the example vec_divduw_V1. At -O3 the compiler elides any code generated for vec_cmplt. With this bug the function vec_divduw_V1 (Vector_Divide double (words by) unsigned word) fails the unit test.
[Bug target/104124] Poor optimization for vector splat DW with small consts
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104124 --- Comment #5 from Steven Munroe --- Thanks
[Bug c/106755] New: Incorrect code gen for altivec intrinsics with constant inputs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106755 Bug ID: 106755 Summary: Incorrect code gen for altivec intrinsics with constant inputs Product: gcc Version: 12.2.1 Status: UNCONFIRMED Severity: blocker Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: munroesj at gcc dot gnu.org Target Milestone: --- Created attachment 53514 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=53514&action=edit Reducted test case for vec_muludq() with make Compiling the PVECLIB project V1.0.4-4 fails unit test (make check) when compiled with GCC 12 on Fedora 36/37. Two units test for: Vector Multiply Unsigned Double Quadword. vec_muludq() and Vector Multiply-Add Unsigned Quadword. vec_madduq() The tests that fail are passing local vector constants to in-lined instants of these functions. Current status; the PVECLIB package is blocked for Fedora 37 because it will not compile with the default GCC-12 compiler.
[Bug target/100085] Bad code for union transfer from __float128 to vector types
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100085 --- Comment #23 from Steven Munroe --- Ok, but I strongly recommend a compiler test that verify that the compiler is generating the expected code (for this and other cases). We have a history of common code changes (accidental or deliberate) causing regressions for POWER targets. Best to find these early, before they impact customer performance.
[Bug target/100085] Bad code for union transfer from __float128 to vector types
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100085 --- Comment #21 from Steven Munroe --- Yes I was told by Peter Bergner that the fix from https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100085#c15 had been back ported top AT15.0-1. But when ran this test with AT15.0-1 I saw: : 0: 20 00 20 39 li r9,32 4: d0 ff 41 39 addir10,r1,-48 8: 57 12 42 f0 xxswapd vs34,vs34 c: 99 4f 4a 7c stxvd2x vs34,r10,r9 10: ce 48 4a 7c lvx v2,r10,r9 14: 20 00 80 4e blr 0030 : 30: 20 00 20 39 li r9,32 34: d0 ff 41 39 addir10,r1,-48 38: 57 12 42 f0 xxswapd vs34,vs34 3c: 99 4f 4a 7c stxvd2x vs34,r10,r9 40: ce 48 4a 7c lvx v2,r10,r9 44: 20 00 80 4e blr 0060 : 60: 20 00 20 39 li r9,32 64: d0 ff 41 39 addir10,r1,-48 68: 57 12 42 f0 xxswapd vs34,vs34 6c: 99 4f 4a 7c stxvd2x vs34,r10,r9 70: 99 4e 4a 7c lxvd2x vs34,r10,r9 74: 57 12 42 f0 xxswapd vs34,vs34 78: 20 00 80 4e blr 0090 : 90: 57 12 42 f0 xxswapd vs34,vs34 94: 20 00 40 39 li r10,32 98: d0 ff 01 39 addir8,r1,-48 9c: f0 ff 21 39 addir9,r1,-16 a0: 99 57 48 7c stxvd2x vs34,r8,r10 a4: 00 00 69 e8 ld r3,0(r9) a8: 08 00 89 e8 ld r4,8(r9) ac: 20 00 80 4e blr So either the patch for AT15.0-1 is not applied correctly or is non-functional because of some difference between GCC11/GCC12. Or regressed because of some other change/patch. In my experience this part of GCC is fragile (based on the long/sad history of IBM long double). So this needs to monitored with each new update.
[Bug target/100085] Bad code for union transfer from __float128 to vector types
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100085 Steven Munroe changed: What|Removed |Added Status|RESOLVED|REOPENED Resolution|FIXED |--- --- Comment #17 from Steven Munroe --- I don't think this is fixed. The fix was supposed to be back-ported to GCC11 for Advance Toolchain 15. The updated test case shoes that this is clearly not working as advertised. Either GCC12 fix has regressed due to subsequent updates or the AT15 GCC11 back-port fails due to some missing/different code between GCC11/12.
[Bug target/100085] Bad code for union transfer from __float128 to vector types
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100085 --- Comment #16 from Steven Munroe --- Created attachment 52510 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=52510&action=edit Reduced tests for xfers from _float128 to vector or __int128 Cover more types including __int128 and vector __int128
[Bug target/104124] Poor optimization for vector splat DW with small consts
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104124 Steven Munroe changed: What|Removed |Added Attachment #52236|0 |1 is obsolete|| --- Comment #2 from Steven Munroe --- Created attachment 52307 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=52307&action=edit Enhansed test case that also shows CSE failure Original test case that adds example where CSE should common a splat immediate or even .rodata load, but fails to do even that.
[Bug target/104124] Poor optimization for vector splat DW with small consts
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104124 Steven Munroe changed: What|Removed |Added CC||munroesj at gcc dot gnu.org --- Comment #1 from Steven Munroe --- Created attachment 52236 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=52236&action=edit Attempts to load small int consts to vector DW via splat Multiple attempt to convince GCC to load small integer (-16 - 15) constants via splat. Current GCC versions (9/10/11) convert vec_splats() and explicit vec_splat_s32/vec_unpackl sequences into to loads from .rodata. This generates more instruction, takes more cycles, and causes register pressure that results in unnecessary spill/reload and load-hit-store rejects.
[Bug target/104124] New: Poor optimization for vector splat DW with small consts
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104124 Bug ID: 104124 Summary: Poor optimization for vector splat DW with small consts Product: gcc Version: 11.1.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: munroesj at gcc dot gnu.org Target Milestone: --- It looks to me like the compiler is seeing register pressure caused by loading all the vector long long constants I need in my code. This is leaf code of a size it can run out of volatilizes (no stack-frame). But this puts more pressure on volatile VRs, VSRs, and GPRs. Especially GPRs because it loading from .rodata when it could (and should) use a vector immediate. For example: vui64_t __test_splatudi_0_V0 (void) { return vec_splats ((unsigned long long) 0); } vi64_t __test_splatudi_1_V0 (void) { return vec_splats ((signed long long) -1); } Generate: 01a0 <__test_splatudi_0_V0>: 1a0: 8c 03 40 10 vspltisw v2,0 1a4: 20 00 80 4e blr 01c0 <__test_splatudi_1_V0>: 1c0: 8c 03 5f 10 vspltisw v2,-1 1c4: 20 00 80 4e blr ... But other cases that could use immedates like: vui64_t __test_splatudi_12_V0 (void) { return vec_splats ((unsigned long long) 12); } GCC 9/10/11 Generates for power8: 0170 <__test_splatudi_12_V0>: 170: 00 00 4c 3c addis r2,r12,0 170: R_PPC64_REL16_HA .TOC. 174: 00 00 42 38 addir2,r2,0 174: R_PPC64_REL16_LO .TOC.+0x4 178: 00 00 22 3d addis r9,r2,0 178: R_PPC64_TOC16_HA .rodata.cst16+0x20 17c: 00 00 29 39 addir9,r9,0 17c: R_PPC64_TOC16_LO .rodata.cst16+0x20 180: ce 48 40 7c lvx v2,0,r9 184: 20 00 80 4e blr and for Power9: <__test_splatisd_12_PWR9>: 0: d1 62 40 f0 xxspltib vs34,12 4: 02 16 58 10 vextsb2d v2,v2 8: 20 00 80 4e blr So why can't the power8 target generate: 00f0 <__test_splatudi_12_V1>: f0: 8c 03 4c 10 vspltisw v2,12 f4: 4e 16 40 10 vupkhsw v2,v2 f8: 20 00 80 4e blr This is 4 cycles vs 9 ((best case) and it is always 9 cycles because GCC does not exploit immediate fusion). In fact GCC 8 (AT12) does this. So I tried defining my own vec_splatudi: vi64_t __test_splatudi_12_V1 (void) { vi32_t vwi = vec_splat_s32 (12); return vec_unpackl (vwi); } Which generates the <__test_splatudi_12_V1> sequence above for GCC 8. But for GCC 9/10/11 it generates: 0110 <__test_splatudi_12_V1>: 110: 00 00 4c 3c addis r2,r12,0 110: R_PPC64_REL16_HA .TOC. 114: 00 00 42 38 addir2,r2,0 114: R_PPC64_REL16_LO .TOC.+0x4 118: 00 00 22 3d addis r9,r2,0 118: R_PPC64_TOC16_HA .rodata.cst16+0x20 11c: 00 00 29 39 addir9,r9,0 11c: R_PPC64_TOC16_LO .rodata.cst16+0x20 120: ce 48 40 7c lvx v2,0,r9 124: 20 00 80 4e blr Again! GCC has gone out of its way to be this clever! Badly! While it can be appropriately clever for power9! I have tried many permutations of this and the only way I have found to prevent this (GCC 9/10/11) cleverness is to use inline __asm (which has other bad side effects).
[Bug target/100085] Bad code for union transfer from __float128 to vector types
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100085 --- Comment #13 from Steven Munroe --- "We want to use plain TImode instead of V1TImode on newer cpus." Actually I disagree. We have vector __int128 in the ABI and with POWER10 a complete set arithmetic operations for 128-bit in VRs. Also this issue is not restricted to TImode. It also effects _Float128 (KFmode), _ibm128 (TFmode) and Libmvec for vector float/double. The proper and optimum handling of these "union transfers" has been broken in GCC for years. And I have grave reservations about the vague plans of small/fringe minority to subset the PowerISA for their convenience.
[Bug target/100085] Bad code for union transfer from __float128 to vector types
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100085 --- Comment #5 from Steven Munroe --- Any progress on this?
[Bug target/100085] Bad code for union transfer from __float128 to vector types
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100085 --- Comment #4 from Steven Munroe --- I am seeing this a similar problem with union transfers from __float128 to __int128. static inline unsigned __int128 vec_xfer_bin128_2_int128t (__binary128 f128) { __VF_128 vunion; vunion.vf1 = f128; return (vunion.ui1); } and unsigned __int128 test_xfer_bin128_2_int128 (__binary128 f128) { return vec_xfer_bin128_2_int128t (f128); } generates: 0030 : 30: 57 12 42 f0 xxswapd vs34,vs34 34: 20 00 20 39 li r9,32 38: d0 ff 41 39 addir10,r1,-48 3c: 99 4f 4a 7c stxvd2x vs34,r10,r9 40: f0 ff 61 e8 ld r3,-16(r1) 44: f8 ff 81 e8 ld r4,-8(r1) 48: 20 00 80 4e blr For POWER8 should use mfvsrd/xxpermdi/mfvsrd. This looks like the root cause of poor performance for __float128 soft-float on POWER8. A simple benchmark using __float128 in C code calling libgcc for -mcpu=power8 and then hardware instructions for -mcpu=power9. P8 target P8AT14, Uses libgcc __addkf3_sw and __mulkf3_sw: test_time_f128 f128 CC tb delta = 52589, sec = 0.000102713 P9 Target P8AT14, Uses libgcc __addkf3_hw and __mulkf3_hw: test_time_f128 f128 CC tb delta = 18762, sec = 3.66445e-05 P9 Target P9AT14, inline hardware binary128 float: test_time_f128 f128 CC tb delta = 3809, sec = 7.43945e-06 I used Valgrind Itrace and Sim-ppc and perfstat analysis. Every call to libgcc __add/sub/mul/divkf3 takes a load-hit-store flush every call. This explains why __float128 is so 13.8 X slower on P8 then P9.
[Bug rtl-optimization/100085] Bad code for union transfer from __float128 to vector types
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100085 Steven Munroe changed: What|Removed |Added CC||munroesj at gcc dot gnu.org --- Comment #1 from Steven Munroe --- Created attachment 50596 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=50596&action=edit Compile test case fo xfer operation. Compile for PowerPCle fo both -mcpu=power8 -mfloat128 and -mcpu=power9 -mfloat128 and see the differn asm generated.
[Bug rtl-optimization/100085] New: Bad code for union transfer from __float128 to vector types
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100085 Bug ID: 100085 Summary: Bad code for union transfer from __float128 to vector types Product: gcc Version: 10.2.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: munroesj at gcc dot gnu.org Target Milestone: --- Created attachment 50595 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=50595&action=edit Reduced example of union and __float128 to vector transfer. GCC 10/9/8/7 will generate poor (-mcpu=power8) code when using a union to transfer a __float128 scalar to any vector type. __float128 is a scalar type and not typecast compatible with any vector type. Despite both being in Vector registers. But for runtime codes implementing __float128 operations for -mcpu=power8 it is useful (and faster) to perform some (data_class, conversions, etc) operations directly in vector registers. The only solution for this is to use union to transfer values between __float128/vector types. This should be a simple vector register transfer and optimized as such. But when for GCC and PowerPCle and -mcpu=power8, we are consistently seeing store/reload sequences. For Power8 this can cause load-hit-store and pipe-line rejects (33 cycles). We don't see this when targeting -mcpu=power9, but power9 supports hardware Float128 instruction. Also we don't see this when targeting BE.
[Bug middle-end/99293] Built-in vec_splat generates sub-optimal code for -mcpu=power10
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99293 --- Comment #1 from Steven Munroe --- Created attachment 50264 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=50264&action=edit Compile test for simplied test case Download vec_dummy.c and vec_int128_ppc.h into a local directory and compile gcc -O3 -mcpu=power10 -m64 -c vec_dummy.c
[Bug middle-end/99293] New: Built-in vec_splat generates sub-optimal code for -mcpu=power10
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99293 Bug ID: 99293 Summary: Built-in vec_splat generates sub-optimal code for -mcpu=power10 Product: gcc Version: 10.2.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: munroesj at gcc dot gnu.org Target Milestone: --- Created attachment 50263 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=50263&action=edit Simplified test case While adding code to Power Vector Library (PVECLIB), for the POWER10 target, I see strange code generation for Altivec built-in vec_splat for the vector long long type. I would expect a xxpermdi (xxspltd) based on the "Power Vector Intrinsic Programming Reference". But I see the following generated: 0300 : 300: 67 02 69 7c mfvsrld r9,vs35 304: 67 4b 09 7c mtvsrdd vs32,r9,r9 308: 05 00 42 10 vrlqv2,v2,v0 30c: 20 00 80 4e blr While these seems to functionally correct, the trip through the GPR seems unnecessary. It requires two serially dependent instructions where a single xxspltd would do. I expected: 0300 : 300: 57 1b 63 f0 xxspltd vs35,vs35,1 304: 05 18 42 10 vrlqv2,v2,v3 308: 20 00 80 4e blr The compiler was: Compiler: gcc version 10.2.1 20210104 (Advance-Toolchain 14.0-2) [2093e873bb6c] (GCC)
[Bug target/98519] rs6000: @pcrel unsupported on this instruction error in pveclib
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98519 --- Comment #7 from Steven Munroe --- Then you have problem as @pcrel is never valid for an instruction like lxsd%X1. Seems like you will need a new constrain or modifier specific to @pcrel.
[Bug target/98519] rs6000: @pcrel unsupported on this instruction error in pveclib
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98519 --- Comment #5 from Steven Munroe --- I would think you need to look at the instruction and the "m" constraint. In this case lxsd%X1 would need to be converted to plxsd and the "m" constraint would have to allow @pcrel. I would think a static variable would be valid, but stack local or explicit pointer with (nonconst) offset/index would not.
[Bug target/85830] vec_popcntd is improperly defined in altivec.h
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85830 --- Comment #3 from Steven Munroe --- (In reply to Carl Love from comment #2) > Hit the save button a little too fast missed putting in everything I > intended to put in. Lets try to get it all in. > > > In altivec.h they are defined as: > > > > #define vec_popcnt __builtin_vec_vpopcntu > > #define vec_popcntb __builtin_vec_vpopcntub > > #define vec_popcnth __builtin_vec_vpopcntuh > > #define vec_popcntw __builtin_vec_vpopcntuw > > #define vec_popcntd __builtin_vec_vpopcntud > > > > It does appear they should be removed from altivec.h. > > > > The user should use the builtin vec_popcnt(a) where a is the unsigned > > long > > long > > or unsigned int as desired. These builtins are support on at least > > gcc version 8.3.1 and later. I assume you mean, remove the defines for; vec_popcntb, vec_popcnth, vec_popcntw, and vec_popcntd. While retaining vec_popcnt ??
[Bug target/96139] Vector element extract mistypes long long int down to long int
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96139 --- Comment #3 from Steven Munroe --- (In reply to Bill Schmidt from comment #2) > Have you tried it for -m32, out of curiosity? no
[Bug c/96139] Vector element extract mistypes long long int down to long int
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96139 --- Comment #1 from Steven Munroe --- Created attachment 48851 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=48851&action=edit Test case for printf of vector long long int elements
[Bug c/96139] New: Vector element extract mistypes long long int down to long int
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96139 Bug ID: 96139 Summary: Vector element extract mistypes long long int down to long int Product: gcc Version: 9.3.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: munroesj at gcc dot gnu.org Target Milestone: --- When printing vector element for example: printf ("%s %016llx,%016llx\n", prefix, val[1], val[0]); where val is a vector unsigned long long int -Wall reports: ../src/printll.c: In function ‘print_v2xint64’: ../src/printll.c:20:21: warning: format ‘%llx’ expects argument of type ‘long long unsigned int’, but argument 3 has type ‘long unsigned int’ [-Wformat=] printf ("%s %016llx,%016llx\n", prefix, val[1], val[0]); ^ Here gcc claims that val[1] is a ‘long unsigned int’ then it is actually typed as: typedef __vector unsigned long long int vui64_t; Some how the vector element extract has dropped the long long int type to long int. This should not be an issue for PPC64 as long long int and long int are both 64-bit but would matter for PPC32.
[Bug c/85830] New: vec_popcntd is improperly defined in altivec.h
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85830 Bug ID: 85830 Summary: vec_popcntd is improperly defined in altivec.h Product: gcc Version: 8.1.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: munroesj at gcc dot gnu.org Target Milestone: --- Created attachment 44147 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=44147&action=edit compile test case for vec_popcntd. Altivec.h should define either the generic vec_popcnt or the specific vec_vpopcntd. In GCC-8.1 altivec.h defines the expected vec_popcnt (as defined in the ABI) and the PIM style specific vec_vpopcntd. these are OK. However it also defines vec_popcntd which generates __builtin_vec_vpopcntud. This gives compile errors. vector unsigned long long __test_popcntd_2 (vector unsigned long long a) { return (vec_popcntd (a)); } vector unsigned int __test_popcntw_2 (vector unsigned int a) { return (vec_popcntw (a)); } vec_popcnt-2.c: In function ‘__test_popcntd_2’: vec_popcnt-2.c:31:3: error: invalid parameter combination for AltiVec intrinsic ‘__builtin_vec_vpopcntud’ return (vec_popcntd (a)); ^~ vec_popcnt-2.c: In function ‘__test_popcntw_2’: vec_popcnt-2.c:37:3: error: invalid parameter combination for AltiVec intrinsic ‘__builtin_vec_vpopcntuw’ return (vec_popcntw (a)); ^~ There are two problems here; 1) You would expect a __builtin_vec_vpopcntud to accept a unsigned input and it does not. I am not sure what it does accept. 2) The vec_popcntd define fits neither the ABI generic specific (vec_popcnt()) or the old PIM specific style (vec_vpopcntd() which follows the PowerISA mnemonic. This 3rd style should be removed from altivec.h ( vec_popcntb, vec_popcnth, vec_popcntw, vec_popcntd) While retaining the ABI generic and PIM specific definitions.
[Bug target/83402] PPC64 implementation of ./rs6000/emmintrin.h gives out of range for _mm_slli_epi32
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83402 --- Comment #9 from Steven Munroe --- I suggested fixing the emmintrin.h source for both eventually ... If you only fix AT11 then sometime later some will discover the difference and try fix it. And likely break it again. So fix AT immediately (with comments for why). Then when gcc trunk opens up. fix it there as a code clean up and ask for a backport to gcc8 stable. Because if you leave it alone in gcc 8. it might break again in gcc 9 :-/
[Bug target/83402] PPC64 implementation of ./rs6000/emmintrin.h gives out of range for _mm_slli_epi32
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83402 --- Comment #7 from Steven Munroe --- Ok it could be that compiler behavior changed. You where testing gcc-trunk? Please try the same test with AT11 gcc7. I know I hit this!
[Bug target/83402] PPC64 implementation of ./rs6000/emmintrin.h gives out of range for _mm_slli_epi32
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83402 --- Comment #5 from Steven Munroe --- You need to look at the generated asm code. And see what the compiler is doing. Basically it should be generating a vspltisw vr,si for vec_splat_s32. But if the immediate signed int (si) is greater than 15, should failure with: error: argument 1 must be a 5-bit signed literal The vec_splats should work for any value as it will load a const vector from storage. Perhaps the compiler is generating bad code and not reporting it. Or the compiler is too smart and converting the vec_splat_s32 to the more general vec_splats under the covers. Look at the source Paul!
[Bug target/83964] [8 Regression] ICE in extract_insn, at recog.c:2304
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83964 --- Comment #18 from Steven Munroe --- (In reply to jos...@codesourcery.com from comment #17) > And, when long is 64-bit, there is no corresponding standard function to > round to 32-bit integer with "invalid" raised for out-of-range results - > but there is (undocumented) __builtin_irint. So if you document > __builtin_irint, machine-specific built-in functions should only be needed > here if the functionality is desired in the presence of math-errno (cf. > the proposals to change the default there) - and even there, > machine-independent no-errno built-in functions might be better. A cross platform solution is acceptable if; it is generated in-line, fully optimized for the target, and is available soon. Looking at glibc for powerpc there are 29 cases in current code. God knows what boost is doing. Need both current rounding and truncate, signed and unsigned.
[Bug target/83964] [8 Regression] ICE in extract_insn, at recog.c:2304
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83964 --- Comment #13 from Steven Munroe --- WTF which part of requirement did you not understand. You you should implement the direct moves (to GPRs) to complete the __builtin_fctid and __builtin_fctiw implementation. But to just remove them is missing the point.
[Bug target/83964] [8 Regression] ICE in extract_insn, at recog.c:2304
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83964 --- Comment #11 from Steven Munroe --- The requirement was to reduce the use of (in-line) assembler in libraries. Asm is error prone in the light of 32/64-bit ABI difference and the compiler (usual) generates the correct code for the target. Float <-> int/long conversion is common operation and builtin instructions are preferred where the Posix functions are unnecessarily heavy.
[Bug target/84266] mmintrin.h intrinsic headers for PowerPC code fails on power9
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84266 --- Comment #10 from Steven Munroe --- Change this to RESOLVED state now?
[Bug target/84266] mmintrin.h intrinsic headers for PowerPC code fails on power9
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84266 --- Comment #9 from Steven Munroe --- Author: munroesj Date: Sun Feb 11 21:45:39 2018 New Revision: 257571 URL: https://gcc.gnu.org/viewcvs?rev=257571&root=gcc&view=rev Log: Fix PR 84266 Modified: trunk/gcc/ChangeLog trunk/gcc/config/rs6000/mmintrin.h
[Bug target/84266] mmintrin.h intrinsic headers for PowerPC code fails on power9
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84266 --- Comment #7 from Steven Munroe --- Created attachment 43388 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=43388&action=edit correct mmintrin.h for power9 2018-02-09 Steven Munroe * config/rs6000/mmintrin.h (_mm_cmpeq_pi32 [_ARCH_PWR9]): Cast vec_cmpeq result to correct type. * config/rs6000/mmintrin.h (_mm_cmpgt_pi32 [_ARCH_PWR9]): Cast vec_cmpgt result to correct type.
[Bug target/84266] mmintrin.h intrinsic headers for PowerPC code fails on power9
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84266 Steven Munroe changed: What|Removed |Added Status|UNCONFIRMED |ASSIGNED Last reconfirmed||2018-02-08 Ever confirmed|0 |1 --- Comment #5 from Steven Munroe --- I'll take this.
[Bug target/84266] mmintrin.h intrinsic headers for PowerPC code fails on power9
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84266 --- Comment #4 from Steven Munroe --- Yup this looks like a pasteo from the pi16 implementation which was not caught as P9 was rare at the time. The #if _ARCH_PWR9 clause is an optimization based on better timing for P9 (vs P8) for GPR <-> VSR transfers. BTW is there a P9 in the GCC compile farm yet?
[Bug target/83402] PPC64 implementation of ./rs6000/emmintrin.h gives out of range for _mm_slli_epi32
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83402 --- Comment #1 from Steven Munroe --- Similarly doe _mm_slli_epi64 for any const value > 15 and < 32. So: if (__builtin_constant_p(__B)) { if (__B < 32) lshift = (__v2du) vec_splat_s32(__B); else lshift = (__v2du) vec_splats((unsigned long long)__B); } else lshift = (__v2du) vec_splats ((unsigned int) __B); should be something like: if (__builtin_constant_p(__B) && (__B < 16)) { lshift = (__v2du) vec_splat_s32(__B); } else lshift = (__v2du) vec_splats ((unsigned int) __B); It is Ok in this case to use a splat word form because the vector shift left doubleword will only use the low order 6-bits of of each doubleword of the shift vctor.
[Bug c/83402] New: PPC64 implementation of ./rs6000/emmintrin.h gives out of range for _mm_slli_epi32
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83402 Bug ID: 83402 Summary: PPC64 implementation of ./rs6000/emmintrin.h gives out of range for _mm_slli_epi32 Product: gcc Version: 8.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: munroesj at gcc dot gnu.org Target Milestone: --- The rs6000/emmintrin.h implementation of _mm_slli_epi32 reports: error: argument 1 must be a 5-bit signed literal For constant shift values > 15. The implementation uses vec_splat_s32 (Vector Splat Immediate Signed Word) for const shift values to generate the shift count for vec_vslw (Vector Shift Left Word). This is preferred to the more expensive vec_splats. The immediate field of vspltisw is 5 bits but it is sign extended and the shift rang must be positive. This limits the immediate range for vspltisw to 0-15 (not the required 0-31). The current implementation uses: if (__builtin_constant_p(__B)) lshift = (__v4su) vec_splat_s32(__B); else lshift = vec_splats ((unsigned int) __B); so we need something like: if (__builtin_constant_p(__B) && (__B < 16))
[Bug testsuite/81539] Bad target in new test case gcc.target/powerpc/mmx-packuswb-1.c from r250432
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81539 Steven Munroe changed: What|Removed |Added Status|UNCONFIRMED |RESOLVED Resolution|--- |FIXED --- Comment #1 from Steven Munroe --- Fixed with: 2017-08-24 Steven Munroe * gcc.target/powerpc/mmx-packuswb-1.c [NO_WARN_X86_INTRINSICS]: Define. Suppress warning during tests.
[Bug other/81831] New: -Wno-psabi is not documented
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81831 Bug ID: 81831 Summary: -Wno-psabi is not documented Product: gcc Version: 8.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: other Assignee: unassigned at gcc dot gnu.org Reporter: munroesj at gcc dot gnu.org Target Milestone: --- The online GCC documentation mentions psABI though out the document, but section "3.8 Options to Request or Suppress Warning" does not describe or even mention -Wno-psabi. This may be an issues for some of the tests I am writing which are consistently generating the warning: In function 'check_union128': /home/sjmunroe/work/gcc-trunk/gcc/gcc/testsuite/gcc.target/powerpc/m128-check.h:89:1: note: the ABI of passing aggregates with 16-byte alignment has changed in GCC 5 I am not sure what this will do if someone asserts -Werror. So it may be appropriate to suppress this warning for ./testsuite/gcc.target/ but when I searched the documentation I could not find any obvious way to do that. Segher tells me that -Wno-psabi will suppress this warning, but this is not documented any where I can find.