[Bug target/116649] New: PPC: Suboptimal code for __builtin_bcdadd_ovf on Power10
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116649 Bug ID: 116649 Summary: PPC: Suboptimal code for __builtin_bcdadd_ovf on Power10 Product: gcc Version: 14.1.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: jens.seifert at de dot ibm.com Target Milestone: --- unsigned long long bcdadd(vector __int128 a, vector __int128 b, vector __int128 *c) { return __builtin_bcdadd_ov(a, b, 0); } creates: bcdadd(__int128 __vector(1), __int128 __vector(1), __int128 __vector(1)*): .quad .L.bcdadd(__int128 __vector(1), __int128 __vector(1), __int128 __vector(1)*),.TOC.@tocbase,0 .L.bcdadd(__int128 __vector(1), __int128 __vector(1), __int128 __vector(1)*): bcdadd. 2,2,3,0 mfcr 3,2 rlwinm 3,3,28,1 blr .long 0 .byte 0,9,0,0,0,0,0,0 while use of setbc expected. bcdadd. 2,2,3,0 setbc 3,27 blr
[Bug target/115973] PPCLE: Inefficient code for __builtin_uaddll_overflow and __builtin_addcll
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115973 --- Comment #2 from Jens Seifert --- Assembly that better integrates: unsigned long long addc_opt(unsigned long long a, unsigned long long b, unsigned long long *res) { unsigned long long rc; __asm__("addc %0,%2,%3;\n\tsubfe %1,%1,%1":"=r"(*res),"=r"(rc):"r"(a),"r"(b):"xer"); return rc + 1; } Output: .L.addc_opt(unsigned long long, unsigned long long, unsigned long long*): addc 9,3,4; subfe 3,3,3 std 9,0(5) addi 3,3,1 blr Power10 code for __builtin_uaddll_overflow is okay: unsigned long long addc(unsigned long long a, unsigned long long b, unsigned long long *res) { return __builtin_uaddll_overflow(a, b, res); } .L.addc(unsigned long long, unsigned long long, unsigned long long*): add 4,3,4 cmpld 0,4,3 std 4,0(5) setbc 3,0 blr
[Bug target/115973] New: PPCLE: Inefficient code for __builtin_uaddll_overflow and __builtin_addcll
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115973 Bug ID: 115973 Summary: PPCLE: Inefficient code for __builtin_uaddll_overflow and __builtin_addcll Product: gcc Version: 14.1.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: jens.seifert at de dot ibm.com Target Milestone: --- unsigned long long add(unsigned long long a, unsigned long long b, unsigned long long *ovf) { return __builtin_addcll(a,b,0,ovf); } creates mr 9,3 add 3,3,4 subfc 9,9,3 subfe 9,9,9 neg 9,9 std 9,0(5) blr Expected to addc + addze unsigned long long add4(unsigned long long a, unsigned long long b, unsigned long long *ovf) { unsigned long long t, res; __asm__("li %0, 0; addc %1,%2,%3; addze %0,%0":"=&r"(res),"=r"(t):"r"(a),"r"(b):"xer"); *ovf = res; return t; } Expected assembly li 9, 0 addc 3,3,4 addze 9,9 std 9,0(5) blr
[Bug target/115355] [12/13/14/15 Regression] vectorization exposes wrong code on P9 LE starting from r12-4496
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115355 --- Comment #10 from Jens Seifert --- Does this affect loop vectorize and slp vectorize ? -fno-tree-loop-vectorize avoids loop vectorization to be performed and workarounds this issue. Does the same problems also affect SLP vectorization, which does not take place in this sample. In other words, do I need -fno-tree-loop-vectorize or -fno-tree-vectorize to workaround this bug ?
[Bug target/115355] PPCLE: Auto-vectorization creates wrong code for Power9
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115355 --- Comment #1 from Jens Seifert --- Same issue with gcc 13.2.1
[Bug target/115355] New: PPCLE: Auto-vectorization creates wrong code for Power9
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115355 Bug ID: 115355 Summary: PPCLE: Auto-vectorization creates wrong code for Power9 Product: gcc Version: 12.2.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: jens.seifert at de dot ibm.com Target Milestone: --- Input setToIdentity.C: #include #include #include void setToIdentityGOOD(unsigned long long *mVec, unsigned int mLen) { for (unsigned long long i = 0; i < mLen; i++) { mVec[i] = i; } } void setToIdentityBAD(unsigned long long *mVec, unsigned int mLen) { for (unsigned int i = 0; i < mLen; i++) { mVec[i] = i; } } unsigned long long vec1[100]; unsigned long long vec2[100]; int main(int argc, char *argv[]) { unsigned int l = argc > 1 ? atoi(argv[1]) : 29; setToIdentityGOOD(vec1, l); setToIdentityBAD(vec2, l); if (memcmp(vec1, vec2, l*sizeof(vec1[0])) != 0) { for (unsigned int i = 0; i < l; i++) { printf("%llu %llu\n", vec1[i], vec2[i]); } } else { printf("match\n"); } return 0; } Fails gcc -O3 -mcpu=power9 -m64 setToIdentity.C -save-temps -fverbose-asm -o pwr9.exe -mno-isel Good: gcc -O3 -mcpu=power8 -m64 setToIdentity.C -save-temps -fverbose-asm -o pwr8.exe -mno-isel "-mno-isel" is only specified to reduce the diff. Failing output: pwr9.exe 0 0 1 1 2 0 3 4294967296 4 4 5 5 6 6 7 7 8 8 9 9 10 10 11 11 12 12 13 13 14 14 15 15 16 16 17 17 18 18 19 19 20 20 21 21 22 22 23 23 24 24 25 25 26 26 27 27 28 28 4th element contains wrong data.
[Bug target/114376] New: s390: Inefficient __builtin_bswap16
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114376 Bug ID: 114376 Summary: s390: Inefficient __builtin_bswap16 Product: gcc Version: 13.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: jens.seifert at de dot ibm.com Target Milestone: --- unsigned short swap16(unsigned short in) { return __builtin_bswap16(in); } generates -O3 -march=z196 swap16(unsigned short): lrvr%r2,%r2 srl %r2,16 llghr %r2,%r2 br %r14 More efficient for 64-bit is: unsigned short swap16_2(unsigned short in) { return __builtin_bswap64(in) >> 48; } Which generates: swap16_2(unsigned short): lrvgr %r2,%r2 srlg%r2,%r2,48 br %r14 For 31-bit lrvr should be used.
[Bug target/93176] PPC: inefficient 64-bit constant consecutive ones
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93176 --- Comment #10 from Jens Seifert --- Looks like no patch in the area got delivered. I did a small test for unsigned long long c() { return 0xULL; } gcc 13.2.0: li 3,0 ori 3,3,0x sldi 3,3,32 expected: li 3, -1 rldic 3, 3, 32, 16 All consecutive ones can be created with li + rldic. The rotate eliminates the bits on the right and the clear the bits on the left as described below: li t,-1 rldic d,T,MB,63-ME
[Bug target/93176] PPC: inefficient 64-bit constant consecutive ones
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93176 --- Comment #7 from Jens Seifert --- What happened ? Still waiting for improvement.
[Bug target/106770] PPCLE: Unnecessary xxpermdi before mfvsrd
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106770 --- Comment #6 from Jens Seifert --- The left part of VSX registers overlaps with floating point registers, that is why no register xxpermdi is required and mfvsrd can access all (left) parts of VSX registers directly. The xxpermdi x,y,y,3 indicates to me that gcc prefers right part of register which might also cause the xxpermdi at the beginning. At the end the mystery is why gcc adds 3 xxpermdi to the code.
[Bug target/106770] PPCLE: Unnecessary xxpermdi before mfvsrd
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106770 --- Comment #4 from Jens Seifert --- PPCLE with no special option means -mcpu=power8 -maltivec (altivecle to be mor precise). vec_promote(, 1) should be a noop on ppcle. But value gets splatted to both left and right part of vector register. => 2 unnecesary xxpermdi The rest of the operations are done on left and right part. vec_extract(, 1) should be noop on ppcle. But value gets taken from right part of register which requires a xxpermdi Overall 3 unnecessary xxpermdi. Don't know why the right part of register gets "preferred".
[Bug c++/108560] New: builtin_va_arg_pack_len is documented to return size_t, but actually returns int
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108560 Bug ID: 108560 Summary: builtin_va_arg_pack_len is documented to return size_t, but actually returns int Product: gcc Version: 12.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: jens.seifert at de dot ibm.com Target Milestone: --- #include bool test(const char *fmt, size_t numTokens, ...) { return __builtin_va_arg_pack_len() != numTokens; } Compiled with -Wsign-compare results in: : In function 'bool test(const char*, size_t, ...)': :5:40: warning: comparison of integer expressions of different signedness: 'int' and 'size_t' {aka 'long unsigned int'} [-Wsign-compare] 5 | return __builtin_va_arg_pack_len() != numTokens; |^~~~ :5:37: error: invalid use of '__builtin_va_arg_pack_len ()' 5 | return __builtin_va_arg_pack_len() != numTokens; |~^~ Compiler returned: 1 Documentation: https://gcc.gnu.org/onlinedocs/gcc/Constructing-Calls.html indicates a size_t return type Built-in Function: size_t __builtin_va_arg_pack_len ()
[Bug target/108396] New: PPCLE: vec_vsubcuq missing
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108396 Bug ID: 108396 Summary: PPCLE: vec_vsubcuq missing Product: gcc Version: 12.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: jens.seifert at de dot ibm.com Target Milestone: --- Input: #include vector unsigned __int128 vsubcuq(vector unsigned __int128 a, vector unsigned __int128 b) { return vec_vsubcuq(a, b); } Command line: gcc -m64 -O2 -maltivec -mcpu=power8 text.C Output: : In function '__vector unsigned __int128 vsubcuq(__vector unsigned __int128, __vector unsigned __int128)': :6:12: error: 'vec_vsubcuq' was not declared in this scope; did you mean 'vec_vsubcuqP'? 6 | return vec_vsubcuq(a, b); |^~~ |vec_vsubcuqP Compiler returned: 1
[Bug target/108049] s390: Compiler adds extra zero extend after xoring 2 zero extended values
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108049 --- Comment #1 from Jens Seifert --- Sample above got compiled with -march=z196
[Bug target/108049] New: s390: Compiler adds extra zero extend after xoring 2 zero extended values
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108049 Bug ID: 108049 Summary: s390: Compiler adds extra zero extend after xoring 2 zero extended values Product: gcc Version: 12.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: jens.seifert at de dot ibm.com Target Milestone: --- Same issue for PPC: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107949 extern unsigned char magic1[256]; unsigned int hash(const unsigned char inp[4]) { const unsigned long long INIT = 0x1ULL; unsigned long long h1 = INIT; h1 = magic1[((unsigned long long)inp[0]) ^ h1]; h1 = magic1[((unsigned long long)inp[1]) ^ h1]; h1 = magic1[((unsigned long long)inp[2]) ^ h1]; h1 = magic1[((unsigned long long)inp[3]) ^ h1]; return h1; } hash(unsigned char const*): llgc%r4,1(%r2) <= zero extends to 64-bit lgrl%r1,.LC0 llgc%r3,0(%r2) <= zero extends to 64-bit xilf%r3,1 llgc%r3,0(%r3,%r1) xr %r3,%r4 <= should be 64-bit xor llgc%r4,2(%r2) <= zero extends to 64-bit llgcr %r3,%r3 <= unnecessary llgc%r2,3(%r2) llgc%r3,0(%r3,%r1) xr %r3,%r4 <= should be 64-bit xor llgcr %r3,%r3 <= unnecessary llgc%r3,0(%r3,%r1) <= zero extends to 64-bit xrk %r2,%r3,%r2 <= should be 64-bit xor llgcr %r2,%r2 <= unnecessary llgc%r2,0(%r2,%r1) br %r14 Smaller sample: unsigned long long tiny2(const unsigned char *inp) { unsigned long long a = inp[0]; unsigned long long b = inp[1]; return a ^ b; } tiny2(unsigned char const*): llgc%r1,0(%r2) llgc%r2,1(%r2) xrk %r2,%r1,%r2 llgcr %r2,%r2 br %r14
[Bug rtl-optimization/107949] PPC: Unnecessary rlwinm after lbzx
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107949 --- Comment #3 from Jens Seifert --- *** Bug 108048 has been marked as a duplicate of this bug. ***
[Bug target/108048] PPCLE: gcc does not recognize that lbzx does zero extend
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108048 Jens Seifert changed: What|Removed |Added Status|UNCONFIRMED |RESOLVED Resolution|--- |DUPLICATE --- Comment #1 from Jens Seifert --- duplicate of https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107949 *** This bug has been marked as a duplicate of bug 107949 ***
[Bug target/108048] New: PPCLE: gcc does not recognize that lbzx does zero extend
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108048 Bug ID: 108048 Summary: PPCLE: gcc does not recognize that lbzx does zero extend Product: gcc Version: 12.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: jens.seifert at de dot ibm.com Target Milestone: --- extern unsigned char magic1[256]; unsigned int hash(const unsigned char inp[4]) { const unsigned long long INIT = 0x1ULL; unsigned long long h1 = INIT; h1 = magic1[((unsigned long long)inp[0]) ^ h1]; h1 = magic1[((unsigned long long)inp[1]) ^ h1]; h1 = magic1[((unsigned long long)inp[2]) ^ h1]; h1 = magic1[((unsigned long long)inp[3]) ^ h1]; return h1; } Generates: hash(unsigned char const*): .LCF0: addi 2,2,.TOC.-.LCF0@l lbz 9,0(3) addis 10,2,.LC0@toc@ha ld 10,.LC0@toc@l(10) lbz 6,1(3) lbz 7,2(3) lbz 8,3(3) xori 9,9,0x1 lbzx 9,10,9 xor 9,9,6 rlwinm 9,9,0,0xff <= unnecessary lbzx 9,10,9 xor 9,9,7 rlwinm 9,9,0,0xff <= unnecessary lbzx 9,10,9 xor 9,9,8 rlwinm 9,9,0,0xff <= unnecessary lbzx 3,10,9 blr All XOR operations are done in unsigned long long (64-bit). gcc adds unnecessary rlwinm. lbz and lbzx does zero extension (no cleanup of upper bits required).
[Bug target/107949] PPC: Unnecessary rlwinm after lbzx
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107949 --- Comment #1 from Jens Seifert --- hash2 is only provided to show how the code should look like (without rlwinm).
[Bug target/107949] New: PPC: Unnecessary rlwinm after lbzx
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107949 Bug ID: 107949 Summary: PPC: Unnecessary rlwinm after lbzx Product: gcc Version: 12.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: jens.seifert at de dot ibm.com Target Milestone: --- extern unsigned char magic1[256]; unsigned int hash(const unsigned char inp[4]) { const unsigned long long INIT = 0x1ULL; unsigned long long h1 = INIT; h1 = magic1[((unsigned long long)inp[0]) ^ h1]; h1 = magic1[((unsigned long long)inp[1]) ^ h1]; h1 = magic1[((unsigned long long)inp[2]) ^ h1]; h1 = magic1[((unsigned long long)inp[3]) ^ h1]; return h1; } #ifdef __powerpc__ #define lbzx(b,c) ({ unsigned long long r; __asm__("lbzx %0,%1,%2":"=r"(r):"b"(b),"r"(c)); r; }) unsigned int hash2(const unsigned char inp[4]) { const unsigned long long INIT = 0x1ULL; unsigned long long h1 = INIT; h1 = lbzx(magic1, inp[0] ^ h1); h1 = lbzx(magic1, inp[1] ^ h1); h1 = lbzx(magic1, inp[2] ^ h1); h1 = lbzx(magic1, inp[3] ^ h1); return h1; } #endif Extra rlwinm get added. hash(unsigned char const*): .LCF0: addi 2,2,.TOC.-.LCF0@l lbz 9,0(3) addis 10,2,.LC0@toc@ha ld 10,.LC0@toc@l(10) lbz 6,1(3) lbz 7,2(3) lbz 8,3(3) xori 9,9,0x1 lbzx 9,10,9 xor 9,9,6 rlwinm 9,9,0,0xff <= not necessary lbzx 9,10,9 xor 9,9,7 rlwinm 9,9,0,0xff <= not necessary lbzx 9,10,9 xor 9,9,8 rlwinm 9,9,0,0xff <= not necessary lbzx 3,10,9 blr .long 0 .byte 0,9,0,0,0,0,0,0 hash2(unsigned char const*): .LCF1: addi 2,2,.TOC.-.LCF1@l lbz 7,0(3) lbz 8,1(3) lbz 10,2(3) lbz 6,3(3) addis 9,2,.LC1@toc@ha ld 9,.LC1@toc@l(9) xori 7,7,0x1 lbzx 7,9,7 xor 8,8,7 lbzx 8,9,8 xor 10,10,8 lbzx 10,9,10 xor 10,6,10 lbzx 3,9,10 rldicl 3,3,0,32 blr Tiny sample: unsigned long long tiny(const unsigned char *inp) { return inp[0] ^ inp[1]; } tiny(unsigned char const*): lbz 9,0(3) lbz 10,1(3) xor 3,9,10 rlwinm 3,3,0,0xff blr .long 0 .byte 0,9,0,0,0,0,0,0 unsigned long long tiny2(const unsigned char *inp) { unsigned long long a = inp[0]; unsigned long long b = inp[1]; return a ^ b; } tiny2(unsigned char const*): lbz 9,0(3) lbz 10,1(3) xor 3,9,10 rlwinm 3,3,0,0xff blr .long 0 .byte 0,9,0,0,0,0,0,0 lbz/lbzx creates a value 0 <= x < 256. xor of 2 such values does not change value range.
[Bug target/107757] New: PPCLE: Inefficient vector constant creation
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107757 Bug ID: 107757 Summary: PPCLE: Inefficient vector constant creation Product: gcc Version: 12.2.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: jens.seifert at de dot ibm.com Target Milestone: --- Due to the fact that vslw, vsld, vsrd, ... only use the modulo of bit width for shifting, the combination with 0xFF..FF vector can be used to create vector constants for: vec_splats(-0.0) or vec_splats(1ULL << 31) and scalar -0.0 vec_splats(-0.0f) or vec_splats(1U << 31) vec_splats((short)0x8000) with only 2 2-cycle vector instructions. Sample: vector long long lsb64() { return vec_splats(1LL); } creates: lsb64(): .LCF5: addi 2,2,.TOC.-.LCF5@l addis 9,2,.LC12@toc@ha addi 9,9,.LC12@toc@l lvx 2,0,9 blr .long 0 .byte 0,9,0,0,0,0,0,0 while: vector long long lsb64_opt() { vector long long a = vec_splats(~0LL); __asm__("vsrd %0,%0,%0":"=v"(a):"v"(a),"v"(a)); return a; } creates: lsb64_opt(): vspltisw 2,-1 vsrd 2,2,2 blr .long 0 .byte 0,9,0,0,0,0,0,0
[Bug target/86160] Implement isinf on PowerPC
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86160 --- Comment #4 from Jens Seifert --- I am looking forward to get Power9 optimization using xststdcdp etc.
[Bug target/106770] PPCLE: Unnecessary xxpermdi before mfvsrd
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106770 --- Comment #2 from Jens Seifert --- vec_extract(vr, 1) should extract the left element. But xxpermdi x,x,x,3 extracts the right element. Looks like a bug in vec_extract for PPCLE and not a problem regarding unnecessary xxpermdi. Using assembly for the subtract: int cmp3(double a, double b) { vector double va = vec_promote(a, 0); vector double vb = vec_promote(b, 0); vector long long vlt = (vector long long)vec_cmplt(va, vb); vector long long vgt = (vector long long)vec_cmplt(vb, va); vector signed long long vr; __asm__ volatile("vsubudm %0,%1,%2":"=v"(vr):"v"(vlt),"v"(vgt):); //vector signed long long vr = vec_sub(vlt, vgt); return vec_extract(vr, 1); } generates: _Z4cmp3dd: .LFB2: .cfi_startproc xxpermdi 1,1,1,0 xxpermdi 2,2,2,0 xvcmpgtdp 32,2,1 xvcmpgtdp 33,1,2 #APP # 34 "cmpdouble.C" 1 vsubudm 0,0,1 # 0 "" 2 #NO_APP mfvsrd 3,32 extsw 3,3 " Looks like the compile knows about the vec_promote doing splat and at the end extracts the non-preferred right element instead of the expected left element.
[Bug target/106770] PPCLE: Unnecessary xxpermdi before mfvsrd
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106770 --- Comment #1 from Jens Seifert --- vec_extract(vr, 1) should extract the left element. But xxpermdi x,x,x,3 extracts the right element. Looks like a bug in vec_extract for PPCLE and not a problem regarding unnecessary xxpermdi.
[Bug target/106770] New: PPCLE: Unnecessary xxpermdi before mfvsrd
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106770 Bug ID: 106770 Summary: PPCLE: Unnecessary xxpermdi before mfvsrd Product: gcc Version: 11.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: jens.seifert at de dot ibm.com Target Milestone: --- #include int cmp2(double a, double b) { vector double va = vec_promote(a, 1); vector double vb = vec_promote(b, 1); vector long long vlt = (vector long long)vec_cmplt(va, vb); vector long long vgt = (vector long long)vec_cmplt(vb, va); vector signed long long vr = vec_sub(vlt, vgt); return vec_extract(vr, 1); } Generates: _Z4cmp2dd: .LFB1: .cfi_startproc xxpermdi 1,1,1,0 xxpermdi 2,2,2,0 xvcmpgtdp 33,2,1 xvcmpgtdp 32,1,2 vsubudm 0,1,0 xxpermdi 0,32,32,3 mfvsrd 3,0 extsw 3,3 blr The unnecessary xxpermdi for vec_promote are already reported in another bugzilla case. mfvsrd can access all 64 vector registers directly and xxpermdi is not required. mfvsrd 3,32 expected instead xxpermdi 0,32,32,3 + mfvsrd 3,0
[Bug target/106769] New: PPCLE: vec_extract(vector unsigned int) unnecessary rldicl after mfvsrwz
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106769 Bug ID: 106769 Summary: PPCLE: vec_extract(vector unsigned int) unnecessary rldicl after mfvsrwz Product: gcc Version: 11.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: jens.seifert at de dot ibm.com Target Milestone: --- #include unsigned int extr(vector unsigned int v) { return vec_extract(v, 2); } Generates: _Z4extrDv4_j: .LFB1: .cfi_startproc mfvsrwz 3,34 rldicl 3,3,0,32 blr .long 0 .byte 0,9,0,0,0,0,0,0 .cfi_endproc The rldicl is not necessary as mfvsrwz already wiped out the upper 32 bits of the register.
[Bug target/106701] New: s390: Compiler does not take into account number range limitation to avoid subtract from immediate
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106701 Bug ID: 106701 Summary: s390: Compiler does not take into account number range limitation to avoid subtract from immediate Product: gcc Version: 11.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: jens.seifert at de dot ibm.com Target Milestone: --- unsigned long long subfic(unsigned long long a) { if (a > 15) __builtin_unreachable(); return 15 - a; } With clang on x86 subtract from immediate gets translated to xor: _Z6subficy: # @_Z6subficy mov rax, rdi xor rax, 15 ret Platforms like 390 and x86 which have no subtract from immediate would benefit from this optimization: gcc currently generates: _Z6subficy: lghi%r1,15 sgr %r1,%r2 lgr %r2,%r1 br %r14
[Bug target/106598] New: s390: Inefficient branchless conditionals for int
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106598 Bug ID: 106598 Summary: s390: Inefficient branchless conditionals for int Product: gcc Version: 11.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: jens.seifert at de dot ibm.com Target Milestone: --- int lt(int a, int b) { return a < b; } generates: cr %r2,%r3 lhi %r1,1 lhi %r2,0 locrnl %r1,%r2 lgfr%r2,%r1 br %r14 int ltOpt(int a, int b) { long long x = a; long long y = b; return ((unsigned long long)(x - y)) >> 63; } better: sgr %r2,%r3 srlg%r2,%r2,63 br %r14 int ltMask(int a, int b) { return -(a < b); } generates: cr %r2,%r3 lhi %r1,1 lhi %r2,0 locrnl %r1,%r2 sllg%r1,%r1,63 srag%r2,%r1,63 int ltMaskOpt(int a, int b) { long long x = a; long long y = b; return (x - y) >> 63; } better: sgr %r2,%r3 srag%r2,%r2,63 br %r14 int leMask(int a, int b) { return -(a <= b); } generates: cr %r2,%r3 lhi %r1,1 lhi %r2,0 locrnle %r1,%r2 sllg%r1,%r1,63 srag%r2,%r1,63 br %r14 int leMaskOpt(int a, int b) { int c; __asm__("cr %1,%2\n\tslbgr %0,%0":"=r"(c):"r"(a),"r"(b):"cc"); // slbgr create a 64-bit mask => lgfr would not be required return c; } better: cr %r2,%r3 slbgr %r2,%r2 lgfr%r2,%r2 <= not necessary br %r14 int le(int a, int b) { return a <= b; } generates: cr %r2,%r3 lhi %r1,1 lhi %r2,0 locrnle %r1,%r2 lgfr%r2,%r1 br %r14 int leOpt(int a, int b) { unsigned long long c; __asm__("cr %1,%2\n\tslbgr %0,%0":"=r"(c):"r"(a),"r"(b):"cc"); return (c >> 63); } better: cr %r2,%r3 slbgr %r2,%r2 srlg%r2,%r2,63 br %r14
[Bug target/106592] New: s390: Inefficient branchless conditionals for long long
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106592 Bug ID: 106592 Summary: s390: Inefficient branchless conditionals for long long Product: gcc Version: 11.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: jens.seifert at de dot ibm.com Target Milestone: --- Created attachment 53443 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=53443&action=edit source code long long gtRef(long long a, long long b) { return a > b; } Generates: cgr %r2,%r3 lghi%r1,0 lghi%r2,1 locgrnh %r2,%r1 Better sequence: cgr %r2,%r3 lghi %r2,0 alcgr %r2,%r2 long long leMaskRef(long long a, long long b) { return -(a <= b); } Generates: cgr %r2,%r3 lhi %r1,0 lhi %r2,1 locrnle %r2,%r1 sllg%r2,%r2,63 srag%r2,%r2,63 Better sequence: cgr %r2,%r3 slbgr %r2,%r2 long long gtMaskRef(long long a, long long b) { return -(a > b); } Generates: cgr %r2,%r3 lhi %r1,0 lhi %r2,1 locrnh %r2,%r1 sllg%r2,%r2,63 srag%r2,%r2,63 Better sequence: cgr %r2,%r3 lghi %r2,0 alcgr %r2,%r2 lcgr %r2,%r2
[Bug target/106536] New: P9: gcc does not detect setb pattern
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106536 Bug ID: 106536 Summary: P9: gcc does not detect setb pattern Product: gcc Version: 11.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: jens.seifert at de dot ibm.com Target Milestone: --- int compare2(unsigned long long a, unsigned long long b) { return (a > b ? 1 : (a < b ? -1 : 0)); } Output: _Z8compare2yy: cmpld 0,3,4 bgt 0,.L5 mfcr 3,128 rlwinm 3,3,1,1 neg 3,3 blr .L5: li 3,1 blr .long 0 .byte 0,9,0,0,0,0,0,0 clang generates: _Z8compare2yy: # @_Z8compare2yy cmpld 3, 4 setb 3, 0 extsw 3, 3 blr .long 0 .quad 0
[Bug target/106525] New: s390: Inefficient branchless conditionals for unsigned long long
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106525 Bug ID: 106525 Summary: s390: Inefficient branchless conditionals for unsigned long long Product: gcc Version: 11.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: jens.seifert at de dot ibm.com Target Milestone: --- Created attachment 53409 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=53409&action=edit source code 1) -(a > b) clgr%r2,%r3 lhi %r2,0 alcr%r2,%r2 sllg%r2,%r2,63 srag%r2,%r2,63 Last 2 could be merged to LCDFR. But optimal is: slgrk %r2,%r3,%r2 slbgr %r2,%r2 lgfr %r2,%r2 Note: lgfr is not required => 2 instructions only. 2) -(a <= b) slgr%r3,%r2 lhi %r2,0 alcr%r2,%r2 sllg%r2,%r2,63 srag%r2,%r2,63 Last 2 could be merged to LCDFR. But optimal is: clgr %r2,%r3 slbgr %r2,%r2 lgfr%r2,%r2 Note: lgfr is not required => 2 instructions only. 3) unsigned 64-bit compare for qsort (a > b) - (a < b) clgr%r2,%r3 lhi %r1,0 alcr%r1,%r1 clgr%r3,%r2 lhi %r2,0 alcr%r2,%r2 srk %r2,%r1,%r2 lgfr%r2,%r2 Optimal: slgrk %r1,%r2,%r3 slgrk 0,%r3,%r2 slbgr %r2,%r3 slbgr %r1,%r2 lgfr %r2,%r1 Note: lgfr not required => 4 instructions only
[Bug target/106043] Power10: lacking vec_blendv builtins
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106043 Jens Seifert changed: What|Removed |Added Status|UNCONFIRMED |RESOLVED Resolution|--- |INVALID --- Comment #2 from Jens Seifert --- Also found in altivec.h
[Bug target/106043] Power10: lacking vec_blendv builtins
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106043 --- Comment #1 from Jens Seifert --- Found in documentation: https://gcc.gnu.org/onlinedocs/gcc-11.3.0/gcc/PowerPC-AltiVec-Built-in-Functions-Available-on-ISA-3_002e1.html#PowerPC-AltiVec-Built-in-Functions-Available-on-ISA-3_002e1
[Bug c/106043] New: Power10: lacking vec_blendv builtins
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106043 Bug ID: 106043 Summary: Power10: lacking vec_blendv builtins Product: gcc Version: 11.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: jens.seifert at de dot ibm.com Target Milestone: --- Missing builtins for vector instructions xxblendvb, xxblendvw, xxblendvd, xxblendvd. #include vector int blendv(vector int a, vector int b, vector int c) { return vec_blendv(a, b, c); }
[Bug target/104268] New: 390: inefficient vec_popcnt for 16-bit for z13
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104268 Bug ID: 104268 Summary: 390: inefficient vec_popcnt for 16-bit for z13 Product: gcc Version: 10.2.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: jens.seifert at de dot ibm.com Target Milestone: --- #include vector unsigned short popcnt(vector unsigned short a) { return vec_popcnt(a); } Generates with -march=z13 _Z6popcntDv8_t: .LFB1: .cfi_startproc vzero %v0 vpopct %v24,%v24,0 vleib %v0,8,7 vsrlb %v0,%v24,%v0 vab %v24,%v24,%v0 vgbm%v0,21845 vn %v24,%v24,%v0 br %r14 .cfi_endproc Optimal sequence would be: vector unsigned short popcnt_opt(vector unsigned short a) { vector unsigned short r = (vector unsigned short)vec_popcnt((vector unsigned char)a); vector unsigned short b = vec_rli(r, 8); r = r + b; r = r >> 8; return r; } _Z10popcnt_optDv8_t: .LFB3: .cfi_startproc vpopct %v24,%v24,0 verllh %v0,%v24,8 vah %v24,%v0,%v24 vesrlh %v24,%v24,8 br %r14 .cfi_endproc
[Bug target/103743] New: PPC: Inefficient equality compare for large 64-bit constants having only 16-bit relevant bits in high part
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103743 Bug ID: 103743 Summary: PPC: Inefficient equality compare for large 64-bit constants having only 16-bit relevant bits in high part Product: gcc Version: 8.3.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: jens.seifert at de dot ibm.com Target Milestone: --- int overflow(); int negOverflow(long long in) { if (in == 0x8000LL) { return overflow(); } return 0; } Generates: negOverflow(long long): .quad .L.negOverflow(long long),.TOC.@tocbase,0 .L.negOverflow(long long): li 9,-1 rldicr 9,9,0,0 cmpd 0,3,9 beq 0,.L10 li 3,0 blr .L10: mflr 0 std 0,16(1) stdu 1,-112(1) bl overflow() nop addi 1,1,112 ld 0,16(1) mtlr 0 blr .long 0 .byte 0,9,0,1,128,0,0,0 Instead of: li 9,-1 rldicr 9,9,0,0 cmpd 0,3,9 Expected output: rotldi 3,3,1 cmpdi 0,3,1 This should be only applied if constant fits into 16-bit and if those 16-bit are in the first 32-bit.
[Bug target/103731] New: 390: inefficient 64-bit constant generation
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103731 Bug ID: 103731 Summary: 390: inefficient 64-bit constant generation Product: gcc Version: 8.3.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: jens.seifert at de dot ibm.com Target Milestone: --- unsigned long long M8() { return 0x; } Generates: .LC0: .quad 0x .text .align 8 .globl _Z2M8v .type _Z2M8v, @function _Z2M8v: .LFB0: .cfi_startproc lgrl%r2,.LC0 br %r14 .cfi_endproc Expected 2 instructions: load immediate + insert immedate(IIHF) instead of LOAD
[Bug target/103106] New: PPC: Missing builtin for P9 vmsumudm
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103106 Bug ID: 103106 Summary: PPC: Missing builtin for P9 vmsumudm Product: gcc Version: 8.3.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: jens.seifert at de dot ibm.com Target Milestone: --- I can't find builtin for vmsumudm instruction. I also found nothing in the Power vector instrinsic programming reference. https://openpowerfoundation.org/?resource_lib=power-vector-intrinsic-programming-reference
[Bug target/102265] New: s390: Inefficient code for __builtin_ctzll
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102265 Bug ID: 102265 Summary: s390: Inefficient code for __builtin_ctzll Product: gcc Version: 10.2.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: jens.seifert at de dot ibm.com Target Milestone: --- unsigned long long ctzll(unsigned long long x) { return __builtin_ctzll(x); } creates: lcgr%r1,%r2 ngr %r2,%r1 lghi%r1,63 flogr %r2,%r2 sgrk%r2,%r1,%r2 lgfr%r2,%r2 br %r14 Optimal sequence for z15 uses population count, for all others use ^ 63 instead of 63 -. unsigned long long ctzll_opt(unsigned long long x) { #if __ARCH__ >= 13 return __builtin_popcountll((x-1) & ~x); #else return __builtin_clzll(x & -x) ^ 63; #endif } < z15: lcgr%r1,%r2 ngr %r2,%r1 flogr %r2,%r2 xilf%r2,63 lgfr%r2,%r2 br %r14 => 1 instruction saved. z15: .cfi_startproc lay %r1,-1(%r2) ncgrk %r2,%r1,%r2 popcnt %r2,%r2,8 br %r14 .cfi_endproc => On z15 only 3 instructions required.
[Bug target/102117] s390: Inefficient code for 64x64=128 signed multiply for <= z13
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102117 --- Comment #1 from Jens Seifert --- Sorry small bug in optimal sequence. __int128 imul128_opt(long long a, long long b) { unsigned __int128 x = (unsigned __int128)(unsigned long long)a; unsigned __int128 y = (unsigned __int128)(unsigned long long)b; unsigned long long t1 = (a >> 63) & b; unsigned long long t2 = (b >> 63) & a; unsigned __int128 u128 = x * y; unsigned long long hi = (u128 >> 64) - (t1 + t2); unsigned long long lo = (unsigned long long)u128; unsigned __int128 res = hi; res <<= 64; res |= lo; return (__int128)res; } _Z11imul128_optxx: .LFB1: .cfi_startproc ldgr%f2,%r12 .cfi_register 12, 17 ldgr%f0,%r13 .cfi_register 13, 16 lgr %r13,%r3 mlgr%r12,%r4 srag%r1,%r3,63 ngr %r1,%r4 srag%r4,%r4,63 ngr %r4,%r3 agr %r4,%r1 sgrk%r4,%r12,%r4 stg %r13,8(%r2) lgdr%r12,%f2 .cfi_restore 12 lgdr%r13,%f0 .cfi_restore 13 stg %r4,0(%r2) br %r14 .cfi_endproc
[Bug target/102117] New: s390: Inefficient code for 64x64=128 signed multiply for <= z13
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102117 Bug ID: 102117 Summary: s390: Inefficient code for 64x64=128 signed multiply for <= z13 Product: gcc Version: 8.3.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: jens.seifert at de dot ibm.com Target Milestone: --- __int128 imul128(long long a, long long b) { return (__int128)a * (__int128)b; } creates sequence with 3 multiplies: _Z7imul128xx: .LFB0: .cfi_startproc ldgr%f2,%r12 .cfi_register 12, 17 ldgr%f0,%r13 .cfi_register 13, 16 lgr %r13,%r3 mlgr%r12,%r4 srag%r1,%r3,63 msgr%r1,%r4 srag%r4,%r4,63 msgr%r4,%r3 agr %r4,%r1 agr %r12,%r4 stmg%r12,%r13,0(%r2) lgdr%r13,%f0 .cfi_restore 13 lgdr%r12,%f2 .cfi_restore 12 br %r14 .cfi_endproc The following sequence only requires 1 multiply: __int128 imul128_opt(long long a, long long b) { unsigned __int128 x = (unsigned __int128)(unsigned long long)a; unsigned __int128 y = (unsigned __int128)(unsigned long long)b; unsigned long long t1 = (a >> 63) & a; unsigned long long t2 = (b >> 63) & b; unsigned __int128 u128 = x * y; unsigned long long hi = (u128 >> 64) - (t1 + t2); unsigned long long lo = (unsigned long long)u128; unsigned __int128 res = hi; res <<= 64; res |= lo; return (__int128)res; } _Z11imul128_optxx: .LFB1: .cfi_startproc ldgr%f2,%r12 .cfi_register 12, 17 ldgr%f0,%r13 .cfi_register 13, 16 lgr %r13,%r3 mlgr%r12,%r4 lgr %r1,%r3 srag%r3,%r3,63 ngr %r3,%r1 srag%r1,%r4,63 ngr %r4,%r1 agr %r3,%r4 sgrk%r3,%r12,%r3 stg %r13,8(%r2) lgdr%r12,%f2 .cfi_restore 12 lgdr%r13,%f0 .cfi_restore 13 stg %r3,0(%r2) br %r14 .cfi_endproc
[Bug target/100866] PPC: Inefficient code for vec_revb(vector unsigned short) < P9
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100866 --- Comment #9 from Jens Seifert --- I know that if I would use vec_perm builtin as an end user, that you then need to fulfill to the LE specification, but you can always optimize the code as you like as long as it creates correct results afterwards. load constant xxlnor constant can always be transformed to load inverse constant.
[Bug target/100866] PPC: Inefficient code for vec_revb(vector unsigned short) < P9
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100866 --- Comment #7 from Jens Seifert --- Regarding vec_revb for vector unsigned int. I agree that revb: .LFB0: .cfi_startproc vspltish %v1,8 vspltisw %v0,-16 vrlh %v2,%v2,%v1 vrlw %v2,%v2,%v0 blr works. But in this case, I would prefer the vperm approach assuming that the loaded constant for the permute vector can be re-used multiple times. But please get rid of the xxlnor 32,32,32. That does not make sense after loading a constant. Change the constant that need to be loaded.
[Bug target/101041] New: z13: Inefficient handling of vector register passed to function
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101041 Bug ID: 101041 Summary: z13: Inefficient handling of vector register passed to function Product: gcc Version: 8.3.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: jens.seifert at de dot ibm.com Target Milestone: --- #include vector unsigned long long mul64(vector unsigned long long a, vector unsigned long long b) { return a * b; } creates: _Z5mul64Dv2_yS_: .LFB9: .cfi_startproc ldgr%f4,%r15 .cfi_register 15, 18 lay %r15,-192(%r15) .cfi_def_cfa_offset 352 vst %v24,160(%r15),3 vst %v26,176(%r15),3 lg %r2,160(%r15) lg %r1,176(%r15) lgr %r4,%r2 lg %r0,168(%r15) lgr %r2,%r1 lg %r1,184(%r15) lgr %r5,%r0 lgr %r3,%r1 vlvgp %v2,%r4,%r5 vlvgp %v0,%r2,%r3 vlgvg %r4,%v2,0 vlgvg %r1,%v2,1 vlgvg %r2,%v0,0 vlgvg %r3,%v0,1 msgr%r2,%r4 msgr%r1,%r3 lgdr%r15,%f4 .cfi_restore 15 .cfi_def_cfa_offset 160 vlvgp %v24,%r2,%r1 br %r14 Store to stack of v24,v26, then lg+lgr for all 4 parts, then constructing new vector register v0 and v2 and then extract the 4 elements again using vlgvg. Expected 4 * vlgvg + 2 * msgr + vlvgp
[Bug target/100930] New: PPC: Missing builtins for P9 vextsb2w, vextsb2w, vextsb2d, vextsh2d, vextsw2d
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100930 Bug ID: 100930 Summary: PPC: Missing builtins for P9 vextsb2w, vextsb2w, vextsb2d, vextsh2d, vextsw2d Product: gcc Version: 8.3.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: jens.seifert at de dot ibm.com Target Milestone: --- Using the same names like xlC appreciated: vec_extsbd, vec_extsbw, vec_extshd, vec_extshw, vec_extswd
[Bug target/100926] New: PPCLE: Inefficient code for vec_xl_be(unsigned short *) < P9
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100926 Bug ID: 100926 Summary: PPCLE: Inefficient code for vec_xl_be(unsigned short *) < P9 Product: gcc Version: 8.3.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: jens.seifert at de dot ibm.com Target Milestone: --- Input: vector unsigned short load_be(unsigned short *c) { return vec_xl_be(0L, c); } creates: _Z7load_bePt: .LFB6: .cfi_startproc .LCF6: 0: addis 2,12,.TOC.-.LCF6@ha addi 2,2,.TOC.-.LCF6@l .localentry _Z7load_bePt,.-_Z7load_bePt addis 9,2,.LC4@toc@ha lxvw4x 34,0,3 addi 9,9,.LC4@toc@l lvx 0,0,9 vperm 2,2,2,0 blr Optimal sequence: vector unsigned short load_be_opt2(unsigned short *c) { vector signed int vneg16; __asm__("vspltisw %0,-16":"=v"(vneg16)); vector unsigned int tmp = vec_xl_be(0L, (unsigned int *)c); tmp = vec_rl(tmp, (vector unsigned int)vneg16); return (vector unsigned short)tmp; } creates: _Z12load_be_opt2Pt: .LFB8: .cfi_startproc lxvw4x 34,0,3 #APP # 77 "vec.C" 1 vspltisw 0,-16 # 0 "" 2 #NO_APP vrlw 2,2,0 blr rotate left (-16) = rotate right (+16) as only the 5 bits get evaluated. Please note that the inline assembly is required, because vec_splats(-16) gets converted into a very inefficient constant generation. vector unsigned short load_be_opt(unsigned short *c) { vector signed int vneg16 = vec_splats(-16); vector unsigned int tmp = vec_xl_be(0L, (unsigned int *)c); tmp = vec_rl(tmp, (vector unsigned int)vneg16); return (vector unsigned short)tmp; } creates: _Z11load_be_optPt: .LFB7: .cfi_startproc li 9,48 lxvw4x 34,0,3 vspltisw 0,0 mtvsrd 33,9 xxspltw 33,33,1 vsubuwm 0,0,1 vrlw 2,2,0 blr
[Bug target/100808] PPC: ISA 3.1 builtin documentation
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100808 --- Comment #3 from Jens Seifert --- - Avoid additional "int" unsigned long long int => unsigned long long Why? Those are exactly the same types! Yes, but the rest of the documentation uses unsigned long long. This is just for consistency with existing documentation.
[Bug target/100871] New: z14: vec_doublee maps to wrong builtin in vecintrin.h
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100871 Bug ID: 100871 Summary: z14: vec_doublee maps to wrong builtin in vecintrin.h Product: gcc Version: 10.2.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: jens.seifert at de dot ibm.com Target Milestone: --- #include Input: vector double doublee(vector float a) { return vec_doublee(a); } cause compile error: vec.C: In function ‘__vector(2) double doublee(__vector(4) float)’: vec.C:43:10: error: ‘__builtin_s390_vfll’ was not declared in this scope; did you mean ‘__builtin_s390_vflls’? 43 |return vec_doublee(a); | ^~~~ | __builtin_s390_vflls vec_doublee in vec_intrin.h should call __builtin_s390_vflls vector double doublee_fix(vector float a) { return __builtin_s390_vflls(a); }
[Bug target/100869] New: z13: Inefficient code for vec_reve(vector double)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100869 Bug ID: 100869 Summary: z13: Inefficient code for vec_reve(vector double) Product: gcc Version: 10.2.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: jens.seifert at de dot ibm.com Target Milestone: --- Input: vector double reve(vector double a) { return vec_reve(a); } creates: _Z4reveDv2_d: .LFB3: .cfi_startproc larl%r5,.L12 vl %v0,.L13-.L12(%r5),3 vperm %v24,%v24,%v24,%v0 br %r14 Optimal code sequence: vector double reve_z13(vector double a) { return vec_permi(a,a,2); } creates: _Z6reve_2Dv2_d: .LFB6: .cfi_startproc vpdi%v24,%v24,%v24,4 br %r14 .cfi_endproc
[Bug target/100868] New: PPC: Inefficient code for vec_reve(vector double)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100868 Bug ID: 100868 Summary: PPC: Inefficient code for vec_reve(vector double) Product: gcc Version: 8.3.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: jens.seifert at de dot ibm.com Target Milestone: --- Input: vector double reve(vector double a) { return vec_reve(a); } creates: _Z4reveDv2_d: .LFB3: .cfi_startproc .LCF3: 0: addis 2,12,.TOC.-.LCF3@ha addi 2,2,.TOC.-.LCF3@l .localentry _Z4reveDv2_d,.-_Z4reveDv2_d addis 9,2,.LC2@toc@ha addi 9,9,.LC2@toc@l lvx 0,0,9 xxlnor 32,32,32 vperm 2,2,2,0 blr Optimal sequence would be: vector double reve_pwr7(vector double a) { return vec_xxpermdi(a,a,2); } which creates: _Z9reve_pwr7Dv2_d: .LFB4: .cfi_startproc xxpermdi 34,34,34,2 blr
[Bug target/100867] New: z13: Inefficient code for vec_revb(vector unsigned short)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100867 Bug ID: 100867 Summary: z13: Inefficient code for vec_revb(vector unsigned short) Product: gcc Version: 10.2.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: jens.seifert at de dot ibm.com Target Milestone: --- Input: vector unsigned short revb(vector unsigned short a) { return vec_revb(a); } Creates: _Z4revbDv4_j: .LFB1: .cfi_startproc larl%r5,.L4 vl %v0,.L5-.L4(%r5),3 vperm %v24,%v24,%v24,%v0 br %r14 Optimal code sequence: vector unsigned short revb_z13(vector unsigned short a) { return vec_rli(a, 8); } Creates: _Z8revb_z13Dv8_t: .LFB5: .cfi_startproc verllh %v24,%v24,8 br %r14 .cfi_endproc
[Bug target/100866] New: PPC: Inefficient code for vec_revb(vector unsigned short) < P9
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100866 Bug ID: 100866 Summary: PPC: Inefficient code for vec_revb(vector unsigned short) < P9 Product: gcc Version: 8.3.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: jens.seifert at de dot ibm.com Target Milestone: --- Input: vector unsigned short revb(vector unsigned short a) { return vec_revb(a); } creates: _Z4revbDv8_t: .LFB1: .cfi_startproc .LCF1: 0: addis 2,12,.TOC.-.LCF1@ha addi 2,2,.TOC.-.LCF1@l .localentry _Z4revbDv8_t,.-_Z4revbDv8_t addis 9,2,.LC1@toc@ha addi 9,9,.LC1@toc@l lvx 0,0,9 xxlnor 32,32,32 vperm 2,2,2,0 blr Optimal code sequence: vector unsigned short revb_pwr7(vector unsigned short a) { return vec_rl(a, vec_splats((unsigned short)8)); } _Z9revb_pwr7Dv8_t: .LFB2: .cfi_startproc .localentry _Z9revb_pwr7Dv8_t,1 vspltish 0,8 vrlh 2,2,0 blr
[Bug c/100808] PPC: ISA 3.1 builtin documentation
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100808 --- Comment #1 from Jens Seifert --- https://gcc.gnu.org/onlinedocs/gcc/PowerPC-AltiVec-Built-in-Functions-Available-on-ISA-3_002e1.html vector unsigned long long int vec_gnb (vector unsigned __int128, const unsigned char) should be unsigned long long int vec_gnb (vector unsigned __int128, const unsigned char) vgnb instruction returns result in GPR.
[Bug c++/100809] PPC: __int128 divide/modulo does not use P10 instructions vdivsq/vdivuq
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100809 --- Comment #1 from Jens Seifert --- Same applies to modulo.
[Bug c++/100809] New: PPC: __int128 divide/modulo does not use P10 instructions vdivsq/vdivuq
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100809 Bug ID: 100809 Summary: PPC: __int128 divide/modulo does not use P10 instructions vdivsq/vdivuq Product: gcc Version: 10.2.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: jens.seifert at de dot ibm.com Target Milestone: --- unsigned __int128 div(unsigned __int128 a, unsigned __int128 b) { return a/b; } __int128 div(__int128 a, __int128 b) { return a/b; } gcc -mcpu=power10 -save-temps -O2 int128.C Output: _Z3divoo: .LFB0: .cfi_startproc .localentry _Z3divoo,1 mflr 0 std 0,16(1) stdu 1,-32(1) .cfi_def_cfa_offset 32 .cfi_offset 65, 16 bl __udivti3@notoc addi 1,1,32 .cfi_def_cfa_offset 0 ld 0,16(1) mtlr 0 .cfi_restore 65 blr .long 0 .byte 0,9,0,1,128,0,0,0 .cfi_endproc .LFE0: .size _Z3divoo,.-_Z3divoo .globl __divti3 .align 2 .p2align 4,,15 .globl _Z3divnn .type _Z3divnn, @function _Z3divnn: .LFB1: .cfi_startproc .localentry _Z3divnn,1 mflr 0 std 0,16(1) stdu 1,-32(1) .cfi_def_cfa_offset 32 .cfi_offset 65, 16 bl __divti3@notoc addi 1,1,32 .cfi_def_cfa_offset 0 ld 0,16(1) mtlr 0 .cfi_restore 65 blr .long 0 .byte 0,9,0,1,128,0,0,0 .cfi_endproc Expected is the use of vdivsq/vdivuq. GCC version: /opt/rh/devtoolset-10/root/usr/bin/gcc -v Using built-in specs. COLLECT_GCC=/opt/rh/devtoolset-10/root/usr/bin/gcc COLLECT_LTO_WRAPPER=/opt/rh/devtoolset-10/root/usr/libexec/gcc/ppc64le-redhat-linux/10/lto-wrapper Target: ppc64le-redhat-linux Configured with: ../configure --enable-bootstrap --enable-languages=c,c++,fortran,lto --prefix=/opt/rh/devtoolset-10/root/usr --mandir=/opt/rh/devtoolset-10/root/usr/share/man --infodir=/opt/rh/devtoolset-10/root/usr/share/info --with-bugurl=http://bugzilla.redhat.com/bugzilla --enable-shared --enable-threads=posix --enable-checking=release --enable-targets=powerpcle-linux --disable-multilib --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-gnu-unique-object --enable-linker-build-id --with-gcc-major-version-only --with-linker-hash-style=gnu --with-default-libstdcxx-abi=gcc4-compatible --enable-plugin --enable-initfini-array --with-isl=/builddir/build/BUILD/gcc-10.2.1-20200804/obj-ppc64le-redhat-linux/isl-install --disable-libmpx --enable-gnu-indirect-function --enable-secureplt --with-long-double-128 --with-cpu-32=power8 --with-tune-32=power8 --with-cpu-64=power8 --with-tune-64=power8 --build=ppc64le-redhat-linux Thread model: posix Supported LTO compression algorithms: zlib gcc version 10.2.1 20200804 (Red Hat 10.2.1-2) (GCC)
[Bug c/100808] New: PPC: ISA 3.1 builtin documentation
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100808 Bug ID: 100808 Summary: PPC: ISA 3.1 builtin documentation Product: gcc Version: 12.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: jens.seifert at de dot ibm.com Target Milestone: --- https://gcc.gnu.org/onlinedocs/gcc/Basic-PowerPC-Built-in-Functions-Available-on-ISA-3_002e1.html#Basic-PowerPC-Built-in-Functions-Available-on-ISA-3_002e1 Please improve the documentation: - Avoid additional "int" unsigned long long int => unsigned long long - add missing line breaks between builtins - remove semicolons
[Bug target/100694] New: PPC: initialization of __int128 is very inefficient
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100694 Bug ID: 100694 Summary: PPC: initialization of __int128 is very inefficient Product: gcc Version: 8.3.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: jens.seifert at de dot ibm.com Target Milestone: --- Initializing a __int128 from 2 64-bit integers is implemented very inefficient. The most natural code which works good on all other platforms generate additional 2 li 0 + 2 or instructions. void test2(unsigned __int128* res, unsigned long long hi, unsigned long long lo) { unsigned __int128 i = hi; i <<= 64; i |= lo; *res = i; } _Z5test2Poyy: .LFB15: .cfi_startproc li 8,0 li 11,0 or 10,5,8 or 11,11,4 std 10,0(3) std 11,8(3) blr .long 0 .byte 0,9,0,0,0,0,0,0 .cfi_endproc While for the above sample, "+" instead "|" solves the issues, it generates addc+addz in other more complicated scenarsion. The most ugly workaround I can think of I now use as workaround. void test4(unsigned __int128* res, unsigned long long hi, unsigned long long lo) { union { unsigned __int128 i; struct { unsigned long long lo; unsigned long long hi; } s; } u; u.s.lo = lo; u.s.hi = hi; *res = u.i; } This generates the expected code sequence in all cases I have looked at. _Z5test4Poyy: .LFB17: .cfi_startproc std 5,0(3) std 4,8(3) blr .long 0 .byte 0,9,0,0,0,0,0,0 .cfi_endproc Please merge li 0 + or to nop.
[Bug target/100693] New: PPC: missing 64-bit addg6s
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100693 Bug ID: 100693 Summary: PPC: missing 64-bit addg6s Product: gcc Version: 8.3.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: jens.seifert at de dot ibm.com Target Milestone: --- gcc only provides unsigned int __builtin_addg6s (unsigned int, unsigned int); but addg6s is a 64-bit operation. I require unsigned long long __builtin_addg6s (unsigned long long, unsigned long long); . I for now use inline assembly.
[Bug target/98020] PPC: mfvsrwz+extsw not merged to mtvsrwa
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98020 Jens Seifert changed: What|Removed |Added Status|WAITING |RESOLVED Resolution|--- |INVALID --- Comment #2 from Jens Seifert --- I thought they are symmetric.
[Bug target/98124] New: Z: Load and test LTDBR instruction gets not used for comparison against 0.0
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98124 Bug ID: 98124 Summary: Z: Load and test LTDBR instruction gets not used for comparison against 0.0 Product: gcc Version: 8.3.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: jens.seifert at de dot ibm.com Target Milestone: --- #include double sign(double in) { return in == 0.0 ? 0.0 : copysign(1.0, in); } Command line: gcc m64 -O2 -save-temps copysign.C Output: _Z4signd: .LFB234: .cfi_startproc larl%r5,.L8 lzdr%f2 cdbr%f0,%f2 je .L6 ld %f2,.L9-.L8(%r5) cpsdr %f0,%f0,%f2 br %r14 Use of LTDBR expected instead of lzdr%f2 + cdbr%f0,%f2
[Bug target/98020] New: PPC: mfvsrwz+extsw not merge to mtvsrwa
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98020 Bug ID: 98020 Summary: PPC: mfvsrwz+extsw not merge to mtvsrwa Product: gcc Version: 8.3.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: jens.seifert at de dot ibm.com Target Milestone: --- int extract(vector signed int v) { return v[2]; } Command line: gcc -mcpu=power8 -maltivec -m64 -O3 -save-temps extract.C Output: _Z7extractDv4_i: .LFB0: .cfi_startproc mfvsrwz 3,34 extsw 3,3 blr
[Bug target/70928] Load simple float constants via VSX operations on PowerPC
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70928 Jens Seifert changed: What|Removed |Added CC||jens.seifert at de dot ibm.com --- Comment #4 from Jens Seifert --- values -16.0..+15.0. vspltisw 0, xvcvsxwdp 32,32 values -16.0f..+15.0f vspltisw 0, xvcvsxwsp 32,32 -0.0 / 0x8000 xxlxor 32,32,32 xvnabsdp 32,32 or xvnegdp 32,32 -0.0f / 0x8000 xxlxor 32,32,32 xvnabssp 32,32 or xvnegsp 32,32 0x7FFF vspltisw 0,-1 xvabsdp 32,32 0x7FFF vspltisw 0,-1 xvabssp 32,32