[Bug target/115355] [12/13/14/15 Regression] vectorization exposes wrong code on P9 LE starting from r12-4496

2024-06-07 Thread jens.seifert at de dot ibm.com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115355

--- Comment #10 from Jens Seifert  ---
Does this affect loop vectorize and slp vectorize ?

-fno-tree-loop-vectorize avoids loop vectorization to be performed and
workarounds this issue. Does the same problems also affect SLP vectorization,
which does not take place in this sample.

In other words, do I need
-fno-tree-loop-vectorize
or
-fno-tree-vectorize
to workaround this bug ?

[Bug target/115355] PPCLE: Auto-vectorization creates wrong code for Power9

2024-06-05 Thread jens.seifert at de dot ibm.com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115355

--- Comment #1 from Jens Seifert  ---
Same issue with gcc 13.2.1

[Bug target/115355] New: PPCLE: Auto-vectorization creates wrong code for Power9

2024-06-05 Thread jens.seifert at de dot ibm.com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115355

Bug ID: 115355
   Summary: PPCLE: Auto-vectorization creates wrong code for
Power9
   Product: gcc
   Version: 12.2.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jens.seifert at de dot ibm.com
  Target Milestone: ---

Input setToIdentity.C:

#include 
#include 
#include 

void setToIdentityGOOD(unsigned long long *mVec, unsigned int mLen)
{
  for (unsigned long long i = 0; i < mLen; i++)
  {
mVec[i] = i;
  }
}

void setToIdentityBAD(unsigned long long *mVec, unsigned int mLen)
{
  for (unsigned int i = 0; i < mLen; i++)
  {
mVec[i] = i;
  }
}

unsigned long long vec1[100];
unsigned long long vec2[100];

int main(int argc, char *argv[])
{
  unsigned int l = argc > 1 ? atoi(argv[1]) : 29;
  setToIdentityGOOD(vec1, l);
  setToIdentityBAD(vec2, l);

  if (memcmp(vec1, vec2, l*sizeof(vec1[0])) != 0)
  {
 for (unsigned int i = 0; i < l; i++)
 {
printf("%llu %llu\n", vec1[i], vec2[i]);
 }
  }
  else
  {
 printf("match\n");
  }
  return 0;
}


Fails
gcc -O3 -mcpu=power9 -m64 setToIdentity.C -save-temps -fverbose-asm -o pwr9.exe
-mno-isel


Good:
gcc -O3 -mcpu=power8 -m64 setToIdentity.C -save-temps -fverbose-asm -o pwr8.exe
-mno-isel

"-mno-isel" is only specified to reduce the diff.


Failing output:

pwr9.exe
0 0
1 1
2 0
3 4294967296
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
12 12
13 13
14 14
15 15
16 16
17 17
18 18
19 19
20 20
21 21
22 22
23 23
24 24
25 25
26 26
27 27
28 28

4th element contains wrong data.

[Bug target/114376] New: s390: Inefficient __builtin_bswap16

2024-03-18 Thread jens.seifert at de dot ibm.com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114376

Bug ID: 114376
   Summary: s390: Inefficient __builtin_bswap16
   Product: gcc
   Version: 13.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jens.seifert at de dot ibm.com
  Target Milestone: ---

unsigned short swap16(unsigned short in)
{
   return __builtin_bswap16(in);
}

generates -O3 -march=z196

swap16(unsigned short):
lrvr%r2,%r2
srl %r2,16
llghr   %r2,%r2
br  %r14

More efficient for 64-bit is:

unsigned short swap16_2(unsigned short in)
{
   return __builtin_bswap64(in) >> 48;
}

Which generates:

swap16_2(unsigned short):
lrvgr   %r2,%r2
srlg%r2,%r2,48
br  %r14

For 31-bit lrvr should be used.

[Bug target/93176] PPC: inefficient 64-bit constant consecutive ones

2023-08-18 Thread jens.seifert at de dot ibm.com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93176

--- Comment #10 from Jens Seifert  ---
Looks like no patch in the area got delivered. I did a small test for 

unsigned long long c()
{
return 0xULL;
}

gcc 13.2.0:
li 3,0
ori 3,3,0x
sldi 3,3,32

expected:
li 3, -1
rldic 3, 3, 32, 16

All consecutive ones can be created with li + rldic.

The rotate eliminates the bits on the right and the clear the bits on the left
as described below:

  li t,-1
  rldic d,T,MB,63-ME

[Bug target/93176] PPC: inefficient 64-bit constant consecutive ones

2023-08-16 Thread jens.seifert at de dot ibm.com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93176

--- Comment #7 from Jens Seifert  ---
What happened ? Still waiting for improvement.

[Bug target/106770] PPCLE: Unnecessary xxpermdi before mfvsrd

2023-02-27 Thread jens.seifert at de dot ibm.com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106770

--- Comment #6 from Jens Seifert  ---
The left part of VSX registers overlaps with floating point registers, that is
why no register xxpermdi is required and mfvsrd can access all (left) parts of
VSX registers directly.
The xxpermdi x,y,y,3 indicates to me that gcc prefers right part of register
which might also cause the xxpermdi at the beginning. At the end the mystery is
why gcc adds 3 xxpermdi to the code.

[Bug target/106770] PPCLE: Unnecessary xxpermdi before mfvsrd

2023-02-27 Thread jens.seifert at de dot ibm.com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106770

--- Comment #4 from Jens Seifert  ---
PPCLE with no special option means -mcpu=power8 -maltivec  (altivecle to be mor
precise).

vec_promote(, 1) should be a noop on ppcle. But value gets
splatted to both left and right part of vector register. => 2 unnecesary
xxpermdi
The rest of the operations are done on left and right part.

vec_extract(, 1) should be noop on ppcle. But value gets taken
from right part of register which requires a xxpermdi

Overall 3 unnecessary xxpermdi. Don't know why the right part of register gets
"preferred".

[Bug c++/108560] New: builtin_va_arg_pack_len is documented to return size_t, but actually returns int

2023-01-26 Thread jens.seifert at de dot ibm.com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108560

Bug ID: 108560
   Summary: builtin_va_arg_pack_len is documented to return
size_t, but actually returns int
   Product: gcc
   Version: 12.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jens.seifert at de dot ibm.com
  Target Milestone: ---

#include 

bool test(const char *fmt, size_t numTokens, ...)
{
return __builtin_va_arg_pack_len() != numTokens;
}

Compiled with -Wsign-compare results in:
: In function 'bool test(const char*, size_t, ...)':
:5:40: warning: comparison of integer expressions of different
signedness: 'int' and 'size_t' {aka 'long unsigned int'} [-Wsign-compare]
5 | return __builtin_va_arg_pack_len() != numTokens;
  |^~~~
:5:37: error: invalid use of '__builtin_va_arg_pack_len ()'
5 | return __builtin_va_arg_pack_len() != numTokens;
  |~^~
Compiler returned: 1

Documentation:
https://gcc.gnu.org/onlinedocs/gcc/Constructing-Calls.html
indicates a size_t return type
Built-in Function: size_t __builtin_va_arg_pack_len ()

[Bug target/108396] New: PPCLE: vec_vsubcuq missing

2023-01-13 Thread jens.seifert at de dot ibm.com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108396

Bug ID: 108396
   Summary: PPCLE: vec_vsubcuq missing
   Product: gcc
   Version: 12.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jens.seifert at de dot ibm.com
  Target Milestone: ---

Input:

#include 

vector unsigned __int128 vsubcuq(vector unsigned __int128 a, vector unsigned
__int128 b)
{
return vec_vsubcuq(a, b);
}

Command line:
gcc -m64 -O2 -maltivec -mcpu=power8 text.C

Output:
: In function '__vector unsigned __int128 vsubcuq(__vector unsigned
__int128, __vector unsigned __int128)':
:6:12: error: 'vec_vsubcuq' was not declared in this scope; did you
mean 'vec_vsubcuqP'?
6 | return vec_vsubcuq(a, b);
  |^~~
  |vec_vsubcuqP
Compiler returned: 1

[Bug target/108049] s390: Compiler adds extra zero extend after xoring 2 zero extended values

2022-12-10 Thread jens.seifert at de dot ibm.com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108049

--- Comment #1 from Jens Seifert  ---
Sample above got compiled with -march=z196

[Bug target/108049] New: s390: Compiler adds extra zero extend after xoring 2 zero extended values

2022-12-10 Thread jens.seifert at de dot ibm.com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108049

Bug ID: 108049
   Summary: s390: Compiler adds extra zero extend after xoring 2
zero extended values
   Product: gcc
   Version: 12.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jens.seifert at de dot ibm.com
  Target Milestone: ---

Same issue for PPC: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107949

extern unsigned char magic1[256];

unsigned int hash(const unsigned char inp[4])
{
   const unsigned long long INIT = 0x1ULL;
   unsigned long long h1 = INIT;
   h1 = magic1[((unsigned long long)inp[0]) ^ h1];
   h1 = magic1[((unsigned long long)inp[1]) ^ h1];
   h1 = magic1[((unsigned long long)inp[2]) ^ h1];
   h1 = magic1[((unsigned long long)inp[3]) ^ h1];
   return h1;
}

hash(unsigned char const*):
llgc%r4,1(%r2) <= zero extends to 64-bit
lgrl%r1,.LC0
llgc%r3,0(%r2) <= zero extends to 64-bit
xilf%r3,1 
llgc%r3,0(%r3,%r1)
xr  %r3,%r4 <= should be 64-bit xor
llgc%r4,2(%r2) <= zero extends to 64-bit
llgcr   %r3,%r3 <= unnecessary
llgc%r2,3(%r2)
llgc%r3,0(%r3,%r1)
xr  %r3,%r4 <= should be 64-bit xor
llgcr   %r3,%r3 <= unnecessary
llgc%r3,0(%r3,%r1) <= zero extends to 64-bit
xrk %r2,%r3,%r2 <= should be 64-bit xor
llgcr   %r2,%r2 <= unnecessary
llgc%r2,0(%r2,%r1)
br  %r14

Smaller sample:
unsigned long long tiny2(const unsigned char *inp)
{
  unsigned long long a = inp[0];
  unsigned long long b = inp[1];
  return a ^ b;
}

tiny2(unsigned char const*):
llgc%r1,0(%r2)
llgc%r2,1(%r2)
xrk %r2,%r1,%r2
llgcr   %r2,%r2
br  %r14

[Bug rtl-optimization/107949] PPC: Unnecessary rlwinm after lbzx

2022-12-10 Thread jens.seifert at de dot ibm.com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107949

--- Comment #3 from Jens Seifert  ---
*** Bug 108048 has been marked as a duplicate of this bug. ***

[Bug target/108048] PPCLE: gcc does not recognize that lbzx does zero extend

2022-12-10 Thread jens.seifert at de dot ibm.com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108048

Jens Seifert  changed:

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |DUPLICATE

--- Comment #1 from Jens Seifert  ---
duplicate of https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107949

*** This bug has been marked as a duplicate of bug 107949 ***

[Bug target/108048] New: PPCLE: gcc does not recognize that lbzx does zero extend

2022-12-10 Thread jens.seifert at de dot ibm.com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108048

Bug ID: 108048
   Summary: PPCLE: gcc does not recognize that lbzx does zero
extend
   Product: gcc
   Version: 12.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jens.seifert at de dot ibm.com
  Target Milestone: ---

extern unsigned char magic1[256];

unsigned int hash(const unsigned char inp[4])
{
   const unsigned long long INIT = 0x1ULL;
   unsigned long long h1 = INIT;
   h1 = magic1[((unsigned long long)inp[0]) ^ h1];
   h1 = magic1[((unsigned long long)inp[1]) ^ h1];
   h1 = magic1[((unsigned long long)inp[2]) ^ h1];
   h1 = magic1[((unsigned long long)inp[3]) ^ h1];
   return h1;
}

Generates:

hash(unsigned char const*):
.LCF0:
addi 2,2,.TOC.-.LCF0@l
lbz 9,0(3)
addis 10,2,.LC0@toc@ha
ld 10,.LC0@toc@l(10)
lbz 6,1(3)
lbz 7,2(3)
lbz 8,3(3)
xori 9,9,0x1
lbzx 9,10,9
xor 9,9,6
rlwinm 9,9,0,0xff <= unnecessary
lbzx 9,10,9
xor 9,9,7
rlwinm 9,9,0,0xff <= unnecessary
lbzx 9,10,9
xor 9,9,8
rlwinm 9,9,0,0xff <= unnecessary
lbzx 3,10,9
blr


All XOR operations are done in unsigned long long (64-bit). gcc adds
unnecessary rlwinm. lbz and lbzx does zero extension (no cleanup of upper bits
required).

[Bug target/107949] PPC: Unnecessary rlwinm after lbzx

2022-12-02 Thread jens.seifert at de dot ibm.com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107949

--- Comment #1 from Jens Seifert  ---
hash2 is only provided to show how the code should look like (without rlwinm).

[Bug target/107949] New: PPC: Unnecessary rlwinm after lbzx

2022-12-02 Thread jens.seifert at de dot ibm.com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107949

Bug ID: 107949
   Summary: PPC: Unnecessary rlwinm after lbzx
   Product: gcc
   Version: 12.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jens.seifert at de dot ibm.com
  Target Milestone: ---

extern unsigned char magic1[256];

unsigned int hash(const unsigned char inp[4])
{
   const unsigned long long INIT = 0x1ULL;
   unsigned long long h1 = INIT;
   h1 = magic1[((unsigned long long)inp[0]) ^ h1];
   h1 = magic1[((unsigned long long)inp[1]) ^ h1];
   h1 = magic1[((unsigned long long)inp[2]) ^ h1];
   h1 = magic1[((unsigned long long)inp[3]) ^ h1];
   return h1;
}

#ifdef __powerpc__
#define lbzx(b,c) ({ unsigned long long r; __asm__("lbzx
%0,%1,%2":"=r"(r):"b"(b),"r"(c)); r; })
unsigned int hash2(const unsigned char inp[4])
{
   const unsigned long long INIT = 0x1ULL;
   unsigned long long h1 = INIT;
   h1 = lbzx(magic1, inp[0] ^ h1);
   h1 = lbzx(magic1, inp[1] ^ h1);
   h1 = lbzx(magic1, inp[2] ^ h1);
   h1 = lbzx(magic1, inp[3] ^ h1);
   return h1;
}
#endif

Extra rlwinm get added.

hash(unsigned char const*):
.LCF0:
addi 2,2,.TOC.-.LCF0@l
lbz 9,0(3)
addis 10,2,.LC0@toc@ha
ld 10,.LC0@toc@l(10)
lbz 6,1(3)
lbz 7,2(3)
lbz 8,3(3)
xori 9,9,0x1
lbzx 9,10,9
xor 9,9,6
rlwinm 9,9,0,0xff <= not necessary
lbzx 9,10,9
xor 9,9,7
rlwinm 9,9,0,0xff <= not necessary
lbzx 9,10,9
xor 9,9,8
rlwinm 9,9,0,0xff <= not necessary
lbzx 3,10,9
blr
.long 0
.byte 0,9,0,0,0,0,0,0
hash2(unsigned char const*):
.LCF1:
addi 2,2,.TOC.-.LCF1@l
lbz 7,0(3)
lbz 8,1(3)
lbz 10,2(3)
lbz 6,3(3)
addis 9,2,.LC1@toc@ha
ld 9,.LC1@toc@l(9)
xori 7,7,0x1
lbzx 7,9,7
xor 8,8,7
lbzx 8,9,8
xor 10,10,8
lbzx 10,9,10
xor 10,6,10
lbzx 3,9,10
rldicl 3,3,0,32
blr

Tiny sample:
unsigned long long tiny(const unsigned char *inp)
{
  return inp[0] ^ inp[1];
}

tiny(unsigned char const*):
lbz 9,0(3)
lbz 10,1(3)
xor 3,9,10
rlwinm 3,3,0,0xff
blr
.long 0
.byte 0,9,0,0,0,0,0,0

unsigned long long tiny2(const unsigned char *inp)
{
  unsigned long long a = inp[0];
  unsigned long long b = inp[1];
  return a ^ b;
}

tiny2(unsigned char const*):
lbz 9,0(3)
lbz 10,1(3)
xor 3,9,10
rlwinm 3,3,0,0xff
blr
.long 0
.byte 0,9,0,0,0,0,0,0

lbz/lbzx creates a value 0 <= x < 256. xor of 2 such values does not change
value range.

[Bug target/107757] New: PPCLE: Inefficient vector constant creation

2022-11-18 Thread jens.seifert at de dot ibm.com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107757

Bug ID: 107757
   Summary: PPCLE: Inefficient vector constant creation
   Product: gcc
   Version: 12.2.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jens.seifert at de dot ibm.com
  Target Milestone: ---

Due to the fact that vslw, vsld, vsrd, ... only use the modulo of bit width for
shifting, the combination with 0xFF..FF vector can be used to create vector
constants for:
vec_splats(-0.0) or vec_splats(1ULL << 31) and scalar -0.0
vec_splats(-0.0f) or vec_splats(1U << 31)
vec_splats((short)0x8000)
with only 2 2-cycle vector instructions.

Sample:

vector long long lsb64()
{
   return vec_splats(1LL);
}

creates:

lsb64():
.LCF5:
addi 2,2,.TOC.-.LCF5@l
addis 9,2,.LC12@toc@ha
addi 9,9,.LC12@toc@l
lvx 2,0,9
blr
.long 0
.byte 0,9,0,0,0,0,0,0

while:

vector long long lsb64_opt()
{
   vector long long a = vec_splats(~0LL);
   __asm__("vsrd %0,%0,%0":"=v"(a):"v"(a),"v"(a));
   return a;
}

creates:
lsb64_opt():
vspltisw 2,-1
vsrd 2,2,2
blr
.long 0
.byte 0,9,0,0,0,0,0,0

[Bug target/86160] Implement isinf on PowerPC

2022-11-08 Thread jens.seifert at de dot ibm.com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86160

--- Comment #4 from Jens Seifert  ---
I am looking forward to get Power9 optimization using xststdcdp etc.

[Bug target/106770] PPCLE: Unnecessary xxpermdi before mfvsrd

2022-08-29 Thread jens.seifert at de dot ibm.com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106770

--- Comment #2 from Jens Seifert  ---
vec_extract(vr, 1) should extract the left element. But xxpermdi x,x,x,3
extracts the right element.
Looks like a bug in vec_extract for PPCLE and not a problem regarding
unnecessary xxpermdi.

Using assembly for the subtract:
int cmp3(double a, double b)
{
vector double va = vec_promote(a, 0);
vector double vb = vec_promote(b, 0);
vector long long vlt = (vector long long)vec_cmplt(va, vb);
vector long long vgt = (vector long long)vec_cmplt(vb, va);
vector signed long long vr;
__asm__ volatile("vsubudm %0,%1,%2":"=v"(vr):"v"(vlt),"v"(vgt):);
//vector signed long long vr = vec_sub(vlt, vgt);

return vec_extract(vr, 1);
}

generates:

_Z4cmp3dd:
.LFB2:
.cfi_startproc
xxpermdi 1,1,1,0
xxpermdi 2,2,2,0
xvcmpgtdp 32,2,1
xvcmpgtdp 33,1,2
#APP
 # 34 "cmpdouble.C" 1
vsubudm 0,0,1
 # 0 "" 2
#NO_APP
mfvsrd 3,32
extsw 3,3
"

Looks like the compile knows about the vec_promote doing splat and at the end
extracts the non-preferred right element instead of the expected left element.

[Bug target/106770] PPCLE: Unnecessary xxpermdi before mfvsrd

2022-08-29 Thread jens.seifert at de dot ibm.com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106770

--- Comment #1 from Jens Seifert  ---
vec_extract(vr, 1) should extract the left element. But xxpermdi x,x,x,3
extracts the right element.
Looks like a bug in vec_extract for PPCLE and not a problem regarding
unnecessary xxpermdi.

[Bug target/106770] New: PPCLE: Unnecessary xxpermdi before mfvsrd

2022-08-29 Thread jens.seifert at de dot ibm.com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106770

Bug ID: 106770
   Summary: PPCLE: Unnecessary xxpermdi before mfvsrd
   Product: gcc
   Version: 11.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jens.seifert at de dot ibm.com
  Target Milestone: ---

#include 

int cmp2(double a, double b)
{
vector double va = vec_promote(a, 1);
vector double vb = vec_promote(b, 1);
vector long long vlt = (vector long long)vec_cmplt(va, vb);
vector long long vgt = (vector long long)vec_cmplt(vb, va);
vector signed long long vr = vec_sub(vlt, vgt);

return vec_extract(vr, 1);
}

Generates:

_Z4cmp2dd:
.LFB1:
.cfi_startproc
xxpermdi 1,1,1,0
xxpermdi 2,2,2,0
xvcmpgtdp 33,2,1
xvcmpgtdp 32,1,2
vsubudm 0,1,0
xxpermdi 0,32,32,3
mfvsrd 3,0
extsw 3,3
blr

The unnecessary xxpermdi for vec_promote are already reported in another
bugzilla case.

mfvsrd can access all 64 vector registers directly and xxpermdi is not
required.
mfvsrd 3,32 expected instead xxpermdi 0,32,32,3 + mfvsrd 3,0

[Bug target/106769] New: PPCLE: vec_extract(vector unsigned int) unnecessary rldicl after mfvsrwz

2022-08-29 Thread jens.seifert at de dot ibm.com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106769

Bug ID: 106769
   Summary: PPCLE: vec_extract(vector unsigned int) unnecessary
rldicl after mfvsrwz
   Product: gcc
   Version: 11.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jens.seifert at de dot ibm.com
  Target Milestone: ---

#include 

unsigned int extr(vector unsigned int v)
{
   return vec_extract(v, 2);
}

Generates:

_Z4extrDv4_j:
.LFB1:
.cfi_startproc
mfvsrwz 3,34
rldicl 3,3,0,32
blr
.long 0
.byte 0,9,0,0,0,0,0,0
.cfi_endproc


The rldicl is not necessary as mfvsrwz already wiped out the upper 32 bits of
the register.

[Bug target/106701] New: s390: Compiler does not take into account number range limitation to avoid subtract from immediate

2022-08-21 Thread jens.seifert at de dot ibm.com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106701

Bug ID: 106701
   Summary: s390: Compiler does not take into account number range
limitation to avoid subtract from immediate
   Product: gcc
   Version: 11.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jens.seifert at de dot ibm.com
  Target Milestone: ---

unsigned long long subfic(unsigned long long a)
{
if (a > 15) __builtin_unreachable();
return 15 - a;
}

With clang on x86 subtract from immediate gets translated to xor:
_Z6subficy: # @_Z6subficy
mov rax, rdi
xor rax, 15
ret

Platforms like 390 and x86 which have no subtract from immediate would benefit
from this optimization:

gcc currently generates:
_Z6subficy:
lghi%r1,15
sgr %r1,%r2
lgr %r2,%r1
br  %r14

[Bug target/106598] New: s390: Inefficient branchless conditionals for int

2022-08-12 Thread jens.seifert at de dot ibm.com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106598

Bug ID: 106598
   Summary: s390: Inefficient branchless conditionals for int
   Product: gcc
   Version: 11.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jens.seifert at de dot ibm.com
  Target Milestone: ---

int lt(int a, int b)
{
return a < b;
}

generates:
cr  %r2,%r3
lhi %r1,1
lhi %r2,0
locrnl  %r1,%r2
lgfr%r2,%r1
br  %r14

int ltOpt(int a, int b)
{
long long x = a;
long long y = b;
return ((unsigned long long)(x - y)) >> 63;
}

better:
sgr %r2,%r3
srlg%r2,%r2,63
br  %r14

int ltMask(int a, int b)
{
return -(a < b);
}

generates:
cr  %r2,%r3
lhi %r1,1
lhi %r2,0
locrnl  %r1,%r2
sllg%r1,%r1,63
srag%r2,%r1,63


int ltMaskOpt(int a, int b)
{
long long x = a;
long long y = b;
return (x - y) >> 63;
}

better:
sgr %r2,%r3
srag%r2,%r2,63
br  %r14

int leMask(int a, int b)
{
return -(a <= b);
}

generates:
cr  %r2,%r3
lhi %r1,1
lhi %r2,0
locrnle %r1,%r2
sllg%r1,%r1,63
srag%r2,%r1,63
br  %r14

int leMaskOpt(int a, int b)
{
   int c;
   __asm__("cr %1,%2\n\tslbgr %0,%0":"=r"(c):"r"(a),"r"(b):"cc");
   // slbgr create a 64-bit mask => lgfr would not be required
   return c;
}

better:
cr %r2,%r3
slbgr %r2,%r2
lgfr%r2,%r2 <= not necessary
br  %r14


int le(int a, int b)
{
return a <= b;
}

generates:
cr  %r2,%r3
lhi %r1,1
lhi %r2,0
locrnle %r1,%r2
lgfr%r2,%r1
br  %r14

int leOpt(int a, int b)
{
   unsigned long long c;
   __asm__("cr %1,%2\n\tslbgr %0,%0":"=r"(c):"r"(a),"r"(b):"cc");
   return (c >> 63);
}

better:
cr %r2,%r3
slbgr %r2,%r2
srlg%r2,%r2,63
br  %r14

[Bug target/106592] New: s390: Inefficient branchless conditionals for long long

2022-08-12 Thread jens.seifert at de dot ibm.com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106592

Bug ID: 106592
   Summary: s390: Inefficient branchless conditionals for long
long
   Product: gcc
   Version: 11.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jens.seifert at de dot ibm.com
  Target Milestone: ---

Created attachment 53443
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=53443=edit
source code

long long gtRef(long long a, long long b)
{
   return a > b;
}

Generates:

cgr %r2,%r3
lghi%r1,0
lghi%r2,1
locgrnh %r2,%r1

Better sequence:
cgr %r2,%r3
lghi %r2,0
alcgr %r2,%r2


long long leMaskRef(long long a, long long b)
{
   return -(a <= b);
}

Generates:

cgr %r2,%r3
lhi %r1,0
lhi %r2,1
locrnle %r2,%r1
sllg%r2,%r2,63
srag%r2,%r2,63

Better sequence:

cgr %r2,%r3
slbgr %r2,%r2

long long gtMaskRef(long long a, long long b)
{
   return -(a > b);
}

Generates:
cgr %r2,%r3
lhi %r1,0
lhi %r2,1
locrnh  %r2,%r1
sllg%r2,%r2,63
srag%r2,%r2,63

Better sequence:
cgr   %r2,%r3
lghi  %r2,0
alcgr %r2,%r2
lcgr  %r2,%r2

[Bug target/106536] New: P9: gcc does not detect setb pattern

2022-08-05 Thread jens.seifert at de dot ibm.com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106536

Bug ID: 106536
   Summary: P9: gcc does not detect setb pattern
   Product: gcc
   Version: 11.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jens.seifert at de dot ibm.com
  Target Milestone: ---

int compare2(unsigned long long a, unsigned long long b)
{
return (a > b ? 1 : (a < b ? -1 : 0));
}

Output:
_Z8compare2yy:
cmpld 0,3,4
bgt 0,.L5
mfcr 3,128
rlwinm 3,3,1,1
neg 3,3
blr
.L5:
li 3,1
blr
.long 0
.byte 0,9,0,0,0,0,0,0

clang generates:

_Z8compare2yy:  # @_Z8compare2yy
cmpld   3, 4
setb 3, 0
extsw 3, 3
blr
.long   0
.quad   0

[Bug target/106525] New: s390: Inefficient branchless conditionals for unsigned long long

2022-08-04 Thread jens.seifert at de dot ibm.com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106525

Bug ID: 106525
   Summary: s390: Inefficient branchless conditionals for unsigned
long long
   Product: gcc
   Version: 11.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jens.seifert at de dot ibm.com
  Target Milestone: ---

Created attachment 53409
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=53409=edit
source code

1)  -(a > b)

clgr%r2,%r3
lhi %r2,0
alcr%r2,%r2
sllg%r2,%r2,63
srag%r2,%r2,63

Last 2 could be merged to LCDFR. But optimal is:

slgrk %r2,%r3,%r2
slbgr %r2,%r2
lgfr  %r2,%r2
Note: lgfr is not required => 2 instructions only.

2) -(a <= b)

slgr%r3,%r2
lhi %r2,0
alcr%r2,%r2
sllg%r2,%r2,63
srag%r2,%r2,63

Last 2 could be merged to LCDFR. But optimal is:

clgr %r2,%r3
slbgr %r2,%r2
lgfr%r2,%r2

Note: lgfr is not required => 2 instructions only.

3) unsigned 64-bit compare for qsort (a > b) - (a < b)

clgr%r2,%r3
lhi %r1,0
alcr%r1,%r1
clgr%r3,%r2
lhi %r2,0
alcr%r2,%r2
srk %r2,%r1,%r2
lgfr%r2,%r2

Optimal:
slgrk %r1,%r2,%r3
slgrk 0,%r3,%r2
slbgr %r2,%r3
slbgr %r1,%r2
lgfr  %r2,%r1

Note: lgfr not required => 4 instructions only

[Bug target/106043] Power10: lacking vec_blendv builtins

2022-07-13 Thread jens.seifert at de dot ibm.com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106043

Jens Seifert  changed:

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |INVALID

--- Comment #2 from Jens Seifert  ---
Also found in altivec.h

[Bug target/106043] Power10: lacking vec_blendv builtins

2022-07-13 Thread jens.seifert at de dot ibm.com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106043

--- Comment #1 from Jens Seifert  ---
Found in documentation:

https://gcc.gnu.org/onlinedocs/gcc-11.3.0/gcc/PowerPC-AltiVec-Built-in-Functions-Available-on-ISA-3_002e1.html#PowerPC-AltiVec-Built-in-Functions-Available-on-ISA-3_002e1

[Bug c/106043] New: Power10: lacking vec_blendv builtins

2022-06-21 Thread jens.seifert at de dot ibm.com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106043

Bug ID: 106043
   Summary: Power10: lacking vec_blendv builtins
   Product: gcc
   Version: 11.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jens.seifert at de dot ibm.com
  Target Milestone: ---

Missing builtins for vector instructions xxblendvb, xxblendvw, xxblendvd,
xxblendvd.


#include 

vector int blendv(vector int a, vector int b, vector int c)
{
return vec_blendv(a, b, c);
}

[Bug target/104268] New: 390: inefficient vec_popcnt for 16-bit for z13

2022-01-28 Thread jens.seifert at de dot ibm.com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104268

Bug ID: 104268
   Summary: 390: inefficient vec_popcnt for 16-bit for z13
   Product: gcc
   Version: 10.2.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jens.seifert at de dot ibm.com
  Target Milestone: ---

#include 

vector unsigned short popcnt(vector unsigned short a)
{
   return vec_popcnt(a);
}

Generates with -march=z13

_Z6popcntDv8_t:
.LFB1:
.cfi_startproc
vzero   %v0
vpopct  %v24,%v24,0
vleib   %v0,8,7
vsrlb   %v0,%v24,%v0
vab %v24,%v24,%v0
vgbm%v0,21845
vn  %v24,%v24,%v0
br  %r14
.cfi_endproc


Optimal sequence would be:
vector unsigned short popcnt_opt(vector unsigned short a)
{
   vector unsigned short r = (vector unsigned short)vec_popcnt((vector unsigned
char)a);
   vector unsigned short b = vec_rli(r, 8);
   r = r + b;
   r = r >> 8;
   return r;
}

_Z10popcnt_optDv8_t:
.LFB3:
.cfi_startproc
vpopct  %v24,%v24,0
verllh  %v0,%v24,8
vah %v24,%v0,%v24
vesrlh  %v24,%v24,8
br  %r14
.cfi_endproc

[Bug target/103743] New: PPC: Inefficient equality compare for large 64-bit constants having only 16-bit relevant bits in high part

2021-12-15 Thread jens.seifert at de dot ibm.com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103743

Bug ID: 103743
   Summary: PPC: Inefficient equality compare for large 64-bit
constants having only 16-bit relevant bits in high
part
   Product: gcc
   Version: 8.3.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jens.seifert at de dot ibm.com
  Target Milestone: ---

int overflow();
int negOverflow(long long in)
{
   if (in == 0x8000LL)
   {
  return overflow();
   }
   return 0;
}

Generates:
negOverflow(long long):
.quad   .L.negOverflow(long long),.TOC.@tocbase,0
.L.negOverflow(long long):
li 9,-1
rldicr 9,9,0,0
cmpd 0,3,9
beq 0,.L10
li 3,0
blr
.L10:
mflr 0
std 0,16(1)
stdu 1,-112(1)
bl overflow()
nop
addi 1,1,112
ld 0,16(1)
mtlr 0
blr
.long 0
.byte 0,9,0,1,128,0,0,0

Instead of:
li 9,-1
rldicr 9,9,0,0
cmpd 0,3,9

Expected output:
rotldi 3,3,1
cmpdi 0,3,1

This should be only applied if constant fits into 16-bit and if those 16-bit
are in the first 32-bit.

[Bug target/103731] New: 390: inefficient 64-bit constant generation

2021-12-15 Thread jens.seifert at de dot ibm.com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103731

Bug ID: 103731
   Summary: 390: inefficient 64-bit constant generation
   Product: gcc
   Version: 8.3.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jens.seifert at de dot ibm.com
  Target Milestone: ---

unsigned long long M8()
{
   return 0x;
}

Generates:

.LC0:
.quad   0x
.text
.align  8
.globl _Z2M8v
.type   _Z2M8v, @function
_Z2M8v:
.LFB0:
.cfi_startproc
lgrl%r2,.LC0
br  %r14
.cfi_endproc

Expected 2 instructions:
load immediate + insert immedate(IIHF) instead of LOAD

[Bug target/103106] New: PPC: Missing builtin for P9 vmsumudm

2021-11-06 Thread jens.seifert at de dot ibm.com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103106

Bug ID: 103106
   Summary: PPC: Missing builtin for P9 vmsumudm
   Product: gcc
   Version: 8.3.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jens.seifert at de dot ibm.com
  Target Milestone: ---

I can't find builtin for vmsumudm instruction.

I also found nothing in the Power vector instrinsic programming reference.
https://openpowerfoundation.org/?resource_lib=power-vector-intrinsic-programming-reference

[Bug target/102265] New: s390: Inefficient code for __builtin_ctzll

2021-09-09 Thread jens.seifert at de dot ibm.com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102265

Bug ID: 102265
   Summary: s390: Inefficient code for __builtin_ctzll
   Product: gcc
   Version: 10.2.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jens.seifert at de dot ibm.com
  Target Milestone: ---

unsigned long long ctzll(unsigned long long x)
{
   return __builtin_ctzll(x);
}

creates:
lcgr%r1,%r2
ngr %r2,%r1
lghi%r1,63
flogr   %r2,%r2
sgrk%r2,%r1,%r2
lgfr%r2,%r2
br  %r14


Optimal sequence for z15 uses population count, for all others use ^ 63 instead
of 63 -.

unsigned long long ctzll_opt(unsigned long long x)
{
#if __ARCH__ >= 13
   return __builtin_popcountll((x-1) & ~x);
#else
   return __builtin_clzll(x & -x) ^ 63;
#endif
}

< z15:
lcgr%r1,%r2
ngr %r2,%r1
flogr   %r2,%r2
xilf%r2,63
lgfr%r2,%r2
br  %r14

=> 1 instruction saved.

z15:
.cfi_startproc
lay %r1,-1(%r2)
ncgrk   %r2,%r1,%r2
popcnt  %r2,%r2,8
br  %r14
.cfi_endproc

=> On z15 only 3 instructions required.

[Bug target/102117] s390: Inefficient code for 64x64=128 signed multiply for <= z13

2021-08-29 Thread jens.seifert at de dot ibm.com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102117

--- Comment #1 from Jens Seifert  ---
Sorry small bug in optimal sequence.

__int128 imul128_opt(long long a, long long b)
{
   unsigned __int128 x = (unsigned __int128)(unsigned long long)a;
   unsigned __int128 y = (unsigned __int128)(unsigned long long)b;
   unsigned long long t1 = (a >> 63) & b;
   unsigned long long t2 = (b >> 63) & a;
   unsigned __int128 u128 = x * y;
   unsigned long long hi = (u128 >> 64) - (t1 + t2);
   unsigned long long lo = (unsigned long long)u128;
   unsigned __int128 res = hi;
   res <<= 64;
   res |= lo;
   return (__int128)res;
}

_Z11imul128_optxx:
.LFB1:
.cfi_startproc
ldgr%f2,%r12
.cfi_register 12, 17
ldgr%f0,%r13
.cfi_register 13, 16
lgr %r13,%r3
mlgr%r12,%r4
srag%r1,%r3,63
ngr %r1,%r4
srag%r4,%r4,63
ngr %r4,%r3
agr %r4,%r1
sgrk%r4,%r12,%r4
stg %r13,8(%r2)
lgdr%r12,%f2
.cfi_restore 12
lgdr%r13,%f0
.cfi_restore 13
stg %r4,0(%r2)
br  %r14
.cfi_endproc

[Bug target/102117] New: s390: Inefficient code for 64x64=128 signed multiply for <= z13

2021-08-29 Thread jens.seifert at de dot ibm.com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102117

Bug ID: 102117
   Summary: s390: Inefficient code for 64x64=128 signed multiply
for <= z13
   Product: gcc
   Version: 8.3.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jens.seifert at de dot ibm.com
  Target Milestone: ---

__int128 imul128(long long a, long long b)
{
   return (__int128)a * (__int128)b;
}

creates sequence with 3 multiplies:

_Z7imul128xx:
.LFB0:
.cfi_startproc
ldgr%f2,%r12
.cfi_register 12, 17
ldgr%f0,%r13
.cfi_register 13, 16
lgr %r13,%r3
mlgr%r12,%r4
srag%r1,%r3,63
msgr%r1,%r4
srag%r4,%r4,63
msgr%r4,%r3
agr %r4,%r1
agr %r12,%r4
stmg%r12,%r13,0(%r2)
lgdr%r13,%f0
.cfi_restore 13
lgdr%r12,%f2
.cfi_restore 12
br  %r14
.cfi_endproc


The following sequence only requires 1 multiply:

__int128 imul128_opt(long long a, long long b)
{
   unsigned __int128 x = (unsigned __int128)(unsigned long long)a;
   unsigned __int128 y = (unsigned __int128)(unsigned long long)b;
   unsigned long long t1 = (a >> 63) & a;
   unsigned long long t2 = (b >> 63) & b;
   unsigned __int128 u128 = x * y;
   unsigned long long hi = (u128 >> 64) - (t1 + t2);
   unsigned long long lo = (unsigned long long)u128;
   unsigned __int128 res = hi;
   res <<= 64;
   res |= lo;
   return (__int128)res;
}

_Z11imul128_optxx:
.LFB1:
.cfi_startproc
ldgr%f2,%r12
.cfi_register 12, 17
ldgr%f0,%r13
.cfi_register 13, 16
lgr %r13,%r3
mlgr%r12,%r4
lgr %r1,%r3
srag%r3,%r3,63
ngr %r3,%r1
srag%r1,%r4,63
ngr %r4,%r1
agr %r3,%r4
sgrk%r3,%r12,%r3
stg %r13,8(%r2)
lgdr%r12,%f2
.cfi_restore 12
lgdr%r13,%f0
.cfi_restore 13
stg %r3,0(%r2)
br  %r14
.cfi_endproc

[Bug target/100866] PPC: Inefficient code for vec_revb(vector unsigned short) < P9

2021-06-20 Thread jens.seifert at de dot ibm.com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100866

--- Comment #9 from Jens Seifert  ---
I know that if I would use vec_perm builtin as an end user, that you then need
to fulfill to the LE specification, but you can always optimize the code as you
like as long as it creates correct results afterwards.

load constant
xxlnor constant

can always be transformed to 

load inverse constant.

[Bug target/100866] PPC: Inefficient code for vec_revb(vector unsigned short) < P9

2021-06-18 Thread jens.seifert at de dot ibm.com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100866

--- Comment #7 from Jens Seifert  ---
Regarding vec_revb for vector unsigned int. I agree that
revb:
.LFB0:
.cfi_startproc
vspltish %v1,8
vspltisw %v0,-16
vrlh %v2,%v2,%v1
vrlw %v2,%v2,%v0
blr

works. But in this case, I would prefer the vperm approach assuming that the
loaded constant for the permute vector can be re-used multiple times.
But please get rid of the xxlnor 32,32,32. That does not make sense after
loading a constant. Change the constant that need to be loaded.

[Bug target/101041] New: z13: Inefficient handling of vector register passed to function

2021-06-12 Thread jens.seifert at de dot ibm.com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101041

Bug ID: 101041
   Summary: z13: Inefficient handling of vector register passed to
function
   Product: gcc
   Version: 8.3.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jens.seifert at de dot ibm.com
  Target Milestone: ---

#include 
vector unsigned long long mul64(vector unsigned long long a, vector unsigned
long long b)
{
   return a * b;
}

creates:
_Z5mul64Dv2_yS_:
.LFB9:
.cfi_startproc
ldgr%f4,%r15
.cfi_register 15, 18
lay %r15,-192(%r15)
.cfi_def_cfa_offset 352
vst %v24,160(%r15),3
vst %v26,176(%r15),3
lg  %r2,160(%r15)
lg  %r1,176(%r15)
lgr %r4,%r2
lg  %r0,168(%r15)
lgr %r2,%r1
lg  %r1,184(%r15)
lgr %r5,%r0
lgr %r3,%r1
vlvgp   %v2,%r4,%r5
vlvgp   %v0,%r2,%r3
vlgvg   %r4,%v2,0
vlgvg   %r1,%v2,1
vlgvg   %r2,%v0,0
vlgvg   %r3,%v0,1
msgr%r2,%r4
msgr%r1,%r3
lgdr%r15,%f4
.cfi_restore 15
.cfi_def_cfa_offset 160
vlvgp   %v24,%r2,%r1
br  %r14

Store to stack of v24,v26, then lg+lgr for all 4 parts, then constructing new
vector register v0 and v2 and then extract the 4 elements again using vlgvg.

Expected 4 * vlgvg + 2 * msgr + vlvgp

[Bug target/100930] New: PPC: Missing builtins for P9 vextsb2w, vextsb2w, vextsb2d, vextsh2d, vextsw2d

2021-06-06 Thread jens.seifert at de dot ibm.com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100930

Bug ID: 100930
   Summary: PPC: Missing builtins for P9 vextsb2w, vextsb2w,
vextsb2d, vextsh2d, vextsw2d
   Product: gcc
   Version: 8.3.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jens.seifert at de dot ibm.com
  Target Milestone: ---

Using the same names like xlC appreciated:
vec_extsbd, vec_extsbw, vec_extshd, vec_extshw, vec_extswd

[Bug target/100926] New: PPCLE: Inefficient code for vec_xl_be(unsigned short *) < P9

2021-06-05 Thread jens.seifert at de dot ibm.com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100926

Bug ID: 100926
   Summary: PPCLE: Inefficient code for vec_xl_be(unsigned short
*) < P9
   Product: gcc
   Version: 8.3.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jens.seifert at de dot ibm.com
  Target Milestone: ---

Input:

vector unsigned short load_be(unsigned short *c)
{
   return vec_xl_be(0L, c);
}

creates:
_Z7load_bePt:
.LFB6:
.cfi_startproc
.LCF6:
0:  addis 2,12,.TOC.-.LCF6@ha
addi 2,2,.TOC.-.LCF6@l
.localentry _Z7load_bePt,.-_Z7load_bePt
addis 9,2,.LC4@toc@ha
lxvw4x 34,0,3
addi 9,9,.LC4@toc@l
lvx 0,0,9
vperm 2,2,2,0
blr


Optimal sequence:

vector unsigned short load_be_opt2(unsigned short *c)
{
   vector signed int vneg16;
   __asm__("vspltisw %0,-16":"=v"(vneg16));
   vector unsigned int tmp = vec_xl_be(0L, (unsigned int *)c);
   tmp = vec_rl(tmp, (vector unsigned int)vneg16);
   return (vector unsigned short)tmp;
}

creates:
_Z12load_be_opt2Pt:
.LFB8:
.cfi_startproc
lxvw4x 34,0,3
#APP
 # 77 "vec.C" 1
vspltisw 0,-16
 # 0 "" 2
#NO_APP
vrlw 2,2,0
blr

rotate left (-16) = rotate right (+16) as only the 5 bits get evaluated.

Please note that the inline assembly is required, because vec_splats(-16) gets
converted into a very inefficient constant generation.

vector unsigned short load_be_opt(unsigned short *c)
{
   vector signed int vneg16 = vec_splats(-16);
   vector unsigned int tmp = vec_xl_be(0L, (unsigned int *)c);
   tmp = vec_rl(tmp, (vector unsigned int)vneg16);
   return (vector unsigned short)tmp;
}

creates:
_Z11load_be_optPt:
.LFB7:
.cfi_startproc
li 9,48
lxvw4x 34,0,3
vspltisw 0,0
mtvsrd 33,9
xxspltw 33,33,1
vsubuwm 0,0,1
vrlw 2,2,0
blr

[Bug target/100808] PPC: ISA 3.1 builtin documentation

2021-06-02 Thread jens.seifert at de dot ibm.com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100808

--- Comment #3 from Jens Seifert  ---
- Avoid additional "int" unsigned long long int => unsigned long long

Why?  Those are exactly the same types!

Yes, but the rest of the documentation uses unsigned long long.
This is just for consistency with existing documentation.

[Bug target/100871] New: z14: vec_doublee maps to wrong builtin in vecintrin.h

2021-06-02 Thread jens.seifert at de dot ibm.com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100871

Bug ID: 100871
   Summary: z14: vec_doublee maps to wrong builtin in vecintrin.h
   Product: gcc
   Version: 10.2.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jens.seifert at de dot ibm.com
  Target Milestone: ---

#include 
Input:
vector double doublee(vector float a)
{
   return vec_doublee(a);
}

cause compile error:

vec.C: In function ‘__vector(2) double doublee(__vector(4) float)’:
vec.C:43:10: error: ‘__builtin_s390_vfll’ was not declared in this scope; did
you mean ‘__builtin_s390_vflls’?
   43 |return vec_doublee(a);
  |  ^~~~
  |  __builtin_s390_vflls

vec_doublee in vec_intrin.h should call __builtin_s390_vflls

vector double doublee_fix(vector float a)
{
   return __builtin_s390_vflls(a);
}

[Bug target/100869] New: z13: Inefficient code for vec_reve(vector double)

2021-06-02 Thread jens.seifert at de dot ibm.com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100869

Bug ID: 100869
   Summary: z13: Inefficient code for vec_reve(vector double)
   Product: gcc
   Version: 10.2.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jens.seifert at de dot ibm.com
  Target Milestone: ---

Input:

vector double reve(vector double a)
{
   return vec_reve(a);
}

creates:
_Z4reveDv2_d:
.LFB3:
.cfi_startproc
larl%r5,.L12
vl  %v0,.L13-.L12(%r5),3
vperm   %v24,%v24,%v24,%v0
br  %r14


Optimal code sequence:

vector double reve_z13(vector double a)
{
   return vec_permi(a,a,2);
}

creates:

_Z6reve_2Dv2_d:
.LFB6:
.cfi_startproc
vpdi%v24,%v24,%v24,4
br  %r14
.cfi_endproc

[Bug target/100868] New: PPC: Inefficient code for vec_reve(vector double)

2021-06-02 Thread jens.seifert at de dot ibm.com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100868

Bug ID: 100868
   Summary: PPC: Inefficient code for vec_reve(vector double)
   Product: gcc
   Version: 8.3.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jens.seifert at de dot ibm.com
  Target Milestone: ---

Input:

vector double reve(vector double a)
{
   return vec_reve(a);
}

creates:

_Z4reveDv2_d:
.LFB3:
.cfi_startproc
.LCF3:
0:  addis 2,12,.TOC.-.LCF3@ha
addi 2,2,.TOC.-.LCF3@l
.localentry _Z4reveDv2_d,.-_Z4reveDv2_d
addis 9,2,.LC2@toc@ha
addi 9,9,.LC2@toc@l
lvx 0,0,9
xxlnor 32,32,32
vperm 2,2,2,0
blr

Optimal sequence would be:

vector double reve_pwr7(vector double a)
{
   return vec_xxpermdi(a,a,2);
}

which creates:

_Z9reve_pwr7Dv2_d:
.LFB4:
.cfi_startproc
xxpermdi 34,34,34,2
blr

[Bug target/100867] New: z13: Inefficient code for vec_revb(vector unsigned short)

2021-06-02 Thread jens.seifert at de dot ibm.com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100867

Bug ID: 100867
   Summary: z13: Inefficient code for vec_revb(vector unsigned
short)
   Product: gcc
   Version: 10.2.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jens.seifert at de dot ibm.com
  Target Milestone: ---

Input:

vector unsigned short revb(vector unsigned short a)
{
   return vec_revb(a);
}

Creates:

_Z4revbDv4_j:
.LFB1:
.cfi_startproc
larl%r5,.L4
vl  %v0,.L5-.L4(%r5),3
vperm   %v24,%v24,%v24,%v0
br  %r14

Optimal code sequence:

vector unsigned short revb_z13(vector unsigned short a)
{
   return vec_rli(a, 8);
}

Creates:
_Z8revb_z13Dv8_t:
.LFB5:
.cfi_startproc
verllh  %v24,%v24,8
br  %r14
.cfi_endproc

[Bug target/100866] New: PPC: Inefficient code for vec_revb(vector unsigned short) < P9

2021-06-02 Thread jens.seifert at de dot ibm.com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100866

Bug ID: 100866
   Summary: PPC: Inefficient code for vec_revb(vector unsigned
short) < P9
   Product: gcc
   Version: 8.3.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jens.seifert at de dot ibm.com
  Target Milestone: ---

Input:

vector unsigned short revb(vector unsigned short a)
{
   return vec_revb(a);
}

creates:

_Z4revbDv8_t:
.LFB1:
.cfi_startproc
.LCF1:
0:  addis 2,12,.TOC.-.LCF1@ha
addi 2,2,.TOC.-.LCF1@l
.localentry _Z4revbDv8_t,.-_Z4revbDv8_t
addis 9,2,.LC1@toc@ha
addi 9,9,.LC1@toc@l
lvx 0,0,9
xxlnor 32,32,32
vperm 2,2,2,0
blr


Optimal code sequence:
vector unsigned short revb_pwr7(vector unsigned short a)
{
   return vec_rl(a, vec_splats((unsigned short)8));
}

_Z9revb_pwr7Dv8_t:
.LFB2:
.cfi_startproc
.localentry _Z9revb_pwr7Dv8_t,1
vspltish 0,8
vrlh 2,2,0
blr

[Bug c/100808] PPC: ISA 3.1 builtin documentation

2021-05-28 Thread jens.seifert at de dot ibm.com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100808

--- Comment #1 from Jens Seifert  ---
https://gcc.gnu.org/onlinedocs/gcc/PowerPC-AltiVec-Built-in-Functions-Available-on-ISA-3_002e1.html

vector unsigned long long int vec_gnb (vector unsigned __int128, const unsigned
char)

should be

unsigned long long int vec_gnb (vector unsigned __int128, const unsigned char)

vgnb instruction returns result in GPR.

[Bug c++/100809] PPC: __int128 divide/modulo does not use P10 instructions vdivsq/vdivuq

2021-05-28 Thread jens.seifert at de dot ibm.com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100809

--- Comment #1 from Jens Seifert  ---
Same applies to modulo.

[Bug c++/100809] New: PPC: __int128 divide/modulo does not use P10 instructions vdivsq/vdivuq

2021-05-28 Thread jens.seifert at de dot ibm.com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100809

Bug ID: 100809
   Summary: PPC: __int128 divide/modulo does not use P10
instructions vdivsq/vdivuq
   Product: gcc
   Version: 10.2.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jens.seifert at de dot ibm.com
  Target Milestone: ---

unsigned __int128 div(unsigned __int128 a, unsigned __int128 b)
{
   return a/b;
}

__int128 div(__int128 a, __int128 b)
{
   return a/b;
}

gcc -mcpu=power10 -save-temps -O2 int128.C

Output:
_Z3divoo:
.LFB0:
.cfi_startproc
.localentry _Z3divoo,1
mflr 0
std 0,16(1)
stdu 1,-32(1)
.cfi_def_cfa_offset 32
.cfi_offset 65, 16
bl __udivti3@notoc
addi 1,1,32
.cfi_def_cfa_offset 0
ld 0,16(1)
mtlr 0
.cfi_restore 65
blr
.long 0
.byte 0,9,0,1,128,0,0,0
.cfi_endproc
.LFE0:
.size   _Z3divoo,.-_Z3divoo
.globl __divti3
.align 2
.p2align 4,,15
.globl _Z3divnn
.type   _Z3divnn, @function
_Z3divnn:
.LFB1:
.cfi_startproc
.localentry _Z3divnn,1
mflr 0
std 0,16(1)
stdu 1,-32(1)
.cfi_def_cfa_offset 32
.cfi_offset 65, 16
bl __divti3@notoc
addi 1,1,32
.cfi_def_cfa_offset 0
ld 0,16(1)
mtlr 0
.cfi_restore 65
blr
.long 0
.byte 0,9,0,1,128,0,0,0
.cfi_endproc

Expected is the use of vdivsq/vdivuq.

GCC version:

/opt/rh/devtoolset-10/root/usr/bin/gcc -v
Using built-in specs.
COLLECT_GCC=/opt/rh/devtoolset-10/root/usr/bin/gcc
COLLECT_LTO_WRAPPER=/opt/rh/devtoolset-10/root/usr/libexec/gcc/ppc64le-redhat-linux/10/lto-wrapper
Target: ppc64le-redhat-linux
Configured with: ../configure --enable-bootstrap
--enable-languages=c,c++,fortran,lto --prefix=/opt/rh/devtoolset-10/root/usr
--mandir=/opt/rh/devtoolset-10/root/usr/share/man
--infodir=/opt/rh/devtoolset-10/root/usr/share/info
--with-bugurl=http://bugzilla.redhat.com/bugzilla --enable-shared
--enable-threads=posix --enable-checking=release
--enable-targets=powerpcle-linux --disable-multilib --with-system-zlib
--enable-__cxa_atexit --disable-libunwind-exceptions --enable-gnu-unique-object
--enable-linker-build-id --with-gcc-major-version-only
--with-linker-hash-style=gnu --with-default-libstdcxx-abi=gcc4-compatible
--enable-plugin --enable-initfini-array
--with-isl=/builddir/build/BUILD/gcc-10.2.1-20200804/obj-ppc64le-redhat-linux/isl-install
--disable-libmpx --enable-gnu-indirect-function --enable-secureplt
--with-long-double-128 --with-cpu-32=power8 --with-tune-32=power8
--with-cpu-64=power8 --with-tune-64=power8 --build=ppc64le-redhat-linux
Thread model: posix
Supported LTO compression algorithms: zlib
gcc version 10.2.1 20200804 (Red Hat 10.2.1-2) (GCC)

[Bug c/100808] New: PPC: ISA 3.1 builtin documentation

2021-05-28 Thread jens.seifert at de dot ibm.com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100808

Bug ID: 100808
   Summary: PPC: ISA 3.1 builtin documentation
   Product: gcc
   Version: 12.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jens.seifert at de dot ibm.com
  Target Milestone: ---

https://gcc.gnu.org/onlinedocs/gcc/Basic-PowerPC-Built-in-Functions-Available-on-ISA-3_002e1.html#Basic-PowerPC-Built-in-Functions-Available-on-ISA-3_002e1

Please improve the documentation:
- Avoid additional "int" unsigned long long int => unsigned long long
- add missing line breaks between builtins
- remove semicolons

[Bug target/100694] New: PPC: initialization of __int128 is very inefficient

2021-05-20 Thread jens.seifert at de dot ibm.com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100694

Bug ID: 100694
   Summary: PPC: initialization of __int128 is very inefficient
   Product: gcc
   Version: 8.3.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jens.seifert at de dot ibm.com
  Target Milestone: ---

Initializing a __int128 from 2 64-bit integers is implemented very inefficient.

The most natural code which works good on all other platforms generate
additional 2 li 0 + 2 or instructions.

void test2(unsigned __int128* res, unsigned long long hi, unsigned long long
lo)
{
   unsigned __int128 i = hi;
   i <<= 64;
   i |= lo;
   *res = i;
}

_Z5test2Poyy:
.LFB15:
.cfi_startproc
li 8,0
li 11,0
or 10,5,8
or 11,11,4
std 10,0(3)
std 11,8(3)
blr
.long 0
.byte 0,9,0,0,0,0,0,0
.cfi_endproc


While for the above sample, "+" instead "|" solves the issues, it generates
addc+addz in other more complicated scenarsion.

The most ugly workaround I can think of I now use as workaround.

void test4(unsigned __int128* res, unsigned long long hi, unsigned long long
lo)
{
   union
   { unsigned __int128 i;
struct
   {
 unsigned long long lo;
 unsigned long long hi;
   } s;
   } u;
   u.s.lo = lo;
   u.s.hi = hi;
   *res = u.i;
}

This generates the expected code sequence in all cases I have looked at.

_Z5test4Poyy:
.LFB17:
.cfi_startproc
std 5,0(3)
std 4,8(3)
blr
.long 0
.byte 0,9,0,0,0,0,0,0
.cfi_endproc

Please merge li 0 + or to nop.

[Bug target/100693] New: PPC: missing 64-bit addg6s

2021-05-20 Thread jens.seifert at de dot ibm.com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100693

Bug ID: 100693
   Summary: PPC: missing 64-bit addg6s
   Product: gcc
   Version: 8.3.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jens.seifert at de dot ibm.com
  Target Milestone: ---

gcc only provides

unsigned int __builtin_addg6s (unsigned int, unsigned int);

but addg6s is a 64-bit operation. I require

unsigned long long __builtin_addg6s (unsigned long long, unsigned long long);
.

I for now use inline assembly.

[Bug target/98020] PPC: mfvsrwz+extsw not merged to mtvsrwa

2020-12-08 Thread jens.seifert at de dot ibm.com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98020

Jens Seifert  changed:

   What|Removed |Added

 Status|WAITING |RESOLVED
 Resolution|--- |INVALID

--- Comment #2 from Jens Seifert  ---
I thought they are symmetric.

[Bug target/98124] New: Z: Load and test LTDBR instruction gets not used for comparison against 0.0

2020-12-03 Thread jens.seifert at de dot ibm.com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98124

Bug ID: 98124
   Summary: Z: Load and test LTDBR instruction gets not used for
comparison against 0.0
   Product: gcc
   Version: 8.3.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jens.seifert at de dot ibm.com
  Target Milestone: ---

#include 
double sign(double in)
{
   return in == 0.0 ? 0.0 : copysign(1.0, in);
}

Command line:
gcc m64 -O2 -save-temps copysign.C

Output:
_Z4signd:
.LFB234:
.cfi_startproc
larl%r5,.L8
lzdr%f2
cdbr%f0,%f2
je  .L6
ld  %f2,.L9-.L8(%r5)
cpsdr   %f0,%f0,%f2
br  %r14

Use of LTDBR expected instead of  lzdr%f2 + cdbr%f0,%f2

[Bug target/98020] New: PPC: mfvsrwz+extsw not merge to mtvsrwa

2020-11-26 Thread jens.seifert at de dot ibm.com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98020

Bug ID: 98020
   Summary: PPC: mfvsrwz+extsw not merge to mtvsrwa
   Product: gcc
   Version: 8.3.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jens.seifert at de dot ibm.com
  Target Milestone: ---

int extract(vector signed int v)
{
   return v[2];
}

Command line:
gcc -mcpu=power8 -maltivec -m64 -O3 -save-temps extract.C

Output:
_Z7extractDv4_i:
.LFB0:
.cfi_startproc
mfvsrwz 3,34
extsw 3,3
blr

[Bug target/70928] Load simple float constants via VSX operations on PowerPC

2020-11-14 Thread jens.seifert at de dot ibm.com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70928

Jens Seifert  changed:

   What|Removed |Added

 CC||jens.seifert at de dot ibm.com

--- Comment #4 from Jens Seifert  ---
values -16.0..+15.0.
vspltisw  0,
xvcvsxwdp 32,32

values -16.0f..+15.0f
vspltisw  0,
xvcvsxwsp 32,32

-0.0 / 0x8000
xxlxor 32,32,32
xvnabsdp 32,32 or xvnegdp 32,32

-0.0f / 0x8000
xxlxor 32,32,32
xvnabssp 32,32 or xvnegsp 32,32

0x7FFF
vspltisw 0,-1
xvabsdp 32,32

0x7FFF
vspltisw 0,-1
xvabssp 32,32