[Bug target/119702] PPCLE: Inefficient auto-vectorization for 64-bit shifts on Power9

avinashd at linux dot ibm.com via Gcc-bugs Mon, 28 Jul 2025 02:42:05 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119702


Avinash Jayakar <avinashd at linux dot ibm.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |avinashd at linux dot ibm.com

--- Comment #2 from Avinash Jayakar <avinashd at linux dot ibm.com> ---
I am looking into this issue. 

As Peter mentioned the following issue
2) vextsb2d is not required as PowerPC has a modulo shift. It does not matter
if additional bytes are set with shift amount.
is no longer present in the trunk.

I just wanted to understand the optimization opportunity a little better. 
This change is mainly to optimize the code size rather than execution time
right? 
Because I think using a splat and shift has a similar performance to doing an
add. I just ran a small benchmark, with 2 variants, and see very minimal
difference in the actual execution time. Here is the synthetic benchmark.

int main() {
  unsigned long long a[2];
  a[0] = 1;
  a[1] = 2;
  for (long i=0; i<1e10; i++) lshift1((unsigned long long*)&a);
  printf ("%ld\n", a[1]); // don't optimize away the loop
}

And should the same behaviour happen with the following code as well? 

1. a[0] *= 2; a[1] *= 2;
2. a[0] += a[0]; a[1] += a[1];

All of these emit the same left shift by 1 instruction with current gcc's
trunk.

[Bug target/119702] PPCLE: Inefficient auto-vectorization for 64-bit shifts on Power9

Reply via email to