https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117718
Bug ID: 117718
Summary: Inefficient address computation for d-form vector
loads
Product: gcc
Version: 15.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: bergner at gcc dot gnu.org
Target Milestone: ---
If we compile some simple test cases returning the value from a global vector
array, we fail to fold the low 16-bits of the offset into the (new to power9)
lxv's offset and instead do the full offset computation outside of the load and
then use an offset of zero for the lxv.
bergner@c643n10lp1:~$ cat vectorlong.c
#include <altivec.h>
vector long var[16];
vector long
foo (void)
{
return var[0];
}
vector long
bar (void)
{
return var[1];
}
bergner@c643n10lp1:~$ gcc -S -O2 -mcpu=power9 vectorlong.c
bergner@c643n10lp1:~$ cat vectorlong.s
foo:
[snip toc setup]
addis 9,2,.LANCHOR0@toc@ha
addi 9,9,.LANCHOR0@toc@l
lxv 34,0(9)
blr
bar:
[snip toc setup]
addis 9,2,.LANCHOR0@toc@ha
addi 9,9,.LANCHOR0@toc@l
lxv 34,16(9)
blr
However, for an equivalent test case using scalars (integer or fp), we do fold
the offset into the load, reducing the number of instructions from three to
two:
bergner@c643n10lp1:~$ cat long.c
long var[16];
long
foo (void)
{
return var[0];
}
long
bar (void)
{
return var[1];
}
bergner@c643n10lp1:~$ gcc -S -O2 -mcpu=power9 long.c
bergner@c643n10lp1:~$ cat long.s
foo:
[snip toc setup]
addis 9,2,.LANCHOR0@toc@ha
ld 3,.LANCHOR0@toc@l(9)
blr
bar:
[snip toc setup]
addis 9,2,.LANCHOR0+8@toc@ha
ld 3,.LANCHOR0+8@toc@l(9)
blr
We should perform the same optimization for vector loads/stores as we do for
scalar loads/stores.