[Bug target/115466] rs6000 vec_ld built-in works on BE but not LE

2024-06-12 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115466

Andrew Pinski  changed:

   What|Removed |Added

 Ever confirmed|0   |1
   Last reconfirmed||2024-06-12
 Status|UNCONFIRMED |WAITING

--- Comment #1 from Andrew Pinski  ---
>  int ia[8] = {1, 2, 3, 4, 5, 6, 7, 8};
>  float fa[8] = {10.0, 20.0, 30.0, 40.0, 50.0, 60.0, 70.0, 80.0};

The way vec_ld works is is (a+b)&~0xf is the address which is being loaded. 

Both ia and fa are not specified as being aligned to the 16byte boundary so it
could be loading from before hand.

What happens if you do:
  int ia[8] __attribute__((aligned(16))) = {1, 2, 3, 4, 5, 6, 7, 8};
  float fa[8] __attribute__((aligned(16))) = {10.0, 20.0, 30.0, 40.0, 50.0,
60.0, 70.0, 80.0};

Instead?

[Bug target/115466] rs6000 vec_ld built-in works on BE but not LE

2024-06-13 Thread linkw at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115466

Kewen Lin  changed:

   What|Removed |Added

 CC||linkw at gcc dot gnu.org

--- Comment #2 from Kewen Lin  ---
(In reply to Andrew Pinski from comment #1)
> >  int ia[8] = {1, 2, 3, 4, 5, 6, 7, 8};
> >  float fa[8] = {10.0, 20.0, 30.0, 40.0, 50.0, 60.0, 70.0, 80.0};
> 
> The way vec_ld works is is (a+b)&~0xf is the address which is being loaded. 
> 
> Both ia and fa are not specified as being aligned to the 16byte boundary so
> it could be loading from before hand.
> 
> What happens if you do:
>   int ia[8] __attribute__((aligned(16))) = {1, 2, 3, 4, 5, 6, 7, 8};
>   float fa[8] __attribute__((aligned(16))) = {10.0, 20.0, 30.0, 40.0, 50.0,
> 60.0, 70.0, 80.0};
> 
> Instead?

Good point, I can't reproduce the reported issue on two LE machines, so I'd
leave Carl to confirm. But from the output (biasing two elements from the
expected), I believe this is the root cause.

[Bug target/115466] rs6000 vec_ld built-in works on BE but not LE

2024-06-13 Thread segher at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115466

Segher Boessenkool  changed:

   What|Removed |Added

 CC||segher at gcc dot gnu.org

--- Comment #3 from Segher Boessenkool  ---
Since people still make this mistake, perhaps we need to make people more aware
of this fact in the documentation?  "The memory argument to vec_ld and vec_st
has to be 16-byte aligned", something short like that.  If people see it often
enough it will finally click.

Carl, can you add something maybe?  In some place you would have seen it :-)

[Bug target/115466] rs6000 vec_ld built-in works on BE but not LE

2024-06-13 Thread carll at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115466

--- Comment #4 from Carl Love  ---
comment 1

Yes, I can confirm if I add the alignment statement to the declarations it
works fine.  

I originally tried to use the built-in as part of a re-write of a function in
the Milvus AI source code.  The function fvec_L2sqr_batch_4_ref() accounts for
about 90% of the workload run time.  The code is:

fvec_L2sqr_batch_4_ref_v1(const float* x, const float* y0, const float* y1, 
   const float* y2, const float* y3, const size_t d, float& dis0, float& dis1,
   float& dis2, float& dis3) {

float d0 = 0;
float d1 = 0;
float d2 = 0;
float d3 = 0;
for (size_t i = 0; i < d; ++i) {
const float q0 = x[i] - y0[i];
const float q1 = x[i] - y1[i];
const float q2 = x[i] - y2[i];
const float q3 = x[i] - y3[i];

d0 += q0 * q0;
d1 += q1 * q1;
d2 += q2 * q2;
d3 += q3 * q3;
}
dis0 = d0;
dis1 = d1;
dis2 = d2;
dis3 = d3;
}

When compiled with -O3, it does generate vsx instructions.  But I noticed that
it was not using the vsx multiply add instructions.  So, I tried rewriting it
to explicitly load a vector with the vec_ld built-in, followed by vec_sub and
vec_madd.  Which by the way gives a 45% reduction in the execution time for my
standalone test.  

At this point, not sure where the arguments get defined so adding the alignment
to the declaration is not so easy. 

That said, Peter mentioned the vec_xl built-in which does seem to work.  The
vec_xl does not require the data to be aligned from what I see in the PVIPR.


comment 3, Segher

I was looking at the PVIPR document when I chose the built-in for my rewrite. 
Looking at the documentation, it does say "Load a 16-byte vector from the
memory address specified by the displacement and the pointer, ignoring the
low-order bits of the calculated address."  In retrospect, I should have picked
up on the ignoring of the low-order bits to imply the addresses needed  to be
aligned.  It would really be good if the documentation explicitly said the data
must be 16-bye aligned.  That said, my bad for not reading/understanding the
documentation well enough.

The vec_xl documentation does not say anything about ignoring the lower bits of
the address.  So, in my case that is a better load built-in to use so I don't
have to try and find all the declarations for the arrays that could be passed
to the function.

It would be great to update the PVIPR to be more explicit about the alignment
needs.


Sorry everyone for the noise.  I think we can close the issue as "User error".

[Bug target/115466] rs6000 vec_ld built-in works on BE but not LE

2024-06-13 Thread carll at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115466

--- Comment #6 from Carl Love  ---
comment 5, Segher

Yes, I have some PVIPR changes that we have talked about previously.  I will
create a patch to also make the changes discussed here.

[Bug target/115466] rs6000 vec_ld built-in works on BE but not LE

2024-06-13 Thread segher at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115466

--- Comment #5 from Segher Boessenkool  ---
The GCC documentation says

> Note that the 'vec_ld' and 'vec_st' built-in functions always generate
> the AltiVec 'LVX' and 'STVX' instructions even if the VSX instruction
> set is available. 

(which means it explicitly does the align-down thing, the &-16 thing),
which doesn't happen with VSX insns.

This comment is a great place to add an "including address masking" thing, if I
can tempt you :-)

[Bug target/115466] rs6000 vec_ld built-in works on BE but not LE

2024-07-02 Thread linkw at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115466

Kewen Lin  changed:

   What|Removed |Added

 Status|WAITING |RESOLVED
 Resolution|--- |INVALID

--- Comment #7 from Kewen Lin  ---
Per all the discussion above, resolving this as invalid.