On 2/26/20 8:31 AM, Jakub Jelinek wrote:
On Wed, Feb 26, 2020 at 07:55:53AM -0600, Bill Schmidt wrote:
The hope is that we can create a vectorized version that returns values
in registers rather than the by-ref parameters, and add code to GCC to
copy things around correctly following the call.  Ideally the signature of
the vectorized version would be sth like

   struct retval {vector double, vector double};
   retval vecsincos (vector double);

In the typical case where calls to sincos are of the form

   sincos (val[i], &sinval[i], &cosval[i]);

this would allow us to only store the values in the caller upon return,
rather than store them in the callee and potentially reload them
immediately in the caller.  On some Power CPUs, the latter behavior can
result in somewhat costly stalls if the consecutive accesses hit a timing
window.
But can't you do
#pragma omp declare simd linear(sinp, cosp)
void sincos (double x, double *sinp, double *cosp);
?
That is something the vectorizer code could handle and for
   for (int i = 0; i < 1024; i++)
     sincos (val[i], &sinval[i], &cosval[i]);
just vectorize it as
   for (int i = 0; i < 1024; i += vf)
     _ZGVbN8vl8l8_sincos (*(vector double *)&val[i], &sinval[i], &cosval[i]);
Anything else will need specialized code to handle sincos specially in the
vectorizer.

After reading all the discussion on this thread, yes, I agree for now.
It will be good for everybody if we can get the vectorized cexpi sorted
out at some point, which will give us a superior interface.

If you feel it isn't possible to do this, then we can abandon it.  Right
now my understanding is that GCC doesn't vectorize calls to sincos yet
for any targets, so it would be moot except that we really should define
what happens for the future.

This calling convention would also be useful in the future for vectorizing
functions that return complex values either by value or by reference.
Only by value, you really don't know what the code does if something is
passed by reference, whether it is read, written into, or both etc.
And for _Complex {float,double}, e.g. the Intel ABI already specifies how to
pass them, just GCC isn't able to do that right now.

Per the fork of the thread with Segher, I've cried uncle on the specifics
of the calling convention. :)


Well, as a matter of practicality, we don't have any of that implemented
in the rs6000 back end, and we don't have any free resources to do that
in GCC 11.  Is there any documentation about what needs to be done to
support this?  I've always been under the impression that vectorizing for
masking when there isn't any hardware support is a losing proposition, so
we've not investigated it.
You don't need to do pretty much anything, except set
clonei->mask_mode = VOIDmode, I think the generic code should handle that
everything beyond that, in particular add the mask argument and use it
both on the caller side and on the expansion of the to be vectorized clone.

But is this actually a good idea?  It seems to me this will generate lousy
code in the absence of hardware support.  Won't we be better off warning and
ignoring the directive, leaving the code in scalar form?

If and when we have hardware support for vector masking, I'll be happy to
remove this restriction, but I need more convincing to do it now.

Thanks,
Bill


        Jakub

Reply via email to