On Wed, Feb 26, 2020 at 07:55:53AM -0600, Bill Schmidt wrote:
> The hope is that we can create a vectorized version that returns values
> in registers rather than the by-ref parameters, and add code to GCC to
> copy things around correctly following the call.  Ideally the signature of
> the vectorized version would be sth like
> 
>   struct retval {vector double, vector double};
>   retval vecsincos (vector double);
> 
> In the typical case where calls to sincos are of the form
> 
>   sincos (val[i], &sinval[i], &cosval[i]);
> 
> this would allow us to only store the values in the caller upon return,
> rather than store them in the callee and potentially reload them
> immediately in the caller.  On some Power CPUs, the latter behavior can
> result in somewhat costly stalls if the consecutive accesses hit a timing
> window.

But can't you do
#pragma omp declare simd linear(sinp, cosp)
void sincos (double x, double *sinp, double *cosp);
?
That is something the vectorizer code could handle and for
  for (int i = 0; i < 1024; i++)
    sincos (val[i], &sinval[i], &cosval[i]);
just vectorize it as
  for (int i = 0; i < 1024; i += vf)
    _ZGVbN8vl8l8_sincos (*(vector double *)&val[i], &sinval[i], &cosval[i]);
Anything else will need specialized code to handle sincos specially in the
vectorizer.

> If you feel it isn't possible to do this, then we can abandon it.  Right
> now my understanding is that GCC doesn't vectorize calls to sincos yet
> for any targets, so it would be moot except that we really should define
> what happens for the future.
> 
> This calling convention would also be useful in the future for vectorizing
> functions that return complex values either by value or by reference.

Only by value, you really don't know what the code does if something is
passed by reference, whether it is read, written into, or both etc.
And for _Complex {float,double}, e.g. the Intel ABI already specifies how to
pass them, just GCC isn't able to do that right now.

> Well, as a matter of practicality, we don't have any of that implemented
> in the rs6000 back end, and we don't have any free resources to do that
> in GCC 11.  Is there any documentation about what needs to be done to
> support this?  I've always been under the impression that vectorizing for
> masking when there isn't any hardware support is a losing proposition, so
> we've not investigated it.

You don't need to do pretty much anything, except set
clonei->mask_mode = VOIDmode, I think the generic code should handle that
everything beyond that, in particular add the mask argument and use it
both on the caller side and on the expansion of the to be vectorized clone.

        Jakub

Reply via email to