Re: [PATCH] rs6000: Add optimizations for _mm_sad_epu8

David Edelsohn via Gcc-patches Fri, 07 Jan 2022 13:03:57 -0800

On Fri, Jan 7, 2022 at 3:57 PM Paul A. Clarke <p...@us.ibm.com> wrote:
>
> On Fri, Jan 07, 2022 at 02:40:51PM -0500, David Edelsohn via Gcc-patches 
> wrote:
> > +#ifdef __LITTLE_ENDIAN__
> > +  /* Sum across four integers with two integer results.  */
> > +  asm ("vsum2sws %0,%1,%2" : "=v" (result) : "v" (vsum), "v" (zero));
> > +  /* Note: vec_sum2s could be used here, but on little-endian, vector
> > +     shifts are added that are not needed for this use-case.
> > +     A vector shift to correctly position the 32-bit integer results
> > +     (currently at [0] and [2]) to [1] and [3] would then need to be
> > +     swapped back again since the desired results are two 64-bit
> > +     integers ([1]|[0] and [3]|[2]).  Thus, no shift is performed.  */
> > +#else
> >    /* Sum across four integers with two integer results.  */
> >    result = vec_sum2s (vsum, (__vector signed int) zero);
> >
> > If little-endian adds shifts to correct for the position and
> > big-endian does not, why not use the inline asm without the shifts for
> > both?  It seems confusing to add the inline asm only for LE instead of
> > always using it with the appropriate comment.
> >
> > It's a good and valuable optimization for LE.  Fewer variants are less
> > fragile, easier to test and easier to maintain.  If you're going to go
> > to the trouble of using inline asm for LE, use it for both.
>
> BE (only) _does_ need a shift as seen on the next two lines after the
> code snippet above:
>   /* Sum across four integers with two integer results.  */
>   result = vec_sum2s (vsum, (__vector signed int) zero);
>   /* Rotate the sums into the correct position.  */
>   result = vec_sld (result, result, 6);
>
> So, when using {vec_sum2s;vec_sld}:
> - LE gets an implicit shift in vec_sum2s which just needs to be undone
>   by the vec_sld, and those shifts don't "cancel out" and get removed
>   by GCC.
> - BE does not get any implicit shifts, but needs one that comes from
>   vec_sld.
>
> Are you saying use the asm(vsum2sws) and then conditionally call
> vec_sld on BE only?
>
> I viewed this change as a temporary bandage unless and until GCC can
> remove the unnecessary swaps.  It seems like the preferred code is
> vec_sum2s/vec_sld, not the asm, but that currently is suboptimal for LE.


Nevermind.  I thought that these patches had not been reviewed.

Thanks, David

Re: [PATCH] rs6000: Add optimizations for _mm_sad_epu8

Reply via email to