Re: [PATCH] rs6000: Add optimizations for _mm_sad_epu8

2022-01-07 Thread David Edelsohn via Gcc-patches
On Fri, Jan 7, 2022 at 3:57 PM Paul A. Clarke  wrote:
>
> On Fri, Jan 07, 2022 at 02:40:51PM -0500, David Edelsohn via Gcc-patches 
> wrote:
> > +#ifdef __LITTLE_ENDIAN__
> > +  /* Sum across four integers with two integer results.  */
> > +  asm ("vsum2sws %0,%1,%2" : "=v" (result) : "v" (vsum), "v" (zero));
> > +  /* Note: vec_sum2s could be used here, but on little-endian, vector
> > + shifts are added that are not needed for this use-case.
> > + A vector shift to correctly position the 32-bit integer results
> > + (currently at [0] and [2]) to [1] and [3] would then need to be
> > + swapped back again since the desired results are two 64-bit
> > + integers ([1]|[0] and [3]|[2]).  Thus, no shift is performed.  */
> > +#else
> >/* Sum across four integers with two integer results.  */
> >result = vec_sum2s (vsum, (__vector signed int) zero);
> >
> > If little-endian adds shifts to correct for the position and
> > big-endian does not, why not use the inline asm without the shifts for
> > both?  It seems confusing to add the inline asm only for LE instead of
> > always using it with the appropriate comment.
> >
> > It's a good and valuable optimization for LE.  Fewer variants are less
> > fragile, easier to test and easier to maintain.  If you're going to go
> > to the trouble of using inline asm for LE, use it for both.
>
> BE (only) _does_ need a shift as seen on the next two lines after the
> code snippet above:
>   /* Sum across four integers with two integer results.  */
>   result = vec_sum2s (vsum, (__vector signed int) zero);
>   /* Rotate the sums into the correct position.  */
>   result = vec_sld (result, result, 6);
>
> So, when using {vec_sum2s;vec_sld}:
> - LE gets an implicit shift in vec_sum2s which just needs to be undone
>   by the vec_sld, and those shifts don't "cancel out" and get removed
>   by GCC.
> - BE does not get any implicit shifts, but needs one that comes from
>   vec_sld.
>
> Are you saying use the asm(vsum2sws) and then conditionally call
> vec_sld on BE only?
>
> I viewed this change as a temporary bandage unless and until GCC can
> remove the unnecessary swaps.  It seems like the preferred code is
> vec_sum2s/vec_sld, not the asm, but that currently is suboptimal for LE.

Nevermind.  I thought that these patches had not been reviewed.

Thanks, David


Re: [PATCH] rs6000: Add optimizations for _mm_sad_epu8

2022-01-07 Thread Paul A. Clarke via Gcc-patches
On Fri, Jan 07, 2022 at 02:40:51PM -0500, David Edelsohn via Gcc-patches wrote:
> +#ifdef __LITTLE_ENDIAN__
> +  /* Sum across four integers with two integer results.  */
> +  asm ("vsum2sws %0,%1,%2" : "=v" (result) : "v" (vsum), "v" (zero));
> +  /* Note: vec_sum2s could be used here, but on little-endian, vector
> + shifts are added that are not needed for this use-case.
> + A vector shift to correctly position the 32-bit integer results
> + (currently at [0] and [2]) to [1] and [3] would then need to be
> + swapped back again since the desired results are two 64-bit
> + integers ([1]|[0] and [3]|[2]).  Thus, no shift is performed.  */
> +#else
>/* Sum across four integers with two integer results.  */
>result = vec_sum2s (vsum, (__vector signed int) zero);
> 
> If little-endian adds shifts to correct for the position and
> big-endian does not, why not use the inline asm without the shifts for
> both?  It seems confusing to add the inline asm only for LE instead of
> always using it with the appropriate comment.
> 
> It's a good and valuable optimization for LE.  Fewer variants are less
> fragile, easier to test and easier to maintain.  If you're going to go
> to the trouble of using inline asm for LE, use it for both.

BE (only) _does_ need a shift as seen on the next two lines after the
code snippet above:
  /* Sum across four integers with two integer results.  */
  result = vec_sum2s (vsum, (__vector signed int) zero);
  /* Rotate the sums into the correct position.  */
  result = vec_sld (result, result, 6);

So, when using {vec_sum2s;vec_sld}:
- LE gets an implicit shift in vec_sum2s which just needs to be undone
  by the vec_sld, and those shifts don't "cancel out" and get removed
  by GCC.
- BE does not get any implicit shifts, but needs one that comes from
  vec_sld.

Are you saying use the asm(vsum2sws) and then conditionally call
vec_sld on BE only?

I viewed this change as a temporary bandage unless and until GCC can
remove the unnecessary swaps.  It seems like the preferred code is
vec_sum2s/vec_sld, not the asm, but that currently is suboptimal for LE.

PC


Re: [PATCH] rs6000: Add optimizations for _mm_sad_epu8

2022-01-07 Thread David Edelsohn via Gcc-patches
+#ifdef __LITTLE_ENDIAN__
+  /* Sum across four integers with two integer results.  */
+  asm ("vsum2sws %0,%1,%2" : "=v" (result) : "v" (vsum), "v" (zero));
+  /* Note: vec_sum2s could be used here, but on little-endian, vector
+ shifts are added that are not needed for this use-case.
+ A vector shift to correctly position the 32-bit integer results
+ (currently at [0] and [2]) to [1] and [3] would then need to be
+ swapped back again since the desired results are two 64-bit
+ integers ([1]|[0] and [3]|[2]).  Thus, no shift is performed.  */
+#else
   /* Sum across four integers with two integer results.  */
   result = vec_sum2s (vsum, (__vector signed int) zero);

If little-endian adds shifts to correct for the position and
big-endian does not, why not use the inline asm without the shifts for
both?  It seems confusing to add the inline asm only for LE instead of
always using it with the appropriate comment.

It's a good and valuable optimization for LE.  Fewer variants are less
fragile, easier to test and easier to maintain.  If you're going to go
to the trouble of using inline asm for LE, use it for both.

Thanks, David


Re: [PATCH] rs6000: Add optimizations for _mm_sad_epu8

2021-11-19 Thread Segher Boessenkool
Hi!

On Fri, Oct 22, 2021 at 12:28:49PM -0500, Paul A. Clarke wrote:
> Power9 ISA added `vabsdub` instruction which is realized in the
> `vec_absd` instrinsic.
> 
> Use `vec_absd` for `_mm_sad_epu8` compatibility intrinsic, when
> `_ARCH_PWR9`.
> 
> Also, the realization of `vec_sum2s` on little-endian includes
> two shifts in order to position the input and output to match
> the semantics of `vec_sum2s`:
> - Shift the second input vector left 12 bytes. In the current usage,
>   that vector is `{0}`, so this shift is unnecessary, but is currently
>   not eliminated under optimization.

The vsum2sws implementation uses an unspec, so there is almost no chance
of anything with it being optimised :-(

It rotates it right by 4 bytes btw, it's not a shift.

> - Shift the vector produced by the `vsum2sws` instruction left 4 bytes.
>   The two words within each doubleword of this (shifted) result must then
>   be explicitly swapped to match the semantics of `_mm_sad_epu8`,
>   effectively reversing this shift.  So, this shift (and a susequent swap)
>   are unnecessary, but not currently removed under optimization.

Rotate left by 4 -- same thing once you consider word 0 and 2 are set
to zeroes by the sum2sws.

Not sure why it is not optimised, what do the dump files say?  -dap and
I'd start looking at the combine dump.

> Using `__builtin_altivec_vsum2sws` retains both shifts, so is not an
> option for removing the shifts.
> 
> For little-endian, use the `vsum2sws` instruction directly, and
> eliminate the explicit shift (swap).
> 
> 2021-10-22  Paul A. Clarke  
> 
> gcc
>   * config/rs6000/emmintrin.h (_mm_sad_epu8): Use vec_absd
>   when _ARCH_PWR9, optimize vec_sum2s when LE.

Please don't break changelog lines early.

> -  vmin = vec_min (a, b);
> -  vmax = vec_max (a, b);
> +#ifndef _ARCH_PWR9
> +  __v16qu vmin = vec_min (a, b);
> +  __v16qu vmax = vec_max (a, b);
>vabsdiff = vec_sub (vmax, vmin);
> +#else
> +  vabsdiff = vec_absd (a, b);
> +#endif

So hrm, maybe we should have the vec_absd macro (or the builtin) always,
just expanding to three insns if necessary.

Okay for trunk with approproate changelog and commit message changes.
Thanks!


Segher