Re: [PATCH] rs6000: Add optimizations for _mm_sad_epu8
On Fri, Jan 7, 2022 at 3:57 PM Paul A. Clarke wrote: > > On Fri, Jan 07, 2022 at 02:40:51PM -0500, David Edelsohn via Gcc-patches > wrote: > > +#ifdef __LITTLE_ENDIAN__ > > + /* Sum across four integers with two integer results. */ > > + asm ("vsum2sws %0,%1,%2" : "=v" (result) : "v" (vsum), "v" (zero)); > > + /* Note: vec_sum2s could be used here, but on little-endian, vector > > + shifts are added that are not needed for this use-case. > > + A vector shift to correctly position the 32-bit integer results > > + (currently at [0] and [2]) to [1] and [3] would then need to be > > + swapped back again since the desired results are two 64-bit > > + integers ([1]|[0] and [3]|[2]). Thus, no shift is performed. */ > > +#else > >/* Sum across four integers with two integer results. */ > >result = vec_sum2s (vsum, (__vector signed int) zero); > > > > If little-endian adds shifts to correct for the position and > > big-endian does not, why not use the inline asm without the shifts for > > both? It seems confusing to add the inline asm only for LE instead of > > always using it with the appropriate comment. > > > > It's a good and valuable optimization for LE. Fewer variants are less > > fragile, easier to test and easier to maintain. If you're going to go > > to the trouble of using inline asm for LE, use it for both. > > BE (only) _does_ need a shift as seen on the next two lines after the > code snippet above: > /* Sum across four integers with two integer results. */ > result = vec_sum2s (vsum, (__vector signed int) zero); > /* Rotate the sums into the correct position. */ > result = vec_sld (result, result, 6); > > So, when using {vec_sum2s;vec_sld}: > - LE gets an implicit shift in vec_sum2s which just needs to be undone > by the vec_sld, and those shifts don't "cancel out" and get removed > by GCC. > - BE does not get any implicit shifts, but needs one that comes from > vec_sld. > > Are you saying use the asm(vsum2sws) and then conditionally call > vec_sld on BE only? > > I viewed this change as a temporary bandage unless and until GCC can > remove the unnecessary swaps. It seems like the preferred code is > vec_sum2s/vec_sld, not the asm, but that currently is suboptimal for LE. Nevermind. I thought that these patches had not been reviewed. Thanks, David
Re: [PATCH] rs6000: Add optimizations for _mm_sad_epu8
On Fri, Jan 07, 2022 at 02:40:51PM -0500, David Edelsohn via Gcc-patches wrote: > +#ifdef __LITTLE_ENDIAN__ > + /* Sum across four integers with two integer results. */ > + asm ("vsum2sws %0,%1,%2" : "=v" (result) : "v" (vsum), "v" (zero)); > + /* Note: vec_sum2s could be used here, but on little-endian, vector > + shifts are added that are not needed for this use-case. > + A vector shift to correctly position the 32-bit integer results > + (currently at [0] and [2]) to [1] and [3] would then need to be > + swapped back again since the desired results are two 64-bit > + integers ([1]|[0] and [3]|[2]). Thus, no shift is performed. */ > +#else >/* Sum across four integers with two integer results. */ >result = vec_sum2s (vsum, (__vector signed int) zero); > > If little-endian adds shifts to correct for the position and > big-endian does not, why not use the inline asm without the shifts for > both? It seems confusing to add the inline asm only for LE instead of > always using it with the appropriate comment. > > It's a good and valuable optimization for LE. Fewer variants are less > fragile, easier to test and easier to maintain. If you're going to go > to the trouble of using inline asm for LE, use it for both. BE (only) _does_ need a shift as seen on the next two lines after the code snippet above: /* Sum across four integers with two integer results. */ result = vec_sum2s (vsum, (__vector signed int) zero); /* Rotate the sums into the correct position. */ result = vec_sld (result, result, 6); So, when using {vec_sum2s;vec_sld}: - LE gets an implicit shift in vec_sum2s which just needs to be undone by the vec_sld, and those shifts don't "cancel out" and get removed by GCC. - BE does not get any implicit shifts, but needs one that comes from vec_sld. Are you saying use the asm(vsum2sws) and then conditionally call vec_sld on BE only? I viewed this change as a temporary bandage unless and until GCC can remove the unnecessary swaps. It seems like the preferred code is vec_sum2s/vec_sld, not the asm, but that currently is suboptimal for LE. PC
Re: [PATCH] rs6000: Add optimizations for _mm_sad_epu8
+#ifdef __LITTLE_ENDIAN__ + /* Sum across four integers with two integer results. */ + asm ("vsum2sws %0,%1,%2" : "=v" (result) : "v" (vsum), "v" (zero)); + /* Note: vec_sum2s could be used here, but on little-endian, vector + shifts are added that are not needed for this use-case. + A vector shift to correctly position the 32-bit integer results + (currently at [0] and [2]) to [1] and [3] would then need to be + swapped back again since the desired results are two 64-bit + integers ([1]|[0] and [3]|[2]). Thus, no shift is performed. */ +#else /* Sum across four integers with two integer results. */ result = vec_sum2s (vsum, (__vector signed int) zero); If little-endian adds shifts to correct for the position and big-endian does not, why not use the inline asm without the shifts for both? It seems confusing to add the inline asm only for LE instead of always using it with the appropriate comment. It's a good and valuable optimization for LE. Fewer variants are less fragile, easier to test and easier to maintain. If you're going to go to the trouble of using inline asm for LE, use it for both. Thanks, David
Re: [PATCH] rs6000: Add optimizations for _mm_sad_epu8
Hi! On Fri, Oct 22, 2021 at 12:28:49PM -0500, Paul A. Clarke wrote: > Power9 ISA added `vabsdub` instruction which is realized in the > `vec_absd` instrinsic. > > Use `vec_absd` for `_mm_sad_epu8` compatibility intrinsic, when > `_ARCH_PWR9`. > > Also, the realization of `vec_sum2s` on little-endian includes > two shifts in order to position the input and output to match > the semantics of `vec_sum2s`: > - Shift the second input vector left 12 bytes. In the current usage, > that vector is `{0}`, so this shift is unnecessary, but is currently > not eliminated under optimization. The vsum2sws implementation uses an unspec, so there is almost no chance of anything with it being optimised :-( It rotates it right by 4 bytes btw, it's not a shift. > - Shift the vector produced by the `vsum2sws` instruction left 4 bytes. > The two words within each doubleword of this (shifted) result must then > be explicitly swapped to match the semantics of `_mm_sad_epu8`, > effectively reversing this shift. So, this shift (and a susequent swap) > are unnecessary, but not currently removed under optimization. Rotate left by 4 -- same thing once you consider word 0 and 2 are set to zeroes by the sum2sws. Not sure why it is not optimised, what do the dump files say? -dap and I'd start looking at the combine dump. > Using `__builtin_altivec_vsum2sws` retains both shifts, so is not an > option for removing the shifts. > > For little-endian, use the `vsum2sws` instruction directly, and > eliminate the explicit shift (swap). > > 2021-10-22 Paul A. Clarke > > gcc > * config/rs6000/emmintrin.h (_mm_sad_epu8): Use vec_absd > when _ARCH_PWR9, optimize vec_sum2s when LE. Please don't break changelog lines early. > - vmin = vec_min (a, b); > - vmax = vec_max (a, b); > +#ifndef _ARCH_PWR9 > + __v16qu vmin = vec_min (a, b); > + __v16qu vmax = vec_max (a, b); >vabsdiff = vec_sub (vmax, vmin); > +#else > + vabsdiff = vec_absd (a, b); > +#endif So hrm, maybe we should have the vec_absd macro (or the builtin) always, just expanding to three insns if necessary. Okay for trunk with approproate changelog and commit message changes. Thanks! Segher