On Fri, Jan 7, 2022 at 3:57 PM Paul A. Clarke <p...@us.ibm.com> wrote: > > On Fri, Jan 07, 2022 at 02:40:51PM -0500, David Edelsohn via Gcc-patches > wrote: > > +#ifdef __LITTLE_ENDIAN__ > > + /* Sum across four integers with two integer results. */ > > + asm ("vsum2sws %0,%1,%2" : "=v" (result) : "v" (vsum), "v" (zero)); > > + /* Note: vec_sum2s could be used here, but on little-endian, vector > > + shifts are added that are not needed for this use-case. > > + A vector shift to correctly position the 32-bit integer results > > + (currently at [0] and [2]) to [1] and [3] would then need to be > > + swapped back again since the desired results are two 64-bit > > + integers ([1]|[0] and [3]|[2]). Thus, no shift is performed. */ > > +#else > > /* Sum across four integers with two integer results. */ > > result = vec_sum2s (vsum, (__vector signed int) zero); > > > > If little-endian adds shifts to correct for the position and > > big-endian does not, why not use the inline asm without the shifts for > > both? It seems confusing to add the inline asm only for LE instead of > > always using it with the appropriate comment. > > > > It's a good and valuable optimization for LE. Fewer variants are less > > fragile, easier to test and easier to maintain. If you're going to go > > to the trouble of using inline asm for LE, use it for both. > > BE (only) _does_ need a shift as seen on the next two lines after the > code snippet above: > /* Sum across four integers with two integer results. */ > result = vec_sum2s (vsum, (__vector signed int) zero); > /* Rotate the sums into the correct position. */ > result = vec_sld (result, result, 6); > > So, when using {vec_sum2s;vec_sld}: > - LE gets an implicit shift in vec_sum2s which just needs to be undone > by the vec_sld, and those shifts don't "cancel out" and get removed > by GCC. > - BE does not get any implicit shifts, but needs one that comes from > vec_sld. > > Are you saying use the asm(vsum2sws) and then conditionally call > vec_sld on BE only? > > I viewed this change as a temporary bandage unless and until GCC can > remove the unnecessary swaps. It seems like the preferred code is > vec_sum2s/vec_sld, not the asm, but that currently is suboptimal for LE.
Nevermind. I thought that these patches had not been reviewed. Thanks, David