Re: [PATCH] Work around PR 50031, sphinx3 slowdown in powerpc on GCC 4.6 and GCC 4.7

Richard Guenther Wed, 10 Aug 2011 01:09:17 -0700

On Tue, Aug 9, 2011 at 8:07 PM, Michael Meissner
<meiss...@linux.vnet.ibm.com> wrote:
> This is an initial patch to work around the slow down of sphinx3 in power7 VSX
> that first shows up in GCC 4.6 and is still present in the current GCC 4.7
> trunk.  http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50031
>
> The key part of the slowdown is in this inner loop in the
> vector_gautbl_eval_logs3  function in sphinx3 vector.c:
>
> {
>  int32 i, r;
>  float64 f;
>  int32 end, veclen;
>  float32 *m1, *m2, *v1, *v2;
>  float64 dval1, dval2, diff1, diff2;
>
>    /* ... */
>
>    for (i = 0; i < veclen; i++) {
>      diff1 = x[i] - m1[i];
>      dval1 -= diff1 * diff1 * v1[i];
>      diff2 = x[i] - m2[i];
>      dval2 -= diff2 * diff2 * v2[i];
>    }
>
>    /* ... */
>
> }
>
> In particular, the compiler 4.6 and beyond vectorizes this inner loop.  
> Because
> it doesn't know the alignment of the float pointers, it generates code to use
> unaligned vector loads unconditionally, which on the powerpc, involves using a
> load of an aligned pointer, and then doing a vperm instruction to permute the
> bytes.  Since the code first does the calculation in 32-bit floating point and
> then converts it to 64-bit floating point, the compiler does a vector convert
> of V4SF to V2DF in the loop.  On the powerpc, this involes two more permutes,
> and then the vector conversion.  Thus in the inner loop, there are:
>
>    4 vector loads
>    4 vector permutes to do the unalgined load
>    8 vector permutes to get things in the right registers for conversion
>    4 vector conversions


Are the arrays all well-aligned in practice?  Thus, would versioning the loop
for all-good-alignment help?

If we have 4 permutes and then 8 further ones - can we combine for example
an unaligned load permute and the following permute for the sf->df conversion?

Does ppc have a VSX tuned cost-model and is it applied correctly in this case?
Maybe we need more fine-grained costs?

Thanks,
Richard.

> This patch offers a new option (-mno-vector-convert-32bit-to-64bit) that
> disables the vector float/int conversions to double.  Overall this is a win:
>
> GCC 4.6, 32-bit:
>    12% improvement, 464.h264ref
>     5% improvement, 450.soplex
>     3% regression,  465.tonto
>     2% improvement, 481.wrf
>     9% improvement, 482.sphinx3
>
> GCC 4.6, 64-bit:
>     5% improvement, 456.hmmer
>     6% improvement, 464.h264ref
>    14% improvement, 482.sphinx3
>
> GCC 4.7, 32-bit:
>      2% improvement, 437.leslie3d
>      9% improvement, 482.sphinx3
>
> I haven't measured GCC 4.7 64-bit mode at the present time, but I can do so if
> desired.
>
> While I don't think this is the only solution to 50031, it at least helps us.
> It is encouraging that GCC 4.7 doesn't have the regression in tonto.
>
> I have bootstraped and run make check on both 4.6 and 4.7 compilers with no
> regressions.  Is it ok to install in the 4.7 tree?  At present, I have made 
> the
> default to generate the vectorized conversion, but it may make sense to flip
> the default.  Is this patch ok to apply?  Given if affects 4.6, did you want 
> to
> see it in 4.6 as well?
>
> [gcc]
> 2011-08-09  Michael Meissner  <meiss...@linux.vnet.ibm.com>
>
>        PR tree-optimization/50031
>        * doc/invoke.texi (RS/6000 and PowerPC Options): Add
>        -mnvsx-vector-32bit-to-64bit switch.
>
>        * config/rs6000/rs6000.md (vec_unpacks_lo_v4sf): Add conditions on
>        -mvector-convert-32bit-to-64bit switch.
>        (vec_unpacks_float_hi_v4s): Ditto.
>        (vec_unpacks_float_lo_v4s): Ditto.
>        (vec_unpacku_float_hi_v4s): Ditto.
>        (vec_unpacku_float_lo_v4s): Ditto.
>
>        * config/rs6000/rs6000.opt (-mvector-convert-32bit-to-64bit): New
>        switch to control whether the compiler does 32->64 bit conversions.
>
> [gcc/testsuite]
> 2011-08-09  Michael Meissner  <meiss...@linux.vnet.ibm.com>
>
>        PR tree-optimization/50031
>        * gcc.target/powerpc/vsx-vector-7.c: New test for
>        -mvector-convert-32bit-to-64bit.
>        * gcc.target/powerpc/vsx-vector-8.c: Ditto.
>
> --
> Michael Meissner, IBM
> 5 Technology Place Drive, M/S 2757, Westford, MA 01886-3141, USA
> meiss...@linux.vnet.ibm.com     fax +1 (978) 399-6899
>

Re: [PATCH] Work around PR 50031, sphinx3 slowdown in powerpc on GCC 4.6 and GCC 4.7

Reply via email to