On Tue, Aug 9, 2011 at 8:07 PM, Michael Meissner <meiss...@linux.vnet.ibm.com> wrote: > This is an initial patch to work around the slow down of sphinx3 in power7 VSX > that first shows up in GCC 4.6 and is still present in the current GCC 4.7 > trunk. http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50031 > > The key part of the slowdown is in this inner loop in the > vector_gautbl_eval_logs3 function in sphinx3 vector.c: > > { > int32 i, r; > float64 f; > int32 end, veclen; > float32 *m1, *m2, *v1, *v2; > float64 dval1, dval2, diff1, diff2; > > /* ... */ > > for (i = 0; i < veclen; i++) { > diff1 = x[i] - m1[i]; > dval1 -= diff1 * diff1 * v1[i]; > diff2 = x[i] - m2[i]; > dval2 -= diff2 * diff2 * v2[i]; > } > > /* ... */ > > } > > In particular, the compiler 4.6 and beyond vectorizes this inner loop. > Because > it doesn't know the alignment of the float pointers, it generates code to use > unaligned vector loads unconditionally, which on the powerpc, involves using a > load of an aligned pointer, and then doing a vperm instruction to permute the > bytes. Since the code first does the calculation in 32-bit floating point and > then converts it to 64-bit floating point, the compiler does a vector convert > of V4SF to V2DF in the loop. On the powerpc, this involes two more permutes, > and then the vector conversion. Thus in the inner loop, there are: > > 4 vector loads > 4 vector permutes to do the unalgined load > 8 vector permutes to get things in the right registers for conversion > 4 vector conversions
Are the arrays all well-aligned in practice? Thus, would versioning the loop for all-good-alignment help? If we have 4 permutes and then 8 further ones - can we combine for example an unaligned load permute and the following permute for the sf->df conversion? Does ppc have a VSX tuned cost-model and is it applied correctly in this case? Maybe we need more fine-grained costs? Thanks, Richard. > This patch offers a new option (-mno-vector-convert-32bit-to-64bit) that > disables the vector float/int conversions to double. Overall this is a win: > > GCC 4.6, 32-bit: > 12% improvement, 464.h264ref > 5% improvement, 450.soplex > 3% regression, 465.tonto > 2% improvement, 481.wrf > 9% improvement, 482.sphinx3 > > GCC 4.6, 64-bit: > 5% improvement, 456.hmmer > 6% improvement, 464.h264ref > 14% improvement, 482.sphinx3 > > GCC 4.7, 32-bit: > 2% improvement, 437.leslie3d > 9% improvement, 482.sphinx3 > > I haven't measured GCC 4.7 64-bit mode at the present time, but I can do so if > desired. > > While I don't think this is the only solution to 50031, it at least helps us. > It is encouraging that GCC 4.7 doesn't have the regression in tonto. > > I have bootstraped and run make check on both 4.6 and 4.7 compilers with no > regressions. Is it ok to install in the 4.7 tree? At present, I have made > the > default to generate the vectorized conversion, but it may make sense to flip > the default. Is this patch ok to apply? Given if affects 4.6, did you want > to > see it in 4.6 as well? > > [gcc] > 2011-08-09 Michael Meissner <meiss...@linux.vnet.ibm.com> > > PR tree-optimization/50031 > * doc/invoke.texi (RS/6000 and PowerPC Options): Add > -mnvsx-vector-32bit-to-64bit switch. > > * config/rs6000/rs6000.md (vec_unpacks_lo_v4sf): Add conditions on > -mvector-convert-32bit-to-64bit switch. > (vec_unpacks_float_hi_v4s): Ditto. > (vec_unpacks_float_lo_v4s): Ditto. > (vec_unpacku_float_hi_v4s): Ditto. > (vec_unpacku_float_lo_v4s): Ditto. > > * config/rs6000/rs6000.opt (-mvector-convert-32bit-to-64bit): New > switch to control whether the compiler does 32->64 bit conversions. > > [gcc/testsuite] > 2011-08-09 Michael Meissner <meiss...@linux.vnet.ibm.com> > > PR tree-optimization/50031 > * gcc.target/powerpc/vsx-vector-7.c: New test for > -mvector-convert-32bit-to-64bit. > * gcc.target/powerpc/vsx-vector-8.c: Ditto. > > -- > Michael Meissner, IBM > 5 Technology Place Drive, M/S 2757, Westford, MA 01886-3141, USA > meiss...@linux.vnet.ibm.com fax +1 (978) 399-6899 >