On Wed, Sep 18, 2013 at 3:45 PM, Zamyatin, Igor <igor.zamya...@intel.com> wrote: > Ccing Uros. Changes in i386.md could be related to the fix for PR57954.
> -----Original Message----- > From: Wei Mi [mailto:w...@google.com] > Sent: Thursday, September 12, 2013 2:51 AM > To: GCC Patches > Cc: David Li; Zamyatin, Igor > Subject: [PATCH] disable use_vector_fp_converts for m_CORE_ALL > > For the following testcase 1.c, on westmere and sandybridge, performance with > the option -mtune=^use_vector_fp_converts is better (improves from 3.46s to > 2.83s). It means cvtss2sd is often better than > unpcklps+cvtps2pd on recent x86 platforms. > > 1.c: > float total = 0.2; > int k = 5; > > int main() { > int i; > > for (i = 0; i < 1000000000; i++) { > total += (0.5 + k); > } > > return total == 0.3; > } > > assembly generated by gcc-r201963 without -mtune=^use_vector_fp_converts > .L2: > unpcklps %xmm0, %xmm0 > subl $1, %eax > cvtps2pd %xmm0, %xmm0 > addsd %xmm1, %xmm0 > unpcklpd %xmm0, %xmm0 > cvtpd2ps %xmm0, %xmm0 > jne .L2 > > assembly generated by gcc-r201963 with -mtune=^use_vector_fp_converts > .L2: > cvtss2sd %xmm0, %xmm0 > subl $1, %eax > addsd %xmm1, %xmm0 > cvtsd2ss %xmm0, %xmm0 > jne .L2 > > But for testcase 2.c (Thanks to Igor Zamyatin for the testcase), performance > with the option -mtune=^use_vector_fp_converts is worse. > Analysis to the assembly shows the performance degradation comes from partial > reg stall caused by cvtsd2ss. Adding pxor %xmm0, %xmm0 before cvtsd2ss > b(,%rdx,8), %xmm0 gets the performance back. > > 2.c: > double b[1024]; > > float a[1024]; > > int main() > { > int i; > for(i = 0 ; i < 1024 * 1024 * 256; i++) > a[i & 1023] = a[i & 1023] * (float)b[i & 1023]; > return (int)a[512]; > } > > without -mtune-crtl=^use_vector_fp_converts > .L2: > movl %eax, %edx > addl $1, %eax > andl $1023, %edx > cmpl $268435456, %eax > movsd b(,%rdx,8), %xmm0 > cvtpd2ps %xmm0, %xmm0 ==> without partial reg stall > because of movsd. > mulss a(,%rdx,4), %xmm0 > movss %xmm0, a(,%rdx,4) > jne .L2 > > with -mtune-crtl=^use_vector_fp_converts > .L2: > movl %eax, %edx > addl $1, %eax > andl $1023, %edx > cmpl $268435456, %eax > cvtsd2ss b(,%rdx,8), %xmm0 ==> with partial reg > stall. Needs to insert "pxor %xmm0, %xmm0" before current insn. > mulss a(,%rdx,4), %xmm0 > movss %xmm0, a(,%rdx,4) > jne .L2 > > So the patch is to turn off use_vector_fp_converts for m_CORE_ALL to use > cvtss2sd/cvtsd2ss directly, and add "pxor %xmmreg %xmmreg" before > cvtss2sd/cvtsd2ss to break partial reg stall (similar as what r201308 does > for cvtsi2ss/cvtsi2sd). bootstrap and regression pass. ok for trunk? > > Thanks, > Wei Mi. > > 2013-09-11 Wei Mi <w...@google.com> > > * config/i386/x86-tune.def (DEF_TUNE): Remove > m_CORE_ALL. > * config/i386/i386.md: Add define_peephole2 to > break partial reg stall for cvtss2sd/cvtsd2ss. You don't need reload_completed in peephole2 patterns. Otherwise the patch is OK. Thanks, Uros.