On Sun, Sep 22, 2013 at 2:29 AM, Uros Bizjak <ubiz...@gmail.com> wrote: > On Wed, Sep 18, 2013 at 3:45 PM, Zamyatin, Igor <igor.zamya...@intel.com> > wrote: >> Ccing Uros. Changes in i386.md could be related to the fix for PR57954. > >> -----Original Message----- >> From: Wei Mi [mailto:w...@google.com] >> Sent: Thursday, September 12, 2013 2:51 AM >> To: GCC Patches >> Cc: David Li; Zamyatin, Igor >> Subject: [PATCH] disable use_vector_fp_converts for m_CORE_ALL >> >> For the following testcase 1.c, on westmere and sandybridge, performance >> with the option -mtune=^use_vector_fp_converts is better (improves from >> 3.46s to 2.83s). It means cvtss2sd is often better than >> unpcklps+cvtps2pd on recent x86 platforms. >> >> 1.c: >> float total = 0.2; >> int k = 5; >> >> int main() { >> int i; >> >> for (i = 0; i < 1000000000; i++) { >> total += (0.5 + k); >> } >> >> return total == 0.3; >> } >> >> assembly generated by gcc-r201963 without -mtune=^use_vector_fp_converts >> .L2: >> unpcklps %xmm0, %xmm0 >> subl $1, %eax >> cvtps2pd %xmm0, %xmm0 >> addsd %xmm1, %xmm0 >> unpcklpd %xmm0, %xmm0 >> cvtpd2ps %xmm0, %xmm0 >> jne .L2 >> >> assembly generated by gcc-r201963 with -mtune=^use_vector_fp_converts >> .L2: >> cvtss2sd %xmm0, %xmm0 >> subl $1, %eax >> addsd %xmm1, %xmm0 >> cvtsd2ss %xmm0, %xmm0 >> jne .L2 >> >> But for testcase 2.c (Thanks to Igor Zamyatin for the testcase), performance >> with the option -mtune=^use_vector_fp_converts is worse. >> Analysis to the assembly shows the performance degradation comes from >> partial reg stall caused by cvtsd2ss. Adding pxor %xmm0, %xmm0 before >> cvtsd2ss b(,%rdx,8), %xmm0 gets the performance back. >> >> 2.c: >> double b[1024]; >> >> float a[1024]; >> >> int main() >> { >> int i; >> for(i = 0 ; i < 1024 * 1024 * 256; i++) >> a[i & 1023] = a[i & 1023] * (float)b[i & 1023]; >> return (int)a[512]; >> } >> >> without -mtune-crtl=^use_vector_fp_converts >> .L2: >> movl %eax, %edx >> addl $1, %eax >> andl $1023, %edx >> cmpl $268435456, %eax >> movsd b(,%rdx,8), %xmm0 >> cvtpd2ps %xmm0, %xmm0 ==> without partial reg stall >> because of movsd. >> mulss a(,%rdx,4), %xmm0 >> movss %xmm0, a(,%rdx,4) >> jne .L2 >> >> with -mtune-crtl=^use_vector_fp_converts >> .L2: >> movl %eax, %edx >> addl $1, %eax >> andl $1023, %edx >> cmpl $268435456, %eax >> cvtsd2ss b(,%rdx,8), %xmm0 ==> with partial reg >> stall. Needs to insert "pxor %xmm0, %xmm0" before current insn. >> mulss a(,%rdx,4), %xmm0 >> movss %xmm0, a(,%rdx,4) >> jne .L2 >> >> So the patch is to turn off use_vector_fp_converts for m_CORE_ALL to use >> cvtss2sd/cvtsd2ss directly, and add "pxor %xmmreg %xmmreg" before >> cvtss2sd/cvtsd2ss to break partial reg stall (similar as what r201308 does >> for cvtsi2ss/cvtsi2sd). bootstrap and regression pass. ok for trunk? >> >> Thanks, >> Wei Mi. >> >> 2013-09-11 Wei Mi <w...@google.com> >> >> * config/i386/x86-tune.def (DEF_TUNE): Remove >> m_CORE_ALL. >> * config/i386/i386.md: Add define_peephole2 to >> break partial reg stall for cvtss2sd/cvtsd2ss. > > You don't need reload_completed in peephole2 patterns. > > Otherwise the patch is OK. > > Thanks, > Uros.
Hi Wei Mi, Have you checked in your patch? -- H.J.