On Wed, Sep 18, 2013 at 3:45 PM, Zamyatin, Igor <igor.zamya...@intel.com> wrote:
> Ccing Uros. Changes in i386.md could be related to the fix for PR57954.

> -----Original Message-----
> From: Wei Mi [mailto:w...@google.com]
> Sent: Thursday, September 12, 2013 2:51 AM
> To: GCC Patches
> Cc: David Li; Zamyatin, Igor
> Subject: [PATCH] disable use_vector_fp_converts for m_CORE_ALL
>
> For the following testcase 1.c, on westmere and sandybridge, performance with 
> the option -mtune=^use_vector_fp_converts is better (improves from 3.46s to 
> 2.83s). It means cvtss2sd is often better than
> unpcklps+cvtps2pd on recent x86 platforms.
>
> 1.c:
> float total = 0.2;
> int k = 5;
>
> int main() {
>  int i;
>
>  for (i = 0; i < 1000000000; i++) {
>    total += (0.5 + k);
>  }
>
>  return total == 0.3;
> }
>
> assembly generated by gcc-r201963 without -mtune=^use_vector_fp_converts
> .L2:
>         unpcklps        %xmm0, %xmm0
>         subl    $1, %eax
>         cvtps2pd        %xmm0, %xmm0
>         addsd   %xmm1, %xmm0
>         unpcklpd        %xmm0, %xmm0
>         cvtpd2ps        %xmm0, %xmm0
>         jne     .L2
>
> assembly generated by gcc-r201963 with -mtune=^use_vector_fp_converts
> .L2:
>         cvtss2sd        %xmm0, %xmm0
>         subl    $1, %eax
>         addsd   %xmm1, %xmm0
>         cvtsd2ss        %xmm0, %xmm0
>         jne     .L2
>
> But for testcase 2.c (Thanks to Igor Zamyatin for the testcase), performance 
> with the option -mtune=^use_vector_fp_converts is worse.
> Analysis to the assembly shows the performance degradation comes from partial 
> reg stall caused by cvtsd2ss. Adding pxor %xmm0, %xmm0 before cvtsd2ss 
> b(,%rdx,8), %xmm0 gets the performance back.
>
> 2.c:
> double b[1024];
>
> float a[1024];
>
> int main()
> {
>     int i;
>     for(i = 0 ; i < 1024 * 1024 * 256; i++)
>       a[i & 1023] = a[i & 1023] * (float)b[i & 1023];
>     return (int)a[512];
> }
>
> without -mtune-crtl=^use_vector_fp_converts
> .L2:
>         movl    %eax, %edx
>         addl    $1, %eax
>         andl    $1023, %edx
>         cmpl    $268435456, %eax
>         movsd   b(,%rdx,8), %xmm0
>         cvtpd2ps        %xmm0, %xmm0    ==> without partial reg stall
> because of movsd.
>         mulss   a(,%rdx,4), %xmm0
>         movss   %xmm0, a(,%rdx,4)
>         jne     .L2
>
> with -mtune-crtl=^use_vector_fp_converts
> .L2:
>         movl    %eax, %edx
>         addl    $1, %eax
>         andl    $1023, %edx
>         cmpl    $268435456, %eax
>         cvtsd2ss        b(,%rdx,8), %xmm0   ==> with partial reg
> stall. Needs to insert "pxor %xmm0, %xmm0" before current insn.
>         mulss   a(,%rdx,4), %xmm0
>         movss   %xmm0, a(,%rdx,4)
>         jne     .L2
>
> So the patch is to turn off use_vector_fp_converts for m_CORE_ALL to use 
> cvtss2sd/cvtsd2ss directly,  and add "pxor %xmmreg %xmmreg" before 
> cvtss2sd/cvtsd2ss to break partial reg stall (similar as what r201308 does 
> for cvtsi2ss/cvtsi2sd). bootstrap and regression pass. ok for trunk?
>
> Thanks,
> Wei Mi.
>
> 2013-09-11  Wei Mi  <w...@google.com>
>
>         * config/i386/x86-tune.def (DEF_TUNE): Remove
>         m_CORE_ALL.
>         * config/i386/i386.md: Add define_peephole2 to
>         break partial reg stall for cvtss2sd/cvtsd2ss.

You don't need reload_completed in peephole2 patterns.

Otherwise the patch is OK.

Thanks,
Uros.

Reply via email to