On Sun, Sep 22, 2013 at 2:29 AM, Uros Bizjak <ubiz...@gmail.com> wrote:
> On Wed, Sep 18, 2013 at 3:45 PM, Zamyatin, Igor <igor.zamya...@intel.com> 
> wrote:
>> Ccing Uros. Changes in i386.md could be related to the fix for PR57954.
>
>> -----Original Message-----
>> From: Wei Mi [mailto:w...@google.com]
>> Sent: Thursday, September 12, 2013 2:51 AM
>> To: GCC Patches
>> Cc: David Li; Zamyatin, Igor
>> Subject: [PATCH] disable use_vector_fp_converts for m_CORE_ALL
>>
>> For the following testcase 1.c, on westmere and sandybridge, performance 
>> with the option -mtune=^use_vector_fp_converts is better (improves from 
>> 3.46s to 2.83s). It means cvtss2sd is often better than
>> unpcklps+cvtps2pd on recent x86 platforms.
>>
>> 1.c:
>> float total = 0.2;
>> int k = 5;
>>
>> int main() {
>>  int i;
>>
>>  for (i = 0; i < 1000000000; i++) {
>>    total += (0.5 + k);
>>  }
>>
>>  return total == 0.3;
>> }
>>
>> assembly generated by gcc-r201963 without -mtune=^use_vector_fp_converts
>> .L2:
>>         unpcklps        %xmm0, %xmm0
>>         subl    $1, %eax
>>         cvtps2pd        %xmm0, %xmm0
>>         addsd   %xmm1, %xmm0
>>         unpcklpd        %xmm0, %xmm0
>>         cvtpd2ps        %xmm0, %xmm0
>>         jne     .L2
>>
>> assembly generated by gcc-r201963 with -mtune=^use_vector_fp_converts
>> .L2:
>>         cvtss2sd        %xmm0, %xmm0
>>         subl    $1, %eax
>>         addsd   %xmm1, %xmm0
>>         cvtsd2ss        %xmm0, %xmm0
>>         jne     .L2
>>
>> But for testcase 2.c (Thanks to Igor Zamyatin for the testcase), performance 
>> with the option -mtune=^use_vector_fp_converts is worse.
>> Analysis to the assembly shows the performance degradation comes from 
>> partial reg stall caused by cvtsd2ss. Adding pxor %xmm0, %xmm0 before 
>> cvtsd2ss b(,%rdx,8), %xmm0 gets the performance back.
>>
>> 2.c:
>> double b[1024];
>>
>> float a[1024];
>>
>> int main()
>> {
>>     int i;
>>     for(i = 0 ; i < 1024 * 1024 * 256; i++)
>>       a[i & 1023] = a[i & 1023] * (float)b[i & 1023];
>>     return (int)a[512];
>> }
>>
>> without -mtune-crtl=^use_vector_fp_converts
>> .L2:
>>         movl    %eax, %edx
>>         addl    $1, %eax
>>         andl    $1023, %edx
>>         cmpl    $268435456, %eax
>>         movsd   b(,%rdx,8), %xmm0
>>         cvtpd2ps        %xmm0, %xmm0    ==> without partial reg stall
>> because of movsd.
>>         mulss   a(,%rdx,4), %xmm0
>>         movss   %xmm0, a(,%rdx,4)
>>         jne     .L2
>>
>> with -mtune-crtl=^use_vector_fp_converts
>> .L2:
>>         movl    %eax, %edx
>>         addl    $1, %eax
>>         andl    $1023, %edx
>>         cmpl    $268435456, %eax
>>         cvtsd2ss        b(,%rdx,8), %xmm0   ==> with partial reg
>> stall. Needs to insert "pxor %xmm0, %xmm0" before current insn.
>>         mulss   a(,%rdx,4), %xmm0
>>         movss   %xmm0, a(,%rdx,4)
>>         jne     .L2
>>
>> So the patch is to turn off use_vector_fp_converts for m_CORE_ALL to use 
>> cvtss2sd/cvtsd2ss directly,  and add "pxor %xmmreg %xmmreg" before 
>> cvtss2sd/cvtsd2ss to break partial reg stall (similar as what r201308 does 
>> for cvtsi2ss/cvtsi2sd). bootstrap and regression pass. ok for trunk?
>>
>> Thanks,
>> Wei Mi.
>>
>> 2013-09-11  Wei Mi  <w...@google.com>
>>
>>         * config/i386/x86-tune.def (DEF_TUNE): Remove
>>         m_CORE_ALL.
>>         * config/i386/i386.md: Add define_peephole2 to
>>         break partial reg stall for cvtss2sd/cvtsd2ss.
>
> You don't need reload_completed in peephole2 patterns.
>
> Otherwise the patch is OK.
>
> Thanks,
> Uros.

Hi Wei Mi,

Have you checked in your patch?

-- 
H.J.

Reply via email to