Re: ]PATCH][RFC] Initial patch for better performance of 64-bit math instructions in 32-bit mode on x86-64

Richard Biener Wed, 01 Jun 2016 03:07:25 -0700

On Wed, Jun 1, 2016 at 11:57 AM, Ilya Enkovich <enkovich....@gmail.com> wrote:
> 2016-05-31 19:15 GMT+03:00 Uros Bizjak <ubiz...@gmail.com>:
>> On Tue, May 31, 2016 at 5:00 PM, Yuri Rumyantsev <ysrum...@gmail.com> wrote:
>>> Hi Uros,
>>>
>>> Here is initial patch to improve performance of 64-bit integer
>>> arithmetic in 32-bit mode. We discovered that gcc is significantly
>>> behind icc and clang on rsa benchmark from eembc2.0 suite.
>>> Te problem function looks like
>>> typedef unsigned long long ull;
>>> typedef unsigned long ul;
>>> ul mul_add(ul *rp, ul *ap, int num, ul w)
>>>  {
>>>  ul c1=0;
>>>  ull t;
>>>  for (;;)
>>>   {
>>>   { t=(ull)w * ap[0] + rp[0] + c1;
>>>    rp[0]= ((ul)t)&0xffffffffL; c1= ((ul)((t)>>32))&(0xffffffffL); };
>>>   if (--num == 0) break;
>>>   { t=(ull)w * ap[1] + rp[1] + c1;
>>>    rp[1]= ((ul)(t))&(0xffffffffL); c1= (((ul)((t)>>32))&(0xffffffffL)); };
>>>   if (--num == 0) break;
>>>   { t=(ull)w * ap[2] + rp[2] + c1;
>>>    rp[2]= (((ul)(t))&(0xffffffffL)); c1= (((ul)((t)>>32))&(0xffffffffL)); };
>>>   if (--num == 0) break;
>>>   { t=(ull)w * ap[3] + rp[3] + c1;
>>>    rp[3]= (((ul)(t))&(0xffffffffL)); c1= (((ul)((t)>>32))&(0xffffffffL)); };
>>>   if (--num == 0) break;
>>>   ap+=4;
>>>   rp+=4;
>>>   }
>>>  return(c1);
>>>  }
>>>
>>> If we apply patch below we will get +6% speed-up for rsa on Silvermont.
>>>
>>> The patch looks loke (not complete since there are other 64-bit
>>> instructions e.g. subtraction):
>>>
>>> Index: i386.md
>>> ===================================================================
>>> --- i386.md     (revision 236181)
>>> +++ i386.md     (working copy)
>>> @@ -5439,7 +5439,7 @@
>>>     (clobber (reg:CC FLAGS_REG))]
>>>    "ix86_binary_operator_ok (PLUS, <DWI>mode, operands)"
>>>    "#"
>>> -  "reload_completed"
>>> +  "1"
>>>    [(parallel [(set (reg:CCC FLAGS_REG)
>>>                    (compare:CCC
>>>                      (plus:DWIH (match_dup 1) (match_dup 2))
>>>
>>> What is your opinion?
>>
>> This splitter doesn't depend on hard registers, so there is no
>> technical obstacle for the proposed patch. OTOH, this is a very old
>> splitter, it is possible that it was introduced to handle some of
>> reload deficiencies. Maybe Jeff knows something about this approach.
>> We have LRA now, so perhaps we have to rethink the purpose of these
>> DImode splitters.
>
> The change doesn't spoil splitter for hard register case and therefore
> splitter still should be able to handle any reload deficiencies.  I think
> we should try to split all instructions working on multiword registers
> (not only PLUS case) at earlier passes to allow more optimizations on
> splitted code and relax registers allocation (now we need to allocate
> consequent registers).  Probably make a separate split right after STV?
> This should help with PR70321.


There are already pass_lower_subreg{,2}, not sure if x86 uses it for splitting
DImode ops though.

Richard.

> Thanks,
> Ilya
>
>>
>> A pragmatic approach would be - if the patch shows measurable benefit,
>> and doesn't introduce regressions, then Stage 1 is the time to try it.
>>
>> BTW: Use "&&  1" in the split condition of the combined insn_and_split
>> pattern to copy the enable condition from the insn part. If there is
>> no condition, you should just use "".
>>
>> Uros.

Re: ]PATCH][RFC] Initial patch for better performance of 64-bit math instructions in 32-bit mode on x86-64

Reply via email to