On Wed, Jun 1, 2016 at 11:57 AM, Ilya Enkovich <enkovich....@gmail.com> wrote: > 2016-05-31 19:15 GMT+03:00 Uros Bizjak <ubiz...@gmail.com>: >> On Tue, May 31, 2016 at 5:00 PM, Yuri Rumyantsev <ysrum...@gmail.com> wrote: >>> Hi Uros, >>> >>> Here is initial patch to improve performance of 64-bit integer >>> arithmetic in 32-bit mode. We discovered that gcc is significantly >>> behind icc and clang on rsa benchmark from eembc2.0 suite. >>> Te problem function looks like >>> typedef unsigned long long ull; >>> typedef unsigned long ul; >>> ul mul_add(ul *rp, ul *ap, int num, ul w) >>> { >>> ul c1=0; >>> ull t; >>> for (;;) >>> { >>> { t=(ull)w * ap[0] + rp[0] + c1; >>> rp[0]= ((ul)t)&0xffffffffL; c1= ((ul)((t)>>32))&(0xffffffffL); }; >>> if (--num == 0) break; >>> { t=(ull)w * ap[1] + rp[1] + c1; >>> rp[1]= ((ul)(t))&(0xffffffffL); c1= (((ul)((t)>>32))&(0xffffffffL)); }; >>> if (--num == 0) break; >>> { t=(ull)w * ap[2] + rp[2] + c1; >>> rp[2]= (((ul)(t))&(0xffffffffL)); c1= (((ul)((t)>>32))&(0xffffffffL)); }; >>> if (--num == 0) break; >>> { t=(ull)w * ap[3] + rp[3] + c1; >>> rp[3]= (((ul)(t))&(0xffffffffL)); c1= (((ul)((t)>>32))&(0xffffffffL)); }; >>> if (--num == 0) break; >>> ap+=4; >>> rp+=4; >>> } >>> return(c1); >>> } >>> >>> If we apply patch below we will get +6% speed-up for rsa on Silvermont. >>> >>> The patch looks loke (not complete since there are other 64-bit >>> instructions e.g. subtraction): >>> >>> Index: i386.md >>> =================================================================== >>> --- i386.md (revision 236181) >>> +++ i386.md (working copy) >>> @@ -5439,7 +5439,7 @@ >>> (clobber (reg:CC FLAGS_REG))] >>> "ix86_binary_operator_ok (PLUS, <DWI>mode, operands)" >>> "#" >>> - "reload_completed" >>> + "1" >>> [(parallel [(set (reg:CCC FLAGS_REG) >>> (compare:CCC >>> (plus:DWIH (match_dup 1) (match_dup 2)) >>> >>> What is your opinion? >> >> This splitter doesn't depend on hard registers, so there is no >> technical obstacle for the proposed patch. OTOH, this is a very old >> splitter, it is possible that it was introduced to handle some of >> reload deficiencies. Maybe Jeff knows something about this approach. >> We have LRA now, so perhaps we have to rethink the purpose of these >> DImode splitters. > > The change doesn't spoil splitter for hard register case and therefore > splitter still should be able to handle any reload deficiencies. I think > we should try to split all instructions working on multiword registers > (not only PLUS case) at earlier passes to allow more optimizations on > splitted code and relax registers allocation (now we need to allocate > consequent registers). Probably make a separate split right after STV? > This should help with PR70321.
There are already pass_lower_subreg{,2}, not sure if x86 uses it for splitting DImode ops though. Richard. > Thanks, > Ilya > >> >> A pragmatic approach would be - if the patch shows measurable benefit, >> and doesn't introduce regressions, then Stage 1 is the time to try it. >> >> BTW: Use "&& 1" in the split condition of the combined insn_and_split >> pattern to copy the enable condition from the insn part. If there is >> no condition, you should just use "". >> >> Uros.