[Bug rtl-optimization/55342] [4.8/4.9 Regression] [LRA,x86] Non-optimal code for simple loop with LRA
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=55342 Richard Biener changed: What|Removed |Added Target Milestone|4.8.5 |4.9.3 --- Comment #17 from Richard Biener --- The gcc-4_8-branch is being closed, re-targeting regressions to 4.9.3.
[Bug rtl-optimization/55342] [4.8/4.9 Regression] [LRA,x86] Non-optimal code for simple loop with LRA
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=55342 --- Comment #16 from Jakub Jelinek --- The #c10 issue went away with r204212 I believe.
[Bug rtl-optimization/55342] [4.8/4.9 Regression] [LRA,x86] Non-optimal code for simple loop with LRA
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=55342 Jeffrey A. Law changed: What|Removed |Added Summary|[4.8/4.9/5 Regression] |[4.8/4.9 Regression] |[LRA,x86] Non-optimal code |[LRA,x86] Non-optimal code |for simple loop with LRA|for simple loop with LRA --- Comment #15 from Jeffrey A. Law --- I've examined the various testcases and the complaints about the poor register allocation in this BZ with a trunk compiler. I'm happy to report that I'm seeing none of the issues raised in this BZ. For c#0 (store-back part of the loop): .L5: movl%edi, %ecx addl$4, %esi subl%ecx, %eax subl%ecx, %edx movzbl 3(%esp), %ecx movb%al, -3(%esi) movl%edi, %eax movb%dl, -4(%esi) subl%eax, %ecx movb%cl, -2(%esi) cmpl%ebp, %ebx movb%al, -1(%esi) je .L1 In c#2, the negation sequence is pointed out. We now get: .L9: movzbl (%ebx), %edx movzbl 1(%ebx), %eax addl$3, %ebx movzbl -1(%ebx), %ecx notl%edx notl%eax notl%ecx cmpb%al, %dl movb%cl, 3(%esp) jb .L13 cmpb3(%esp), %al movzbl %al, %edi jbe .L5 movzbl 3(%esp), %edi jmp .L5 For the 1st modified testcase -O2 -mcpu=atom -m32: .L11: movzbl %al, %edi cmpb%al, %cl cmovbe %ecx, %edi .L4: movl%edi, %eax leal4(%esi), %esi subl%eax, %edx subl%eax, %ecx movb%dl, -3(%esi) movb%cl, -4(%esi) movzbl 3(%esp), %edx subl%eax, %edx movl%edi, %eax movb%dl, -2(%esi) cmpl%ebx, %ebp movb%al, -1(%esi) je .L1 .L7: movzbl (%ebx), %ecx leal3(%ebx), %ebx movzbl -2(%ebx), %edx notl%ecx movzbl -1(%ebx), %eax notl%edx notl%eax cmpb%dl, %cl movb%al, 3(%esp) jb .L11 movzbl 3(%esp), %eax movzbl %al, %edi cmpb%al, %dl cmovbe %edx, %edi jmp .L4 Then in c#10 (t1 testcase): .L11: movzbl %al, %edi cmpb%al, %cl cmovbe %ecx, %edi .L4: movl%edi, %eax leal4(%esi), %esi subl%eax, %edx subl%eax, %ecx movb%dl, -3(%esi) movb%cl, -4(%esi) movzbl 3(%esp), %edx subl%eax, %edx movl%edi, %eax movb%dl, -2(%esi) cmpl%ebp, %ebx movb%al, -1(%esi) je .L1 .L7: movzbl (%ebx), %ecx leal3(%ebx), %ebx movzbl -2(%ebx), %edx notl%ecx movzbl -1(%ebx), %eax notl%edx notl%eax cmpb%dl, %cl movb%al, 3(%esp) jb .L11 movzbl 3(%esp), %eax movzbl %al, %edi cmpb%al, %dl cmovbe %edx, %edi jmp .L4 Across the board we're not seeing objects spilled into the stack. The code looks quite tight to me. Clearing the regressio marker for GCC 5. I didn't do any bisection work to identify what changes fixed things.
[Bug rtl-optimization/55342] [4.8/4.9 Regression] [LRA,x86] Non-optimal code for simple loop with LRA
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55342 Uroš Bizjak changed: What|Removed |Added Status|NEW |ASSIGNED
[Bug rtl-optimization/55342] [4.8/4.9 Regression] [LRA,x86] Non-optimal code for simple loop with LRA
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55342 --- Comment #11 from Yuri Rumyantsev --- Created attachment 30816 --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=30816&action=edit test-case to reproduce t1.c must be compiled on x86 with options: -O2 -march=atom -mtune=atom -mfpmath=sse -m32
[Bug rtl-optimization/55342] [4.8/4.9 Regression] [LRA,x86] Non-optimal code for simple loop with LRA
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55342 --- Comment #10 from Yuri Rumyantsev --- After fix rev. 202468 assembly looks slightly better but we met with another RA inefficiency which can be illustrated on the attached (t1.c) test compiled with options "-march=atom -mtune=atom -m32 -O2" that upped bound ol loop check is on register but base register for "write" is on stack: .L8: movzbl3(%esp), %edx movl%esi, %ecx cmpb%cl, %dl movl%esi, %edi cmovbe%edx, %edi .L4: movl%esi, %edx movl28(%esp), %esi <-- why write is on stack movl%edi, %ecx addl$4, 28(%esp) <-- perform write incrementation on stack subl%ecx, %edx subl%ecx, %ebx movzbl3(%esp), %ecx movb%dl, (%esi) movl%edi, %edx subl%edx, %ecx movb%bl, 1(%esi) movb%cl, 2(%esi) movl28(%esp), %esi cmpl%ebp, %eax <-- why upper bound is in register? movb%dl, -1(%esi) je.L1 .L5: movzbl(%eax), %esi leal3(%eax), %eax movzbl-2(%eax), %ebx notl%esi notl%ebx movl%esi, %edx movzbl-1(%eax), %ecx cmpb%bl, %dl notl%ecx movb%cl, 3(%esp) jb.L8 movzbl3(%esp), %edx movl%ebx, %edi cmpb%bl, %dl cmovbe%edx, %edi jmp.L4 Is it something wrong in ATOM cost model? But anyway I assume that keeping upper bound on stack is much cheeper then load base with incrementation from stack.
[Bug rtl-optimization/55342] [4.8/4.9 Regression] [LRA,x86] Non-optimal code for simple loop with LRA
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55342 --- Comment #9 from Yuri Rumyantsev --- The issue still exists in 4.9 compiler but we got another 30% degradation after r202165 fix. It can be reproduced with modified test-case which as attached with any 4.9 compiler, namely code produced for inner loop looks like: .L8: movl%esi, %ecx movl%esi, %edi movzbl3(%esp), %edx cmpb%cl, %dl movl%edx, %ecx cmovbe%ecx, %edi .L4: movl%esi, %edx movl%edi, %ecx subl%ecx, %edx movl28(%esp), %ecx movl28(%esp), %esi addl$4, 28(%esp) movb%dl, (%ecx) movl%edi, %ecx subl%ecx, %ebx movl%edi, %edx movzbl3(%esp), %ecx movb%bl, 1(%esi) subl%edx, %ecx movl%edi, %ebx movb%cl, 2(%esi) movl28(%esp), %esi cmpl%ebp, %eax movb%bl, -1(%esi) je.L1 .L5: movzbl(%eax), %esi leal3(%eax), %eax movzbl-2(%eax), %ebx notl%esi notl%ebx movl%esi, %edx movzbl-1(%eax), %ecx cmpb%bl, %dl movb%cl, 3(%esp) notb3(%esp) jb.L8 movzbl3(%esp), %edx movl%ebx, %edi cmpb%bl, %dl cmovbe%edx, %edi jmp.L4 and you can see that (1) there are 2 additional moves on top of blocks marked with .L4 and .L8; (2) redundant spill/fills of 'write' base in block marked with .L4 (28(%esp)). To reproduce it is sufficient to compile modified test-case with '-m32 -march=atom' options.
[Bug rtl-optimization/55342] [4.8/4.9 Regression] [LRA,x86] Non-optimal code for simple loop with LRA
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55342 --- Comment #8 from Yuri Rumyantsev --- Created attachment 30751 --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=30751&action=edit modified test-case Modified test-case to reproduce sub-optimal register allocation.
[Bug rtl-optimization/55342] [4.8/4.9 Regression] [LRA,x86] Non-optimal code for simple loop with LRA
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55342 Vladimir Makarov changed: What|Removed |Added CC||vmakarov at gcc dot gnu.org --- Comment #7 from Vladimir Makarov --- (In reply to Yuri Rumyantsev from comment #2) > The patching compiler produces better binaries but we still have -6% > performance degradation on corei7. The main cause of it it that LRA compiler > generates spill of 'pure' byte 'g' whereas old compiler generates spill for > 'm' that is negation of 'g': > > gcc wwithout LRA (assembly part the head of loop) > > .L7: > movzbl 1(%edi), %edx > leal3(%edi), %ebp > movzbl (%edi), %ebx > movl%ebp, %edi > notl%edx // perform negation on register > movb%dl, 3(%esp) > > gcc with LRA > > .L7: > movzbl (%edi), %ebx > leal3(%edi), %ebp > movzbl 1(%edi), %ecx > movl%ebp, %edi > movzbl -1(%ebp), %edx > notl%ebx > notl%ecx > movb%dl, (%esp) > cmpb%cl, %bl > notb(%esp) // perform nagation in memory > > i.e. wwe have redundant load and store form/to stack. > > I assume that this should be fixed also. Fixing problem with notl needs implementing a new functionality in LRA: making reloads which stays if the reload pseudo got a hard registers and was inherited (in this case it is profitable). Otherwise the current code should be generated (the reloads and reload pseudos should be removed, the old code should be restored). I've started work on this but it will not be fixed quickly as implementing the new functionality is not trivial task.
[Bug rtl-optimization/55342] [4.8/4.9 Regression] [LRA,x86] Non-optimal code for simple loop with LRA
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55342 Jakub Jelinek changed: What|Removed |Added Target Milestone|4.8.1 |4.8.2 --- Comment #6 from Jakub Jelinek --- GCC 4.8.1 has been released.
[Bug rtl-optimization/55342] [4.8/4.9 Regression] [LRA,x86] Non-optimal code for simple loop with LRA
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55342 Jakub Jelinek changed: What|Removed |Added Target Milestone|4.8.0 |4.8.1 --- Comment #5 from Jakub Jelinek 2013-03-22 14:44:31 UTC --- GCC 4.8.0 is being released, adjusting target milestone.