[Bug rtl-optimization/55342] [4.8/4.9 Regression] [LRA,x86] Non-optimal code for simple loop with LRA

2015-06-23 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=55342

Richard Biener  changed:

   What|Removed |Added

   Target Milestone|4.8.5   |4.9.3

--- Comment #17 from Richard Biener  ---
The gcc-4_8-branch is being closed, re-targeting regressions to 4.9.3.


[Bug rtl-optimization/55342] [4.8/4.9 Regression] [LRA,x86] Non-optimal code for simple loop with LRA

2015-02-12 Thread jakub at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=55342

--- Comment #16 from Jakub Jelinek  ---
The #c10 issue went away with r204212 I believe.


[Bug rtl-optimization/55342] [4.8/4.9 Regression] [LRA,x86] Non-optimal code for simple loop with LRA

2015-02-11 Thread law at redhat dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=55342

Jeffrey A. Law  changed:

   What|Removed |Added

Summary|[4.8/4.9/5 Regression]  |[4.8/4.9 Regression]
   |[LRA,x86] Non-optimal code  |[LRA,x86] Non-optimal code
   |for simple loop with LRA|for simple loop with LRA

--- Comment #15 from Jeffrey A. Law  ---
I've examined the various testcases and the complaints about the poor register
allocation in this BZ with a trunk compiler.

I'm happy to report that I'm seeing none of the issues raised in this BZ.  

For c#0 (store-back part of the loop):
.L5:
movl%edi, %ecx
addl$4, %esi
subl%ecx, %eax
subl%ecx, %edx
movzbl  3(%esp), %ecx
movb%al, -3(%esi)
movl%edi, %eax
movb%dl, -4(%esi)
subl%eax, %ecx
movb%cl, -2(%esi)
cmpl%ebp, %ebx
movb%al, -1(%esi)
je  .L1

In c#2, the negation sequence is pointed out.  We now get:

.L9:
movzbl  (%ebx), %edx
movzbl  1(%ebx), %eax
addl$3, %ebx
movzbl  -1(%ebx), %ecx
notl%edx
notl%eax
notl%ecx
cmpb%al, %dl
movb%cl, 3(%esp)
jb  .L13
cmpb3(%esp), %al
movzbl  %al, %edi
jbe .L5
movzbl  3(%esp), %edi
jmp .L5

For the 1st modified testcase -O2 -mcpu=atom -m32:

.L11:
movzbl  %al, %edi
cmpb%al, %cl
cmovbe  %ecx, %edi
.L4:
movl%edi, %eax
leal4(%esi), %esi
subl%eax, %edx
subl%eax, %ecx
movb%dl, -3(%esi)
movb%cl, -4(%esi)
movzbl  3(%esp), %edx
subl%eax, %edx
movl%edi, %eax
movb%dl, -2(%esi)
cmpl%ebx, %ebp
movb%al, -1(%esi)
je  .L1
.L7:
movzbl  (%ebx), %ecx
leal3(%ebx), %ebx
movzbl  -2(%ebx), %edx
notl%ecx
movzbl  -1(%ebx), %eax
notl%edx
notl%eax
cmpb%dl, %cl
movb%al, 3(%esp)
jb  .L11
movzbl  3(%esp), %eax
movzbl  %al, %edi
cmpb%al, %dl
cmovbe  %edx, %edi
jmp .L4

Then in c#10 (t1 testcase):

.L11:
movzbl  %al, %edi
cmpb%al, %cl
cmovbe  %ecx, %edi
.L4:
movl%edi, %eax
leal4(%esi), %esi
subl%eax, %edx
subl%eax, %ecx
movb%dl, -3(%esi)
movb%cl, -4(%esi)
movzbl  3(%esp), %edx
subl%eax, %edx
movl%edi, %eax
movb%dl, -2(%esi)
cmpl%ebp, %ebx
movb%al, -1(%esi)
je  .L1
.L7:
movzbl  (%ebx), %ecx
leal3(%ebx), %ebx
movzbl  -2(%ebx), %edx
notl%ecx
movzbl  -1(%ebx), %eax
notl%edx
notl%eax
cmpb%dl, %cl
movb%al, 3(%esp)
jb  .L11
movzbl  3(%esp), %eax
movzbl  %al, %edi
cmpb%al, %dl
cmovbe  %edx, %edi
jmp .L4


Across the board we're not seeing objects spilled into the stack.  The code
looks quite tight to me.

Clearing the regressio marker for GCC 5.  I didn't do any bisection work to
identify what changes fixed things.


[Bug rtl-optimization/55342] [4.8/4.9 Regression] [LRA,x86] Non-optimal code for simple loop with LRA

2014-04-06 Thread ubizjak at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55342

Uroš Bizjak  changed:

   What|Removed |Added

 Status|NEW |ASSIGNED

[Bug rtl-optimization/55342] [4.8/4.9 Regression] [LRA,x86] Non-optimal code for simple loop with LRA

2013-09-13 Thread ysrumyan at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55342

--- Comment #11 from Yuri Rumyantsev  ---
Created attachment 30816
  --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=30816&action=edit
test-case to reproduce

t1.c must be compiled on x86 with options:

-O2 -march=atom -mtune=atom -mfpmath=sse -m32


[Bug rtl-optimization/55342] [4.8/4.9 Regression] [LRA,x86] Non-optimal code for simple loop with LRA

2013-09-13 Thread ysrumyan at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55342

--- Comment #10 from Yuri Rumyantsev  ---
After fix rev. 202468 assembly looks slightly better but we met with another RA
inefficiency which can be illustrated on the attached (t1.c) test compiled with
options "-march=atom -mtune=atom -m32 -O2" that upped bound ol loop check is on
register but base register for "write" is on stack:

.L8:
movzbl3(%esp), %edx
movl%esi, %ecx
cmpb%cl, %dl
movl%esi, %edi
cmovbe%edx, %edi
.L4:
movl%esi, %edx
movl28(%esp), %esi  <-- why write is on stack
movl%edi, %ecx
addl$4, 28(%esp)  <-- perform write incrementation on stack
subl%ecx, %edx
subl%ecx, %ebx
movzbl3(%esp), %ecx
movb%dl, (%esi)
movl%edi, %edx
subl%edx, %ecx
movb%bl, 1(%esi)
movb%cl, 2(%esi)
movl28(%esp), %esi
cmpl%ebp, %eax  <-- why upper bound is in register?
movb%dl, -1(%esi)
je.L1
.L5:
movzbl(%eax), %esi
leal3(%eax), %eax
movzbl-2(%eax), %ebx
notl%esi
notl%ebx
movl%esi, %edx
movzbl-1(%eax), %ecx
cmpb%bl, %dl
notl%ecx
movb%cl, 3(%esp)
jb.L8
movzbl3(%esp), %edx
movl%ebx, %edi
cmpb%bl, %dl
cmovbe%edx, %edi
jmp.L4

Is it something wrong in ATOM cost model? But anyway I assume that keeping
upper bound on stack is much cheeper then load base with incrementation from
stack.


[Bug rtl-optimization/55342] [4.8/4.9 Regression] [LRA,x86] Non-optimal code for simple loop with LRA

2013-09-05 Thread ysrumyan at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55342

--- Comment #9 from Yuri Rumyantsev  ---
The issue still exists in 4.9 compiler but we got another 30% degradation after
r202165 fix. It can be reproduced with modified test-case which as attached
with any 4.9 compiler, namely code produced for inner loop looks like:

.L8:
movl%esi, %ecx
movl%esi, %edi
movzbl3(%esp), %edx
cmpb%cl, %dl
movl%edx, %ecx
cmovbe%ecx, %edi
.L4:
movl%esi, %edx
movl%edi, %ecx
subl%ecx, %edx
movl28(%esp), %ecx
movl28(%esp), %esi
addl$4, 28(%esp)
movb%dl, (%ecx)
movl%edi, %ecx
subl%ecx, %ebx
movl%edi, %edx
movzbl3(%esp), %ecx
movb%bl, 1(%esi)
subl%edx, %ecx
movl%edi, %ebx
movb%cl, 2(%esi)
movl28(%esp), %esi
cmpl%ebp, %eax
movb%bl, -1(%esi)
je.L1
.L5:
movzbl(%eax), %esi
leal3(%eax), %eax
movzbl-2(%eax), %ebx
notl%esi
notl%ebx
movl%esi, %edx
movzbl-1(%eax), %ecx
cmpb%bl, %dl
movb%cl, 3(%esp)
notb3(%esp)
jb.L8
movzbl3(%esp), %edx
movl%ebx, %edi
cmpb%bl, %dl
cmovbe%edx, %edi
jmp.L4

and you can see that (1) there are 2 additional moves on top of blocks marked
with .L4 and .L8; (2) redundant spill/fills of 'write' base in block marked
with .L4 (28(%esp)).
To reproduce it is sufficient to compile modified test-case with '-m32
-march=atom' options.


[Bug rtl-optimization/55342] [4.8/4.9 Regression] [LRA,x86] Non-optimal code for simple loop with LRA

2013-09-05 Thread ysrumyan at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55342

--- Comment #8 from Yuri Rumyantsev  ---
Created attachment 30751
  --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=30751&action=edit
modified test-case

Modified test-case to reproduce sub-optimal register allocation.


[Bug rtl-optimization/55342] [4.8/4.9 Regression] [LRA,x86] Non-optimal code for simple loop with LRA

2013-06-06 Thread vmakarov at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55342

Vladimir Makarov  changed:

   What|Removed |Added

 CC||vmakarov at gcc dot gnu.org

--- Comment #7 from Vladimir Makarov  ---
(In reply to Yuri Rumyantsev from comment #2)
> The patching compiler produces better binaries but we still have -6%
> performance degradation on corei7. The main cause of it it that LRA compiler
> generates spill of 'pure' byte 'g' whereas old compiler generates spill for
> 'm' that is negation of 'g':
> 
> gcc wwithout LRA (assembly part the head of loop)
> 
> .L7:
>   movzbl  1(%edi), %edx
>   leal3(%edi), %ebp
>   movzbl  (%edi), %ebx
>   movl%ebp, %edi
>   notl%edx   // perform negation on register
>   movb%dl, 3(%esp)
> 
> gcc with LRA
> 
> .L7:
>   movzbl  (%edi), %ebx
>   leal3(%edi), %ebp
>   movzbl  1(%edi), %ecx
>   movl%ebp, %edi
>   movzbl  -1(%ebp), %edx
>   notl%ebx
>   notl%ecx
>   movb%dl, (%esp)
>   cmpb%cl, %bl
>   notb(%esp) // perform nagation in memory
> 
> i.e. wwe have redundant load and store form/to stack.
> 
> I assume that this should be fixed also.

Fixing problem with notl needs implementing a new functionality in LRA: making
reloads which stays if the reload pseudo got a hard registers and was inherited
(in this case it is profitable).  Otherwise the current code should be
generated (the reloads and reload pseudos should be removed, the old code
should be restored).  I've started work on this but it will not be fixed
quickly as implementing the new functionality is not trivial task.

[Bug rtl-optimization/55342] [4.8/4.9 Regression] [LRA,x86] Non-optimal code for simple loop with LRA

2013-05-31 Thread jakub at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55342

Jakub Jelinek  changed:

   What|Removed |Added

   Target Milestone|4.8.1   |4.8.2

--- Comment #6 from Jakub Jelinek  ---
GCC 4.8.1 has been released.


[Bug rtl-optimization/55342] [4.8/4.9 Regression] [LRA,x86] Non-optimal code for simple loop with LRA

2013-03-22 Thread jakub at gcc dot gnu.org


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55342



Jakub Jelinek  changed:



   What|Removed |Added



   Target Milestone|4.8.0   |4.8.1



--- Comment #5 from Jakub Jelinek  2013-03-22 
14:44:31 UTC ---

GCC 4.8.0 is being released, adjusting target milestone.