[Bug tree-optimization/81611] [8 Regression] gcc un-learned loop / post-increment optimization

2018-02-27 Thread aoliva at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81611

Alexandre Oliva  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #25 from Alexandre Oliva  ---
Fixed

[Bug tree-optimization/81611] [8 Regression] gcc un-learned loop / post-increment optimization

2018-02-27 Thread aoliva at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81611

--- Comment #24 from Alexandre Oliva  ---
Author: aoliva
Date: Wed Feb 28 05:25:34 2018
New Revision: 258053

URL: https://gcc.gnu.org/viewcvs?rev=258053=gcc=rev
Log:
[PR81611] turn inc-and-use-of-dead-orig into auto-inc

When the addressing modes available on the machine don't allow offsets
in addresses, odds are that post-increments will be represented in
trees and RTL as:

  y <= x + 1
  ... *(x) ...
  x <= y

so deal with it by turning such RTL as:

  (set y (plus x n))
  ... (mem x) ...

without intervening uses of y into

  (set y x)
  ... (mem (post_add y n)) ...

so as to create auto-inc addresses that we'd otherwise miss.


for  gcc/ChangeLog

PR rtl-optimization/81611
* auto-inc-dec.c (attempt_change): Move dead note from
mem_insn if it's the next use of regno
(find_address): Take address use of reg holding
non-incremented value.  Add parm to limit search to the named
reg only.
(merge_in_block): Attempt to use a mem insn that is the next
use of the original regno.

Modified:
trunk/gcc/ChangeLog
trunk/gcc/auto-inc-dec.c

[Bug tree-optimization/81611] [8 Regression] gcc un-learned loop / post-increment optimization

2018-02-15 Thread law at redhat dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81611

--- Comment #23 from Jeffrey A. Law  ---
No, the DOM change is only a partial fix.  I've largely approved the auto-inc
change.  I recommend the additional tests in c#19 be pulled into a distinct BZ
and the gcc8 regression marker removed once the auto-inc changes are committed.

[Bug tree-optimization/81611] [8 Regression] gcc un-learned loop / post-increment optimization

2018-02-15 Thread egallager at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81611

Eric Gallager  changed:

   What|Removed |Added

 CC||egallager at gcc dot gnu.org

--- Comment #22 from Eric Gallager  ---
(In reply to Alexandre Oliva from comment #21)
> Author: aoliva
> Date: Tue Jan 30 17:40:50 2018
> New Revision: 257194
> 
> URL: https://gcc.gnu.org/viewcvs?rev=257194=gcc=rev
> Log:
> [PR81611] accept copies in simple_iv_increment_p
> 
> If there are copies between the GIMPLE_PHI at the loop body and the
> increment that reaches it (presumably through a back edge), still
> regard it as a simple_iv_increment, so that we won't consider the
> value in the back edge eligible for forwprop.  Doing so would risk
> making the phi node and the incremented conflicting value live
> within the loop, and the phi node to be preserved for propagated
> uses after the loop.
> 
> for  gcc/ChangeLog
> 
>   PR tree-optimization/81611
>   * tree-ssa-dom.c (simple_iv_increment_p): Skip intervening
>   copies.
> 
> Modified:
> trunk/gcc/ChangeLog
> trunk/gcc/tree-ssa-dom.c

Did this fix it?

[Bug tree-optimization/81611] [8 Regression] gcc un-learned loop / post-increment optimization

2018-01-30 Thread aoliva at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81611

--- Comment #21 from Alexandre Oliva  ---
Author: aoliva
Date: Tue Jan 30 17:40:50 2018
New Revision: 257194

URL: https://gcc.gnu.org/viewcvs?rev=257194=gcc=rev
Log:
[PR81611] accept copies in simple_iv_increment_p

If there are copies between the GIMPLE_PHI at the loop body and the
increment that reaches it (presumably through a back edge), still
regard it as a simple_iv_increment, so that we won't consider the
value in the back edge eligible for forwprop.  Doing so would risk
making the phi node and the incremented conflicting value live
within the loop, and the phi node to be preserved for propagated
uses after the loop.

for  gcc/ChangeLog

PR tree-optimization/81611
* tree-ssa-dom.c (simple_iv_increment_p): Skip intervening
copies.

Modified:
trunk/gcc/ChangeLog
trunk/gcc/tree-ssa-dom.c

[Bug tree-optimization/81611] [8 Regression] gcc un-learned loop / post-increment optimization

2018-01-25 Thread gjl at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81611

--- Comment #20 from Georg-Johann Lay  ---
A bit of the bloat reported in PR81625 is also due to missed post-inc
addressing, so it might be worth a look if you are after more test cases.
(Current 8.0.1 perfomrs better than 8.0.0 I used back then: only bloat of +22%
compared to 3.4.6 instead of +26).

[Bug tree-optimization/81611] [8 Regression] gcc un-learned loop / post-increment optimization

2018-01-25 Thread gjl at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81611

--- Comment #19 from Georg-Johann Lay  ---
Hi, thanks for all that work and efforts.

I tried that patch for the following small test:

extern void foo (void);

extern char volatile vv;

void func2 (const int *p)
{
while (1)
{
int var = *p++;
if (var == 10)
return foo();
if (var == 0)
break;
}
}

void func3 (const int *p, const __flash char *f)
{
while (1)
{
int var = *p++;
if (var == 10)
return foo();
vv = *f++;
if (!vv)
break;
}
}


$ avr-gcc -Os -mmcu=avr5 inc.c -S -dp

Unfortunately, the code is still quote sub-optimal, in particular due to
reg-reg moves all over the place, apart from missing post-inc opportunities.

For example, func3 compiles as follows:

func3:
.L7:
movw r20,r24 ;  37  [c=4 l=1]  *movhi/0
subi r20,-2  ;  9   [c=4 l=2]  addhi3_clobber/1
sbci r21,-1
movw r30,r24 ;  38  [c=4 l=1]  *movhi/0
ld r24,Z ;  10  [c=8 l=2]  *movhi/2
ldd r25,Z+1
sbiw r24,10  ;  11  [c=12 l=1]  cmphi3/5
brne .L6 ;  12  [c=16 l=1]  branch
jmp foo  ;  14  [c=0 l=2]  call_insn/3
.L8:
movw r22,r26 ;  5   [c=4 l=1]  *movhi/0
rjmp .L7 ;  46  [c=4 l=1]  jump
.L6:
movw r26,r22 ;  39  [c=4 l=1]  *movhi/0
adiw r26,1   ;  18  [c=4 l=1]  addhi3_clobber/0
movw r30,r22 ;  40  [c=4 l=1]  *movhi/0
lpm r24,Z;  19  [c=4 l=1]  movqi_insn/3
sts vv,r24   ;  20  [c=4 l=2]  movqi_insn/2
lds r18,vv   ;  21  [c=4 l=2]  movqi_insn/3
movw r24,r20 ;  22  [c=4 l=1]  *movhi/0
cpse r18,__zero_reg__;  24  [c=0 l=1]  enable_interrupt-3
rjmp .L8
ret  ;  43  [c=0 l=1]  return

In particular, moving values back and forth and bad register selection is a
common and well known annoyance (insns 37, 38, 5, 39, 40, 22).

Just to give an impression of optimal code, which would read something like:

func3:
;; Use Z=r30/31 for F.  LPM can only use indirect and
;; post-inc with Z.
movw r30, r22
;; Use X=r26/27 for P.  X register can only use indirect and
;; post-inc addressing, which is fine for that purpose.
movw r26, r24
.L7:
;; var = *p++
ld r24,X+
ld r25,X+
;;  var == 10 ?
sbiw r24,10
brne .L6
jmp foo
.L6:
;; vv = *f++
lpm r24,Z+
sts vv,r24
;; if (!vv) break
lds r24,vv
cpse r24,__zero_reg__
rjmp .L7
ret


If uses 12 instructions instead of 12, operates faster (usually focus is on
code size) and has a register footprint of 6 whereas gcc needs 12.

[Bug tree-optimization/81611] [8 Regression] gcc un-learned loop / post-increment optimization

2018-01-23 Thread aoliva at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81611

--- Comment #18 from Alexandre Oliva  ---
Vacations over, patches formatted and posted.
https://gcc.gnu.org/ml/gcc-patches/2018-01/msg01994.html

[Bug tree-optimization/81611] [8 Regression] gcc un-learned loop / post-increment optimization

2018-01-03 Thread aoliva at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81611

--- Comment #17 from Alexandre Oliva  ---
Created attachment 43025
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=43025=edit
another complement to the initial partial patch, this one improving
auto-inc-dec

We already had code to turn the following into a post_inc:

(set (reg A) (plus (reg B) (const_int I)))

 ... (mem (plus (reg A) (const_int -I))) ...

so I've adjusted auto-inc-dec to also look for a (mem (reg B)) and deal with it
in the same way.  It worked beautifully for this one testcase.  Let's see how a
full bootstrap goes, on machines with auto-inc addressing modes...

[Bug tree-optimization/81611] [8 Regression] gcc un-learned loop / post-increment optimization

2018-01-03 Thread aoliva at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81611

--- Comment #16 from Alexandre Oliva  ---
Even if create_mem_ref_raw created a MEM_REF, we'd still allocate a new pseudo
for the reg - 1 at cfgexpand, and that ends up preventing the post_inc
addressing mode from being selected.

The more I think about it, the more I conclude we have to bite the bullet and
support post_inc addressing modes in auto_inc_dec, even with increments saved
on another pseudo before the memory uses.  Not just for iv uses, but also for
linear code, mainly because the IR we generate for post-inc, all the way from
the beginning, makes it more likely that the increment will be logically before
the use of the old value as an address.  I was trying to avoid that, but at
this point anything else feels like papering over the problem.

A smarter addressing-mode-aware middle end could tentatively generate off-by-1
base addresses so as to enable otherwise invalid addresses and subsequent
post_inc transformations, but...  if the transformation doesn't take place, we
may end up worse off.

[Bug tree-optimization/81611] [8 Regression] gcc un-learned loop / post-increment optimization

2018-01-03 Thread aoliva at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81611

--- Comment #15 from Alexandre Oliva  ---
As we create_mem_ref within ivopts, create_mem_ref_raw requires a
valid_mem_ref_p, which in memory_address_addr_space_p calls
targetm.addr_space.legitimate_address_p, and that's
avr_addr_space_legitimate_address_p, that calls avr_legitimate_address_p, that
rejects negative offsets to a REG.  That's where avr loses, compared with x86
and arm.

Because of this rejected address offset, we end up creating _16 = str_11 +
65535.  That eventually gets simplified to _16 = str_6 (in forwprop, no less),
but the str_11 = str_6 + 1 stmt remains because of the use in the phi node, and
it appears before the MEM_REF.

Now, the reason we validate the address is that create_mem_ref calls
create_mem_ref_raw with verify enabled.  Even though the letter might end up
creating a machine-independent MEM_REF, we end up constraining it according to
machine-dependent addressing limitations anyway.

[Bug tree-optimization/81611] [8 Regression] gcc un-learned loop / post-increment optimization

2018-01-03 Thread law at redhat dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81611

--- Comment #14 from Jeffrey A. Law  ---
MEM_REF (as opposed to TARGET_MEM_REF) should be able to handle any simple
SSA_NAME + CONSTANT_OFFSET which are all we're really concerned with here.  THe
target's addressing modes don't really enter the picture (that's what
TARGET_MEM_REF is for).

I'd look for an issue in the types preventing forwprop from doing its job here.

[Bug tree-optimization/81611] [8 Regression] gcc un-learned loop / post-increment optimization

2018-01-02 Thread aoliva at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81611

--- Comment #13 from Alexandre Oliva  ---
We do have such constant propagation on such ports as x86* and arm, but not on
avr.  Presumably (I haven't checked) it has to do with available addressing
modes, and gimple's avoiding, even in MEM_REFs, address expressions that don't
fit its more stringent requirements.

[Bug tree-optimization/81611] [8 Regression] gcc un-learned loop / post-increment optimization

2018-01-02 Thread law at redhat dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81611

--- Comment #12 from Jeffrey A. Law  ---
You know, I wonder if we're missing something bigger here.

ISTM we're potentially missing CSEs in memory addresses as well as forward
propagation opportunities in MEM_REF expressions.

I strongly suspect DOM doesn't look inside a MEM_REF to see if the computed
address is lying around in hash table.

Similarly I wouldn't be surprised if forward propagation misses the opportunity
to eliminate instructions that compute addresses by propagating the constant
offset part into a MEM_REF.

[Bug tree-optimization/81611] [8 Regression] gcc un-learned loop / post-increment optimization

2017-12-27 Thread aoliva at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81611

--- Comment #11 from Alexandre Oliva  ---
Testing has revealed that the alternative complementary candidate patch
introduces a number of regressions, in tests intended specifically to detect
this kind of problem.  I don't see an easy way to delay the post-increment at
gimplification time, so I'm dropping this approach altogether.  Back to
exploring other possibilities, perhaps even extending auto_inc_dec, or... 
something else, I don't know.

[Bug tree-optimization/81611] [8 Regression] gcc un-learned loop / post-increment optimization

2017-12-26 Thread aoliva at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81611

Alexandre Oliva  changed:

   What|Removed |Added

  Attachment #42971|0   |1
is obsolete||

--- Comment #10 from Alexandre Oliva  ---
Created attachment 42972
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=42972=edit
2xupdated complementary candidate patch

It looked like the build was going so well, but...  No such luck.  Watch out
for CPU-eating zombies with the previous patch; it may loop endlessly moving a
stmt just before its successor over and over and over.  I suppose the old code
really didn't mean to move stuff within a single BB, otherwise it would have
checked this case.  Anyhow...  One more bug fixed, bootstrap well into
stage3...

[Bug tree-optimization/81611] [8 Regression] gcc un-learned loop / post-increment optimization

2017-12-26 Thread aoliva at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81611

Alexandre Oliva  changed:

   What|Removed |Added

  Attachment #42969|0   |1
is obsolete||

--- Comment #9 from Alexandre Oliva  ---
Created attachment 42971
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=42971=edit
updated complementary candidate patch

This updated patch fixes a bug in the sinking patch, in the path that handled
all uses in a single stmt.

[Bug tree-optimization/81611] [8 Regression] gcc un-learned loop / post-increment optimization

2017-12-26 Thread aoliva at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81611

--- Comment #8 from Alexandre Oliva  ---
Created attachment 42970
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=42970=edit
alternative (?) complementary candidate patch

This addresses the concern of post-increment in non-loops.  It solves the
problem along with the initial partial patch, but I'm still undecided as to the
second.  I already know it is broken (there's a thinko in the case that handles
all uses in a single stmt), and once the postinc auto_inc addressing modes are
handled by this third patch, I'm not sure yet whether the SSA conflict
reduction the second one affords are worth the trouble any more.  I'll try to
actually measure the effects, once I verify that it (or rather a fixed version
thereof) works.

[Bug tree-optimization/81611] [8 Regression] gcc un-learned loop / post-increment optimization

2017-12-26 Thread aoliva at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81611

--- Comment #7 from Alexandre Oliva  ---
Hmm, what the complementary patch won't do is improve the odds of auto_inc or
even saving a temp in spaghetti code, rather than in loops.  Maybe that's
important too?  I wonder if we should add the post-increment to the gimple stmt
seq after the use of the old value, like IIRC we used to.

[Bug tree-optimization/81611] [8 Regression] gcc un-learned loop / post-increment optimization

2017-12-26 Thread aoliva at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81611

--- Comment #6 from Alexandre Oliva  ---
Created attachment 42969
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=42969=edit
complementary candidate patch

This patch complements the earlier one.  

On AVR, unlike other ports, we had the increment computed into a separate
pseudo, then the memory store using the original pseudo, then the copy of the
separate pseudo to the original one.  auto_inc_dec can't deal with such a
sequence.  It could deal with it if the increment was after the memory access,
which would have already eliminated the temporary and the copy.

So, instead of extending auto_inc_dec to deal with a more complex scenario
involving 3 insns, I decided to try and move the increment down, past the
memory access, so as to benefit even ports without auto_inc addressing modes by
avoiding SSA conflicts and reducing the use of temporaries.

It moves IV increment stmts that are not used within their own blocks closer to
the end of the block, so as to reduce the likelihood of a SSA conflict with the
PHI node.  This also makes it more likely that the increment will appear after
memory uses with which it can be combined into a post_inc addressing mode.

[Bug tree-optimization/81611] [8 Regression] gcc un-learned loop / post-increment optimization

2017-12-26 Thread aoliva at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81611

--- Comment #5 from Alexandre Oliva  ---
Created attachment 42968
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=42968=edit
partial candidate patch

Alas, although it restores good code for x86_64 and arm, it doesn't go as far
as enabling avr to use auto_inc (though auto_inc use is restored on arm). 
Still looking...

[Bug tree-optimization/81611] [8 Regression] gcc un-learned loop / post-increment optimization

2017-12-26 Thread aoliva at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81611

Alexandre Oliva  changed:

   What|Removed |Added

 Status|NEW |ASSIGNED
  Component|rtl-optimization|tree-optimization
   Assignee|unassigned at gcc dot gnu.org  |aoliva at gcc dot 
gnu.org

--- Comment #4 from Alexandre Oliva  ---
Mine; patch will be attached momentarily.