[Bug rtl-optimization/98782] IRA artificially creating spills due to BB frequencies

2021-01-29 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782

--- Comment #2 from Tamar Christina  ---
Just an update on what I know so far.

There seems to be no symmetry between the growth of the memory costs vs that of
the caller saved registers.

In the case of the memory costs, the actual memory cost in the back end is
multiplied by the BB frequencies.
The total allocation frequency and memory costs for the memory is a sum of all
the usages of the register in
a live range/BB.

In the case of the example above the reg we're interested in is reg 104.

create_insn_allocnos you can see that the memory costs for the register and the
total frequencies grow as follows:

Bad:

   ALLOCNO_FREQ 10, REF_FREQ_BB 10
   ALLOCNO_FREQ 485, REF_FREQ_BB 485
   ALLOCNO_FREQ 989, REF_FREQ_BB 504

Good:

   ALLOCNO_FREQ 10, REF_FREQ_BB 10  
   ALLOCNO_FREQ 495, REF_FREQ_BB 495
   ALLOCNO_FREQ 990, REF_FREQ_BB 495

Notice that in the bad case your total final frequency is actually lower.

The costs are created by BB_FREQ * mem-cost in the backend. In the case of
AArch64 that's
BB_FREQ * 4.

So what we end up within cost calculations in scan_one_insn in ira-costs is:

Bad:

  add-cost 40, mem-cost 40
  add-cost 1940, mem-cost 1940
  add-cost 2016, mem-cost 3956

Good:

  add-cost 40, mem-cost 40
  add-cost 1980, mem-cost 1980
  add-cost 1980, mem-cost 3960

Counter-intuitively this means having a higher probability for the BB gets you
a lower frequency which in turn seems to give you a lower cost for memory
operations.

Finally this gets topped off somewhere with another small amount (10 * memcost,
where 10 looks like to be the ratio between BB_FREQ_MAX / REG_FREQ_MAX)
which result in the costs for regiisters that can be seen during an IRA dump.

Bad:

a5(r104,l0) costs: GENERAL_REGS:0,0 FP_LO8_REGS:50,4995 FP_LO_REGS:50,4995
FP_REGS:50,4995 POINTER_AND_FP_REGS:50,4995 MEM:40,3996

Good:

a5(r104,l0) costs: GENERAL_REGS:0,0 FP_LO8_REGS:50,5000 FP_LO_REGS:50,5000
FP_REGS:50,5000 POINTER_AND_FP_REGS:50,5000 MEM:40,4000

So overall a change of 0.1% in probability caused a decrease of 0.1 %
(BB_FREQ_MAX / REG_FREQ_MAX) * memcost.

Now, based on the above the base costs of using GENERAL_REGS for both of the
above are 0.

But in ira_tune_allocno_costs IRA applies a penalty to the costs of the
register if it's live across a call in the same BB.
This penalty is applied using the CALL_FREQ.

Bad:

  CALL_FREQ 504, FREQ_FROM_BB 504

Good:

  CALL_FREQ 495, FREQ_FROM_BB 495

So the BB_FREQ went down, but the CALL_FREQ went up in the Bad/Good case.

The BB_FREQ went down 1, but the CALL_FREQ went up 9.

The penalty that is applied is CALL_FREQ * ().
So in our case it's CALL_FREQ * (4 + 4).

So we end up with:

Bad:

   ira_memory_move_cost[0] 4, ira_memory_move_cost[1] 4
   cost 4032, reg-cost 4032
   CALL_FREQ 504, ALLOCANO_FREQ 999

Good:

   ira_memory_move_cost[0] 4, ira_memory_move_cost[1] 4  
   cost 3960, reg-cost 3960  
   CALL_FREQ 495, ALLOCANO_FREQ 1000 

Which gives us a final cost of:

Bad:

   Memory: 3956
   Register: 4032

Good:

   Memory: 3960
   Register: 3960

This is all to say, the penalty grows far quicker than BB frequency.
Tiny changes in BB frequencies have a large effect on it.

This is why RA ends up thinking it's cheaper to go through memory.  It is
trying to apply a penalty to the registers to prevent their use, which is
understandable but what isn't clear to me is that it doesn't take into account
whether the instruction is in a loop.  It can be far cheaper to spill at the
call site instead.

ira_tune_allocno_costs does have a callback hook

  IRA_HARD_REGNO_ADD_COST_MULTIPLIER

that targets can use to tweak the penalty being applied to the registers that
are live through a call. The annoying thing about the hook is that values you
give it are weighted by BB_FREQ and not CALL_FREQ.  So you're not given enough
information to be able to strike a balance between the growth of the CALL_FREQ
and the BB_FREQ.

It's done using

cost += ((ira_memory_move_cost[mode][rclass][0]
  + ira_memory_move_cost[mode][rclass][1])
 * ALLOCNO_FREQ (a)
 * IRA_HARD_REGNO_ADD_COST_MULTIPLIER (regno) / 2);

I have attempted to provide a backend-hook in AArch64 that applies an inverse
penalty to caller-saved registers.  But because I am not given frequency
information I can only give a constant penalty back, which means it's clearly
incorrect as it's specifically tuning for a loop in Exchange2.

Using this hook I was able to only regain about 40% of the regression with no
losses in SPECINT CPU2017.

But I think the ratio between the growth of the costs and the penalty is off. 
This is evident by that the regression exists on multiple targets.

[Bug rtl-optimization/98782] IRA artificially creating spills due to BB frequencies

2021-01-22 Thread fxue at os dot amperecomputing.com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782

Feng Xue  changed:

   What|Removed |Added

 CC||fxue at os dot 
amperecomputing.com

--- Comment #1 from Feng Xue  ---
The value "foo + 1024" is spilled for both cases, but in different way. For bad
case, spill outside loop, and only reload inside. While for good case,
spill/reload pair occurs around the call to "bar", which might also consider
extra cost of using caller-saved registers. It seems that IRA has two different
logics to handle spilling.

[Bad case]
foo:
stp x29, x30, [sp, -80]!
mov w5, 753
mov x29, sp
stp x19, x20, [sp, 16]
mov x19, x1
mul w1, w0, w5
stp x21, x22, [sp, 32]
mov w22, 5271
add w2, w1, 7
mov w21, w5
mul w3, w0, w22
mov w20, 760
mov w22, 0
str w0, [sp, 76]
add x0, x19, 1024
str x0, [sp, 64]  // Spill (foo + 1024)
.p2align 3,,7
.L5:
ldrbw0, [x19]
cbz w0, .L2
ldr w0, [sp, 76]
stp w1, w2, [sp, 56]
str w3, [sp, 72]
bl  bar
ldrbw0, [x19, 1]!
ldp w1, w2, [sp, 56]
add w21, w21, w0
ldr w3, [sp, 72]
mul w20, w20, w0
ldr x0, [sp, 64] // Reload (foo + 1024)
add w22, w22, w20
cmp x19, x0
bne .L5
b   .L4
.p2align 2,,3
.L2:
ldrbw0, [x19, 1]!
add w21, w21, w0
mul w20, w20, w0
ldr x0, [sp, 64] // Reload (foo + 1024)
add w22, w22, w20
cmp x0, x19
bne .L5
.L4:
add w0, w20, w21
add w0, w0, w22
ldp x19, x20, [sp, 16]
ldp x21, x22, [sp, 32]
ldp x29, x30, [sp], 80
ret

[Good case:]
foo:
stp x29, x30, [sp, -80]!
mov w5, 753
add x7, x1, 1024
mul w2, w0, w5
mov x29, sp
stp x21, x22, [sp, 32]
mov w21, 5271
mov w22, w5
stp x19, x20, [sp, 16]
mov x19, x1
mul w3, w0, w21
stp w2, w0, [sp, 72]  // Spill x(%w0)
add w2, w2, 7 // t2(%w2)
mov w21, 0
mov w20, 760
.p2align 3,,7
.L5:
ldrbw0, [x19]
cbz w0, .L2
ldp w1, w0, [sp, 72]  // Reload x 
stp w2, w3, [sp, 56]  // Spill t2
str x7, [sp, 64]  // Spill (foo + 1024)
bl  bar
ldrbw0, [x19, 1]!
ldr x7, [sp, 64]  // Reload (foo + 1024)
add w22, w22, w0
ldp w2, w3, [sp, 56]  // Reload t2
mul w20, w20, w0
add w21, w21, w20
cmp x19, x7
bne .L5
b   .L4
.p2align 2,,3
.L2:
ldrbw0, [x19, 1]!
add w22, w22, w0
mul w20, w20, w0
add w21, w21, w20
cmp x7, x19
bne .L5
.L4:
add w0, w20, w22
add w0, w0, w21
ldp x19, x20, [sp, 16]
ldp x21, x22, [sp, 32]
ldp x29, x30, [sp], 80
ret

Even for good case, we could expect better spill/reload generation. Refer to
comments above, "x" and "t2" are similar, both loop invariant, but handled
differently. Spilling "t2" inside loop is worst than spilling it outside, as
what IRA does for "x". 

Both issues could be correlated to same thing.

[Bug rtl-optimization/98782] IRA artificially creating spills due to BB frequencies

2021-01-21 Thread jgreenhalgh at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782

James Greenhalgh  changed:

   What|Removed |Added

 Status|UNCONFIRMED |NEW
   Last reconfirmed||2021-01-21
 Ever confirmed|0   |1