Hi Richard,
On 8/7/24 10:47, Richard Sandiford wrote:
> I should probably start by saying that the "model" heuristic is now
> pretty old and was originally tuned for an in-order AArch32 core.
> The aim wasn't to *minimise* spilling, but to strike a better balance
> between parallelising with spills vs. sequentialising. At the time,
> scheduling without taking register pressure into account would overly
> parallelise things, whereas the original -fsched-pressure would overly
> serialise (i.e. was too conservative).
>
> There were specific workloads in, er, a formerly popular embedded
> benchmark that benefitted significantly from *some* spilling.
>
> This comment probably sums up the trade-off best:
>
> This pressure cost is deliberately timid. The intention has been
> to choose a heuristic that rarely interferes with the normal list
> scheduler in cases where that scheduler would produce good code.
> We simply want to curb some of its worst excesses.
>
> Because it was tuned for an in-order core, it was operating in an
> environment where instruction latencies were meaningful and realistic.
> So it still deferred to those to quite a big extent. This is almost
> certainly too conservative for out-of-order cores.
>
> So...
>
> Vineet Gupta <[email protected]> writes:
>> On 8/5/24 21:31, Jeff Law wrote:
>>> On 8/5/24 5:35 PM, Vineet Gupta wrote:
>>>> Hi Richard,
>>>>
>>>> I'm reaching out for some insight on PR/114729. Apologies in advance for
>>>> the long email.
>>>>
>>>> On RISC-V we are hitting sched1 pathology on SPEC2017 Cactu where
>>>> codegen spills are overwhelming the execution: disabling sched1 shaves
>>>> off 1.3 trillion dynamic icounts which is about half of total execution
>>>> (in user mode qemu run).
>>>>
>>>> I have a reduced test down to a single spill (w/ sched1 and non w/o).
>>>> The full sched1 verbose log for test is attached for reference (so is
>>>> the test case itself).
>>>>
>>>> The gist of the issue (from schedule pov) is annotated insn below: 46,
>>>> 54, 55
>>>>
>>>> ld a5,%lo(u)(s0)
>>>> fld fa5,%lo(f)(t6)
>>>> fcvt.l.d a4,fa4,rtz
>>>> srli a0,a5,16 # insn 46 (ideal order 1)
>>>> srli a1,a5,32 # insn 55 (ideal order 3)
>>>> sh a5,%lo(_Z1sv)(a3) # insn 44
>>>> slli a4,a4,3
>>>> srli a5,a5,48
>>>> fsd fa5,%lo(e)(t0)
>>>> add a4,s9,a4
>>>> sh a0,%lo(_Z1sv+2)(a3) # insn 54 (ideal order 2)
>>>> sh a1,%lo(_Z1sv+4)(a3) # insn 64
>>>> sh a5,%lo(_Z1sv+6)(a3)
>>>>
>>>> If insn 54 were scheduled ahead of insn 55, the corresponding reg need
>>>> not be allocated (and consequently not spilled) in the outer loop.
>>>> There are no uses of a0 after insn 54.
>>> So what does the ready queue look like at the start of whatever cycle
>>> insn 46 fires on? I would expect insn 46, 55, 44, the slli & fsd just
>>> based on data dependencies.
>> Dump just before insn 46 is scheduled (most of them are from tail end of
>> BB due to it processing insns from end)
>>
>> ;; Ready list after ready_sort:
>> 81:50(cost=0:prio=7:delay=2:idx=12)
>> 80:49(cost=0:prio=6:delay=1:idx=23) 65:44(cost=1:prio=7:idx=20)
>> 55:42(cost=1:prio=7:idx=18) 94:58(cost=0:prio=2:idx=0)
>> 92:56(cost=0:prio=5:idx=28) 88:54(cost=0:prio=5:idx=26)
>> 44:39(cost=0:prio=6:idx=15) 46:40(cost=0:prio=7:idx=16)
>> ;; 13--> b 0: i 46 r180=zxt(r170,0x10,0x10)
>> :alu:@GR_REGS+1(1)@FP_REGS+0(0)
>>
>> 46 is at the head hence it is scheduled.
>>
>>>> Looking into sched1 logs (#8916): schedule_insn () is called for insn
>>>> 46 and insn 54 added to ready q.
>>> Presumably those happen on different cycles? I would not expect 45 to
>>> enter the ready queue on some cycle N. Then on N+M insn 54 should enter
>>> the ready queue, prsumably with M == 1.
>> Nope they are in same cycle since 54 is in SD_LIST_FORW of insn 46.
>>
>> advance = schedule_insn(insn = 46)
>> for (sd_it = sd_iterator_start (insn, SD_LIST_FORW);
>> sd_iterator_cond (&sd_it, &dep);)
>> try_ready
>> fix_tick_ready
>> change_queue_index(insn = 54, delay=QUEUE_READY)
>>
>>>> Then we have ready_sort -> ready_sort_real () -> model_set_excess_costs
>>>> () called for insns in ready q.
>>>>
>>>> For insn 54, model_excess_cost () there is a reg dead, hence the -1 in
>>>> print below. However its model_excess_group_cost () is still 0,
>>>> disregarding the delta -1.
>>>>
>>>> ;; | 17 54 | 6 +1 | GR_REGS:[-1 base cost 0] FP_REGS:[0
>>>> base cost 0]
>>>>
>>>> Per comments around model_set_excess_costs () this seems intentional /
>>>> as designed "... negative costs are converted to zero ones ...".
>>>>
>>>> The subsequent qsort () w/ numerous gyrations of rank_for_schedule()
>>>> ends up moving 55 to top.
>>> Presumably due to the number of uses and the types of uses.
>> Its the 3 things in following order:
>>
>> RFS_PRESSURE_DELAY (low wins): ECC(tmp) + insn_delay(tmp) - ECC (tmp2)
>> - insn_delay (tmp2)
>> RFS_PRIORITY (high wins): INSN_PRIORITY (tmp2) - INSN_PRIORITY (tmp)
>> RFS_PRESSURE_INDEX (low wins): model_index (tmp) - model_index (tmp2)
>>
>>>> ;; Ready list (t = 14):
>>>> 81:50(cost=0:prio=7:delay=1:idx=12)
>>>> 94:58(cost=0:prio=2:idx=0)
>>>> 92:56(cost=0:prio=5:idx=28)
>>>> 88:54(cost=0:prio=5:idx=26)
>>>> 80:49(cost=0:prio=6:idx=23)
>>>> 54:41(cost=0:prio=6:idx=17)
>>>> 44:39(cost=0:prio=6:idx=15)
>>>> 65:44(cost=0:prio=7:idx=20)
>>>> 55:42(cost=0:prio=7:idx=18)
>>> Hard to know what to make of this, most of the insns referenced in the
>>> queue don't refer back to anything you've shown above.
>> Yes, as most of them are from towards end of BB. These are just ready,
>> but the one at the last is the sorted one which will be scheduled in the
>> insn stream.
>>
>>
>>>> This does change scheduling to move insn 44 ahead of insn 54 and in the
>>>> next cycle, ideal insn 54 is picked (ahead of insn 55) avoiding the spill.
>>>>
>>>> So it seems it all comes down to ready_sort () -> rank_for_schedule()
>>>> for deciding the winner (added prints in rfs_result)
>>>>
>>>> ...
>>>> ;; --1-- RFS_PRIORITY winner 55 looser 54
>>>> ;; --2-- RFS_PRIORITY winner 55 looser 54
>>> Which is correct. From a critical path standpoint it's more important
>>> to get insn 55 issued than it is to get insn 54 issued. But......
>>>
>>>> ...
>>>> ;; --1-- RFS_PRIORITY winner 55 looser 44
>>>> ;; --2-- RFS_PRIORITY winner 55 looser 44
>>>> ;; --1-- RFS_PRESSURE_INDEX winner 55 looser 65
>>>> ;; --1-- RFS_PRESSURE_INDEX winner 55 looser 65
>>>> ;; --2-- RFS_PRESSURE_INDEX winner 55 looser 65
>>>>
>>>> Note that priority of insn 55 (prio 7, alu insn) will always be higher
>>>> than insn 54 (prio 6, a store) since insn 55 SD_LIST_FWD has a
>>>> store/sink (insn 64, also with a prio 6).
>>>> And since ECC (model_set_excess_costs) calculations also include
>>>> priority, there's a bias towards priority in the ranking.
>>>> Meaning all else being constant, priority will invariably cause a win.
>>> Right, that's right bias in general. But with the priorities only
>>> differing by 1 unit, I would have expected the increase in register
>>> pressure to be enough to overcome that bias or at least make them equal.
>> The actual reg pressure delta is subsumed in the ECC value which tends
>> to convert negative biases to zero.
>>
>>>> I did find a potential issue where it seems to be disregarding prio tmp >
>>>> prio tmp2 (although this doesn't fix the issue)
>>> Unsure ;-)
>> Yeah sorry that was stupid.
>>
>>
>>> Note that in rank_for_schedule we check pressure state before we check
>>> priority. So it's a bit unclear why RFS_PRIORITY was selected when
>>> comparing insns 55 and 54. I guess that would tend to indicate that the
>>> ECC wasn't enough to make a difference:
>> Indeed. ECC delta would have to be neutral for it to be skipped. The
>> order of things checked is what I enumerated above.
>>
>> I'm currently pursuing a different trail which comes form observation
>> that initial model setup concludes that pressure is 28 so with 27
>> allocable regs we are bound to spill one.
>> More on that after I find something concrete.
> ...I think for OoO cores, this:
>
> baseECC (X) could itself be used as the ECC value described above.
> However, this is often too conservative, in the sense that it
> tends to make high-priority instructions that increase pressure
> wait too long in cases where introducing a spill would be better.
> For this reason the final ECC is a priority-adjusted form of
> baseECC (X). Specifically, we calculate:
>
> P (X) = INSN_PRIORITY (X) - insn_delay (X) - baseECC (X)
> baseP = MAX { P (X) | baseECC (X) <= 0 }
>
> Then:
>
> ECC (X) = MAX (MIN (baseP - P (X), baseECC (X)), 0)
>
> Thus an instruction's effect on pressure is ignored if it has a high
> enough priority relative to the ones that don't increase pressure.
> Negative values of baseECC (X) do not increase the priority of X
> itself, but they do make it harder for other instructions to
> increase the pressure further.
>
> is probably not appropriate. We should probably just use the baseECC,
> as suggested by the first sentence in the comment. It looks like the hack:
>
> diff --git a/gcc/haifa-sched.cc b/gcc/haifa-sched.cc
> index 1bc610f9a5f..9601e929a88 100644
> --- a/gcc/haifa-sched.cc
> +++ b/gcc/haifa-sched.cc
> @@ -2512,7 +2512,7 @@ model_set_excess_costs (rtx_insn **insns, int count)
> print_p = true;
> }
> cost = model_excess_cost (insns[i], print_p);
> - if (cost <= 0)
> + if (cost <= 0 && 0)
> {
> priority = INSN_PRIORITY (insns[i]) - insn_delay (insns[i]) - cost;
> priority_base = MAX (priority_base, priority);
> @@ -2525,6 +2525,7 @@ model_set_excess_costs (rtx_insn **insns, int count)
>
> /* Use MAX (baseECC, 0) and baseP to calculcate ECC for each
> instruction. */
> + if (0)
> for (i = 0; i < count; i++)
> {
> cost = INSN_REG_PRESSURE_EXCESS_COST_CHANGE (insns[i]);
>
> fixes things for me. Perhaps we should replace these && 0s
> with a query for an out-of-order core?
>
> I haven't benchmarked this. :) And I'm looking at the code for the
> first time in many years, so I'm certainly forgetting details.
Back to the original issue, I reran reduce with the hack and see the issue
still.
I've updated the PR but here's the gist essentially:
-fno-schedule-insns | -fschedule-insns
|
li t3,2 | li t5,2
li t1,1 | li t4,1
---xxx--- | sd s10,8(sp) # %sfp
.L2: | .L2:
---xxx--- | ld a5,8(sp) # %sfp
beq s10,zero,.L9 | beq a5,zero,.L9
mv a1,s6 | mv a1,a6
beq s6,zero,.L7 | beq a6,zero,.L7
mulw a2,s5,s6 | mulw a2,a0,a6
mv a0,s5 | mv s10,a0
slli a6,s5,3 | slli s11,a0,3
slli a3,a2,3 | slli a3,a2,3
.L6: | .L6:
fcvt.d.w fa5,a2 # 56 | fcvt.d.w fa5,a2 # 56
add a5,s4,a3 # 63 | add a4,s5,a3 # 63
fsd fa5,%lo(e)(t6) # 57 | add a5,s6,a3 # 58
fld fa4,0(a5) # 64 | fsd fa5,%lo(e)(t2) # 57
fld fa5,%lo(o)(t5) | fld fa3,0(a4) # 64
add a5,s3,a3 # 58 | fld fa4,0(a5) # 60
fmul.d fa4,fa4,fa5 | fld fa5,%lo(o)(t0)
fld fa5,0(a5) # 60 | fcvt.l.d a4,fa3,rtz
schedule_insn () is called as follows:
;; 0--> b 0: i 56 r210=flt(r185#0) :alu: GR_REGS+0(0) FP_REGS+1(1):model 0
;; 1--> b 0: i 63 r215=r141+r183 :alu: GR_REGS+1(1) FP_REGS+0(0)
;; 2--> b 0: i 58 r211=r138+r183 :alu:@GR_REGS+1(1)@FP_REGS+0(0)
The prints between insn 63 and insn 58 show the issue
;; point uid prio delay class : del ECC-gpr del ECC-fpr
ECC-tot
;;| 1 57 | 12 +2 | GR_REGS:[0 base cost 0] FP_REGS:[-1 base cost 0]ECC 0
;;| 10 61 | 4 +1 | GR_REGS:[0 base cost 0] FP_REGS:[1 base cost 0] ECC 0
;;| 8 58 | 5 +1 | GR_REGS:[1 base cost 0] FP_REGS:[0 base cost 0] ECC 0
;; --1-- RFS_PRESSURE_DELAY winner 58 looser 57
;; --2-- RFS_PRESSURE_DELAY winner 58 looser 57
(1) insn 57 delta is -1 (reduces pressure), while insn 58 delta is 1 (increases
pressure) yet ECC is 0 for both, meaning for pressure considerations
they are treated the same. This ECC "dilution" happens in model_spill_cost ().
(2). insn 57 is a store (so prio 2) while insn 58 is an add (so prio 1).
rank_for_schedule (): RFS_PRESSURE_DELAY chooses lower of (ECC + insn_delay)
and thus picks 58 over 57 causing the issue.
Thx,
-Vineet