----- Forwarded message from Richard Braun <[email protected]> ----- Date: Thu, 6 Nov 2025 14:38:10 +0100 From: Richard Braun <[email protected]> To: Benoit Dinechin <[email protected]> Subject: Re: Question about conditional functional unit reservation
On Thu, Nov 06, 2025 at 01:01:36PM +0000, Benoit Dinechin wrote: > Correct, (define_bypass ...) only adjusts latencies (aka dependence cost in > GCC). > > I had a quick look at sprufe8b.pdf. It seems that 'dependence cost' of GCC > would be Delay_Slots + 1, while the "Functional Unit Latency' is the repeat > rate of the unit. The other resource-related constraints are accesses to the > read and write ports of RF and cross-path. > > I wrongly assumed that the 'latency' you mentioned was a 'dependence cost', > not a 'repeat rate' > > However, if parts of the instruction execution are delayed due to internal FU > resource conflicts, such as mpyspdp preempting these resources from mpydp > because the former has finished reading the register file earlier, then the > dependence cost of mpydp indeed increases [by the amount of mpyspdp FU > latency == 3]. > > I do not see a simple way to address this in GCC, because the dependence cost > is computed before scheduling starts, and is assumed constant while > instruction scheduling proceeds. > > An approach I would try if I had this problem (but no idea if it will work) > would be to rely on the target hooks TARGET_SCHED_REORDER2: > - Since mpyspdp must be scheduled just after mpydp to trigger the problem, > we can assume they are not dependent > - If both data-ready together (as seen from TARGET_SCHED_REORDER2), schedule > mpyspdp first > - If mpyspdp becomes data-ready the cycle after mpydp is issued, delay it by > putting at the lend of the ready list and decreasing the number of ready > instructions. There are similar (but different) constraints when mpyspdp is scheduled first. I managed to get good results by defining pseudo functional units dedicated to each instructions : (define_cpu_unit "m1dp" "c6x_m1") (define_cpu_unit "m1spdp" "c6x_m1") For mpydp: - "(m_N__CUNIT_)*4,nothing*4,m_N_w*2") + "(m_N__CUNIT_)*4+m_N_dp*4,m_N_spdp*3,nothing,m_N_w*2") and for mpyspdp: - "(m_N__CUNIT_)*2,nothing*3,m_N_w*2") + "(m_N__CUNIT_)*3+m_N_spdp*3,m_N_dp,nothing,m_N_w*2") Note that this includes a correction about the functional unit latency of mpyspdp from 2 to 3 on its own, which is not related to the conditional latencies I'm asking about. The solution doesn't scale, but the target doesn't emit any other instruction that could be affected, with the exception of mpysp2dp, so it should remain reasonabe. What do you think about that approach ? Also, why is the gcc mailing list not Cc'd in your reply ? Thanks. -- Richard Braun ----- End forwarded message ----- -- Richard Braun
