On 10/25/2018 7:31 AM, Valentin Schneider wrote: > > On 24/10/2018 20:27, Steven Sistare wrote: > [...] >> Hi Valentin, >> >> Asymmetric systems could maintain a separate bitmap for misfits; set a bit >> when a CPU goes on CPU, clear it going off. When a fast CPU goes new idle, >> it would first search the misfits mask, then search cfs_overload_cpus. >> The misfits logic would be conditionalized with CONFIG or sched feat static >> branches so symmetric systems do not incur extra overhead. > > That sounds reasonable - besides, misfit already introduces a > sched_asym_cpucapacity static key. I'll try to play around with that. > >>> We'd also lose the NOHZ update done in idle_balance(), though I think it's >>> not such a big deal - were were piggy-backing this on idle_balance() just >>> because it happened to be convenient, and we still have NOHZ_STATS_KICK >>> anyway. >> >> Agreed. >> >>> Another thing - in your test cases, what is the most prevalent cause of >>> failure to pull a task in idle_balance()? Is it the load_balance() itself >>> that fails to find a task (e.g. because the imbalance is not deemed big >>> enough), or is it the idle migration cost logic that prevents >>> load_balance() from running to completion? >> >> The latter. Eg, for the test "X6-2, 40 CPUs, hackbench 3 process 50000", >> CPU avg_idle is 355566 nsec, and sched_migration_cost_ns = 500000, >> so idle_balance bails at the top: >> if (this_rq->avg_idle < sysctl_sched_migration_cost || >> ... >> goto out >> >> For other tests, we get past that clause but bail from a domain: >> if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost) { >> ... >> break; >> >>> In the first case, try_steal() makes perfect sense to me. In the second >>> case, I'm not sure if we really want to pull something if we know (well, >>> we *think*) we're about to resume the execution of some other task. >> >> 355.566 microsec is enough time to steal, go on CPU, do useful work, and go >> off CPU, particularly for chatty workloads like hackbench. The performance >> data bear this out. For the higher loads, the average timeslice for >> hackbench >> > > Thanks for the explanation. AIUI the big difference here is that try_steal() > is considerably cheaper than load_balance(), so the rq->avg_idle concerns > matter less (or at least, on a considerably smaller scale).
Right. >> Perhaps I could skip try_steal() if avg_idle is very small, although with >> hackbench I have seen average time slice as small as 10 microsec under >> high load and preemptions. I'll run some experiments. > > That might be a safe thing to do. In the same department, maybe we could > skip try_steal() if we bail out of idle_balance() because > !(this_rq->rd->overload). Although rq->rd->overload and cfs_overload_cpus > are decoupled, they should express the same thing here. I tried that in an earlier version of my code: new_tasks = idle_balance(rq, rf); if (new_tasks == 0 && rq->rd->overload) new_tasks = try_steal(rq, rf); but I did not see any performance improvement vs without the overload check, so I omitted it for simplicity. - Steve >>>> We could merge the stealing code into the idle_balance() code to get a >>>> union of the two, but IMO that would be less readable. >>>> >>>> We could remove the core and socket levels from idle_balance() >>> >>> I understand that as only doing load_balance() at DIE level in >>> idle_balance(), as that is what makes most sense to me (with big.LITTLE >>> those misfit migrations are done at DIE level), is that correct? >> >> Correct. >>> Also, with DynamIQ (next gen big.LITTLE) we could have asymmetry at MC >>> level, which could cause issues there. >> >> We could keep idle_balance for this level and fall back to stealing as in >> my patch, or you could extend the misfits bitmap to also include CPUs >> with reduced memory bandwidth and active tasks. (if I understand the >> asymmetry >> correctly). >> > > It's mostly µarch asymmetry, so by "asymmetry at MC level" I meant "we'll > see the SD_ASYM_CPUCAPACITY flag at MC level". But if we tweak stealing > to take misfit tasks into account (so we'd rely on SD_ASYM_CPUCAPACITY > in some way or another), that could work. > >>>> and let >>>> stealing handle those levels. I think that makes sense after stealing >>>> performance is validated on more architectures, but we would still have >>>> two different mechanisms. >>>> >>>> - Steve >>> >>> I'll try out those patches on top of the misfit series to see how the >>> whole thing behaves. >> >> Very good, thanks. >> >> - Steve >>