On 26/07/2019 10:01, Vincent Guittot wrote: >> Huh, interesting. Why go for utilization? > > Mainly because that's what is used to detect a misfit task and not the load > >> >> Right now we store the load of the task and use it to pick the "biggest" >> misfit (in terms of load) when there are more than one misfit tasks to >> choose: > > But having a big load doesn't mean that you have a big utilization > > so you can trig the misfit case because of task A with a big > utilization that doesn't fit to its local cpu. But then select a task > B in detach_tasks that has a small utilization but a big weight and as > a result a higher load > And task B will never trig the misfit UC by itself and should not > steal the pulling opportunity of task A >
We can avoid this entirely by going straight for an active balance when we are balancing misfit tasks (which we really should be doing TBH). If we *really* want to be surgical about misfit migration, we could track the task itself via a pointer to its task_struct, but IIRC Morten purposely avoided this due to all the fun synchronization issues that come with it. With that out of the way, I still believe we should maximize the migrated load when dealing with several misfit tasks - there's not much else you can look at anyway to make a decision. It sort of makes sense when e.g. you have two misfit tasks stuck on LITTLE CPUs and you finally have a big CPU being freed, it would seem fair to pick the one that's been "throttled" the longest - at equal niceness, that would be the one with the highest load. >> >> update_sd_pick_busiest(): >> ,---- >> | /* >> | * If we have more than one misfit sg go with the biggest misfit. >> | */ >> | if (sgs->group_type == group_misfit_task && >> | sgs->group_misfit_task_load < busiest->group_misfit_task_load) >> | return false; >> `---- >> >> I don't think it makes much sense to maximize utilization for misfit tasks: >> they're over the capacity margin, which exactly means "I can't really tell >> you much on that utilization other than it doesn't fit". >> >> At the very least, this rq field should be renamed "misfit_task_util". > > yes. I agree that i should rename the field > >> >> [...] >> >>> @@ -7060,12 +7048,21 @@ static unsigned long __read_mostly >>> max_load_balance_interval = HZ/10; >>> enum fbq_type { regular, remote, all }; >>> >>> enum group_type { >>> - group_other = 0, >>> + group_has_spare = 0, >>> + group_fully_busy, >>> group_misfit_task, >>> + group_asym_capacity, >>> group_imbalanced, >>> group_overloaded, >>> }; >>> >>> +enum group_migration { >>> + migrate_task = 0, >>> + migrate_util, >>> + migrate_load, >>> + migrate_misfit, >> >> Can't we have only 3 imbalance types (task, util, load), and make misfit >> fall in that first one? Arguably it is a special kind of task balance, >> since it would go straight for the active balance, but it would fit a >> `migrate_task` imbalance with a "go straight for active balance" flag >> somewhere. > > migrate_misfit uses its own special condition to detect the task that > can be pulled compared to the other ones > Since misfit is about migrating running tasks, a `migrate_task` imbalance with a flag that goes straight to active balancing should work, no? [...] >> Rather than filling the local group, shouldn't we follow the same strategy >> as for load, IOW try to reach an average without pushing local above nor >> busiest below ? > > But we don't know if this will be enough to make the busiest group not > overloaded anymore > > This is a transient state: > a group is overloaded, another one has spare capacity > How to balance the system will depend of how much overload if in the > group and we don't know this value. > The only solution is to: > - try to pull as much task as possible to fill the spare capacity > - Is the group still overloaded ? use avg_load to balance the system > because both group will be overloaded > - Is the group no more overloaded ? balance the number of idle cpus > >> >> We could build an sds->avg_util similar to sds->avg_load. > > When there is spare capacity, we balances the number of idle cpus > What if there is spare capacity but no idle CPUs? In scenarios like this we should balance utilization. We could wait for a newidle balance to happen, but it'd be a shame to repeatedly do this when we could preemptively balance utilization.