Re: [RFC] sched/eevdf: sched feature to dismiss lag on wakeup
On Fri, Mar 22, 2024 at 06:02:05PM +0100, Vincent Guittot wrote: > and then > se->vruntime = max_vruntime(se->vruntime, vruntime) > First things first, I was wrong to assume a "boost" in the CFS code. So I dug a bit deeper and tried to pinpoint what the difference between CFS and EEVDF actually is. I found the following: Let's assume we have two tasks taking turns on a single CPU. Task 1 is always runnable. Task 2 gets woken up by task 1 and goes back to sleep when it is done. This means, task 1 runs, wakes up task 2, task 2 runs, goes to sleep and task 1 runs again and we repeat. Most of the time: runtime(task1) > runtime(task2) Rare occasions: runtime(task1) < runtime(task2) So, task 1 usually consumes more of its designated time slices until it gets rescheduled by the wakeup of task2 than task 2 does. But both never consume their full time slice. Rather the opposite, both run for low 5-digit ns or less. So something like this: task 1|--||-||--... task 2 || || This creates different behaviors under CFS and EEVDF: ### CFS In CFS the difference in runtimes means that task 2 cannot catch up with task 1 vruntime-wise With every exchange between task 1 and task 2, task 2 falls back more on vruntime. Once a difference in the magnitude of sysctl_sched_latency is established, the difference remains stable due to the max handling in place_entity. Occasionally, task 2 may run longer than task 1. In those cases, it will catch up slightly. But in the majority of cases, task 2 runs shorter, thereby increasing the difference in vruntime. This would explain why task 2 gets always scheduled immediately on wakeup. ### EEVDF ## The rare occasions where task 2 runs longer than task 1 seem to cause issues with EEVDF: In the regular case where task 1 runs longer than task 2. Task 2 gets a positive lag and is selected on wake up --> good. In the irregular case where task 2 runs longer than task 1 task 2 now gets a negative lag and is no longer chosen on wakeup --> bad (in some cases). This would explain why task 2 gets not selected on wake up occasionally. ### Summary So my wording, that a woken up task gets "boosted" was obviously wrong. Task 2 is not getting boosted in CFS, it gets "outrun" by task 1, with no chance of catching up. Leaving it with a smaller vruntime value. EEVDF on the other hand, does not allow lag to accumulate if an entity, like task 2 in this case, regularly dequeues itself. So it will always have a lag with an upper boundary of whatever difference it encountered in comparison to the runtime with task 1. The patch below, allows tasks to accumulate lag over time. This fixes the original regression, that made me stumble into this topic. But, this might of course come with arbitrary side effects. I'm not suggesting to actually implement this, but would like to confirm whether my understanding is correct that this is the aspect where CFS and EEVDF differ, where CFS is more aware of the past in this particular case than EEVDF is. diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 03be0d1330a6..b83a72311d2a 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -701,7 +701,7 @@ static void update_entity_lag(struct cfs_rq *cfs_rq, struct sched_entity *se) s64 lag, limit; SCHED_WARN_ON(!se->on_rq); - lag = avg_vruntime(cfs_rq) - se->vruntime; + lag = se->vlag + avg_vruntime(cfs_rq) - se->vruntime; limit = calc_delta_fair(max_t(u64, 2*se->slice, TICK_NSEC), se); se->vlag = clamp(lag, -limit, limit);
Re: [RFC] sched/eevdf: sched feature to dismiss lag on wakeup
On Wed, Mar 20, 2024 at 02:51:00PM +0100, Vincent Guittot wrote: > On Wed, 20 Mar 2024 at 08:04, Tobias Huschle wrote: > > There was no guarantee of course. place_entity was reducing the vruntime of > > woken up tasks though, giving it a slight boost, right?. For the scenario > > It was rather the opposite, It was ensuring that long sleeping tasks > will not get too much bonus because of vruntime too far in the past. > This is similar although not exactly the same intent as the lag. The > bonus was up to 24ms previously whereas it's not more than a slice now > I might have gotten this quite wrong then. I was looking at place_entity and saw that non-initial placements get their vruntime reduced via vruntime -= thresh; which would mean that the placed task would have a vruntime smaller than cfs_rq->min_vruntime, based on pre-EEVDF behavior, last seen at: af4cf40470c2 sched/fair: Add cfs_rq::avg_vruntime If there was no such benefit for woken up tasks. Then the scenario I observed is just conincidentally worse with EEVDF, which can happen when exchanging an algorithm I suppose. Or EEVDF just exposes a so far hidden problem in that scenario.
Re: [RFC] sched/eevdf: sched feature to dismiss lag on wakeup
On Tue, Mar 19, 2024 at 02:41:14PM +0100, Vincent Guittot wrote: > On Tue, 19 Mar 2024 at 10:08, Tobias Huschle wrote: > > > > On 2024-03-18 15:45, Luis Machado wrote: > > > On 3/14/24 13:45, Tobias Huschle wrote: > > >> On Fri, Mar 08, 2024 at 03:11:38PM +, Luis Machado wrote: > > >>> On 2/28/24 16:10, Tobias Huschle wrote: > > >>>> > > >>>> Questions: > > >>>> 1. The kworker getting its negative lag occurs in the following > > >>>> scenario > > >>>>- kworker and a cgroup are supposed to execute on the same CPU > > >>>>- one task within the cgroup is executing and wakes up the > > >>>> kworker > > >>>>- kworker with 0 lag, gets picked immediately and finishes its > > >>>> execution within ~5000ns > > >>>>- on dequeue, kworker gets assigned a negative lag > > >>>>Is this expected behavior? With this short execution time, I > > >>>> would > > >>>>expect the kworker to be fine. > > >>> > > >>> That strikes me as a bit odd as well. Have you been able to determine > > >>> how a negative lag > > >>> is assigned to the kworker after such a short runtime? > > >>> > > >> > > >> I did some more trace reading though and found something. > > >> > > >> What I observed if everything runs regularly: > > >> - vhost and kworker run alternating on the same CPU > > >> - if the kworker is done, it leaves the runqueue > > >> - vhost wakes up the kworker if it needs it > > >> --> this means: > > >> - vhost starts alone on an otherwise empty runqueue > > >> - it seems like it never gets dequeued > > >> (unless another unrelated task joins or migration hits) > > >> - if vhost wakes up the kworker, the kworker gets selected > > >> - vhost runtime > kworker runtime > > >> --> kworker gets positive lag and gets selected immediately next > > >> time > > >> > > >> What happens if it does go wrong: > > >> From what I gather, there seem to be occasions where the vhost either > > >> executes suprisingly quick, or the kworker surprinsingly slow. If > > >> these > > >> outliers reach critical values, it can happen, that > > >>vhost runtime < kworker runtime > > >> which now causes the kworker to get the negative lag. > > >> > > >> In this case it seems like, that the vhost is very fast in waking up > > >> the kworker. And coincidentally, the kworker takes, more time than > > >> usual > > >> to finish. We speak of 4-digit to low 5-digit nanoseconds. > > >> > > >> So, for these outliers, the scheduler extrapolates that the kworker > > >> out-consumes the vhost and should be slowed down, although in the > > >> majority > > >> of other cases this does not happen. > > > > > > Thanks for providing the above details Tobias. It does seem like EEVDF > > > is strict > > > about the eligibility checks and making tasks wait when their lags are > > > negative, even > > > if just a little bit as in the case of the kworker. > > > > > > There was a patch to disable the eligibility checks > > > (https://lore.kernel.org/lkml/20231013030213.2472697-1-youssefes...@chromium.org/), > > > which would make EEVDF more like EVDF, though the deadline comparison > > > would > > > probably still favor the vhost task instead of the kworker with the > > > negative lag. > > > > > > I'm not sure if you tried it, but I thought I'd mention it. > > > > Haven't seen that one yet. Unfortunately, it does not help to ignore the > > eligibility. > > > > I'm inclined to rather propose propose a documentation change, which > > describes that tasks should not rely on woken up tasks being scheduled > > immediately. > > Where do you see such an assumption ? Even before eevdf, there were > nothing that ensures such behavior. When using CFS (legacy or eevdf) > tasks, you can't know if the newly wakeup task will run 1st or not > There was no guarantee of course. place_entity was reducing the vruntime of woken up tasks though, giving it a slight boost, right?. For the scenario that I observed, that boost was enough to make sure, that the woken up tasks gets scheduled consistently. This might still not be true for all scenarios, but in general EEVDF seems to be stricter with woken up tasks. Dismissing the lag on wakeup also does obviously not guarantee getting scheduled, as other tasks might still be involved. The question would be if it should be explicitly mentioned somewhere that, at this point, woken up tasks are not getting any special treatment and noone should rely on that boost for woken up tasks. > > > > Changing things in the code to address for the specific scenario I'm > > seeing seems to mostly create unwanted side effects and/or would require > > the definition of some magic cut-off values. > > > >
Re: [RFC] sched/eevdf: sched feature to dismiss lag on wakeup
On 2024-03-18 15:45, Luis Machado wrote: On 3/14/24 13:45, Tobias Huschle wrote: On Fri, Mar 08, 2024 at 03:11:38PM +, Luis Machado wrote: On 2/28/24 16:10, Tobias Huschle wrote: Questions: 1. The kworker getting its negative lag occurs in the following scenario - kworker and a cgroup are supposed to execute on the same CPU - one task within the cgroup is executing and wakes up the kworker - kworker with 0 lag, gets picked immediately and finishes its execution within ~5000ns - on dequeue, kworker gets assigned a negative lag Is this expected behavior? With this short execution time, I would expect the kworker to be fine. That strikes me as a bit odd as well. Have you been able to determine how a negative lag is assigned to the kworker after such a short runtime? I did some more trace reading though and found something. What I observed if everything runs regularly: - vhost and kworker run alternating on the same CPU - if the kworker is done, it leaves the runqueue - vhost wakes up the kworker if it needs it --> this means: - vhost starts alone on an otherwise empty runqueue - it seems like it never gets dequeued (unless another unrelated task joins or migration hits) - if vhost wakes up the kworker, the kworker gets selected - vhost runtime > kworker runtime --> kworker gets positive lag and gets selected immediately next time What happens if it does go wrong: From what I gather, there seem to be occasions where the vhost either executes suprisingly quick, or the kworker surprinsingly slow. If these outliers reach critical values, it can happen, that vhost runtime < kworker runtime which now causes the kworker to get the negative lag. In this case it seems like, that the vhost is very fast in waking up the kworker. And coincidentally, the kworker takes, more time than usual to finish. We speak of 4-digit to low 5-digit nanoseconds. So, for these outliers, the scheduler extrapolates that the kworker out-consumes the vhost and should be slowed down, although in the majority of other cases this does not happen. Thanks for providing the above details Tobias. It does seem like EEVDF is strict about the eligibility checks and making tasks wait when their lags are negative, even if just a little bit as in the case of the kworker. There was a patch to disable the eligibility checks (https://lore.kernel.org/lkml/20231013030213.2472697-1-youssefes...@chromium.org/), which would make EEVDF more like EVDF, though the deadline comparison would probably still favor the vhost task instead of the kworker with the negative lag. I'm not sure if you tried it, but I thought I'd mention it. Haven't seen that one yet. Unfortunately, it does not help to ignore the eligibility. I'm inclined to rather propose propose a documentation change, which describes that tasks should not rely on woken up tasks being scheduled immediately. Changing things in the code to address for the specific scenario I'm seeing seems to mostly create unwanted side effects and/or would require the definition of some magic cut-off values.
Re: [RFC] sched/eevdf: sched feature to dismiss lag on wakeup
On Fri, Mar 08, 2024 at 03:11:38PM +, Luis Machado wrote: > On 2/28/24 16:10, Tobias Huschle wrote: > > > > Questions: > > 1. The kworker getting its negative lag occurs in the following scenario > >- kworker and a cgroup are supposed to execute on the same CPU > >- one task within the cgroup is executing and wakes up the kworker > >- kworker with 0 lag, gets picked immediately and finishes its > > execution within ~5000ns > >- on dequeue, kworker gets assigned a negative lag > >Is this expected behavior? With this short execution time, I would > >expect the kworker to be fine. > > That strikes me as a bit odd as well. Have you been able to determine how a > negative lag > is assigned to the kworker after such a short runtime? > I did some more trace reading though and found something. What I observed if everything runs regularly: - vhost and kworker run alternating on the same CPU - if the kworker is done, it leaves the runqueue - vhost wakes up the kworker if it needs it --> this means: - vhost starts alone on an otherwise empty runqueue - it seems like it never gets dequeued (unless another unrelated task joins or migration hits) - if vhost wakes up the kworker, the kworker gets selected - vhost runtime > kworker runtime --> kworker gets positive lag and gets selected immediately next time What happens if it does go wrong: >From what I gather, there seem to be occasions where the vhost either executes suprisingly quick, or the kworker surprinsingly slow. If these outliers reach critical values, it can happen, that vhost runtime < kworker runtime which now causes the kworker to get the negative lag. In this case it seems like, that the vhost is very fast in waking up the kworker. And coincidentally, the kworker takes, more time than usual to finish. We speak of 4-digit to low 5-digit nanoseconds. So, for these outliers, the scheduler extrapolates that the kworker out-consumes the vhost and should be slowed down, although in the majority of other cases this does not happen. Therefore this particular usecase would profit from being able to ignore such outliers, or being able to ignore a certain amount of difference in the lag values, i.e. introduce some grace value around the average runtime for which lag is not accounted. But not sure if I like that idea. So the negative lag can be somewhat justified, but for this particular case it leads to a problem where one outlier can cause havoc. As mentioned in the vhost discussion, it could also be argued that the vhost should not rely on the fact that the kworker gets always scheduled on wake up, since these timing issues can always happen. Hence, the two options: - offer the alternative strategy which dismisses lag on wake up for workloads where we know that a task usually finishes faster than others but should not be punished by rare outliers (if that is predicatble, I don't know) - require vhost to adress this issue on their side (if possible without creating an armada of side effects) (plus the third one mentioned above, but that requires a magic cutoff value, meh) > I was looking at a different thread > (https://lore.kernel.org/lkml/20240226082349.302363-1-yu.c.c...@intel.com/) > that > uncovers a potential overflow in the eligibility calculation. Though I don't > think that is the case for this particular > vhost problem. Yea, the numbers I see do not look very overflowy.
Re: [RFC] sched/eevdf: sched feature to dismiss lag on wakeup
On Thu, Feb 29, 2024 at 09:06:16AM +0530, K Prateek Nayak wrote: > (+ Xuewen Yan, Ke Wang) > > Hello Tobias, > <...> > > > > Questions: > > 1. The kworker getting its negative lag occurs in the following scenario > >- kworker and a cgroup are supposed to execute on the same CPU > >- one task within the cgroup is executing and wakes up the kworker > >- kworker with 0 lag, gets picked immediately and finishes its > > execution within ~5000ns > >- on dequeue, kworker gets assigned a negative lag > >Is this expected behavior? With this short execution time, I would > >expect the kworker to be fine. > >For a more detailed discussion on this symptom, please see: > >https://lore.kernel.org/netdev/ZWbapeL34Z8AMR5f@DESKTOP-2CCOB1S./T/ > > Does the lag clamping path from Xuewen Yan [1] work for the vhost case > mentioned in the thread? Instead of placing the task just behind the > 0-lag point, clamping the lag seems to be more principled approach since > EEVDF already does it in update_entity_lag(). > > If the lag is still too large, maybe the above coupled with Peter's > delayed dequeue patch can help [2] (Note: tree is prone to force > updates) > > [1] https://lore.kernel.org/lkml/20240130080643.1828-1-xuewen@unisoc.com/ > [2] > https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/commit/?h=sched/eevdf=e62ef63a888c97188a977daddb72b61548da8417 I tried Peter's patches a while ago. Unfortunately, reducing the lag is not sufficient in this particular case. The calling entity expects the woken up kworker to run instantly. In order to have a chance that the woken up kworker is scheduled right away, the kworker must not have any negative lag. To guarantee it being scheduled it should even have a positive lag which allows it to pass all other entities on the queue. Therefore I proposed to just wipe the negative lag in these cases, which seems to map to strategy #2 of the underlying paper. The other way to think about this would be: The assumption, that woken up tasks get a high probability to run is no longer valid. In that case, the entity triggering the wake up has to explicitly give up the CPU. If there are no other tasks, apart from the 2 involved so far, has good chances of being scheduled. If the runqueue is busy, other tasks might intervene. I keep playing around with these options, but potential side effects are worrying me. > <...>
[RFC] sched/eevdf: avoid task starvation in cgroups
When running update_curr, it is checked whether the current task has missed its deadline (update_deadline). If the deadline has been crossed, the task is set to be rescheduled if there are other tasks available on its cfs_rq. This can cause task starvation in some cgroup configurations. Assume the following scenario: [ ] rq of CPU | [ ] cfs_rq1 / \ cfs_rq2 [ ] othertask | curr In this case, update_curr is called for all levels of the hierarchy, starting at the leaf. Assume that curr is a kernel task which does not give up the CPU voluntarily, i.e. loops indefinitely unless forced to give up the CPU. Assume further that curr has actually reached its deadline, the expected behavior would be, that the scheduler determines that curr would now start to overconsume and therefore should set the need_resched flag to nudge the current task to allow a reschedule in favor of othertask. To reach the expected behavior, it is not sufficient to check whether other tasks are queued in the cfs_rq, which curr is assigned to. In the configuration shown above, each run queue sees: rq_cpu: 2 tasks cfs_rq1: 2 tasks cfs_rq2: 1 task This means that cfs_rq2 will never reschedule on its own as it sees no other tasks that would be worth rescheduling for. Hence, the call to reschedule relies on the upper levels in the hierarchy. cfs_rq1 sees 1 additional task and could therefore take the desired decision. But, cfs_rq1 takes the additional task into account when computing its own deadline, which means, its deadline lies further in the future. This causes that its deadline is not being reached. Therefore cfs_rq1 does not even check for other potential tasks to be ran. It could now be assumed that cfs_rq1 will reach its deadline at some point. But, after curr has consumed its timeslice, all sched_entities in the cgroup tree get reweighted. This causes the deadlines of all entities in the hierarchy to be shifted further into the future. This means especially, that the deadline of cfs_rq1 also gets pushed into the future, causing it to never reach its deadline. The decision whether a reweight needs to be done depends on the weight of the sched_entity and a recalculated value of the shares the sched_entity shall expect (see calc_group_shares in fair.c). These values are, in the described scenario, always different, hence, reweight is done. The combination of these two circumstances can lead to curr running indefinitely unless interrupted by an external entity. Address this issue by rather checking for the nr_running value of the rq of the CPU itself, i.e. if there is any other task, somewhere on the CPU runqueue, give it a chance to execute. The scenario that made me discover this is being discussed here: https://lore.kernel.org/netdev/ZWbapeL34Z8AMR5f@DESKTOP-2CCOB1S./T/ Questions: 1. Is this behavior by design? I couldn't find an explanation why we only check for the local cfs_rq. Do we only want to allow take-overs from the same cgroup hierarchy level in that case? 2. This problem is referred to as the "hog"-problem by Ankur Arora here: https://lore.kernel.org/lkml/2024021304.1802415-24-ankur.a.ar...@oracle.com/ Might there be a connection, it's the same piece of code in the end. I'd also prefer the boolean passing to avoid calling resched_curr too often. 3. Which are the criteria for which I should expect a reweight of the cgroup hierarchy? It seems unintuitive too me, that the shares value keeps changing every time. Feedback and opinions would be highly appreciated. Signed-off-by: Tobias Huschle --- kernel/sched/fair.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 61c4ef20a2f8..e9733ef7964a 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1015,7 +1015,7 @@ static void update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se) /* * The task has consumed its request, reschedule. */ - if (cfs_rq->nr_running > 1) { + if (rq_of(cfs_rq)->nr_running > 1) { resched_curr(rq_of(cfs_rq)); clear_buddies(cfs_rq, se); } -- 2.34.1
[RFC] sched/eevdf: sched feature to dismiss lag on wakeup
The previously used CFS scheduler gave tasks that were woken up an enhanced chance to see runtime immediately by deducting a certain value from its vruntime on runqueue placement during wakeup. This property was used by some, at least vhost, to ensure, that certain kworkers are scheduled immediately after being woken up. The EEVDF scheduler, does not support this so far. Instead, if such a woken up entitiy carries a negative lag from its previous execution, it will have to wait for the current time slice to finish, which affects the performance of the process expecting the immediate execution negatively. To address this issue, implement EEVDF strategy #2 for rejoining entities, which dismisses the lag from previous execution and allows the woken up task to run immediately (if no other entities are deemed to be preferred for scheduling by EEVDF). The vruntime is decremented by an additional value of 1 to make sure, that the woken up tasks gets to actually run. This is of course not following strategy #2 in an exact manner but guarantees the expected behavior for the scenario described above. Without the additional decrement, the performance goes south even more. So there are some side effects I could not get my head around yet. Questions: 1. The kworker getting its negative lag occurs in the following scenario - kworker and a cgroup are supposed to execute on the same CPU - one task within the cgroup is executing and wakes up the kworker - kworker with 0 lag, gets picked immediately and finishes its execution within ~5000ns - on dequeue, kworker gets assigned a negative lag Is this expected behavior? With this short execution time, I would expect the kworker to be fine. For a more detailed discussion on this symptom, please see: https://lore.kernel.org/netdev/ZWbapeL34Z8AMR5f@DESKTOP-2CCOB1S./T/ 2. The proposed code change of course only addresses the symptom. Am I assuming correctly that this is in general the exepected behavior and that the task waking up the kworker should rather do an explicit reschedule of itself to grant the kworker time to execute? In the vhost case, this is currently attempted through a cond_resched which is not doing anything because the need_resched flag is not set. Feedback and opinions would be highly appreciated. Signed-off-by: Tobias Huschle --- kernel/sched/fair.c | 5 + kernel/sched/features.h | 1 + 2 files changed, 6 insertions(+) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 533547e3c90a..c20ae6d62961 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5239,6 +5239,11 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) lag = div_s64(lag, load); } + if (sched_feat(NOLAG_WAKEUP) && (flags & ENQUEUE_WAKEUP)) { + se->vlag = 0; + lag = 1; + } + se->vruntime = vruntime - lag; /* diff --git a/kernel/sched/features.h b/kernel/sched/features.h index 143f55df890b..d3118e7568b4 100644 --- a/kernel/sched/features.h +++ b/kernel/sched/features.h @@ -7,6 +7,7 @@ SCHED_FEAT(PLACE_LAG, true) SCHED_FEAT(PLACE_DEADLINE_INITIAL, true) SCHED_FEAT(RUN_TO_PARITY, true) +SCHED_FEAT(NOLAG_WAKEUP, true) /* * Prefer to schedule the task we woke last (assuming it failed -- 2.34.1
Re: [RFC 1/1] sched/fair: Consider asymmetric scheduler groups in load balancer
On 2023-07-07 16:33, Shrikanth Hegde wrote: On 7/7/23 1:14 PM, Tobias Huschle wrote: On 2023-07-05 09:52, Vincent Guittot wrote: Le lundi 05 juin 2023 à 10:07:16 (+0200), Tobias Huschle a écrit : On 2023-05-16 15:36, Vincent Guittot wrote: > On Mon, 15 May 2023 at 13:46, Tobias Huschle > wrote: > > > > The current load balancer implementation implies that scheduler > > groups, > > within the same domain, all host the same number of CPUs. This is > > reflected in the condition, that a scheduler group, which is load > > balancing and classified as having spare capacity, should pull work > > from the busiest group, if the local group runs less processes than > > the busiest one. This implies that these two groups should run the > > same number of processes, which is problematic if the groups are not > > of the same size. > > > > The assumption that scheduler groups within the same scheduler domain > > host the same number of CPUs appears to be true for non-s390 > > architectures. Nevertheless, s390 can have scheduler groups of unequal > > size. > > > > This introduces a performance degredation in the following scenario: > > > > Consider a system with 8 CPUs, 6 CPUs are located on one CPU socket, > > the remaining 2 are located on another socket: > > > > Socket -1- -2- > > CPU 1 2 3 4 5 6 7 8 > > > > Placing some workload ( x = one task ) yields the following > > scenarios: > > > > The first 5 tasks are distributed evenly across the two groups. > > > > Socket -1- -2- > > CPU 1 2 3 4 5 6 7 8 > > x x x x x > > > > Adding a 6th task yields the following distribution: > > > > Socket -1- -2- > > CPU 1 2 3 4 5 6 7 8 > > SMT1 x x x x x > > SMT2 x > > Your description is a bit confusing for me. What you name CPU above > should be named Core, doesn' it ? > > Could you share with us your scheduler topology ? > You are correct, it should say core instead of CPU. One actual configuration from one of my test machines (lscpu -e): [...] So, 6 cores / 12 CPUs in one group 2 cores / 4 CPUs in the other. Thaks for the details If I run stress-ng with 8 cpu stressors on the original code I get a distribution like this: 00 01 02 03 04 05 06 07 08 09 10 11 || 12 13 14 15 x x x x x x x x Which means that the two cores in the smaller group are running into SMT while two cores in the larger group are still idle. This is caused by the prefer_sibling path which really wants to see both groups run the same number of tasks. yes and it considers that there are the same number of CPUs per group > > > > The task is added to the 2nd scheduler group, as the scheduler has the > > assumption that scheduler groups are of the same size, so they should > > also host the same number of tasks. This makes CPU 7 run into SMT > > thread, which comes with a performance penalty. This means, that in > > the window of 6-8 tasks, load balancing is done suboptimally, because > > SMT is used although there is no reason to do so as fully idle CPUs > > are still available. > > > > Taking the weight of the scheduler groups into account, ensures that > > a load balancing CPU within a smaller group will not try to pull tasks > > from a bigger group while the bigger group still has idle CPUs > > available. > > > > Signed-off-by: Tobias Huschle > > --- > > kernel/sched/fair.c | 3 ++- > > 1 file changed, 2 insertions(+), 1 deletion(-) > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > index 48b6f0ca13ac..b1307d7e4065 100644 > > --- a/kernel/sched/fair.c > > +++ b/kernel/sched/fair.c > > @@ -10426,7 +10426,8 @@ static struct sched_group > > *find_busiest_group(struct lb_env *env) > > * group's child domain. > > */ > > if (sds.prefer_sibling && local->group_type == > > group_has_spare && > > - busiest->sum_nr_running > local->sum_nr_running + 1) > > + busiest->sum_nr_running * local->group_weight > > > + local->sum_nr_running * > > busiest->group_weight + 1) what you want to test here is that moving 1 task from busiest to local group would help and balance the ratio of tasks per cpu (busiest->sum_nr_running - 1) / busiest->group_weight > (local->sum_nr_running + 1) / local->group_weight which can be develop into busiest->sum_nr_running * local->group_weight >= local-&g
Re: [RFC 1/1] sched/fair: Consider asymmetric scheduler groups in load balancer
On 2023-07-06 19:19, Shrikanth Hegde wrote: On 5/15/23 5:16 PM, Tobias Huschle wrote: The current load balancer implementation implies that scheduler groups, within the same domain, all host the same number of CPUs. This is reflected in the condition, that a scheduler group, which is load balancing and classified as having spare capacity, should pull work from the busiest group, if the local group runs less processes than the busiest one. This implies that these two groups should run the same number of processes, which is problematic if the groups are not of the same size. The assumption that scheduler groups within the same scheduler domain host the same number of CPUs appears to be true for non-s390 architectures. Nevertheless, s390 can have scheduler groups of unequal size. This introduces a performance degredation in the following scenario: Consider a system with 8 CPUs, 6 CPUs are located on one CPU socket, the remaining 2 are located on another socket: Socket -1--2- CPU 1 2 3 4 5 67 8 Placing some workload ( x = one task ) yields the following scenarios: The first 5 tasks are distributed evenly across the two groups. Socket -1--2- CPU 1 2 3 4 5 67 8 x x x x x Adding a 6th task yields the following distribution: Socket -1--2- CPU 1 2 3 4 5 67 8 SMT1 x x x x x SMT2x The task is added to the 2nd scheduler group, as the scheduler has the assumption that scheduler groups are of the same size, so they should also host the same number of tasks. This makes CPU 7 run into SMT thread, which comes with a performance penalty. This means, that in the window of 6-8 tasks, load balancing is done suboptimally, because SMT is used although there is no reason to do so as fully idle CPUs are still available. Taking the weight of the scheduler groups into account, ensures that a load balancing CPU within a smaller group will not try to pull tasks from a bigger group while the bigger group still has idle CPUs available. Signed-off-by: Tobias Huschle --- kernel/sched/fair.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 48b6f0ca13ac..b1307d7e4065 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -10426,7 +10426,8 @@ static struct sched_group *find_busiest_group(struct lb_env *env) * group's child domain. */ if (sds.prefer_sibling && local->group_type == group_has_spare && - busiest->sum_nr_running > local->sum_nr_running + 1) + busiest->sum_nr_running * local->group_weight > + local->sum_nr_running * busiest->group_weight + 1) goto force_balance; I assume its SMT2 here. sched group is enclosed in [busy_cpus/idle_cpus/weight] Lets take an example: we will begin the with the said imbalance. [3/9/12] -- local group [3/1/4] -- busy group. we will evaluate 3*12 > (4*(3+1)) is true and proceeds further to balance. but calculate_imbalance will lead to zero as the imbalance no? in case of prefer sibling case it find the difference of sum_nr_running of the two groups. It will be 3-3 = 0 this would need modifications to calculate_imbalance. Maybe something like this will help? NOT TESTED AT ALL. diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index a80a73909dc2..e027d4086edc 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -10296,7 +10296,9 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s return; } - if (busiest->group_weight == 1 || sds->prefer_sibling) { + if (busiest->group_weight == 1 || + (sds->prefer_sibling && +busiest->group_weight == local->group_weight)) { unsigned int nr_diff = busiest->sum_nr_running; /* * When prefer sibling, evenly spread running tasks on @@ -10305,6 +10307,11 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s env->migration_type = migrate_task; lsub_positive(_diff, local->sum_nr_running); env->imbalance = nr_diff; + } + if (sds->prefer_sibling && + busiest->group_weight != local->group_weight) { + env->migration_type = migrate_task; + env->imbalance = 1; } else { I also had a look at this when Vincent pointed out that this part is missing. The formula proposed by Vincent works pretty well, it is not even necessary to add additional if-statements here.
Re: [RFC 1/1] sched/fair: Consider asymmetric scheduler groups in load balancer
On 2023-07-04 15:40, Peter Zijlstra wrote: On Mon, May 15, 2023 at 01:46:01PM +0200, Tobias Huschle wrote: The current load balancer implementation implies that scheduler groups, within the same domain, all host the same number of CPUs. This is reflected in the condition, that a scheduler group, which is load balancing and classified as having spare capacity, should pull work from the busiest group, if the local group runs less processes than the busiest one. This implies that these two groups should run the same number of processes, which is problematic if the groups are not of the same size. The assumption that scheduler groups within the same scheduler domain host the same number of CPUs appears to be true for non-s390 architectures. Mostly; there's historically the cpuset case where we can create lopsided groups like that. And today we're growing all these hybrid things that will also tickle this, except they're looking towards different balancer extentions to deal with the IPC difference so might not be immediately caring about this here issue. Signed-off-by: Tobias Huschle --- kernel/sched/fair.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 48b6f0ca13ac..b1307d7e4065 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -10426,7 +10426,8 @@ static struct sched_group *find_busiest_group(struct lb_env *env) * group's child domain. */ if (sds.prefer_sibling && local->group_type == group_has_spare && - busiest->sum_nr_running > local->sum_nr_running + 1) + busiest->sum_nr_running * local->group_weight > + local->sum_nr_running * busiest->group_weight + 1) Should that not be: busiest->group_weight * (local->sum_nr_running + 1) ? I agree, adding the brackets makes more sense and is clearer on what's intended by this check while also preserving the original behavior for local->group_weight == busiest->group_weight I'm not opposed to this -- it seems fairly straight forward. Appreciated, I will go ahead and send a patch once I incorporated the other feedback I got. Thanks. goto force_balance; if (busiest->group_type != group_overloaded) { -- 2.34.1
Re: [RFC 1/1] sched/fair: Consider asymmetric scheduler groups in load balancer
On 2023-07-05 09:52, Vincent Guittot wrote: Le lundi 05 juin 2023 à 10:07:16 (+0200), Tobias Huschle a écrit : On 2023-05-16 15:36, Vincent Guittot wrote: > On Mon, 15 May 2023 at 13:46, Tobias Huschle > wrote: > > > > The current load balancer implementation implies that scheduler > > groups, > > within the same domain, all host the same number of CPUs. This is > > reflected in the condition, that a scheduler group, which is load > > balancing and classified as having spare capacity, should pull work > > from the busiest group, if the local group runs less processes than > > the busiest one. This implies that these two groups should run the > > same number of processes, which is problematic if the groups are not > > of the same size. > > > > The assumption that scheduler groups within the same scheduler domain > > host the same number of CPUs appears to be true for non-s390 > > architectures. Nevertheless, s390 can have scheduler groups of unequal > > size. > > > > This introduces a performance degredation in the following scenario: > > > > Consider a system with 8 CPUs, 6 CPUs are located on one CPU socket, > > the remaining 2 are located on another socket: > > > > Socket -1--2- > > CPU 1 2 3 4 5 67 8 > > > > Placing some workload ( x = one task ) yields the following > > scenarios: > > > > The first 5 tasks are distributed evenly across the two groups. > > > > Socket -1--2- > > CPU 1 2 3 4 5 67 8 > > x x x x x > > > > Adding a 6th task yields the following distribution: > > > > Socket -1--2- > > CPU 1 2 3 4 5 67 8 > > SMT1 x x x x x > > SMT2x > > Your description is a bit confusing for me. What you name CPU above > should be named Core, doesn' it ? > > Could you share with us your scheduler topology ? > You are correct, it should say core instead of CPU. One actual configuration from one of my test machines (lscpu -e): [...] So, 6 cores / 12 CPUs in one group 2 cores / 4 CPUs in the other. Thaks for the details If I run stress-ng with 8 cpu stressors on the original code I get a distribution like this: 00 01 02 03 04 05 06 07 08 09 10 11 || 12 13 14 15 x x x x x x x x Which means that the two cores in the smaller group are running into SMT while two cores in the larger group are still idle. This is caused by the prefer_sibling path which really wants to see both groups run the same number of tasks. yes and it considers that there are the same number of CPUs per group > > > > The task is added to the 2nd scheduler group, as the scheduler has the > > assumption that scheduler groups are of the same size, so they should > > also host the same number of tasks. This makes CPU 7 run into SMT > > thread, which comes with a performance penalty. This means, that in > > the window of 6-8 tasks, load balancing is done suboptimally, because > > SMT is used although there is no reason to do so as fully idle CPUs > > are still available. > > > > Taking the weight of the scheduler groups into account, ensures that > > a load balancing CPU within a smaller group will not try to pull tasks > > from a bigger group while the bigger group still has idle CPUs > > available. > > > > Signed-off-by: Tobias Huschle > > --- > > kernel/sched/fair.c | 3 ++- > > 1 file changed, 2 insertions(+), 1 deletion(-) > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > index 48b6f0ca13ac..b1307d7e4065 100644 > > --- a/kernel/sched/fair.c > > +++ b/kernel/sched/fair.c > > @@ -10426,7 +10426,8 @@ static struct sched_group > > *find_busiest_group(struct lb_env *env) > > * group's child domain. > > */ > > if (sds.prefer_sibling && local->group_type == > > group_has_spare && > > - busiest->sum_nr_running > local->sum_nr_running + 1) > > + busiest->sum_nr_running * local->group_weight > > > + local->sum_nr_running * > > busiest->group_weight + 1) what you want to test here is that moving 1 task from busiest to local group would help and balance the ratio of tasks per cpu (busiest->sum_nr_running - 1) / busiest->group_weight > (local->sum_nr_running + 1) / local->group_weight which can be develop into busiest->sum_nr_running * local->group_weight >= local->sum_nr_running * busiest->group_weight + busiest->group_weight + local->group_weight and
Re: [RFC 0/1] sched/fair: Consider asymmetric scheduler groups in load balancer
On 2023-05-16 18:35, Dietmar Eggemann wrote: On 15/05/2023 13:46, Tobias Huschle wrote: The current load balancer implementation implies that scheduler groups, within the same scheduler domain, all host the same number of CPUs. This appears to be valid for non-s390 architectures. Nevertheless, s390 can actually have scheduler groups of unequal size. Arm (classical) big.Little had this for years before we switched to flat scheduling (only MC sched domain) over CPU capacity boundaries for Arm DynamIQ. Arm64 Juno platform in mainline: root@juno:~# cat /sys/devices/system/cpu/cpu*/topology/cluster_cpus_list 0,3-5 1-2 1-2 0,3-5 0,3-5 0,3-5 root@juno:~# cat /proc/schedstat | grep ^domain | awk '{print $1, $2}' domain0 39 <-- domain1 3f domain0 06 <-- domain1 3f domain0 06 domain1 3f domain0 39 domain1 3f domain0 39 domain1 3f domain0 39 domain1 3f root@juno:~# cat /sys/kernel/debug/sched/domains/cpu0/domain*/name MC DIE But we don't have SMT on the mobile processors. It looks like you are only interested to get group_weight dependency into this 'prefer_sibling' condition of find_busiest_group()? Sorry, looks like your reply hit some bad filter of my mail program. Let me answer, although it's a bit late. Yes, I would like to get the group_weight into the prefer_sibling path. Unfortunately, we cannot go for a flat hierarchy as the s390 hardware allows to have CPUs to be pretty far apart (cache-wise), which means, the load balancer should avoid to move tasks back and forth between those CPUs if possible. We can't remove SD_PREFER_SIBLING either, as this would cause the load balancer to aim for having the same number of idle CPUs in all groups, which is a problem as well in asymmetric groups, for example: With SD_PREFER_SIBLING, aiming for same number of non-idle CPUs 00 01 02 03 04 05 06 07 08 09 10 11 || 12 13 14 15 x x x x x x x x Without SD_PREFER_SIBLING, aiming for the same number of idle CPUs 00 01 02 03 04 05 06 07 08 09 10 11 || 12 13 14 15 x x x x x x x x Hence the idea to add the group_weight to the prefer_sibling path. I was wondering if this would be the right place to address this issue or if I should go down another route. We in (classical) big.LITTLE (sd flag SD_ASYM_CPUCAPACITY) remove SD_PREFER_SIBLING from sd->child so we don't run this condition. The current scheduler behavior causes some s390 configs to use SMT while some cores are still idle, leading to a performance degredation under certain levels of workload. Please refer to the patch's commit message for more details and an example. This patch is a proposal on how to integrate the size of scheduler groups into the decision process. This patch is the most basic approach to address this issue and does not claim to be perfect as-is. Other ideas that also proved to address the problem but are more complex but also potentially more precise: 1. On scheduler group building, count the number of CPUs within each group that are first in their sibling mask. This represents the number of CPUs that can be used before running into SMT. This should be slightly more accurate than using the full group weight if the number of available SMT threads per core varies. 2. Introduce a new scheduler group classification (smt_busy) in between of fully_busy and has_spare. This classification would indicate that a group still has spare capacity, but will run into SMT when using that capacity. This would make the load balancer prefer groups with fully idle CPUs over ones that are about to run into SMT. Feedback would be greatly appreciated. Tobias Huschle (1): sched/fair: Consider asymmetric scheduler groups in load balancer kernel/sched/fair.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
Re: [RFC 1/1] sched/fair: Consider asymmetric scheduler groups in load balancer
On 2023-05-16 15:36, Vincent Guittot wrote: On Mon, 15 May 2023 at 13:46, Tobias Huschle wrote: The current load balancer implementation implies that scheduler groups, within the same domain, all host the same number of CPUs. This is reflected in the condition, that a scheduler group, which is load balancing and classified as having spare capacity, should pull work from the busiest group, if the local group runs less processes than the busiest one. This implies that these two groups should run the same number of processes, which is problematic if the groups are not of the same size. The assumption that scheduler groups within the same scheduler domain host the same number of CPUs appears to be true for non-s390 architectures. Nevertheless, s390 can have scheduler groups of unequal size. This introduces a performance degredation in the following scenario: Consider a system with 8 CPUs, 6 CPUs are located on one CPU socket, the remaining 2 are located on another socket: Socket -1--2- CPU 1 2 3 4 5 67 8 Placing some workload ( x = one task ) yields the following scenarios: The first 5 tasks are distributed evenly across the two groups. Socket -1--2- CPU 1 2 3 4 5 67 8 x x x x x Adding a 6th task yields the following distribution: Socket -1--2- CPU 1 2 3 4 5 67 8 SMT1 x x x x x SMT2x Your description is a bit confusing for me. What you name CPU above should be named Core, doesn' it ? Could you share with us your scheduler topology ? You are correct, it should say core instead of CPU. One actual configuration from one of my test machines (lscpu -e): CPU NODE DRAWER BOOK SOCKET CORE L1d:L1i:L2 ONLINE CONFIGURED POLARIZATION ADDRESS 00 00 00 0:0:0 yes yeshorizontal 0 10 00 00 1:1:0 yes yeshorizontal 1 20 00 01 2:2:0 yes yeshorizontal 2 30 00 01 3:3:0 yes yeshorizontal 3 40 00 02 4:4:0 yes yeshorizontal 4 50 00 02 5:5:0 yes yeshorizontal 5 60 00 03 6:6:0 yes yeshorizontal 6 70 00 03 7:7:0 yes yeshorizontal 7 80 00 04 8:8:0 yes yeshorizontal 8 90 00 04 9:9:0 yes yeshorizontal 9 100 00 05 10:10:0 yes yeshorizontal 10 110 00 05 11:11:0 yes yeshorizontal 11 120 00 16 12:12:0 yes yeshorizontal 12 130 00 16 13:13:0 yes yeshorizontal 13 140 00 17 14:14:0 yes yeshorizontal 14 150 00 17 15:15:0 yes yeshorizontal 15 So, 6 cores / 12 CPUs in one group 2 cores / 4 CPUs in the other. If I run stress-ng with 8 cpu stressors on the original code I get a distribution like this: 00 01 02 03 04 05 06 07 08 09 10 11 || 12 13 14 15 x x x x x x x x Which means that the two cores in the smaller group are running into SMT while two cores in the larger group are still idle. This is caused by the prefer_sibling path which really wants to see both groups run the same number of tasks. The task is added to the 2nd scheduler group, as the scheduler has the assumption that scheduler groups are of the same size, so they should also host the same number of tasks. This makes CPU 7 run into SMT thread, which comes with a performance penalty. This means, that in the window of 6-8 tasks, load balancing is done suboptimally, because SMT is used although there is no reason to do so as fully idle CPUs are still available. Taking the weight of the scheduler groups into account, ensures that a load balancing CPU within a smaller group will not try to pull tasks from a bigger group while the bigger group still has idle CPUs available. Signed-off-by: Tobias Huschle --- kernel/sched/fair.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 48b6f0ca13ac..b1307d7e4065 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -10426,7 +10426,8 @@ static struct sched_group *find_busiest_group(struct lb_env *env) * group's child domain. */ if (sds.prefer_sibling && local->group_type == group_has_spare && - busiest->sum_nr_running > local->sum_nr_running + 1) + busiest->sum_nr_running * local->group_weight > + local->sum_nr_running * busiest->group_weight + 1) This is the prefer_sibling path. Could it be that you should
[RFC 1/1] sched/fair: Consider asymmetric scheduler groups in load balancer
The current load balancer implementation implies that scheduler groups, within the same domain, all host the same number of CPUs. This is reflected in the condition, that a scheduler group, which is load balancing and classified as having spare capacity, should pull work from the busiest group, if the local group runs less processes than the busiest one. This implies that these two groups should run the same number of processes, which is problematic if the groups are not of the same size. The assumption that scheduler groups within the same scheduler domain host the same number of CPUs appears to be true for non-s390 architectures. Nevertheless, s390 can have scheduler groups of unequal size. This introduces a performance degredation in the following scenario: Consider a system with 8 CPUs, 6 CPUs are located on one CPU socket, the remaining 2 are located on another socket: Socket -1--2- CPU 1 2 3 4 5 67 8 Placing some workload ( x = one task ) yields the following scenarios: The first 5 tasks are distributed evenly across the two groups. Socket -1--2- CPU 1 2 3 4 5 67 8 x x x x x Adding a 6th task yields the following distribution: Socket -1--2- CPU 1 2 3 4 5 67 8 SMT1 x x x x x SMT2x The task is added to the 2nd scheduler group, as the scheduler has the assumption that scheduler groups are of the same size, so they should also host the same number of tasks. This makes CPU 7 run into SMT thread, which comes with a performance penalty. This means, that in the window of 6-8 tasks, load balancing is done suboptimally, because SMT is used although there is no reason to do so as fully idle CPUs are still available. Taking the weight of the scheduler groups into account, ensures that a load balancing CPU within a smaller group will not try to pull tasks from a bigger group while the bigger group still has idle CPUs available. Signed-off-by: Tobias Huschle --- kernel/sched/fair.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 48b6f0ca13ac..b1307d7e4065 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -10426,7 +10426,8 @@ static struct sched_group *find_busiest_group(struct lb_env *env) * group's child domain. */ if (sds.prefer_sibling && local->group_type == group_has_spare && - busiest->sum_nr_running > local->sum_nr_running + 1) + busiest->sum_nr_running * local->group_weight > + local->sum_nr_running * busiest->group_weight + 1) goto force_balance; if (busiest->group_type != group_overloaded) { -- 2.34.1
[RFC 0/1] sched/fair: Consider asymmetric scheduler groups in load balancer
The current load balancer implementation implies that scheduler groups, within the same scheduler domain, all host the same number of CPUs. This appears to be valid for non-s390 architectures. Nevertheless, s390 can actually have scheduler groups of unequal size. The current scheduler behavior causes some s390 configs to use SMT while some cores are still idle, leading to a performance degredation under certain levels of workload. Please refer to the patch's commit message for more details and an example. This patch is a proposal on how to integrate the size of scheduler groups into the decision process. This patch is the most basic approach to address this issue and does not claim to be perfect as-is. Other ideas that also proved to address the problem but are more complex but also potentially more precise: 1. On scheduler group building, count the number of CPUs within each group that are first in their sibling mask. This represents the number of CPUs that can be used before running into SMT. This should be slightly more accurate than using the full group weight if the number of available SMT threads per core varies. 2. Introduce a new scheduler group classification (smt_busy) in between of fully_busy and has_spare. This classification would indicate that a group still has spare capacity, but will run into SMT when using that capacity. This would make the load balancer prefer groups with fully idle CPUs over ones that are about to run into SMT. Feedback would be greatly appreciated. Tobias Huschle (1): sched/fair: Consider asymmetric scheduler groups in load balancer kernel/sched/fair.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) -- 2.34.1