Re: [RFC] sched/eevdf: sched feature to dismiss lag on wakeup

2024-04-09 Thread Tobias Huschle
On Fri, Mar 22, 2024 at 06:02:05PM +0100, Vincent Guittot wrote:
> and then
> se->vruntime = max_vruntime(se->vruntime, vruntime)
> 

First things first, I was wrong to assume a "boost" in the CFS code. So I
dug a bit deeper and tried to pinpoint what the difference between CFS and
EEVDF actually is. I found the following:

Let's assume we have two tasks taking turns on a single CPU.
Task 1 is always runnable.
Task 2 gets woken up by task 1 and goes back to sleep when it is done.
This means, task 1 runs, wakes up task 2, task 2 runs, goes to sleep and
task 1 runs again and we repeat.
Most of the time: runtime(task1) > runtime(task2)
Rare occasions:   runtime(task1) < runtime(task2)
So, task 1 usually consumes more of its designated time slices until it gets
rescheduled by the wakeup of task2 than task 2 does. But both never consume
their full time slice. Rather the opposite, both run for low 5-digit ns or
less.

So something like this:

task 1|--||-||--...
task 2   || ||

This creates different behaviors under CFS and EEVDF:

### CFS 

In CFS the difference in runtimes means that task 2 cannot catch up with 
task 1 vruntime-wise

With every exchange between task 1 and task 2, task 2 falls back more on
vruntime. Once a difference in the magnitude of sysctl_sched_latency is 
established, the difference remains stable due to the max handling in 
place_entity.

Occasionally, task 2 may run longer than task 1. In those cases, it
will catch up slightly. But in the majority of cases, task 2 runs
shorter, thereby increasing the difference in vruntime.

This would explain why task 2 gets always scheduled immediately on wakeup.

### EEVDF ##

The rare occasions where task 2 runs longer than task 1 seem to cause 
issues with EEVDF:

In the regular case where task 1 runs longer than task 2. Task 2 gets 
a positive lag and is selected on wake up --> good.
In the irregular case where task 2 runs longer than task 1 task 2 
now gets a negative lag and is no longer chosen on wakeup --> bad (in some 
cases).

This would explain why task 2 gets not selected on wake up occasionally. 

### Summary 

So my wording, that a woken up task gets "boosted" was obviously wrong. 
Task 2 is not getting boosted in CFS, it gets "outrun" by task 1, with 
no chance of catching up. Leaving it with a smaller vruntime value.

EEVDF on the other hand, does not allow lag to accumulate if an entity, like 
task 2 in this case, regularly dequeues itself. So it will always have 
a lag with an upper boundary of whatever difference it encountered in 
comparison to the runtime with task 1.

The patch below, allows tasks to accumulate lag over time. This fixes the
original regression, that made me stumble into this topic. But, this might 
of course come with arbitrary side effects.

I'm not suggesting to actually implement this, but would like to confirm 
whether my understanding is correct that this is the aspect where CFS and 
EEVDF differ, where CFS is more aware of the past in this particular case
than EEVDF is.

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 03be0d1330a6..b83a72311d2a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -701,7 +701,7 @@ static void update_entity_lag(struct cfs_rq *cfs_rq, struct 
sched_entity *se)
s64 lag, limit;
 
SCHED_WARN_ON(!se->on_rq);
-   lag = avg_vruntime(cfs_rq) - se->vruntime;
+   lag = se->vlag + avg_vruntime(cfs_rq) - se->vruntime;
 
limit = calc_delta_fair(max_t(u64, 2*se->slice, TICK_NSEC), se);
se->vlag = clamp(lag, -limit, limit);


Re: [RFC] sched/eevdf: sched feature to dismiss lag on wakeup

2024-03-21 Thread Tobias Huschle
On Wed, Mar 20, 2024 at 02:51:00PM +0100, Vincent Guittot wrote:
> On Wed, 20 Mar 2024 at 08:04, Tobias Huschle  wrote:
> > There was no guarantee of course. place_entity was reducing the vruntime of
> > woken up tasks though, giving it a slight boost, right?. For the scenario
> 
> It was rather the opposite, It was ensuring that long sleeping tasks
> will not get too much bonus because of vruntime too far in the past.
> This is similar although not exactly the same intent as the lag. The
> bonus was up to 24ms previously whereas it's not more than a slice now
> 

I might have gotten this quite wrong then. I was looking at place_entity
and saw that non-initial placements get their vruntime reduced via

vruntime -= thresh;

which would mean that the placed task would have a vruntime smaller than
cfs_rq->min_vruntime, based on pre-EEVDF behavior, last seen at:

   af4cf40470c2 sched/fair: Add cfs_rq::avg_vruntime

If there was no such benefit for woken up tasks. Then the scenario I observed
is just conincidentally worse with EEVDF, which can happen when exchanging an
algorithm I suppose. Or EEVDF just exposes a so far hidden problem in that 
scenario.


Re: [RFC] sched/eevdf: sched feature to dismiss lag on wakeup

2024-03-20 Thread Tobias Huschle
On Tue, Mar 19, 2024 at 02:41:14PM +0100, Vincent Guittot wrote:
> On Tue, 19 Mar 2024 at 10:08, Tobias Huschle  wrote:
> >
> > On 2024-03-18 15:45, Luis Machado wrote:
> > > On 3/14/24 13:45, Tobias Huschle wrote:
> > >> On Fri, Mar 08, 2024 at 03:11:38PM +, Luis Machado wrote:
> > >>> On 2/28/24 16:10, Tobias Huschle wrote:
> > >>>>
> > >>>> Questions:
> > >>>> 1. The kworker getting its negative lag occurs in the following
> > >>>> scenario
> > >>>>- kworker and a cgroup are supposed to execute on the same CPU
> > >>>>- one task within the cgroup is executing and wakes up the
> > >>>> kworker
> > >>>>- kworker with 0 lag, gets picked immediately and finishes its
> > >>>>  execution within ~5000ns
> > >>>>- on dequeue, kworker gets assigned a negative lag
> > >>>>Is this expected behavior? With this short execution time, I
> > >>>> would
> > >>>>expect the kworker to be fine.
> > >>>
> > >>> That strikes me as a bit odd as well. Have you been able to determine
> > >>> how a negative lag
> > >>> is assigned to the kworker after such a short runtime?
> > >>>
> > >>
> > >> I did some more trace reading though and found something.
> > >>
> > >> What I observed if everything runs regularly:
> > >> - vhost and kworker run alternating on the same CPU
> > >> - if the kworker is done, it leaves the runqueue
> > >> - vhost wakes up the kworker if it needs it
> > >> --> this means:
> > >>   - vhost starts alone on an otherwise empty runqueue
> > >>   - it seems like it never gets dequeued
> > >> (unless another unrelated task joins or migration hits)
> > >>   - if vhost wakes up the kworker, the kworker gets selected
> > >>   - vhost runtime > kworker runtime
> > >> --> kworker gets positive lag and gets selected immediately next
> > >> time
> > >>
> > >> What happens if it does go wrong:
> > >> From what I gather, there seem to be occasions where the vhost either
> > >> executes suprisingly quick, or the kworker surprinsingly slow. If
> > >> these
> > >> outliers reach critical values, it can happen, that
> > >>vhost runtime < kworker runtime
> > >> which now causes the kworker to get the negative lag.
> > >>
> > >> In this case it seems like, that the vhost is very fast in waking up
> > >> the kworker. And coincidentally, the kworker takes, more time than
> > >> usual
> > >> to finish. We speak of 4-digit to low 5-digit nanoseconds.
> > >>
> > >> So, for these outliers, the scheduler extrapolates that the kworker
> > >> out-consumes the vhost and should be slowed down, although in the
> > >> majority
> > >> of other cases this does not happen.
> > >
> > > Thanks for providing the above details Tobias. It does seem like EEVDF
> > > is strict
> > > about the eligibility checks and making tasks wait when their lags are
> > > negative, even
> > > if just a little bit as in the case of the kworker.
> > >
> > > There was a patch to disable the eligibility checks
> > > (https://lore.kernel.org/lkml/20231013030213.2472697-1-youssefes...@chromium.org/),
> > > which would make EEVDF more like EVDF, though the deadline comparison
> > > would
> > > probably still favor the vhost task instead of the kworker with the
> > > negative lag.
> > >
> > > I'm not sure if you tried it, but I thought I'd mention it.
> >
> > Haven't seen that one yet. Unfortunately, it does not help to ignore the
> > eligibility.
> >
> > I'm inclined to rather propose propose a documentation change, which
> > describes that tasks should not rely on woken up tasks being scheduled
> > immediately.
> 
> Where do you see such an assumption ? Even before eevdf, there were
> nothing that ensures such behavior. When using CFS (legacy or eevdf)
> tasks, you can't know if the newly wakeup task will run 1st or not
> 

There was no guarantee of course. place_entity was reducing the vruntime of 
woken up tasks though, giving it a slight boost, right?. For the scenario 
that I observed, that boost was enough to make sure, that the woken up tasks 
gets scheduled consistently. This might still not be true for all scenarios, 
but in general EEVDF seems to be stricter with woken up tasks.

Dismissing the lag on wakeup also does obviously not guarantee getting 
scheduled, as other tasks might still be involved.

The question would be if it should be explicitly mentioned somewhere that,
at this point, woken up tasks are not getting any special treatment and
noone should rely on that boost for woken up tasks.

> >
> > Changing things in the code to address for the specific scenario I'm
> > seeing seems to mostly create unwanted side effects and/or would require
> > the definition of some magic cut-off values.
> >
> >


Re: [RFC] sched/eevdf: sched feature to dismiss lag on wakeup

2024-03-19 Thread Tobias Huschle

On 2024-03-18 15:45, Luis Machado wrote:

On 3/14/24 13:45, Tobias Huschle wrote:

On Fri, Mar 08, 2024 at 03:11:38PM +, Luis Machado wrote:

On 2/28/24 16:10, Tobias Huschle wrote:


Questions:
1. The kworker getting its negative lag occurs in the following 
scenario

   - kworker and a cgroup are supposed to execute on the same CPU
   - one task within the cgroup is executing and wakes up the 
kworker

   - kworker with 0 lag, gets picked immediately and finishes its
 execution within ~5000ns
   - on dequeue, kworker gets assigned a negative lag
   Is this expected behavior? With this short execution time, I 
would

   expect the kworker to be fine.


That strikes me as a bit odd as well. Have you been able to determine 
how a negative lag

is assigned to the kworker after such a short runtime?



I did some more trace reading though and found something.

What I observed if everything runs regularly:
- vhost and kworker run alternating on the same CPU
- if the kworker is done, it leaves the runqueue
- vhost wakes up the kworker if it needs it
--> this means:
  - vhost starts alone on an otherwise empty runqueue
  - it seems like it never gets dequeued
(unless another unrelated task joins or migration hits)
  - if vhost wakes up the kworker, the kworker gets selected
  - vhost runtime > kworker runtime
--> kworker gets positive lag and gets selected immediately next 
time


What happens if it does go wrong:
From what I gather, there seem to be occasions where the vhost either
executes suprisingly quick, or the kworker surprinsingly slow. If 
these

outliers reach critical values, it can happen, that
   vhost runtime < kworker runtime
which now causes the kworker to get the negative lag.

In this case it seems like, that the vhost is very fast in waking up
the kworker. And coincidentally, the kworker takes, more time than 
usual

to finish. We speak of 4-digit to low 5-digit nanoseconds.

So, for these outliers, the scheduler extrapolates that the kworker
out-consumes the vhost and should be slowed down, although in the 
majority

of other cases this does not happen.


Thanks for providing the above details Tobias. It does seem like EEVDF 
is strict

about the eligibility checks and making tasks wait when their lags are
negative, even
if just a little bit as in the case of the kworker.

There was a patch to disable the eligibility checks
(https://lore.kernel.org/lkml/20231013030213.2472697-1-youssefes...@chromium.org/),
which would make EEVDF more like EVDF, though the deadline comparison 
would

probably still favor the vhost task instead of the kworker with the
negative lag.

I'm not sure if you tried it, but I thought I'd mention it.


Haven't seen that one yet. Unfortunately, it does not help to ignore the 
eligibility.


I'm inclined to rather propose propose a documentation change, which 
describes that tasks should not rely on woken up tasks being scheduled 
immediately.


Changing things in the code to address for the specific scenario I'm 
seeing seems to mostly create unwanted side effects and/or would require 
the definition of some magic cut-off values.





Re: [RFC] sched/eevdf: sched feature to dismiss lag on wakeup

2024-03-14 Thread Tobias Huschle
On Fri, Mar 08, 2024 at 03:11:38PM +, Luis Machado wrote:
> On 2/28/24 16:10, Tobias Huschle wrote:
> > 
> > Questions:
> > 1. The kworker getting its negative lag occurs in the following scenario
> >- kworker and a cgroup are supposed to execute on the same CPU
> >- one task within the cgroup is executing and wakes up the kworker
> >- kworker with 0 lag, gets picked immediately and finishes its
> >  execution within ~5000ns
> >- on dequeue, kworker gets assigned a negative lag
> >Is this expected behavior? With this short execution time, I would
> >expect the kworker to be fine.
> 
> That strikes me as a bit odd as well. Have you been able to determine how a 
> negative lag
> is assigned to the kworker after such a short runtime?
> 

I did some more trace reading though and found something.

What I observed if everything runs regularly:
- vhost and kworker run alternating on the same CPU
- if the kworker is done, it leaves the runqueue
- vhost wakes up the kworker if it needs it
--> this means:
  - vhost starts alone on an otherwise empty runqueue
  - it seems like it never gets dequeued
(unless another unrelated task joins or migration hits)
  - if vhost wakes up the kworker, the kworker gets selected
  - vhost runtime > kworker runtime 
--> kworker gets positive lag and gets selected immediately next time

What happens if it does go wrong:
>From what I gather, there seem to be occasions where the vhost either
executes suprisingly quick, or the kworker surprinsingly slow. If these
outliers reach critical values, it can happen, that
   vhost runtime < kworker runtime
which now causes the kworker to get the negative lag.

In this case it seems like, that the vhost is very fast in waking up
the kworker. And coincidentally, the kworker takes, more time than usual
to finish. We speak of 4-digit to low 5-digit nanoseconds.

So, for these outliers, the scheduler extrapolates that the kworker 
out-consumes the vhost and should be slowed down, although in the majority
of other cases this does not happen.

Therefore this particular usecase would profit from being able to ignore
such outliers, or being able to ignore a certain amount of difference in the
lag values, i.e. introduce some grace value around the average runtime for
which lag is not accounted. But not sure if I like that idea.

So the negative lag can be somewhat justified, but for this particular case
it leads to a problem where one outlier can cause havoc. As mentioned in the
vhost discussion, it could also be argued that the vhost should not rely on 
the fact that the kworker gets always scheduled on wake up, since these
timing issues can always happen.

Hence, the two options:
- offer the alternative strategy which dismisses lag on wake up for workloads
  where we know that a task usually finishes faster than others but should
  not be punished by rare outliers (if that is predicatble, I don't know)
- require vhost to adress this issue on their side (if possible without 
  creating an armada of side effects)

(plus the third one mentioned above, but that requires a magic cutoff value, 
meh)

> I was looking at a different thread 
> (https://lore.kernel.org/lkml/20240226082349.302363-1-yu.c.c...@intel.com/) 
> that
> uncovers a potential overflow in the eligibility calculation. Though I don't 
> think that is the case for this particular
> vhost problem.

Yea, the numbers I see do not look very overflowy.


Re: [RFC] sched/eevdf: sched feature to dismiss lag on wakeup

2024-03-06 Thread Tobias Huschle
On Thu, Feb 29, 2024 at 09:06:16AM +0530, K Prateek Nayak wrote:
> (+ Xuewen Yan, Ke Wang)
> 
> Hello Tobias,
> 
<...>
> > 
> > Questions:
> > 1. The kworker getting its negative lag occurs in the following scenario
> >- kworker and a cgroup are supposed to execute on the same CPU
> >- one task within the cgroup is executing and wakes up the kworker
> >- kworker with 0 lag, gets picked immediately and finishes its
> >  execution within ~5000ns
> >- on dequeue, kworker gets assigned a negative lag
> >Is this expected behavior? With this short execution time, I would
> >expect the kworker to be fine.
> >For a more detailed discussion on this symptom, please see:
> >https://lore.kernel.org/netdev/ZWbapeL34Z8AMR5f@DESKTOP-2CCOB1S./T/
> 
> Does the lag clamping path from Xuewen Yan [1] work for the vhost case
> mentioned in the thread? Instead of placing the task just behind the
> 0-lag point, clamping the lag seems to be more principled approach since
> EEVDF already does it in update_entity_lag().
> 
> If the lag is still too large, maybe the above coupled with Peter's
> delayed dequeue patch can help [2] (Note: tree is prone to force
> updates)
> 
> [1] https://lore.kernel.org/lkml/20240130080643.1828-1-xuewen@unisoc.com/
> [2] 
> https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/commit/?h=sched/eevdf=e62ef63a888c97188a977daddb72b61548da8417

I tried Peter's patches a while ago. Unfortunately, reducing the lag
is not sufficient in this particular case. The calling entity expects
the woken up kworker to run instantly.

In order to have a chance that the woken up kworker is scheduled right
away, the kworker must not have any negative lag. To guarantee it being 
scheduled it should even have a positive lag which allows it to pass
all other entities on the queue.

Therefore I proposed to just wipe the negative lag in these cases, 
which seems to map to strategy #2 of the underlying paper.

The other way to think about this would be:
The assumption, that woken up tasks get a high probability to run
is no longer valid. In that case, the entity triggering the wake
up has to explicitly give up the CPU. If there are no other tasks,
apart from the 2 involved so far, has good chances of being 
scheduled. If the runqueue is busy, other tasks might intervene.

I keep playing around with these options, but potential side effects
are worrying me.

> 
<...>


[RFC] sched/eevdf: avoid task starvation in cgroups

2024-02-28 Thread Tobias Huschle
When running update_curr, it is checked whether the current task has
missed its deadline (update_deadline). If the deadline has been crossed,
the task is set to be rescheduled if there are other tasks available on
its cfs_rq.
This can cause task starvation in some cgroup configurations.

Assume the following scenario:

   [ ]  rq of CPU
|
   [ ]  cfs_rq1
  /   \
cfs_rq2 [ ]   othertask
 |
curr

In this case, update_curr is called for all levels of the hierarchy,
starting at the leaf.
Assume that curr is a kernel task which does not give up the CPU
voluntarily, i.e. loops indefinitely unless forced to give up the CPU.
Assume further that curr has actually reached its deadline, the expected
behavior would be, that the scheduler determines that curr would now
start to overconsume and therefore should set the need_resched flag to
nudge the current task to allow a reschedule in favor of othertask.

To reach the expected behavior, it is not sufficient to check whether
other tasks are queued in the cfs_rq, which curr is assigned to.

In the configuration shown above, each run queue sees:
rq_cpu:  2 tasks
cfs_rq1: 2 tasks
cfs_rq2: 1 task

This means that cfs_rq2 will never reschedule on its own as it sees
no other tasks that would be worth rescheduling for. Hence, the
call to reschedule relies on the upper levels in the hierarchy.
cfs_rq1 sees 1 additional task and could therefore take the desired
decision. But, cfs_rq1 takes the additional task into account when
computing its own deadline, which means, its deadline lies further in
the future. This causes that its deadline is not being reached.
Therefore cfs_rq1 does not even check for other potential tasks to be
ran.

It could now be assumed that cfs_rq1 will reach its deadline at some
point. But, after curr has consumed its timeslice, all sched_entities
in the cgroup tree get reweighted. This causes the deadlines of all
entities in the hierarchy to be shifted further into the future.
This means especially, that the deadline of cfs_rq1 also gets pushed
into the future, causing it to never reach its deadline.
The decision whether a reweight needs to be done depends on the weight
of the sched_entity and a recalculated value of the shares the
sched_entity shall expect (see calc_group_shares in fair.c). These
values are, in the described scenario, always different, hence, reweight
is done.

The combination of these two circumstances can lead to curr
running indefinitely unless interrupted by an external entity.

Address this issue by rather checking for the nr_running value of the
rq of the CPU itself, i.e. if there is any other task, somewhere on the
CPU runqueue, give it a chance to execute.

The scenario that made me discover this is being discussed here:
https://lore.kernel.org/netdev/ZWbapeL34Z8AMR5f@DESKTOP-2CCOB1S./T/

Questions:
1. Is this behavior by design? I couldn't find an explanation why we
   only check for the local cfs_rq. Do we only want to allow take-overs
   from the same cgroup hierarchy level in that case?
2. This problem is referred to as the "hog"-problem by Ankur Arora here:
   
https://lore.kernel.org/lkml/2024021304.1802415-24-ankur.a.ar...@oracle.com/
   Might there be a connection, it's the same piece of code in the end. I'd also
   prefer the boolean passing to avoid calling resched_curr too often.
3. Which are the criteria for which I should expect a reweight of the 
   cgroup hierarchy? It seems unintuitive too me, that the shares value
   keeps changing every time.

Feedback and opinions would be highly appreciated.

Signed-off-by: Tobias Huschle 
---
 kernel/sched/fair.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 61c4ef20a2f8..e9733ef7964a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1015,7 +1015,7 @@ static void update_deadline(struct cfs_rq *cfs_rq, struct 
sched_entity *se)
/*
 * The task has consumed its request, reschedule.
 */
-   if (cfs_rq->nr_running > 1) {
+   if (rq_of(cfs_rq)->nr_running > 1) {
resched_curr(rq_of(cfs_rq));
clear_buddies(cfs_rq, se);
}
-- 
2.34.1



[RFC] sched/eevdf: sched feature to dismiss lag on wakeup

2024-02-28 Thread Tobias Huschle
The previously used CFS scheduler gave tasks that were woken up an
enhanced chance to see runtime immediately by deducting a certain value
from its vruntime on runqueue placement during wakeup.

This property was used by some, at least vhost, to ensure, that certain
kworkers are scheduled immediately after being woken up. The EEVDF
scheduler, does not support this so far. Instead, if such a woken up
entitiy carries a negative lag from its previous execution, it will have
to wait for the current time slice to finish, which affects the
performance of the process expecting the immediate execution negatively.

To address this issue, implement EEVDF strategy #2 for rejoining
entities, which dismisses the lag from previous execution and allows
the woken up task to run immediately (if no other entities are deemed
to be preferred for scheduling by EEVDF).

The vruntime is decremented by an additional value of 1 to make sure,
that the woken up tasks gets to actually run. This is of course not
following strategy #2 in an exact manner but guarantees the expected
behavior for the scenario described above. Without the additional
decrement, the performance goes south even more. So there are some
side effects I could not get my head around yet.

Questions:
1. The kworker getting its negative lag occurs in the following scenario
   - kworker and a cgroup are supposed to execute on the same CPU
   - one task within the cgroup is executing and wakes up the kworker
   - kworker with 0 lag, gets picked immediately and finishes its
 execution within ~5000ns
   - on dequeue, kworker gets assigned a negative lag
   Is this expected behavior? With this short execution time, I would
   expect the kworker to be fine.
   For a more detailed discussion on this symptom, please see:
   https://lore.kernel.org/netdev/ZWbapeL34Z8AMR5f@DESKTOP-2CCOB1S./T/
2. The proposed code change of course only addresses the symptom. Am I
   assuming correctly that this is in general the exepected behavior and
   that the task waking up the kworker should rather do an explicit
   reschedule of itself to grant the kworker time to execute?
   In the vhost case, this is currently attempted through a cond_resched
   which is not doing anything because the need_resched flag is not set.

Feedback and opinions would be highly appreciated.

Signed-off-by: Tobias Huschle 
---
 kernel/sched/fair.c | 5 +
 kernel/sched/features.h | 1 +
 2 files changed, 6 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 533547e3c90a..c20ae6d62961 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5239,6 +5239,11 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity 
*se, int flags)
lag = div_s64(lag, load);
}
 
+   if (sched_feat(NOLAG_WAKEUP) && (flags & ENQUEUE_WAKEUP)) {
+   se->vlag = 0;
+   lag = 1;
+   }
+
se->vruntime = vruntime - lag;
 
/*
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 143f55df890b..d3118e7568b4 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -7,6 +7,7 @@
 SCHED_FEAT(PLACE_LAG, true)
 SCHED_FEAT(PLACE_DEADLINE_INITIAL, true)
 SCHED_FEAT(RUN_TO_PARITY, true)
+SCHED_FEAT(NOLAG_WAKEUP, true)
 
 /*
  * Prefer to schedule the task we woke last (assuming it failed
-- 
2.34.1



Re: [RFC 1/1] sched/fair: Consider asymmetric scheduler groups in load balancer

2023-07-07 Thread Tobias Huschle

On 2023-07-07 16:33, Shrikanth Hegde wrote:

On 7/7/23 1:14 PM, Tobias Huschle wrote:

On 2023-07-05 09:52, Vincent Guittot wrote:

Le lundi 05 juin 2023 à 10:07:16 (+0200), Tobias Huschle a écrit :

On 2023-05-16 15:36, Vincent Guittot wrote:
> On Mon, 15 May 2023 at 13:46, Tobias Huschle 
> wrote:
> >
> > The current load balancer implementation implies that scheduler
> > groups,
> > within the same domain, all host the same number of CPUs. This is
> > reflected in the condition, that a scheduler group, which is load
> > balancing and classified as having spare capacity, should pull work
> > from the busiest group, if the local group runs less processes than
> > the busiest one. This implies that these two groups should run the
> > same number of processes, which is problematic if the groups are not
> > of the same size.
> >
> > The assumption that scheduler groups within the same scheduler
domain
> > host the same number of CPUs appears to be true for non-s390
> > architectures. Nevertheless, s390 can have scheduler groups of
unequal
> > size.
> >
> > This introduces a performance degredation in the following scenario:
> >
> > Consider a system with 8 CPUs, 6 CPUs are located on one CPU socket,
> > the remaining 2 are located on another socket:
> >
> > Socket   -1-    -2-
> > CPU  1 2 3 4 5 6    7 8
> >
> > Placing some workload ( x = one task ) yields the following
> > scenarios:
> >
> > The first 5 tasks are distributed evenly across the two groups.
> >
> > Socket   -1-    -2-
> > CPU  1 2 3 4 5 6    7 8
> >  x x x  x x
> >
> > Adding a 6th task yields the following distribution:
> >
> > Socket   -1-    -2-
> > CPU  1 2 3 4 5 6    7 8
> > SMT1 x x x  x x
> > SMT2    x
>
> Your description is a bit confusing for me. What you name CPU above
> should be named Core, doesn' it ?
>
> Could you share with us your scheduler topology ?
>

You are correct, it should say core instead of CPU.

One actual configuration from one of my test machines (lscpu -e):



[...]



So, 6 cores / 12 CPUs in one group 2 cores / 4 CPUs in the other.


Thaks for the details



If I run stress-ng with 8 cpu stressors on the original code I get a
distribution
like this:

00 01 02 03 04 05 06 07 08 09 10 11  || 12 13 14 15
    x x x x  x  x  x  x

Which means that the two cores in the smaller group are running into 
SMT

while two
cores in the larger group are still idle. This is caused by the
prefer_sibling path
which really wants to see both groups run the same number of tasks.


yes and it considers that there are the same number of CPUs per group



> >
> > The task is added to the 2nd scheduler group, as the scheduler
has the
> > assumption that scheduler groups are of the same size, so they
should
> > also host the same number of tasks. This makes CPU 7 run into SMT
> > thread, which comes with a performance penalty. This means, that in
> > the window of 6-8 tasks, load balancing is done suboptimally,
because
> > SMT is used although there is no reason to do so as fully idle CPUs
> > are still available.
> >
> > Taking the weight of the scheduler groups into account, ensures that
> > a load balancing CPU within a smaller group will not try to pull
tasks
> > from a bigger group while the bigger group still has idle CPUs
> > available.
> >
> > Signed-off-by: Tobias Huschle 
> > ---
> >  kernel/sched/fair.c | 3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 48b6f0ca13ac..b1307d7e4065 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -10426,7 +10426,8 @@ static struct sched_group
> > *find_busiest_group(struct lb_env *env)
> >  * group's child domain.
> >  */
> > if (sds.prefer_sibling && local->group_type ==
> > group_has_spare &&
> > -   busiest->sum_nr_running > local->sum_nr_running + 1)
> > +   busiest->sum_nr_running * local->group_weight >
> > +   local->sum_nr_running *
> > busiest->group_weight + 1)



what you want to test here is that moving 1 task from busiest to 
local

group
would help and balance the ratio of tasks per cpu

(busiest->sum_nr_running - 1) / busiest->group_weight >
(local->sum_nr_running + 1) / local->group_weight

which can be develop into

busiest->sum_nr_running * local->group_weight >= 
local-&g

Re: [RFC 1/1] sched/fair: Consider asymmetric scheduler groups in load balancer

2023-07-07 Thread Tobias Huschle

On 2023-07-06 19:19, Shrikanth Hegde wrote:

On 5/15/23 5:16 PM, Tobias Huschle wrote:
The current load balancer implementation implies that scheduler 
groups,

within the same domain, all host the same number of CPUs. This is
reflected in the condition, that a scheduler group, which is load
balancing and classified as having spare capacity, should pull work
from the busiest group, if the local group runs less processes than
the busiest one. This implies that these two groups should run the
same number of processes, which is problematic if the groups are not
of the same size.

The assumption that scheduler groups within the same scheduler domain
host the same number of CPUs appears to be true for non-s390
architectures. Nevertheless, s390 can have scheduler groups of unequal
size.

This introduces a performance degredation in the following scenario:

Consider a system with 8 CPUs, 6 CPUs are located on one CPU socket,
the remaining 2 are located on another socket:

Socket   -1--2-
CPU  1 2 3 4 5 67 8

Placing some workload ( x = one task ) yields the following
scenarios:

The first 5 tasks are distributed evenly across the two groups.

Socket   -1--2-
CPU  1 2 3 4 5 67 8
 x x x  x x

Adding a 6th task yields the following distribution:

Socket   -1--2-
CPU  1 2 3 4 5 67 8
SMT1 x x x  x x
SMT2x

The task is added to the 2nd scheduler group, as the scheduler has the
assumption that scheduler groups are of the same size, so they should
also host the same number of tasks. This makes CPU 7 run into SMT
thread, which comes with a performance penalty. This means, that in
the window of 6-8 tasks, load balancing is done suboptimally, because
SMT is used although there is no reason to do so as fully idle CPUs
are still available.

Taking the weight of the scheduler groups into account, ensures that
a load balancing CPU within a smaller group will not try to pull tasks
from a bigger group while the bigger group still has idle CPUs
available.

Signed-off-by: Tobias Huschle 
---
 kernel/sched/fair.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 48b6f0ca13ac..b1307d7e4065 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10426,7 +10426,8 @@ static struct sched_group 
*find_busiest_group(struct lb_env *env)

 * group's child domain.
 */
if (sds.prefer_sibling && local->group_type == group_has_spare &&
-   busiest->sum_nr_running > local->sum_nr_running + 1)
+   busiest->sum_nr_running * local->group_weight >
+   local->sum_nr_running * busiest->group_weight + 1)
goto force_balance;




I assume its SMT2 here. sched group is enclosed in 
[busy_cpus/idle_cpus/weight]


Lets take an example:  we will begin the with the said imbalance.
[3/9/12] -- local group
[3/1/4] -- busy group.
we will evaluate 3*12 > (4*(3+1)) is true and proceeds further to 
balance.

but calculate_imbalance will lead to zero as the imbalance no? in case
of prefer sibling
case it find the difference of sum_nr_running of the two groups. It
will be 3-3 = 0

this would need modifications to calculate_imbalance.
Maybe something like this will help? NOT TESTED AT ALL.

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a80a73909dc2..e027d4086edc 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10296,7 +10296,9 @@ static inline void calculate_imbalance(struct
lb_env *env, struct sd_lb_stats *s
return;
}

-   if (busiest->group_weight == 1 || sds->prefer_sibling) 
{

+   if (busiest->group_weight == 1 ||
+   (sds->prefer_sibling &&
+busiest->group_weight == local->group_weight)) 
{

unsigned int nr_diff = busiest->sum_nr_running;
/*
 * When prefer sibling, evenly spread running 
tasks on

@@ -10305,6 +10307,11 @@ static inline void calculate_imbalance(struct
lb_env *env, struct sd_lb_stats *s
env->migration_type = migrate_task;
lsub_positive(_diff, local->sum_nr_running);
env->imbalance = nr_diff;
+   }
+   if (sds->prefer_sibling &&
+   busiest->group_weight != local->group_weight) {
+   env->migration_type = migrate_task;
+   env->imbalance = 1;
} else {



I also had a look at this when Vincent pointed out that this part is 
missing.
The formula proposed by Vincent works pretty well, it is not even 
necessary

to add additional if-statements here.



Re: [RFC 1/1] sched/fair: Consider asymmetric scheduler groups in load balancer

2023-07-07 Thread Tobias Huschle

On 2023-07-04 15:40, Peter Zijlstra wrote:

On Mon, May 15, 2023 at 01:46:01PM +0200, Tobias Huschle wrote:
The current load balancer implementation implies that scheduler 
groups,

within the same domain, all host the same number of CPUs. This is
reflected in the condition, that a scheduler group, which is load
balancing and classified as having spare capacity, should pull work
from the busiest group, if the local group runs less processes than
the busiest one. This implies that these two groups should run the
same number of processes, which is problematic if the groups are not
of the same size.

The assumption that scheduler groups within the same scheduler domain
host the same number of CPUs appears to be true for non-s390
architectures.


Mostly; there's historically the cpuset case where we can create
lopsided groups like that. And today we're growing all these hybrid
things that will also tickle this, except they're looking towards
different balancer extentions to deal with the IPC difference so might
not be immediately caring about this here issue.



Signed-off-by: Tobias Huschle 
---
 kernel/sched/fair.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 48b6f0ca13ac..b1307d7e4065 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10426,7 +10426,8 @@ static struct sched_group 
*find_busiest_group(struct lb_env *env)

 * group's child domain.
 */
if (sds.prefer_sibling && local->group_type == group_has_spare &&
-   busiest->sum_nr_running > local->sum_nr_running + 1)
+   busiest->sum_nr_running * local->group_weight >
+   local->sum_nr_running * busiest->group_weight + 1)


Should that not be: busiest->group_weight * (local->sum_nr_running + 1) 
?


I agree, adding the brackets makes more sense and is clearer on what's
intended by this check while also preserving the original behavior for
local->group_weight == busiest->group_weight



I'm not opposed to this -- it seems fairly straight forward.


Appreciated, I will go ahead and send a patch once I incorporated the 
other feedback I got.

Thanks.




goto force_balance;

if (busiest->group_type != group_overloaded) {
--
2.34.1



Re: [RFC 1/1] sched/fair: Consider asymmetric scheduler groups in load balancer

2023-07-07 Thread Tobias Huschle

On 2023-07-05 09:52, Vincent Guittot wrote:

Le lundi 05 juin 2023 à 10:07:16 (+0200), Tobias Huschle a écrit :

On 2023-05-16 15:36, Vincent Guittot wrote:
> On Mon, 15 May 2023 at 13:46, Tobias Huschle 
> wrote:
> >
> > The current load balancer implementation implies that scheduler
> > groups,
> > within the same domain, all host the same number of CPUs. This is
> > reflected in the condition, that a scheduler group, which is load
> > balancing and classified as having spare capacity, should pull work
> > from the busiest group, if the local group runs less processes than
> > the busiest one. This implies that these two groups should run the
> > same number of processes, which is problematic if the groups are not
> > of the same size.
> >
> > The assumption that scheduler groups within the same scheduler domain
> > host the same number of CPUs appears to be true for non-s390
> > architectures. Nevertheless, s390 can have scheduler groups of unequal
> > size.
> >
> > This introduces a performance degredation in the following scenario:
> >
> > Consider a system with 8 CPUs, 6 CPUs are located on one CPU socket,
> > the remaining 2 are located on another socket:
> >
> > Socket   -1--2-
> > CPU  1 2 3 4 5 67 8
> >
> > Placing some workload ( x = one task ) yields the following
> > scenarios:
> >
> > The first 5 tasks are distributed evenly across the two groups.
> >
> > Socket   -1--2-
> > CPU  1 2 3 4 5 67 8
> >  x x x  x x
> >
> > Adding a 6th task yields the following distribution:
> >
> > Socket   -1--2-
> > CPU  1 2 3 4 5 67 8
> > SMT1 x x x  x x
> > SMT2x
>
> Your description is a bit confusing for me. What you name CPU above
> should be named Core, doesn' it ?
>
> Could you share with us your scheduler topology ?
>

You are correct, it should say core instead of CPU.

One actual configuration from one of my test machines (lscpu -e):



[...]



So, 6 cores / 12 CPUs in one group 2 cores / 4 CPUs in the other.


Thaks for the details



If I run stress-ng with 8 cpu stressors on the original code I get a
distribution
like this:

00 01 02 03 04 05 06 07 08 09 10 11  || 12 13 14 15
x x x x  x  x  x  x

Which means that the two cores in the smaller group are running into 
SMT

while two
cores in the larger group are still idle. This is caused by the
prefer_sibling path
which really wants to see both groups run the same number of tasks.


yes and it considers that there are the same number of CPUs per group



> >
> > The task is added to the 2nd scheduler group, as the scheduler has the
> > assumption that scheduler groups are of the same size, so they should
> > also host the same number of tasks. This makes CPU 7 run into SMT
> > thread, which comes with a performance penalty. This means, that in
> > the window of 6-8 tasks, load balancing is done suboptimally, because
> > SMT is used although there is no reason to do so as fully idle CPUs
> > are still available.
> >
> > Taking the weight of the scheduler groups into account, ensures that
> > a load balancing CPU within a smaller group will not try to pull tasks
> > from a bigger group while the bigger group still has idle CPUs
> > available.
> >
> > Signed-off-by: Tobias Huschle 
> > ---
> >  kernel/sched/fair.c | 3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 48b6f0ca13ac..b1307d7e4065 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -10426,7 +10426,8 @@ static struct sched_group
> > *find_busiest_group(struct lb_env *env)
> >  * group's child domain.
> >  */
> > if (sds.prefer_sibling && local->group_type ==
> > group_has_spare &&
> > -   busiest->sum_nr_running > local->sum_nr_running + 1)
> > +   busiest->sum_nr_running * local->group_weight >
> > +   local->sum_nr_running *
> > busiest->group_weight + 1)



what you want to test here is that moving 1 task from busiest to local 
group

would help and balance the ratio of tasks per cpu

(busiest->sum_nr_running - 1) / busiest->group_weight >
(local->sum_nr_running + 1) / local->group_weight

which can be develop into

busiest->sum_nr_running * local->group_weight >= local->sum_nr_running
* busiest->group_weight + busiest->group_weight + local->group_weight

and 

Re: [RFC 0/1] sched/fair: Consider asymmetric scheduler groups in load balancer

2023-07-04 Thread Tobias Huschle

On 2023-05-16 18:35, Dietmar Eggemann wrote:

On 15/05/2023 13:46, Tobias Huschle wrote:
The current load balancer implementation implies that scheduler 
groups,

within the same scheduler domain, all host the same number of CPUs.

This appears to be valid for non-s390 architectures. Nevertheless, 
s390

can actually have scheduler groups of unequal size.


Arm (classical) big.Little had this for years before we switched to 
flat

scheduling (only MC sched domain) over CPU capacity boundaries for Arm
DynamIQ.

Arm64 Juno platform in mainline:

root@juno:~# cat 
/sys/devices/system/cpu/cpu*/topology/cluster_cpus_list

0,3-5
1-2
1-2
0,3-5
0,3-5
0,3-5

root@juno:~# cat /proc/schedstat | grep ^domain | awk '{print $1, $2}'

domain0 39 <--
domain1 3f
domain0 06 <--
domain1 3f
domain0 06
domain1 3f
domain0 39
domain1 3f
domain0 39
domain1 3f
domain0 39
domain1 3f

root@juno:~# cat /sys/kernel/debug/sched/domains/cpu0/domain*/name
MC
DIE

But we don't have SMT on the mobile processors.

It looks like you are only interested to get group_weight dependency
into this 'prefer_sibling' condition of find_busiest_group()?


Sorry, looks like your reply hit some bad filter of my mail program.
Let me answer, although it's a bit late.

Yes, I would like to get the group_weight into the prefer_sibling path.
Unfortunately, we cannot go for a flat hierarchy as the s390 hardware
allows to have CPUs to be pretty far apart (cache-wise), which means,
the load balancer should avoid to move tasks back and forth between
those CPUs if possible.

We can't remove SD_PREFER_SIBLING either, as this would cause the load
balancer to aim for having the same number of idle CPUs in all groups,
which is a problem as well in asymmetric groups, for example:

With SD_PREFER_SIBLING, aiming for same number of non-idle CPUs
00 01 02 03 04 05 06 07 08 09 10 11  || 12 13 14 15
x x x x  x  x  x  x

Without SD_PREFER_SIBLING, aiming for the same number of idle CPUs
00 01 02 03 04 05 06 07 08 09 10 11  || 12 13 14 15
x  x  x x  x x x  x


Hence the idea to add the group_weight to the prefer_sibling path.

I was wondering if this would be the right place to address this issue
or if I should go down another route.


We in (classical) big.LITTLE (sd flag SD_ASYM_CPUCAPACITY) remove
SD_PREFER_SIBLING from sd->child so we don't run this condition.


The current scheduler behavior causes some s390 configs to use SMT
while some cores are still idle, leading to a performance degredation
under certain levels of workload.

Please refer to the patch's commit message for more details and an
example. This patch is a proposal on how to integrate the size of
scheduler groups into the decision process.

This patch is the most basic approach to address this issue and does
not claim to be perfect as-is.

Other ideas that also proved to address the problem but are more
complex but also potentially more precise:
  1. On scheduler group building, count the number of CPUs within each
 group that are first in their sibling mask. This represents the
 number of CPUs that can be used before running into SMT. This
 should be slightly more accurate than using the full group weight
 if the number of available SMT threads per core varies.
  2. Introduce a new scheduler group classification (smt_busy) in
 between of fully_busy and has_spare. This classification would
 indicate that a group still has spare capacity, but will run
 into SMT when using that capacity. This would make the load
 balancer prefer groups with fully idle CPUs over ones that are
 about to run into SMT.

Feedback would be greatly appreciated.

Tobias Huschle (1):
  sched/fair: Consider asymmetric scheduler groups in load balancer

 kernel/sched/fair.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)





Re: [RFC 1/1] sched/fair: Consider asymmetric scheduler groups in load balancer

2023-06-05 Thread Tobias Huschle

On 2023-05-16 15:36, Vincent Guittot wrote:
On Mon, 15 May 2023 at 13:46, Tobias Huschle  
wrote:


The current load balancer implementation implies that scheduler 
groups,

within the same domain, all host the same number of CPUs. This is
reflected in the condition, that a scheduler group, which is load
balancing and classified as having spare capacity, should pull work
from the busiest group, if the local group runs less processes than
the busiest one. This implies that these two groups should run the
same number of processes, which is problematic if the groups are not
of the same size.

The assumption that scheduler groups within the same scheduler domain
host the same number of CPUs appears to be true for non-s390
architectures. Nevertheless, s390 can have scheduler groups of unequal
size.

This introduces a performance degredation in the following scenario:

Consider a system with 8 CPUs, 6 CPUs are located on one CPU socket,
the remaining 2 are located on another socket:

Socket   -1--2-
CPU  1 2 3 4 5 67 8

Placing some workload ( x = one task ) yields the following
scenarios:

The first 5 tasks are distributed evenly across the two groups.

Socket   -1--2-
CPU  1 2 3 4 5 67 8
 x x x  x x

Adding a 6th task yields the following distribution:

Socket   -1--2-
CPU  1 2 3 4 5 67 8
SMT1 x x x  x x
SMT2x


Your description is a bit confusing for me. What you name CPU above
should be named Core, doesn' it ?

Could you share with us your scheduler topology ?



You are correct, it should say core instead of CPU.

One actual configuration from one of my test machines (lscpu -e):

CPU NODE DRAWER BOOK SOCKET CORE L1d:L1i:L2 ONLINE CONFIGURED 
POLARIZATION ADDRESS
  00  00  00 0:0:0 yes yeshorizontal 
  0
  10  00  00 1:1:0 yes yeshorizontal 
  1
  20  00  01 2:2:0 yes yeshorizontal 
  2
  30  00  01 3:3:0 yes yeshorizontal 
  3
  40  00  02 4:4:0 yes yeshorizontal 
  4
  50  00  02 5:5:0 yes yeshorizontal 
  5
  60  00  03 6:6:0 yes yeshorizontal 
  6
  70  00  03 7:7:0 yes yeshorizontal 
  7
  80  00  04 8:8:0 yes yeshorizontal 
  8
  90  00  04 9:9:0 yes yeshorizontal 
  9
 100  00  05 10:10:0   yes yeshorizontal 
  10
 110  00  05 11:11:0   yes yeshorizontal 
  11
 120  00  16 12:12:0   yes yeshorizontal 
  12
 130  00  16 13:13:0   yes yeshorizontal 
  13
 140  00  17 14:14:0   yes yeshorizontal 
  14
 150  00  17 15:15:0   yes yeshorizontal 
  15


So, 6 cores / 12 CPUs in one group 2 cores / 4 CPUs in the other.

If I run stress-ng with 8 cpu stressors on the original code I get a 
distribution

like this:

00 01 02 03 04 05 06 07 08 09 10 11  || 12 13 14 15
x x x x  x  x  x  x

Which means that the two cores in the smaller group are running into SMT 
while two
cores in the larger group are still idle. This is caused by the 
prefer_sibling path

which really wants to see both groups run the same number of tasks.



The task is added to the 2nd scheduler group, as the scheduler has the
assumption that scheduler groups are of the same size, so they should
also host the same number of tasks. This makes CPU 7 run into SMT
thread, which comes with a performance penalty. This means, that in
the window of 6-8 tasks, load balancing is done suboptimally, because
SMT is used although there is no reason to do so as fully idle CPUs
are still available.

Taking the weight of the scheduler groups into account, ensures that
a load balancing CPU within a smaller group will not try to pull tasks
from a bigger group while the bigger group still has idle CPUs
available.

Signed-off-by: Tobias Huschle 
---
 kernel/sched/fair.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 48b6f0ca13ac..b1307d7e4065 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10426,7 +10426,8 @@ static struct sched_group 
*find_busiest_group(struct lb_env *env)

 * group's child domain.
 */
if (sds.prefer_sibling && local->group_type == group_has_spare 
&&

-   busiest->sum_nr_running > local->sum_nr_running + 1)
+   busiest->sum_nr_running * local->group_weight >
+   local->sum_nr_running * busiest->group_weight 
+ 1)


This is the prefer_sibling path. Could it be that you should

[RFC 1/1] sched/fair: Consider asymmetric scheduler groups in load balancer

2023-05-15 Thread Tobias Huschle
The current load balancer implementation implies that scheduler groups,
within the same domain, all host the same number of CPUs. This is
reflected in the condition, that a scheduler group, which is load
balancing and classified as having spare capacity, should pull work
from the busiest group, if the local group runs less processes than
the busiest one. This implies that these two groups should run the
same number of processes, which is problematic if the groups are not 
of the same size.

The assumption that scheduler groups within the same scheduler domain
host the same number of CPUs appears to be true for non-s390 
architectures. Nevertheless, s390 can have scheduler groups of unequal 
size.

This introduces a performance degredation in the following scenario:

Consider a system with 8 CPUs, 6 CPUs are located on one CPU socket,
the remaining 2 are located on another socket:

Socket   -1--2-
CPU  1 2 3 4 5 67 8

Placing some workload ( x = one task ) yields the following
scenarios:

The first 5 tasks are distributed evenly across the two groups.

Socket   -1--2-
CPU  1 2 3 4 5 67 8
 x x x  x x

Adding a 6th task yields the following distribution:

Socket   -1--2-
CPU  1 2 3 4 5 67 8
SMT1 x x x  x x
SMT2x

The task is added to the 2nd scheduler group, as the scheduler has the
assumption that scheduler groups are of the same size, so they should
also host the same number of tasks. This makes CPU 7 run into SMT 
thread, which comes with a performance penalty. This means, that in 
the window of 6-8 tasks, load balancing is done suboptimally, because 
SMT is used although there is no reason to do so as fully idle CPUs 
are still available.

Taking the weight of the scheduler groups into account, ensures that
a load balancing CPU within a smaller group will not try to pull tasks
from a bigger group while the bigger group still has idle CPUs
available.

Signed-off-by: Tobias Huschle 
---
 kernel/sched/fair.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 48b6f0ca13ac..b1307d7e4065 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10426,7 +10426,8 @@ static struct sched_group *find_busiest_group(struct 
lb_env *env)
 * group's child domain.
 */
if (sds.prefer_sibling && local->group_type == group_has_spare &&
-   busiest->sum_nr_running > local->sum_nr_running + 1)
+   busiest->sum_nr_running * local->group_weight >
+   local->sum_nr_running * busiest->group_weight + 1)
goto force_balance;
 
if (busiest->group_type != group_overloaded) {
-- 
2.34.1



[RFC 0/1] sched/fair: Consider asymmetric scheduler groups in load balancer

2023-05-15 Thread Tobias Huschle
The current load balancer implementation implies that scheduler groups,
within the same scheduler domain, all host the same number of CPUs. 

This appears to be valid for non-s390 architectures. Nevertheless, s390
can actually have scheduler groups of unequal size.
The current scheduler behavior causes some s390 configs to use SMT
while some cores are still idle, leading to a performance degredation 
under certain levels of workload.

Please refer to the patch's commit message for more details and an
example. This patch is a proposal on how to integrate the size of
scheduler groups into the decision process.

This patch is the most basic approach to address this issue and does 
not claim to be perfect as-is.

Other ideas that also proved to address the problem but are more 
complex but also potentially more precise:
  1. On scheduler group building, count the number of CPUs within each 
 group that are first in their sibling mask. This represents the 
 number of CPUs that can be used before running into SMT. This 
 should be slightly more accurate than using the full group weight 
 if the number of available SMT threads per core varies.
  2. Introduce a new scheduler group classification (smt_busy) in
 between of fully_busy and has_spare. This classification would  
 indicate that a group still has spare capacity, but will run 
 into SMT when using that capacity. This would make the load 
 balancer prefer groups with fully idle CPUs over ones that are 
 about to run into SMT.

Feedback would be greatly appreciated.

Tobias Huschle (1):
  sched/fair: Consider asymmetric scheduler groups in load balancer

 kernel/sched/fair.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

-- 
2.34.1