On Tue, Sep 19, 2017 at 3:05 AM, Brendan Jackman <brendan.jack...@arm.com> wrote: > On Mon, Sep 18 2017 at 22:15, Joel Fernandes wrote: >> Hi Brendan, > Hi Joel, > > Thanks for taking a look :) > >> On Fri, Aug 11, 2017 at 2:45 AM, Brendan Jackman >> <brendan.jack...@arm.com> wrote: >>> This patch adds a parameter to select_task_rq, sibling_count_hint >>> allowing the caller, where it has this information, to inform the >>> sched_class the number of tasks that are being woken up as part of >>> the same event. >>> >>> The wake_q mechanism is one case where this information is available. >>> >>> select_task_rq_fair can then use the information to detect that it >>> needs to widen the search space for task placement in order to avoid >>> overloading the last-level cache domain's CPUs. >>> >>> * * * >>> >>> The reason I am investigating this change is the following use case >>> on ARM big.LITTLE (asymmetrical CPU capacity): 1 task per CPU, which >>> all repeatedly do X amount of work then >>> pthread_barrier_wait (i.e. sleep until the last task finishes its X >>> and hits the barrier). On big.LITTLE, the tasks which get a "big" CPU >>> finish faster, and then those CPUs pull over the tasks that are still >>> running: >>> >>> v CPU v ->time-> >>> >>> ------------- >>> 0 (big) 11111 /333 >>> ------------- >>> 1 (big) 22222 /444| >>> ------------- >>> 2 (LITTLE) 333333/ >>> ------------- >>> 3 (LITTLE) 444444/ >>> ------------- >> >> This is because of the newly idle balancing right? > > Probably not: before CPUs 0 and 1 will pull tasks their utilization > (tasks 1 and 2's blocked load) needs to decay enough that they have more > spare capacity than CPUs 2 and 3 (with imbalance_pct etc), so it's more > likely to be a nohz idle balance. We also need need_active_balance to be > satisfied (assuming 3 and 4 are running constantly) - in Android this is > hastened by a special case [1], in mainline it takes longer. > > [1] > https://android.googlesource.com/kernel/common/+/android-4.4/kernel/sched/fair.c#8865
Ok, I agree. > >>> >>> Now when task 4 hits the barrier (at |) and wakes the others up, >>> there are 4 tasks with prev_cpu=<big> and 0 tasks with >>> prev_cpu=<little>. want_affine therefore means that we'll only look >>> in CPUs 0 and 1 (sd_llc), so tasks will be unnecessarily coscheduled >>> on the bigs until the next load balance, something like this: >>> >>> v CPU v ->time-> >>> >>> ------------------------ >>> 0 (big) 11111 /333 31313\33333 >>> ------------------------ >>> 1 (big) 22222 /444|424\4444444 >>> ------------------------ >>> 2 (LITTLE) 333333/ \222222 >>> ------------------------ >>> 3 (LITTLE) 444444/ \1111 >>> ------------------------ >>> ^^^ >>> underutilization >> >> I wonder if this problem can be better solved by not having the first >> newly-idle-balance pull happen based on some condition, only to be >> corrected later by another balancing act? > > Well the fact that CPUs 0 and 1 pull 3 and 4 is desirable - we want to > use the big CPUs as much as possible for these always-running > tasks. Moving load back to the LITTLE CPUs is not "correcting" a > previous mistake, it's responding to a change in conditions (i.e. we > have more runnable tasks than big CPUs now) and this patch is about > trying to do that response ASAP. Agreed. >>> >>> So, I'm trying to get want_affine = 0 for these tasks. >>> >>> I don't _think_ any incarnation of the wakee_flips mechanism can help >>> us here because which task is waker and which tasks are wakees >>> generally changes with each iteration. >>> >>> However pthread_barrier_wait (or more accurately FUTEX_WAKE) has the >>> nice property that we know exactly how many tasks are being woken, so >>> we can cheat. >>> >>> It might be a disadvantage that we "widen" _every_ task that's woken in >>> an event, while select_idle_sibling would work fine for the first >>> sd_llc_size - 1 tasks. >>> >>> IIUC, if wake_affine() behaves correctly this trick wouldn't be >>> necessary on SMP systems, so it might be best guarded by the presence >> >> Actually wake_affine doesn't check for balance if previous/next cpu >> are within the same shared cache domain. The difference is some time >> ago it would return true for shared cache but now it returns false as >> of 4.14-rc1: >> http://elixir.free-electrons.com/linux/v4.14-rc1/source/kernel/sched/fair.c#L5466 >> >> Since it would return false in the above wake up cases for task 1 and >> 2, it would then run select_idle_sibling on the previous CPU which is >> also within the big cluster, so I don't think it will make a >> difference in this case... Infact what it returns probably doesn't >> matter. > > So my paragraph here was making a leap in reasoning, let me try to fill > the gap: On SMP these tasks never need to move around. If by some chance > they did get coscheduled, the first load balance would spread them out and > then every time they wake up from then on, prev_cpu is the sensible > choice. So it will look something like: > > v CPU v ->time-> > > ------------- > { 0 (SAME) 11111111111 > cache { ------------- > { 1 (SAME) 222222222222| > ------------- > { 2 (SAME) 33333333333 > cache { ------------- > { 3 (SAME) 44444444444 > ------------- > > So here, task 2 wakes up the other guys and when it's doing tasks 3 and > 4, prev_cpu and smp_processor_id() don't share a cache, so IIUC its' > basically wake_affine's job to decide between prev_cpu and > smp_processor_id(). So "if wake_affine is behaving correctly" the > problem that this patch aims to solve (i.e. the fact that we overload > the waker's LLC domain because of bias towards prev_cpu) does not arise > on SMP. > Yes SMP, but your patch is for solving a problem for non-SMP. So your original statement about wake_affine solving any problem for SMP is not relevant I feel :-P. I guess you can just kill this para from the commit message to prevent confusion. >>> of SD_ASYM_CPUCAPACITY? >>> >>> * * * >>> >>> Final note.. >>> >>> In order to observe "perfect" behaviour for this use case, I also had >>> to disable the TTWU_QUEUE sched feature. Suppose during the wakeup >>> above we are working through the work queue and have placed tasks 3 >>> and 2, and are about to place task 1: >>> >>> v CPU v ->time-> >>> >>> -------------- >>> 0 (big) 11111 /333 3 >>> -------------- >>> 1 (big) 22222 /444|4 >>> -------------- >>> 2 (LITTLE) 333333/ 2 >>> -------------- >>> 3 (LITTLE) 444444/ <- Task 1 should go here >>> -------------- >>> >>> If TTWU_QUEUE is enabled, we will not yet have enqueued task >>> 2 (having instead sent a reschedule IPI) or attached its load to CPU >>> 2. So we are likely to also place task 1 on cpu 2. Disabling >>> TTWU_QUEUE means that we enqueue task 2 before placing task 1, >>> solving this issue. TTWU_QUEUE is there to minimise rq lock >>> contention, and I guess that this contention is less of an issue on >>> big.LITTLE systems since they have relatively few CPUs, which >>> suggests the trade-off makes sense here. >> >> Yeah, OTOH probably once those IPIs are sent, the select path should I >> feel be smart enough to factor those pending enqueues, not sure how >> hard or meaningful it is to implement something like that.. > > Yeah that should be possible - in fact I've just noticed that > find_idlest_cpu already checks idle_cpu() which takes pending enqueues > (rq->wake_list) into account, and I think find_idlest_cpu should be used > in the case I'm talking about here.. I must have been doing something > wrong when testing. Oh ok! thanks, - Joel [..]