Re: [RFC] sched/fair: Use wake_q length as a hint for wake_wide

Joel Fernandes Wed, 20 Sep 2017 22:51:08 -0700

On Wed, Sep 20, 2017 at 2:17 PM, Atish Patra <atish.pa...@oracle.com> wrote:
> On 09/20/2017 03:23 PM, Joel Fernandes wrote:
>>
>> On Wed, Sep 20, 2017 at 2:33 AM, Brendan Jackman
>> <brendan.jack...@arm.com> wrote:
>>>
>>>
>>> On Wed, Sep 20 2017 at 05:06, Joel Fernandes wrote:
>>>>>
>>>>> On Tue, Sep 19, 2017 at 3:05 AM, Brendan Jackman
>>>>> <brendan.jack...@arm.com> wrote:
>>>>>>
>>>>>> On Mon, Sep 18 2017 at 22:15, Joel Fernandes wrote:
>>>>
>>>> [..]
>>>>>>>>
>>>>>>>> IIUC, if wake_affine() behaves correctly this trick wouldn't be
>>>>>>>> necessary on SMP systems, so it might be best guarded by the
>>>>>>>> presence
>>>>>>>
>>>>>>>
>>>>>>> Actually wake_affine doesn't check for balance if previous/next cpu
>>>>>>> are within the same shared cache domain. The difference is some time
>>>>>>> ago it would return true for shared cache but now it returns false
as
>>>>>>> of 4.14-rc1:
>>>>>>>
>>>>>>>
http://elixir.free-electrons.com/linux/v4.14-rc1/source/kernel/sched/fair.c#L5466
>>>>>>>
>>>>>>> Since it would return false in the above wake up cases for task 1
and
>>>>>>> 2, it would then run select_idle_sibling on the previous CPU which
is
>>>>>>> also within the big cluster, so I don't think it will make a
>>>>>>> difference in this case... Infact what it returns probably doesn't
>>>>>>> matter.
>>>>>>
>>>>>>
>>>>>> So my paragraph here was making a leap in reasoning, let me try to
>>>>>> fill
>>>>>> the gap: On SMP these tasks never need to move around. If by some
>>>>>> chance
>>>>>> they did get coscheduled, the first load balance would spread them
out
>>>>>> and
>>>>>> then every time they wake up from then on, prev_cpu is the sensible
>>>>>> choice. So it will look something like:
>>>>>>
>>>>>>             v CPU v           ->time->
>>>>>>
>>>>>>                             -------------
>>>>>>          {  0  (SAME)       11111111111
>>>>>>   cache  {                  -------------
>>>>>>          {  1  (SAME)       222222222222|
>>>>>>                             -------------
>>>>>>          {  2  (SAME)       33333333333
>>>>>>   cache  {                  -------------
>>>>>>          {  3  (SAME)       44444444444
>>>>>>                             -------------
>>>>>>
>>>>>> So here, task 2 wakes up the other guys and when it's doing tasks 3
>>>>>> and
>>>>>> 4, prev_cpu and smp_processor_id() don't share a cache, so IIUC its'
>>>>>> basically wake_affine's job to decide between prev_cpu and
>>>>>> smp_processor_id(). So "if wake_affine is behaving correctly" the
>>>>>> problem that this patch aims to solve (i.e. the fact that we overload
>>>>>> the waker's LLC domain because of bias towards prev_cpu) does not
>>>>>> arise
>>>>>> on SMP.
>>>>>>
>>>>>
>>>>> Yes SMP, but your patch is for solving a problem for non-SMP. So your
>>>>> original statement about wake_affine solving any problem for SMP is
>>>>> not relevant I feel :-P. I guess you can just kill this para from the
>>>>> commit message to prevent confusion.
>>>>
>>>>
>>>> Ok I take that back, you were talking about guarding this feature by
>>>> the SD_ASYM_CPUCAPACITY flag.
>>>>
>>>> I don't think that protection would be helpful because you can have
>>>> the same issue if the tasks do different amount of work on SMP. So in
>>>> that case some threads might still complete before the others and you
>>>> run into the same thing.
>>>
>>>
>>> Well assuming we're still talking about one task per CPU, if you have
>>> tasks doing different amount of work there's still no reason to move the
>>> longer-running threads around. The only reason that happens in my
>>> example is because of the asym capacity.
>>
>>
>> Yes but you can very well have RT pressure and things that temporarily
>> change the capacity equality. Also this is a simple benchmark and for
>> any reason you have more than 1 task running on those other CPUs and
>> then the idle CPUs run some of the tasks and you run into a similar
>> situation that might need your patch..
>>
> The patch would be helpful only if it doesn't cross NUMA boundary. right ?
>
> If NUMA comes into picture, not sure searching across NUMA may hurt more
> than help, especially in this case.


I don't understand what you mean by "searching across NUMA", :-(, do you
mean the slow path?

As I said, if the SD_BALANCE_WAKE flag for the sched domain flag is not
set, then a full wide search isn't done anyway. You have this code that
sets the sd variable:

                if (tmp->flags & sd_flag)
                        sd = tmp;

Since sd = NULL if the sched domain (tmp->flags) isn't set, you will
always have select_idle_sibling running and not doing the full search
if I understand correctly.

Further adding the ASYM protection isn't sensible if capacities are
affected by RT and IRQ time etc anyway. Does that make sense?

I am glad I understand the code a bit better now after staring at it
for quite some time but I think some more staring is needed.

- Joel

Re: [RFC] sched/fair: Use wake_q length as a hint for wake_wide

Reply via email to