On Wed, Oct 14, 2020 at 10:33:21AM +0200, Peter Zijlstra wrote:
> On Tue, Oct 13, 2020 at 02:13:28PM -0300, Marcelo Tosatti wrote:
> 
> > > Yes but if the task isn't running, run_posix_cpu_timers() doesn't have
> > > anything to elapse. So indeed we can spare the IPI if the task is not
> > > running. Provided ordering makes sure that the task sees the new 
> > > dependency
> > > when it schedules in of course.
> > 
> > True.
> > 
> >  * p->on_cpu <- { 0, 1 }:
> >  *
> >  *   is set by prepare_task() and cleared by finish_task() such that it 
> > will be
> >  *   set before p is scheduled-in and cleared after p is scheduled-out, both
> >  *   under rq->lock. Non-zero indicates the task is running on its CPU.
> > 
> > 
> > CPU-0 (tick_set_dep)            CPU-1 (task switch)
> > 
> > STORE p->tick_dep_mask
> > smp_mb() (atomic_fetch_or())
> > LOAD p->on_cpu
> > 
> > 
> >                                 context_switch(prev, next)
> >                                 STORE next->on_cpu = 1
> >                                 ...                             [*]
> > 
> >                                 LOAD current->tick_dep_mask
> > 
> 
> That load is in tick_nohz_task_switch() right? (which BTW is placed
> completely wrong) You could easily do something like the below I
> suppose.
> 
> diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> index 81632cd5e3b7..2a5fafe66bb0 100644
> --- a/kernel/time/tick-sched.c
> +++ b/kernel/time/tick-sched.c
> @@ -410,6 +410,14 @@ void __tick_nohz_task_switch(void)
>       ts = this_cpu_ptr(&tick_cpu_sched);
>  
>       if (ts->tick_stopped) {
> +             /*
> +              * tick_set_dep()               (this)
> +              *
> +              * STORE p->tick_dep_mask       STORE p->on_cpu
> +              * smp_mb()                     smp_mb()
> +              * LOAD p->on_cpu               LOAD p->tick_dep_mask
> +              */
> +             smp_mb();
>               if (atomic_read(&current->tick_dep_mask) ||
>                   atomic_read(&current->signal->tick_dep_mask))
>                       tick_nohz_full_kick();

It would then need to be unconditional (whatever value of ts->tick_stopped).
Assuming the tick isn't stopped, we may well have an interrupt firing right
after schedule() which doesn't see the new value of tick_dep_map.

Alternatively, we could rely on p->on_rq which is set to TASK_ON_RQ_QUEUED
at wake up time, prior to the schedule() full barrier. Of course that doesn't
mean that the task is actually the one running on the CPU but it's a good sign,
considering that we are running in nohz_full mode and it's usually optimized
for single task mode.

Also setting a remote task's tick dependency is only used by posix cpu timer
in case the user has the bad taste to enqueue on a task running in nohz_full
mode. It shouldn't deserve an unconditional full barrier in the schedule path.

If the target is current, as is used by RCU, I guess we can keep a special
treatment.

> re tick_nohz_task_switch() being placed wrong, it should probably be
> placed before finish_lock_switch(). Something like so.
> 
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index cf044580683c..5c92c959824f 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4084,6 +4084,7 @@ static struct rq *finish_task_switch(struct task_struct 
> *prev)
>       vtime_task_switch(prev);
>       perf_event_task_sched_in(prev, current);
>       finish_task(prev);
> +     tick_nohz_task_switch();
>       finish_lock_switch(rq);
>       finish_arch_post_lock_switch();
>       kcov_finish_switch(current);
> @@ -4121,7 +4122,6 @@ static struct rq *finish_task_switch(struct task_struct 
> *prev)
>               put_task_struct_rcu_user(prev);
>       }
>  
> -     tick_nohz_task_switch();

IIRC, we wanted to keep it outside rq lock because it shouldn't it...

Reply via email to