On Fri, Jan 19, 2018 at 01:02:18AM +0100, Frederic Weisbecker wrote:
> When a CPU runs in full dynticks mode, a 1Hz tick remains in order to
> keep the scheduler stats alive. However this residual tick is a burden
> for bare metal tasks that can't stand any interruption at all, or want
> to minimize them.
> 
> The usual boot parameters "nohz_full=" or "isolcpus=nohz" will now
> outsource these scheduler ticks to the global workqueue so that a
> housekeeping CPU handles those remotely.
> 
> Note that in the case of using isolcpus, it's still up to the user to
> affine the global workqueues to the housekeeping CPUs through
> /sys/devices/virtual/workqueue/cpumask or domains isolation
> "isolcpus=nohz,domain".

I would very much like a few words on why sched_class::task_tick() is
safe to call remote -- from a quick look I think it actually is, but it
would be good to have some words here.

> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index d72d0e9..c79500c 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3062,7 +3062,82 @@ u64 scheduler_tick_max_deferment(void)
>  
>       return jiffies_to_nsecs(next - now);
>  }
> -#endif
> +
> +struct tick_work {
> +     int                     cpu;
> +     struct delayed_work     work;
> +};
> +
> +static struct tick_work __percpu *tick_work_cpu;
> +
> +static void sched_tick_remote(struct work_struct *work)
> +{
> +     struct delayed_work *dwork = to_delayed_work(work);
> +     struct tick_work *twork = container_of(dwork, struct tick_work, work);
> +     int cpu = twork->cpu;
> +     struct rq *rq = cpu_rq(cpu);
> +     struct rq_flags rf;
> +
> +     /*
> +      * Handle the tick only if it appears the remote CPU is running
> +      * in full dynticks mode. The check is racy by nature, but
> +      * missing a tick or having one too much is no big deal.
> +      */
> +     if (!idle_cpu(cpu) && tick_nohz_tick_stopped_cpu(cpu)) {
> +             rq_lock_irq(rq, &rf);
> +             update_rq_clock(rq);
> +             rq->curr->sched_class->task_tick(rq, rq->curr, 0);
> +             rq_unlock_irq(rq, &rf);
> +     }
> +
> +     queue_delayed_work(system_unbound_wq, dwork, HZ);

Do we want something that tracks the actual interrer arrival time of
this work, such that we can detect and warn if the book-keeping thing is
failing to keep up?

> +}
> +
> +static void sched_tick_start(int cpu)
> +{
> +     struct tick_work *twork;
> +
> +     if (housekeeping_cpu(cpu, HK_FLAG_TICK))
> +             return;

This all looks very static :-(, you can't reconfigure this nohz_full
crud after boot?

> +     WARN_ON_ONCE(!tick_work_cpu);
> +
> +     twork = per_cpu_ptr(tick_work_cpu, cpu);
> +     twork->cpu = cpu;
> +     INIT_DELAYED_WORK(&twork->work, sched_tick_remote);
> +     queue_delayed_work(system_unbound_wq, &twork->work, HZ);
> +}

Similarly, I think we want a few words about how unbound workqueues are
expected to behave vs NUMA.

AFAICT unbound workqueues by default prefer to run on a cpu in the same
node, but if no cpu is available, it doesn't go looking for the nearest
node that does have a cpu, it just punts to whatever random cpu.


Reply via email to