Hi Andrew, On Mon, Jan 28, 2019 at 3:06 PM Andrew Morton <a...@linux-foundation.org> wrote: > > On Wed, 16 Jan 2019 14:35:01 -0500 Johannes Weiner <han...@cmpxchg.org> wrote: > > > psi has provisions to shut off the periodic aggregation worker when > > there is a period of no task activity - and thus no data that needs > > aggregating. However, while developing psi monitoring, Suren noticed > > that the aggregation clock currently won't stay shut off for good. > > > > Debugging this revealed a flaw in the idle design: an aggregation run > > will see no task activity and decide to go to sleep; shortly > > thereafter, the kworker thread that executed the aggregation will go > > idle and cause a scheduling change, during which the psi callback will > > kick the !pending worker again. This will ping-pong forever, and is > > equivalent to having no shut-off logic at all (but with more code!) > > > > Fix this by exempting aggregation workers from psi's clock waking > > logic when the state change is them going to sleep. To do this, tag > > workers with the last work function they executed, and if in psi we > > see a worker going to sleep after aggregating psi data, we will not > > reschedule the aggregation work item. > > > > What if the worker is also executing other items before or after? > > > > Any psi state times that were incurred by work items preceding the > > aggregation work will have been collected from the per-cpu buckets > > during the aggregation itself. If there are work items following the > > aggregation work, the worker's last_func tag will be overwritten and > > the aggregator will be kept alive to process this genuine new activity. > > > > If the aggregation work is the last thing the worker does, and we > > decide to go idle, the brief period of non-idle time incurred between > > the aggregation run and the kworker's dequeue will be stranded in the > > per-cpu buckets until the clock is woken by later activity. But that > > should not be a problem. The buckets can hold 4s worth of time, and > > future activity will wake the clock with a 2s delay, giving us 2s > > worth of data we can leave behind when disabling aggregation. If it > > takes a worker more than two seconds to go idle after it finishes its > > last work item, we likely have bigger problems in the system, and > > won't notice one sample that was averaged with a bogus per-CPU weight. > > > > --- a/kernel/sched/psi.c > > +++ b/kernel/sched/psi.c > > @@ -480,9 +481,6 @@ static void psi_group_change(struct psi_group *group, > > int cpu, > > groupc->tasks[t]++; > > > > write_seqcount_end(&groupc->seq); > > - > > - if (!delayed_work_pending(&group->clock_work)) > > - schedule_delayed_work(&group->clock_work, PSI_FREQ); > > } > > > > static struct psi_group *iterate_groups(struct task_struct *task, void > > **iter) > > This breaks Suren's "psi: introduce psi monitor": > > --- kernel/sched/psi.c~psi-introduce-psi-monitor > +++ kernel/sched/psi.c > @@ -752,8 +1012,25 @@ static void psi_group_change(struct psi_ > > write_seqcount_end(&groupc->seq); > > - if (!delayed_work_pending(&group->clock_work)) > - schedule_delayed_work(&group->clock_work, PSI_FREQ); > + /* > + * Polling flag resets to 0 at the max rate of once per update window > + * (at least 500ms interval). smp_wmb is required after group->polling > + * 0-to-1 transition to order groupc->times and group->polling writes > + * because stall detection logic in the slowpath relies on > groupc->times > + * changing before group->polling. Explicit smp_wmb is missing because > + * cmpxchg() implies smp_mb. > + */ > + if ((state_mask & group->trigger_mask) && > + atomic_cmpxchg(&group->polling, 0, 1) == 0) { > + /* > + * Start polling immediately even if the work is already > + * scheduled > + */ > + mod_delayed_work(system_wq, &group->clock_work, 1); > + } else { > + if (!delayed_work_pending(&group->clock_work)) > + schedule_delayed_work(&group->clock_work, PSI_FREQ); > + } > } > > and I'm too lazy to go in and figure out how to fix it. > > If we're sure about "psi: fix aggregation idle shut-off" (and I am not) > then can I ask for a redo of "psi: introduce psi monitor"?
I resolved the conflict with "psi: introduce psi monitor" patch and posted v4 at https://lore.kernel.org/lkml/20190206023446.177362-1-sur...@google.com, however please be advised that it also includes additional cleanup changes yet to be reviewed. The first 4 patches in this series are already in linux-next, so this one should apply cleanly there. Please let me know if it creates any other issues. Thanks, Suren.