Re: [PATCH 6/7] psi: pressure stall information for CPU, memory, and IO

2018-05-23 Thread Vinayak Menon

On 5/23/2018 6:47 PM, Johannes Weiner wrote:
> On Wed, May 09, 2018 at 04:33:24PM +0530, Vinayak Menon wrote:
>> On 5/8/2018 2:31 AM, Johannes Weiner wrote:
>>> +   /* Kick the stats aggregation worker if it's gone to sleep */
>>> +   if (!delayed_work_pending(&group->clock_work))
>> This causes a crash when the work is scheduled before system_wq is up. In my 
>> case when the first
>> schedule was called from kthreadd. And I had to do this to make it work.
>> if (keventd_up() && !delayed_work_pending(&group->clock_work))
>>
>>> +   schedule_delayed_work(&group->clock_work, MY_LOAD_FREQ);
> I was trying to figure out how this is possible, and it didn't make
> sense because we do initialize the system_wq way before kthreadd.
>
> Did you by any chance backport this to a pre-4.10 kernel which does
> not have 3347fa092821 ("workqueue: make workqueue available early
> during boot") yet?

Sorry I did not mention that. I was trying on 4.9 kernel. It's clear now. 
Thanks.

>>> +void psi_task_change(struct task_struct *task, u64 now, int clear, int set)
>>> +{
>>> +   struct cgroup *cgroup, *parent;
>> unused variables
> They're used in the next patch, I'll fix that up.
>
> Thanks



Re: [PATCH 6/7] psi: pressure stall information for CPU, memory, and IO

2018-05-23 Thread Johannes Weiner
On Wed, May 09, 2018 at 04:33:24PM +0530, Vinayak Menon wrote:
> On 5/8/2018 2:31 AM, Johannes Weiner wrote:
> > +   /* Kick the stats aggregation worker if it's gone to sleep */
> > +   if (!delayed_work_pending(&group->clock_work))
> 
> This causes a crash when the work is scheduled before system_wq is up. In my 
> case when the first
> schedule was called from kthreadd. And I had to do this to make it work.
> if (keventd_up() && !delayed_work_pending(&group->clock_work))
>
> > +   schedule_delayed_work(&group->clock_work, MY_LOAD_FREQ);

I was trying to figure out how this is possible, and it didn't make
sense because we do initialize the system_wq way before kthreadd.

Did you by any chance backport this to a pre-4.10 kernel which does
not have 3347fa092821 ("workqueue: make workqueue available early
during boot") yet?

> > +void psi_task_change(struct task_struct *task, u64 now, int clear, int set)
> > +{
> > +   struct cgroup *cgroup, *parent;
> 
> unused variables

They're used in the next patch, I'll fix that up.

Thanks


Re: [PATCH 6/7] psi: pressure stall information for CPU, memory, and IO

2018-05-14 Thread Peter Zijlstra
On Thu, May 10, 2018 at 09:41:32AM -0400, Johannes Weiner wrote:
> So there is a reason I'm tracking productivity states per-cpu and not
> globally. Consider the following example periods on two CPUs:
> 
> CPU 0
> Task 1: | EXECUTING  | memstalled |
> Task 2: | runqueued  | EXECUTING  |
> 
> CPU 1
> Task 3: | memstalled | EXECUTING  |
> 
> If we tracked only the global number of stalled tasks, similarly to
> nr_uninterruptible, the number would be elevated throughout the whole
> sampling period, giving a pressure value of 100% for "some stalled".
> And, since there is always something executing, a "full stall" of 0%.

But if you read the comment about SMP IO-wait; see commit:

  e33a9bba85a8 ("sched/core: move IO scheduling accounting from 
io_schedule_timeout() into scheduler")

you'll see that per-cpu accounting has issues too.

Also, note that in your example above you have 1 memstalled task (at any
one time), but _2_ CPUs. So at most you should end up with a 50% value.
There is no way 1 task could consume 2 CPUs worth of time.

Furthermore, associating a blocked task to any particular CPU is
fundamentally broken and I'll hard NAK anything that relies on it.

> Now consider what happens when the Task 3 sequence is the other way
> around:
> 
> CPU 0
> Task 1: | EXECUTING  | memstalled |
> Task 2: | runqueued  | EXECUTING  |
> 
> CPU 1
> Task 3: | EXECUTING  | memstalled |
> 
> Here the number of stalled tasks is elevated only during half of the
> sampling period, this time giving a pressure reading of 50% for "some"
> (and again 0% for "full").

That entirely depends on your averaging; an exponentially decaying
average would not typically result in 50% for the above case. But I
think we can agree that this results in one 0% and one 100% sample -- we
have two stalled tasks and two CPUs.

> That's a different measurement, but in terms of workload progress, the
> sequences are functionally equivalent. In both scenarios the same
> amount of productive CPU cycles is spent advancing tasks 1, 2 and 3,
> and the same amount of potentially productive CPU time is lost due to
> the contention of memory. We really ought to read the same pressure.

And you do -- subject to the averaging used, as per the above.

The first gives two 50% samples, the second gives 0%, 100%.

> So what I'm doing is calculating the productivity loss on each CPU in
> a sampling period as if they were independent time slices. It doesn't
> matter how you slice and dice the sequences within each one - if used
> CPU time and lost CPU time have the same proportion, we have the same
> pressure.

I'm still thinking you can do basically the same without the stong CPU
relation.

> To illustrate:
> 
> CPU X
> 1234
> Task 1: | EXECUTING  | memstalled | sleeping   | sleeping   |
> Task 2: | runqueued  | EXECUTING  | sleeping   | sleeping   |
> Task 3: | sleeping   | sleeping   | EXECUTING  | memstalled |
> 
> You can clearly see the 50% of walltime in which *somebody* isn't
> advancing (2 and 4), and the 25% of walltime in which *no* tasks are
> (3). Same amount of work, same memory stalls, same pressure numbers.
> 
> Globalized state tracking would produce those numbers on the single
> CPU (obviously), but once concurrency gets into the mix, it's
> questionable what its results mean. It certainly isn't able to
> reliably detect equivalent slowdowns of individual tasks ("some" is
> all over the place), and in this example wasn't able to capture the
> impact of contention on overall work completion ("full" is 0%).
> 
> * CPU 0: some = 50%, full =  0%
>   CPU 1: some = 50%, full = 50%
> avg: some = 50%, full = 25%

I'm not entirely sure I get your point here; but note that a task
doesn't sleep on a CPU. When it sleeps it is not strictly associated
with a CPU, only when it runs does it have an association.

What is the value of accounting a sleep state to a particular CPU if the
task when wakes up on another? Where did the sleep take place?

All we really can say is that a task slept, and if we can reduce the
reason for its sleeping (IO, reclaim, whatever) then it could've ran
sooner. And then you can make predictions based on the number of CPUs
and global idle time, how much that could improve things.



Re: [PATCH 6/7] psi: pressure stall information for CPU, memory, and IO

2018-05-10 Thread Johannes Weiner
On Wed, May 09, 2018 at 12:21:00PM +0200, Peter Zijlstra wrote:
> On Mon, May 07, 2018 at 05:01:34PM -0400, Johannes Weiner wrote:
> > +   local_irq_disable();
> > +   rq = this_rq();
> > +   raw_spin_lock(&rq->lock);
> > +   rq_pin_lock(rq, &rf);
> 
> Given that churn in sched.h, you seen rq_lock() and friends.
> 
> Either write this like:
> 
>   local_irq_disable();
>   rq = this_rq();
>   rq_lock(rq, &rf);
> 
> Or instroduce "rq = this_rq_lock_irq()", which we could also use in
> do_sched_yield().

Sounds good, I'll add that.

> > +   update_rq_clock(rq);
> > +
> > +   current->flags |= PF_MEMSTALL;
> > +   psi_task_change(current, rq_clock(rq), 0, TSK_MEMSTALL);
> > +
> > +   rq_unpin_lock(rq, &rf);
> > +   raw_spin_unlock(&rq->lock);
> > +   local_irq_enable();
> 
> That's called rq_unlock_irq().

I'll use that. This code was first written against a kernel that
didn't have 8a8c69c32778 ("sched/core: Add rq->lock wrappers.") yet ;)


Re: [PATCH 6/7] psi: pressure stall information for CPU, memory, and IO

2018-05-10 Thread Johannes Weiner
On Wed, May 09, 2018 at 12:14:54PM +0200, Peter Zijlstra wrote:
> On Mon, May 07, 2018 at 05:01:34PM -0400, Johannes Weiner wrote:
> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > index 15750c222ca2..1658477466d5 100644
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
[...]
> What's all this churn about?

The psi callbacks in kernel/sched/stat.h use these rq lock functions
from this file, but sched.h includes stat.h before those definitions.

I'll move this into a separate patch with a proper explanation.


Re: [PATCH 6/7] psi: pressure stall information for CPU, memory, and IO

2018-05-10 Thread Johannes Weiner
On Wed, May 09, 2018 at 12:05:51PM +0200, Peter Zijlstra wrote:
> On Mon, May 07, 2018 at 05:01:34PM -0400, Johannes Weiner wrote:
> > +   u64 some[NR_PSI_RESOURCES] = { 0, };
> > +   u64 full[NR_PSI_RESOURCES] = { 0, };
> 
> > +   some[r] /= max(nonidle_total, 1UL);
> > +   full[r] /= max(nonidle_total, 1UL);
> 
> That's a bare 64bit divide.. that typically failed to build on 32bit
> archs.

Ah yes, I'll switch that to do_div(). Thanks


Re: [PATCH 6/7] psi: pressure stall information for CPU, memory, and IO

2018-05-10 Thread Johannes Weiner
On Wed, May 09, 2018 at 12:04:55PM +0200, Peter Zijlstra wrote:
> On Mon, May 07, 2018 at 05:01:34PM -0400, Johannes Weiner wrote:
> > +static void psi_clock(struct work_struct *work)
> > +{
> > +   u64 some[NR_PSI_RESOURCES] = { 0, };
> > +   u64 full[NR_PSI_RESOURCES] = { 0, };
> > +   unsigned long nonidle_total = 0;
> > +   unsigned long missed_periods;
> > +   struct delayed_work *dwork;
> > +   struct psi_group *group;
> > +   unsigned long expires;
> > +   int cpu;
> > +   int r;
> > +
> > +   dwork = to_delayed_work(work);
> > +   group = container_of(dwork, struct psi_group, clock_work);
> > +
> > +   /*
> > +* Calculate the sampling period. The clock might have been
> > +* stopped for a while.
> > +*/
> > +   expires = group->period_expires;
> > +   missed_periods = (jiffies - expires) / MY_LOAD_FREQ;
> > +   group->period_expires = expires + ((1 + missed_periods) * MY_LOAD_FREQ);
> > +
> > +   /*
> > +* Aggregate the per-cpu state into a global state. Each CPU
> > +* is weighted by its non-idle time in the sampling period.
> > +*/
> > +   for_each_online_cpu(cpu) {
> 
> Typically when using online CPU state, you also need hotplug notifiers
> to deal with changes in the online set.
> 
> You also typically need something like cpus_read_lock() around an
> iteration of online CPUs, to avoid the set changing while you're poking
> at them.
> 
> The lack for neither is evident or explained.

The per-cpu state we access is allocated for each possible CPU, so
that is safe (and state being all 0 is semantically sound, too). In a
race with onlining, we might miss some per-cpu samples, but would
catch them the next time. In a race with offlining, we may never
consider the final up to 2s state history of the disappearing CPU; we
could have an offlining callback to flush the state, but I'm not sure
this would be an actual problem in the real world since the error is
small (smallest averaging window is 5 sampling periods) and then would
age out quickly.

I can certainly add a comment explaining this at least.

> > +   struct psi_group_cpu *groupc = per_cpu_ptr(group->cpus, cpu);
> > +   unsigned long nonidle;
> > +
> > +   nonidle = nsecs_to_jiffies(groupc->nonidle_time);
> > +   groupc->nonidle_time = 0;
> > +   nonidle_total += nonidle;
> > +
> > +   for (r = 0; r < NR_PSI_RESOURCES; r++) {
> > +   struct psi_resource *res = &groupc->res[r];
> > +
> > +   some[r] += (res->times[0] + res->times[1]) * nonidle;
> > +   full[r] += res->times[1] * nonidle;
> > +
> > +   /* It's racy, but we can tolerate some error */
> > +   res->times[0] = 0;
> > +   res->times[1] = 0;
> > +   }
> > +   }


Re: [PATCH 6/7] psi: pressure stall information for CPU, memory, and IO

2018-05-10 Thread Johannes Weiner
On Wed, May 09, 2018 at 11:59:38AM +0200, Peter Zijlstra wrote:
> On Mon, May 07, 2018 at 05:01:34PM -0400, Johannes Weiner wrote:
> > diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h
> > new file mode 100644
> > index ..b22b0ffc729d
> > --- /dev/null
> > +++ b/include/linux/psi_types.h
> > @@ -0,0 +1,84 @@
> > +#ifndef _LINUX_PSI_TYPES_H
> > +#define _LINUX_PSI_TYPES_H
> > +
> > +#include 
> > +
> > +#ifdef CONFIG_PSI
> > +
> > +/* Tracked task states */
> > +enum psi_task_count {
> > +   NR_RUNNING,
> > +   NR_IOWAIT,
> > +   NR_MEMSTALL,
> > +   NR_PSI_TASK_COUNTS,
> > +};
> > +
> > +/* Task state bitmasks */
> > +#define TSK_RUNNING(1 << NR_RUNNING)
> > +#define TSK_IOWAIT (1 << NR_IOWAIT)
> > +#define TSK_MEMSTALL   (1 << NR_MEMSTALL)
> > +
> > +/* Resources that workloads could be stalled on */
> > +enum psi_res {
> > +   PSI_CPU,
> > +   PSI_MEM,
> > +   PSI_IO,
> > +   NR_PSI_RESOURCES,
> > +};
> > +
> > +/* Pressure states for a group of tasks */
> > +enum psi_state {
> > +   PSI_NONE,   /* No stalled tasks */
> > +   PSI_SOME,   /* Stalled tasks & working tasks */
> > +   PSI_FULL,   /* Stalled tasks & no working tasks */
> > +   NR_PSI_STATES,
> > +};
> > +
> > +struct psi_resource {
> > +   /* Current pressure state for this resource */
> > +   enum psi_state state;
> > +
> > +   /* Start of current state (cpu_clock) */
> > +   u64 state_start;
> > +
> > +   /* Time sampling buckets for pressure states (ns) */
> > +   u64 times[NR_PSI_STATES - 1];
> 
> Fails to explain why no FULL.

It's NONE that's excluded. I'll add a comment.

> > +struct psi_group_cpu {
> > +   /* States of the tasks belonging to this group */
> > +   unsigned int tasks[NR_PSI_TASK_COUNTS];
> > +
> 
> AFAICT there's a hole here, that would fit the @nonidle member. Which
> also avoids the later hole generated by it.

Good spot, I'll reshuffle this accordingly.

> > +   /* Per-resource pressure tracking in this group */
> > +   struct psi_resource res[NR_PSI_RESOURCES];
> > +
> > +   /* There are runnable or D-state tasks */
> > +   bool nonidle;
> 
> Mandatory complaint about using _Bool in composites goes here.

int it is.

Thanks


Re: [PATCH 6/7] psi: pressure stall information for CPU, memory, and IO

2018-05-10 Thread Johannes Weiner
On Wed, May 09, 2018 at 01:38:49PM +0200, Peter Zijlstra wrote:
> On Wed, May 09, 2018 at 12:46:18PM +0200, Peter Zijlstra wrote:
> > On Mon, May 07, 2018 at 05:01:34PM -0400, Johannes Weiner wrote:
> > 
> > > @@ -2038,6 +2038,7 @@ try_to_wake_up(struct task_struct *p, unsigned int 
> > > state, int wake_flags)
> > >   cpu = select_task_rq(p, p->wake_cpu, SD_BALANCE_WAKE, wake_flags);
> > >   if (task_cpu(p) != cpu) {
> > >   wake_flags |= WF_MIGRATED;
> > > + psi_ttwu_dequeue(p);
> > >   set_task_cpu(p, cpu);
> > >   }
> > >  
> > 
> > > +static inline void psi_ttwu_dequeue(struct task_struct *p)
> > > +{
> > > + /*
> > > +  * Is the task being migrated during a wakeup? Make sure to
> > > +  * deregister its sleep-persistent psi states from the old
> > > +  * queue, and let psi_enqueue() know it has to requeue.
> > > +  */
> > > + if (unlikely(p->in_iowait || (p->flags & PF_MEMSTALL))) {
> > > + struct rq_flags rf;
> > > + struct rq *rq;
> > > + int clear = 0;
> > > +
> > > + if (p->in_iowait)
> > > + clear |= TSK_IOWAIT;
> > > + if (p->flags & PF_MEMSTALL)
> > > + clear |= TSK_MEMSTALL;
> > > +
> > > + rq = __task_rq_lock(p, &rf);
> > > + update_rq_clock(rq);
> > > + psi_task_change(p, rq_clock(rq), clear, 0);
> > > + p->sched_psi_wake_requeue = 1;
> > > + __task_rq_unlock(rq, &rf);
> > > + }
> > > +}
> > 
> > Yeah, no... not happening.
> > 
> > We spend a lot of time to never touch the old rq->lock on wakeups. Mason
> > was the one pushing for that, so he should very well know this.
> > 
> > The one cross-cpu atomic (iowait) is already a problem (the whole iowait
> > accounting being useless makes it even worse), adding significant remote
> > prodding is just really bad.
> 
> Also, since all you need is the global number, I don't think you
> actually need any of this. See what we do for nr_uninterruptible.
> 
> In general I think you want to (re)read loadavg.c some more, and maybe
> reuse a bit more of that.

So there is a reason I'm tracking productivity states per-cpu and not
globally. Consider the following example periods on two CPUs:

CPU 0
Task 1: | EXECUTING  | memstalled |
Task 2: | runqueued  | EXECUTING  |

CPU 1
Task 3: | memstalled | EXECUTING  |

If we tracked only the global number of stalled tasks, similarly to
nr_uninterruptible, the number would be elevated throughout the whole
sampling period, giving a pressure value of 100% for "some stalled".
And, since there is always something executing, a "full stall" of 0%.

Now consider what happens when the Task 3 sequence is the other way
around:

CPU 0
Task 1: | EXECUTING  | memstalled |
Task 2: | runqueued  | EXECUTING  |

CPU 1
Task 3: | EXECUTING  | memstalled |

Here the number of stalled tasks is elevated only during half of the
sampling period, this time giving a pressure reading of 50% for "some"
(and again 0% for "full").

That's a different measurement, but in terms of workload progress, the
sequences are functionally equivalent. In both scenarios the same
amount of productive CPU cycles is spent advancing tasks 1, 2 and 3,
and the same amount of potentially productive CPU time is lost due to
the contention of memory. We really ought to read the same pressure.

So what I'm doing is calculating the productivity loss on each CPU in
a sampling period as if they were independent time slices. It doesn't
matter how you slice and dice the sequences within each one - if used
CPU time and lost CPU time have the same proportion, we have the same
pressure.

In both scenarios above, this method will give a pressure reading of
some=50% and full=25% of "normalized walltime", which is the time loss
the work would experience on a single CPU executing it serially.

To illustrate:

CPU X
1234
Task 1: | EXECUTING  | memstalled | sleeping   | sleeping   |
Task 2: | runqueued  | EXECUTING  | sleeping   | sleeping   |
Task 3: | sleeping   | sleeping   | EXECUTING  | memstalled |

You can clearly see the 50% of walltime in which *somebody* isn't
advancing (2 and 4), and the 25% of walltime in which *no* tasks are
(3). Same amount of work, same memory stalls, same pressure numbers.

Globalized state tracking would produce those numbers on the single
CPU (obviously), but once concurrency gets into the mix, it's
questionable what its results mean. It certainly isn't able to
reliably detect equivalent slowdowns of individual tasks ("some" is
all over the place), and in this example wasn't able to capture the
impact of contention on overall work completion ("full" is 0%).

* CPU 0: some = 50%, full =  0%
  CPU 1: some = 50%, full = 50%
avg: some = 50%, full = 25%


Re: [PATCH 6/7] psi: pressure stall information for CPU, memory, and IO

2018-05-09 Thread Peter Zijlstra
On Wed, May 09, 2018 at 12:46:18PM +0200, Peter Zijlstra wrote:
> On Mon, May 07, 2018 at 05:01:34PM -0400, Johannes Weiner wrote:
> 
> > @@ -2038,6 +2038,7 @@ try_to_wake_up(struct task_struct *p, unsigned int 
> > state, int wake_flags)
> > cpu = select_task_rq(p, p->wake_cpu, SD_BALANCE_WAKE, wake_flags);
> > if (task_cpu(p) != cpu) {
> > wake_flags |= WF_MIGRATED;
> > +   psi_ttwu_dequeue(p);
> > set_task_cpu(p, cpu);
> > }
> >  
> 
> > +static inline void psi_ttwu_dequeue(struct task_struct *p)
> > +{
> > +   /*
> > +* Is the task being migrated during a wakeup? Make sure to
> > +* deregister its sleep-persistent psi states from the old
> > +* queue, and let psi_enqueue() know it has to requeue.
> > +*/
> > +   if (unlikely(p->in_iowait || (p->flags & PF_MEMSTALL))) {
> > +   struct rq_flags rf;
> > +   struct rq *rq;
> > +   int clear = 0;
> > +
> > +   if (p->in_iowait)
> > +   clear |= TSK_IOWAIT;
> > +   if (p->flags & PF_MEMSTALL)
> > +   clear |= TSK_MEMSTALL;
> > +
> > +   rq = __task_rq_lock(p, &rf);
> > +   update_rq_clock(rq);
> > +   psi_task_change(p, rq_clock(rq), clear, 0);
> > +   p->sched_psi_wake_requeue = 1;
> > +   __task_rq_unlock(rq, &rf);
> > +   }
> > +}
> 
> Yeah, no... not happening.
> 
> We spend a lot of time to never touch the old rq->lock on wakeups. Mason
> was the one pushing for that, so he should very well know this.
> 
> The one cross-cpu atomic (iowait) is already a problem (the whole iowait
> accounting being useless makes it even worse), adding significant remote
> prodding is just really bad.

Also, since all you need is the global number, I don't think you
actually need any of this. See what we do for nr_uninterruptible.

In general I think you want to (re)read loadavg.c some more, and maybe
reuse a bit more of that.


Re: [PATCH 6/7] psi: pressure stall information for CPU, memory, and IO

2018-05-09 Thread Vinayak Menon

On 5/8/2018 2:31 AM, Johannes Weiner wrote:
> +static void psi_group_update(struct psi_group *group, int cpu, u64 now,
> +  unsigned int clear, unsigned int set)
> +{
> + enum psi_state state = PSI_NONE;
> + struct psi_group_cpu *groupc;
> + unsigned int *tasks;
> + unsigned int to, bo;
> +
> + groupc = per_cpu_ptr(group->cpus, cpu);
> + tasks = groupc->tasks;
> +
> + /* Update task counts according to the set/clear bitmasks */
> + for (to = 0; (bo = ffs(clear)); to += bo, clear >>= bo) {
> + int idx = to + (bo - 1);
> +
> + if (tasks[idx] == 0 && !psi_bug) {
> + printk_deferred(KERN_ERR "psi: task underflow! cpu=%d 
> idx=%d tasks=[%u %u %u %u]\n",
> + cpu, idx, tasks[0], tasks[1],
> + tasks[2], tasks[3]);
> + psi_bug = 1;
> + }
> + tasks[idx]--;
> + }
> + for (to = 0; (bo = ffs(set)); to += bo, set >>= bo)
> + tasks[to + (bo - 1)]++;
> +
> + /* Time in which tasks wait for the CPU */
> + state = PSI_NONE;
> + if (tasks[NR_RUNNING] > 1)
> + state = PSI_SOME;
> + time_state(&groupc->res[PSI_CPU], state, now);
> +
> + /* Time in which tasks wait for memory */
> + state = PSI_NONE;
> + if (tasks[NR_MEMSTALL]) {
> + if (!tasks[NR_RUNNING] ||
> + (cpu_curr(cpu)->flags & PF_MEMSTALL))
> + state = PSI_FULL;
> + else
> + state = PSI_SOME;
> + }
> + time_state(&groupc->res[PSI_MEM], state, now);
> +
> + /* Time in which tasks wait for IO */
> + state = PSI_NONE;
> + if (tasks[NR_IOWAIT]) {
> + if (!tasks[NR_RUNNING])
> + state = PSI_FULL;
> + else
> + state = PSI_SOME;
> + }
> + time_state(&groupc->res[PSI_IO], state, now);
> +
> + /* Time in which tasks are non-idle, to weigh the CPU in summaries */
> + if (groupc->nonidle)
> + groupc->nonidle_time += now - groupc->nonidle_start;
> + groupc->nonidle = tasks[NR_RUNNING] ||
> + tasks[NR_IOWAIT] || tasks[NR_MEMSTALL];
> + if (groupc->nonidle)
> + groupc->nonidle_start = now;
> +
> + /* Kick the stats aggregation worker if it's gone to sleep */
> + if (!delayed_work_pending(&group->clock_work))

This causes a crash when the work is scheduled before system_wq is up. In my 
case when the first
schedule was called from kthreadd. And I had to do this to make it work.
if (keventd_up() && !delayed_work_pending(&group->clock_work))

> + schedule_delayed_work(&group->clock_work, MY_LOAD_FREQ);
> +}
> +
> +void psi_task_change(struct task_struct *task, u64 now, int clear, int set)
> +{
> + struct cgroup *cgroup, *parent;

unused variables

Thanks,
Vinayak


Re: [PATCH 6/7] psi: pressure stall information for CPU, memory, and IO

2018-05-09 Thread Peter Zijlstra
On Mon, May 07, 2018 at 05:01:34PM -0400, Johannes Weiner wrote:

> @@ -28,10 +28,14 @@ static inline int sched_info_on(void)
>   return 1;
>  #elif defined(CONFIG_TASK_DELAY_ACCT)
>   extern int delayacct_on;
> - return delayacct_on;
> -#else
> - return 0;
> + if (delayacct_on)
> + return 1;
> +#elif defined(CONFIG_PSI)
> + extern int psi_disabled;
> + if (!psi_disabled)
> + return 1;
>  #endif
> + return 0;
>  }

> diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h
> index 8aea199a39b4..cb4a68bcf37a 100644
> --- a/kernel/sched/stats.h
> +++ b/kernel/sched/stats.h
> @@ -55,12 +55,90 @@ static inline void rq_sched_info_depart  (struct rq *rq, 
> unsigned long long delt
>  # define   schedstat_val_or_zero(var)0
>  #endif /* CONFIG_SCHEDSTATS */
>  
> +#ifdef CONFIG_PSI
> +/*
> + * PSI tracks state that persists across sleeps, such as iowaits and
> + * memory stalls. As a result, it has to distinguish between sleeps,
> + * where a task's runnable state changes, and requeues, where a task
> + * and its state are being moved between CPUs and runqueues.
> + */
> +static inline void psi_enqueue(struct task_struct *p, u64 now)
> +{
> + int clear = 0, set = TSK_RUNNING;
> +
> + if (p->state == TASK_RUNNING || p->sched_psi_wake_requeue) {
> + if (p->flags & PF_MEMSTALL)
> + set |= TSK_MEMSTALL;
> + p->sched_psi_wake_requeue = 0;
> + } else {
> + if (p->in_iowait)
> + clear |= TSK_IOWAIT;
> + }
> +
> + psi_task_change(p, now, clear, set);
> +}
> +static inline void psi_dequeue(struct task_struct *p, u64 now)
> +{
> + int clear = TSK_RUNNING, set = 0;
> +
> + if (p->state == TASK_RUNNING) {
> + if (p->flags & PF_MEMSTALL)
> + clear |= TSK_MEMSTALL;
> + } else {
> + if (p->in_iowait)
> + set |= TSK_IOWAIT;
> + }
> +
> + psi_task_change(p, now, clear, set);
> +}
> +static inline void psi_ttwu_dequeue(struct task_struct *p)
> +{
> + /*
> +  * Is the task being migrated during a wakeup? Make sure to
> +  * deregister its sleep-persistent psi states from the old
> +  * queue, and let psi_enqueue() know it has to requeue.
> +  */
> + if (unlikely(p->in_iowait || (p->flags & PF_MEMSTALL))) {
> + struct rq_flags rf;
> + struct rq *rq;
> + int clear = 0;
> +
> + if (p->in_iowait)
> + clear |= TSK_IOWAIT;
> + if (p->flags & PF_MEMSTALL)
> + clear |= TSK_MEMSTALL;
> +
> + rq = __task_rq_lock(p, &rf);
> + update_rq_clock(rq);
> + psi_task_change(p, rq_clock(rq), clear, 0);
> + p->sched_psi_wake_requeue = 1;
> + __task_rq_unlock(rq, &rf);
> + }
> +}

That all seems to be missing psi_disabled tests.. Yes I know it's
burried down in psi_task_change() somewhere, but that's really (too)
late.

(also, you seem to be conserving whitespace; typically we have an empty
lines between functions)


Re: [PATCH 6/7] psi: pressure stall information for CPU, memory, and IO

2018-05-09 Thread Peter Zijlstra
On Mon, May 07, 2018 at 05:01:34PM -0400, Johannes Weiner wrote:

> @@ -2038,6 +2038,7 @@ try_to_wake_up(struct task_struct *p, unsigned int 
> state, int wake_flags)
>   cpu = select_task_rq(p, p->wake_cpu, SD_BALANCE_WAKE, wake_flags);
>   if (task_cpu(p) != cpu) {
>   wake_flags |= WF_MIGRATED;
> + psi_ttwu_dequeue(p);
>   set_task_cpu(p, cpu);
>   }
>  

> +static inline void psi_ttwu_dequeue(struct task_struct *p)
> +{
> + /*
> +  * Is the task being migrated during a wakeup? Make sure to
> +  * deregister its sleep-persistent psi states from the old
> +  * queue, and let psi_enqueue() know it has to requeue.
> +  */
> + if (unlikely(p->in_iowait || (p->flags & PF_MEMSTALL))) {
> + struct rq_flags rf;
> + struct rq *rq;
> + int clear = 0;
> +
> + if (p->in_iowait)
> + clear |= TSK_IOWAIT;
> + if (p->flags & PF_MEMSTALL)
> + clear |= TSK_MEMSTALL;
> +
> + rq = __task_rq_lock(p, &rf);
> + update_rq_clock(rq);
> + psi_task_change(p, rq_clock(rq), clear, 0);
> + p->sched_psi_wake_requeue = 1;
> + __task_rq_unlock(rq, &rf);
> + }
> +}

Yeah, no... not happening.

We spend a lot of time to never touch the old rq->lock on wakeups. Mason
was the one pushing for that, so he should very well know this.

The one cross-cpu atomic (iowait) is already a problem (the whole iowait
accounting being useless makes it even worse), adding significant remote
prodding is just really bad.


Re: [PATCH 6/7] psi: pressure stall information for CPU, memory, and IO

2018-05-09 Thread Peter Zijlstra
On Mon, May 07, 2018 at 05:01:34PM -0400, Johannes Weiner wrote:
> +static void psi_clock(struct work_struct *work)
> +{

> + dwork = to_delayed_work(work);
> + group = container_of(dwork, struct psi_group, clock_work);
> +

> +
> + /* Keep the clock ticking only when there is action */
> + if (nonidle_total)
> + schedule_delayed_work(dwork, MY_LOAD_FREQ);
> +}

Note that this doesn't generate a stable frequency for the callback.
The (nondeterministic) time spend doing the actual work is added to each
period, this gives an unconditional downward bias to the frequency, but
also makes it very unstable.

You want explicit management of timer->expires, and add MY_LOAD_FREQ
(which is a misnomer) to it and not reset it based on jiffies.


Re: [PATCH 6/7] psi: pressure stall information for CPU, memory, and IO

2018-05-09 Thread Peter Zijlstra
On Mon, May 07, 2018 at 05:01:34PM -0400, Johannes Weiner wrote:
> +/**
> + * psi_memstall_enter - mark the beginning of a memory stall section
> + * @flags: flags to handle nested sections
> + *
> + * Marks the calling task as being stalled due to a lack of memory,
> + * such as waiting for a refault or performing reclaim.
> + */
> +void psi_memstall_enter(unsigned long *flags)
> +{
> + struct rq_flags rf;
> + struct rq *rq;
> +
> + *flags = current->flags & PF_MEMSTALL;
> + if (*flags)
> + return;
> + /*
> +  * PF_MEMSTALL setting & accounting needs to be atomic wrt
> +  * changes to the task's scheduling state, otherwise we can
> +  * race with CPU migration.
> +  */
> + local_irq_disable();
> + rq = this_rq();
> + raw_spin_lock(&rq->lock);
> + rq_pin_lock(rq, &rf);

Given that churn in sched.h, you seen rq_lock() and friends.

Either write this like:

local_irq_disable();
rq = this_rq();
rq_lock(rq, &rf);

Or instroduce "rq = this_rq_lock_irq()", which we could also use in
do_sched_yield().

> + update_rq_clock(rq);
> +
> + current->flags |= PF_MEMSTALL;
> + psi_task_change(current, rq_clock(rq), 0, TSK_MEMSTALL);
> +
> + rq_unpin_lock(rq, &rf);
> + raw_spin_unlock(&rq->lock);
> + local_irq_enable();

That's called rq_unlock_irq().

> +}
> +
> +/**
> + * psi_memstall_leave - mark the end of an memory stall section
> + * @flags: flags to handle nested memdelay sections
> + *
> + * Marks the calling task as no longer stalled due to lack of memory.
> + */
> +void psi_memstall_leave(unsigned long *flags)
> +{
> + struct rq_flags rf;
> + struct rq *rq;
> +
> + if (*flags)
> + return;
> + /*
> +  * PF_MEMSTALL clearing & accounting needs to be atomic wrt
> +  * changes to the task's scheduling state, otherwise we could
> +  * race with CPU migration.
> +  */
> + local_irq_disable();
> + rq = this_rq();
> + raw_spin_lock(&rq->lock);
> + rq_pin_lock(rq, &rf);
> +
> + update_rq_clock(rq);
> +
> + current->flags &= ~PF_MEMSTALL;
> + psi_task_change(current, rq_clock(rq), TSK_MEMSTALL, 0);
> +
> + rq_unpin_lock(rq, &rf);
> + raw_spin_unlock(&rq->lock);
> + local_irq_enable();
> +}

Idem.




Re: [PATCH 6/7] psi: pressure stall information for CPU, memory, and IO

2018-05-09 Thread Peter Zijlstra
On Mon, May 07, 2018 at 05:01:34PM -0400, Johannes Weiner wrote:
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 15750c222ca2..1658477466d5 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h

> @@ -919,6 +921,8 @@ DECLARE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
>  #define cpu_curr(cpu)(cpu_rq(cpu)->curr)
>  #define raw_rq() raw_cpu_ptr(&runqueues)
>  
> +extern void update_rq_clock(struct rq *rq);
> +
>  static inline u64 __rq_clock_broken(struct rq *rq)
>  {
>   return READ_ONCE(rq->clock);
> @@ -1037,6 +1041,86 @@ static inline void rq_repin_lock(struct rq *rq, struct 
> rq_flags *rf)
>  #endif
>  }
>  
> +struct rq *__task_rq_lock(struct task_struct *p, struct rq_flags *rf)
> + __acquires(rq->lock);
> +
> +struct rq *task_rq_lock(struct task_struct *p, struct rq_flags *rf)
> + __acquires(p->pi_lock)
> + __acquires(rq->lock);
> +
> +static inline void __task_rq_unlock(struct rq *rq, struct rq_flags *rf)
> + __releases(rq->lock)
> +{
> + rq_unpin_lock(rq, rf);
> + raw_spin_unlock(&rq->lock);
> +}
> +
> +static inline void
> +task_rq_unlock(struct rq *rq, struct task_struct *p, struct rq_flags *rf)
> + __releases(rq->lock)
> + __releases(p->pi_lock)
> +{
> + rq_unpin_lock(rq, rf);
> + raw_spin_unlock(&rq->lock);
> + raw_spin_unlock_irqrestore(&p->pi_lock, rf->flags);
> +}
> +
> +static inline void
> +rq_lock_irqsave(struct rq *rq, struct rq_flags *rf)
> + __acquires(rq->lock)
> +{
> + raw_spin_lock_irqsave(&rq->lock, rf->flags);
> + rq_pin_lock(rq, rf);
> +}
> +
> +static inline void
> +rq_lock_irq(struct rq *rq, struct rq_flags *rf)
> + __acquires(rq->lock)
> +{
> + raw_spin_lock_irq(&rq->lock);
> + rq_pin_lock(rq, rf);
> +}
> +
> +static inline void
> +rq_lock(struct rq *rq, struct rq_flags *rf)
> + __acquires(rq->lock)
> +{
> + raw_spin_lock(&rq->lock);
> + rq_pin_lock(rq, rf);
> +}
> +
> +static inline void
> +rq_relock(struct rq *rq, struct rq_flags *rf)
> + __acquires(rq->lock)
> +{
> + raw_spin_lock(&rq->lock);
> + rq_repin_lock(rq, rf);
> +}
> +
> +static inline void
> +rq_unlock_irqrestore(struct rq *rq, struct rq_flags *rf)
> + __releases(rq->lock)
> +{
> + rq_unpin_lock(rq, rf);
> + raw_spin_unlock_irqrestore(&rq->lock, rf->flags);
> +}
> +
> +static inline void
> +rq_unlock_irq(struct rq *rq, struct rq_flags *rf)
> + __releases(rq->lock)
> +{
> + rq_unpin_lock(rq, rf);
> + raw_spin_unlock_irq(&rq->lock);
> +}
> +
> +static inline void
> +rq_unlock(struct rq *rq, struct rq_flags *rf)
> + __releases(rq->lock)
> +{
> + rq_unpin_lock(rq, rf);
> + raw_spin_unlock(&rq->lock);
> +}
> +
>  #ifdef CONFIG_NUMA
>  enum numa_topology_type {
>   NUMA_DIRECT,
> @@ -1670,8 +1754,6 @@ static inline void sub_nr_running(struct rq *rq, 
> unsigned count)
>   sched_update_tick_dependency(rq);
>  }
>  
> -extern void update_rq_clock(struct rq *rq);
> -
>  extern void activate_task(struct rq *rq, struct task_struct *p, int flags);
>  extern void deactivate_task(struct rq *rq, struct task_struct *p, int flags);
>  
> @@ -1752,86 +1834,6 @@ static inline void sched_rt_avg_update(struct rq *rq, 
> u64 rt_delta) { }
>  static inline void sched_avg_update(struct rq *rq) { }
>  #endif
>  
> -struct rq *__task_rq_lock(struct task_struct *p, struct rq_flags *rf)
> - __acquires(rq->lock);
> -
> -struct rq *task_rq_lock(struct task_struct *p, struct rq_flags *rf)
> - __acquires(p->pi_lock)
> - __acquires(rq->lock);
> -
> -static inline void __task_rq_unlock(struct rq *rq, struct rq_flags *rf)
> - __releases(rq->lock)
> -{
> - rq_unpin_lock(rq, rf);
> - raw_spin_unlock(&rq->lock);
> -}
> -
> -static inline void
> -task_rq_unlock(struct rq *rq, struct task_struct *p, struct rq_flags *rf)
> - __releases(rq->lock)
> - __releases(p->pi_lock)
> -{
> - rq_unpin_lock(rq, rf);
> - raw_spin_unlock(&rq->lock);
> - raw_spin_unlock_irqrestore(&p->pi_lock, rf->flags);
> -}
> -
> -static inline void
> -rq_lock_irqsave(struct rq *rq, struct rq_flags *rf)
> - __acquires(rq->lock)
> -{
> - raw_spin_lock_irqsave(&rq->lock, rf->flags);
> - rq_pin_lock(rq, rf);
> -}
> -
> -static inline void
> -rq_lock_irq(struct rq *rq, struct rq_flags *rf)
> - __acquires(rq->lock)
> -{
> - raw_spin_lock_irq(&rq->lock);
> - rq_pin_lock(rq, rf);
> -}
> -
> -static inline void
> -rq_lock(struct rq *rq, struct rq_flags *rf)
> - __acquires(rq->lock)
> -{
> - raw_spin_lock(&rq->lock);
> - rq_pin_lock(rq, rf);
> -}
> -
> -static inline void
> -rq_relock(struct rq *rq, struct rq_flags *rf)
> - __acquires(rq->lock)
> -{
> - raw_spin_lock(&rq->lock);
> - rq_repin_lock(rq, rf);
> -}
> -
> -static inline void
> -rq_unlock_irqrestore(struct rq *rq, struct rq_flags *rf)
> - __releases(rq->lock)
> -{
> - rq_unpin_lock(rq, rf);
> - raw_spin_unlock_irqrestore(&rq->lock, rf

Re: [PATCH 6/7] psi: pressure stall information for CPU, memory, and IO

2018-05-09 Thread Peter Zijlstra
On Mon, May 07, 2018 at 05:01:34PM -0400, Johannes Weiner wrote:
> + u64 some[NR_PSI_RESOURCES] = { 0, };
> + u64 full[NR_PSI_RESOURCES] = { 0, };

> + some[r] /= max(nonidle_total, 1UL);
> + full[r] /= max(nonidle_total, 1UL);

That's a bare 64bit divide.. that typically failed to build on 32bit
archs.


Re: [PATCH 6/7] psi: pressure stall information for CPU, memory, and IO

2018-05-09 Thread Peter Zijlstra
On Mon, May 07, 2018 at 05:01:34PM -0400, Johannes Weiner wrote:
> +static void psi_clock(struct work_struct *work)
> +{
> + u64 some[NR_PSI_RESOURCES] = { 0, };
> + u64 full[NR_PSI_RESOURCES] = { 0, };
> + unsigned long nonidle_total = 0;
> + unsigned long missed_periods;
> + struct delayed_work *dwork;
> + struct psi_group *group;
> + unsigned long expires;
> + int cpu;
> + int r;
> +
> + dwork = to_delayed_work(work);
> + group = container_of(dwork, struct psi_group, clock_work);
> +
> + /*
> +  * Calculate the sampling period. The clock might have been
> +  * stopped for a while.
> +  */
> + expires = group->period_expires;
> + missed_periods = (jiffies - expires) / MY_LOAD_FREQ;
> + group->period_expires = expires + ((1 + missed_periods) * MY_LOAD_FREQ);
> +
> + /*
> +  * Aggregate the per-cpu state into a global state. Each CPU
> +  * is weighted by its non-idle time in the sampling period.
> +  */
> + for_each_online_cpu(cpu) {

Typically when using online CPU state, you also need hotplug notifiers
to deal with changes in the online set.

You also typically need something like cpus_read_lock() around an
iteration of online CPUs, to avoid the set changing while you're poking
at them.

The lack for neither is evident or explained.

> + struct psi_group_cpu *groupc = per_cpu_ptr(group->cpus, cpu);
> + unsigned long nonidle;
> +
> + nonidle = nsecs_to_jiffies(groupc->nonidle_time);
> + groupc->nonidle_time = 0;
> + nonidle_total += nonidle;
> +
> + for (r = 0; r < NR_PSI_RESOURCES; r++) {
> + struct psi_resource *res = &groupc->res[r];
> +
> + some[r] += (res->times[0] + res->times[1]) * nonidle;
> + full[r] += res->times[1] * nonidle;
> +
> + /* It's racy, but we can tolerate some error */
> + res->times[0] = 0;
> + res->times[1] = 0;
> + }
> + }
> +
> + for (r = 0; r < NR_PSI_RESOURCES; r++) {
> + /* Finish the weighted aggregation */
> + some[r] /= max(nonidle_total, 1UL);
> + full[r] /= max(nonidle_total, 1UL);
> +
> + /* Accumulate stall time */
> + group->some[r] += some[r];
> + group->full[r] += full[r];
> +
> + /* Calculate recent pressure averages */
> + calc_avgs(group->avg_some[r], some[r], missed_periods);
> + calc_avgs(group->avg_full[r], full[r], missed_periods);
> + }
> +
> + /* Keep the clock ticking only when there is action */
> + if (nonidle_total)
> + schedule_delayed_work(dwork, MY_LOAD_FREQ);
> +}


Re: [PATCH 6/7] psi: pressure stall information for CPU, memory, and IO

2018-05-09 Thread Peter Zijlstra
On Mon, May 07, 2018 at 05:01:34PM -0400, Johannes Weiner wrote:
> diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h
> new file mode 100644
> index ..b22b0ffc729d
> --- /dev/null
> +++ b/include/linux/psi_types.h
> @@ -0,0 +1,84 @@
> +#ifndef _LINUX_PSI_TYPES_H
> +#define _LINUX_PSI_TYPES_H
> +
> +#include 
> +
> +#ifdef CONFIG_PSI
> +
> +/* Tracked task states */
> +enum psi_task_count {
> + NR_RUNNING,
> + NR_IOWAIT,
> + NR_MEMSTALL,
> + NR_PSI_TASK_COUNTS,
> +};
> +
> +/* Task state bitmasks */
> +#define TSK_RUNNING  (1 << NR_RUNNING)
> +#define TSK_IOWAIT   (1 << NR_IOWAIT)
> +#define TSK_MEMSTALL (1 << NR_MEMSTALL)
> +
> +/* Resources that workloads could be stalled on */
> +enum psi_res {
> + PSI_CPU,
> + PSI_MEM,
> + PSI_IO,
> + NR_PSI_RESOURCES,
> +};
> +
> +/* Pressure states for a group of tasks */
> +enum psi_state {
> + PSI_NONE,   /* No stalled tasks */
> + PSI_SOME,   /* Stalled tasks & working tasks */
> + PSI_FULL,   /* Stalled tasks & no working tasks */
> + NR_PSI_STATES,
> +};
> +
> +struct psi_resource {
> + /* Current pressure state for this resource */
> + enum psi_state state;
> +
> + /* Start of current state (cpu_clock) */
> + u64 state_start;
> +
> + /* Time sampling buckets for pressure states (ns) */
> + u64 times[NR_PSI_STATES - 1];

Fails to explain why no FULL.

> +};
> +
> +struct psi_group_cpu {
> + /* States of the tasks belonging to this group */
> + unsigned int tasks[NR_PSI_TASK_COUNTS];
> +

AFAICT there's a hole here, that would fit the @nonidle member. Which
also avoids the later hole generated by it.

> + /* Per-resource pressure tracking in this group */
> + struct psi_resource res[NR_PSI_RESOURCES];
> +
> + /* There are runnable or D-state tasks */
> + bool nonidle;

Mandatory complaint about using _Bool in composites goes here.

> + /* Start of current non-idle state (cpu_clock) */
> + u64 nonidle_start;
> +
> + /* Time sampling bucket for non-idle state (ns) */
> + u64 nonidle_time;
> +};
> +
> +struct psi_group {
> + struct psi_group_cpu *cpus;
> +
> + struct delayed_work clock_work;
> + unsigned long period_expires;
> +
> + u64 some[NR_PSI_RESOURCES];
> + u64 full[NR_PSI_RESOURCES];
> +
> + unsigned long avg_some[NR_PSI_RESOURCES][3];
> + unsigned long avg_full[NR_PSI_RESOURCES][3];
> +};


Re: [PATCH 6/7] psi: pressure stall information for CPU, memory, and IO

2018-05-08 Thread Johannes Weiner
On Mon, May 07, 2018 at 05:42:36PM -0700, Randy Dunlap wrote:
> On 05/07/2018 02:01 PM, Johannes Weiner wrote:
> > + * The ratio is tracked in decaying time averages over 10s, 1m, 5m
> > + * windows. Cumluative stall times are tracked and exported as well to
> 
>Cumulative
> 

> > +/**
> > + * psi_memstall_leave - mark the end of an memory stall section
> 
> end of a memory

Thanks Randy, I'll get those fixed.


Re: [PATCH 6/7] psi: pressure stall information for CPU, memory, and IO

2018-05-08 Thread Johannes Weiner
On Tue, May 08, 2018 at 11:04:09AM +0800, kbuild test robot wrote:
>118#else /* CONFIG_PSI */
>119static inline void psi_enqueue(struct task_struct *p, u64 now)
>120{
>121}
>122static inline void psi_dequeue(struct task_struct *p, u64 now)
>123{
>124}
>125static inline void psi_ttwu_dequeue(struct task_struct *p) {}
>  > 126{
>127}

Stupid last-minute cleanup reshuffling. v2 will have:

diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h
index cb4a68bcf37a..ff6256b3d216 100644
--- a/kernel/sched/stats.h
+++ b/kernel/sched/stats.h
@@ -122,7 +122,7 @@ static inline void psi_enqueue(struct task_struct *p, u64 
now)
 static inline void psi_dequeue(struct task_struct *p, u64 now)
 {
 }
-static inline void psi_ttwu_dequeue(struct task_struct *p) {}
+static inline void psi_ttwu_dequeue(struct task_struct *p)
 {
 }
 #endif /* CONFIG_PSI */


Re: [PATCH 6/7] psi: pressure stall information for CPU, memory, and IO

2018-05-07 Thread kbuild test robot
Hi Johannes,

I love your patch! Yet something to improve:

[auto build test ERROR on linus/master]
[also build test ERROR on v4.17-rc4]
[cannot apply to next-20180507]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Johannes-Weiner/mm-workingset-don-t-drop-refault-information-prematurely/20180508-081214
config: x86_64-randconfig-x012-201818 (attached as .config)
compiler: gcc-7 (Debian 7.3.0-16) 7.3.0
reproduce:
# save the attached .config to linux build tree
make ARCH=x86_64 

All errors (new ones prefixed by >>):

   In file included from kernel/livepatch/../sched/sched.h:1317:0,
from kernel/livepatch/transition.c:27:
>> kernel/livepatch/../sched/stats.h:126:1: error: expected identifier or '(' 
>> before '{' token
{
^

vim +126 kernel/livepatch/../sched/stats.h

57  
58  #ifdef CONFIG_PSI
59  /*
60   * PSI tracks state that persists across sleeps, such as iowaits and
61   * memory stalls. As a result, it has to distinguish between sleeps,
62   * where a task's runnable state changes, and requeues, where a task
63   * and its state are being moved between CPUs and runqueues.
64   */
65  static inline void psi_enqueue(struct task_struct *p, u64 now)
66  {
67  int clear = 0, set = TSK_RUNNING;
68  
69  if (p->state == TASK_RUNNING || p->sched_psi_wake_requeue) {
70  if (p->flags & PF_MEMSTALL)
71  set |= TSK_MEMSTALL;
72  p->sched_psi_wake_requeue = 0;
73  } else {
74  if (p->in_iowait)
75  clear |= TSK_IOWAIT;
76  }
77  
78  psi_task_change(p, now, clear, set);
79  }
80  static inline void psi_dequeue(struct task_struct *p, u64 now)
81  {
82  int clear = TSK_RUNNING, set = 0;
83  
84  if (p->state == TASK_RUNNING) {
85  if (p->flags & PF_MEMSTALL)
86  clear |= TSK_MEMSTALL;
87  } else {
88  if (p->in_iowait)
89  set |= TSK_IOWAIT;
90  }
91  
92  psi_task_change(p, now, clear, set);
93  }
94  static inline void psi_ttwu_dequeue(struct task_struct *p)
95  {
96  /*
97   * Is the task being migrated during a wakeup? Make sure to
98   * deregister its sleep-persistent psi states from the old
99   * queue, and let psi_enqueue() know it has to requeue.
   100   */
   101  if (unlikely(p->in_iowait || (p->flags & PF_MEMSTALL))) {
   102  struct rq_flags rf;
   103  struct rq *rq;
   104  int clear = 0;
   105  
   106  if (p->in_iowait)
   107  clear |= TSK_IOWAIT;
   108  if (p->flags & PF_MEMSTALL)
   109  clear |= TSK_MEMSTALL;
   110  
   111  rq = __task_rq_lock(p, &rf);
   112  update_rq_clock(rq);
   113  psi_task_change(p, rq_clock(rq), clear, 0);
   114  p->sched_psi_wake_requeue = 1;
   115  __task_rq_unlock(rq, &rf);
   116  }
   117  }
   118  #else /* CONFIG_PSI */
   119  static inline void psi_enqueue(struct task_struct *p, u64 now)
   120  {
   121  }
   122  static inline void psi_dequeue(struct task_struct *p, u64 now)
   123  {
   124  }
   125  static inline void psi_ttwu_dequeue(struct task_struct *p) {}
 > 126  {
   127  }
   128  #endif /* CONFIG_PSI */
   129  

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip


Re: [PATCH 6/7] psi: pressure stall information for CPU, memory, and IO

2018-05-07 Thread kbuild test robot
Hi Johannes,

I love your patch! Yet something to improve:

[auto build test ERROR on linus/master]
[also build test ERROR on v4.17-rc4]
[cannot apply to next-20180507]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Johannes-Weiner/mm-workingset-don-t-drop-refault-information-prematurely/20180508-081214
config: i386-randconfig-x073-201818 (attached as .config)
compiler: gcc-7 (Debian 7.3.0-16) 7.3.0
reproduce:
# save the attached .config to linux build tree
make ARCH=i386 

All errors (new ones prefixed by >>):

   In file included from kernel/sched/sched.h:1317:0,
from kernel/sched/core.c:8:
>> kernel/sched/stats.h:126:1: error: expected identifier or '(' before '{' 
>> token
{
^

vim +126 kernel/sched/stats.h

57  
58  #ifdef CONFIG_PSI
59  /*
60   * PSI tracks state that persists across sleeps, such as iowaits and
61   * memory stalls. As a result, it has to distinguish between sleeps,
62   * where a task's runnable state changes, and requeues, where a task
63   * and its state are being moved between CPUs and runqueues.
64   */
65  static inline void psi_enqueue(struct task_struct *p, u64 now)
66  {
67  int clear = 0, set = TSK_RUNNING;
68  
69  if (p->state == TASK_RUNNING || p->sched_psi_wake_requeue) {
70  if (p->flags & PF_MEMSTALL)
71  set |= TSK_MEMSTALL;
72  p->sched_psi_wake_requeue = 0;
73  } else {
74  if (p->in_iowait)
75  clear |= TSK_IOWAIT;
76  }
77  
78  psi_task_change(p, now, clear, set);
79  }
80  static inline void psi_dequeue(struct task_struct *p, u64 now)
81  {
82  int clear = TSK_RUNNING, set = 0;
83  
84  if (p->state == TASK_RUNNING) {
85  if (p->flags & PF_MEMSTALL)
86  clear |= TSK_MEMSTALL;
87  } else {
88  if (p->in_iowait)
89  set |= TSK_IOWAIT;
90  }
91  
92  psi_task_change(p, now, clear, set);
93  }
94  static inline void psi_ttwu_dequeue(struct task_struct *p)
95  {
96  /*
97   * Is the task being migrated during a wakeup? Make sure to
98   * deregister its sleep-persistent psi states from the old
99   * queue, and let psi_enqueue() know it has to requeue.
   100   */
   101  if (unlikely(p->in_iowait || (p->flags & PF_MEMSTALL))) {
   102  struct rq_flags rf;
   103  struct rq *rq;
   104  int clear = 0;
   105  
   106  if (p->in_iowait)
   107  clear |= TSK_IOWAIT;
   108  if (p->flags & PF_MEMSTALL)
   109  clear |= TSK_MEMSTALL;
   110  
   111  rq = __task_rq_lock(p, &rf);
   112  update_rq_clock(rq);
   113  psi_task_change(p, rq_clock(rq), clear, 0);
   114  p->sched_psi_wake_requeue = 1;
   115  __task_rq_unlock(rq, &rf);
   116  }
   117  }
   118  #else /* CONFIG_PSI */
   119  static inline void psi_enqueue(struct task_struct *p, u64 now)
   120  {
   121  }
   122  static inline void psi_dequeue(struct task_struct *p, u64 now)
   123  {
   124  }
   125  static inline void psi_ttwu_dequeue(struct task_struct *p) {}
 > 126  {
   127  }
   128  #endif /* CONFIG_PSI */
   129  

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip


Re: [PATCH 6/7] psi: pressure stall information for CPU, memory, and IO

2018-05-07 Thread Randy Dunlap
On 05/07/2018 02:01 PM, Johannes Weiner wrote:
> 
> Signed-off-by: Johannes Weiner 
> ---
>  Documentation/accounting/psi.txt |  73 ++
>  include/linux/psi.h  |  27 ++
>  include/linux/psi_types.h|  84 ++
>  include/linux/sched.h|  10 +
>  include/linux/sched/stat.h   |  10 +-
>  init/Kconfig |  16 ++
>  kernel/fork.c|   4 +
>  kernel/sched/Makefile|   1 +
>  kernel/sched/core.c  |   3 +
>  kernel/sched/psi.c   | 424 +++
>  kernel/sched/sched.h | 166 ++--
>  kernel/sched/stats.h |  91 ++-
>  mm/compaction.c  |   5 +
>  mm/filemap.c |  15 +-
>  mm/page_alloc.c  |  10 +
>  mm/vmscan.c  |  13 +
>  16 files changed, 859 insertions(+), 93 deletions(-)
>  create mode 100644 Documentation/accounting/psi.txt
>  create mode 100644 include/linux/psi.h
>  create mode 100644 include/linux/psi_types.h
>  create mode 100644 kernel/sched/psi.c
> 
> diff --git a/Documentation/accounting/psi.txt 
> b/Documentation/accounting/psi.txt
> new file mode 100644
> index ..e051810d5127
> --- /dev/null
> +++ b/Documentation/accounting/psi.txt
> @@ -0,0 +1,73 @@

Looks good to me.


> diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
> new file mode 100644
> index ..052c529a053b
> --- /dev/null
> +++ b/kernel/sched/psi.c
> @@ -0,0 +1,424 @@
> +/*
> + * Measure workload productivity impact from overcommitting CPU, memory, IO
> + *
> + * Copyright (c) 2017 Facebook, Inc.
> + * Author: Johannes Weiner 
> + *
> + * Implementation
> + *
> + * Task states -- running, iowait, memstall -- are tracked through the
> + * scheduler and aggregated into a system-wide productivity state. The
> + * ratio between the times spent in productive states and delays tells
> + * us the overall productivity of the workload.
> + *
> + * The ratio is tracked in decaying time averages over 10s, 1m, 5m
> + * windows. Cumluative stall times are tracked and exported as well to

   Cumulative

> + * allow detection of latency spikes and custom time averaging.
> + *
> + * Multiple CPUs
> + *
> + * To avoid cache contention, times are tracked local to the CPUs. To
> + * get a comprehensive view of a system or cgroup, we have to consider
> + * the fact that CPUs could be unevenly loaded or even entirely idle
> + * if the workload doesn't have enough threads. To avoid artifacts
> + * caused by that, when adding up the global pressure ratio, the
> + * CPU-local ratios are weighed according to their non-idle time:
> + *
> + *   Time the CPU had stalled tasksTime the CPU was non-idle
> + *   -- * ---
> + *WalltimeTime all CPUs were non-idle
> + */


> +
> +/**
> + * psi_memstall_leave - mark the end of an memory stall section

end of a memory

> + * @flags: flags to handle nested memdelay sections
> + *
> + * Marks the calling task as no longer stalled due to lack of memory.
> + */
> +void psi_memstall_leave(unsigned long *flags)
> +{



-- 
~Randy


[PATCH 6/7] psi: pressure stall information for CPU, memory, and IO

2018-05-07 Thread Johannes Weiner
When systems are overcommitted and resources become contended, it's
hard to tell exactly the impact this has on workload productivity, or
how close the system is to lockups and OOM kills. In particular, when
machines work multiple jobs concurrently, the impact of overcommit in
terms of latency and throughput on the individual job can be enormous.

In order to maximize hardware utilization without sacrificing
individual job health or risk complete machine lockups, this patch
implements a way to quantify resource pressure in the system.

A kernel built with CONFIG_PSI=y creates files in /proc/pressure/ that
expose the percentage of time the system is stalled on CPU, memory, or
IO, respectively. Stall states are aggregate versions of the per-task
delay accounting delays:

   cpu: some tasks are runnable but not executing on a CPU
   memory: tasks are reclaiming, or waiting for swapin or thrashing cache
   io: tasks are waiting for io completions

These percentages of walltime can be thought of as pressure
percentages, and they give a general sense of system health and
productivity loss incurred by resource overcommit. They can also
indicate when the system is approaching lockup scenarios and OOMs.

To do this, psi keeps track of the task states associated with each
CPU and samples the time they spend in stall states. Every 2 seconds,
the samples are averaged across CPUs - weighted by the CPUs' non-idle
time to eliminate artifacts from unused CPUs - and translated into
percentages of walltime. A running average of those percentages is
maintained over 10s, 1m, and 5m periods (similar to the loadaverage).

Signed-off-by: Johannes Weiner 
---
 Documentation/accounting/psi.txt |  73 ++
 include/linux/psi.h  |  27 ++
 include/linux/psi_types.h|  84 ++
 include/linux/sched.h|  10 +
 include/linux/sched/stat.h   |  10 +-
 init/Kconfig |  16 ++
 kernel/fork.c|   4 +
 kernel/sched/Makefile|   1 +
 kernel/sched/core.c  |   3 +
 kernel/sched/psi.c   | 424 +++
 kernel/sched/sched.h | 166 ++--
 kernel/sched/stats.h |  91 ++-
 mm/compaction.c  |   5 +
 mm/filemap.c |  15 +-
 mm/page_alloc.c  |  10 +
 mm/vmscan.c  |  13 +
 16 files changed, 859 insertions(+), 93 deletions(-)
 create mode 100644 Documentation/accounting/psi.txt
 create mode 100644 include/linux/psi.h
 create mode 100644 include/linux/psi_types.h
 create mode 100644 kernel/sched/psi.c

diff --git a/Documentation/accounting/psi.txt b/Documentation/accounting/psi.txt
new file mode 100644
index ..e051810d5127
--- /dev/null
+++ b/Documentation/accounting/psi.txt
@@ -0,0 +1,73 @@
+
+PSI - Pressure Stall Information
+
+
+:Date: April, 2018
+:Author: Johannes Weiner 
+
+When CPU, memory or IO devices are contended, workloads experience
+latency spikes, throughput losses, and run the risk of OOM kills.
+
+Without an accurate measure of such contention, users are forced to
+either play it safe and under-utilize their hardware resources, or
+roll the dice and frequently suffer the disruptions resulting from
+excessive overcommit.
+
+The psi feature identifies and quantifies the disruptions caused by
+such resource crunches and the time impact it has on complex workloads
+or even entire systems.
+
+Having an accurate measure of productivity losses caused by resource
+scarcity aids users in sizing workloads to hardware--or provisioning
+hardware according to workload demand.
+
+As psi aggregates this information in realtime, systems can be managed
+dynamically using techniques such as load shedding, migrating jobs to
+other systems or data centers, or strategically pausing or killing low
+priority or restartable batch jobs.
+
+This allows maximizing hardware utilization without sacrificing
+workload health or risking major disruptions such as OOM kills.
+
+Pressure interface
+==
+
+Pressure information for each resource is exported through the
+respective file in /proc/pressure/ -- cpu, memory, and io.
+
+In both cases, the format for CPU is as such:
+
+some avg10=0.00 avg60=0.00 avg300=0.00 total=0
+
+and for memory and IO:
+
+some avg10=0.00 avg60=0.00 avg300=0.00 total=0
+full avg10=0.00 avg60=0.00 avg300=0.00 total=0
+
+The "some" line indicates the share of time in which at least some
+tasks are stalled on a given resource.
+
+The "full" line indicates the share of time in which all non-idle
+tasks are stalled on a given resource simultaneously. In this state
+actual CPU cycles are going to waste, and a workload that spends
+extended time in this state is considered to be thrashing. This has
+severe impact on performance, and it's useful to distinguish this
+situation from a state where some tasks are stalled but