On Thu, Jul 19, 2018 at 08:08:20AM -0700, Linus Torvalds wrote:
> On Wed, Jul 18, 2018 at 5:03 AM Peter Zijlstra <pet...@infradead.org> wrote:
> >
> > And as said before, we can compress the state from 12 bytes, to 6 bits
> > (or 1 byte), giving another 11 bytes for 59 bytes free.
> >
> > Leaving us just 5 bytes short of needing a single cacheline :/
> 
> Do you actually need 64 bits for the times?
> 
> That's the big cost. And it seems ridiculous, if you actually care about size.
> 
> You already have a 64-bit start time. Everything else is some
> cumulative relative time. Do those really need 64-bit and nanosecond
> resolution?
> 
> Maybe a 32-bit microsecond would be ok - would you ever account more
> than 35 minutes of anything without starting anew?

D'oh, you're right, the per-cpu buckets don't need to be this big at
all. In fact, we flush those deltas out every 2 seconds when there is
activity to maintain the running averages. Since we get 4.2s worth of
nanoseconds into a u32, we don't even need to divide in the hotpath.

Something along the lines of this here should work:

static void psi_group_change(struct psi_group *group, int cpu, u64 now,
                             unsigned int clear, unsigned int set)
{
        struct psi_group_cpu *groupc;
        unsigned int *tasks;
        unsigned int t;
        u32 delta;

        groupc = per_cpu_ptr(group->cpus, cpu);
        tasks = groupc->tasks;

        /* Time since last task change on this runqueue */
        delta = now - groupc->last_time;
        groupc->last_time = now;

        /* Tasks waited for IO? */
        if (tasks[NR_IOWAIT]) {
                if (!tasks[NR_RUNNING])
                        groupc->full_time[PSI_IO] += delta;
                else
                        groupc->some_time[PSI_IO] += delta;
        }

        /* Tasks waited for memory? */
        if (tasks[NR_MEMSTALL]) {
                if (!tasks[NR_RUNNING] ||
                    (cpu_curr(cpu)->flags & PF_MEMSTALL))
                        groupc->full_time[PSI_MEM] += delta;
                else
                        groupc->some_time[PSI_MEM] += delta;
        }

        /* Tasks waited for the CPU? */
        if (tasks[NR_RUNNING] > 1)
                groupc->some_time[PSI_CPU] += delta;

        /* Tasks were generally non-idle? To weigh the CPU in summaries */
        if (tasks[NR_RUNNING] || tasks[NR_IOWAIT] || tasks[NR_MEMSTALL])
                groupc->nonidle_time += delta;

        /* Update task counts according to the set/clear bitmasks */
        for (t = 0; clear; clear &= ~(1 << t), t++)
                if (clear & (1 << t))
                        groupc->tasks[t]--;
        for (t = 0; set; set &= ~(1 << t), t++)
                if (set & (1 << t))
                        groupc->tasks[t]++;

        /* Kick the stats aggregation worker if it's gone to sleep */
        if (!delayed_work_pending(&group->clock_work))
                schedule_delayed_work(&group->clock_work, PSI_FREQ);
}

And then we can pack it down to one cacheline:

struct psi_group_cpu {
        /* States of the tasks belonging to this group */
        unsigned int tasks[NR_PSI_TASK_COUNTS]; // 3

        /* Time sampling bucket for pressure states - no FULL for CPU */
        u32 some_time[NR_PSI_RESOURCES];
        u32 full_time[NR_PSI_RESOURCES - 1];

        /* Time sampling bucket for non-idle state (ns) */
        u32 nonidle_time;

        /* Time of last task change in this group (rq_clock) */
        u64 last_time;
};

I'm going to go test with this.

Thanks

Reply via email to