On Tue, Jul 17, 2018 at 7:45 AM Michael Ellerman <m...@ellerman.id.au> wrote: > > > Interesting. I don't see anything as high as 18%, it's more spread out: > > 7.81% context_switch [kernel.kallsyms] [k] cgroup_rstat_updated
Oh, see that's the difference. You're running in a non-root cgroup, I think. That also means that your scheduler overhead has way more spinlocks, and in particular, you have that raw_spinlock_t *cpu_lock = per_cpu_ptr(&cgroup_rstat_cpu_lock, cpu); .. raw_spin_lock_irqsave(cpu_lock, flags); there too. So you have at least twice the spinlocks that my case had, and yes, the costs are way more spread out because your case has all that cgroup accounting too. That said, I don't understand the powerpc memory ordering. I thought the rules were "isync on lock, lwsync on unlock". That's what the AIX docs imply, at least. In particular, I find: "isync is not a memory barrier instruction, but the load-compare-conditional branch-isync sequence can provide this ordering property" so why are you doing "sync/lwsync", when it sounds like "isync/lwsync" (for lock/unlock) is the right thing and would already give memory barrier semantics? Linus