On Fri, 2014-05-02 at 02:08 -0400, Rik van Riel wrote: > On 05/02/2014 01:58 AM, Mike Galbraith wrote: > > On Fri, 2014-05-02 at 07:32 +0200, Mike Galbraith wrote: > >> On Fri, 2014-05-02 at 00:42 -0400, Rik van Riel wrote: > >>> Currently sync wakeups from the wake_affine code cannot work as > >>> designed, because the task doing the sync wakeup from the target > >>> cpu will block its wakee from selecting that cpu. > >>> > >>> This is despite the fact that whether or not the wakeup is sync > >>> determines whether or not we want to do an affine wakeup... > >> > >> If the sync hint really did mean we ARE going to schedule RSN, waking > >> local would be a good thing. It is all too often a big fat lie. > > > > One example of that is say pgbench. The mother of all work (server > > thread) for that load wakes with sync hint. Let the server wake the > > first of a small herd CPU affine, and that first wakee then preempt the > > server (mother of all work) that drives the entire load. > > > > Byebye throughput. > > > > When there's only one wakee, and there's really not enough overlap to at > > least break even, waking CPU affine is a great idea. Even when your > > wakees only run for a short time, if you wake/get_preempted repeat, the > > load will serialize. > > I see a similar issue with specjbb2013, with 4 backend and > 4 frontend JVMs on a 4 node NUMA system. > > The NUMA balancing code nicely places the memory of each JVM > on one NUMA node, but then the wake_affine code will happily > run all of the threads anywhere on the system, totally ruining > memory locality.
Hm, I thought numasched got excessive pull crap under control. For steady hefty loads, you want to kill all but periodic load balancing once the thing gets cranked up. The less you move tasks, the better the load will perform. Bursty loads exist too though, damn the bad luck. > The front end and back end only exchange a few hundred messages > a second, over loopback tcp, so the switching rate between > threads is quite low... > > I wonder if it would make sense for wake_affine to be off by > default, and only switch on when the right conditions are > detected, instead of having it on by default like we have now? Not IMHO, but I have seen situations where that was exactly what I recommended to fix the throughput problem the user was having. Reason why is that case was on a box where FAIR_SLEEPERS is disabled by default, meaning there is no such thing as wakeup preemption. Guess what happens when you don't have shared LLC for a fast/light wakee to escape to when the waker is a pig. The worst thing possible in that case is to wake affine. Leave the poor thing wherever it was, else it will take a latency hit that need not have been. > I have some ideas on that, but I should probably catch some > sleep before trying to code them up :) Yeah, there are many aspects to ponder. > Meanwhile, the test patch that I posted may help us figure out > whether the "sync" option in the current wake_affine code does > anything useful. If I had a NAK stamp and digital ink pad, that patch wouldn't be readable, much less applicable ;-) -Mike -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/