This isn't strictly related to Open MPI, but all of us here care about NUMA, locality, and performance, so I thought I'd pass along something that Brice forwarded to the hwloc-devel list.
See Brice's note below, and the original mail to the LKML below that. Begin forwarded message: > From: Brice Goglin <brice.gog...@inria.fr> > Subject: [hwloc-devel] possible membind changes coming in the Linux kernel > Date: March 16, 2012 11:11:35 AM EDT > To: hwloc development <hwloc-de...@open-mpi.org> > Reply-To: Hardware locality development list <hwloc-de...@open-mpi.org> > > We'll have to check the compatiblity of this thing with the hwloc membind API > if/when it gets merged in the Linux kernel. > Lee Schermerhorn's Migrate-on-Fault is supposed to be > hwloc_membind_nexttouch, that would be a very good news. > > Brice > > > -------- Message original -------- > Sujet: [RFC][PATCH 00/26] sched/numa > Date : Fri, 16 Mar 2012 15:40:28 +0100 > De : Peter Zijlstra <a.p.zijls...@chello.nl> > Pour : Linus Torvalds <torva...@linux-foundation.org>, Andrew Morton > <a...@linux-foundation.org>, Thomas Gleixner <t...@linutronix.de>, Ingo > Molnar <mi...@elte.hu>, Paul Turner <p...@google.com>, Suresh Siddha > <suresh.b.sid...@intel.com>, Mike Galbraith <efa...@gmx.de>, "Paul E. > McKenney" <paul...@linux.vnet.ibm.com>, Lai Jiangshan <la...@cn.fujitsu.com>, > Dan Smith <da...@us.ibm.com>, Bharata B Rao <bharata....@gmail.com>, Lee > Schermerhorn <lee.schermerh...@hp.com>, Andrea Arcangeli > <aarca...@redhat.com>, Rik van Riel <r...@redhat.com>, Johannes Weiner > <han...@cmpxchg.org> > Copie à : linux-ker...@vger.kernel.org, linux...@kvack.org > > Hi All, > > While the current scheduler has knowledge of the machine topology, including > NUMA (although there's room for improvement there as well [1]), it is > completely insensitive to which nodes a task's memory actually is on. > > Current upstream task memory allocation prefers to use the node the task is > currently running on (unless explicitly told otherwise, see > mbind()/set_mempolicy()), and with the scheduler free to move the task about > at > will, the task's memory can end up being spread all over the machine's nodes. > > While the scheduler does a reasonable job of keeping short running tasks on a > single node (by means of simply not doing the cross-node migration very > often), > it completely blows for long-running processes with a large memory footprint. > > This patch-set aims at improving this situation. It does so by assigning a > preferred, or home, node to every process/thread_group. Memory allocation is > then directed by this preference instead of the node the task might actually > be > running on momentarily. The load-balancer is also modified to prefer running > the task on its home-node, although not at the cost of letting CPUs go idle or > at the cost of execution fairness. > > On top of this a new NUMA balancer is introduced, which can change a process' > home-node the hard way. This heavy process migration is driven by two factors: > either tasks are running away from their home-node, or memory is being > allocated away from the home-node. In either case, it tries to move processes > around to make the 'problem' go away. > > The home-node migration handles both cpu and memory (anonymous only for now) > in > an integrated fashion. The memory migration uses migrate-on-fault to avoid > doing a lot of work from the actual numa balancer kernl thread and only > migrates the active memory. > > For processes that have more tasks than would fit on a node and which want to > split their activity in a useful fashion, the patch-set introduces two new > syscalls: sys_numa_tbind()/sys_numa_mbind(). These syscalls can be used to > create {thread}x{vma} groups which are then scheduled as a unit instead of the > entire process. > > That said, its still early days and there's lots of improvements to make. > > On to the actual patches... > > The first two are generic cleanups: > > [01/26] mm, mpol: Re-implement check_*_range() using walk_page_range() > [02/26] mm, mpol: Remove NUMA_INTERLEAVE_HIT > > The second set is a rework of Lee Schermerhorn's Migrate-on-Fault patches [2]: > > [03/26] mm, mpol: add MPOL_MF_LAZY ... > [04/26] mm, mpol: add MPOL_MF_NOOP > [05/26] mm, mpol: Check for misplaced page > [06/26] mm: Migrate misplaced page > [07/26] mm: Handle misplaced anon pages > [08/26] mm, mpol: Simplify do_mbind() > > The third set implements the basic numa balancing: > > [09/26] sched, mm: Introduce tsk_home_node() > [10/26] mm, mpol: Make mempolicy home-node aware > [11/26] mm, mpol: Lazy migrate a process/vma > [12/26] sched, mm: sched_{fork,exec} node assignment > [13/26] sched: Implement home-node awareness > [14/26] sched, numa: Numa balancer > [15/26] sched, numa: Implement hotplug hooks > [16/26] sched, numa: Abstract the numa_entity > > The next three patches are a band-aid, Lai Jiangshan (and Paul McKenney) are > doing a proper implementation.. the reverts are me being lazy about fwd > porting > my call_srcu() implementation. > > [17/26] srcu: revert1 > [18/26] srcu: revert2 > [19/26] srcu: Implement call_srcu() > > The last bits implement the new syscalls: > > [20/26] mm, mpol: Introduce vma_dup_policy() > [21/26] mm, mpol: Introduce vma_put_policy() > [22/26] mm, mpol: Split and explose some mempolicy functions > [23/26] sched, numa: Introduce sys_numa_{t,m}bind() > [24/26] mm, mpol: Implement numa_group RSS accounting > [25/26] sched, numa: Only migrate long-running entities > [26/26] sched, numa: A few debug bits > > > And a few numbers... > > On my WSM-EP (2 nodes, 6 cores/node, 2 thread/core), running 48 stream > benchmarks [3] (modified to use ~230MB and run long). > > Without these patches it degrades into 50-50 local/remote memory accesses: > > Performance counter stats for 'sleep 2': > > 259,668,750 r01b7@500b:u [100.00%] > 262,170,142 r01b7@200b:u > > > 2.010446121 seconds time elapsed > > With the patches there's a significant improvement in locality: > > Performance counter stats for 'sleep 2': > > 496,860,345 r01b7@500b:u [100.00%] > 78,292,565 r01b7@200b:u > > > 2.010707488 seconds time elapsed > > (the perf events are a bit magical and not supported in an actual perf > release -- but the first one is L3 misses to local dram, the second is > L3 misses to remote dram) > > If you look at those numbers you can also see that the sum is greater in the > second case, this means that we can service L3 misses at a higher rate, which > translates into a performance gain. > > These numbers also show that while there's a marked improvement, there's still > some gain to be had. The current numa balancer is still somewhat fickle. > > ~ Peter > > > [1] - > http://marc.info/?l=linux-kernel&m=130218515520540 > > now that we have SD_OVERLAP it should be fairly easy to do. > > [2] - > http://markmail.org/message/mdwbcitql5ka4uws > > > [3] - > https://asc.llnl.gov/computing_resources/purple/archive/benchmarks/memory/stream.tar > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to > majord...@vger.kernel.org > > More majordomo info at > http://vger.kernel.org/majordomo-info.html > > Please read the FAQ at > http://www.tux.org/lkml/ > _______________________________________________ > hwloc-devel mailing list > hwloc-de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-devel -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/