On Feb 28, 2011, at 3:47 PM, David Singleton wrote:

> I dont think you can avoid the problem.  Unless it has changed very recently, 
> Linux swapin_readahead is the main culprit in messing with NUMA locality on 
> that platform.  Faulting a single page causes 8 or 16 or whatever contiguous 
> pages to be read from swap.  An arbitrary contiguous range of pages in swap 
> may not even come from the same process far less the same NUMA node.  My 
> understanding is that since there is no NUMA info with the swap entry, the 
> only policy that can be applied to is that of the faulting vma in the 
> faulting process.  The faulted page will have the desired NUMA placement but 
> possibly not the rest. So swapping mixes different process' NUMA policies 
> leading to a "NUMA diffusion process".  

That is terrible!

Is the only way to avoid this to pin the memory so that it doesn't get swapped 
out?  (which is evil in its own way)

> Here's a contrived example on a 2.6.27 kernel.
> 
> #  Grab 3 lots of 10000MB on a 24GB Nehalem node:
> 
> v1100:~ > numactl --membind=0 ./memory_grabber 10000 &
> [1] 434
> v1100:~ > numactl --membind=1 ./memory_grabber 10000 &
> [2] 435
> v1100:~ > ./memory_grabber 10000  &
> [3] 436
> 
> # Time sequence of NUMA page locality for the 3 processes:
> 
> v1100:~ > cat /proc/43?/numa_maps | grep 7ffd861da000
> 7ffd861da000 bind:0 anon=2184075 dirty=2184075 active=1104219 N0=2184075
> 7ffd861da000 bind:1 anon=1709350 dirty=1709350 active=918142 N1=1709350
> 7ffd861da000 default anon=2086028 dirty=2086028 active=1194354 N0=774151 
> N1=1311877
> 
> v1100:~ > cat /proc/43?/numa_maps | grep 7ffd861da000
> 7ffd861da000 bind:0 anon=1777593 dirty=1678821 swapcache=98772 active=744021 
> N0=1777524 N1=69
> 7ffd861da000 bind:1 anon=1649256 dirty=1649256 active=797862 N1=1649256
> 7ffd861da000 default anon=2313532 dirty=2143102 swapcache=170430 
> active=1928372 N0=982483 N1=1331049
> 
> v1100:~ > cat /proc/43?/numa_maps | grep 7ffd861da000
> 7ffd861da000 bind:0 anon=1619803 dirty=1521031 swapcache=98772 active=652729 
> N0=1617878 N1=1925
> 7ffd861da000 bind:1 anon=1616983 dirty=1616983 active=771814 N1=1616983
> 7ffd861da000 default anon=2393655 dirty=2223225 swapcache=170430 
> active=2147908 N0=1052167 N1=1341488
> 
> v1100:~ > cat /proc/43?/numa_maps | grep 7ffd861da000
> 7ffd861da000 bind:0 anon=1490293 dirty=1391521 swapcache=98772 active=679807 
> N0=1482914 N1=7379
> 7ffd861da000 bind:1 anon=1850875 dirty=1850873 swapcache=2 active=996836 
> N0=256407 N1=1594468
> 7ffd861da000 default anon=2484496 dirty=2314066 swapcache=170430 
> active=2396456 N0=1083215 N1=1401281

I'm sorry; I'm not too familiar with the output of /proc/*/numa_maps -- what is 
this showing?  I see some entries switching from active=X to swapcache=X, 
assumedly meaning that they have been swapped out...?

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/


Reply via email to