Hi David, Have you tried setting /proc/sys/vm/zone_reclaim_mode to 3 or 7 ?
Cheers, -- Steffen Persvold Chief Architect NumaChip, Numascale AS Tel: +47 23 16 71 88 Fax: +47 23 16 71 80 Skype: spersvold > On 09 Jul 2015, at 20:44, mathog <[email protected]> wrote: > > Reran the generators and that did make the system slow again, so at least > this problem can be reproduced. > > After those ran memory is definitely in short supply, pretty much everything > is in file cache. For whatever reason, the system seems to be loathe to > release memory from file cache for other uses. I think that is the problem. > > Here is some data, this is a bit long... > > numactl --hardware ho > available: 2 nodes (0-1) > node 0 cpus: 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 > 46 > node 0 size: 262098 MB > node 0 free: 18372 MB > node 1 cpus: 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 > 47 > node 1 size: 262144 MB > node 1 free: 2829 MB > node distances: > node 0 1 > 0: 10 20 > 1: 20 10 > > CPU specific tests were done on 20, so NUMA node 0. None of the tests come > close to using up all the physical memory in a "node", which is 262GB. > > When cache has been cleared, and the test programs run fast: > cat /proc/meminfo | head -11 > MemTotal: 529231456 kB > MemFree: 525988868 kB > Buffers: 5428 kB > Cached: 46544 kB > SwapCached: 556 kB > Active: 62220 kB > Inactive: 121316 kB > Active(anon): 26596 kB > Inactive(anon): 109456 kB > Active(file): 35624 kB > Inactive(file): 11860 kB > > run one test and it jumps up to > > MemTotal: 529231456 kB > MemFree: 491812500 kB > Buffers: 10644 kB > Cached: 34139976 kB > SwapCached: 556 kB > Active: 34152592 kB > Inactive: 130400 kB > Active(anon): 27560 kB > Inactive(anon): 109316 kB > Active(file): 34125032 kB > Inactive(file): 21084 kB > > and the next test is still quick. After running the generators, but when > nothing much is running, it starts like this: > > cat /proc/meminfo | head -11 > MemTotal: 529231456 kB > MemFree: 19606616 kB > Buffers: 46704 kB > Cached: 493107268 kB > SwapCached: 556 kB > Active: 34229020 kB > Inactive: 459056372 kB > Active(anon): 712 kB > Inactive(anon): 135508 kB > Active(file): 34228308 kB > Inactive(file): 458920864 kB > > Then when a test job is run it drops quickly to this and sticks. Note the > MemFree value. I think this is where the "Events/20" process kicks in: > > cat /proc/meminfo | head -11 > MemTotal: 529231456 kB > MemFree: 691740 kB > Buffers: 46768 kB > Cached: 493056968 kB > SwapCached: 556 kB > Active: 53164328 kB > Inactive: 459006232 kB > Active(anon): 18936048 kB > Inactive(anon): 135608 kB > Active(file): 34228280 kB > Inactive(file): 458870624 kB > > Kill the process and the system "recovers" to the preceding memory > configuration in a few seconds. Similarly /proc/zoneinfo values from before > the generators were run, when the system was fast: > > extract -in state_zoneinfo_fast3.txt -if '^Node' -ifn 10 -ifonly > Node 0, zone DMA > pages free 3931 > min 0 > low 0 > high 0 > scanned 0 > spanned 4095 > present 3832 > nr_free_pages 3931 > nr_inactive_anon 0 > nr_active_anon 0 > Node 0, zone DMA32 > pages free 105973 > min 139 > low 173 > high 208 > scanned 0 > spanned 1044480 > present 822056 > nr_free_pages 105973 > nr_inactive_anon 0 > nr_active_anon 0 > Node 0, zone Normal > pages free 50199731 > min 11122 > low 13902 > high 16683 > scanned 0 > spanned 66256896 > present 65351040 > nr_free_pages 50199731 > nr_inactive_anon 16490 > nr_active_anon 7191 > Node 1, zone Normal > pages free 57596396 > min 11265 > low 14081 > high 16897 > scanned 0 > spanned 67108864 > present 66191360 > nr_free_pages 57596396 > nr_inactive_anon 10839 > nr_active_anon 1772 > > and after the generators were run (slow): > > Node 0, zone DMA > pages free 3931 > min 0 > low 0 > high 0 > scanned 0 > spanned 4095 > present 3832 > nr_free_pages 3931 > nr_inactive_anon 0 > nr_active_anon 0 > Node 0, zone DMA32 > pages free 105973 > min 139 > low 173 > high 208 > scanned 0 > spanned 1044480 > present 822056 > nr_free_pages 105973 > nr_inactive_anon 0 > nr_active_anon 0 > Node 0, zone Normal > pages free 23045 > min 11122 > low 13902 > high 16683 > scanned 0 > spanned 66256896 > present 65351040 > nr_free_pages 23045 > nr_inactive_anon 16486 > nr_active_anon 5839 > Node 1, zone Normal > pages free 33726 > min 11265 > low 14081 > high 16897 > scanned 0 > spanned 67108864 > present 66191360 > nr_free_pages 33726 > nr_inactive_anon 10836 > nr_active_anon 1065 > > Looking the same way at /proc/zoninfo while a test is running showed > the "pages free" and "nr_free_pages" values oscillating downward to > a low of about 28000 for Node 0, zone Normal. The rest of the values were > essentially stable. > > Looking the same way at /proc/meminfo while a test is running gave values > that differed in only minor ways from the "after" table shown above. MemFree > varied in a range from abut 680000 to 720000. > Cached dropped to ~482407184 kB and then budged barely at all. > > Finally the last few lines from "sar -B" (sorry about the wrap) > > 10:30:03 AM 5810.55 301475.26 95.99 0.05 51710.29 48086.79 > 0.00 48084.94 100.00 > 10:40:01 AM 3404.90 185502.87 96.67 0.01 47267.84 44816.30 > 0.00 44816.30 100.00 > 10:50:02 AM 9.13 13.32 192.24 0.11 4592.56 48.54 > 3149.01 3197.55 100.00 > 11:00:01 AM 191.78 9.97 347.56 0.13 16760.51 0.00 > 3683.21 3683.21 100.00 > > 11:00:01 AM pgpgin/s pgpgout/s fault/s majflt/s pgfree/s pgscank/s > pgscand/s pgsteal/s %vmeff > 11:10:01 AM 11.64 7.75 342.59 0.09 18528.24 0.00 > 1699.66 1699.66 100.00 > 11:20:01 AM 0.00 6.75 96.87 0.00 43.97 0.00 > 0.00 0.00 0.00 > > The generators finished at 10:35. The data point at 10:30 (while they were > running) pgscank/s and pgsteal/s jumped up from 0 to high values. When > later tests were run the former fell down to not much but the latter stayed > high. Additionally when the test runs were made following the generator it > pushed pgscand/s from 0 to several thousand per second. The last row > consists of a 10 minute span where no tests were run, and these values all > dropped back to zero. > > Since excessive file cache seems to implicated did this: > echo 3 > /proc/sys/vm/drop_caches > > and reran the test on node 20. It was fast. > > I guess the question now is what parameter(s) control(s) the conversion from > memory in file cache to memory needed for other purposes when free memory is > in short supply and there is substantial demand. It seems the OS isn't > releasing cache. Or maybe it isn't flushing it to disk. I don't think it's > the latter because iotop and iostat don't show any activity during a "slow" > read. > > Thank, > > David Mathog > [email protected] > Manager, Sequence Analysis Facility, Biology Division, Caltech > _______________________________________________ > Beowulf mailing list, [email protected] sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf _______________________________________________ Beowulf mailing list, [email protected] sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
