Probably worth trying to upgrade. Your version does not balance the AIO threads, 7.x does. It can definitely explain imbalance of NUMA memory for some cases.
-- Leif > On Jan 18, 2017, at 7:12 PM, Kapil Sharma (kapsharm) <[email protected]> > wrote: > > 5.3.2 > > >> On Jan 18, 2017, at 9:11 PM, Leif Hedstrom <[email protected]> wrote: >> >> Which version are you on? >> >> -- Leif >> >> On Jan 18, 2017, at 3:21 PM, Kapil Sharma (kapsharm) <[email protected]> >> wrote: >> >>> We are seeing interesting behavior at high memory usage. Apologize for a >>> long and detailed email. >>> >>> Our ATS caches have been running for many months, and reached a point where >>> ATS has allocated huge amount of memory to free list pools (we can confirm >>> this by by dumping mempools). I understand that this is a known ATS >>> behavior/limitation/issue, where freelist mempools once allocated are never >>> reclaimed (And, that there may be a patch for adding reclamation support). >>> >>> But my question is regarding an interesting side issue that is being >>> observed when the ATS cache reach this stage. We see our ATS caches >>> allocating large amount of memory for slab cache - primarily “dentry” >>> cache. This is probably okay, because even though Kernel does greedy >>> allocation for internal caches (page cache, dentry, inode cache etc.), all >>> of that memory is reclaimable during low memory pressure. Now, the more >>> interesting behavior is that during this low memory state, we are observing >>> only one of NUMA zones is exhausting the pages!, and this particular ATS >>> cache has been in this state for several days. >>> Here is snipped output from /proc/zoneinfo: >>> >>> <snip> >>> Node 0, zone Normal >>> pages free 6320539 <<< Roughly 25GB free (Note, System has a total >>> of 512GB) >>> min 8129 >>> low 10161 >>> high 12193 >>> scanned 0 >>> spanned 66584576 >>> present 65674240 >>> nr_free_pages 6320539 >>> nr_inactive_anon 71 >>> nr_active_anon 79274 >>> nr_inactive_file 1720428 >>> nr_active_file 4580107 >>> nr_unevictable 39168773 >>> nr_mlock 39168773 >>> nr_anon_pages 39239109 >>> nr_mapped 13298 >>> nr_file_pages 6309563 >>> nr_dirty 91 >>> nr_writeback 0 >>> nr_slab_reclaimable 4581560 <<< ~10G. >>> nr_slab_unreclaimable 16047 >>> <snip> >>> Node 1, zone Normal >>> pages free 10224 <<<< Check this. It is below low watermark!!! >>> min 8193 >>> low 10241 >>> high 12289 >>> scanned 0 >>> spanned 67108864 >>> present 66191360 >>> nr_free_pages 10224 >>> nr_inactive_anon 64 >>> nr_active_anon 20886 >>> nr_inactive_file 42840 >>> nr_active_file 330486 >>> nr_unevictable 45630255 >>> nr_mlock 45630255 >>> nr_anon_pages 45649954 >>> nr_mapped 2151 >>> nr_file_pages 374576 >>> nr_dirty 9 >>> nr_writeback 0 >>> nr_slab_reclaimable 11939312 <<< ~48G >>> nr_slab_unreclaimable 17135 >>> <snip> >>> >>> It would appear page allocations for slab (from slabtop it is pretty much >>> all dentry) is disproportionately hitting NUMA zone 1. Under these >>> conditions, my guess is zone/node 1 memory will be constantly under low >>> memory pressure, causing scan/reclaim of pages to constantly run. Without >>> knowing much about Linux Kernel MM, I am guessing this may be suboptimal? >>> >>> Please correct my (wild) assumption on why we may be observing this : >>> - My guess is dentry is being created for a each new “accepted” connection >>> socket. >>> - There is only one ACCEPT thread to handle port 80 requests in our cache >>> configuration. ACCEPT thread is responsible for opening FDs for accepted >>> socket connections. >>> - ACCEPT thread is confined to run on cpuset belonging to one NUMA zone >>> only…. >>> (I am connecting a lot of dots here) >>> >>> Any insight will be appreciated. >>> >>> thanks >>> Kapil >>> >>> >>> >>> >>> >>> >
