Probably worth trying to upgrade. Your version does not balance the AIO 
threads, 7.x does. It can definitely explain imbalance of NUMA memory for some 
cases.

-- Leif 

> On Jan 18, 2017, at 7:12 PM, Kapil Sharma (kapsharm) <[email protected]> 
> wrote:
> 
> 5.3.2
> 
> 
>> On Jan 18, 2017, at 9:11 PM, Leif Hedstrom <[email protected]> wrote:
>> 
>> Which version are you on?
>> 
>> -- Leif 
>> 
>> On Jan 18, 2017, at 3:21 PM, Kapil Sharma (kapsharm) <[email protected]> 
>> wrote:
>> 
>>> We are seeing interesting behavior at high memory usage. Apologize for a 
>>> long and detailed email.
>>> 
>>> Our ATS caches have been running for many months, and reached a point where 
>>> ATS has allocated huge amount of memory to free list pools (we can confirm 
>>> this by by dumping mempools). I understand that this is a known ATS 
>>> behavior/limitation/issue, where freelist mempools once allocated are never 
>>> reclaimed (And, that there may be a patch for adding reclamation support). 
>>> 
>>> But my question is regarding an interesting side issue that is being 
>>> observed when the ATS cache reach this stage. We see our ATS caches 
>>> allocating large amount of memory for slab cache - primarily “dentry” 
>>> cache. This is probably okay, because even though Kernel does greedy 
>>> allocation for internal caches (page cache, dentry, inode cache etc.), all 
>>> of that memory is reclaimable during low memory pressure. Now, the more 
>>> interesting behavior is that during this low memory state, we are observing 
>>> only one of NUMA zones is exhausting the pages!, and this particular ATS 
>>> cache has been in this state for several days.
>>> Here is snipped output from /proc/zoneinfo:
>>> 
>>> <snip>
>>> Node 0, zone   Normal
>>>   pages free     6320539   <<< Roughly 25GB free (Note, System has a total 
>>> of 512GB)
>>>         min      8129
>>>         low      10161
>>>         high     12193
>>>         scanned  0
>>>         spanned  66584576
>>>         present  65674240
>>>     nr_free_pages 6320539
>>>     nr_inactive_anon 71
>>>     nr_active_anon 79274
>>>     nr_inactive_file 1720428
>>>     nr_active_file 4580107
>>>     nr_unevictable 39168773
>>>     nr_mlock     39168773
>>>     nr_anon_pages 39239109
>>>     nr_mapped    13298
>>>     nr_file_pages 6309563
>>>     nr_dirty     91
>>>     nr_writeback 0
>>>     nr_slab_reclaimable 4581560  <<< ~10G.
>>>     nr_slab_unreclaimable 16047
>>>  <snip>
>>> Node 1, zone   Normal
>>>   pages free     10224        <<<< Check this. It is below low watermark!!!
>>>         min      8193
>>>         low      10241
>>>         high     12289
>>>         scanned  0
>>>         spanned  67108864
>>>         present  66191360
>>>     nr_free_pages 10224
>>>     nr_inactive_anon 64
>>>     nr_active_anon 20886
>>>     nr_inactive_file 42840
>>>     nr_active_file 330486
>>>     nr_unevictable 45630255
>>>     nr_mlock     45630255
>>>     nr_anon_pages 45649954
>>>     nr_mapped    2151
>>>     nr_file_pages 374576
>>>     nr_dirty     9
>>>     nr_writeback 0
>>>     nr_slab_reclaimable 11939312   <<< ~48G 
>>>     nr_slab_unreclaimable 17135   
>>> <snip>
>>> 
>>> It would appear page allocations for slab (from slabtop it is pretty much 
>>> all dentry) is disproportionately hitting NUMA zone 1. Under these 
>>> conditions, my guess is zone/node 1 memory will be constantly under low 
>>> memory pressure, causing scan/reclaim of pages to constantly run. Without 
>>> knowing much about Linux Kernel MM, I am guessing this may be suboptimal?
>>> 
>>> Please correct my (wild) assumption on why we may be observing this :
>>> - My guess is dentry is being created for a each new “accepted”  connection 
>>> socket.
>>> - There is only one ACCEPT thread to handle port 80 requests in our cache 
>>> configuration. ACCEPT thread is responsible for opening FDs for accepted 
>>> socket connections.
>>> - ACCEPT thread is confined to run on cpuset belonging to one NUMA zone 
>>> only….
>>> (I am connecting a lot of dots here)
>>> 
>>> Any insight will be appreciated.
>>> 
>>> thanks
>>> Kapil
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
> 

Reply via email to