Hi. On Mon, Apr 15, 2019 at 04:40:56PM +0200, Martin Schwarz wrote: > The system from my previous example has already been rebooted, sorry!
Kind of expected. It's useful nevertheless. > But here's from another system that currently starts showing the same > problem and has an equally small workload: > > root@rad-wgv-srv01:~# free -thwl Nothing out of the ordinary here. > root@rad-wgv-srv01:~# cat /proc/meminfo > MemTotal: 1010976 kB > MemFree: 73980 kB > MemAvailable: 38756 kB > Buffers: 9964 kB > Cached: 50340 kB It's not the file cache who ate the memory. > SwapCached: 2728 kB And it's not the swap caching. > Active(anon): 11068 kB > Inactive(anon): 3696 kB Memory consumption cannot be attributed to tmpfs. I know, you've posted 'df' output earlier, but it does not take mount namespaces into the account. > Mapped: 19904 kB To my biggest disappointment, the problem cannot be explained by excessive use of mmap(2) syscall. Would be easy otherwise. > Shmem: 1120 kB It's not the shared memory segments. > Slab: 90744 kB > SReclaimable: 13100 kB > SUnreclaim: 77644 kB And it's not dentries cache (saw the thing grown once or twice. was ugly). > AnonHugePages: 0 kB > ShmemHugePages: 0 kB > ShmemPmdMapped: 0 kB > HugePages_Total: 0 > HugePages_Free: 0 > HugePages_Rsvd: 0 > HugePages_Surp: 0 And last, but not the least, there are no hugepages in use. > root@rad-wgv-srv01:~# smem -tm | tail > /bin/bash 3 358 1076 > /lib/systemd/systemd 3 386 1158 > /lib/x86_64-linux-gnu/libc-2.24.so 33 54 1783 > /usr/lib/x86_64-linux-gnu/libcrypto.so.1 5 386 1933 > /usr/bin/python2.7 1 2220 2220 > /lib/systemd/libsystemd-shared-232.so 5 544 2723 > <anonymous> 33 146 4848 > [heap] 33 304 10060 > ----------------------------------------------------------------- > 179 922 11110 41011 Moreover, no current running visible process consume the memory. I suspect that this host does not utilize them anyway. In short. I do believe that this is happening, but I never seen anything like this. I cannot imagine the scenario that can lead to this, as long as we're talking real hardware aka big iron. What I suspect is happening here is runaway memory allocation by a kernel module (at least one of them), and said kernel module is likely to be VMWare-specific. It could be vmxnet3 (network). It could be that LSI kernel module or whatever they're using for SCSI these days (vmw_pvscsi?). And that means - 'perf top', or better yet - 'perf record'. Reco