Re: Still OOM problems with 4.9er/4.10er kernels

2017-03-26 Thread Gerhard Wiesinger

On 23.03.2017 09:38, Mike Galbraith wrote:

On Thu, 2017-03-23 at 08:16 +0100, Gerhard Wiesinger wrote:

On 21.03.2017 08:13, Mike Galbraith wrote:

On Tue, 2017-03-21 at 06:59 +0100, Gerhard Wiesinger wrote:


Is this the correct information?

Incomplete, but enough to reiterate cgroup_disable=memory
suggestion.


How to collect complete information?

If Michal wants specifics, I suspect he'll ask.  I posted only to pass
along a speck of information, and offer a test suggestion.. twice.


Still OOM with cgroup_disable=memory, kernel 
4.11.0-0.rc3.git0.2.fc26.x86_64,I set vm.min_free_kbytes = 10240 in 
these tests.

# Full config
grep "vm\." /etc/sysctl.d/*
/etc/sysctl.d/00-dirty_background_ratio.conf:vm.dirty_background_ratio = 3
/etc/sysctl.d/00-dirty_ratio.conf:vm.dirty_ratio = 15
/etc/sysctl.d/00-kernel-vm-min-free-kbyzes.conf:vm.min_free_kbytes = 10240
/etc/sysctl.d/00-overcommit_memory.conf:vm.overcommit_memory = 2
/etc/sysctl.d/00-overcommit_ratio.conf:vm.overcommit_ratio = 80
/etc/sysctl.d/00-swappiness.conf:vm.swappiness=10

[31880.623557] sa1: page allocation stalls for 10942ms, order:0, 
mode:0x14200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null)

[31880.623623] sa1 cpuset=/ mems_allowed=0
[31880.623630] CPU: 1 PID: 17112 Comm: sa1 Not tainted 
4.11.0-0.rc3.git0.2.fc26.x86_64 #1
[31880.623724] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), 
BIOS 1.9.3 04/01/2014

[31880.623819] Call Trace:
[31880.623893]  dump_stack+0x63/0x84
[31880.623971]  warn_alloc+0x10c/0x1b0
[31880.624046]  __alloc_pages_slowpath+0x93d/0xe60
[31880.624142]  ? get_page_from_freelist+0x122/0xbf0
[31880.624225]  ? unmap_region+0xf7/0x130
[31880.624312]  __alloc_pages_nodemask+0x290/0x2b0
[31880.624388]  alloc_pages_vma+0xa0/0x2b0
[31880.624463]  __handle_mm_fault+0x4d0/0x1160
[31880.624550]  handle_mm_fault+0xb3/0x250
[31880.624628]  __do_page_fault+0x23f/0x4c0
[31880.624701]  trace_do_page_fault+0x41/0x120
[31880.624781]  do_async_page_fault+0x51/0xa0
[31880.624866]  async_page_fault+0x28/0x30
[31880.624941] RIP: 0033:0x7f9218d4914f
[31880.625032] RSP: 002b:7ffe0d1376a8 EFLAGS: 00010206
[31880.625153] RAX: 7f9218d2a314 RBX: 7f9218f4e658 RCX: 
7f9218d2a354
[31880.625235] RDX: 05ec RSI:  RDI: 
7f9218d2a314
[31880.625324] RBP: 7ffe0d137950 R08: 7f9218d2a900 R09: 
00027000
[31880.625423] R10: 7ffe0d1376e0 R11: 7f9218d2a900 R12: 
0003
[31880.625505] R13: 7ffe0d137a38 R14: fd01 R15: 
0002

[31880.625688] Mem-Info:
[31880.625762] active_anon:36671 inactive_anon:36711 isolated_anon:88
active_file:1399 inactive_file:1410 isolated_file:0
unevictable:0 dirty:5 writeback:15 unstable:0
slab_reclaimable:3099 slab_unreclaimable:3558
mapped:2037 shmem:3 pagetables:3340 bounce:0
free:2972 free_pcp:102 free_cma:0
[31880.627334] Node 0 active_anon:146684kB inactive_anon:146816kB 
active_file:5596kB inactive_file:5572kB unevictable:0kB 
isolated(anon):368kB isolated(file):0kB mapped:8044kB dirty:20kB 
writeback:136kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 
12kB writeback_tmp:0kB unstable:0kB pages_scanned:82 all_unreclaimable? no
[31880.627606] Node 0 DMA free:1816kB min:440kB low:548kB high:656kB 
active_anon:5636kB inactive_anon:6844kB active_file:132kB 
inactive_file:148kB unevictable:0kB writepending:4kB present:15992kB 
managed:15908kB mlocked:0kB slab_reclaimable:284kB 
slab_unreclaimable:532kB kernel_stack:0kB pagetables:188kB bounce:0kB 
free_pcp:0kB local_pcp:0kB free_cma:0kB

[31880.627883] lowmem_reserve[]: 0 327 327 327 327
[31880.627959] Node 0 DMA32 free:10072kB min:9796kB low:12244kB 
high:14692kB active_anon:141048kB inactive_anon:14kB 
active_file:5432kB inactive_file:5444kB unevictable:0kB 
writepending:152kB present:376688kB managed:353760kB mlocked:0kB 
slab_reclaimable:12112kB slab_unreclaimable:13700kB kernel_stack:2464kB 
pagetables:13172kB bounce:0kB free_pcp:504kB local_pcp:272kB free_cma:0kB

[31880.628334] lowmem_reserve[]: 0 0 0 0 0
[31880.629882] Node 0 DMA: 33*4kB (UME) 24*8kB (UM) 26*16kB (UME) 4*32kB 
(UME) 5*64kB (UME) 1*128kB (E) 2*256kB (M) 0*512kB 0*1024kB 0*2048kB 
0*4096kB = 1828kB
[31880.632255] Node 0 DMA32: 174*4kB (UMEH) 107*8kB (UMEH) 96*16kB 
(UMEH) 59*32kB (UME) 30*64kB (UMEH) 8*128kB (UEH) 8*256kB (UMEH) 1*512kB 
(E) 0*1024kB 0*2048kB 0*4096kB = 10480kB
[31880.634344] Node 0 hugepages_total=0 hugepages_free=0 
hugepages_surp=0 hugepages_size=2048kB

[31880.634346] 7276 total pagecache pages
[31880.635277] 4367 pages in swap cache
[31880.636206] Swap cache stats: add 563, delete 5635551, find 
6573228/8496821

[31880.637145] Free swap  = 973736kB
[31880.638038] Total swap = 2064380kB
[31880.638988] 98170 pages RAM
[31880.640309] 0 pages HighMem/MovableOnly
[31880.641791] 5753 pages reserved
[31880.642908] 0 pages cma reserved
[31880.643978] 0 pages hwpoisoned

Will try your suggestion

Re: Still OOM problems with 4.9er/4.10er kernels

2017-03-23 Thread Tetsuo Handa
On 2017/03/23 17:38, Mike Galbraith wrote:
> On Thu, 2017-03-23 at 08:16 +0100, Gerhard Wiesinger wrote:
>> On 21.03.2017 08:13, Mike Galbraith wrote:
>>> On Tue, 2017-03-21 at 06:59 +0100, Gerhard Wiesinger wrote:
>>>
 Is this the correct information?
>>> Incomplete, but enough to reiterate cgroup_disable=memory
>>> suggestion.
>>>
>>
>> How to collect complete information?
> 
> If Michal wants specifics, I suspect he'll ask.  I posted only to pass
> along a speck of information, and offer a test suggestion.. twice.
> 
>   -Mike

Isn't information Mike asked something like output from below command

  for i in `find /sys/fs/cgroup/memory/ -type f`; do echo == $i 
==; cat $i | xargs echo; done

and check which cgroups stalling tasks belong to? Also, Mike suggested to
reproduce your problem with cgroup_disable=memory kernel command line option
in order to bisect whether memory cgroups are related to your problem.

I don't know whether Michal already knows specific information to collect.
I think memory allocation watchdog might give us some clue. It will give us
output like http://I-love.SAKURA.ne.jp/tmp/serial-20170321.txt.xz .

Can you afford building kernels with watchdog patch applied? Steps I tried for
building kernels are shown below. (If you can't afford building but can afford
trying binary rpms, I can upload them.)


yum -y install yum-utils
wget 
https://dl.fedoraproject.org/pub/alt/rawhide-kernel-nodebug/SRPMS/kernel-4.11.0-0.rc3.git0.1.fc27.src.rpm
yum-builddep -y kernel-4.11.0-0.rc3.git0.1.fc27.src.rpm
rpm -ivh kernel-4.11.0-0.rc3.git0.1.fc27.src.rpm
yum-builddep -y ~/rpmbuild/SPECS/kernel.spec
patch -p1 -d ~/rpmbuild/SPECS/ << "EOF"
--- a/kernel.spec
+++ b/kernel.spec
@@ -24,7 +24,7 @@
 %global zipsed -e 's/\.ko$/\.ko.xz/'
 %endif
 
-# define buildid .local
+%define buildid .kmallocwd
 
 # baserelease defines which build revision of this kernel version we're
 # building.  We used to call this fedora_build, but the magical name
@@ -1207,6 +1207,8 @@
 
 git am %{patches}
 
+patch -p1 < $RPM_SOURCE_DIR/kmallocwd.patch
+
 # END OF PATCH APPLICATIONS
 
 # Any further pre-build tree manipulations happen here.
@@ -1243,6 +1245,8 @@
 do
   cat $i > temp-$i
   mv $i .config
+  echo 'CONFIG_DETECT_MEMALLOC_STALL_TASK=y' >> .config
+  echo 'CONFIG_DEFAULT_MEMALLOC_TASK_TIMEOUT=30' >> .config
   Arch=`head -1 .config | cut -b 3-`
   make ARCH=$Arch listnewconfig | grep -E '^CONFIG_' >.newoptions || true
 %if %{listnewconfig_fail}
EOF
wget -O ~/rpmbuild/SOURCES/kmallocwd.patch 
'https://marc.info/?l=linux-mm&m=148957858821214&q=raw'
rpmbuild -bb ~/rpmbuild/SPECS/kernel.spec




Re: Still OOM problems with 4.9er/4.10er kernels

2017-03-23 Thread Mike Galbraith
On Thu, 2017-03-23 at 08:16 +0100, Gerhard Wiesinger wrote:
> On 21.03.2017 08:13, Mike Galbraith wrote:
> > On Tue, 2017-03-21 at 06:59 +0100, Gerhard Wiesinger wrote:
> > 
> > > Is this the correct information?
> > Incomplete, but enough to reiterate cgroup_disable=memory
> > suggestion.
> > 
> 
> How to collect complete information?

If Michal wants specifics, I suspect he'll ask.  I posted only to pass
along a speck of information, and offer a test suggestion.. twice.

-Mike


Re: Still OOM problems with 4.9er/4.10er kernels

2017-03-23 Thread Gerhard Wiesinger

On 21.03.2017 08:13, Mike Galbraith wrote:

On Tue, 2017-03-21 at 06:59 +0100, Gerhard Wiesinger wrote:


Is this the correct information?

Incomplete, but enough to reiterate cgroup_disable=memory suggestion.



How to collect complete information?

Thnx.

Ciao,
Gerhard


Re: Still OOM problems with 4.9er/4.10er kernels

2017-03-21 Thread Mike Galbraith
On Tue, 2017-03-21 at 06:59 +0100, Gerhard Wiesinger wrote:

> Is this the correct information?

Incomplete, but enough to reiterate cgroup_disable=memory suggestion.

-Mike


Re: Still OOM problems with 4.9er/4.10er kernels

2017-03-20 Thread Gerhard Wiesinger

On 20.03.2017 04:05, Mike Galbraith wrote:

On Sun, 2017-03-19 at 17:02 +0100, Gerhard Wiesinger wrote:


mount | grep cgroup

Just because controllers are mounted doesn't mean they're populated. To
check that, you want to look for directories under the mount points
with a non-empty 'tasks'.  You will find some, but memory cgroup
assignments would likely be most interesting for this thread.  You can
eliminate any diddling there by booting with cgroup_disable=memory.



Is this the correct information?

mount | grep "type cgroup" | cut -f 3 -d ' ' | while read LINE; do echo 
"";echo 
${LINE};ls -l ${LINE}; done


/sys/fs/cgroup/systemd
total 0
-rw-r--r--  1 root root 0 Mar 20 14:31 cgroup.clone_children
-rw-r--r--  1 root root 0 Mar 20 14:31 cgroup.procs
-r--r--r--  1 root root 0 Mar 20 14:31 cgroup.sane_behavior
drwxr-xr-x  2 root root 0 Mar 20 14:31 init.scope
-rw-r--r--  1 root root 0 Mar 20 14:31 notify_on_release
-rw-r--r--  1 root root 0 Mar 20 14:31 release_agent
drwxr-xr-x 60 root root 0 Mar 21 06:50 system.slice
-rw-r--r--  1 root root 0 Mar 20 14:31 tasks
drwxr-xr-x  4 root root 0 Mar 21 06:55 user.slice

/sys/fs/cgroup/net_cls,net_prio
total 0
-rw-r--r-- 1 root root 0 Mar 20 14:31 cgroup.clone_children
-rw-r--r-- 1 root root 0 Mar 20 14:31 cgroup.procs
-r--r--r-- 1 root root 0 Mar 20 14:31 cgroup.sane_behavior
-rw-r--r-- 1 root root 0 Mar 20 14:31 net_cls.classid
-rw-r--r-- 1 root root 0 Mar 20 14:31 net_prio.ifpriomap
-r--r--r-- 1 root root 0 Mar 20 14:31 net_prio.prioidx
-rw-r--r-- 1 root root 0 Mar 20 14:31 notify_on_release
-rw-r--r-- 1 root root 0 Mar 20 14:31 release_agent
-rw-r--r-- 1 root root 0 Mar 20 14:31 tasks

/sys/fs/cgroup/cpu,cpuacct
total 0
-rw-r--r-- 1 root root 0 Mar 20 14:31 cgroup.clone_children
-rw-r--r-- 1 root root 0 Mar 20 14:31 cgroup.procs
-r--r--r-- 1 root root 0 Mar 20 14:31 cgroup.sane_behavior
-r--r--r-- 1 root root 0 Mar 20 14:31 cpuacct.stat
-rw-r--r-- 1 root root 0 Mar 20 14:31 cpuacct.usage
-r--r--r-- 1 root root 0 Mar 20 14:31 cpuacct.usage_all
-r--r--r-- 1 root root 0 Mar 20 14:31 cpuacct.usage_percpu
-r--r--r-- 1 root root 0 Mar 20 14:31 cpuacct.usage_percpu_sys
-r--r--r-- 1 root root 0 Mar 20 14:31 cpuacct.usage_percpu_user
-r--r--r-- 1 root root 0 Mar 20 14:31 cpuacct.usage_sys
-r--r--r-- 1 root root 0 Mar 20 14:31 cpuacct.usage_user
-rw-r--r-- 1 root root 0 Mar 20 14:31 cpu.cfs_period_us
-rw-r--r-- 1 root root 0 Mar 20 14:31 cpu.cfs_quota_us
-rw-r--r-- 1 root root 0 Mar 20 14:31 cpu.shares
-r--r--r-- 1 root root 0 Mar 20 14:31 cpu.stat
drwxr-xr-x 2 root root 0 Mar 20 14:31 init.scope
-rw-r--r-- 1 root root 0 Mar 20 14:31 notify_on_release
-rw-r--r-- 1 root root 0 Mar 20 14:31 release_agent
drwxr-xr-x 2 root root 0 Mar 20 14:31 system.slice
-rw-r--r-- 1 root root 0 Mar 20 14:31 tasks
drwxr-xr-x 4 root root 0 Mar 21 06:55 user.slice

/sys/fs/cgroup/devices
total 0
-rw-r--r--  1 root root 0 Mar 20 14:31 cgroup.clone_children
-rw-r--r--  1 root root 0 Mar 20 14:31 cgroup.procs
-r--r--r--  1 root root 0 Mar 20 14:31 cgroup.sane_behavior
--w---  1 root root 0 Mar 20 14:31 devices.allow
--w---  1 root root 0 Mar 20 14:31 devices.deny
-r--r--r--  1 root root 0 Mar 20 14:31 devices.list
drwxr-xr-x  2 root root 0 Mar 20 14:31 init.scope
-rw-r--r--  1 root root 0 Mar 20 14:31 notify_on_release
-rw-r--r--  1 root root 0 Mar 20 14:31 release_agent
drwxr-xr-x 60 root root 0 Mar 21 06:50 system.slice
-rw-r--r--  1 root root 0 Mar 20 14:31 tasks
drwxr-xr-x  4 root root 0 Mar 21 06:55 user.slice

/sys/fs/cgroup/freezer
total 0
-rw-r--r-- 1 root root 0 Mar 20 14:31 cgroup.clone_children
-rw-r--r-- 1 root root 0 Mar 20 14:31 cgroup.procs
-r--r--r-- 1 root root 0 Mar 20 14:31 cgroup.sane_behavior
-rw-r--r-- 1 root root 0 Mar 20 14:31 notify_on_release
-rw-r--r-- 1 root root 0 Mar 20 14:31 release_agent
-rw-r--r-- 1 root root 0 Mar 20 14:31 tasks

/sys/fs/cgroup/perf_ev

Re: Still OOM problems with 4.9er/4.10er kernels

2017-03-19 Thread Mike Galbraith
On Sun, 2017-03-19 at 17:02 +0100, Gerhard Wiesinger wrote:

> mount | grep cgroup

Just because controllers are mounted doesn't mean they're populated. To
check that, you want to look for directories under the mount points
with a non-empty 'tasks'.  You will find some, but memory cgroup
assignments would likely be most interesting for this thread.  You can
eliminate any diddling there by booting with cgroup_disable=memory.

-Mike


Re: Still OOM problems with 4.9er/4.10er kernels

2017-03-19 Thread Tetsuo Handa
On 2017/03/19 17:17, Gerhard Wiesinger wrote:
> On 17.03.2017 21:08, Gerhard Wiesinger wrote:
>> On 17.03.2017 18:13, Michal Hocko wrote:
>>> On Fri 17-03-17 17:37:48, Gerhard Wiesinger wrote:
>>> [...] 
> 
> 4.11.0-0.rc2.git4.1.fc27.x86_64
> 
> There are also lockups after some runtime hours to 1 day:
> Message from syslogd@myserver Mar 19 08:22:33 ...
>  kernel:BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 stuck for 
> 18717s!
> 
> Message from syslogd@myserver at Mar 19 08:22:33 ...
>  kernel:BUG: workqueue lockup - pool cpus=1 node=0 flags=0x0 nice=0 stuck for 
> 18078s!
> 
> repeated a lot of times 
> 
> Ciao,
> Gerhard

"kernel:BUG: workqueue lockup" lines alone do not help. It does not tell what 
work is
stalling. Maybe stalling due to constant swapping while doing memory allocation 
when
processing some work, but relevant lines are needed in order to know what is 
happening.
You can try SysRq-t to dump what workqueue threads are doing when you encounter 
such lines.

You might want to try kmallocwd at
http://lkml.kernel.org/r/1489578541-81526-1-git-send-email-penguin-ker...@i-love.sakura.ne.jp
 .



Re: Still OOM problems with 4.9er/4.10er kernels

2017-03-19 Thread Gerhard Wiesinger

On 19.03.2017 16:18, Michal Hocko wrote:

On Fri 17-03-17 21:08:31, Gerhard Wiesinger wrote:

On 17.03.2017 18:13, Michal Hocko wrote:

On Fri 17-03-17 17:37:48, Gerhard Wiesinger wrote:
[...]

Why does the kernel prefer to swapin/out and not use

a.) the free memory?

It will use all the free memory up to min watermark which is set up
based on min_free_kbytes.

Makes sense, how is /proc/sys/vm/min_free_kbytes default value calculated?

See init_per_zone_wmark_min


b.) the buffer/cache?

the memory reclaim is strongly biased towards page cache and we try to
avoid swapout as much as possible (see get_scan_count).

If I understand it correctly, swapping is preferred over dropping the
cache, right. Can this behaviour be changed to prefer dropping the
cache to some minimum amount?  Is this also configurable in a way?

No, we enforce swapping if the amount of free + file pages are below the
cumulative high watermark.


(As far as I remember e.g. kernel 2.4 dropped the caches well).


There is ~100M memory available but kernel swaps all the time ...

Any ideas?

Kernel: 4.9.14-200.fc25.x86_64

top - 17:33:43 up 28 min,  3 users,  load average: 3.58, 1.67, 0.89
Tasks: 145 total,   4 running, 141 sleeping,   0 stopped,   0 zombie
%Cpu(s): 19.1 us, 56.2 sy,  0.0 ni,  4.3 id, 13.4 wa, 2.0 hi,  0.3 si,  4.7
st
KiB Mem :   230076 total,61508 free,   123472 used,45096 buff/cache

procs ---memory-- ---swap-- -io -system--
--cpu-
  r  b   swpd   free   buff  cache   si   sobibo in   cs us sy id wa st
  3  5 303916  60372328  43864 27828  200 41420   236 6984 11138 11 47  6 
23 14

I am really surprised to see any reclaim at all. 26% of free memory
doesn't sound as if we should do a reclaim at all. Do you have an
unusual configuration of /proc/sys/vm/min_free_kbytes ? Or is there
anything running inside a memory cgroup with a small limit?

nothing special set regarding /proc/sys/vm/min_free_kbytes (default values),
detailed config below. Regarding cgroups, none of I know. How to check (I
guess nothing is set because cg* commands are not available)?

be careful because systemd started to use some controllers. You can
easily check cgroup mount points.


See below.




/proc/sys/vm/min_free_kbytes
45056

So at least 45M will be kept reserved for the system. Your data
indicated you had more memory. How does /proc/zoneinfo look like?
Btw. you seem to be using fc kernel, are there any patches applied on
top of Linus tree? Could you try to retest vanilla kernel?



System looks normally now, FYI (e.g. now permanent swapping)


free
  totalusedfree  shared buff/cache   
available

Mem: 349076  154112   41560 184 153404  148716
Swap:   2064380  831844 1232536

cat /proc/zoneinfo

Node 0, zone  DMA
  per-node stats
  nr_inactive_anon 9543
  nr_active_anon 22105
  nr_inactive_file 9877
  nr_active_file 13416
  nr_unevictable 0
  nr_isolated_anon 0
  nr_isolated_file 0
  nr_pages_scanned 0
  workingset_refault 1926013
  workingset_activate 707166
  workingset_nodereclaim 187276
  nr_anon_pages 11429
  nr_mapped6852
  nr_file_pages 46772
  nr_dirty 1
  nr_writeback 0
  nr_writeback_temp 0
  nr_shmem 46
  nr_shmem_hugepages 0
  nr_shmem_pmdmapped 0
  nr_anon_transparent_hugepages 0
  nr_unstable  0
  nr_vmscan_write 3319047
  nr_vmscan_immediate_reclaim 32363
  nr_dirtied   222115
  nr_written   3537529
  pages free 3110
min  27
low  33
high 39
   node_scanned  0
spanned  4095
present  3998
managed  3977
  nr_free_pages 3110
  nr_zone_inactive_anon 18
  nr_zone_active_anon 3
  nr_zone_inactive_file 51
  nr_zone_active_file 75
  nr_zone_unevictable 0
  nr_zone_write_pending 0
  nr_mlock 0
  nr_slab_reclaimable 214
  nr_slab_unreclaimable 289
  nr_page_table_pages 185
  nr_kernel_stack 16
  nr_bounce0
  nr_zspages   0
  numa_hit 1214071
  numa_miss0
  numa_foreign 0
  numa_interleave 0
  numa_local   1214071
  numa_other   0
  nr_free_cma  0
protection: (0, 306, 306, 306, 306)
  pagesets
cpu: 0
  count: 0
  high:  0
  batch: 1
  vm stats threshold: 4
cpu: 1
  count: 0
  high:  0
  batch: 1
  vm stats threshold: 4
  node_unreclaimable:  0
  start_pfn:   1
  node_inactive_ratio: 0
Node 0, zoneDMA32
  pages free 7921
min  546
low  682
high 818
   node_scanned  0
spanned  94172
present  94172
managed  83292
  nr_free_pages 7921
  nr_zone_inactive_anon 9525
  nr_zone_active_anon 22102
  nr_zone_inactive_file 9826
  nr_zone_active_file 13341
  nr_zone_unevi

Re: Still OOM problems with 4.9er/4.10er kernels

2017-03-19 Thread Michal Hocko
On Fri 17-03-17 21:08:31, Gerhard Wiesinger wrote:
> On 17.03.2017 18:13, Michal Hocko wrote:
> >On Fri 17-03-17 17:37:48, Gerhard Wiesinger wrote:
> >[...]
> >>Why does the kernel prefer to swapin/out and not use
> >>
> >>a.) the free memory?
> >It will use all the free memory up to min watermark which is set up
> >based on min_free_kbytes.
> 
> Makes sense, how is /proc/sys/vm/min_free_kbytes default value calculated?

See init_per_zone_wmark_min

> >>b.) the buffer/cache?
> >the memory reclaim is strongly biased towards page cache and we try to
> >avoid swapout as much as possible (see get_scan_count).
> 
> If I understand it correctly, swapping is preferred over dropping the
> cache, right. Can this behaviour be changed to prefer dropping the
> cache to some minimum amount?  Is this also configurable in a way?

No, we enforce swapping if the amount of free + file pages are below the
cumulative high watermark.

> (As far as I remember e.g. kernel 2.4 dropped the caches well).
> 
> >>There is ~100M memory available but kernel swaps all the time ...
> >>
> >>Any ideas?
> >>
> >>Kernel: 4.9.14-200.fc25.x86_64
> >>
> >>top - 17:33:43 up 28 min,  3 users,  load average: 3.58, 1.67, 0.89
> >>Tasks: 145 total,   4 running, 141 sleeping,   0 stopped,   0 zombie
> >>%Cpu(s): 19.1 us, 56.2 sy,  0.0 ni,  4.3 id, 13.4 wa, 2.0 hi,  0.3 si,  4.7
> >>st
> >>KiB Mem :   230076 total,61508 free,   123472 used,45096 buff/cache
> >>
> >>procs ---memory-- ---swap-- -io -system--
> >>--cpu-
> >>  r  b   swpd   free   buff  cache   si   sobibo in   cs us sy id 
> >> wa st
> >>  3  5 303916  60372328  43864 27828  200 41420   236 6984 11138 11 47  
> >> 6 23 14
> >I am really surprised to see any reclaim at all. 26% of free memory
> >doesn't sound as if we should do a reclaim at all. Do you have an
> >unusual configuration of /proc/sys/vm/min_free_kbytes ? Or is there
> >anything running inside a memory cgroup with a small limit?
> 
> nothing special set regarding /proc/sys/vm/min_free_kbytes (default values),
> detailed config below. Regarding cgroups, none of I know. How to check (I
> guess nothing is set because cg* commands are not available)?

be careful because systemd started to use some controllers. You can
easily check cgroup mount points.

> /proc/sys/vm/min_free_kbytes
> 45056

So at least 45M will be kept reserved for the system. Your data
indicated you had more memory. How does /proc/zoneinfo look like?
Btw. you seem to be using fc kernel, are there any patches applied on
top of Linus tree? Could you try to retest vanilla kernel?
-- 
Michal Hocko
SUSE Labs


Re: Still OOM problems with 4.9er/4.10er kernels

2017-03-19 Thread Gerhard Wiesinger

On 17.03.2017 21:08, Gerhard Wiesinger wrote:

On 17.03.2017 18:13, Michal Hocko wrote:

On Fri 17-03-17 17:37:48, Gerhard Wiesinger wrote:
[...] 


4.11.0-0.rc2.git4.1.fc27.x86_64

There are also lockups after some runtime hours to 1 day:
Message from syslogd@myserver Mar 19 08:22:33 ...
 kernel:BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 
stuck for 18717s!


Message from syslogd@myserver at Mar 19 08:22:33 ...
 kernel:BUG: workqueue lockup - pool cpus=1 node=0 flags=0x0 nice=0 
stuck for 18078s!


repeated a lot of times 

Ciao,
Gerhard



Re: Still OOM problems with 4.9er/4.10er kernels

2017-03-17 Thread Gerhard Wiesinger

On 17.03.2017 18:13, Michal Hocko wrote:

On Fri 17-03-17 17:37:48, Gerhard Wiesinger wrote:
[...]

Why does the kernel prefer to swapin/out and not use

a.) the free memory?

It will use all the free memory up to min watermark which is set up
based on min_free_kbytes.


Makes sense, how is /proc/sys/vm/min_free_kbytes default value calculated?




b.) the buffer/cache?

the memory reclaim is strongly biased towards page cache and we try to
avoid swapout as much as possible (see get_scan_count).


If I understand it correctly, swapping is preferred over dropping the 
cache, right. Can this behaviour be changed to prefer dropping the cache 
to some minimum amount?

Is this also configurable in a way?
(As far as I remember e.g. kernel 2.4 dropped the caches well).

  

There is ~100M memory available but kernel swaps all the time ...

Any ideas?

Kernel: 4.9.14-200.fc25.x86_64

top - 17:33:43 up 28 min,  3 users,  load average: 3.58, 1.67, 0.89
Tasks: 145 total,   4 running, 141 sleeping,   0 stopped,   0 zombie
%Cpu(s): 19.1 us, 56.2 sy,  0.0 ni,  4.3 id, 13.4 wa, 2.0 hi,  0.3 si,  4.7
st
KiB Mem :   230076 total,61508 free,   123472 used,45096 buff/cache

procs ---memory-- ---swap-- -io -system--
--cpu-
  r  b   swpd   free   buff  cache   si   sobibo in   cs us sy id wa st
  3  5 303916  60372328  43864 27828  200 41420   236 6984 11138 11 47  6 
23 14

I am really surprised to see any reclaim at all. 26% of free memory
doesn't sound as if we should do a reclaim at all. Do you have an
unusual configuration of /proc/sys/vm/min_free_kbytes ? Or is there
anything running inside a memory cgroup with a small limit?


nothing special set regarding /proc/sys/vm/min_free_kbytes (default 
values), detailed config below. Regarding cgroups, none of I know. How 
to check (I guess nothing is set because cg* commands are not available)?


cat /etc/sysctl.d/* | grep "^vm"
vm.dirty_background_ratio = 3
vm.dirty_ratio = 15
vm.overcommit_memory = 2
vm.overcommit_ratio = 80
vm.swappiness=10

find /proc/sys/vm -type f -exec echo {} \; -exec cat {} \;
/proc/sys/vm/admin_reserve_kbytes
8192
/proc/sys/vm/block_dump
0
/proc/sys/vm/compact_memory
cat: /proc/sys/vm/compact_memory: Permission denied
/proc/sys/vm/compact_unevictable_allowed
1
/proc/sys/vm/dirty_background_bytes
0
/proc/sys/vm/dirty_background_ratio
3
/proc/sys/vm/dirty_bytes
0
/proc/sys/vm/dirty_expire_centisecs
3000
/proc/sys/vm/dirty_ratio
15
/proc/sys/vm/dirty_writeback_centisecs
500
/proc/sys/vm/dirtytime_expire_seconds
43200
/proc/sys/vm/drop_caches
0
/proc/sys/vm/extfrag_threshold
500
/proc/sys/vm/hugepages_treat_as_movable
0
/proc/sys/vm/hugetlb_shm_group
0
/proc/sys/vm/laptop_mode
0
/proc/sys/vm/legacy_va_layout
0
/proc/sys/vm/lowmem_reserve_ratio
256 256 32  1
/proc/sys/vm/max_map_count
65530
/proc/sys/vm/memory_failure_early_kill
0
/proc/sys/vm/memory_failure_recovery
1
/proc/sys/vm/min_free_kbytes
45056
/proc/sys/vm/min_slab_ratio
5
/proc/sys/vm/min_unmapped_ratio
1
/proc/sys/vm/mmap_min_addr
65536
/proc/sys/vm/mmap_rnd_bits
28
/proc/sys/vm/mmap_rnd_compat_bits
8
/proc/sys/vm/nr_hugepages
0
/proc/sys/vm/nr_hugepages_mempolicy
0
/proc/sys/vm/nr_overcommit_hugepages
0
/proc/sys/vm/nr_pdflush_threads
0
/proc/sys/vm/numa_zonelist_order
default
/proc/sys/vm/oom_dump_tasks
1
/proc/sys/vm/oom_kill_allocating_task
0
/proc/sys/vm/overcommit_kbytes
0
/proc/sys/vm/overcommit_memory
2
/proc/sys/vm/overcommit_ratio
80
/proc/sys/vm/page-cluster
3
/proc/sys/vm/panic_on_oom
0
/proc/sys/vm/percpu_pagelist_fraction
0
/proc/sys/vm/stat_interval
1
/proc/sys/vm/stat_refresh
/proc/sys/vm/swappiness
10
/proc/sys/vm/user_reserve_kbytes
31036
/proc/sys/vm/vfs_cache_pressure
100
/proc/sys/vm/watermark_scale_factor
10
/proc/sys/vm/zone_reclaim_mode
0

Thnx.


Ciao,

Gerhard




Re: Still OOM problems with 4.9er/4.10er kernels

2017-03-17 Thread Michal Hocko
On Fri 17-03-17 17:37:48, Gerhard Wiesinger wrote:
[...]
> Why does the kernel prefer to swapin/out and not use
> 
> a.) the free memory?

It will use all the free memory up to min watermark which is set up
based on min_free_kbytes.

> b.) the buffer/cache?

the memory reclaim is strongly biased towards page cache and we try to
avoid swapout as much as possible (see get_scan_count).
 
> There is ~100M memory available but kernel swaps all the time ...
> 
> Any ideas?
> 
> Kernel: 4.9.14-200.fc25.x86_64
> 
> top - 17:33:43 up 28 min,  3 users,  load average: 3.58, 1.67, 0.89
> Tasks: 145 total,   4 running, 141 sleeping,   0 stopped,   0 zombie
> %Cpu(s): 19.1 us, 56.2 sy,  0.0 ni,  4.3 id, 13.4 wa, 2.0 hi,  0.3 si,  4.7
> st
> KiB Mem :   230076 total,61508 free,   123472 used,45096 buff/cache
> 
> procs ---memory-- ---swap-- -io -system--
> --cpu-
>  r  b   swpd   free   buff  cache   si   sobibo in   cs us sy id wa st
>  3  5 303916  60372328  43864 27828  200 41420   236 6984 11138 11 47  6 
> 23 14

I am really surprised to see any reclaim at all. 26% of free memory
doesn't sound as if we should do a reclaim at all. Do you have an
unusual configuration of /proc/sys/vm/min_free_kbytes ? Or is there
anything running inside a memory cgroup with a small limit?
-- 
Michal Hocko
SUSE Labs


Re: Still OOM problems with 4.9er/4.10er kernels

2017-03-17 Thread Gerhard Wiesinger

On 16.03.2017 10:39, Michal Hocko wrote:

On Thu 16-03-17 02:23:18, l...@pengaru.com wrote:

On Thu, Mar 16, 2017 at 10:08:44AM +0100, Michal Hocko wrote:

On Thu 16-03-17 01:47:33, l...@pengaru.com wrote:
[...]

While on the topic of understanding allocation stalls, Philip Freeman recently
mailed linux-kernel with a similar report, and in his case there are plenty of
page cache pages.  It was also a GFP_HIGHUSER_MOVABLE 0-order allocation.

care to point me to the report?

http://lkml.iu.edu/hypermail/linux/kernel/1703.1/06360.html

Thanks. It is gone from my lkml mailbox. Could you CC me (and linux-mm) please?
  
  

I'm no MM expert, but it appears a bit broken for such a low-order allocation
to stall on the order of 10 seconds when there's plenty of reclaimable pages,
in addition to mostly unused and abundant swap space on SSD.

yes this might indeed signal a problem.

Well maybe I missed something obvious that a better informed eye will catch.

Nothing really obvious. There is indeed a lot of anonymous memory to
swap out. Almost no pages on file LRU lists (active_file:759
inactive_file:749) but 158783 total pagecache pages so we have to have a
lot of pages in the swap cache. I would probably have to see more data
to make a full picture.



Why does the kernel prefer to swapin/out and not use

a.) the free memory?

b.) the buffer/cache?

There is ~100M memory available but kernel swaps all the time ...

Any ideas?

Kernel: 4.9.14-200.fc25.x86_64

top - 17:33:43 up 28 min,  3 users,  load average: 3.58, 1.67, 0.89
Tasks: 145 total,   4 running, 141 sleeping,   0 stopped,   0 zombie
%Cpu(s): 19.1 us, 56.2 sy,  0.0 ni,  4.3 id, 13.4 wa, 2.0 hi,  0.3 si,  
4.7 st

KiB Mem :   230076 total,61508 free,   123472 used,45096 buff/cache

procs ---memory-- ---swap-- -io -system-- 
--cpu-
 r  b   swpd   free   buff  cache   si   sobibo in   cs us sy 
id wa st
 3  5 303916  60372328  43864 27828  200 41420   236 6984 11138 11 
47  6 23 14
 5  4 292852  52904756  58584 19600  448 48780   540 8088 10528 18 
61  1  7 13
 3  3 288792  49052   1152  65924 4856  576  9824  1100 4324 5720  7 
18  2 64  8
 2  2 283676  54160716  67604 6332  344 31740   964 3879 5055 12 34 
10 37  7
 3  3 286852  66712216  53136 28064 4832 56532  4920 9175 12625 10 
55 12 14 10
 2  0 299680  62428196  53316 36312 13164 54728 13212 16820 25283  
7 56 18 12  7
 1  1 300756  63220624  58160 17944 1260 24528  1304 5804 9302  3 
22 38 34  3


Thnx.


Ciao,

Gerhard



Re: Still OOM problems with 4.9er/4.10er kernels

2017-03-16 Thread Michal Hocko
On Thu 16-03-17 02:23:18, l...@pengaru.com wrote:
> On Thu, Mar 16, 2017 at 10:08:44AM +0100, Michal Hocko wrote:
> > On Thu 16-03-17 01:47:33, l...@pengaru.com wrote:
> > [...]
> > > While on the topic of understanding allocation stalls, Philip Freeman 
> > > recently
> > > mailed linux-kernel with a similar report, and in his case there are 
> > > plenty of
> > > page cache pages.  It was also a GFP_HIGHUSER_MOVABLE 0-order allocation.
> > 
> > care to point me to the report?
> 
> http://lkml.iu.edu/hypermail/linux/kernel/1703.1/06360.html

Thanks. It is gone from my lkml mailbox. Could you CC me (and linux-mm) please?
 
> >  
> > > I'm no MM expert, but it appears a bit broken for such a low-order 
> > > allocation
> > > to stall on the order of 10 seconds when there's plenty of reclaimable 
> > > pages,
> > > in addition to mostly unused and abundant swap space on SSD.
> > 
> > yes this might indeed signal a problem.
> 
> Well maybe I missed something obvious that a better informed eye will catch.

Nothing really obvious. There is indeed a lot of anonymous memory to
swap out. Almost no pages on file LRU lists (active_file:759
inactive_file:749) but 158783 total pagecache pages so we have to have a
lot of pages in the swap cache. I would probably have to see more data
to make a full picture.

-- 
Michal Hocko
SUSE Labs


Re: Still OOM problems with 4.9er/4.10er kernels

2017-03-16 Thread lkml
On Thu, Mar 16, 2017 at 10:08:44AM +0100, Michal Hocko wrote:
> On Thu 16-03-17 01:47:33, l...@pengaru.com wrote:
> [...]
> > While on the topic of understanding allocation stalls, Philip Freeman 
> > recently
> > mailed linux-kernel with a similar report, and in his case there are plenty 
> > of
> > page cache pages.  It was also a GFP_HIGHUSER_MOVABLE 0-order allocation.
> 
> care to point me to the report?

http://lkml.iu.edu/hypermail/linux/kernel/1703.1/06360.html

>  
> > I'm no MM expert, but it appears a bit broken for such a low-order 
> > allocation
> > to stall on the order of 10 seconds when there's plenty of reclaimable 
> > pages,
> > in addition to mostly unused and abundant swap space on SSD.
> 
> yes this might indeed signal a problem.

Well maybe I missed something obvious that a better informed eye will catch.

Regards,
Vito Caputo


Re: Still OOM problems with 4.9er/4.10er kernels

2017-03-16 Thread Michal Hocko
On Thu 16-03-17 01:47:33, l...@pengaru.com wrote:
[...]
> While on the topic of understanding allocation stalls, Philip Freeman recently
> mailed linux-kernel with a similar report, and in his case there are plenty of
> page cache pages.  It was also a GFP_HIGHUSER_MOVABLE 0-order allocation.

care to point me to the report?
 
> I'm no MM expert, but it appears a bit broken for such a low-order allocation
> to stall on the order of 10 seconds when there's plenty of reclaimable pages,
> in addition to mostly unused and abundant swap space on SSD.

yes this might indeed signal a problem.
-- 
Michal Hocko
SUSE Labs


Re: Still OOM problems with 4.9er/4.10er kernels

2017-03-16 Thread lkml
On Thu, Mar 16, 2017 at 09:27:14AM +0100, Michal Hocko wrote:
> On Thu 16-03-17 07:38:08, Gerhard Wiesinger wrote:
> [...]
> > The following commit is included in that version:
> > commit 710531320af876192d76b2c1f68190a1df941b02
> > Author: Michal Hocko 
> > Date:   Wed Feb 22 15:45:58 2017 -0800
> > 
> > mm, vmscan: cleanup lru size claculations
> > 
> > commit fd538803731e50367b7c59ce4ad3454426a3d671 upstream.
> 
> This patch shouldn't make any difference. It is a cleanup patch.
> I guess you meant 71ab6cfe88dc ("mm, vmscan: consider eligible zones in
> get_scan_count") but even that one shouldn't make any difference for 64b
> systems.
> 
> > But still OOMs:
> > [157048.030760] clamscan: page allocation stalls for 19405ms, order:0, 
> > mode:0x14200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null)
> 
> This is not OOM it is an allocation stall. The allocation request cannot
> simply make forward progress for more than 10s. This alone is bad but
> considering this is GFP_HIGHUSER_MOVABLE which has the full reclaim
> capabilities I would suspect your workload overcommits the available
> memory too much. You only have ~380MB of RAM with ~160MB sitting in the
> anonymous memory, almost nothing in the page cache so I am not wondering
> that you see a constant swap activity. There seems to be only 40M in the
> slab so we are still missing ~180MB which is neither on the LRU lists
> nor allocated by slab. This means that some kernel subsystem allocates
> from the page allocator directly.
> 
> That being said, I believe that what you are seeing is not a bug in the
> MM subsystem but rather some susbsytem using more memory than it used to
> before so your workload doesn't fit into the amount of memory you have
> anymore.
> 

While on the topic of understanding allocation stalls, Philip Freeman recently
mailed linux-kernel with a similar report, and in his case there are plenty of
page cache pages.  It was also a GFP_HIGHUSER_MOVABLE 0-order allocation.

I'm no MM expert, but it appears a bit broken for such a low-order allocation
to stall on the order of 10 seconds when there's plenty of reclaimable pages,
in addition to mostly unused and abundant swap space on SSD.

Regards,
Vito Caputo


Re: Still OOM problems with 4.9er/4.10er kernels

2017-03-16 Thread Michal Hocko
On Thu 16-03-17 07:38:08, Gerhard Wiesinger wrote:
[...]
> The following commit is included in that version:
> commit 710531320af876192d76b2c1f68190a1df941b02
> Author: Michal Hocko 
> Date:   Wed Feb 22 15:45:58 2017 -0800
> 
> mm, vmscan: cleanup lru size claculations
> 
> commit fd538803731e50367b7c59ce4ad3454426a3d671 upstream.

This patch shouldn't make any difference. It is a cleanup patch.
I guess you meant 71ab6cfe88dc ("mm, vmscan: consider eligible zones in
get_scan_count") but even that one shouldn't make any difference for 64b
systems.

> But still OOMs:
> [157048.030760] clamscan: page allocation stalls for 19405ms, order:0, 
> mode:0x14200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null)

This is not OOM it is an allocation stall. The allocation request cannot
simply make forward progress for more than 10s. This alone is bad but
considering this is GFP_HIGHUSER_MOVABLE which has the full reclaim
capabilities I would suspect your workload overcommits the available
memory too much. You only have ~380MB of RAM with ~160MB sitting in the
anonymous memory, almost nothing in the page cache so I am not wondering
that you see a constant swap activity. There seems to be only 40M in the
slab so we are still missing ~180MB which is neither on the LRU lists
nor allocated by slab. This means that some kernel subsystem allocates
from the page allocator directly.

That being said, I believe that what you are seeing is not a bug in the
MM subsystem but rather some susbsytem using more memory than it used to
before so your workload doesn't fit into the amount of memory you have
anymore.

[...]
> [157048.081827] Mem-Info:
> [157048.083005] active_anon:19902 inactive_anon:19920 isolated_anon:383
>  active_file:816 inactive_file:529 isolated_file:0
>  unevictable:0 dirty:0 writeback:19 unstable:0
>  slab_reclaimable:4225 slab_unreclaimable:6483
>  mapped:942 shmem:3 pagetables:3553 bounce:0
>  free:944 free_pcp:87 free_cma:0
> [157048.089470] Node 0 active_anon:79552kB inactive_anon:79588kB
> active_file:3108kB inactive_file:2144kB unevictable:0kB
> isolated(anon):1624kB isolated(file):0kB mapped:3612kB dirty:0kB
> writeback:76kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 12kB
> writeback_tmp:0kB unstable:0kB pages_scanned:247 all_unreclaimable? no
> [157048.092318] Node 0 DMA free:1408kB min:104kB low:128kB high:152kB
> active_anon:664kB inactive_anon:3124kB active_file:48kB inactive_file:40kB
> unevictable:0kB writepending:0kB present:15992kB managed:15908kB mlocked:0kB
> slab_reclaimable:564kB slab_unreclaimable:2148kB kernel_stack:92kB
> pagetables:1328kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
> [157048.096008] lowmem_reserve[]: 0 327 327 327 327
> [157048.097234] Node 0 DMA32 free:2576kB min:2264kB low:2828kB high:3392kB
> active_anon:78844kB inactive_anon:76612kB active_file:2840kB
> inactive_file:1896kB unevictable:0kB writepending:76kB present:376688kB
> managed:353792kB mlocked:0kB slab_reclaimable:16336kB
> slab_unreclaimable:23784kB kernel_stack:2388kB pagetables:12884kB bounce:0kB
> free_pcp:644kB local_pcp:312kB free_cma:0kB
> [157048.101118] lowmem_reserve[]: 0 0 0 0 0
> [157048.102190] Node 0 DMA: 37*4kB (UEH) 12*8kB (H) 13*16kB (H) 10*32kB (H)
> 4*64kB (H) 3*128kB (H) 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 1412kB
> [157048.104989] Node 0 DMA32: 79*4kB (UMEH) 199*8kB (UMEH) 18*16kB (UMH)
> 5*32kB (H) 2*64kB (H) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB =
> 2484kB
> [157048.107789] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0
> hugepages_size=2048kB
> [157048.107790] 2027 total pagecache pages
> [157048.109125] 710 pages in swap cache
> [157048.115088] Swap cache stats: add 36179491, delete 36179123, find
> 86964755/101977142
> [157048.116934] Free swap  = 808064kB
> [157048.118466] Total swap = 2064380kB
> [157048.122828] 98170 pages RAM
> [157048.124039] 0 pages HighMem/MovableOnly
> [157048.125051] 5745 pages reserved
> [157048.125997] 0 pages cma reserved
> [157048.127008] 0 pages hwpoisoned
> 
> 
> Thnx.
> 
> Ciao,
> Gerhard

-- 
Michal Hocko
SUSE Labs


Re: Still OOM problems with 4.9er/4.10er kernels

2017-03-15 Thread Gerhard Wiesinger

On 02.03.2017 08:17, Minchan Kim wrote:

Hi Michal,

On Tue, Feb 28, 2017 at 09:12:24AM +0100, Michal Hocko wrote:

On Tue 28-02-17 14:17:23, Minchan Kim wrote:

On Mon, Feb 27, 2017 at 10:44:49AM +0100, Michal Hocko wrote:

On Mon 27-02-17 18:02:36, Minchan Kim wrote:
[...]

>From 9779a1c5d32e2edb64da5cdfcd6f9737b94a247a Mon Sep 17 00:00:00 2001
From: Minchan Kim 
Date: Mon, 27 Feb 2017 17:39:06 +0900
Subject: [PATCH] mm: use up highatomic before OOM kill

Not-Yet-Signed-off-by: Minchan Kim 
---
  mm/page_alloc.c | 14 --
  1 file changed, 4 insertions(+), 10 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 614cd0397ce3..e073cca4969e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3549,16 +3549,6 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
*no_progress_loops = 0;
else
(*no_progress_loops)++;
-
-   /*
-* Make sure we converge to OOM if we cannot make any progress
-* several times in the row.
-*/
-   if (*no_progress_loops > MAX_RECLAIM_RETRIES) {
-   /* Before OOM, exhaust highatomic_reserve */
-   return unreserve_highatomic_pageblock(ac, true);
-   }
-
/*
 * Keep reclaiming pages while there is a chance this will lead
 * somewhere.  If none of the target zones can satisfy our allocation
@@ -3821,6 +3811,10 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int 
order,
if (read_mems_allowed_retry(cpuset_mems_cookie))
goto retry_cpuset;
  
+	/* Before OOM, exhaust highatomic_reserve */

+   if (unreserve_highatomic_pageblock(ac, true))
+   goto retry;
+

OK, this can help for higher order requests when we do not exhaust all
the retries and fail on compaction but I fail to see how can this help
for order-0 requets which was what happened in this case. I am not
saying this is wrong, though.

The should_reclaim_retry can return false although no_progress_loop is less
than MAX_RECLAIM_RETRIES unless eligible zones has enough reclaimable pages
by the progress_loop.

Yes, sorry I should have been more clear. I was talking about this
particular case where we had a lot of reclaimable pages (a lot of
anonymous with the swap available).

This reports shows two problems. Why we see OOM 1) enough *free* pages and
2) enough *freeable* pages.

I just pointed out 1) and sent the patch to solve it.

About 2), one of my imaginary scenario is inactive anon list is full of
pinned pages so VM can unmap them successfully in shrink_page_list but fail
to free due to increased page refcount. In that case, the page will be added
to inactive anonymous LRU list again without activating so inactive_list_is_low
on anonymous LRU is always false. IOW, there is no deactivation from active 
list.

It's just my picture without no clue. ;-)


With latest kernels (4.11.0-0.rc2.git0.2.fc26.x86_64) I'm having the 
issue that swapping is active all the time after some runtime (~1day).


top - 07:30:17 up 1 day, 19:42,  1 user,  load average: 13.71, 16.98, 15.36
Tasks: 130 total,   2 running, 128 sleeping,   0 stopped, 0 zombie
%Cpu(s): 15.8 us, 33.5 sy,  0.0 ni,  3.9 id, 34.5 wa,  4.9 hi,  1.0 si,  
6.4 st

KiB Mem :   369700 total, 5484 free,   311556 used, 52660 buff/cache
KiB Swap:  2064380 total,  1187684 free,   876696 used. 20340 avail Mem

[root@smtp ~]# vmstat 1
procs ---memory-- ---swap-- -io -system-- 
--cpu-
 r  b   swpd   free   buff  cache   si   sobibo in   cs us sy 
id wa st
 3  1 876280   7132  16536  64840  238  226  1027   258 80   97  2  3 
83 11  1
 0  4 876140   3812  10520  64552 3676  168 11840  1100 2255 2582  7 
13  8 70  3
 0  3 875372   3628   4024  56160 5424   64 10004   476 2157 2580  2 
14  0 83  2
 0  4 875560  24056   2208  56296 9032 2180 39928  2388 4111 4549 10 
32  0 55  3
 2  2 875660   7540   5256  58220 5536 1604 48756  1864 4505 4196 12 
23  5 58  3
 0  3 875264   3664   2120  57596 2304  116 17904   560 2223 1825 15 
15  0 67  3
 0  2 875564   3800588  57856 1340 1068 14780  1184 1390 1364 12 
10  0 77  3
 1  2 875724   3740372  53988 3104  928 16884  1068 1560 1527  3 
12  0 83  3
 0  3 881096   3708532  52220 4604 5872 21004  6104 2752 2259  7 
18  5 67  2


The following commit is included in that version:
commit 710531320af876192d76b2c1f68190a1df941b02
Author: Michal Hocko 
Date:   Wed Feb 22 15:45:58 2017 -0800

mm, vmscan: cleanup lru size claculations

commit fd538803731e50367b7c59ce4ad3454426a3d671 upstream.

But still OOMs:
[157048.030760] clamscan: page allocation stalls for 19405ms, order:0, 
mode:0x14200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null)

[157048.031985] clamscan cpuset=/ mems_allowed=0
[157048.031993] CPU: 1 PID: 9597 Comm: clamscan Not tainted 
4.11.0-0.rc2.git0.2.fc26.x86_64 #1
[157048.033197] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), 
BIOS 1.9.3 04/01/2014

[157048.034382] Call Trace:
[157048.035

Re: Still OOM problems with 4.9er/4.10er kernels

2017-03-02 Thread Minchan Kim
Hi Michal,

On Tue, Feb 28, 2017 at 09:12:24AM +0100, Michal Hocko wrote:
> On Tue 28-02-17 14:17:23, Minchan Kim wrote:
> > On Mon, Feb 27, 2017 at 10:44:49AM +0100, Michal Hocko wrote:
> > > On Mon 27-02-17 18:02:36, Minchan Kim wrote:
> > > [...]
> > > > >From 9779a1c5d32e2edb64da5cdfcd6f9737b94a247a Mon Sep 17 00:00:00 2001
> > > > From: Minchan Kim 
> > > > Date: Mon, 27 Feb 2017 17:39:06 +0900
> > > > Subject: [PATCH] mm: use up highatomic before OOM kill
> > > > 
> > > > Not-Yet-Signed-off-by: Minchan Kim 
> > > > ---
> > > >  mm/page_alloc.c | 14 --
> > > >  1 file changed, 4 insertions(+), 10 deletions(-)
> > > > 
> > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > > index 614cd0397ce3..e073cca4969e 100644
> > > > --- a/mm/page_alloc.c
> > > > +++ b/mm/page_alloc.c
> > > > @@ -3549,16 +3549,6 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned 
> > > > order,
> > > > *no_progress_loops = 0;
> > > > else
> > > > (*no_progress_loops)++;
> > > > -
> > > > -   /*
> > > > -* Make sure we converge to OOM if we cannot make any progress
> > > > -* several times in the row.
> > > > -*/
> > > > -   if (*no_progress_loops > MAX_RECLAIM_RETRIES) {
> > > > -   /* Before OOM, exhaust highatomic_reserve */
> > > > -   return unreserve_highatomic_pageblock(ac, true);
> > > > -   }
> > > > -
> > > > /*
> > > >  * Keep reclaiming pages while there is a chance this will lead
> > > >  * somewhere.  If none of the target zones can satisfy our 
> > > > allocation
> > > > @@ -3821,6 +3811,10 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned 
> > > > int order,
> > > > if (read_mems_allowed_retry(cpuset_mems_cookie))
> > > > goto retry_cpuset;
> > > >  
> > > > +   /* Before OOM, exhaust highatomic_reserve */
> > > > +   if (unreserve_highatomic_pageblock(ac, true))
> > > > +   goto retry;
> > > > +
> > > 
> > > OK, this can help for higher order requests when we do not exhaust all
> > > the retries and fail on compaction but I fail to see how can this help
> > > for order-0 requets which was what happened in this case. I am not
> > > saying this is wrong, though.
> > 
> > The should_reclaim_retry can return false although no_progress_loop is less
> > than MAX_RECLAIM_RETRIES unless eligible zones has enough reclaimable pages
> > by the progress_loop.
> 
> Yes, sorry I should have been more clear. I was talking about this
> particular case where we had a lot of reclaimable pages (a lot of
> anonymous with the swap available).

This reports shows two problems. Why we see OOM 1) enough *free* pages and
2) enough *freeable* pages.

I just pointed out 1) and sent the patch to solve it.

About 2), one of my imaginary scenario is inactive anon list is full of
pinned pages so VM can unmap them successfully in shrink_page_list but fail
to free due to increased page refcount. In that case, the page will be added
to inactive anonymous LRU list again without activating so inactive_list_is_low
on anonymous LRU is always false. IOW, there is no deactivation from active 
list.

It's just my picture without no clue. ;-)


Re: Still OOM problems with 4.9er/4.10er kernels

2017-02-28 Thread Michal Hocko
On Tue 28-02-17 14:17:23, Minchan Kim wrote:
> On Mon, Feb 27, 2017 at 10:44:49AM +0100, Michal Hocko wrote:
> > On Mon 27-02-17 18:02:36, Minchan Kim wrote:
> > [...]
> > > >From 9779a1c5d32e2edb64da5cdfcd6f9737b94a247a Mon Sep 17 00:00:00 2001
> > > From: Minchan Kim 
> > > Date: Mon, 27 Feb 2017 17:39:06 +0900
> > > Subject: [PATCH] mm: use up highatomic before OOM kill
> > > 
> > > Not-Yet-Signed-off-by: Minchan Kim 
> > > ---
> > >  mm/page_alloc.c | 14 --
> > >  1 file changed, 4 insertions(+), 10 deletions(-)
> > > 
> > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > index 614cd0397ce3..e073cca4969e 100644
> > > --- a/mm/page_alloc.c
> > > +++ b/mm/page_alloc.c
> > > @@ -3549,16 +3549,6 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned 
> > > order,
> > >   *no_progress_loops = 0;
> > >   else
> > >   (*no_progress_loops)++;
> > > -
> > > - /*
> > > -  * Make sure we converge to OOM if we cannot make any progress
> > > -  * several times in the row.
> > > -  */
> > > - if (*no_progress_loops > MAX_RECLAIM_RETRIES) {
> > > - /* Before OOM, exhaust highatomic_reserve */
> > > - return unreserve_highatomic_pageblock(ac, true);
> > > - }
> > > -
> > >   /*
> > >* Keep reclaiming pages while there is a chance this will lead
> > >* somewhere.  If none of the target zones can satisfy our allocation
> > > @@ -3821,6 +3811,10 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned 
> > > int order,
> > >   if (read_mems_allowed_retry(cpuset_mems_cookie))
> > >   goto retry_cpuset;
> > >  
> > > + /* Before OOM, exhaust highatomic_reserve */
> > > + if (unreserve_highatomic_pageblock(ac, true))
> > > + goto retry;
> > > +
> > 
> > OK, this can help for higher order requests when we do not exhaust all
> > the retries and fail on compaction but I fail to see how can this help
> > for order-0 requets which was what happened in this case. I am not
> > saying this is wrong, though.
> 
> The should_reclaim_retry can return false although no_progress_loop is less
> than MAX_RECLAIM_RETRIES unless eligible zones has enough reclaimable pages
> by the progress_loop.

Yes, sorry I should have been more clear. I was talking about this
particular case where we had a lot of reclaimable pages (a lot of
anonymous with the swap available).

-- 
Michal Hocko
SUSE Labs


Re: Still OOM problems with 4.9er/4.10er kernels

2017-02-28 Thread Michal Hocko
On Tue 28-02-17 07:06:41, Gerhard Wiesinger wrote:
> On 27.02.2017 09:27, Michal Hocko wrote:
> >On Sun 26-02-17 09:40:42, Gerhard Wiesinger wrote:
> >>On 04.01.2017 10:11, Michal Hocko wrote:
> The VM stops working (e.g. not pingable) after around 8h (will be 
> restarted
> automatically), happened serveral times.
> 
> Had also further OOMs which I sent to Mincham.
> >>>Could you post them to the mailing list as well, please?
> >>Still OOMs on dnf update procedure with kernel 4.10: 4.10.0-1.fc26.x86_64 as
> >>well on 4.9.9-200.fc25.x86_64
> >>
> >>On 4.10er kernels:
> >[...]
> >>kernel: Node 0 DMA32 free:5012kB min:2264kB low:2828kB high:3392kB
> >>active_anon:143580kB inactive_anon:143300kB active_file:2576kB
> >>inactive_file:2560kB unevictable:0kB writepending:0kB present:376688kB
> >>managed:353968kB mlocked:0kB slab_reclaimable:13708kB
> >>slab_unreclaimable:18064kB kernel_stack:2352kB pagetables:12888kB bounce:0kB
> >>free_pcp:412kB local_pcp:88kB free_cma:0kB
> >[...]
> >
> >>On 4.9er kernels:
> >[...]
> >>kernel: Node 0 DMA32 free:3356kB min:2668kB low:3332kB high:3996kB
> >>active_anon:122148kB inactive_anon:112068kB active_file:81324kB
> >>inactive_file:101972kB unevictable:0kB writepending:4648kB present:507760kB
> >>managed:484384kB mlocked:0kB slab_reclaimable:17660kB
> >>slab_unreclaimable:21404kB kernel_stack:2432kB pagetables:10124kB bounce:0kB
> >>free_pcp:120kB local_pcp:0kB free_cma:0kB
> >In both cases the amount if free memory is above the min watermark, so
> >we shouldn't be hitting the oom. We might have somebody freeing memory
> >after the last attempt, though...
> >
> >[...]
> >>Should be very easy to reproduce with a low mem VM (e.g. 192MB) under KVM
> >>with ext4 and Fedora 25 and some memory load and updating the VM.
> >>
> >>Any further progress?
> >The linux-next (resp. mmotm tree) has new tracepoints which should help
> >to tell us more about what is going on here. Could you try to enable
> >oom/reclaim_retry_zone and vmscan/mm_vmscan_direct_reclaim_{begin,end}
> 
> Is this available in this version?
> 
> https://koji.fedoraproject.org/koji/buildinfo?buildID=862775
> 
> kernel-4.11.0-0.rc0.git5.1.fc26

no idea.

> 
> How to enable?

mount -t tracefs none /trace
echo 1 > /trace/events/oom/reclaim_retry_zone/enabled
echo 1 > /trace/events/vmscan/mm_vmscan_direct_reclaim_begin
echo 1 > /trace/events/vmscan/mm_vmscan_direct_reclaim_end
-- 
Michal Hocko
SUSE Labs


Re: Still OOM problems with 4.9er/4.10er kernels

2017-02-27 Thread Gerhard Wiesinger

On 27.02.2017 09:27, Michal Hocko wrote:

On Sun 26-02-17 09:40:42, Gerhard Wiesinger wrote:

On 04.01.2017 10:11, Michal Hocko wrote:

The VM stops working (e.g. not pingable) after around 8h (will be restarted
automatically), happened serveral times.

Had also further OOMs which I sent to Mincham.

Could you post them to the mailing list as well, please?

Still OOMs on dnf update procedure with kernel 4.10: 4.10.0-1.fc26.x86_64 as
well on 4.9.9-200.fc25.x86_64

On 4.10er kernels:

[...]

kernel: Node 0 DMA32 free:5012kB min:2264kB low:2828kB high:3392kB
active_anon:143580kB inactive_anon:143300kB active_file:2576kB
inactive_file:2560kB unevictable:0kB writepending:0kB present:376688kB
managed:353968kB mlocked:0kB slab_reclaimable:13708kB
slab_unreclaimable:18064kB kernel_stack:2352kB pagetables:12888kB bounce:0kB
free_pcp:412kB local_pcp:88kB free_cma:0kB

[...]


On 4.9er kernels:

[...]

kernel: Node 0 DMA32 free:3356kB min:2668kB low:3332kB high:3996kB
active_anon:122148kB inactive_anon:112068kB active_file:81324kB
inactive_file:101972kB unevictable:0kB writepending:4648kB present:507760kB
managed:484384kB mlocked:0kB slab_reclaimable:17660kB
slab_unreclaimable:21404kB kernel_stack:2432kB pagetables:10124kB bounce:0kB
free_pcp:120kB local_pcp:0kB free_cma:0kB

In both cases the amount if free memory is above the min watermark, so
we shouldn't be hitting the oom. We might have somebody freeing memory
after the last attempt, though...

[...]

Should be very easy to reproduce with a low mem VM (e.g. 192MB) under KVM
with ext4 and Fedora 25 and some memory load and updating the VM.

Any further progress?

The linux-next (resp. mmotm tree) has new tracepoints which should help
to tell us more about what is going on here. Could you try to enable
oom/reclaim_retry_zone and vmscan/mm_vmscan_direct_reclaim_{begin,end}


Is this available in this version?

https://koji.fedoraproject.org/koji/buildinfo?buildID=862775

kernel-4.11.0-0.rc0.git5.1.fc26

How to enable?


Thnx.

Ciao,

gerhard



Re: Still OOM problems with 4.9er/4.10er kernels

2017-02-27 Thread Minchan Kim
On Mon, Feb 27, 2017 at 10:44:49AM +0100, Michal Hocko wrote:
> On Mon 27-02-17 18:02:36, Minchan Kim wrote:
> [...]
> > >From 9779a1c5d32e2edb64da5cdfcd6f9737b94a247a Mon Sep 17 00:00:00 2001
> > From: Minchan Kim 
> > Date: Mon, 27 Feb 2017 17:39:06 +0900
> > Subject: [PATCH] mm: use up highatomic before OOM kill
> > 
> > Not-Yet-Signed-off-by: Minchan Kim 
> > ---
> >  mm/page_alloc.c | 14 --
> >  1 file changed, 4 insertions(+), 10 deletions(-)
> > 
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 614cd0397ce3..e073cca4969e 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -3549,16 +3549,6 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
> > *no_progress_loops = 0;
> > else
> > (*no_progress_loops)++;
> > -
> > -   /*
> > -* Make sure we converge to OOM if we cannot make any progress
> > -* several times in the row.
> > -*/
> > -   if (*no_progress_loops > MAX_RECLAIM_RETRIES) {
> > -   /* Before OOM, exhaust highatomic_reserve */
> > -   return unreserve_highatomic_pageblock(ac, true);
> > -   }
> > -
> > /*
> >  * Keep reclaiming pages while there is a chance this will lead
> >  * somewhere.  If none of the target zones can satisfy our allocation
> > @@ -3821,6 +3811,10 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int 
> > order,
> > if (read_mems_allowed_retry(cpuset_mems_cookie))
> > goto retry_cpuset;
> >  
> > +   /* Before OOM, exhaust highatomic_reserve */
> > +   if (unreserve_highatomic_pageblock(ac, true))
> > +   goto retry;
> > +
> 
> OK, this can help for higher order requests when we do not exhaust all
> the retries and fail on compaction but I fail to see how can this help
> for order-0 requets which was what happened in this case. I am not
> saying this is wrong, though.

The should_reclaim_retry can return false although no_progress_loop is less
than MAX_RECLAIM_RETRIES unless eligible zones has enough reclaimable pages
by the progress_loop. In that case, unreserve_highatomic_pageblock cannot
be called so that VM can keep a pageblock(e.g., 2M) for highatomic reserve.
Then, zone_watermark_ok subtracts nr_reserved_highatomic pages for the
pass/fail decision whichs is very conservative but no choice for the hot
path performance. With that, order-0 allocation can be failed.

Thanks.


Re: Still OOM problems with 4.9er/4.10er kernels

2017-02-27 Thread Michal Hocko
On Mon 27-02-17 18:02:36, Minchan Kim wrote:
[...]
> >From 9779a1c5d32e2edb64da5cdfcd6f9737b94a247a Mon Sep 17 00:00:00 2001
> From: Minchan Kim 
> Date: Mon, 27 Feb 2017 17:39:06 +0900
> Subject: [PATCH] mm: use up highatomic before OOM kill
> 
> Not-Yet-Signed-off-by: Minchan Kim 
> ---
>  mm/page_alloc.c | 14 --
>  1 file changed, 4 insertions(+), 10 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 614cd0397ce3..e073cca4969e 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3549,16 +3549,6 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
>   *no_progress_loops = 0;
>   else
>   (*no_progress_loops)++;
> -
> - /*
> -  * Make sure we converge to OOM if we cannot make any progress
> -  * several times in the row.
> -  */
> - if (*no_progress_loops > MAX_RECLAIM_RETRIES) {
> - /* Before OOM, exhaust highatomic_reserve */
> - return unreserve_highatomic_pageblock(ac, true);
> - }
> -
>   /*
>* Keep reclaiming pages while there is a chance this will lead
>* somewhere.  If none of the target zones can satisfy our allocation
> @@ -3821,6 +3811,10 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int 
> order,
>   if (read_mems_allowed_retry(cpuset_mems_cookie))
>   goto retry_cpuset;
>  
> + /* Before OOM, exhaust highatomic_reserve */
> + if (unreserve_highatomic_pageblock(ac, true))
> + goto retry;
> +

OK, this can help for higher order requests when we do not exhaust all
the retries and fail on compaction but I fail to see how can this help
for order-0 requets which was what happened in this case. I am not
saying this is wrong, though.

>   /* Reclaim has failed us, start killing things */
>   page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress);
>   if (page)
> -- 
> 2.7.4

-- 
Michal Hocko
SUSE Labs


Re: Still OOM problems with 4.9er/4.10er kernels

2017-02-27 Thread Minchan Kim
On Sun, Feb 26, 2017 at 09:40:42AM +0100, Gerhard Wiesinger wrote:
> On 04.01.2017 10:11, Michal Hocko wrote:
> >>The VM stops working (e.g. not pingable) after around 8h (will be restarted
> >>automatically), happened serveral times.
> >>
> >>Had also further OOMs which I sent to Mincham.
> >Could you post them to the mailing list as well, please?
> 
> Still OOMs on dnf update procedure with kernel 4.10: 4.10.0-1.fc26.x86_64 as
> well on 4.9.9-200.fc25.x86_64
> 
> On 4.10er kernels:
> 
> Free swap  = 1137532kB
> 
> cat /etc/sysctl.d/* | grep ^vm
> vm.dirty_background_ratio = 3
> vm.dirty_ratio = 15
> vm.overcommit_memory = 2
> vm.overcommit_ratio = 80
> vm.swappiness=10
> 
> kernel: python invoked oom-killer:
> gfp_mask=0x14201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD), nodemask=0, order=0,
> oom_score_adj=0
> kernel: python cpuset=/ mems_allowed=0
> kernel: CPU: 1 PID: 813 Comm: python Not tainted 4.10.0-1.fc26.x86_64 #1
> kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.3
> 04/01/2014
> kernel: Call Trace:
> kernel:  dump_stack+0x63/0x84
> kernel:  dump_header+0x7b/0x1f6
> kernel:  ? do_try_to_free_pages+0x2c5/0x340
> kernel:  oom_kill_process+0x202/0x3d0
> kernel:  out_of_memory+0x2b7/0x4e0
> kernel:  __alloc_pages_slowpath+0x915/0xb80
> kernel:  __alloc_pages_nodemask+0x218/0x2d0
> kernel:  alloc_pages_current+0x93/0x150
> kernel:  __page_cache_alloc+0xcf/0x100
> kernel:  filemap_fault+0x39d/0x800
> kernel:  ? page_add_file_rmap+0xe5/0x200
> kernel:  ? filemap_map_pages+0x2e1/0x4e0
> kernel:  ext4_filemap_fault+0x36/0x50
> kernel:  __do_fault+0x21/0x110
> kernel:  handle_mm_fault+0xdd1/0x1410
> kernel:  ? swake_up+0x42/0x50
> kernel:  __do_page_fault+0x23f/0x4c0
> kernel:  trace_do_page_fault+0x41/0x120
> kernel:  do_async_page_fault+0x51/0xa0
> kernel:  async_page_fault+0x28/0x30
> kernel: RIP: 0033:0x7f0681ad6350
> kernel: RSP: 002b:7ffcbdd238d8 EFLAGS: 00010246
> kernel: RAX: 7f0681b0f960 RBX:  RCX: 7fff
> kernel: RDX:  RSI: 3ff0 RDI: 3ff0
> kernel: RBP: 7f067461ab40 R08:  R09: 3ff0
> kernel: R10: 556f1c6d8a80 R11: 0001 R12: 7f0676d1a8d0
> kernel: R13:  R14: 7f06746168bc R15: 7f0674385910
> kernel: Mem-Info:
> kernel: active_anon:37423 inactive_anon:37512 isolated_anon:0
>  active_file:462 inactive_file:603 isolated_file:0
>  unevictable:0 dirty:0 writeback:0 unstable:0
>  slab_reclaimable:3538 slab_unreclaimable:4818
>  mapped:859 shmem:9 pagetables:3370 bounce:0
>  free:1650 free_pcp:103 free_cma:0
> kernel: Node 0 active_anon:149380kB inactive_anon:149704kB
> active_file:1848kB inactive_file:3660kB unevictable:0kB isolated(anon):128kB
> isolated(file):0kB mapped:4580kB dirty:0kB writeback:380kB shmem:0kB
> shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 36kB writeback_tmp:0kB
> unstable:0kB pages_scanned:352 all_unreclaimable? no
> kernel: Node 0 DMA free:1484kB min:104kB low:128kB high:152kB
> active_anon:5660kB inactive_anon:6156kB active_file:56kB inactive_file:64kB
> unevictable:0kB writepending:0kB present:15992kB managed:15908kB mlocked:0kB
> slab_reclaimable:444kB slab_unreclaimable:1208kB kernel_stack:32kB
> pagetables:592kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
> kernel: lowmem_reserve[]: 0 327 327 327 327
> kernel: Node 0 DMA32 free:5012kB min:2264kB low:2828kB high:3392kB
> active_anon:143580kB inactive_anon:143300kB active_file:2576kB
> inactive_file:2560kB unevictable:0kB writepending:0kB present:376688kB
> managed:353968kB mlocked:0kB slab_reclaimable:13708kB
> slab_unreclaimable:18064kB kernel_stack:2352kB pagetables:12888kB bounce:0kB
> free_pcp:412kB local_pcp:88kB free_cma:0kB
> kernel: lowmem_reserve[]: 0 0 0 0 0
> kernel: Node 0 DMA: 70*4kB (UMEH) 20*8kB (UMEH) 13*16kB (MH) 5*32kB (H)
> 4*64kB (H) 2*128kB (H) 1*256kB (H) 0*512kB 0*1024kB 0*2048kB 0*4096kB =
> 1576kB
> kernel: Node 0 DMA32: 1134*4kB (UMEH) 25*8kB (UMEH) 13*16kB (MH) 7*32kB (H)
> 3*64kB (H) 0*128kB 1*256kB (H) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 5616kB
 
Althogh DMA32 zone has enough free memory, free memory includes H pageblock
which is reserved memory for high-order atomic allocation. That might be
a reason you cannot succeed watermark check for the allocation.

I tried to solve the issue in 4.9 time to use up the reserved memory before
the OOM and merged into 4.10 but I think there is a hole so could you apply
this patch on top of your 4.10? (To be clear, cannot apply it to 4.9)

>From 9779a1c5d32e2edb64da5cdfcd6f9737b94a247a Mon Sep 17 00:00:00 2001
From: Minchan Kim 
Date: Mon, 27 Feb 2017 17:39:06 +0900
Subject: [PATCH] mm: use up highatomic before OOM kill

Not-Yet-Signed-off-by: Minchan Kim 
---
 mm/page_alloc.c | 14 --
 1 file changed, 4 insertions(+), 10 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 614cd0397ce3..e073cca4969e 100644
--- a/mm/page_alloc.

Re: Still OOM problems with 4.9er/4.10er kernels

2017-02-27 Thread Michal Hocko
On Sun 26-02-17 09:40:42, Gerhard Wiesinger wrote:
> On 04.01.2017 10:11, Michal Hocko wrote:
> >>The VM stops working (e.g. not pingable) after around 8h (will be restarted
> >>automatically), happened serveral times.
> >>
> >>Had also further OOMs which I sent to Mincham.
> >Could you post them to the mailing list as well, please?
> 
> Still OOMs on dnf update procedure with kernel 4.10: 4.10.0-1.fc26.x86_64 as
> well on 4.9.9-200.fc25.x86_64
> 
> On 4.10er kernels:
[...]
> kernel: Node 0 DMA32 free:5012kB min:2264kB low:2828kB high:3392kB
> active_anon:143580kB inactive_anon:143300kB active_file:2576kB
> inactive_file:2560kB unevictable:0kB writepending:0kB present:376688kB
> managed:353968kB mlocked:0kB slab_reclaimable:13708kB
> slab_unreclaimable:18064kB kernel_stack:2352kB pagetables:12888kB bounce:0kB
> free_pcp:412kB local_pcp:88kB free_cma:0kB
[...]

> On 4.9er kernels:
[...]
> kernel: Node 0 DMA32 free:3356kB min:2668kB low:3332kB high:3996kB
> active_anon:122148kB inactive_anon:112068kB active_file:81324kB
> inactive_file:101972kB unevictable:0kB writepending:4648kB present:507760kB
> managed:484384kB mlocked:0kB slab_reclaimable:17660kB
> slab_unreclaimable:21404kB kernel_stack:2432kB pagetables:10124kB bounce:0kB
> free_pcp:120kB local_pcp:0kB free_cma:0kB

In both cases the amount if free memory is above the min watermark, so
we shouldn't be hitting the oom. We might have somebody freeing memory
after the last attempt, though...

[...]
> Should be very easy to reproduce with a low mem VM (e.g. 192MB) under KVM
> with ext4 and Fedora 25 and some memory load and updating the VM.
> 
> Any further progress?

The linux-next (resp. mmotm tree) has new tracepoints which should help
to tell us more about what is going on here. Could you try to enable
oom/reclaim_retry_zone and vmscan/mm_vmscan_direct_reclaim_{begin,end}
-- 
Michal Hocko
SUSE Labs


Re: Still OOM problems with 4.9er/4.10er kernels

2017-02-26 Thread Gerhard Wiesinger

On 04.01.2017 10:11, Michal Hocko wrote:

The VM stops working (e.g. not pingable) after around 8h (will be restarted
automatically), happened serveral times.

Had also further OOMs which I sent to Mincham.

Could you post them to the mailing list as well, please?


Still OOMs on dnf update procedure with kernel 4.10: 
4.10.0-1.fc26.x86_64 as well on 4.9.9-200.fc25.x86_64


On 4.10er kernels:

Free swap  = 1137532kB

cat /etc/sysctl.d/* | grep ^vm
vm.dirty_background_ratio = 3
vm.dirty_ratio = 15
vm.overcommit_memory = 2
vm.overcommit_ratio = 80
vm.swappiness=10

kernel: python invoked oom-killer: 
gfp_mask=0x14201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD), nodemask=0, 
order=0, oom_score_adj=0

kernel: python cpuset=/ mems_allowed=0
kernel: CPU: 1 PID: 813 Comm: python Not tainted 4.10.0-1.fc26.x86_64 #1
kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
1.9.3 04/01/2014

kernel: Call Trace:
kernel:  dump_stack+0x63/0x84
kernel:  dump_header+0x7b/0x1f6
kernel:  ? do_try_to_free_pages+0x2c5/0x340
kernel:  oom_kill_process+0x202/0x3d0
kernel:  out_of_memory+0x2b7/0x4e0
kernel:  __alloc_pages_slowpath+0x915/0xb80
kernel:  __alloc_pages_nodemask+0x218/0x2d0
kernel:  alloc_pages_current+0x93/0x150
kernel:  __page_cache_alloc+0xcf/0x100
kernel:  filemap_fault+0x39d/0x800
kernel:  ? page_add_file_rmap+0xe5/0x200
kernel:  ? filemap_map_pages+0x2e1/0x4e0
kernel:  ext4_filemap_fault+0x36/0x50
kernel:  __do_fault+0x21/0x110
kernel:  handle_mm_fault+0xdd1/0x1410
kernel:  ? swake_up+0x42/0x50
kernel:  __do_page_fault+0x23f/0x4c0
kernel:  trace_do_page_fault+0x41/0x120
kernel:  do_async_page_fault+0x51/0xa0
kernel:  async_page_fault+0x28/0x30
kernel: RIP: 0033:0x7f0681ad6350
kernel: RSP: 002b:7ffcbdd238d8 EFLAGS: 00010246
kernel: RAX: 7f0681b0f960 RBX:  RCX: 7fff
kernel: RDX:  RSI: 3ff0 RDI: 3ff0
kernel: RBP: 7f067461ab40 R08:  R09: 3ff0
kernel: R10: 556f1c6d8a80 R11: 0001 R12: 7f0676d1a8d0
kernel: R13:  R14: 7f06746168bc R15: 7f0674385910
kernel: Mem-Info:
kernel: active_anon:37423 inactive_anon:37512 isolated_anon:0
 active_file:462 inactive_file:603 isolated_file:0
 unevictable:0 dirty:0 writeback:0 unstable:0
 slab_reclaimable:3538 slab_unreclaimable:4818
 mapped:859 shmem:9 pagetables:3370 bounce:0
 free:1650 free_pcp:103 free_cma:0
kernel: Node 0 active_anon:149380kB inactive_anon:149704kB 
active_file:1848kB inactive_file:3660kB unevictable:0kB 
isolated(anon):128kB isolated(file):0kB mapped:4580kB dirty:0kB 
writeback:380kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 
36kB writeback_tmp:0kB unstable:0kB pages_scanned:352 all_unreclaimable? no
kernel: Node 0 DMA free:1484kB min:104kB low:128kB high:152kB 
active_anon:5660kB inactive_anon:6156kB active_file:56kB 
inactive_file:64kB unevictable:0kB writepending:0kB present:15992kB 
managed:15908kB mlocked:0kB slab_reclaimable:444kB 
slab_unreclaimable:1208kB kernel_stack:32kB pagetables:592kB bounce:0kB 
free_pcp:0kB local_pcp:0kB free_cma:0kB

kernel: lowmem_reserve[]: 0 327 327 327 327
kernel: Node 0 DMA32 free:5012kB min:2264kB low:2828kB high:3392kB 
active_anon:143580kB inactive_anon:143300kB active_file:2576kB 
inactive_file:2560kB unevictable:0kB writepending:0kB present:376688kB 
managed:353968kB mlocked:0kB slab_reclaimable:13708kB 
slab_unreclaimable:18064kB kernel_stack:2352kB pagetables:12888kB 
bounce:0kB free_pcp:412kB local_pcp:88kB free_cma:0kB

kernel: lowmem_reserve[]: 0 0 0 0 0
kernel: Node 0 DMA: 70*4kB (UMEH) 20*8kB (UMEH) 13*16kB (MH) 5*32kB (H) 
4*64kB (H) 2*128kB (H) 1*256kB (H) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 
1576kB
kernel: Node 0 DMA32: 1134*4kB (UMEH) 25*8kB (UMEH) 13*16kB (MH) 7*32kB 
(H) 3*64kB (H) 0*128kB 1*256kB (H) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 
5616kB
kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 
hugepages_size=2048kB

kernel: 6561 total pagecache pages
kernel: 5240 pages in swap cache
kernel: Swap cache stats: add 100078658, delete 100073419, find 
199458343/238460223

kernel: Free swap  = 1137532kB
kernel: Total swap = 2064380kB
kernel: 98170 pages RAM
kernel: 0 pages HighMem/MovableOnly
kernel: 5701 pages reserved
kernel: 0 pages cma reserved
kernel: 0 pages hwpoisoned
kernel: Out of memory: Kill process 11968 (clamscan) score 170 or 
sacrifice child
kernel: Killed process 11968 (clamscan) total-vm:538120kB, 
anon-rss:182220kB, file-rss:464kB, shmem-rss:0kB


On 4.9er kernels:

Free swap  = 1826688kB

cat /etc/sysctl.d/* | grep ^vm
vm.dirty_background_ratio=3
vm.dirty_ratio=15
vm.overcommit_memory=2
vm.overcommit_ratio=80
vm.swappiness=10

kernel: dnf invoked oom-killer: 
gfp_mask=0x24280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=0, 
order=0, oom_score_adj=0

kernel: dnf cpuset=/ mems_allowed=0
kernel: CPU: 0 PID: 20049 Comm: dnf Not tainted 4.9.9-200.fc25.x86_64 #1
kernel: Hardw