Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone

2017-02-25 Thread Tetsuo Handa
Michal Hocko wrote:
> On Wed 22-02-17 11:02:21, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > On Tue 21-02-17 23:35:07, Tetsuo Handa wrote:
> > > > Michal Hocko wrote:
> > > > > OK, so it seems that all the distractions are handled now and 
> > > > > linux-next
> > > > > should provide a reasonable base for testing. You said you weren't 
> > > > > able
> > > > > to reproduce the original long stalls on too_many_isolated(). I would 
> > > > > be
> > > > > still interested to see those oom reports and potential anomalies in 
> > > > > the
> > > > > isolated counts before I send the patch for inclusion so your further
> > > > > testing would be more than appreciated. Also stalls > 10s without any
> > > > > previous occurrences would be interesting.
> > > > 
> > > > I confirmed that linux-next-20170221 with kmallocwd applied can 
> > > > reproduce
> > > > infinite too_many_isolated() loop problem. Please send your patches to 
> > > > linux-next.
> > > 
> > > So I assume that you didn't see the lockup with the patch applied and
> > > the OOM killer has resolved the situation by killing other tasks, right?
> > > Can I assume your Tested-by?
> > 
> > No. I tested linux-next-20170221 which does not include your patch.
> > I didn't test linux-next-20170221 with your patch applied. Your patch will
> > avoid infinite too_many_isolated() loop problem in shrink_inactive_list().
> > But we need to test different workloads by other people. Thus, I suggest
> > you to send your patches to linux-next without my testing.
> 
> I will send the patch to Andrew later after merge window closes. It
> would be really helpful, though, to see how it handles your workload
> which is known to reproduce the oom starvation.

I tested http://lkml.kernel.org/r/20170119112336.gn30...@dhcp22.suse.cz
on top of linux-next-20170221 with kmallocwd applied.

I did not hit too_many_isolated() loop problem. But I hit an "unable to invoke
the OOM killer due to !__GFP_FS allocation" lockup problem shown below.

Complete log is at http://I-love.SAKURA.ne.jp/tmp/serial-20170226.txt.xz .
--
[  444.281177] Killed process 9477 (a.out) total-vm:4168kB, anon-rss:84kB, 
file-rss:0kB, shmem-rss:0kB
[  444.287046] oom_reaper: reaped process 9477 (a.out), now anon-rss:0kB, 
file-rss:0kB, shmem-rss:0kB
[  484.810225] BUG: workqueue lockup - pool cpus=1 node=0 flags=0x0 nice=0 
stuck for 38s!
[  484.812907] BUG: workqueue lockup - pool cpus=2 node=0 flags=0x0 nice=0 
stuck for 41s!
[  484.815546] Showing busy workqueues and worker pools:
[  484.817595] workqueue events: flags=0x0
[  484.819456]   pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=3/256
[  484.821666] pending: vmpressure_work_fn, vmstat_shepherd, 
vmw_fb_dirty_flush [vmwgfx]
[  484.824356]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=2/256
[  484.826582] pending: drain_local_pages_wq BAR(9595), e1000_watchdog 
[e1000]
[  484.829091]   pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=2/256
[  484.831325] in-flight: 7418:rht_deferred_worker
[  484.86] pending: rht_deferred_worker
[  484.835346] workqueue events_long: flags=0x0
[  484.837343]   pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=1/256
[  484.839566] pending: gc_worker [nf_conntrack]
[  484.841691] workqueue events_power_efficient: flags=0x80
[  484.843873]   pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=1/256
[  484.846103] pending: fb_flashcursor
[  484.847928]   pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=2/256
[  484.850149] pending: neigh_periodic_work, neigh_periodic_work
[  484.852403] workqueue events_freezable_power_: flags=0x84
[  484.854534]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256
[  484.85] in-flight: 27:disk_events_workfn
[  484.858621] workqueue writeback: flags=0x4e
[  484.860347]   pwq 256: cpus=0-127 flags=0x4 nice=0 active=2/256
[  484.862415] in-flight: 8444:wb_workfn wb_workfn
[  484.864602] workqueue vmstat: flags=0xc
[  484.866291]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256
[  484.868307] pending: vmstat_update
[  484.869876]   pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/256
[  484.871864] pending: vmstat_update
[  484.874058] workqueue mpt_poll_0: flags=0x8
[  484.875698]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256
[  484.877602] pending: mpt_fault_reset_work [mptbase]
[  484.879502] workqueue xfs-buf/sda1: flags=0xc
[  484.881148]   pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/1
[  484.883011] pending: xfs_buf_ioend_work [xfs]
[  484.884706] workqueue xfs-data/sda1: flags=0xc
[  484.886367]   pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=27/256 MAYDAY
[  484.888410] in-flight: 5356:xfs_end_io [xfs], 451(RESCUER):xfs_end_io 
[xfs] xfs_end_io [xfs] xfs_end_io [xfs] xfs_end_io [xfs] xfs_end_io [xfs], 
10498:xfs_end_io [xfs], 6386:xfs_end_io [xfs]
[  484.893483] pending: xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io 
[xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], 
xfs_end_io [xfs], x

Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone

2017-02-21 Thread Michal Hocko
On Wed 22-02-17 11:02:21, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Tue 21-02-17 23:35:07, Tetsuo Handa wrote:
> > > Michal Hocko wrote:
> > > > OK, so it seems that all the distractions are handled now and linux-next
> > > > should provide a reasonable base for testing. You said you weren't able
> > > > to reproduce the original long stalls on too_many_isolated(). I would be
> > > > still interested to see those oom reports and potential anomalies in the
> > > > isolated counts before I send the patch for inclusion so your further
> > > > testing would be more than appreciated. Also stalls > 10s without any
> > > > previous occurrences would be interesting.
> > > 
> > > I confirmed that linux-next-20170221 with kmallocwd applied can reproduce
> > > infinite too_many_isolated() loop problem. Please send your patches to 
> > > linux-next.
> > 
> > So I assume that you didn't see the lockup with the patch applied and
> > the OOM killer has resolved the situation by killing other tasks, right?
> > Can I assume your Tested-by?
> 
> No. I tested linux-next-20170221 which does not include your patch.
> I didn't test linux-next-20170221 with your patch applied. Your patch will
> avoid infinite too_many_isolated() loop problem in shrink_inactive_list().
> But we need to test different workloads by other people. Thus, I suggest
> you to send your patches to linux-next without my testing.

I will send the patch to Andrew later after merge window closes. It
would be really helpful, though, to see how it handles your workload
which is known to reproduce the oom starvation.
-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone

2017-02-21 Thread Tetsuo Handa
Michal Hocko wrote:
> On Tue 21-02-17 23:35:07, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > OK, so it seems that all the distractions are handled now and linux-next
> > > should provide a reasonable base for testing. You said you weren't able
> > > to reproduce the original long stalls on too_many_isolated(). I would be
> > > still interested to see those oom reports and potential anomalies in the
> > > isolated counts before I send the patch for inclusion so your further
> > > testing would be more than appreciated. Also stalls > 10s without any
> > > previous occurrences would be interesting.
> > 
> > I confirmed that linux-next-20170221 with kmallocwd applied can reproduce
> > infinite too_many_isolated() loop problem. Please send your patches to 
> > linux-next.
> 
> So I assume that you didn't see the lockup with the patch applied and
> the OOM killer has resolved the situation by killing other tasks, right?
> Can I assume your Tested-by?

No. I tested linux-next-20170221 which does not include your patch.
I didn't test linux-next-20170221 with your patch applied. Your patch will
avoid infinite too_many_isolated() loop problem in shrink_inactive_list().
But we need to test different workloads by other people. Thus, I suggest
you to send your patches to linux-next without my testing.

> 
> Thanks for your testing!
> -- 
> Michal Hocko
> SUSE Labs
> 


Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone

2017-02-21 Thread Michal Hocko
On Tue 21-02-17 23:35:07, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > OK, so it seems that all the distractions are handled now and linux-next
> > should provide a reasonable base for testing. You said you weren't able
> > to reproduce the original long stalls on too_many_isolated(). I would be
> > still interested to see those oom reports and potential anomalies in the
> > isolated counts before I send the patch for inclusion so your further
> > testing would be more than appreciated. Also stalls > 10s without any
> > previous occurrences would be interesting.
> 
> I confirmed that linux-next-20170221 with kmallocwd applied can reproduce
> infinite too_many_isolated() loop problem. Please send your patches to 
> linux-next.

So I assume that you didn't see the lockup with the patch applied and
the OOM killer has resolved the situation by killing other tasks, right?
Can I assume your Tested-by?

Thanks for your testing!
-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone

2017-02-21 Thread Tetsuo Handa
Michal Hocko wrote:
> OK, so it seems that all the distractions are handled now and linux-next
> should provide a reasonable base for testing. You said you weren't able
> to reproduce the original long stalls on too_many_isolated(). I would be
> still interested to see those oom reports and potential anomalies in the
> isolated counts before I send the patch for inclusion so your further
> testing would be more than appreciated. Also stalls > 10s without any
> previous occurrences would be interesting.

I confirmed that linux-next-20170221 with kmallocwd applied can reproduce
infinite too_many_isolated() loop problem. Please send your patches to 
linux-next.

Complete log is at http://I-love.SAKURA.ne.jp/tmp/serial-20170221.txt.xz .

[ 1160.162013] Out of memory: Kill process 7523 (a.out) score 998 or sacrifice 
child
[ 1160.164422] Killed process 7523 (a.out) total-vm:4168kB, anon-rss:84kB, 
file-rss:0kB, shmem-rss:0kB
[ 1160.169699] oom_reaper: reaped process 7523 (a.out), now anon-rss:0kB, 
file-rss:0kB, shmem-rss:0kB
[ 1209.781787] MemAlloc-Info: stalling=32 dying=1 exiting=0 victim=1 
oom_count=45896
[ 1209.790966] MemAlloc: kswapd0(67) flags=0xa60840 switches=51139 
uninterruptible
[ 1209.799726] kswapd0 D1093667  2 0x
[ 1209.807326] Call Trace:
[ 1209.812581]  __schedule+0x336/0xe00
[ 1209.818599]  schedule+0x3d/0x90
[ 1209.823907]  schedule_timeout+0x26a/0x510
[ 1209.827218]  ? trace_hardirqs_on+0xd/0x10
[ 1209.830535]  __down_common+0xfb/0x131
[ 1209.833801]  ? _xfs_buf_find+0x2cb/0xc10 [xfs]
[ 1209.837372]  __down+0x1d/0x1f
[ 1209.840331]  down+0x41/0x50
[ 1209.843243]  xfs_buf_lock+0x64/0x370 [xfs]
[ 1209.846597]  _xfs_buf_find+0x2cb/0xc10 [xfs]
[ 1209.850031]  ? _xfs_buf_find+0xa4/0xc10 [xfs]
[ 1209.853514]  xfs_buf_get_map+0x2a/0x480 [xfs]
[ 1209.855831]  xfs_buf_read_map+0x2c/0x400 [xfs]
[ 1209.857388]  ? free_debug_processing+0x27d/0x2af
[ 1209.859037]  xfs_trans_read_buf_map+0x186/0x830 [xfs]
[ 1209.860707]  xfs_read_agf+0xc8/0x2b0 [xfs]
[ 1209.862184]  xfs_alloc_read_agf+0x7a/0x300 [xfs]
[ 1209.863728]  ? xfs_alloc_space_available+0x7b/0x120 [xfs]
[ 1209.865385]  xfs_alloc_fix_freelist+0x3bc/0x490 [xfs]
[ 1209.866974]  ? __radix_tree_lookup+0x84/0xf0
[ 1209.868374]  ? xfs_perag_get+0x1a0/0x310 [xfs]
[ 1209.869798]  ? xfs_perag_get+0x5/0x310 [xfs]
[ 1209.871288]  xfs_alloc_vextent+0x161/0xda0 [xfs]
[ 1209.872757]  xfs_bmap_btalloc+0x46c/0x8b0 [xfs]
[ 1209.874182]  ? save_stack_trace+0x1b/0x20
[ 1209.875542]  xfs_bmap_alloc+0x17/0x30 [xfs]
[ 1209.876847]  xfs_bmapi_write+0x74e/0x11d0 [xfs]
[ 1209.878190]  xfs_iomap_write_allocate+0x199/0x3a0 [xfs]
[ 1209.879632]  xfs_map_blocks+0x2cc/0x5a0 [xfs]
[ 1209.880909]  xfs_do_writepage+0x215/0x920 [xfs]
[ 1209.882255]  ? clear_page_dirty_for_io+0xb4/0x310
[ 1209.883598]  xfs_vm_writepage+0x3b/0x70 [xfs]
[ 1209.884841]  pageout.isra.54+0x1a4/0x460
[ 1209.886210]  shrink_page_list+0xa86/0xcf0
[ 1209.887441]  shrink_inactive_list+0x1c5/0x660
[ 1209.888682]  shrink_node_memcg+0x535/0x7f0
[ 1209.889975]  ? mem_cgroup_iter+0x14d/0x720
[ 1209.891197]  shrink_node+0xe1/0x310
[ 1209.892288]  kswapd+0x362/0x9b0
[ 1209.893308]  kthread+0x10f/0x150
[ 1209.894383]  ? mem_cgroup_shrink_node+0x3b0/0x3b0
[ 1209.895703]  ? kthread_create_on_node+0x70/0x70
[ 1209.896956]  ret_from_fork+0x31/0x40
[ 1209.898117] MemAlloc: systemd-journal(526) flags=0x400900 switches=33248 
seq=121659 gfp=0x14201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) order=0 delay=52772 
uninterruptible
[ 1209.902154] systemd-journal D11240   526  1 0x
[ 1209.903642] Call Trace:
[ 1209.904574]  __schedule+0x336/0xe00
[ 1209.905734]  schedule+0x3d/0x90
[ 1209.906817]  schedule_timeout+0x20d/0x510
[ 1209.908025]  ? prepare_to_wait+0x2b/0xc0
[ 1209.909268]  ? lock_timer_base+0xa0/0xa0
[ 1209.910460]  io_schedule_timeout+0x1e/0x50
[ 1209.911681]  congestion_wait+0x86/0x260
[ 1209.912853]  ? remove_wait_queue+0x60/0x60
[ 1209.914115]  shrink_inactive_list+0x5b4/0x660
[ 1209.915385]  ? __list_lru_count_one.isra.2+0x22/0x80
[ 1209.916768]  shrink_node_memcg+0x535/0x7f0
[ 1209.918173]  shrink_node+0xe1/0x310
[ 1209.919288]  do_try_to_free_pages+0xe1/0x300
[ 1209.920548]  try_to_free_pages+0x131/0x3f0
[ 1209.921827]  __alloc_pages_slowpath+0x3ec/0xd95
[ 1209.923137]  __alloc_pages_nodemask+0x3e4/0x460
[ 1209.924454]  ? __radix_tree_lookup+0x84/0xf0
[ 1209.925790]  alloc_pages_current+0x97/0x1b0
[ 1209.927021]  ? find_get_entry+0x5/0x300
[ 1209.928189]  __page_cache_alloc+0x15d/0x1a0
[ 1209.929471]  ? pagecache_get_page+0x2c/0x2b0
[ 1209.930716]  filemap_fault+0x4df/0x8b0
[ 1209.931867]  ? filemap_fault+0x373/0x8b0
[ 1209.933111]  ? xfs_ilock+0x22c/0x360 [xfs]
[ 1209.934510]  ? xfs_filemap_fault+0x64/0x1e0 [xfs]
[ 1209.935857]  ? down_read_nested+0x7b/0xc0
[ 1209.937123]  ? xfs_ilock+0x22c/0x360 [xfs]
[ 1209.938373]  xfs_filemap_fault+0x6c/0x1e0 [xfs]
[ 1209.939691]  __do_fault+0x1e/0xa0
[ 1209.940807]  ? _raw_spin_unlock+0x27/0x40
[ 1209.94

Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone

2017-02-21 Thread Michal Hocko
On Fri 03-02-17 19:57:39, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Mon 30-01-17 09:55:46, Michal Hocko wrote:
> > > On Sun 29-01-17 00:27:27, Tetsuo Handa wrote:
> > [...]
> > > > Regarding [1], it helped avoiding the too_many_isolated() issue. I can't
> > > > tell whether it has any negative effect, but I got on the first trial 
> > > > that
> > > > all allocating threads are blocked on wait_for_completion() from 
> > > > flush_work()
> > > > in drain_all_pages() introduced by "mm, page_alloc: drain per-cpu pages 
> > > > from
> > > > workqueue context". There was no warn_alloc() stall warning message 
> > > > afterwords.
> > > 
> > > That patch is buggy and there is a follow up [1] which is not sitting in 
> > > the
> > > mmotm (and thus linux-next) yet. I didn't get to review it properly and
> > > I cannot say I would be too happy about using WQ from the page
> > > allocator. I believe even the follow up needs to have WQ_RECLAIM WQ.
> > > 
> > > [1] 
> > > http://lkml.kernel.org/r/20170125083038.rzb5f43nptmk7...@techsingularity.net
> > 
> > Did you get chance to test with this follow up patch? It would be
> > interesting to see whether OOM situation can still starve the waiter.
> > The current linux-next should contain this patch.
> 
> So far I can't reproduce problems except two listed below (cond_resched() trap
> in printk() and IDLE priority trap are excluded from the list).

OK, so it seems that all the distractions are handled now and linux-next
should provide a reasonable base for testing. You said you weren't able
to reproduce the original long stalls on too_many_isolated(). I would be
still interested to see those oom reports and potential anomalies in the
isolated counts before I send the patch for inclusion so your further
testing would be more than appreciated. Also stalls > 10s without any
previous occurrences would be interesting.

Thanks!
-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone

2017-02-08 Thread Peter Zijlstra
On Tue, Feb 07, 2017 at 10:12:12PM +0100, Michal Hocko wrote:
> This is moot - 
> http://lkml.kernel.org/r/20170207201950.20482-1-mho...@kernel.org

Thanks! I was just about to go stare at it in more detail.



Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone

2017-02-07 Thread Michal Hocko
On Mon 06-02-17 11:39:18, Michal Hocko wrote:
> On Sun 05-02-17 19:43:07, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > I got same warning with ext4. Maybe we need to check carefully.
> > 
> > [  511.215743] =
> > [  511.218003] WARNING: RECLAIM_FS-safe -> RECLAIM_FS-unsafe lock order 
> > detected
> > [  511.220031] 4.10.0-rc6-next-20170202+ #500 Not tainted
> > [  511.221689] -
> > [  511.223579] a.out/49302 [HC0[0]:SC0[0]:HE1:SE1] is trying to acquire:
> > [  511.225533]  (cpu_hotplug.dep_map){++}, at: [] 
> > get_online_cpus+0x37/0x80
> > [  511.227795] 
> > [  511.227795] and this task is already holding:
> > [  511.230082]  (jbd2_handle){-.}, at: [] 
> > start_this_handle+0x1a7/0x590
> > [  511.232592] which would create a new lock dependency:
> > [  511.234192]  (jbd2_handle){-.} -> (cpu_hotplug.dep_map){++}
> > [  511.235966] 
> > [  511.235966] but this new dependency connects a RECLAIM_FS-irq-safe lock:
> > [  511.238563]  (jbd2_handle){-.}
> > [  511.238564] 
> > [  511.238564] ... which became RECLAIM_FS-irq-safe at:
> > [  511.242078]   
> > [  511.242084] [] __lock_acquire+0x34b/0x1640
> > [  511.244495] [] lock_acquire+0xc9/0x250
> > [  511.246697] [] jbd2_log_wait_commit+0x55/0x1d0
> [...]
> > [  511.276216] to a RECLAIM_FS-irq-unsafe lock:
> > [  511.278128]  (cpu_hotplug.dep_map){++}
> > [  511.278130] 
> > [  511.278130] ... which became RECLAIM_FS-irq-unsafe at:
> > [  511.281809] ...
> > [  511.281811]   
> > [  511.282598] [] mark_held_locks+0x71/0x90
> > [  511.284854] [] lockdep_trace_alloc+0x6f/0xd0
> > [  511.287218] [] kmem_cache_alloc_node_trace+0x48/0x3b0
> > [  511.289755] [] __smpboot_create_thread.part.2+0x35/0xf0
> > [  511.292329] [] smpboot_create_threads+0x66/0x90
> [...]
> > [  511.317867] other info that might help us debug this:
> > [  511.317867] 
> > [  511.320920]  Possible interrupt unsafe locking scenario:
> > [  511.320920] 
> > [  511.323218]CPU0CPU1
> > [  511.324622]
> > [  511.325973]   lock(cpu_hotplug.dep_map);
> > [  511.327246]local_irq_disable();
> > [  511.328870]lock(jbd2_handle);
> > [  511.330483]lock(cpu_hotplug.dep_map);
> > [  511.332259]   
> > [  511.333187] lock(jbd2_handle);
> 
> Peter, is there any way how to tell the lockdep that this is in fact
> reclaim safe? The direct reclaim only does the trylock and backs off so
> we cannot deadlock here.
> 
> Or am I misinterpreting the trace?

This is moot - http://lkml.kernel.org/r/20170207201950.20482-1-mho...@kernel.org

-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone

2017-02-07 Thread Brian Foster
On Tue, Feb 07, 2017 at 07:30:54PM +0900, Tetsuo Handa wrote:
> Brian Foster wrote:
> > > The workload is to write to a single file on XFS from 10 processes 
> > > demonstrated at
> > > http://lkml.kernel.org/r/201512052133.iae00551.lsoqftmffvo...@i-love.sakura.ne.jp
> > > using "while :; do ./oom-write; done" loop on a VM with 4CPUs / 2048MB 
> > > RAM.
> > > With this XFS_FILBLKS_MIN() change applied, I no longer hit assertion 
> > > failures.
> > > 
> > 
> > Thanks for testing. Well, that's an interesting workload. I couldn't
> > reproduce on a few quick tries in a similarly configured vm.
> 
> It takes 10 to 15 minutes. Maybe some size threshold involved?
> 
> > /tmp/file _is_ on an XFS filesystem in your test, correct? If so and if
> > you still have the output file from a test that reproduced, could you
> > get the 'xfs_io -c "fiemap -v" ' output?
> 
> Here it is.
> 
> [  720.199748] 0 pages HighMem/MovableOnly
> [  720.199749] 150524 pages reserved
> [  720.199749] 0 pages cma reserved
> [  720.199750] 0 pages hwpoisoned
> [  722.187335] XFS: Assertion failed: oldlen > newlen, file: 
> fs/xfs/libxfs/xfs_bmap.c, line: 2867
> [  722.201784] [ cut here ]
...
> 
> # ls -l /tmp/file
> -rw--- 1 kumaneko kumaneko 43426648064 Feb  7 19:25 /tmp/file
> # xfs_io -c "fiemap -v" /tmp/file
> /tmp/file:
>  EXT: FILE-OFFSET  BLOCK-RANGETOTAL FLAGS
>0: [0..262015]: 358739712..359001727  262016   0x0
...
>  187: [84810808..84901119]: 110211736..110302047   90312   0x1

Ok, from the size of the file I realized that I missed you were running
in a loop the first time around. I tried playing with it some more and
still haven't been able to reproduce.

Anyways, the patch intended to fix this has been reviewed[1] and queued
for the next release, so it's probably not a big deal since you've
already verified it. Thanks again.

Brian

[1] http://www.spinics.net/lists/linux-xfs/msg04083.html


Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone

2017-02-07 Thread Tetsuo Handa
Brian Foster wrote:
> > The workload is to write to a single file on XFS from 10 processes 
> > demonstrated at
> > http://lkml.kernel.org/r/201512052133.iae00551.lsoqftmffvo...@i-love.sakura.ne.jp
> > using "while :; do ./oom-write; done" loop on a VM with 4CPUs / 2048MB RAM.
> > With this XFS_FILBLKS_MIN() change applied, I no longer hit assertion 
> > failures.
> > 
> 
> Thanks for testing. Well, that's an interesting workload. I couldn't
> reproduce on a few quick tries in a similarly configured vm.

It takes 10 to 15 minutes. Maybe some size threshold involved?

> /tmp/file _is_ on an XFS filesystem in your test, correct? If so and if
> you still have the output file from a test that reproduced, could you
> get the 'xfs_io -c "fiemap -v" ' output?

Here it is.

[  720.199748] 0 pages HighMem/MovableOnly
[  720.199749] 150524 pages reserved
[  720.199749] 0 pages cma reserved
[  720.199750] 0 pages hwpoisoned
[  722.187335] XFS: Assertion failed: oldlen > newlen, file: 
fs/xfs/libxfs/xfs_bmap.c, line: 2867
[  722.201784] [ cut here ]
[  722.205940] WARNING: CPU: 0 PID: 4877 at fs/xfs/xfs_message.c:105 
asswarn+0x33/0x40 [xfs]
[  722.212333] Modules linked in: nf_conntrack_netbios_ns 
nf_conntrack_broadcast ip6t_rpfilter ipt_REJECT nf_reject_ipv4 ip6t_REJECT 
nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge 
stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 
ip6table_mangle ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 
nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_raw ebtable_filter 
ebtables ip6table_filter ip6_tables iptable_filter coretemp crct10dif_pclmul 
vmw_vsock_vmci_transport crc32_pclmul ghash_clmulni_intel vsock aesni_intel 
crypto_simd cryptd glue_helper ppdev vmw_balloon pcspkr sg parport_pc i2c_piix4 
shpchp vmw_vmci parport ip_tables xfs libcrc32c sd_mod sr_mod cdrom ata_generic 
pata_acpi crc32c_intel serio_raw vmwgfx drm_kms_helper syscopyarea sysfillrect
[  722.243207]  sysimgblt fb_sys_fops mptspi scsi_transport_spi ata_piix ahci 
ttm mptscsih libahci drm libata mptbase e1000 i2c_core
[  722.247704] CPU: 0 PID: 4877 Comm: write Not tainted 
4.10.0-rc6-next-20170202 #498
[  722.250612] Hardware name: VMware, Inc. VMware Virtual Platform/440BX 
Desktop Reference Platform, BIOS 6.00 07/02/2015
[  722.254089] Call Trace:
[  722.255751]  dump_stack+0x85/0xc9
[  722.257650]  __warn+0xd1/0xf0
[  722.259420]  warn_slowpath_null+0x1d/0x20
[  722.261434]  asswarn+0x33/0x40 [xfs]
[  722.263356]  xfs_bmap_add_extent_hole_delay+0xb7f/0xdf0 [xfs]
[  722.265695]  xfs_bmapi_reserve_delalloc+0x297/0x440 [xfs]
[  722.267792]  ? xfs_ilock+0x1c9/0x360 [xfs]
[  722.269559]  xfs_file_iomap_begin+0x880/0x1140 [xfs]
[  722.271606]  ? iomap_write_end+0x80/0x80
[  722.273377]  iomap_apply+0x6c/0x130
[  722.274969]  iomap_file_buffered_write+0x68/0xa0
[  722.276702]  ? iomap_write_end+0x80/0x80
[  722.278311]  xfs_file_buffered_aio_write+0x132/0x390 [xfs]
[  722.280394]  ? _raw_spin_unlock+0x27/0x40
[  722.282247]  xfs_file_write_iter+0x90/0x130 [xfs]
[  722.284257]  __vfs_write+0xe5/0x140
[  722.285924]  vfs_write+0xc7/0x1f0
[  722.287536]  ? syscall_trace_enter+0x1d0/0x380
[  722.289490]  SyS_write+0x58/0xc0
[  722.291025]  do_int80_syscall_32+0x6c/0x1f0
[  722.292671]  entry_INT80_compat+0x38/0x50
[  722.294298] RIP: 0023:0x8048076
[  722.295684] RSP: 002b:ffedf840 EFLAGS: 0202 ORIG_RAX: 
0004
[  722.298075] RAX: ffda RBX: 0001 RCX: 08048000
[  722.300516] RDX: 1000 RSI:  RDI: 
[  722.302902] RBP:  R08:  R09: 
[  722.305278] R10:  R11:  R12: 
[  722.307567] R13:  R14:  R15: 
[  722.309792] ---[ end trace 5b7012eeb84093b7 ]---
[  732.650867] oom_reaper: reaped process 4876 (oom-write), now anon-rss:0kB, 
file-rss:0kB, shmem-rss:0kB

# ls -l /tmp/file
-rw--- 1 kumaneko kumaneko 43426648064 Feb  7 19:25 /tmp/file
# xfs_io -c "fiemap -v" /tmp/file
/tmp/file:
 EXT: FILE-OFFSET  BLOCK-RANGETOTAL FLAGS
   0: [0..262015]: 358739712..359001727  262016   0x0
   1: [262016..524159]:367651920..367914063  262144   0x0
   2: [524160..1048447]:   385063864..385588151  524288   0x0
   3: [1048448..1238031]:  463702512..463892095  189584   0x0
   4: [1238032..3335167]:  448234520..450331655 2097136   0x0
   5: [3335168..4769775]:  36165320..37599927   1434608   0x0
   6: [4769776..6897175]:  31677984..33805383   2127400   0x0
   7: [6897176..15285759]: 450331656..458720239 8388584   0x0
   8: [15285760..18520255]: 237497528..240732023 3234496   0x0
   9: [18520256..21063607]: 229750248..232293599 2543352   0x0
  10: [21063608..25257855]: 240732024..244926271 4194248   0x0
  11: [25257856..29452159]: 179523440..183717743 4194304   0x0
  12: [29452160..30380031]

Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone

2017-02-06 Thread Brian Foster
On Mon, Feb 06, 2017 at 03:42:22PM +0100, Michal Hocko wrote:
> On Mon 06-02-17 09:35:33, Brian Foster wrote:
> > On Mon, Feb 06, 2017 at 03:29:24PM +0900, Tetsuo Handa wrote:
> > > Brian Foster wrote:
> > > > On Fri, Feb 03, 2017 at 03:50:09PM +0100, Michal Hocko wrote:
> > > > > [Let's CC more xfs people]
> > > > > 
> > > > > On Fri 03-02-17 19:57:39, Tetsuo Handa wrote:
> > > > > [...]
> > > > > > (1) I got an assertion failure.
> > > > > 
> > > > > I suspect this is a result of
> > > > > http://lkml.kernel.org/r/20170201092706.9966-2-mho...@kernel.org
> > > > > I have no idea what the assert means though.
> > > > > 
> > > > > > 
> > > > > > [  969.626518] Killed process 6262 (oom-write) total-vm:2166856kB, 
> > > > > > anon-rss:1128732kB, file-rss:4kB, shmem-rss:0kB
> > > > > > [  969.958307] oom_reaper: reaped process 6262 (oom-write), now 
> > > > > > anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
> > > > > > [  972.114644] XFS: Assertion failed: oldlen > newlen, file: 
> > > > > > fs/xfs/libxfs/xfs_bmap.c, line: 2867
> > > > 
> > > > Indirect block reservation underrun on delayed allocation extent merge.
> > > > These are extra blocks are used for the inode bmap btree when a delalloc
> > > > extent is converted to physical blocks. We're in a case where we expect
> > > > to only ever free excess blocks due to a merge of extents with
> > > > independent reservations, but a situation occurs where we actually need
> > > > blocks and hence the assert fails. This can occur if an extent is merged
> > > > with one that has a reservation less than the expected worst case
> > > > reservation for its size (due to previous extent splits due to hole
> > > > punches, for example). Therefore, I think the core expectation that
> > > > xfs_bmap_add_extent_hole_delay() will always have enough blocks
> > > > pre-reserved is invalid.
> > > > 
> > > > Can you describe the workload that reproduces this? FWIW, I think the
> > > > way xfs_bmap_add_extent_hole_delay() currently works is likely broken
> > > > and have a couple patches to fix up indlen reservation that I haven't
> > > > posted yet. The diff that deals with this particular bit is appended.
> > > > Care to give that a try?
> > > 
> > > The workload is to write to a single file on XFS from 10 processes 
> > > demonstrated at
> > > http://lkml.kernel.org/r/201512052133.iae00551.lsoqftmffvo...@i-love.sakura.ne.jp
> > > using "while :; do ./oom-write; done" loop on a VM with 4CPUs / 2048MB 
> > > RAM.
> > > With this XFS_FILBLKS_MIN() change applied, I no longer hit assertion 
> > > failures.
> > > 
> > 
> > Thanks for testing. Well, that's an interesting workload. I couldn't
> > reproduce on a few quick tries in a similarly configured vm.
> > 
> > Normally I'd expect to see this kind of thing on a hole punching
> > workload or dealing with large, sparse files that make use of
> > speculative preallocation (post-eof blocks allocated in anticipation of
> > file extending writes). I'm wondering if what is happening here is that
> > the appending writes and file closes due to oom kills are generating
> > speculative preallocs and prealloc truncates, respectively, and that
> > causes prealloc extents at the eof boundary to be split up and then
> > re-merged by surviving appending writers.
> 
> Can those preallocs be affected by
> http://lkml.kernel.org/r/20170201092706.9966-2-mho...@kernel.org ?
> 

Hmm, I wouldn't expect that to make much of a difference wrt to the core
problem. The prealloc is created on a file extending write that requires
block allocation (we basically just tack on extra blocks to an extending
alloc based on some heuristics like the size of the file and the
previous extent). Whether that allocation occurs on one iomap iteration
or another due to a short write and retry, I wouldn't expect to matter
that much.

I suppose it could change the behavior of specialized workload though.
E.g., if it caused a write() call to return quicker and thus lead to a
file close(). We do use file release as an indication that prealloc will
not be used and can reclaim it at that point (presumably causing an
extent split with pre-eof blocks).

Brian

> -- 
> Michal Hocko
> SUSE Labs
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone

2017-02-06 Thread Michal Hocko
On Mon 06-02-17 09:35:33, Brian Foster wrote:
> On Mon, Feb 06, 2017 at 03:29:24PM +0900, Tetsuo Handa wrote:
> > Brian Foster wrote:
> > > On Fri, Feb 03, 2017 at 03:50:09PM +0100, Michal Hocko wrote:
> > > > [Let's CC more xfs people]
> > > > 
> > > > On Fri 03-02-17 19:57:39, Tetsuo Handa wrote:
> > > > [...]
> > > > > (1) I got an assertion failure.
> > > > 
> > > > I suspect this is a result of
> > > > http://lkml.kernel.org/r/20170201092706.9966-2-mho...@kernel.org
> > > > I have no idea what the assert means though.
> > > > 
> > > > > 
> > > > > [  969.626518] Killed process 6262 (oom-write) total-vm:2166856kB, 
> > > > > anon-rss:1128732kB, file-rss:4kB, shmem-rss:0kB
> > > > > [  969.958307] oom_reaper: reaped process 6262 (oom-write), now 
> > > > > anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
> > > > > [  972.114644] XFS: Assertion failed: oldlen > newlen, file: 
> > > > > fs/xfs/libxfs/xfs_bmap.c, line: 2867
> > > 
> > > Indirect block reservation underrun on delayed allocation extent merge.
> > > These are extra blocks are used for the inode bmap btree when a delalloc
> > > extent is converted to physical blocks. We're in a case where we expect
> > > to only ever free excess blocks due to a merge of extents with
> > > independent reservations, but a situation occurs where we actually need
> > > blocks and hence the assert fails. This can occur if an extent is merged
> > > with one that has a reservation less than the expected worst case
> > > reservation for its size (due to previous extent splits due to hole
> > > punches, for example). Therefore, I think the core expectation that
> > > xfs_bmap_add_extent_hole_delay() will always have enough blocks
> > > pre-reserved is invalid.
> > > 
> > > Can you describe the workload that reproduces this? FWIW, I think the
> > > way xfs_bmap_add_extent_hole_delay() currently works is likely broken
> > > and have a couple patches to fix up indlen reservation that I haven't
> > > posted yet. The diff that deals with this particular bit is appended.
> > > Care to give that a try?
> > 
> > The workload is to write to a single file on XFS from 10 processes 
> > demonstrated at
> > http://lkml.kernel.org/r/201512052133.iae00551.lsoqftmffvo...@i-love.sakura.ne.jp
> > using "while :; do ./oom-write; done" loop on a VM with 4CPUs / 2048MB RAM.
> > With this XFS_FILBLKS_MIN() change applied, I no longer hit assertion 
> > failures.
> > 
> 
> Thanks for testing. Well, that's an interesting workload. I couldn't
> reproduce on a few quick tries in a similarly configured vm.
> 
> Normally I'd expect to see this kind of thing on a hole punching
> workload or dealing with large, sparse files that make use of
> speculative preallocation (post-eof blocks allocated in anticipation of
> file extending writes). I'm wondering if what is happening here is that
> the appending writes and file closes due to oom kills are generating
> speculative preallocs and prealloc truncates, respectively, and that
> causes prealloc extents at the eof boundary to be split up and then
> re-merged by surviving appending writers.

Can those preallocs be affected by
http://lkml.kernel.org/r/20170201092706.9966-2-mho...@kernel.org ?

-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone

2017-02-06 Thread Brian Foster
On Mon, Feb 06, 2017 at 03:29:24PM +0900, Tetsuo Handa wrote:
> Brian Foster wrote:
> > On Fri, Feb 03, 2017 at 03:50:09PM +0100, Michal Hocko wrote:
> > > [Let's CC more xfs people]
> > > 
> > > On Fri 03-02-17 19:57:39, Tetsuo Handa wrote:
> > > [...]
> > > > (1) I got an assertion failure.
> > > 
> > > I suspect this is a result of
> > > http://lkml.kernel.org/r/20170201092706.9966-2-mho...@kernel.org
> > > I have no idea what the assert means though.
> > > 
> > > > 
> > > > [  969.626518] Killed process 6262 (oom-write) total-vm:2166856kB, 
> > > > anon-rss:1128732kB, file-rss:4kB, shmem-rss:0kB
> > > > [  969.958307] oom_reaper: reaped process 6262 (oom-write), now 
> > > > anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
> > > > [  972.114644] XFS: Assertion failed: oldlen > newlen, file: 
> > > > fs/xfs/libxfs/xfs_bmap.c, line: 2867
> > 
> > Indirect block reservation underrun on delayed allocation extent merge.
> > These are extra blocks are used for the inode bmap btree when a delalloc
> > extent is converted to physical blocks. We're in a case where we expect
> > to only ever free excess blocks due to a merge of extents with
> > independent reservations, but a situation occurs where we actually need
> > blocks and hence the assert fails. This can occur if an extent is merged
> > with one that has a reservation less than the expected worst case
> > reservation for its size (due to previous extent splits due to hole
> > punches, for example). Therefore, I think the core expectation that
> > xfs_bmap_add_extent_hole_delay() will always have enough blocks
> > pre-reserved is invalid.
> > 
> > Can you describe the workload that reproduces this? FWIW, I think the
> > way xfs_bmap_add_extent_hole_delay() currently works is likely broken
> > and have a couple patches to fix up indlen reservation that I haven't
> > posted yet. The diff that deals with this particular bit is appended.
> > Care to give that a try?
> 
> The workload is to write to a single file on XFS from 10 processes 
> demonstrated at
> http://lkml.kernel.org/r/201512052133.iae00551.lsoqftmffvo...@i-love.sakura.ne.jp
> using "while :; do ./oom-write; done" loop on a VM with 4CPUs / 2048MB RAM.
> With this XFS_FILBLKS_MIN() change applied, I no longer hit assertion 
> failures.
> 

Thanks for testing. Well, that's an interesting workload. I couldn't
reproduce on a few quick tries in a similarly configured vm.

Normally I'd expect to see this kind of thing on a hole punching
workload or dealing with large, sparse files that make use of
speculative preallocation (post-eof blocks allocated in anticipation of
file extending writes). I'm wondering if what is happening here is that
the appending writes and file closes due to oom kills are generating
speculative preallocs and prealloc truncates, respectively, and that
causes prealloc extents at the eof boundary to be split up and then
re-merged by surviving appending writers.

/tmp/file _is_ on an XFS filesystem in your test, correct? If so and if
you still have the output file from a test that reproduced, could you
get the 'xfs_io -c "fiemap -v" ' output?

I suppose another possibility is that prealloc occurs, write failure(s)
leads to extent splits via unmapping the target range of the write, and
then surviving writers generate the warning on a delalloc extent merge..

Brian

> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majord...@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: mailto:"d...@kvack.org";> em...@kvack.org 


Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone

2017-02-06 Thread Michal Hocko
On Sun 05-02-17 19:43:07, Tetsuo Handa wrote:
> Michal Hocko wrote:
> I got same warning with ext4. Maybe we need to check carefully.
> 
> [  511.215743] =
> [  511.218003] WARNING: RECLAIM_FS-safe -> RECLAIM_FS-unsafe lock order 
> detected
> [  511.220031] 4.10.0-rc6-next-20170202+ #500 Not tainted
> [  511.221689] -
> [  511.223579] a.out/49302 [HC0[0]:SC0[0]:HE1:SE1] is trying to acquire:
> [  511.225533]  (cpu_hotplug.dep_map){++}, at: [] 
> get_online_cpus+0x37/0x80
> [  511.227795] 
> [  511.227795] and this task is already holding:
> [  511.230082]  (jbd2_handle){-.}, at: [] 
> start_this_handle+0x1a7/0x590
> [  511.232592] which would create a new lock dependency:
> [  511.234192]  (jbd2_handle){-.} -> (cpu_hotplug.dep_map){++}
> [  511.235966] 
> [  511.235966] but this new dependency connects a RECLAIM_FS-irq-safe lock:
> [  511.238563]  (jbd2_handle){-.}
> [  511.238564] 
> [  511.238564] ... which became RECLAIM_FS-irq-safe at:
> [  511.242078]   
> [  511.242084] [] __lock_acquire+0x34b/0x1640
> [  511.244495] [] lock_acquire+0xc9/0x250
> [  511.246697] [] jbd2_log_wait_commit+0x55/0x1d0
[...]
> [  511.276216] to a RECLAIM_FS-irq-unsafe lock:
> [  511.278128]  (cpu_hotplug.dep_map){++}
> [  511.278130] 
> [  511.278130] ... which became RECLAIM_FS-irq-unsafe at:
> [  511.281809] ...
> [  511.281811]   
> [  511.282598] [] mark_held_locks+0x71/0x90
> [  511.284854] [] lockdep_trace_alloc+0x6f/0xd0
> [  511.287218] [] kmem_cache_alloc_node_trace+0x48/0x3b0
> [  511.289755] [] __smpboot_create_thread.part.2+0x35/0xf0
> [  511.292329] [] smpboot_create_threads+0x66/0x90
[...]
> [  511.317867] other info that might help us debug this:
> [  511.317867] 
> [  511.320920]  Possible interrupt unsafe locking scenario:
> [  511.320920] 
> [  511.323218]CPU0CPU1
> [  511.324622]
> [  511.325973]   lock(cpu_hotplug.dep_map);
> [  511.327246]local_irq_disable();
> [  511.328870]lock(jbd2_handle);
> [  511.330483]lock(cpu_hotplug.dep_map);
> [  511.332259]   
> [  511.333187] lock(jbd2_handle);

Peter, is there any way how to tell the lockdep that this is in fact
reclaim safe? The direct reclaim only does the trylock and backs off so
we cannot deadlock here.

Or am I misinterpreting the trace?
-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone

2017-02-06 Thread Michal Hocko
On Sun 05-02-17 19:43:07, Tetsuo Handa wrote:
[...]
> Below one is also a loop. Maybe we can add __GFP_NOMEMALLOC to GFP_NOWAIT ?

No, GFP_NOWAIT is just too generic to use this flag.

> [  257.781715] Out of memory: Kill process 5171 (a.out) score 842 or 
> sacrifice child
> [  257.784726] Killed process 5171 (a.out) total-vm:2177096kB, 
> anon-rss:1476488kB, file-rss:4kB, shmem-rss:0kB
> [  257.787691] a.out(5171): TIF_MEMDIE allocation: order=0 
> mode=0x1000200(GFP_NOWAIT|__GFP_NOWARN)
> [  257.789789] CPU: 3 PID: 5171 Comm: a.out Not tainted 
> 4.10.0-rc6-next-20170202+ #500
> [  257.791784] Hardware name: VMware, Inc. VMware Virtual Platform/440BX 
> Desktop Reference Platform, BIOS 6.00 07/02/2015
> [  257.794700] Call Trace:
> [  257.795690]  dump_stack+0x85/0xc9
> [  257.797224]  __alloc_pages_slowpath+0xacb/0xe36
> [  257.798612]  __alloc_pages_nodemask+0x382/0x3d0
> [  257.799942]  alloc_pages_current+0x97/0x1b0
> [  257.801236]  __get_free_pages+0x14/0x50
> [  257.802546]  __tlb_remove_page_size+0x70/0xd0

This is bound to MAX_GATHER_BATCH_COUNT which shouldn't be a lot of
pages (20 or so). We could add __GFP_NOMEMALLOC into tlb_next_batch
but I am not entirely convinced it is really necessary.
-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone

2017-02-05 Thread Tetsuo Handa
Brian Foster wrote:
> On Fri, Feb 03, 2017 at 03:50:09PM +0100, Michal Hocko wrote:
> > [Let's CC more xfs people]
> > 
> > On Fri 03-02-17 19:57:39, Tetsuo Handa wrote:
> > [...]
> > > (1) I got an assertion failure.
> > 
> > I suspect this is a result of
> > http://lkml.kernel.org/r/20170201092706.9966-2-mho...@kernel.org
> > I have no idea what the assert means though.
> > 
> > > 
> > > [  969.626518] Killed process 6262 (oom-write) total-vm:2166856kB, 
> > > anon-rss:1128732kB, file-rss:4kB, shmem-rss:0kB
> > > [  969.958307] oom_reaper: reaped process 6262 (oom-write), now 
> > > anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
> > > [  972.114644] XFS: Assertion failed: oldlen > newlen, file: 
> > > fs/xfs/libxfs/xfs_bmap.c, line: 2867
> 
> Indirect block reservation underrun on delayed allocation extent merge.
> These are extra blocks are used for the inode bmap btree when a delalloc
> extent is converted to physical blocks. We're in a case where we expect
> to only ever free excess blocks due to a merge of extents with
> independent reservations, but a situation occurs where we actually need
> blocks and hence the assert fails. This can occur if an extent is merged
> with one that has a reservation less than the expected worst case
> reservation for its size (due to previous extent splits due to hole
> punches, for example). Therefore, I think the core expectation that
> xfs_bmap_add_extent_hole_delay() will always have enough blocks
> pre-reserved is invalid.
> 
> Can you describe the workload that reproduces this? FWIW, I think the
> way xfs_bmap_add_extent_hole_delay() currently works is likely broken
> and have a couple patches to fix up indlen reservation that I haven't
> posted yet. The diff that deals with this particular bit is appended.
> Care to give that a try?

The workload is to write to a single file on XFS from 10 processes demonstrated 
at
http://lkml.kernel.org/r/201512052133.iae00551.lsoqftmffvo...@i-love.sakura.ne.jp
using "while :; do ./oom-write; done" loop on a VM with 4CPUs / 2048MB RAM.
With this XFS_FILBLKS_MIN() change applied, I no longer hit assertion failures.


Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone

2017-02-05 Thread Tetsuo Handa
Michal Hocko wrote:
> [CC Petr]
> 
> On Fri 03-02-17 19:57:39, Tetsuo Handa wrote:
> [...]
> > (2) I got a lockdep warning. (A new false positive?)
> 
> Yes, I suspect this is a false possitive. I do not see how we can
> deadlock. __alloc_pages_direct_reclaim calls drain_all_pages(NULL) which
> means that a potential recursion to the page allocator during draining
> would just bail out on the trylock. Maybe I am misinterpreting the
> report though.
> 

I got same warning with ext4. Maybe we need to check carefully.

[  511.215743] =
[  511.218003] WARNING: RECLAIM_FS-safe -> RECLAIM_FS-unsafe lock order detected
[  511.220031] 4.10.0-rc6-next-20170202+ #500 Not tainted
[  511.221689] -
[  511.223579] a.out/49302 [HC0[0]:SC0[0]:HE1:SE1] is trying to acquire:
[  511.225533]  (cpu_hotplug.dep_map){++}, at: [] 
get_online_cpus+0x37/0x80
[  511.227795] 
[  511.227795] and this task is already holding:
[  511.230082]  (jbd2_handle){-.}, at: [] 
start_this_handle+0x1a7/0x590
[  511.232592] which would create a new lock dependency:
[  511.234192]  (jbd2_handle){-.} -> (cpu_hotplug.dep_map){++}
[  511.235966] 
[  511.235966] but this new dependency connects a RECLAIM_FS-irq-safe lock:
[  511.238563]  (jbd2_handle){-.}
[  511.238564] 
[  511.238564] ... which became RECLAIM_FS-irq-safe at:
[  511.242078]   
[  511.242084] [] __lock_acquire+0x34b/0x1640
[  511.244492]   
[  511.244495] [] lock_acquire+0xc9/0x250
[  511.246694]   
[  511.246697] [] jbd2_log_wait_commit+0x55/0x1d0
[  511.249323]   
[  511.249328] [] jbd2_complete_transaction+0x71/0x90
[  511.252069]   
[  511.252074] [] ext4_evict_inode+0x356/0x760
[  511.254753]   
[  511.254757] [] evict+0xd1/0x1a0
[  511.257062]   
[  511.257065] [] dispose_list+0x4d/0x80
[  511.259531]   
[  511.259535] [] prune_icache_sb+0x5a/0x80
[  511.261953]   
[  511.261957] [] super_cache_scan+0x141/0x190
[  511.264540]   
[  511.264545] [] shrink_slab+0x29f/0x6d0
[  511.267165]   
[  511.267171] [] shrink_node+0x2fa/0x310
[  511.269455]   
[  511.269459] [] kswapd+0x362/0x9b0
[  511.271831]   
[  511.271834] [] kthread+0x10f/0x150
[  511.274031]   
[  511.274035] [] ret_from_fork+0x31/0x40
[  511.276216] 
[  511.276216] to a RECLAIM_FS-irq-unsafe lock:
[  511.278128]  (cpu_hotplug.dep_map){++}
[  511.278130] 
[  511.278130] ... which became RECLAIM_FS-irq-unsafe at:
[  511.281809] ...
[  511.281811]   
[  511.282598] [] mark_held_locks+0x71/0x90
[  511.284852]   
[  511.284854] [] lockdep_trace_alloc+0x6f/0xd0
[  511.287215]   
[  511.287218] [] kmem_cache_alloc_node_trace+0x48/0x3b0
[  511.289751]   
[  511.289755] [] __smpboot_create_thread.part.2+0x35/0xf0
[  511.292326]   
[  511.292329] [] smpboot_create_threads+0x66/0x90
[  511.295025]   
[  511.295030] [] cpuhp_invoke_callback+0x229/0x9e0
[  511.299245]   
[  511.299253] [] cpuhp_up_callbacks+0x37/0xb0
[  511.301889]   
[  511.301894] [] _cpu_up+0x89/0xf0
[  511.304270]   
[  511.304275] [] do_cpu_up+0x85/0xb0
[  511.306428]   
[  511.306431] [] cpu_up+0x13/0x20
[  511.308533]   
[  511.308535] [] smp_init+0x6b/0xcc
[  511.310710]   
[  511.310713] [] kernel_init_freeable+0x17d/0x2ac
[  511.313232]   
[  511.313235] [] kernel_init+0xe/0x110
[  511.315616]   
[  511.315620] [] ret_from_fork+0x31/0x40
[  511.317867] 
[  511.317867] other info that might help us debug this:
[  511.317867] 
[  511.320920]  Possible interrupt unsafe locking scenario:
[  511.320920] 
[  511.323218]CPU0CPU1
[  511.324622]
[  511.325973]   lock(cpu_hotplug.dep_map);
[  511.327246]local_irq_disable();
[  511.328870]lock(jbd2_handle);
[  511.330483]lock(cpu_hotplug.dep_map);
[  511.332259]   
[  511.333187] lock(jbd2_handle);
[  511.334304] 
[  511.334304]  *** DEADLOCK ***
[  511.334304] 
[  511.336749] 4 locks held by a.out/49302:
[  511.338129]  #0:  (sb_writers#8){.+.+.+}, at: [] 
mnt_want_write+0x24/0x50
[  511.340768]  #1:  (&type->i_mutex_dir_key#3){++}, at: 
[] path_openat+0x60b/0xd50
[  511.343744]  #2:  (jbd2_handle){-.}, at: [] 
start_this_handle+0x1a7/0x590
[  511.345743]  #3:  (pcpu_drain_mutex){+.+...}, at: [] 
drain_all_pages.part.89+0x1f/0x2c0
[  511.348605] 
[  511.348605] the dependencies between RECLAIM_FS-irq-safe lock and the 
holding lock:
[  511.351336] -> (jbd2_handle){-.} ops: 203220 {
[  511.352768]HARDIRQ-ON-W at:
[  511.353827] 
[  511.353833] [] __lock_acquire+0x9de/0x1640
[  511.356489] 
[  511.356492] [] lock_acquire+0xc9/0x250
[  511.359063] 
[  511.359067] [] jbd2_log_wait_commit+0x55/0x1d0
[  511.361905] 
[  511.361908] [] jbd2_complete_transaction+0x71/0x90
[  511.364560] 
[  511.364563] [] ext4_sync_file+0x2

Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone

2017-02-03 Thread Brian Foster
On Fri, Feb 03, 2017 at 03:50:09PM +0100, Michal Hocko wrote:
> [Let's CC more xfs people]
> 
> On Fri 03-02-17 19:57:39, Tetsuo Handa wrote:
> [...]
> > (1) I got an assertion failure.
> 
> I suspect this is a result of
> http://lkml.kernel.org/r/20170201092706.9966-2-mho...@kernel.org
> I have no idea what the assert means though.
> 
> > 
> > [  969.626518] Killed process 6262 (oom-write) total-vm:2166856kB, 
> > anon-rss:1128732kB, file-rss:4kB, shmem-rss:0kB
> > [  969.958307] oom_reaper: reaped process 6262 (oom-write), now 
> > anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
> > [  972.114644] XFS: Assertion failed: oldlen > newlen, file: 
> > fs/xfs/libxfs/xfs_bmap.c, line: 2867

Indirect block reservation underrun on delayed allocation extent merge.
These are extra blocks are used for the inode bmap btree when a delalloc
extent is converted to physical blocks. We're in a case where we expect
to only ever free excess blocks due to a merge of extents with
independent reservations, but a situation occurs where we actually need
blocks and hence the assert fails. This can occur if an extent is merged
with one that has a reservation less than the expected worst case
reservation for its size (due to previous extent splits due to hole
punches, for example). Therefore, I think the core expectation that
xfs_bmap_add_extent_hole_delay() will always have enough blocks
pre-reserved is invalid.

Can you describe the workload that reproduces this? FWIW, I think the
way xfs_bmap_add_extent_hole_delay() currently works is likely broken
and have a couple patches to fix up indlen reservation that I haven't
posted yet. The diff that deals with this particular bit is appended.
Care to give that a try?

Brian

> > [  972.125085] [ cut here ]
> > [  972.129261] WARNING: CPU: 0 PID: 6280 at fs/xfs/xfs_message.c:105 
> > asswarn+0x33/0x40 [xfs]
> > [  972.136146] Modules linked in: nf_conntrack_netbios_ns 
> > nf_conntrack_broadcast ip6t_rpfilter ipt_REJECT nf_reject_ipv4 ip6t_REJECT 
> > nf_reject_ipv6 xt_conntrack coretemp crct10dif_pclmul ppdev crc32_pclmul 
> > ghash_clmulni_intel ip_set nfnetlink ebtable_nat aesni_intel crypto_simd 
> > cryptd ebtable_broute glue_helper vmw_balloon bridge stp llc ip6table_nat 
> > nf_conntrack_ipv6 nf_defrag_ipv6 pcspkr nf_nat_ipv6 ip6table_mangle 
> > ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 
> > nf_nat nf_conntrack iptable_mangle iptable_raw ebtable_filter ebtables 
> > ip6table_filter ip6_tables iptable_filter sg parport_pc parport shpchp 
> > i2c_piix4 vmw_vsock_vmci_transport vsock vmw_vmci ip_tables xfs libcrc32c 
> > sr_mod cdrom ata_generic sd_mod pata_acpi crc32c_intel serio_raw vmwgfx 
> > drm_kms_helper syscopyarea sysfillrect
> > [  972.163630]  sysimgblt fb_sys_fops ttm drm ata_piix ahci libahci mptspi 
> > scsi_transport_spi mptscsih e1000 libata i2c_core mptbase
> > [  972.172535] CPU: 0 PID: 6280 Comm: write Not tainted 
> > 4.10.0-rc6-next-20170202 #498
> > [  972.175126] Hardware name: VMware, Inc. VMware Virtual Platform/440BX 
> > Desktop Reference Platform, BIOS 6.00 07/02/2015
> > [  972.178381] Call Trace:
...

---8<---

diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index bfc00de..d2e48ed 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -2809,7 +2809,8 @@ xfs_bmap_add_extent_hole_delay(
oldlen = startblockval(left.br_startblock) +
startblockval(new->br_startblock) +
startblockval(right.br_startblock);
-   newlen = xfs_bmap_worst_indlen(ip, temp);
+   newlen = XFS_FILBLKS_MIN(xfs_bmap_worst_indlen(ip, temp),
+oldlen);
xfs_bmbt_set_startblock(xfs_iext_get_ext(ifp, *idx),
nullstartblock((int)newlen));
trace_xfs_bmap_post_update(ip, *idx, state, _THIS_IP_);
@@ -2830,7 +2831,8 @@ xfs_bmap_add_extent_hole_delay(
xfs_bmbt_set_blockcount(xfs_iext_get_ext(ifp, *idx), temp);
oldlen = startblockval(left.br_startblock) +
startblockval(new->br_startblock);
-   newlen = xfs_bmap_worst_indlen(ip, temp);
+   newlen = XFS_FILBLKS_MIN(xfs_bmap_worst_indlen(ip, temp),
+oldlen);
xfs_bmbt_set_startblock(xfs_iext_get_ext(ifp, *idx),
nullstartblock((int)newlen));
trace_xfs_bmap_post_update(ip, *idx, state, _THIS_IP_);
@@ -2846,7 +2848,8 @@ xfs_bmap_add_extent_hole_delay(
temp = new->br_blockcount + right.br_blockcount;
oldlen = startblockval(new->br_startblock) +
startblockval(right.br_startblock);
-   newlen = xfs_bmap_worst_indlen(ip, temp);
+   newlen = XFS_FILBLKS_MIN(xfs_bmap_worst_indlen(ip, temp),
+oldlen);

Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone

2017-02-03 Thread Michal Hocko
[CC Petr]

On Fri 03-02-17 19:57:39, Tetsuo Handa wrote:
[...]
> (2) I got a lockdep warning. (A new false positive?)

Yes, I suspect this is a false possitive. I do not see how we can
deadlock. __alloc_pages_direct_reclaim calls drain_all_pages(NULL) which
means that a potential recursion to the page allocator during draining
would just bail out on the trylock. Maybe I am misinterpreting the
report though.

> [  243.036975] =
> [  243.042976] WARNING: RECLAIM_FS-safe -> RECLAIM_FS-unsafe lock order 
> detected
> [  243.051211] 4.10.0-rc6-next-20170202 #46 Not tainted
> [  243.054619] -
> [  243.057395] awk/8767 [HC0[0]:SC0[0]:HE1:SE1] is trying to acquire:
> [  243.060310]  (cpu_hotplug.dep_map){++}, at: [] 
> get_online_cpus+0x32/0x80
> [  243.063462] 
> [  243.063462] and this task is already holding:
> [  243.066851]  (&xfs_dir_ilock_class){-.}, at: [] 
> xfs_ilock+0x114/0x290 [xfs]
> [  243.069949] which would create a new lock dependency:
> [  243.072143]  (&xfs_dir_ilock_class){-.} -> 
> (cpu_hotplug.dep_map){++}
> [  243.074789] 
> [  243.074789] but this new dependency connects a RECLAIM_FS-irq-safe lock:
> [  243.078735]  (&xfs_dir_ilock_class){-.}
> [  243.078739] 
> [  243.078739] ... which became RECLAIM_FS-irq-safe at:
> [  243.084175]   
> [  243.084180] [] __lock_acquire+0x344/0x1bb0
> [  243.087257]   
> [  243.087261] [] lock_acquire+0xe0/0x2a0
> [  243.090027]   
> [  243.090033] [] down_write_nested+0x59/0xc0
> [  243.092838]   
> [  243.092888] [] xfs_ilock+0x14e/0x290 [xfs]
> [  243.095453]   
> [  243.095485] [] xfs_reclaim_inode+0x135/0x340 [xfs]
> [  243.098083]   
> [  243.098109] [] xfs_reclaim_inodes_ag+0x2ca/0x4f0 [xfs]
> [  243.100668]   
> [  243.100692] [] xfs_reclaim_inodes_nr+0x2e/0x40 [xfs]
> [  243.103191]   
> [  243.103221] [] xfs_fs_free_cached_objects+0x14/0x20 [xfs]
> [  243.105710]   
> [  243.105714] [] super_cache_scan+0x17c/0x190
> [  243.107947]   
> [  243.107950] [] shrink_slab+0x29a/0x710
> [  243.110133]   
> [  243.110135] [] shrink_node+0x23d/0x320
> [  243.112262]   
> [  243.112264] [] kswapd+0x354/0xa10
> [  243.114323]   
> [  243.114326] [] kthread+0x10a/0x140
> [  243.116448]   
> [  243.116452] [] ret_from_fork+0x31/0x40
> [  243.118692] 
> [  243.118692] to a RECLAIM_FS-irq-unsafe lock:
> [  243.120636]  (cpu_hotplug.dep_map){++}
> [  243.120638] 
> [  243.120638] ... which became RECLAIM_FS-irq-unsafe at:
> [  243.124021] ...
> [  243.124022]   
> [  243.124820] [] mark_held_locks+0x71/0x90
> [  243.127033]   
> [  243.127035] [] lockdep_trace_alloc+0xc5/0x110
> [  243.129228]   
> [  243.129231] [] kmem_cache_alloc_node_trace+0x4a/0x410
> [  243.131534]   
> [  243.131536] [] __smpboot_create_thread.part.3+0x30/0xf0
> [  243.133850]   
> [  243.133852] [] smpboot_create_threads+0x61/0x90
> [  243.136113]   
> [  243.136119] [] cpuhp_invoke_callback+0xbb/0xb70
> [  243.138319]   
> [  243.138320] [] cpuhp_up_callbacks+0x32/0xb0
> [  243.140479]   
> [  243.140480] [] _cpu_up+0x84/0xf0
> [  243.142484]   
> [  243.142485] [] do_cpu_up+0x84/0xd0
> [  243.144716]   
> [  243.144719] [] cpu_up+0xe/0x10
> [  243.146684]   
> [  243.146687] [] smp_init+0xd5/0x141
> [  243.148755]   
> [  243.148758] [] kernel_init_freeable+0x17d/0x2a7
> [  243.150932]   
> [  243.150936] [] kernel_init+0x9/0x100
> [  243.153088]   
> [  243.153092] [] ret_from_fork+0x31/0x40
> [  243.155135] 
> [  243.155135] other info that might help us debug this:
> [  243.155135] 
> [  243.157724]  Possible interrupt unsafe locking scenario:
> [  243.157724] 
> [  243.159877]CPU0CPU1
> [  243.161047]
> [  243.162210]   lock(cpu_hotplug.dep_map);
> [  243.163279]local_irq_disable();
> [  243.164669]lock(&xfs_dir_ilock_class);
> [  243.166148]lock(cpu_hotplug.dep_map);
> [  243.167653]   
> [  243.168594] lock(&xfs_dir_ilock_class);
> [  243.169694] 
> [  243.169694]  *** DEADLOCK ***
> [  243.169694] 
> [  243.171864] 3 locks held by awk/8767:
> [  243.172872]  #0:  (&type->i_mutex_dir_key#3){++}, at: 
> [] path_openat+0x53c/0xa90
> [  243.174791]  #1:  (&xfs_dir_ilock_class){-.}, at: [] 
> xfs_ilock+0x114/0x290 [xfs]
> [  243.176899]  #2:  (pcpu_drain_mutex){+.+...}, at: [] 
> drain_all_pages.part.80+0x1a/0x320
> [  243.178875] 
> [  243.178875] the dependencies between RECLAIM_FS-irq-safe lock and the 
> holding lock:
> [  243.181262] -> (&xfs_dir_ilock_class){-.} ops: 17348 {
> [  243.182610]HARDIRQ-ON-W at:
> [  243.183603] 
> [  243.183606] [] __lock_acquire+0x794/0x1bb0
> [  243.186056] 
> [  243.186059] [] lock_acquire+0xe0/0x2a0
> [  243.188419] 
> [  243.188422] [] down_write_nested+0x59/0xc0
> [  243.190909]

Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone

2017-02-03 Thread Michal Hocko
[Let's CC more xfs people]

On Fri 03-02-17 19:57:39, Tetsuo Handa wrote:
[...]
> (1) I got an assertion failure.

I suspect this is a result of
http://lkml.kernel.org/r/20170201092706.9966-2-mho...@kernel.org
I have no idea what the assert means though.

> 
> [  969.626518] Killed process 6262 (oom-write) total-vm:2166856kB, 
> anon-rss:1128732kB, file-rss:4kB, shmem-rss:0kB
> [  969.958307] oom_reaper: reaped process 6262 (oom-write), now anon-rss:0kB, 
> file-rss:0kB, shmem-rss:0kB
> [  972.114644] XFS: Assertion failed: oldlen > newlen, file: 
> fs/xfs/libxfs/xfs_bmap.c, line: 2867
> [  972.125085] [ cut here ]
> [  972.129261] WARNING: CPU: 0 PID: 6280 at fs/xfs/xfs_message.c:105 
> asswarn+0x33/0x40 [xfs]
> [  972.136146] Modules linked in: nf_conntrack_netbios_ns 
> nf_conntrack_broadcast ip6t_rpfilter ipt_REJECT nf_reject_ipv4 ip6t_REJECT 
> nf_reject_ipv6 xt_conntrack coretemp crct10dif_pclmul ppdev crc32_pclmul 
> ghash_clmulni_intel ip_set nfnetlink ebtable_nat aesni_intel crypto_simd 
> cryptd ebtable_broute glue_helper vmw_balloon bridge stp llc ip6table_nat 
> nf_conntrack_ipv6 nf_defrag_ipv6 pcspkr nf_nat_ipv6 ip6table_mangle 
> ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat 
> nf_conntrack iptable_mangle iptable_raw ebtable_filter ebtables 
> ip6table_filter ip6_tables iptable_filter sg parport_pc parport shpchp 
> i2c_piix4 vmw_vsock_vmci_transport vsock vmw_vmci ip_tables xfs libcrc32c 
> sr_mod cdrom ata_generic sd_mod pata_acpi crc32c_intel serio_raw vmwgfx 
> drm_kms_helper syscopyarea sysfillrect
> [  972.163630]  sysimgblt fb_sys_fops ttm drm ata_piix ahci libahci mptspi 
> scsi_transport_spi mptscsih e1000 libata i2c_core mptbase
> [  972.172535] CPU: 0 PID: 6280 Comm: write Not tainted 
> 4.10.0-rc6-next-20170202 #498
> [  972.175126] Hardware name: VMware, Inc. VMware Virtual Platform/440BX 
> Desktop Reference Platform, BIOS 6.00 07/02/2015
> [  972.178381] Call Trace:
> [  972.180003]  dump_stack+0x85/0xc9
> [  972.181682]  __warn+0xd1/0xf0
> [  972.183374]  warn_slowpath_null+0x1d/0x20
> [  972.185223]  asswarn+0x33/0x40 [xfs]
> [  972.186950]  xfs_bmap_add_extent_hole_delay+0xb7f/0xdf0 [xfs]
> [  972.189055]  xfs_bmapi_reserve_delalloc+0x297/0x440 [xfs]
> [  972.191263]  ? xfs_ilock+0x1c9/0x360 [xfs]
> [  972.193414]  xfs_file_iomap_begin+0x880/0x1140 [xfs]
> [  972.195300]  ? iomap_write_end+0x80/0x80
> [  972.196980]  iomap_apply+0x6c/0x130
> [  972.198539]  iomap_file_buffered_write+0x68/0xa0
> [  972.200316]  ? iomap_write_end+0x80/0x80
> [  972.201950]  xfs_file_buffered_aio_write+0x132/0x390 [xfs]
> [  972.203868]  ? _raw_spin_unlock+0x27/0x40
> [  972.205470]  xfs_file_write_iter+0x90/0x130 [xfs]
> [  972.207167]  __vfs_write+0xe5/0x140
> [  972.208752]  vfs_write+0xc7/0x1f0
> [  972.210233]  ? syscall_trace_enter+0x1d0/0x380
> [  972.211809]  SyS_write+0x58/0xc0
> [  972.213166]  do_int80_syscall_32+0x6c/0x1f0
> [  972.214676]  entry_INT80_compat+0x38/0x50
> [  972.216168] RIP: 0023:0x8048076
> [  972.217494] RSP: 002b:ff997020 EFLAGS: 0202 ORIG_RAX: 
> 0004
> [  972.219635] RAX: ffda RBX: 0001 RCX: 
> 08048000
> [  972.221679] RDX: 1000 RSI:  RDI: 
> 
> [  972.223774] RBP:  R08:  R09: 
> 
> [  972.225905] R10:  R11:  R12: 
> 
> [  972.227946] R13:  R14:  R15: 
> 
> [  972.230064] ---[ end trace d498098daec56c11 ]---
> [  984.210890] vmtoolsd invoked oom-killer: 
> gfp_mask=0x14201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD), nodemask=(null),  
> order=0, oom_score_adj=0
> [  984.224191] vmtoolsd cpuset=/ mems_allowed=0
> [  984.231022] CPU: 0 PID: 689 Comm: vmtoolsd Tainted: GW   
> 4.10.0-rc6-next-20170202 #498
-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone

2017-02-03 Thread Michal Hocko
On Fri 03-02-17 19:57:39, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Mon 30-01-17 09:55:46, Michal Hocko wrote:
> > > On Sun 29-01-17 00:27:27, Tetsuo Handa wrote:
> > [...]
> > > > Regarding [1], it helped avoiding the too_many_isolated() issue. I can't
> > > > tell whether it has any negative effect, but I got on the first trial 
> > > > that
> > > > all allocating threads are blocked on wait_for_completion() from 
> > > > flush_work()
> > > > in drain_all_pages() introduced by "mm, page_alloc: drain per-cpu pages 
> > > > from
> > > > workqueue context". There was no warn_alloc() stall warning message 
> > > > afterwords.
> > > 
> > > That patch is buggy and there is a follow up [1] which is not sitting in 
> > > the
> > > mmotm (and thus linux-next) yet. I didn't get to review it properly and
> > > I cannot say I would be too happy about using WQ from the page
> > > allocator. I believe even the follow up needs to have WQ_RECLAIM WQ.
> > > 
> > > [1] 
> > > http://lkml.kernel.org/r/20170125083038.rzb5f43nptmk7...@techsingularity.net
> > 
> > Did you get chance to test with this follow up patch? It would be
> > interesting to see whether OOM situation can still starve the waiter.
> > The current linux-next should contain this patch.
> 
> So far I can't reproduce problems except two listed below (cond_resched() trap
> in printk() and IDLE priority trap are excluded from the list). But I agree 
> that
> the follow up patch needs to use a WQ_RECLAIM WQ. It is theoretically possible
> that an allocation request which can trigger the OOM killer waits for the
> system_wq while there is already a work which is in system_wq which is looping
> forever inside the page allocator without triggering the OOM killer.

Well, this shouldn't happen AFAICS because a new worker would be
requested and that would certainly require a memory and that allocation
would trigger the OOM killer. On the other hand I agree that it would be
safer to not depend on memory allocation from within the page allocator.

> Maybe the follow up patch can share the vmstat WQ?

Yes, this would be an option.
-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone

2017-02-03 Thread Tetsuo Handa
Michal Hocko wrote:
> On Mon 30-01-17 09:55:46, Michal Hocko wrote:
> > On Sun 29-01-17 00:27:27, Tetsuo Handa wrote:
> [...]
> > > Regarding [1], it helped avoiding the too_many_isolated() issue. I can't
> > > tell whether it has any negative effect, but I got on the first trial that
> > > all allocating threads are blocked on wait_for_completion() from 
> > > flush_work()
> > > in drain_all_pages() introduced by "mm, page_alloc: drain per-cpu pages 
> > > from
> > > workqueue context". There was no warn_alloc() stall warning message 
> > > afterwords.
> > 
> > That patch is buggy and there is a follow up [1] which is not sitting in the
> > mmotm (and thus linux-next) yet. I didn't get to review it properly and
> > I cannot say I would be too happy about using WQ from the page
> > allocator. I believe even the follow up needs to have WQ_RECLAIM WQ.
> > 
> > [1] 
> > http://lkml.kernel.org/r/20170125083038.rzb5f43nptmk7...@techsingularity.net
> 
> Did you get chance to test with this follow up patch? It would be
> interesting to see whether OOM situation can still starve the waiter.
> The current linux-next should contain this patch.

So far I can't reproduce problems except two listed below (cond_resched() trap
in printk() and IDLE priority trap are excluded from the list). But I agree that
the follow up patch needs to use a WQ_RECLAIM WQ. It is theoretically possible
that an allocation request which can trigger the OOM killer waits for the
system_wq while there is already a work which is in system_wq which is looping
forever inside the page allocator without triggering the OOM killer.
Maybe the follow up patch can share the vmstat WQ?

(1) I got an assertion failure.

[  969.626518] Killed process 6262 (oom-write) total-vm:2166856kB, 
anon-rss:1128732kB, file-rss:4kB, shmem-rss:0kB
[  969.958307] oom_reaper: reaped process 6262 (oom-write), now anon-rss:0kB, 
file-rss:0kB, shmem-rss:0kB
[  972.114644] XFS: Assertion failed: oldlen > newlen, file: 
fs/xfs/libxfs/xfs_bmap.c, line: 2867
[  972.125085] [ cut here ]
[  972.129261] WARNING: CPU: 0 PID: 6280 at fs/xfs/xfs_message.c:105 
asswarn+0x33/0x40 [xfs]
[  972.136146] Modules linked in: nf_conntrack_netbios_ns 
nf_conntrack_broadcast ip6t_rpfilter ipt_REJECT nf_reject_ipv4 ip6t_REJECT 
nf_reject_ipv6 xt_conntrack coretemp crct10dif_pclmul ppdev crc32_pclmul 
ghash_clmulni_intel ip_set nfnetlink ebtable_nat aesni_intel crypto_simd cryptd 
ebtable_broute glue_helper vmw_balloon bridge stp llc ip6table_nat 
nf_conntrack_ipv6 nf_defrag_ipv6 pcspkr nf_nat_ipv6 ip6table_mangle 
ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat 
nf_conntrack iptable_mangle iptable_raw ebtable_filter ebtables ip6table_filter 
ip6_tables iptable_filter sg parport_pc parport shpchp i2c_piix4 
vmw_vsock_vmci_transport vsock vmw_vmci ip_tables xfs libcrc32c sr_mod cdrom 
ata_generic sd_mod pata_acpi crc32c_intel serio_raw vmwgfx drm_kms_helper 
syscopyarea sysfillrect
[  972.163630]  sysimgblt fb_sys_fops ttm drm ata_piix ahci libahci mptspi 
scsi_transport_spi mptscsih e1000 libata i2c_core mptbase
[  972.172535] CPU: 0 PID: 6280 Comm: write Not tainted 
4.10.0-rc6-next-20170202 #498
[  972.175126] Hardware name: VMware, Inc. VMware Virtual Platform/440BX 
Desktop Reference Platform, BIOS 6.00 07/02/2015
[  972.178381] Call Trace:
[  972.180003]  dump_stack+0x85/0xc9
[  972.181682]  __warn+0xd1/0xf0
[  972.183374]  warn_slowpath_null+0x1d/0x20
[  972.185223]  asswarn+0x33/0x40 [xfs]
[  972.186950]  xfs_bmap_add_extent_hole_delay+0xb7f/0xdf0 [xfs]
[  972.189055]  xfs_bmapi_reserve_delalloc+0x297/0x440 [xfs]
[  972.191263]  ? xfs_ilock+0x1c9/0x360 [xfs]
[  972.193414]  xfs_file_iomap_begin+0x880/0x1140 [xfs]
[  972.195300]  ? iomap_write_end+0x80/0x80
[  972.196980]  iomap_apply+0x6c/0x130
[  972.198539]  iomap_file_buffered_write+0x68/0xa0
[  972.200316]  ? iomap_write_end+0x80/0x80
[  972.201950]  xfs_file_buffered_aio_write+0x132/0x390 [xfs]
[  972.203868]  ? _raw_spin_unlock+0x27/0x40
[  972.205470]  xfs_file_write_iter+0x90/0x130 [xfs]
[  972.207167]  __vfs_write+0xe5/0x140
[  972.208752]  vfs_write+0xc7/0x1f0
[  972.210233]  ? syscall_trace_enter+0x1d0/0x380
[  972.211809]  SyS_write+0x58/0xc0
[  972.213166]  do_int80_syscall_32+0x6c/0x1f0
[  972.214676]  entry_INT80_compat+0x38/0x50
[  972.216168] RIP: 0023:0x8048076
[  972.217494] RSP: 002b:ff997020 EFLAGS: 0202 ORIG_RAX: 
0004
[  972.219635] RAX: ffda RBX: 0001 RCX: 08048000
[  972.221679] RDX: 1000 RSI:  RDI: 
[  972.223774] RBP:  R08:  R09: 
[  972.225905] R10:  R11:  R12: 
[  972.227946] R13:  R14:  R15: 
[  972.230064] ---[ end trace d498098daec56c11 ]---
[  984.210890] vmtoolsd invoked oom-killer: 
gfp_mask=0x14201ca(GFP_HIGHUSER_

Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone

2017-02-02 Thread Michal Hocko
On Mon 30-01-17 09:55:46, Michal Hocko wrote:
> On Sun 29-01-17 00:27:27, Tetsuo Handa wrote:
[...]
> > Regarding [1], it helped avoiding the too_many_isolated() issue. I can't
> > tell whether it has any negative effect, but I got on the first trial that
> > all allocating threads are blocked on wait_for_completion() from 
> > flush_work()
> > in drain_all_pages() introduced by "mm, page_alloc: drain per-cpu pages from
> > workqueue context". There was no warn_alloc() stall warning message 
> > afterwords.
> 
> That patch is buggy and there is a follow up [1] which is not sitting in the
> mmotm (and thus linux-next) yet. I didn't get to review it properly and
> I cannot say I would be too happy about using WQ from the page
> allocator. I believe even the follow up needs to have WQ_RECLAIM WQ.
> 
> [1] 
> http://lkml.kernel.org/r/20170125083038.rzb5f43nptmk7...@techsingularity.net

Did you get chance to test with this follow up patch? It would be
interesting to see whether OOM situation can still starve the waiter.
The current linux-next should contain this patch.

Thanks!
-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone

2017-01-31 Thread Michal Hocko
On Tue 31-01-17 13:51:40, Christoph Hellwig wrote:
> On Tue, Jan 31, 2017 at 12:58:46PM +0100, Michal Hocko wrote:
> > What do you think Christoph? I have an additional patch to handle
> > do_generic_file_read and a similar one to back off in
> > __vmalloc_area_node. I would like to post them all in one series but I
> > would like to know that this one is OK before I do that.
> 
> Well, that patch you posted is okay, but you probably need additional
> ones for the other interesting users of iomap_apply.

I have checked all of them I guees/hope. Which one you have in mind?

-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone

2017-01-31 Thread Christoph Hellwig
On Tue, Jan 31, 2017 at 12:58:46PM +0100, Michal Hocko wrote:
> What do you think Christoph? I have an additional patch to handle
> do_generic_file_read and a similar one to back off in
> __vmalloc_area_node. I would like to post them all in one series but I
> would like to know that this one is OK before I do that.

Well, that patch you posted is okay, but you probably need additional
ones for the other interesting users of iomap_apply.


Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone

2017-01-31 Thread Michal Hocko
On Wed 25-01-17 14:00:14, Michal Hocko wrote:
> On Wed 25-01-17 20:09:31, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > On Wed 25-01-17 11:19:57, Christoph Hellwig wrote:
> > > > On Wed, Jan 25, 2017 at 11:15:17AM +0100, Michal Hocko wrote:
> > > > > I think we are missing a check for fatal_signal_pending in
> > > > > iomap_file_buffered_write. This means that an oom victim can consume 
> > > > > the
> > > > > full memory reserves. What do you think about the following? I haven't
> > > > > tested this but it mimics generic_perform_write so I guess it should
> > > > > work.
> > > > 
> > > > Hi Michal,
> > > > 
> > > > this looks reasonable to me.  But we have a few more such loops,
> > > > maybe it makes sense to move the check into iomap_apply?
> > > 
> > > I wasn't sure about the expected semantic of iomap_apply but now that
> > > I've actually checked all the callers I believe all of them should be
> > > able to handle EINTR just fine. Well iomap_file_dirty, iomap_zero_range,
> > > iomap_fiemap and iomap_page_mkwriteseem do not follow the standard
> > > pattern to return the number of written pages or an error but it rather
> > > propagates the error out. From my limited understanding of those code
> > > paths that should just be ok. I was not all that sure about iomap_dio_rw
> > > that is just too convoluted for me. If that one is OK as well then
> > > the following patch should be indeed better.
> > 
> > Is "length" in
> > 
> >written = actor(inode, pos, length, data, &iomap);
> > 
> > call guaranteed to be small enough? If not guaranteed,
> > don't we need to check SIGKILL inside "actor" functions?
> 
> You are right! Checking for signals inside iomap_apply doesn't really
> solve anything because basically all users do iov_iter_count(). Blee. So
> we have loops around iomap_apply which itself loops inside the actor.
> iomap_write_begin seems to be used by most of them which is also where we
> get the pagecache page so I guess this should be the "right" place to
> put the check in. Things like dax_iomap_actor will need an explicit check.
> This is quite unfortunate but I do not see any better solution.
> What do you think Christoph?

What do you think Christoph? I have an additional patch to handle
do_generic_file_read and a similar one to back off in
__vmalloc_area_node. I would like to post them all in one series but I
would like to know that this one is OK before I do that.

Thanks!

> ---
> From 362da5cac527146a341300c2ca441245c16043e8 Mon Sep 17 00:00:00 2001
> From: Michal Hocko 
> Date: Wed, 25 Jan 2017 11:06:37 +0100
> Subject: [PATCH] fs: break out of iomap_file_buffered_write on fatal signals
> 
> Tetsuo has noticed that an OOM stress test which performs large write
> requests can cause the full memory reserves depletion. He has tracked
> this down to the following path
>   __alloc_pages_nodemask+0x436/0x4d0
>   alloc_pages_current+0x97/0x1b0
>   __page_cache_alloc+0x15d/0x1a0  mm/filemap.c:728
>   pagecache_get_page+0x5a/0x2b0   mm/filemap.c:1331
>   grab_cache_page_write_begin+0x23/0x40   mm/filemap.c:2773
>   iomap_write_begin+0x50/0xd0 fs/iomap.c:118
>   iomap_write_actor+0xb5/0x1a0fs/iomap.c:190
>   ? iomap_write_end+0x80/0x80 fs/iomap.c:150
>   iomap_apply+0xb3/0x130  fs/iomap.c:79
>   iomap_file_buffered_write+0x68/0xa0 fs/iomap.c:243
>   ? iomap_write_end+0x80/0x80
>   xfs_file_buffered_aio_write+0x132/0x390 [xfs]
>   ? remove_wait_queue+0x59/0x60
>   xfs_file_write_iter+0x90/0x130 [xfs]
>   __vfs_write+0xe5/0x140
>   vfs_write+0xc7/0x1f0
>   ? syscall_trace_enter+0x1d0/0x380
>   SyS_write+0x58/0xc0
>   do_syscall_64+0x6c/0x200
>   entry_SYSCALL64_slow_path+0x25/0x25
> 
> the oom victim has access to all memory reserves to make a forward
> progress to exit easier. But iomap_file_buffered_write and other callers
> of iomap_apply loop to complete the full request. We need to check for
> fatal signals and back off with a short write instead. As the
> iomap_apply delegates all the work down to the actor we have to hook
> into those. All callers that work with the page cache are calling
> iomap_write_begin so we will check for signals there. dax_iomap_actor
> has to handle the situation explicitly because it copies data to the
> userspace directly. Other callers like iomap_page_mkwrite work on a
> single page or iomap_fiemap_actor do not allocate memory based on the
> given len.
> 
> Fixes: 68a9f5e7007c ("xfs: implement iomap based buffered write path")
> Cc: stable # 4.8+
> Reported-by: Tetsuo Handa 
> Signed-off-by: Michal Hocko 
> ---
>  fs/dax.c   | 5 +
>  fs/iomap.c | 3 +++
>  2 files changed, 8 insertions(+)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 413a91db9351..0e263dacf9cf 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -1033,6 +1033,11 @@ dax_iomap_actor(struct inode *inode, loff_t pos, 
> loff_t length, void *data,

Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone

2017-01-30 Thread Michal Hocko
On Sun 29-01-17 00:27:27, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > Tetsuo,
> > before we settle on the proper fix for this issue, could you give the
> > patch a try and try to reproduce the too_many_isolated() issue or
> > just see whether patch [1] has any negative effect on your oom stress
> > testing?
> > 
> > [1] http://lkml.kernel.org/r/20170119112336.gn30...@dhcp22.suse.cz
> 
> I tested with both [1] and below patch applied on linux-next-20170125 and
> the result is at http://I-love.SAKURA.ne.jp/tmp/serial-20170128.txt.xz .
> 
> Regarding below patch, it helped avoiding complete memory depletion with
> large write() request. I don't know whether below patch helps avoiding
> complete memory depletion when reading large amount (in other words, I
> don't know whether this check is done for large read() request).

It's not AFAICS. do_generic_file_read doesn't do the
fatal_signal_pending check.

> But
> I believe that __GFP_KILLABLE (despite the limitation that there are
> unkillable waits in the reclaim path) is better solution compared to
> scattering around fatal_signal_pending() in the callers. The reason
> we check SIGKILL here is to avoid allocating memory more than needed.
> If we check SIGKILL in the entry point of __alloc_pages_nodemask() and
> retry: label in __alloc_pages_slowpath(), we waste 0 page. Regardless
> of whether the OOM killer is invoked, whether memory can be allocated
> without direct reclaim operation, not allocating memory unless needed
> (in other words, allow page allocator fail immediately if the caller
> can give up on SIGKILL and SIGKILL is pending) makes sense. It will
> reduce possibility of OOM livelock on CONFIG_MMU=n kernels where the
> OOM reaper is not available.

I am not really convinced this is a good idea. Put aside the fuzzy
semantic of __GFP_KILLABLE, we would have to use this flag in all
potentially allocating places from read/write paths and then it is just
easier to do the explicit checks in the the loops around those
allocations.
 
> > On Wed 25-01-17 14:00:14, Michal Hocko wrote:
> > [...]
> > > From 362da5cac527146a341300c2ca441245c16043e8 Mon Sep 17 00:00:00 2001
> > > From: Michal Hocko 
> > > Date: Wed, 25 Jan 2017 11:06:37 +0100
> > > Subject: [PATCH] fs: break out of iomap_file_buffered_write on fatal 
> > > signals
> > > 
> > > Tetsuo has noticed that an OOM stress test which performs large write
> > > requests can cause the full memory reserves depletion. He has tracked
> > > this down to the following path
> > >   __alloc_pages_nodemask+0x436/0x4d0
> > >   alloc_pages_current+0x97/0x1b0
> > >   __page_cache_alloc+0x15d/0x1a0  mm/filemap.c:728
> > >   pagecache_get_page+0x5a/0x2b0   mm/filemap.c:1331
> > >   grab_cache_page_write_begin+0x23/0x40   mm/filemap.c:2773
> > >   iomap_write_begin+0x50/0xd0 fs/iomap.c:118
> > >   iomap_write_actor+0xb5/0x1a0fs/iomap.c:190
> > >   ? iomap_write_end+0x80/0x80 fs/iomap.c:150
> > >   iomap_apply+0xb3/0x130  fs/iomap.c:79
> > >   iomap_file_buffered_write+0x68/0xa0 fs/iomap.c:243
> > >   ? iomap_write_end+0x80/0x80
> > >   xfs_file_buffered_aio_write+0x132/0x390 [xfs]
> > >   ? remove_wait_queue+0x59/0x60
> > >   xfs_file_write_iter+0x90/0x130 [xfs]
> > >   __vfs_write+0xe5/0x140
> > >   vfs_write+0xc7/0x1f0
> > >   ? syscall_trace_enter+0x1d0/0x380
> > >   SyS_write+0x58/0xc0
> > >   do_syscall_64+0x6c/0x200
> > >   entry_SYSCALL64_slow_path+0x25/0x25
> > > 
> > > the oom victim has access to all memory reserves to make a forward
> > > progress to exit easier. But iomap_file_buffered_write and other callers
> > > of iomap_apply loop to complete the full request. We need to check for
> > > fatal signals and back off with a short write instead. As the
> > > iomap_apply delegates all the work down to the actor we have to hook
> > > into those. All callers that work with the page cache are calling
> > > iomap_write_begin so we will check for signals there. dax_iomap_actor
> > > has to handle the situation explicitly because it copies data to the
> > > userspace directly. Other callers like iomap_page_mkwrite work on a
> > > single page or iomap_fiemap_actor do not allocate memory based on the
> > > given len.
> > > 
> > > Fixes: 68a9f5e7007c ("xfs: implement iomap based buffered write path")
> > > Cc: stable # 4.8+
> > > Reported-by: Tetsuo Handa 
> > > Signed-off-by: Michal Hocko 
> > > ---
> > >  fs/dax.c   | 5 +
> > >  fs/iomap.c | 3 +++
> > >  2 files changed, 8 insertions(+)
> > > 
> > > diff --git a/fs/dax.c b/fs/dax.c
> > > index 413a91db9351..0e263dacf9cf 100644
> > > --- a/fs/dax.c
> > > +++ b/fs/dax.c
> > > @@ -1033,6 +1033,11 @@ dax_iomap_actor(struct inode *inode, loff_t pos, 
> > > loff_t length, void *data,
> > >   struct blk_dax_ctl dax = { 0 };
> > >   ssize_t map_len;
> > >  
> > > + if (fatal_signal_pending(current)) {
> > > + ret = -EINTR;
> > > + break;
> > >

Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone

2017-01-28 Thread Tetsuo Handa
Michal Hocko wrote:
> Tetsuo,
> before we settle on the proper fix for this issue, could you give the
> patch a try and try to reproduce the too_many_isolated() issue or
> just see whether patch [1] has any negative effect on your oom stress
> testing?
> 
> [1] http://lkml.kernel.org/r/20170119112336.gn30...@dhcp22.suse.cz

I tested with both [1] and below patch applied on linux-next-20170125 and
the result is at http://I-love.SAKURA.ne.jp/tmp/serial-20170128.txt.xz .

Regarding below patch, it helped avoiding complete memory depletion with
large write() request. I don't know whether below patch helps avoiding
complete memory depletion when reading large amount (in other words, I
don't know whether this check is done for large read() request). But
I believe that __GFP_KILLABLE (despite the limitation that there are
unkillable waits in the reclaim path) is better solution compared to
scattering around fatal_signal_pending() in the callers. The reason
we check SIGKILL here is to avoid allocating memory more than needed.
If we check SIGKILL in the entry point of __alloc_pages_nodemask() and
retry: label in __alloc_pages_slowpath(), we waste 0 page. Regardless
of whether the OOM killer is invoked, whether memory can be allocated
without direct reclaim operation, not allocating memory unless needed
(in other words, allow page allocator fail immediately if the caller
can give up on SIGKILL and SIGKILL is pending) makes sense. It will
reduce possibility of OOM livelock on CONFIG_MMU=n kernels where the
OOM reaper is not available.

> 
> On Wed 25-01-17 14:00:14, Michal Hocko wrote:
> [...]
> > From 362da5cac527146a341300c2ca441245c16043e8 Mon Sep 17 00:00:00 2001
> > From: Michal Hocko 
> > Date: Wed, 25 Jan 2017 11:06:37 +0100
> > Subject: [PATCH] fs: break out of iomap_file_buffered_write on fatal signals
> > 
> > Tetsuo has noticed that an OOM stress test which performs large write
> > requests can cause the full memory reserves depletion. He has tracked
> > this down to the following path
> > __alloc_pages_nodemask+0x436/0x4d0
> > alloc_pages_current+0x97/0x1b0
> > __page_cache_alloc+0x15d/0x1a0  mm/filemap.c:728
> > pagecache_get_page+0x5a/0x2b0   mm/filemap.c:1331
> > grab_cache_page_write_begin+0x23/0x40   mm/filemap.c:2773
> > iomap_write_begin+0x50/0xd0 fs/iomap.c:118
> > iomap_write_actor+0xb5/0x1a0fs/iomap.c:190
> > ? iomap_write_end+0x80/0x80 fs/iomap.c:150
> > iomap_apply+0xb3/0x130  fs/iomap.c:79
> > iomap_file_buffered_write+0x68/0xa0 fs/iomap.c:243
> > ? iomap_write_end+0x80/0x80
> > xfs_file_buffered_aio_write+0x132/0x390 [xfs]
> > ? remove_wait_queue+0x59/0x60
> > xfs_file_write_iter+0x90/0x130 [xfs]
> > __vfs_write+0xe5/0x140
> > vfs_write+0xc7/0x1f0
> > ? syscall_trace_enter+0x1d0/0x380
> > SyS_write+0x58/0xc0
> > do_syscall_64+0x6c/0x200
> > entry_SYSCALL64_slow_path+0x25/0x25
> > 
> > the oom victim has access to all memory reserves to make a forward
> > progress to exit easier. But iomap_file_buffered_write and other callers
> > of iomap_apply loop to complete the full request. We need to check for
> > fatal signals and back off with a short write instead. As the
> > iomap_apply delegates all the work down to the actor we have to hook
> > into those. All callers that work with the page cache are calling
> > iomap_write_begin so we will check for signals there. dax_iomap_actor
> > has to handle the situation explicitly because it copies data to the
> > userspace directly. Other callers like iomap_page_mkwrite work on a
> > single page or iomap_fiemap_actor do not allocate memory based on the
> > given len.
> > 
> > Fixes: 68a9f5e7007c ("xfs: implement iomap based buffered write path")
> > Cc: stable # 4.8+
> > Reported-by: Tetsuo Handa 
> > Signed-off-by: Michal Hocko 
> > ---
> >  fs/dax.c   | 5 +
> >  fs/iomap.c | 3 +++
> >  2 files changed, 8 insertions(+)
> > 
> > diff --git a/fs/dax.c b/fs/dax.c
> > index 413a91db9351..0e263dacf9cf 100644
> > --- a/fs/dax.c
> > +++ b/fs/dax.c
> > @@ -1033,6 +1033,11 @@ dax_iomap_actor(struct inode *inode, loff_t pos, 
> > loff_t length, void *data,
> > struct blk_dax_ctl dax = { 0 };
> > ssize_t map_len;
> >  
> > +   if (fatal_signal_pending(current)) {
> > +   ret = -EINTR;
> > +   break;
> > +   }
> > +
> > dax.sector = dax_iomap_sector(iomap, pos);
> > dax.size = (length + offset + PAGE_SIZE - 1) & PAGE_MASK;
> > map_len = dax_map_atomic(iomap->bdev, &dax);
> > diff --git a/fs/iomap.c b/fs/iomap.c
> > index e57b90b5ff37..691eada58b06 100644
> > --- a/fs/iomap.c
> > +++ b/fs/iomap.c
> > @@ -114,6 +114,9 @@ iomap_write_begin(struct inode *inode, loff_t pos, 
> > unsigned len, unsigned flags,
> >  
> > BUG_ON(pos + len > iomap->offset + iomap->length);
> >  
> > +   if (fatal_signal_pendin

Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone

2017-01-27 Thread Michal Hocko
Tetsuo,
before we settle on the proper fix for this issue, could you give the
patch a try and try to reproduce the too_many_isolated() issue or
just see whether patch [1] has any negative effect on your oom stress
testing?

[1] http://lkml.kernel.org/r/20170119112336.gn30...@dhcp22.suse.cz

On Wed 25-01-17 14:00:14, Michal Hocko wrote:
[...]
> From 362da5cac527146a341300c2ca441245c16043e8 Mon Sep 17 00:00:00 2001
> From: Michal Hocko 
> Date: Wed, 25 Jan 2017 11:06:37 +0100
> Subject: [PATCH] fs: break out of iomap_file_buffered_write on fatal signals
> 
> Tetsuo has noticed that an OOM stress test which performs large write
> requests can cause the full memory reserves depletion. He has tracked
> this down to the following path
>   __alloc_pages_nodemask+0x436/0x4d0
>   alloc_pages_current+0x97/0x1b0
>   __page_cache_alloc+0x15d/0x1a0  mm/filemap.c:728
>   pagecache_get_page+0x5a/0x2b0   mm/filemap.c:1331
>   grab_cache_page_write_begin+0x23/0x40   mm/filemap.c:2773
>   iomap_write_begin+0x50/0xd0 fs/iomap.c:118
>   iomap_write_actor+0xb5/0x1a0fs/iomap.c:190
>   ? iomap_write_end+0x80/0x80 fs/iomap.c:150
>   iomap_apply+0xb3/0x130  fs/iomap.c:79
>   iomap_file_buffered_write+0x68/0xa0 fs/iomap.c:243
>   ? iomap_write_end+0x80/0x80
>   xfs_file_buffered_aio_write+0x132/0x390 [xfs]
>   ? remove_wait_queue+0x59/0x60
>   xfs_file_write_iter+0x90/0x130 [xfs]
>   __vfs_write+0xe5/0x140
>   vfs_write+0xc7/0x1f0
>   ? syscall_trace_enter+0x1d0/0x380
>   SyS_write+0x58/0xc0
>   do_syscall_64+0x6c/0x200
>   entry_SYSCALL64_slow_path+0x25/0x25
> 
> the oom victim has access to all memory reserves to make a forward
> progress to exit easier. But iomap_file_buffered_write and other callers
> of iomap_apply loop to complete the full request. We need to check for
> fatal signals and back off with a short write instead. As the
> iomap_apply delegates all the work down to the actor we have to hook
> into those. All callers that work with the page cache are calling
> iomap_write_begin so we will check for signals there. dax_iomap_actor
> has to handle the situation explicitly because it copies data to the
> userspace directly. Other callers like iomap_page_mkwrite work on a
> single page or iomap_fiemap_actor do not allocate memory based on the
> given len.
> 
> Fixes: 68a9f5e7007c ("xfs: implement iomap based buffered write path")
> Cc: stable # 4.8+
> Reported-by: Tetsuo Handa 
> Signed-off-by: Michal Hocko 
> ---
>  fs/dax.c   | 5 +
>  fs/iomap.c | 3 +++
>  2 files changed, 8 insertions(+)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 413a91db9351..0e263dacf9cf 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -1033,6 +1033,11 @@ dax_iomap_actor(struct inode *inode, loff_t pos, 
> loff_t length, void *data,
>   struct blk_dax_ctl dax = { 0 };
>   ssize_t map_len;
>  
> + if (fatal_signal_pending(current)) {
> + ret = -EINTR;
> + break;
> + }
> +
>   dax.sector = dax_iomap_sector(iomap, pos);
>   dax.size = (length + offset + PAGE_SIZE - 1) & PAGE_MASK;
>   map_len = dax_map_atomic(iomap->bdev, &dax);
> diff --git a/fs/iomap.c b/fs/iomap.c
> index e57b90b5ff37..691eada58b06 100644
> --- a/fs/iomap.c
> +++ b/fs/iomap.c
> @@ -114,6 +114,9 @@ iomap_write_begin(struct inode *inode, loff_t pos, 
> unsigned len, unsigned flags,
>  
>   BUG_ON(pos + len > iomap->offset + iomap->length);
>  
> + if (fatal_signal_pending(current))
> + return -EINTR;
> +
>   page = grab_cache_page_write_begin(inode->i_mapping, index, flags);
>   if (!page)
>   return -ENOMEM;
> -- 
> 2.11.0
> 
> 
> -- 
> Michal Hocko
> SUSE Labs

-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone

2017-01-25 Thread Tetsuo Handa
Michal Hocko wrote:
> On Wed 25-01-17 19:33:59, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > I think we are missing a check for fatal_signal_pending in
> > > iomap_file_buffered_write. This means that an oom victim can consume the
> > > full memory reserves. What do you think about the following? I haven't
> > > tested this but it mimics generic_perform_write so I guess it should
> > > work.
> > 
> > Looks OK to me. I worried
> > 
> > #define AOP_FLAG_UNINTERRUPTIBLE0x0001 /* will not do a short write 
> > */
> > 
> > which forbids (!?) aborting the loop. But it seems that this flag is
> > no longer checked (i.e. set but not used). So, everybody should be ready
> > for short write, although I don't know whether exofs / hfs / hfsplus are
> > doing appropriate error handling.
> 
> Those were using generic implementation before and that handles this
> case AFAICS.

What I wanted to say is: "We can remove AOP_FLAG_UNINTERRUPTIBLE completely
because grep does not find that flag used in condition check, can't we?".


Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone

2017-01-25 Thread Michal Hocko
On Wed 25-01-17 20:09:31, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Wed 25-01-17 11:19:57, Christoph Hellwig wrote:
> > > On Wed, Jan 25, 2017 at 11:15:17AM +0100, Michal Hocko wrote:
> > > > I think we are missing a check for fatal_signal_pending in
> > > > iomap_file_buffered_write. This means that an oom victim can consume the
> > > > full memory reserves. What do you think about the following? I haven't
> > > > tested this but it mimics generic_perform_write so I guess it should
> > > > work.
> > > 
> > > Hi Michal,
> > > 
> > > this looks reasonable to me.  But we have a few more such loops,
> > > maybe it makes sense to move the check into iomap_apply?
> > 
> > I wasn't sure about the expected semantic of iomap_apply but now that
> > I've actually checked all the callers I believe all of them should be
> > able to handle EINTR just fine. Well iomap_file_dirty, iomap_zero_range,
> > iomap_fiemap and iomap_page_mkwriteseem do not follow the standard
> > pattern to return the number of written pages or an error but it rather
> > propagates the error out. From my limited understanding of those code
> > paths that should just be ok. I was not all that sure about iomap_dio_rw
> > that is just too convoluted for me. If that one is OK as well then
> > the following patch should be indeed better.
> 
> Is "length" in
> 
>written = actor(inode, pos, length, data, &iomap);
> 
> call guaranteed to be small enough? If not guaranteed,
> don't we need to check SIGKILL inside "actor" functions?

You are right! Checking for signals inside iomap_apply doesn't really
solve anything because basically all users do iov_iter_count(). Blee. So
we have loops around iomap_apply which itself loops inside the actor.
iomap_write_begin seems to be used by most of them which is also where we
get the pagecache page so I guess this should be the "right" place to
put the check in. Things like dax_iomap_actor will need an explicit check.
This is quite unfortunate but I do not see any better solution.
What do you think Christoph?
---
>From 362da5cac527146a341300c2ca441245c16043e8 Mon Sep 17 00:00:00 2001
From: Michal Hocko 
Date: Wed, 25 Jan 2017 11:06:37 +0100
Subject: [PATCH] fs: break out of iomap_file_buffered_write on fatal signals

Tetsuo has noticed that an OOM stress test which performs large write
requests can cause the full memory reserves depletion. He has tracked
this down to the following path
__alloc_pages_nodemask+0x436/0x4d0
alloc_pages_current+0x97/0x1b0
__page_cache_alloc+0x15d/0x1a0  mm/filemap.c:728
pagecache_get_page+0x5a/0x2b0   mm/filemap.c:1331
grab_cache_page_write_begin+0x23/0x40   mm/filemap.c:2773
iomap_write_begin+0x50/0xd0 fs/iomap.c:118
iomap_write_actor+0xb5/0x1a0fs/iomap.c:190
? iomap_write_end+0x80/0x80 fs/iomap.c:150
iomap_apply+0xb3/0x130  fs/iomap.c:79
iomap_file_buffered_write+0x68/0xa0 fs/iomap.c:243
? iomap_write_end+0x80/0x80
xfs_file_buffered_aio_write+0x132/0x390 [xfs]
? remove_wait_queue+0x59/0x60
xfs_file_write_iter+0x90/0x130 [xfs]
__vfs_write+0xe5/0x140
vfs_write+0xc7/0x1f0
? syscall_trace_enter+0x1d0/0x380
SyS_write+0x58/0xc0
do_syscall_64+0x6c/0x200
entry_SYSCALL64_slow_path+0x25/0x25

the oom victim has access to all memory reserves to make a forward
progress to exit easier. But iomap_file_buffered_write and other callers
of iomap_apply loop to complete the full request. We need to check for
fatal signals and back off with a short write instead. As the
iomap_apply delegates all the work down to the actor we have to hook
into those. All callers that work with the page cache are calling
iomap_write_begin so we will check for signals there. dax_iomap_actor
has to handle the situation explicitly because it copies data to the
userspace directly. Other callers like iomap_page_mkwrite work on a
single page or iomap_fiemap_actor do not allocate memory based on the
given len.

Fixes: 68a9f5e7007c ("xfs: implement iomap based buffered write path")
Cc: stable # 4.8+
Reported-by: Tetsuo Handa 
Signed-off-by: Michal Hocko 
---
 fs/dax.c   | 5 +
 fs/iomap.c | 3 +++
 2 files changed, 8 insertions(+)

diff --git a/fs/dax.c b/fs/dax.c
index 413a91db9351..0e263dacf9cf 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1033,6 +1033,11 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t 
length, void *data,
struct blk_dax_ctl dax = { 0 };
ssize_t map_len;
 
+   if (fatal_signal_pending(current)) {
+   ret = -EINTR;
+   break;
+   }
+
dax.sector = dax_iomap_sector(iomap, pos);
dax.size = (length + offset + PAGE_SIZE - 1) & PAGE_MASK;
map_len = dax_map_atomic(iomap->bdev, &dax);
diff --git a/fs/iomap.c b/fs/iomap.c
index e57b90b

Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone

2017-01-25 Thread Tetsuo Handa
Michal Hocko wrote:
> On Wed 25-01-17 11:19:57, Christoph Hellwig wrote:
> > On Wed, Jan 25, 2017 at 11:15:17AM +0100, Michal Hocko wrote:
> > > I think we are missing a check for fatal_signal_pending in
> > > iomap_file_buffered_write. This means that an oom victim can consume the
> > > full memory reserves. What do you think about the following? I haven't
> > > tested this but it mimics generic_perform_write so I guess it should
> > > work.
> > 
> > Hi Michal,
> > 
> > this looks reasonable to me.  But we have a few more such loops,
> > maybe it makes sense to move the check into iomap_apply?
> 
> I wasn't sure about the expected semantic of iomap_apply but now that
> I've actually checked all the callers I believe all of them should be
> able to handle EINTR just fine. Well iomap_file_dirty, iomap_zero_range,
> iomap_fiemap and iomap_page_mkwriteseem do not follow the standard
> pattern to return the number of written pages or an error but it rather
> propagates the error out. From my limited understanding of those code
> paths that should just be ok. I was not all that sure about iomap_dio_rw
> that is just too convoluted for me. If that one is OK as well then
> the following patch should be indeed better.

Is "length" in

   written = actor(inode, pos, length, data, &iomap);

call guaranteed to be small enough? If not guaranteed,
don't we need to check SIGKILL inside "actor" functions?


Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone

2017-01-25 Thread Michal Hocko
On Wed 25-01-17 11:19:57, Christoph Hellwig wrote:
> On Wed, Jan 25, 2017 at 11:15:17AM +0100, Michal Hocko wrote:
> > I think we are missing a check for fatal_signal_pending in
> > iomap_file_buffered_write. This means that an oom victim can consume the
> > full memory reserves. What do you think about the following? I haven't
> > tested this but it mimics generic_perform_write so I guess it should
> > work.
> 
> Hi Michal,
> 
> this looks reasonable to me.  But we have a few more such loops,
> maybe it makes sense to move the check into iomap_apply?

I wasn't sure about the expected semantic of iomap_apply but now that
I've actually checked all the callers I believe all of them should be
able to handle EINTR just fine. Well iomap_file_dirty, iomap_zero_range,
iomap_fiemap and iomap_page_mkwriteseem do not follow the standard
pattern to return the number of written pages or an error but it rather
propagates the error out. From my limited understanding of those code
paths that should just be ok. I was not all that sure about iomap_dio_rw
that is just too convoluted for me. If that one is OK as well then
the following patch should be indeed better.
---
>From d99c9d4115bed69a5d71281f59c190b0b26627cf Mon Sep 17 00:00:00 2001
From: Michal Hocko 
Date: Wed, 25 Jan 2017 11:06:37 +0100
Subject: [PATCH] fs: break out of iomap_file_buffered_write on fatal signals

Tetsuo has noticed that an OOM stress test which performs large write
requests can cause the full memory reserves depletion. He has tracked
this down to the following path
__alloc_pages_nodemask+0x436/0x4d0
alloc_pages_current+0x97/0x1b0
__page_cache_alloc+0x15d/0x1a0  mm/filemap.c:728
pagecache_get_page+0x5a/0x2b0   mm/filemap.c:1331
grab_cache_page_write_begin+0x23/0x40   mm/filemap.c:2773
iomap_write_begin+0x50/0xd0 fs/iomap.c:118
iomap_write_actor+0xb5/0x1a0fs/iomap.c:190
? iomap_write_end+0x80/0x80 fs/iomap.c:150
iomap_apply+0xb3/0x130  fs/iomap.c:79
iomap_file_buffered_write+0x68/0xa0 fs/iomap.c:243
? iomap_write_end+0x80/0x80
xfs_file_buffered_aio_write+0x132/0x390 [xfs]
? remove_wait_queue+0x59/0x60
xfs_file_write_iter+0x90/0x130 [xfs]
__vfs_write+0xe5/0x140
vfs_write+0xc7/0x1f0
? syscall_trace_enter+0x1d0/0x380
SyS_write+0x58/0xc0
do_syscall_64+0x6c/0x200
entry_SYSCALL64_slow_path+0x25/0x25

the oom victim has access to all memory reserves to make a forward
progress to exit easier. But iomap_file_buffered_write and other callers
of iomap_apply loop to complete the full request. We need to check for
fatal signals and back off with a short write instead.

Fixes: 68a9f5e7007c ("xfs: implement iomap based buffered write path")
Cc: stable # 4.8+
Reported-by: Tetsuo Handa 
Signed-off-by: Michal Hocko 
---
 fs/iomap.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/fs/iomap.c b/fs/iomap.c
index e57b90b5ff37..a58190f7a3e4 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -46,6 +46,9 @@ iomap_apply(struct inode *inode, loff_t pos, loff_t length, 
unsigned flags,
struct iomap iomap = { 0 };
loff_t written = 0, ret;
 
+   if (fatal_signal_pending(current))
+   return -EINTR;
+
/*
 * Need to map a range from start position for length bytes. This can
 * span multiple pages - it is only guaranteed to return a range of a
-- 
2.11.0

-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone

2017-01-25 Thread Christoph Hellwig
On Wed, Jan 25, 2017 at 11:15:17AM +0100, Michal Hocko wrote:
> I think we are missing a check for fatal_signal_pending in
> iomap_file_buffered_write. This means that an oom victim can consume the
> full memory reserves. What do you think about the following? I haven't
> tested this but it mimics generic_perform_write so I guess it should
> work.

Hi Michal,

this looks reasonable to me.  But we have a few more such loops,
maybe it makes sense to move the check into iomap_apply?


Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone

2017-01-25 Thread Michal Hocko
[Let's add Christoph]

The below insane^Wstress test should exercise the OOM killer behavior.

On Sat 21-01-17 16:42:42, Tetsuo Handa wrote:
> Tetsuo Handa wrote:
> > And I think that there is a different problem if I tune a reproducer
> > like below (i.e. increased the buffer size to write()/fsync() from 4096).
> > 
> > --
> > #include 
> > #include 
> > #include 
> > #include 
> > #include 
> > #include 
> > #include 
> > 
> > int main(int argc, char *argv[])
> > {
> > static char buffer[10485760] = { }; /* or 1048576 */
> > char *buf = NULL;
> > unsigned long size;
> > unsigned long i;
> > for (i = 0; i < 1024; i++) {
> > if (fork() == 0) {
> > int fd = open("/proc/self/oom_score_adj", O_WRONLY);
> > write(fd, "1000", 4);
> > close(fd);
> > sleep(1);
> > snprintf(buffer, sizeof(buffer), "/tmp/file.%u", 
> > getpid());
> > fd = open(buffer, O_WRONLY | O_CREAT | O_APPEND, 0600);
> > while (write(fd, buffer, sizeof(buffer)) == 
> > sizeof(buffer))
> > fsync(fd);
> > _exit(0);
> > }
> > }
> > for (size = 1048576; size < 512UL * (1 << 30); size <<= 1) {
> > char *cp = realloc(buf, size);
> > if (!cp) {
> > size >>= 1;
> > break;
> > }
> > buf = cp;
> > }
> > sleep(2);
> > /* Will cause OOM due to overcommit */
> > for (i = 0; i < size; i += 4096)
> > buf[i] = 0;
> > pause();
> > return 0;
> > }
> > --
> > 
> > Above reproducer sometimes kills all OOM killable processes and the system
> > finally panics. I guess that somebody is abusing TIF_MEMDIE for needless
> > allocations to the level where GFP_ATOMIC allocations start failing.
[...] 
> And I got flood of traces shown below. It seems to be consuming memory 
> reserves
> until the size passed to write() request is stored to the page cache even 
> after
> OOM-killed.
> 
> Complete log is at http://I-love.SAKURA.ne.jp/tmp/serial-20170121.txt.xz .
> 
> [  202.306077] a.out(9789): TIF_MEMDIE allocation: order=0 
> mode=0x1c2004a(GFP_NOFS|__GFP_HIGHMEM|__GFP_HARDWALL|__GFP_MOVABLE|__GFP_WRITE)
> [  202.309832] CPU: 0 PID: 9789 Comm: a.out Not tainted 
> 4.10.0-rc4-next-20170120+ #492
> [  202.312323] Hardware name: VMware, Inc. VMware Virtual Platform/440BX 
> Desktop Reference Platform, BIOS 6.00 07/02/2015
> [  202.315429] Call Trace:
> [  202.316902]  dump_stack+0x85/0xc9
> [  202.318810]  __alloc_pages_slowpath+0xa99/0xd7c
> [  202.320697]  ? node_dirty_ok+0xef/0x130
> [  202.322454]  __alloc_pages_nodemask+0x436/0x4d0
> [  202.324506]  alloc_pages_current+0x97/0x1b0
> [  202.326397]  __page_cache_alloc+0x15d/0x1a0  mm/filemap.c:728
> [  202.328209]  pagecache_get_page+0x5a/0x2b0   mm/filemap.c:1331
> [  202.329989]  grab_cache_page_write_begin+0x23/0x40   mm/filemap.c:2773
> [  202.331905]  iomap_write_begin+0x50/0xd0 fs/iomap.c:118
> [  202.333641]  iomap_write_actor+0xb5/0x1a0fs/iomap.c:190
> [  202.335377]  ? iomap_write_end+0x80/0x80 fs/iomap.c:150
> [  202.337090]  iomap_apply+0xb3/0x130  fs/iomap.c:79
> [  202.338721]  iomap_file_buffered_write+0x68/0xa0 fs/iomap.c:243
> [  202.340613]  ? iomap_write_end+0x80/0x80
> [  202.342471]  xfs_file_buffered_aio_write+0x132/0x390 [xfs]
> [  202.344501]  ? remove_wait_queue+0x59/0x60
> [  202.346261]  xfs_file_write_iter+0x90/0x130 [xfs]
> [  202.348082]  __vfs_write+0xe5/0x140
> [  202.349743]  vfs_write+0xc7/0x1f0
> [  202.351214]  ? syscall_trace_enter+0x1d0/0x380
> [  202.353155]  SyS_write+0x58/0xc0
> [  202.354628]  do_syscall_64+0x6c/0x200
> [  202.356100]  entry_SYSCALL64_slow_path+0x25/0x25
> 
> 
> Do we need to allow access to memory reserves for this allocation?
> Or, should the caller check for SIGKILL rather than iterate the loop?

I think we are missing a check for fatal_signal_pending in
iomap_file_buffered_write. This means that an oom victim can consume the
full memory reserves. What do you think about the following? I haven't
tested this but it mimics generic_perform_write so I guess it should
work.
---
>From d56b54b708d403d1bf39fccb89750bab31c19032 Mon Sep 17 00:00:00 2001
From: Michal Hocko 
Date: Wed, 25 Jan 2017 11:06:37 +0100
Subject: [PATCH] fs: break out of iomap_file_buffered_write on fatal signals

Tetsuo has noticed that an OOM stress test which performs large write
requests can cause the full memory reserves depletion. He has tracked
this down to the following path
__alloc_pages_nodemask+0x436/0x4d0
alloc_pages_current+0x97/0x1b0
__page_cache_alloc+0x15d/0x1a0  mm/filemap.c:728
pagecache_get_page+0x5a/0x2b0   mm/

Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone

2017-01-25 Thread Michal Hocko
On Fri 20-01-17 22:27:27, Tetsuo Handa wrote:
> Mel Gorman wrote:
> > On Thu, Jan 19, 2017 at 12:23:36PM +0100, Michal Hocko wrote:
> > > So what do you think about the following? Tetsuo, would you be willing
> > > to run this patch through your torture testing please?
> > 
> > I'm fine with treating this as a starting point.
> 
> OK. So I tried to test this patch but I failed at preparation step.
> There are too many pending mm patches and I'm not sure which patch on
> which linux-next snapshot I should try.

The current linux-next should be good to test. It contains all patches
sitting in the mmotm tree. If you want a more stable base then you can
use mmotm git tree
(git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git #since-4.9
or its #auto-latest alias)

> Also as another question,
> too_many_isolated() loop exists in both mm/vmscan.c and mm/compaction.c
> but why this patch does not touch the loop in mm/compaction.c part?

I am not yet convinced the compaction suffers from the same problem.
Compaction backs off much sooner so that path shouldn't get into
pathological situation AFAICS. I might be wrong here but I think we
should start with the reclaim path first.
-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone

2017-01-20 Thread Tetsuo Handa
Tetsuo Handa wrote:
> And I think that there is a different problem if I tune a reproducer
> like below (i.e. increased the buffer size to write()/fsync() from 4096).
> 
> --
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> 
> int main(int argc, char *argv[])
> {
>   static char buffer[10485760] = { }; /* or 1048576 */
>   char *buf = NULL;
>   unsigned long size;
>   unsigned long i;
>   for (i = 0; i < 1024; i++) {
>   if (fork() == 0) {
>   int fd = open("/proc/self/oom_score_adj", O_WRONLY);
>   write(fd, "1000", 4);
>   close(fd);
>   sleep(1);
>   snprintf(buffer, sizeof(buffer), "/tmp/file.%u", 
> getpid());
>   fd = open(buffer, O_WRONLY | O_CREAT | O_APPEND, 0600);
>   while (write(fd, buffer, sizeof(buffer)) == 
> sizeof(buffer))
>   fsync(fd);
>   _exit(0);
>   }
>   }
>   for (size = 1048576; size < 512UL * (1 << 30); size <<= 1) {
>   char *cp = realloc(buf, size);
>   if (!cp) {
>   size >>= 1;
>   break;
>   }
>   buf = cp;
>   }
>   sleep(2);
>   /* Will cause OOM due to overcommit */
>   for (i = 0; i < size; i += 4096)
>   buf[i] = 0;
>   pause();
>   return 0;
> }
> --
> 
> Above reproducer sometimes kills all OOM killable processes and the system
> finally panics. I guess that somebody is abusing TIF_MEMDIE for needless
> allocations to the level where GFP_ATOMIC allocations start failing.

I tracked who is abusing TIF_MEMDIE using below patch.


diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ea088e1..d9ac53d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3038,7 +3038,7 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, 
const char *fmt, ...)
static DEFINE_RATELIMIT_STATE(nopage_rs, DEFAULT_RATELIMIT_INTERVAL,
  DEFAULT_RATELIMIT_BURST);
 
-   if ((gfp_mask & __GFP_NOWARN) || !__ratelimit(&nopage_rs) ||
+   if (1 || (gfp_mask & __GFP_NOWARN) || !__ratelimit(&nopage_rs) ||
debug_guardpage_minorder() > 0)
return;
 
@@ -3573,6 +3573,7 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
int no_progress_loops = 0;
unsigned long alloc_start = jiffies;
unsigned int stall_timeout = 10 * HZ;
+   bool victim = false;
 
/*
 * In the slowpath, we sanity check order to avoid ever trying to
@@ -3656,8 +3657,10 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
if (gfp_mask & __GFP_KSWAPD_RECLAIM)
wake_all_kswapds(order, ac);
 
-   if (gfp_pfmemalloc_allowed(gfp_mask))
+   if (gfp_pfmemalloc_allowed(gfp_mask)) {
alloc_flags = ALLOC_NO_WATERMARKS;
+   victim = test_thread_flag(TIF_MEMDIE);
+   }
 
/*
 * Reset the zonelist iterators if memory policies can be ignored.
@@ -3790,6 +3793,11 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
warn_alloc(gfp_mask, ac->nodemask,
"page allocation failure: order:%u", order);
 got_pg:
+   if (page && victim) {
+   pr_warn("%s(%u): TIF_MEMDIE allocation: order=%d 
mode=%#x(%pGg)\n",
+   current->comm, current->pid, order, gfp_mask, 
&gfp_mask);
+   dump_stack();
+   }
return page;
 }
 


And I got flood of traces shown below. It seems to be consuming memory reserves
until the size passed to write() request is stored to the page cache even after
OOM-killed.

Complete log is at http://I-love.SAKURA.ne.jp/tmp/serial-20170121.txt.xz .

[  202.306077] a.out(9789): TIF_MEMDIE allocation: order=0 
mode=0x1c2004a(GFP_NOFS|__GFP_HIGHMEM|__GFP_HARDWALL|__GFP_MOVABLE|__GFP_WRITE)
[  202.309832] CPU: 0 PID: 9789 Comm: a.out Not tainted 
4.10.0-rc4-next-20170120+ #492
[  202.312323] Hardware name: VMware, Inc. VMware Virtual Platform/440BX 
Desktop Reference Platform, BIOS 6.00 07/02/2015
[  202.315429] Call Trace:
[  202.316902]  dump_stack+0x85/0xc9
[  202.318810]  __alloc_pages_slowpath+0xa99/0xd7c
[  202.320697]  ? node_dirty_ok+0xef/0x130
[  202.322454]  __alloc_pages_nodemask+0x436/0x4d0
[  202.324506]  alloc_pages_current+0x97/0x1b0
[  202.326397]  __page_cache_alloc+0x15d/0x1a0  mm/filemap.c:728
[  202.328209]  pagecache_get_page+0x5a/0x2b0   mm/filemap.c:1331
[  202.329989]  grab_cache_page_write_begin+0x23/0x40   mm/filemap.c:2773
[  202.331905]  iomap_write_begin+0x50/0xd0 fs/iomap.c:118
[  202.333641]  iomap_write_actor+0xb5/0x1a0fs/iomap.c:190
[  202.335377]  ? iomap_write_end+0x80/0x80 fs/

Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone

2017-01-20 Thread Tetsuo Handa
Mel Gorman wrote:
> On Thu, Jan 19, 2017 at 12:23:36PM +0100, Michal Hocko wrote:
> > So what do you think about the following? Tetsuo, would you be willing
> > to run this patch through your torture testing please?
> 
> I'm fine with treating this as a starting point.

OK. So I tried to test this patch but I failed at preparation step.
There are too many pending mm patches and I'm not sure which patch on
which linux-next snapshot I should try. Also as another question,
too_many_isolated() loop exists in both mm/vmscan.c and mm/compaction.c
but why this patch does not touch the loop in mm/compaction.c part?
Is there a guarantee that the problem can be avoided by tweaking only
too_many_isolated() part?

Anyway I tried linux-next-20170119 snapshot in order to confirm that
my reproducer can still reproduce the problem before trying this patch.
But I was not able to reproduce the problem today, for mm part is
changing rapidly and existing reproducers need tuning.

And I think that there is a different problem if I tune a reproducer
like below (i.e. increased the buffer size to write()/fsync() from 4096).

--
#include 
#include 
#include 
#include 
#include 
#include 
#include 

int main(int argc, char *argv[])
{
static char buffer[10485760] = { }; /* or 1048576 */
char *buf = NULL;
unsigned long size;
unsigned long i;
for (i = 0; i < 1024; i++) {
if (fork() == 0) {
int fd = open("/proc/self/oom_score_adj", O_WRONLY);
write(fd, "1000", 4);
close(fd);
sleep(1);
snprintf(buffer, sizeof(buffer), "/tmp/file.%u", 
getpid());
fd = open(buffer, O_WRONLY | O_CREAT | O_APPEND, 0600);
while (write(fd, buffer, sizeof(buffer)) == 
sizeof(buffer))
fsync(fd);
_exit(0);
}
}
for (size = 1048576; size < 512UL * (1 << 30); size <<= 1) {
char *cp = realloc(buf, size);
if (!cp) {
size >>= 1;
break;
}
buf = cp;
}
sleep(2);
/* Will cause OOM due to overcommit */
for (i = 0; i < size; i += 4096)
buf[i] = 0;
pause();
return 0;
}
--

Above reproducer sometimes kills all OOM killable processes and the system
finally panics. I guess that somebody is abusing TIF_MEMDIE for needless
allocations to the level where GFP_ATOMIC allocations start failing.

Complete log is at http://I-love.SAKURA.ne.jp/tmp/serial-20170120.txt.xz .
--
[  184.482761] a.out invoked oom-killer: 
gfp_mask=0x14280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=0, order=0, 
oom_score_adj=0
(...snipped...)
[  184.482955] Node 0 active_anon:1418748kB inactive_anon:13548kB 
active_file:11448kB inactive_file:26044kB unevictable:0kB isolated(anon):0kB 
isolated(file):132kB mapped:13744kB dirty:25872kB writeback:376kB shmem:0kB 
shmem_thp: 0kB sh\
mem_pmdmapped: 258048kB anon_thp: 14184kB writeback_tmp:0kB unstable:0kB 
pages_scanned:95127 all_unreclaimable? yes
[  184.482956] Node 0 DMA free:7660kB min:380kB low:472kB high:564kB 
active_anon:8176kB inactive_anon:0kB active_file:0kB inactive_file:0kB 
unevictable:0kB writepending:0kB present:15988kB managed:15904kB mlocked:0kB 
slab_reclaimable:40\
kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:28kB bounce:0kB 
free_pcp:0kB local_pcp:0kB free_cma:0kB
[  184.482959] lowmem_reserve[]: 0 1823 1823 1823
[  184.482963] Node 0 DMA32 free:44636kB min:44672kB low:55840kB high:67008kB 
active_anon:1410572kB inactive_anon:13548kB active_file:11448kB 
inactive_file:26044kB unevictable:0kB writepending:26248kB present:2080640kB 
managed:1866768kB\
 mlocked:0kB slab_reclaimable:85544kB slab_unreclaimable:128876kB 
kernel_stack:20496kB pagetables:40712kB bounce:0kB free_pcp:1136kB 
local_pcp:656kB free_cma:0kB
[  184.482966] lowmem_reserve[]: 0 0 0 0
[  184.482970] Node 0 DMA: 9*4kB (UE) 5*8kB (E) 2*16kB (ME) 0*32kB 2*64kB (U) 
2*128kB (UE) 2*256kB (UE) 1*512kB (E) 2*1024kB (UE) 2*2048kB (ME) 0*4096kB = 
7660kB
[  184.482994] Node 0 DMA32: 3845*4kB (UME) 1809*8kB (UME) 600*16kB (UME) 
134*32kB (UME) 14*64kB (UME) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB 
= 44636kB
(...snipped...)
[  187.477371] Node 0 active_anon:1415648kB inactive_anon:13548kB 
active_file:11452kB inactive_file:79120kB unevictable:0kB isolated(anon):0kB 
isolated(file):5220kB mapped:13748kB dirty:83484kB writeback:376kB shmem:0kB 
shmem_thp: 0kB s\
hmem_pmdmapped: 258048kB anon_thp: 14184kB writeback_tmp:0kB unstable:0kB 
pages_scanned:16058 all_unreclaimable? no
[  187.477372] Node 0 DMA free:0kB min:380kB low:472kB high:564kB 
active_anon:8176kB inactive_anon:0kB active_file:0kB inactive_file:6976kB 
unevictable:0kB writepending:7492kB present:15988kB 

Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone

2017-01-20 Thread Mel Gorman
On Fri, Jan 20, 2017 at 02:42:24PM +0800, Hillf Danton wrote:
> > @@ -1603,16 +1603,16 @@ int isolate_lru_page(struct page *page)
> >   * the LRU list will go small and be scanned faster than necessary, 
> > leading to
> >   * unnecessary swapping, thrashing and OOM.
> >   */
> > -static int too_many_isolated(struct pglist_data *pgdat, int file,
> > +static bool safe_to_isolate(struct pglist_data *pgdat, int file,
> > struct scan_control *sc)
> 
> I prefer the current function name.
> 

The restructure is to work with the workqueue api.

> >  {
> > unsigned long inactive, isolated;
> > 
> > if (current_is_kswapd())
> > -   return 0;
> > +   return true;
> > 
> > -   if (!sane_reclaim(sc))
> > -   return 0;
> > +   if (sane_reclaim(sc))
> > +   return true;
> 
> We only need a one-line change.

It's bool so the conversion is made to bool while it's being changed
anyway.

> > 
> > if (file) {
> > inactive = node_page_state(pgdat, NR_INACTIVE_FILE);
> > @@ -1630,7 +1630,7 @@ static int too_many_isolated(struct pglist_data 
> > *pgdat, int file,
> > if ((sc->gfp_mask & (__GFP_IO | __GFP_FS)) == (__GFP_IO | __GFP_FS))
> > inactive >>= 3;
> > 
> > -   return isolated > inactive;
> > +   return isolated < inactive;
> >  }
> > 
> >  static noinline_for_stack void
> > @@ -1719,12 +1719,28 @@ shrink_inactive_list(unsigned long nr_to_scan, 
> > struct lruvec *lruvec,
> > struct pglist_data *pgdat = lruvec_pgdat(lruvec);
> > struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
> > 
> > -   while (unlikely(too_many_isolated(pgdat, file, sc))) {
> > -   congestion_wait(BLK_RW_ASYNC, HZ/10);
> > +   while (!safe_to_isolate(pgdat, file, sc)) {
> > +   long ret;
> > +
> > +   ret = wait_event_interruptible_timeout(pgdat->isolated_wait,
> > +   safe_to_isolate(pgdat, file, sc), HZ/10);
> > 
> > /* We are about to die and free our memory. Return now. */
> > -   if (fatal_signal_pending(current))
> > -   return SWAP_CLUSTER_MAX;
> > +   if (fatal_signal_pending(current)) {
> > +   nr_reclaimed = SWAP_CLUSTER_MAX;
> > +   goto out;
> > +   }
> > +
> > +   /*
> > +* If we reached the timeout, this is direct reclaim, and
> > +* pages cannot be isolated then return. If the situation
> 
> Please add something that we would rather shrink slab than go
> another round of nap.
> 

That's not necessarily true or even a good idea. It could result in
excessive slab shrinking that is no longer in proportion to LRU scanning
and increased contention within shrinkers.

> > +* persists for a long time then it'll eventually reach
> > +* the no_progress limit in should_reclaim_retry and consider
> > +* going OOM. In this case, do not wake the isolated_wait
> > +* queue as the wakee will still not be able to make progress.
> > +*/
> > +   if (!ret && !current_is_kswapd() && !safe_to_isolate(pgdat, 
> > file, sc))
> > +   return 0;
> > }
> > 
> > lru_add_drain();
> > @@ -1839,6 +1855,10 @@ shrink_inactive_list(unsigned long nr_to_scan, 
> > struct lruvec *lruvec,
> > stat.nr_activate, stat.nr_ref_keep,
> > stat.nr_unmap_fail,
> > sc->priority, file);
> > +
> > +out:
> > +   if (waitqueue_active(&pgdat->isolated_wait))
> > +   wake_up(&pgdat->isolated_wait);
> > return nr_reclaimed;
> >  }
> > 
> Is it also needed to check isolated_wait active before kswapd 
> takes nap?
> 

No because this is where pages were isolated and there is no putback
event that would justify waking the queue. There is a race between
waitqueue_active() and going to sleep that we rely on the timeout to
recover from.

-- 
Mel Gorman
SUSE Labs


Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone

2017-01-19 Thread Hillf Danton

On Thursday, January 19, 2017 6:08 PM Mel Gorman wrote: 
> 
> If it's definitely required and is proven to fix the
> infinite-loop-without-oom workload then I'll back off and withdraw my
> objections. However, I'd at least like the following untested patch to
> be considered as an alternative. It has some weaknesses and would be
> slower to OOM than your patch but it avoids reintroducing zone counters
> 
> ---8<---
> mm, vmscan: Wait on a waitqueue when too many pages are isolated
> 
> When too many pages are isolated, direct reclaim waits on congestion to clear
> for up to a tenth of a second. There is no reason to believe that too many
> pages are isolated due to dirty pages, reclaim efficiency or congestion.
> It may simply be because an extremely large number of processes have entered
> direct reclaim at the same time. However, it is possible for the situation
> to persist forever and never reach OOM.
> 
> This patch queues processes a waitqueue when too many pages are isolated.
> When parallel reclaimers finish shrink_page_list, they wake the waiters
> to recheck whether too many pages are isolated.
> 
> The wait on the queue has a timeout as not all sites that isolate pages
> will do the wakeup. Depending on every isolation of LRU pages to be perfect
> forever is potentially fragile. The specific wakeups occur for page reclaim
> and compaction. If too many pages are isolated due to memory failure,
> hotplug or directly calling migration from a syscall then the waiting
> processes may wait the full timeout.
> 
> Note that the timeout allows the use of waitqueue_active() on the basis
> that a race will cause the full timeout to be reached due to a missed
> wakeup. This is relatively harmless and still a massive improvement over
> unconditionally calling congestion_wait.
> 
> Direct reclaimers that cannot isolate pages within the timeout will consider
> return to the caller. This is somewhat clunky as it won't return immediately
> and make go through the other priorities and slab shrinking. Eventually,
> it'll go through a few iterations of should_reclaim_retry and reach the
> MAX_RECLAIM_RETRIES limit and consider going OOM.
> 
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 91f69aa0d581..3dd617d0c8c4 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -628,6 +628,7 @@ typedef struct pglist_data {
>   int node_id;
>   wait_queue_head_t kswapd_wait;
>   wait_queue_head_t pfmemalloc_wait;
> + wait_queue_head_t isolated_wait;
>   struct task_struct *kswapd; /* Protected by
>  mem_hotplug_begin/end() */
>   int kswapd_order;
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 43a6cf1dc202..1b1ff6da7401 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -1634,6 +1634,10 @@ static enum compact_result compact_zone(struct zone 
> *zone, struct compact_contro
>   count_compact_events(COMPACTMIGRATE_SCANNED, cc->total_migrate_scanned);
>   count_compact_events(COMPACTFREE_SCANNED, cc->total_free_scanned);
> 
> + /* Page reclaim could have stalled due to isolated pages */
> + if (waitqueue_active(&zone->zone_pgdat->isolated_wait))
> + wake_up(&zone->zone_pgdat->isolated_wait);
> +
>   trace_mm_compaction_end(start_pfn, cc->migrate_pfn,
>   cc->free_pfn, end_pfn, sync, ret);
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 8ff25883c172..d848c9f31bff 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5823,6 +5823,7 @@ static void __paginginit free_area_init_core(struct 
> pglist_data *pgdat)
>  #endif
>   init_waitqueue_head(&pgdat->kswapd_wait);
>   init_waitqueue_head(&pgdat->pfmemalloc_wait);
> + init_waitqueue_head(&pgdat->isolated_wait);
>  #ifdef CONFIG_COMPACTION
>   init_waitqueue_head(&pgdat->kcompactd_wait);
>  #endif
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 2281ad310d06..c93f299fbad7 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1603,16 +1603,16 @@ int isolate_lru_page(struct page *page)
>   * the LRU list will go small and be scanned faster than necessary, leading 
> to
>   * unnecessary swapping, thrashing and OOM.
>   */
> -static int too_many_isolated(struct pglist_data *pgdat, int file,
> +static bool safe_to_isolate(struct pglist_data *pgdat, int file,
>   struct scan_control *sc)

I prefer the current function name.

>  {
>   unsigned long inactive, isolated;
> 
>   if (current_is_kswapd())
> - return 0;
> + return true;
> 
> - if (!sane_reclaim(sc))
> - return 0;
> + if (sane_reclaim(sc))
> + return true;

We only need a one-line change.
> 
>   if (file) {
>   inactive = node_page_state(pgdat, NR_INACTIVE_FILE);
> @@ -1630,7 +1630,7 @@ static int too_many_isolated(struct pglist_data *pgdat, 
> int file,
>   if ((sc->gfp_mask & (__GFP_IO | __GFP_FS)) == (__GFP_IO | __

Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone

2017-01-19 Thread Mel Gorman
On Thu, Jan 19, 2017 at 12:23:36PM +0100, Michal Hocko wrote:
> On Thu 19-01-17 10:07:55, Mel Gorman wrote:
> [...]
> > mm, vmscan: Wait on a waitqueue when too many pages are isolated
> > 
> > When too many pages are isolated, direct reclaim waits on congestion to 
> > clear
> > for up to a tenth of a second. There is no reason to believe that too many
> > pages are isolated due to dirty pages, reclaim efficiency or congestion.
> > It may simply be because an extremely large number of processes have entered
> > direct reclaim at the same time. However, it is possible for the situation
> > to persist forever and never reach OOM.
> > 
> > This patch queues processes a waitqueue when too many pages are isolated.
> > When parallel reclaimers finish shrink_page_list, they wake the waiters
> > to recheck whether too many pages are isolated.
> > 
> > The wait on the queue has a timeout as not all sites that isolate pages
> > will do the wakeup. Depending on every isolation of LRU pages to be perfect
> > forever is potentially fragile. The specific wakeups occur for page reclaim
> > and compaction. If too many pages are isolated due to memory failure,
> > hotplug or directly calling migration from a syscall then the waiting
> > processes may wait the full timeout.
> > 
> > Note that the timeout allows the use of waitqueue_active() on the basis
> > that a race will cause the full timeout to be reached due to a missed
> > wakeup. This is relatively harmless and still a massive improvement over
> > unconditionally calling congestion_wait.
> > 
> > Direct reclaimers that cannot isolate pages within the timeout will consider
> > return to the caller. This is somewhat clunky as it won't return immediately
> > and make go through the other priorities and slab shrinking. Eventually,
> > it'll go through a few iterations of should_reclaim_retry and reach the
> > MAX_RECLAIM_RETRIES limit and consider going OOM.
> 
> I cannot really say I would like this. It's just much more complex than
> necessary.

I guess it's a difference in opinion. Miximg per-zone and per-node
information for me is complex. I liked the workqueue because it was an
example of waiting on a specific event instead of relying completely on
time.

> I definitely agree that congestion_wait while waiting for
> too_many_isolated is a crude hack. This patch doesn't really resolve
> my biggest worry, though, that we go OOM with too many pages isolated
> as your patch doesn't alter zone_reclaimable_pages to reflect those
> numbers.
> 

Indeed, but such cases are also caught by the no_progress_loop logic to
avoid a premature OOM.

> Anyway, I think both of us are probably overcomplicating things a bit.
> Your waitqueue approach is definitely better semantically than the
> congestion_wait because we are waiting for a different event than the
> API is intended for. On the other hand a mere
> schedule_timeout_interruptible might work equally well in the real life.
> On the other side I might really over emphasise the role of NR_ISOLATED*
> counts. It might really turn out that we can safely ignore them and it
> won't be the end of the world. So what do you think about the following
> as a starting point. If we ever see oom reports with high number of
> NR_ISOLATED* which are part of the oom report then we know we have to do
> something about that. Those changes would at least be driven by a real
> usecase rather than theoretical scenarios.
> 
> So what do you think about the following? Tetsuo, would you be willing
> to run this patch through your torture testing please?

I'm fine with treating this as a starting point.

Thanks.

-- 
Mel Gorman
SUSE Labs


Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone

2017-01-19 Thread Michal Hocko
On Thu 19-01-17 10:07:55, Mel Gorman wrote:
[...]
> mm, vmscan: Wait on a waitqueue when too many pages are isolated
> 
> When too many pages are isolated, direct reclaim waits on congestion to clear
> for up to a tenth of a second. There is no reason to believe that too many
> pages are isolated due to dirty pages, reclaim efficiency or congestion.
> It may simply be because an extremely large number of processes have entered
> direct reclaim at the same time. However, it is possible for the situation
> to persist forever and never reach OOM.
> 
> This patch queues processes a waitqueue when too many pages are isolated.
> When parallel reclaimers finish shrink_page_list, they wake the waiters
> to recheck whether too many pages are isolated.
> 
> The wait on the queue has a timeout as not all sites that isolate pages
> will do the wakeup. Depending on every isolation of LRU pages to be perfect
> forever is potentially fragile. The specific wakeups occur for page reclaim
> and compaction. If too many pages are isolated due to memory failure,
> hotplug or directly calling migration from a syscall then the waiting
> processes may wait the full timeout.
> 
> Note that the timeout allows the use of waitqueue_active() on the basis
> that a race will cause the full timeout to be reached due to a missed
> wakeup. This is relatively harmless and still a massive improvement over
> unconditionally calling congestion_wait.
> 
> Direct reclaimers that cannot isolate pages within the timeout will consider
> return to the caller. This is somewhat clunky as it won't return immediately
> and make go through the other priorities and slab shrinking. Eventually,
> it'll go through a few iterations of should_reclaim_retry and reach the
> MAX_RECLAIM_RETRIES limit and consider going OOM.

I cannot really say I would like this. It's just much more complex than
necessary. I definitely agree that congestion_wait while waiting for
too_many_isolated is a crude hack. This patch doesn't really resolve
my biggest worry, though, that we go OOM with too many pages isolated
as your patch doesn't alter zone_reclaimable_pages to reflect those
numbers.

Anyway, I think both of us are probably overcomplicating things a bit.
Your waitqueue approach is definitely better semantically than the
congestion_wait because we are waiting for a different event than the
API is intended for. On the other hand a mere
schedule_timeout_interruptible might work equally well in the real life.
On the other side I might really over emphasise the role of NR_ISOLATED*
counts. It might really turn out that we can safely ignore them and it
won't be the end of the world. So what do you think about the following
as a starting point. If we ever see oom reports with high number of
NR_ISOLATED* which are part of the oom report then we know we have to do
something about that. Those changes would at least be driven by a real
usecase rather than theoretical scenarios.

So what do you think about the following? Tetsuo, would you be willing
to run this patch through your torture testing please?
---
>From 47cba23b5b50260b533d7ad57a4c9e6a800d9b20 Mon Sep 17 00:00:00 2001
From: Michal Hocko 
Date: Thu, 19 Jan 2017 12:11:56 +0100
Subject: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever

Tetsuo Handa has reported [1] that direct reclaimers might get stuck in
too_many_isolated loop basically for ever because the last few pages on
the LRU lists are isolated by the kswapd which is stuck on fs locks when
doing the pageout. This in turn means that there is nobody to actually
trigger the oom killer and the system is basically unusable.

too_many_isolated has been introduced by 35cd78156c49 ("vmscan: throttle
direct reclaim when too many pages are isolated already") to prevent
from pre-mature oom killer invocations because back then no reclaim
progress could indeed trigger the OOM killer too early. But since the
oom detection rework 0a0337e0d1d1 ("mm, oom: rework oom detection")
the allocation/reclaim retry loop considers all the reclaimable pages
and throttles the allocation at that layer so we can loosen the direct
reclaim throttling.

Make shrink_inactive_list loop over too_many_isolated bounded and returns
immediately when the situation hasn't resolved after the first sleep.
Replace congestion_wait by a simple schedule_timeout_interruptible because
we are not really waiting on the IO congestion in this path.

Please note that this patch can theoretically cause the OOM killer to
trigger earlier while there are many pages isolated for the reclaim
which makes progress only very slowly. This would be obvious from the oom
report as the number of isolated pages are printed there. If we ever hit
this should_reclaim_retry should consider those numbers in the evaluation
in one way or another.

[1] 
http://lkml.kernel.org/r/201602092349.acg81273.osvtmjqhlof...@i-love.sakura.ne.jp
Signed-off-by: Michal Hocko 
---
 mm/vmscan.c | 8 +++-
 1 file changed, 7 insertions(+), 1 delet

Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone

2017-01-19 Thread Mel Gorman
On Wed, Jan 18, 2017 at 06:29:46PM +0100, Michal Hocko wrote:
> On Wed 18-01-17 17:00:10, Mel Gorman wrote:
> > > > You don't appear to directly use that information in patch 2.
> > > 
> > > It is used via zone_reclaimable_pages in should_reclaim_retry
> > > 
> > 
> > Which is still not directly required to avoid the infinite loop. There
> > even is a small inherent risk if the too_isolated_condition no longer
> > applies at the time should_reclaim_retry is attempted.
> 
> Not really because, if those pages are no longer isolated then they
> either have been reclaimed - and NR_FREE_PAGES will increase - or they
> have been put back to LRU in which case we will see them in regular LRU
> counters. I need to catch the case where there are still too many pages
> isolated which would skew should_reclaim_retry watermark check.
>  

We can also rely on the no_progress_loops counter to trigger OOM. It'll
take longer but has a lower risk of premature OOM.

> > > > The primary
> > > > breakout is returning after stalling at least once. You could also avoid
> > > > an infinite loop by using a waitqueue that sleeps on too many isolated.
> > > 
> > > That would be tricky on its own. Just consider the report form Tetsuo.
> > > Basically all the direct reclamers are looping on too_many_isolated
> > > while the kswapd is not making any progres because it is blocked on FS
> > > locks which are held by flushers which are making dead slow progress.
> > > Some of those direct reclaimers could have gone oom instead and release
> > > some memory if we decide so, which we cannot because we are deep down in
> > > the reclaim path. Waiting for on the reclaimer to increase the ISOLATED
> > > counter wouldn't help in this situation.
> > > 
> > 
> > If it's a waitqueue waking one process at a time, the progress may be
> > slow but it'll still exit the loop, attempt the reclaim and then
> > potentially OOM if no progress is made. The key is using the waitqueue
> > to have a fair queue of processes making progress instead of a
> > potentially infinite loop that never meets the exit conditions.
> 
> It is not clear to me who would wake waiters on the queue. You cannot
> rely on kswapd to do that as already mentioned.
> 

We can use timeouts to guard against an infinite wait. Besides, updating
every single place where pages are put back on the LRU would be fragile
and too easy to break.

> > > > That would both avoid the clunky congestion_wait() and guarantee forward
> > > > progress. If the primary motivation is to avoid an infinite loop with
> > > > too_many_isolated then there are ways of handling that without 
> > > > reintroducing
> > > > zone-based counters.
> > > > 
> > > > > > Heavy memory pressure on highmem should be spread across the whole 
> > > > > > node as
> > > > > > we no longer are applying the fair zone allocation policy. The 
> > > > > > processes
> > > > > > with highmem requirements will be reclaiming from all zones and 
> > > > > > when it
> > > > > > finishes, it's possible that a lowmem-specific request will be 
> > > > > > clear to make
> > > > > > progress. It's all the same LRU so if there are too many pages 
> > > > > > isolated,
> > > > > > it makes sense to wait regardless of the allocation request.
> > > > > 
> > > > > This is true but I am not sure how it is realated to the patch.
> > > > 
> > > > Because heavy pressure that is enough to trigger too many isolated pages
> > > > is unlikely to be specifically targetting a lower zone.
> > > 
> > > Why? Basically any GFP_KERNEL allocation will make lowmem pressure and
> > > going OOM on lowmem is not all that unrealistic scenario on 32b systems.
> > > 
> > 
> > If the sole source of pressure is from GFP_KERNEL allocations then the
> > isolated counter will also be specific to the lower zones and there is no
> > benefit from the patch.
> 
> I believe you are wrong here. Just consider that you have isolated
> basically all lowmem pages. too_many_isolated will still happily tell
> you to not throttle or back off because NR_INACTIVE_* are way too bigger
> than all low mem pages altogether. Or am I still missing your point?
> 

This is a potential risk. It could be accounted for by including the node
isolated counters in the calculation but it'll be inherently fuzzy and
may stall a lowmem direct reclaimer unnecessarily in the presence of
highmem reclaim.

> > If there is a combination of highmem and lowmem pressure then the highmem
> > reclaimers will also reclaim lowmem memory.
> > 
> > > > There is general
> > > > pressure with multiple direct reclaimers being applied. If the system is
> > > > under enough pressure with parallel reclaimers to trigger 
> > > > too_many_isolated
> > > > checks then the system is grinding already and making little progress. 
> > > > Adding
> > > > multiple counters to allow a lowmem reclaimer to potentially make faster
> > > > progress is going to be marginal at best.
> > > 
> > > OK, I agree that the situation where highmem blocks

Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone

2017-01-18 Thread Michal Hocko
On Wed 18-01-17 17:00:10, Mel Gorman wrote:
> On Wed, Jan 18, 2017 at 05:17:31PM +0100, Michal Hocko wrote:
> > On Wed 18-01-17 15:54:30, Mel Gorman wrote:
> > > On Wed, Jan 18, 2017 at 04:15:31PM +0100, Michal Hocko wrote:
> > > > On Wed 18-01-17 14:46:55, Mel Gorman wrote:
> > > > > On Wed, Jan 18, 2017 at 02:44:52PM +0100, Michal Hocko wrote:
> > > > > > From: Michal Hocko 
> > > > > > 
> > > > > > 599d0c954f91 ("mm, vmscan: move LRU lists to node") has moved
> > > > > > NR_ISOLATED* counters from zones to nodes. This is not the best fit
> > > > > > especially for systems with high/lowmem because a heavy memory 
> > > > > > pressure
> > > > > > on the highmem zone might block lowmem requests from making 
> > > > > > progress. Or
> > > > > > we might allow to reclaim lowmem zone even though there are too many
> > > > > > pages already isolated from the eligible zones just because highmem
> > > > > > pages will easily bias too_many_isolated to say no.
> > > > > > 
> > > > > > Fix these potential issues by moving isolated stats back to zones 
> > > > > > and
> > > > > > teach too_many_isolated to consider only eligible zones. Per zone
> > > > > > isolation counters are a bit tricky with the node reclaim because
> > > > > > we have to track each page separatelly.
> > > > > > 
> > > > > 
> > > > > I'm quite unhappy with this. Each move back increases the cache 
> > > > > footprint
> > > > > because of the counters
> > > > 
> > > > Why would per zone counters cause an increased cache footprint?
> > > > 
> > > 
> > > Because there are multiple counters, each of which need to be updated.
> > 
> > How does this differ from per node counter though.
> 
> A per-node counter is 2 * nr_online_nodes
> A per-zone counter is 2 * nr_populated_zones
> 
> > We would need to do
> > the accounting anyway. Moreover none of the accounting is done in a hot
> > path.
> > 
> > > > > but it's not clear at all this patch actually helps anything.
> > > > 
> > > > Yes, I cannot prove any real issue so far. The main motivation was the
> > > > patch 2 which needs per-zone accounting to use it in the retry logic
> > > > (should_reclaim_retry). I've spotted too_many_isoalated issues on the
> > > > way.
> > > > 
> > > 
> > > You don't appear to directly use that information in patch 2.
> > 
> > It is used via zone_reclaimable_pages in should_reclaim_retry
> > 
> 
> Which is still not directly required to avoid the infinite loop. There
> even is a small inherent risk if the too_isolated_condition no longer
> applies at the time should_reclaim_retry is attempted.

Not really because, if those pages are no longer isolated then they
either have been reclaimed - and NR_FREE_PAGES will increase - or they
have been put back to LRU in which case we will see them in regular LRU
counters. I need to catch the case where there are still too many pages
isolated which would skew should_reclaim_retry watermark check.
 
> > > The primary
> > > breakout is returning after stalling at least once. You could also avoid
> > > an infinite loop by using a waitqueue that sleeps on too many isolated.
> > 
> > That would be tricky on its own. Just consider the report form Tetsuo.
> > Basically all the direct reclamers are looping on too_many_isolated
> > while the kswapd is not making any progres because it is blocked on FS
> > locks which are held by flushers which are making dead slow progress.
> > Some of those direct reclaimers could have gone oom instead and release
> > some memory if we decide so, which we cannot because we are deep down in
> > the reclaim path. Waiting for on the reclaimer to increase the ISOLATED
> > counter wouldn't help in this situation.
> > 
> 
> If it's a waitqueue waking one process at a time, the progress may be
> slow but it'll still exit the loop, attempt the reclaim and then
> potentially OOM if no progress is made. The key is using the waitqueue
> to have a fair queue of processes making progress instead of a
> potentially infinite loop that never meets the exit conditions.

It is not clear to me who would wake waiters on the queue. You cannot
rely on kswapd to do that as already mentioned.

> > > That would both avoid the clunky congestion_wait() and guarantee forward
> > > progress. If the primary motivation is to avoid an infinite loop with
> > > too_many_isolated then there are ways of handling that without 
> > > reintroducing
> > > zone-based counters.
> > > 
> > > > > Heavy memory pressure on highmem should be spread across the whole 
> > > > > node as
> > > > > we no longer are applying the fair zone allocation policy. The 
> > > > > processes
> > > > > with highmem requirements will be reclaiming from all zones and when 
> > > > > it
> > > > > finishes, it's possible that a lowmem-specific request will be clear 
> > > > > to make
> > > > > progress. It's all the same LRU so if there are too many pages 
> > > > > isolated,
> > > > > it makes sense to wait regardless of the allocation request.
> > > > 
> > > > This is tru

Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone

2017-01-18 Thread Mel Gorman
On Wed, Jan 18, 2017 at 05:17:31PM +0100, Michal Hocko wrote:
> On Wed 18-01-17 15:54:30, Mel Gorman wrote:
> > On Wed, Jan 18, 2017 at 04:15:31PM +0100, Michal Hocko wrote:
> > > On Wed 18-01-17 14:46:55, Mel Gorman wrote:
> > > > On Wed, Jan 18, 2017 at 02:44:52PM +0100, Michal Hocko wrote:
> > > > > From: Michal Hocko 
> > > > > 
> > > > > 599d0c954f91 ("mm, vmscan: move LRU lists to node") has moved
> > > > > NR_ISOLATED* counters from zones to nodes. This is not the best fit
> > > > > especially for systems with high/lowmem because a heavy memory 
> > > > > pressure
> > > > > on the highmem zone might block lowmem requests from making progress. 
> > > > > Or
> > > > > we might allow to reclaim lowmem zone even though there are too many
> > > > > pages already isolated from the eligible zones just because highmem
> > > > > pages will easily bias too_many_isolated to say no.
> > > > > 
> > > > > Fix these potential issues by moving isolated stats back to zones and
> > > > > teach too_many_isolated to consider only eligible zones. Per zone
> > > > > isolation counters are a bit tricky with the node reclaim because
> > > > > we have to track each page separatelly.
> > > > > 
> > > > 
> > > > I'm quite unhappy with this. Each move back increases the cache 
> > > > footprint
> > > > because of the counters
> > > 
> > > Why would per zone counters cause an increased cache footprint?
> > > 
> > 
> > Because there are multiple counters, each of which need to be updated.
> 
> How does this differ from per node counter though.

A per-node counter is 2 * nr_online_nodes
A per-zone counter is 2 * nr_populated_zones

> We would need to do
> the accounting anyway. Moreover none of the accounting is done in a hot
> path.
> 
> > > > but it's not clear at all this patch actually helps anything.
> > > 
> > > Yes, I cannot prove any real issue so far. The main motivation was the
> > > patch 2 which needs per-zone accounting to use it in the retry logic
> > > (should_reclaim_retry). I've spotted too_many_isoalated issues on the
> > > way.
> > > 
> > 
> > You don't appear to directly use that information in patch 2.
> 
> It is used via zone_reclaimable_pages in should_reclaim_retry
> 

Which is still not directly required to avoid the infinite loop. There
even is a small inherent risk if the too_isolated_condition no longer
applies at the time should_reclaim_retry is attempted.

> > The primary
> > breakout is returning after stalling at least once. You could also avoid
> > an infinite loop by using a waitqueue that sleeps on too many isolated.
> 
> That would be tricky on its own. Just consider the report form Tetsuo.
> Basically all the direct reclamers are looping on too_many_isolated
> while the kswapd is not making any progres because it is blocked on FS
> locks which are held by flushers which are making dead slow progress.
> Some of those direct reclaimers could have gone oom instead and release
> some memory if we decide so, which we cannot because we are deep down in
> the reclaim path. Waiting for on the reclaimer to increase the ISOLATED
> counter wouldn't help in this situation.
> 

If it's a waitqueue waking one process at a time, the progress may be
slow but it'll still exit the loop, attempt the reclaim and then
potentially OOM if no progress is made. The key is using the waitqueue
to have a fair queue of processes making progress instead of a
potentially infinite loop that never meets the exit conditions.

> > That would both avoid the clunky congestion_wait() and guarantee forward
> > progress. If the primary motivation is to avoid an infinite loop with
> > too_many_isolated then there are ways of handling that without reintroducing
> > zone-based counters.
> > 
> > > > Heavy memory pressure on highmem should be spread across the whole node 
> > > > as
> > > > we no longer are applying the fair zone allocation policy. The processes
> > > > with highmem requirements will be reclaiming from all zones and when it
> > > > finishes, it's possible that a lowmem-specific request will be clear to 
> > > > make
> > > > progress. It's all the same LRU so if there are too many pages isolated,
> > > > it makes sense to wait regardless of the allocation request.
> > > 
> > > This is true but I am not sure how it is realated to the patch.
> > 
> > Because heavy pressure that is enough to trigger too many isolated pages
> > is unlikely to be specifically targetting a lower zone.
> 
> Why? Basically any GFP_KERNEL allocation will make lowmem pressure and
> going OOM on lowmem is not all that unrealistic scenario on 32b systems.
> 

If the sole source of pressure is from GFP_KERNEL allocations then the
isolated counter will also be specific to the lower zones and there is no
benefit from the patch.

If there is a combination of highmem and lowmem pressure then the highmem
reclaimers will also reclaim lowmem memory.

> > There is general
> > pressure with multiple direct reclaimers being applied. If the system is
> > under 

Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone

2017-01-18 Thread Michal Hocko
On Wed 18-01-17 15:54:30, Mel Gorman wrote:
> On Wed, Jan 18, 2017 at 04:15:31PM +0100, Michal Hocko wrote:
> > On Wed 18-01-17 14:46:55, Mel Gorman wrote:
> > > On Wed, Jan 18, 2017 at 02:44:52PM +0100, Michal Hocko wrote:
> > > > From: Michal Hocko 
> > > > 
> > > > 599d0c954f91 ("mm, vmscan: move LRU lists to node") has moved
> > > > NR_ISOLATED* counters from zones to nodes. This is not the best fit
> > > > especially for systems with high/lowmem because a heavy memory pressure
> > > > on the highmem zone might block lowmem requests from making progress. Or
> > > > we might allow to reclaim lowmem zone even though there are too many
> > > > pages already isolated from the eligible zones just because highmem
> > > > pages will easily bias too_many_isolated to say no.
> > > > 
> > > > Fix these potential issues by moving isolated stats back to zones and
> > > > teach too_many_isolated to consider only eligible zones. Per zone
> > > > isolation counters are a bit tricky with the node reclaim because
> > > > we have to track each page separatelly.
> > > > 
> > > 
> > > I'm quite unhappy with this. Each move back increases the cache footprint
> > > because of the counters
> > 
> > Why would per zone counters cause an increased cache footprint?
> > 
> 
> Because there are multiple counters, each of which need to be updated.

How does this differ from per node counter though. We would need to do
the accounting anyway. Moreover none of the accounting is done in a hot
path.

> > > but it's not clear at all this patch actually helps anything.
> > 
> > Yes, I cannot prove any real issue so far. The main motivation was the
> > patch 2 which needs per-zone accounting to use it in the retry logic
> > (should_reclaim_retry). I've spotted too_many_isoalated issues on the
> > way.
> > 
> 
> You don't appear to directly use that information in patch 2.

It is used via zone_reclaimable_pages in should_reclaim_retry

> The primary
> breakout is returning after stalling at least once. You could also avoid
> an infinite loop by using a waitqueue that sleeps on too many isolated.

That would be tricky on its own. Just consider the report form Tetsuo.
Basically all the direct reclamers are looping on too_many_isolated
while the kswapd is not making any progres because it is blocked on FS
locks which are held by flushers which are making dead slow progress.
Some of those direct reclaimers could have gone oom instead and release
some memory if we decide so, which we cannot because we are deep down in
the reclaim path. Waiting for on the reclaimer to increase the ISOLATED
counter wouldn't help in this situation.

> That would both avoid the clunky congestion_wait() and guarantee forward
> progress. If the primary motivation is to avoid an infinite loop with
> too_many_isolated then there are ways of handling that without reintroducing
> zone-based counters.
> 
> > > Heavy memory pressure on highmem should be spread across the whole node as
> > > we no longer are applying the fair zone allocation policy. The processes
> > > with highmem requirements will be reclaiming from all zones and when it
> > > finishes, it's possible that a lowmem-specific request will be clear to 
> > > make
> > > progress. It's all the same LRU so if there are too many pages isolated,
> > > it makes sense to wait regardless of the allocation request.
> > 
> > This is true but I am not sure how it is realated to the patch.
> 
> Because heavy pressure that is enough to trigger too many isolated pages
> is unlikely to be specifically targetting a lower zone.

Why? Basically any GFP_KERNEL allocation will make lowmem pressure and
going OOM on lowmem is not all that unrealistic scenario on 32b systems.

> There is general
> pressure with multiple direct reclaimers being applied. If the system is
> under enough pressure with parallel reclaimers to trigger too_many_isolated
> checks then the system is grinding already and making little progress. Adding
> multiple counters to allow a lowmem reclaimer to potentially make faster
> progress is going to be marginal at best.

OK, I agree that the situation where highmem blocks lowmem from making
progress is much less likely than the other situation described in the
changelog when lowmem doesn't get throttled ever. Which is the one I am
interested more about.

> > Also consider that lowmem throttling in too_many_isolated has only small
> > chance to ever work with the node counters because highmem >> lowmem in
> > many/most configurations.
> > 
> 
> While true, it's also not that important.
> 
> > > More importantly, this patch may make things worse and delay reclaim. If
> > > this patch allowed a lowmem request to make progress that would have
> > > previously stalled, it's going to spend time skipping pages in the LRU
> > > instead of letting kswapd and the highmem pressured processes make 
> > > progress.
> > 
> > I am not sure I understand this part. Say that we have highmem pressure
> > which would isolated too m

Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone

2017-01-18 Thread Mel Gorman
On Wed, Jan 18, 2017 at 04:15:31PM +0100, Michal Hocko wrote:
> On Wed 18-01-17 14:46:55, Mel Gorman wrote:
> > On Wed, Jan 18, 2017 at 02:44:52PM +0100, Michal Hocko wrote:
> > > From: Michal Hocko 
> > > 
> > > 599d0c954f91 ("mm, vmscan: move LRU lists to node") has moved
> > > NR_ISOLATED* counters from zones to nodes. This is not the best fit
> > > especially for systems with high/lowmem because a heavy memory pressure
> > > on the highmem zone might block lowmem requests from making progress. Or
> > > we might allow to reclaim lowmem zone even though there are too many
> > > pages already isolated from the eligible zones just because highmem
> > > pages will easily bias too_many_isolated to say no.
> > > 
> > > Fix these potential issues by moving isolated stats back to zones and
> > > teach too_many_isolated to consider only eligible zones. Per zone
> > > isolation counters are a bit tricky with the node reclaim because
> > > we have to track each page separatelly.
> > > 
> > 
> > I'm quite unhappy with this. Each move back increases the cache footprint
> > because of the counters
> 
> Why would per zone counters cause an increased cache footprint?
> 

Because there are multiple counters, each of which need to be updated.

> > but it's not clear at all this patch actually helps anything.
> 
> Yes, I cannot prove any real issue so far. The main motivation was the
> patch 2 which needs per-zone accounting to use it in the retry logic
> (should_reclaim_retry). I've spotted too_many_isoalated issues on the
> way.
> 

You don't appear to directly use that information in patch 2. The primary
breakout is returning after stalling at least once. You could also avoid
an infinite loop by using a waitqueue that sleeps on too many isolated.
That would both avoid the clunky congestion_wait() and guarantee forward
progress. If the primary motivation is to avoid an infinite loop with
too_many_isolated then there are ways of handling that without reintroducing
zone-based counters.

> > Heavy memory pressure on highmem should be spread across the whole node as
> > we no longer are applying the fair zone allocation policy. The processes
> > with highmem requirements will be reclaiming from all zones and when it
> > finishes, it's possible that a lowmem-specific request will be clear to make
> > progress. It's all the same LRU so if there are too many pages isolated,
> > it makes sense to wait regardless of the allocation request.
> 
> This is true but I am not sure how it is realated to the patch.

Because heavy pressure that is enough to trigger too many isolated pages
is unlikely to be specifically targetting a lower zone. There is general
pressure with multiple direct reclaimers being applied. If the system is
under enough pressure with parallel reclaimers to trigger too_many_isolated
checks then the system is grinding already and making little progress. Adding
multiple counters to allow a lowmem reclaimer to potentially make faster
progress is going to be marginal at best.

> Also consider that lowmem throttling in too_many_isolated has only small
> chance to ever work with the node counters because highmem >> lowmem in
> many/most configurations.
> 

While true, it's also not that important.

> > More importantly, this patch may make things worse and delay reclaim. If
> > this patch allowed a lowmem request to make progress that would have
> > previously stalled, it's going to spend time skipping pages in the LRU
> > instead of letting kswapd and the highmem pressured processes make progress.
> 
> I am not sure I understand this part. Say that we have highmem pressure
> which would isolated too many pages from the LRU.

Which requires multiple direct reclaimers or tiny inactive lists. In the
event there is such highmem pressure, it also means the lower zones are
depleted.

> lowmem request would
> stall previously regardless of where those pages came from. With this
> patch it would stall only when we isolated too many pages from the
> eligible zones.

And when it makes progress, it's goign to compete with the other direct
reclaimers except the lowmem reclaim is skipping some pages and
recycling them through the LRU. It chews up CPU that would probably have
been better spent letting kswapd and the other direct reclaimers do
their work.

> So let's assume that lowmem is not under pressure,

It has to be or the highmem request would have used memory from the
lower zones.

-- 
Mel Gorman
SUSE Labs


Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone

2017-01-18 Thread Michal Hocko
On Wed 18-01-17 14:46:55, Mel Gorman wrote:
> On Wed, Jan 18, 2017 at 02:44:52PM +0100, Michal Hocko wrote:
> > From: Michal Hocko 
> > 
> > 599d0c954f91 ("mm, vmscan: move LRU lists to node") has moved
> > NR_ISOLATED* counters from zones to nodes. This is not the best fit
> > especially for systems with high/lowmem because a heavy memory pressure
> > on the highmem zone might block lowmem requests from making progress. Or
> > we might allow to reclaim lowmem zone even though there are too many
> > pages already isolated from the eligible zones just because highmem
> > pages will easily bias too_many_isolated to say no.
> > 
> > Fix these potential issues by moving isolated stats back to zones and
> > teach too_many_isolated to consider only eligible zones. Per zone
> > isolation counters are a bit tricky with the node reclaim because
> > we have to track each page separatelly.
> > 
> 
> I'm quite unhappy with this. Each move back increases the cache footprint
> because of the counters

Why would per zone counters cause an increased cache footprint?

> but it's not clear at all this patch actually helps anything.

Yes, I cannot prove any real issue so far. The main motivation was the
patch 2 which needs per-zone accounting to use it in the retry logic
(should_reclaim_retry). I've spotted too_many_isoalated issues on the
way.

> Heavy memory pressure on highmem should be spread across the whole node as
> we no longer are applying the fair zone allocation policy. The processes
> with highmem requirements will be reclaiming from all zones and when it
> finishes, it's possible that a lowmem-specific request will be clear to make
> progress. It's all the same LRU so if there are too many pages isolated,
> it makes sense to wait regardless of the allocation request.

This is true but I am not sure how it is realated to the patch. If we
have a heavy highmem memory pressure then we will throttle based on
pages isolated from the respective zones. So if the there is a lowmem
pressure at the same time then we throttle it only when we need to.

Also consider that lowmem throttling in too_many_isolated has only small
chance to ever work with the node counters because highmem >> lowmem in
many/most configurations.

> More importantly, this patch may make things worse and delay reclaim. If
> this patch allowed a lowmem request to make progress that would have
> previously stalled, it's going to spend time skipping pages in the LRU
> instead of letting kswapd and the highmem pressured processes make progress.

I am not sure I understand this part. Say that we have highmem pressure
which would isolated too many pages from the LRU. lowmem request would
stall previously regardless of where those pages came from. With this
patch it would stall only when we isolated too many pages from the
eligible zones. So let's assume that lowmem is not under pressure, why
should we stall? And why would it delay reclaim? Whoever want to make
progress on that zone has to iterate and potentially skip many pages.
-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone

2017-01-18 Thread Mel Gorman
On Wed, Jan 18, 2017 at 02:44:52PM +0100, Michal Hocko wrote:
> From: Michal Hocko 
> 
> 599d0c954f91 ("mm, vmscan: move LRU lists to node") has moved
> NR_ISOLATED* counters from zones to nodes. This is not the best fit
> especially for systems with high/lowmem because a heavy memory pressure
> on the highmem zone might block lowmem requests from making progress. Or
> we might allow to reclaim lowmem zone even though there are too many
> pages already isolated from the eligible zones just because highmem
> pages will easily bias too_many_isolated to say no.
> 
> Fix these potential issues by moving isolated stats back to zones and
> teach too_many_isolated to consider only eligible zones. Per zone
> isolation counters are a bit tricky with the node reclaim because
> we have to track each page separatelly.
> 

I'm quite unhappy with this. Each move back increases the cache footprint
because of the counters but it's not clear at all this patch actually
helps anything.

Heavy memory pressure on highmem should be spread across the whole node as
we no longer are applying the fair zone allocation policy. The processes
with highmem requirements will be reclaiming from all zones and when it
finishes, it's possible that a lowmem-specific request will be clear to make
progress. It's all the same LRU so if there are too many pages isolated,
it makes sense to wait regardless of the allocation request.

More importantly, this patch may make things worse and delay reclaim. If
this patch allowed a lowmem request to make progress that would have
previously stalled, it's going to spend time skipping pages in the LRU
instead of letting kswapd and the highmem pressured processes make progress.

-- 
Mel Gorman
SUSE Labs


[RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone

2017-01-18 Thread Michal Hocko
From: Michal Hocko 

599d0c954f91 ("mm, vmscan: move LRU lists to node") has moved
NR_ISOLATED* counters from zones to nodes. This is not the best fit
especially for systems with high/lowmem because a heavy memory pressure
on the highmem zone might block lowmem requests from making progress. Or
we might allow to reclaim lowmem zone even though there are too many
pages already isolated from the eligible zones just because highmem
pages will easily bias too_many_isolated to say no.

Fix these potential issues by moving isolated stats back to zones and
teach too_many_isolated to consider only eligible zones. Per zone
isolation counters are a bit tricky with the node reclaim because
we have to track each page separatelly.

Signed-off-by: Michal Hocko 
---
 include/linux/mmzone.h |  4 ++--
 mm/compaction.c| 16 +++---
 mm/khugepaged.c|  4 ++--
 mm/memory_hotplug.c|  2 +-
 mm/migrate.c   |  4 ++--
 mm/page_alloc.c| 14 ++--
 mm/vmscan.c| 58 --
 mm/vmstat.c|  4 ++--
 8 files changed, 56 insertions(+), 50 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 91f69aa0d581..100e7f37b7dc 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -119,6 +119,8 @@ enum zone_stat_item {
NR_ZONE_INACTIVE_FILE,
NR_ZONE_ACTIVE_FILE,
NR_ZONE_UNEVICTABLE,
+   NR_ZONE_ISOLATED_ANON,  /* Temporary isolated pages from anon lru */
+   NR_ZONE_ISOLATED_FILE,  /* Temporary isolated pages from file lru */
NR_ZONE_WRITE_PENDING,  /* Count of dirty, writeback and unstable pages 
*/
NR_MLOCK,   /* mlock()ed pages found and moved off LRU */
NR_SLAB_RECLAIMABLE,
@@ -148,8 +150,6 @@ enum node_stat_item {
NR_INACTIVE_FILE,   /*  " " "   "   " */
NR_ACTIVE_FILE, /*  " " "   "   " */
NR_UNEVICTABLE, /*  " " "   "   " */
-   NR_ISOLATED_ANON,   /* Temporary isolated pages from anon lru */
-   NR_ISOLATED_FILE,   /* Temporary isolated pages from file lru */
NR_PAGES_SCANNED,   /* pages scanned since last reclaim */
WORKINGSET_REFAULT,
WORKINGSET_ACTIVATE,
diff --git a/mm/compaction.c b/mm/compaction.c
index 43a6cf1dc202..f84104217887 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -639,12 +639,12 @@ static bool too_many_isolated(struct zone *zone)
 {
unsigned long active, inactive, isolated;
 
-   inactive = node_page_state(zone->zone_pgdat, NR_INACTIVE_FILE) +
-   node_page_state(zone->zone_pgdat, NR_INACTIVE_ANON);
-   active = node_page_state(zone->zone_pgdat, NR_ACTIVE_FILE) +
-   node_page_state(zone->zone_pgdat, NR_ACTIVE_ANON);
-   isolated = node_page_state(zone->zone_pgdat, NR_ISOLATED_FILE) +
-   node_page_state(zone->zone_pgdat, NR_ISOLATED_ANON);
+   inactive = zone_page_state(zone, NR_ZONE_INACTIVE_FILE) +
+   zone_page_state(zone, NR_ZONE_INACTIVE_ANON);
+   active = zone_page_state(zone, NR_ZONE_ACTIVE_FILE) +
+   zone_page_state(zone, NR_ZONE_ACTIVE_ANON);
+   isolated = zone_page_state(zone, NR_ZONE_ISOLATED_FILE) +
+   zone_page_state(zone, NR_ZONE_ISOLATED_ANON);
 
return isolated > (inactive + active) / 2;
 }
@@ -857,8 +857,8 @@ isolate_migratepages_block(struct compact_control *cc, 
unsigned long low_pfn,
 
/* Successfully isolated */
del_page_from_lru_list(page, lruvec, page_lru(page));
-   inc_node_page_state(page,
-   NR_ISOLATED_ANON + page_is_file_cache(page));
+   inc_zone_page_state(page,
+   NR_ZONE_ISOLATED_ANON + 
page_is_file_cache(page));
 
 isolate_success:
list_add(&page->lru, &cc->migratepages);
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 34bce5c308e3..8e692b683cac 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -482,7 +482,7 @@ void __khugepaged_exit(struct mm_struct *mm)
 static void release_pte_page(struct page *page)
 {
/* 0 stands for page_is_file_cache(page) == false */
-   dec_node_page_state(page, NR_ISOLATED_ANON + 0);
+   dec_zone_page_state(page, NR_ZONE_ISOLATED_ANON + 0);
unlock_page(page);
putback_lru_page(page);
 }
@@ -578,7 +578,7 @@ static int __collapse_huge_page_isolate(struct 
vm_area_struct *vma,
goto out;
}
/* 0 stands for page_is_file_cache(page) == false */
-   inc_node_page_state(page, NR_ISOLATED_ANON + 0);
+   inc_zone_page_state(page, NR_ZONE_ISOLATED_ANON + 0);
VM_BUG_ON_PAGE(!PageLocked(page), page);
VM_BUG_ON_PAGE(PageLRU(page), page);
 
diff --git a/mm/memory_hotpl