Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
Michal Hocko wrote: > On Wed 22-02-17 11:02:21, Tetsuo Handa wrote: > > Michal Hocko wrote: > > > On Tue 21-02-17 23:35:07, Tetsuo Handa wrote: > > > > Michal Hocko wrote: > > > > > OK, so it seems that all the distractions are handled now and > > > > > linux-next > > > > > should provide a reasonable base for testing. You said you weren't > > > > > able > > > > > to reproduce the original long stalls on too_many_isolated(). I would > > > > > be > > > > > still interested to see those oom reports and potential anomalies in > > > > > the > > > > > isolated counts before I send the patch for inclusion so your further > > > > > testing would be more than appreciated. Also stalls > 10s without any > > > > > previous occurrences would be interesting. > > > > > > > > I confirmed that linux-next-20170221 with kmallocwd applied can > > > > reproduce > > > > infinite too_many_isolated() loop problem. Please send your patches to > > > > linux-next. > > > > > > So I assume that you didn't see the lockup with the patch applied and > > > the OOM killer has resolved the situation by killing other tasks, right? > > > Can I assume your Tested-by? > > > > No. I tested linux-next-20170221 which does not include your patch. > > I didn't test linux-next-20170221 with your patch applied. Your patch will > > avoid infinite too_many_isolated() loop problem in shrink_inactive_list(). > > But we need to test different workloads by other people. Thus, I suggest > > you to send your patches to linux-next without my testing. > > I will send the patch to Andrew later after merge window closes. It > would be really helpful, though, to see how it handles your workload > which is known to reproduce the oom starvation. I tested http://lkml.kernel.org/r/20170119112336.gn30...@dhcp22.suse.cz on top of linux-next-20170221 with kmallocwd applied. I did not hit too_many_isolated() loop problem. But I hit an "unable to invoke the OOM killer due to !__GFP_FS allocation" lockup problem shown below. Complete log is at http://I-love.SAKURA.ne.jp/tmp/serial-20170226.txt.xz . -- [ 444.281177] Killed process 9477 (a.out) total-vm:4168kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB [ 444.287046] oom_reaper: reaped process 9477 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB [ 484.810225] BUG: workqueue lockup - pool cpus=1 node=0 flags=0x0 nice=0 stuck for 38s! [ 484.812907] BUG: workqueue lockup - pool cpus=2 node=0 flags=0x0 nice=0 stuck for 41s! [ 484.815546] Showing busy workqueues and worker pools: [ 484.817595] workqueue events: flags=0x0 [ 484.819456] pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=3/256 [ 484.821666] pending: vmpressure_work_fn, vmstat_shepherd, vmw_fb_dirty_flush [vmwgfx] [ 484.824356] pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=2/256 [ 484.826582] pending: drain_local_pages_wq BAR(9595), e1000_watchdog [e1000] [ 484.829091] pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=2/256 [ 484.831325] in-flight: 7418:rht_deferred_worker [ 484.86] pending: rht_deferred_worker [ 484.835346] workqueue events_long: flags=0x0 [ 484.837343] pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=1/256 [ 484.839566] pending: gc_worker [nf_conntrack] [ 484.841691] workqueue events_power_efficient: flags=0x80 [ 484.843873] pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=1/256 [ 484.846103] pending: fb_flashcursor [ 484.847928] pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=2/256 [ 484.850149] pending: neigh_periodic_work, neigh_periodic_work [ 484.852403] workqueue events_freezable_power_: flags=0x84 [ 484.854534] pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256 [ 484.85] in-flight: 27:disk_events_workfn [ 484.858621] workqueue writeback: flags=0x4e [ 484.860347] pwq 256: cpus=0-127 flags=0x4 nice=0 active=2/256 [ 484.862415] in-flight: 8444:wb_workfn wb_workfn [ 484.864602] workqueue vmstat: flags=0xc [ 484.866291] pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256 [ 484.868307] pending: vmstat_update [ 484.869876] pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/256 [ 484.871864] pending: vmstat_update [ 484.874058] workqueue mpt_poll_0: flags=0x8 [ 484.875698] pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256 [ 484.877602] pending: mpt_fault_reset_work [mptbase] [ 484.879502] workqueue xfs-buf/sda1: flags=0xc [ 484.881148] pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/1 [ 484.883011] pending: xfs_buf_ioend_work [xfs] [ 484.884706] workqueue xfs-data/sda1: flags=0xc [ 484.886367] pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=27/256 MAYDAY [ 484.888410] in-flight: 5356:xfs_end_io [xfs], 451(RESCUER):xfs_end_io [xfs] xfs_end_io [xfs] xfs_end_io [xfs] xfs_end_io [xfs] xfs_end_io [xfs], 10498:xfs_end_io [xfs], 6386:xfs_end_io [xfs] [ 484.893483] pending: xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], x
Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
On Wed 22-02-17 11:02:21, Tetsuo Handa wrote: > Michal Hocko wrote: > > On Tue 21-02-17 23:35:07, Tetsuo Handa wrote: > > > Michal Hocko wrote: > > > > OK, so it seems that all the distractions are handled now and linux-next > > > > should provide a reasonable base for testing. You said you weren't able > > > > to reproduce the original long stalls on too_many_isolated(). I would be > > > > still interested to see those oom reports and potential anomalies in the > > > > isolated counts before I send the patch for inclusion so your further > > > > testing would be more than appreciated. Also stalls > 10s without any > > > > previous occurrences would be interesting. > > > > > > I confirmed that linux-next-20170221 with kmallocwd applied can reproduce > > > infinite too_many_isolated() loop problem. Please send your patches to > > > linux-next. > > > > So I assume that you didn't see the lockup with the patch applied and > > the OOM killer has resolved the situation by killing other tasks, right? > > Can I assume your Tested-by? > > No. I tested linux-next-20170221 which does not include your patch. > I didn't test linux-next-20170221 with your patch applied. Your patch will > avoid infinite too_many_isolated() loop problem in shrink_inactive_list(). > But we need to test different workloads by other people. Thus, I suggest > you to send your patches to linux-next without my testing. I will send the patch to Andrew later after merge window closes. It would be really helpful, though, to see how it handles your workload which is known to reproduce the oom starvation. -- Michal Hocko SUSE Labs
Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
Michal Hocko wrote: > On Tue 21-02-17 23:35:07, Tetsuo Handa wrote: > > Michal Hocko wrote: > > > OK, so it seems that all the distractions are handled now and linux-next > > > should provide a reasonable base for testing. You said you weren't able > > > to reproduce the original long stalls on too_many_isolated(). I would be > > > still interested to see those oom reports and potential anomalies in the > > > isolated counts before I send the patch for inclusion so your further > > > testing would be more than appreciated. Also stalls > 10s without any > > > previous occurrences would be interesting. > > > > I confirmed that linux-next-20170221 with kmallocwd applied can reproduce > > infinite too_many_isolated() loop problem. Please send your patches to > > linux-next. > > So I assume that you didn't see the lockup with the patch applied and > the OOM killer has resolved the situation by killing other tasks, right? > Can I assume your Tested-by? No. I tested linux-next-20170221 which does not include your patch. I didn't test linux-next-20170221 with your patch applied. Your patch will avoid infinite too_many_isolated() loop problem in shrink_inactive_list(). But we need to test different workloads by other people. Thus, I suggest you to send your patches to linux-next without my testing. > > Thanks for your testing! > -- > Michal Hocko > SUSE Labs >
Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
On Tue 21-02-17 23:35:07, Tetsuo Handa wrote: > Michal Hocko wrote: > > OK, so it seems that all the distractions are handled now and linux-next > > should provide a reasonable base for testing. You said you weren't able > > to reproduce the original long stalls on too_many_isolated(). I would be > > still interested to see those oom reports and potential anomalies in the > > isolated counts before I send the patch for inclusion so your further > > testing would be more than appreciated. Also stalls > 10s without any > > previous occurrences would be interesting. > > I confirmed that linux-next-20170221 with kmallocwd applied can reproduce > infinite too_many_isolated() loop problem. Please send your patches to > linux-next. So I assume that you didn't see the lockup with the patch applied and the OOM killer has resolved the situation by killing other tasks, right? Can I assume your Tested-by? Thanks for your testing! -- Michal Hocko SUSE Labs
Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
Michal Hocko wrote: > OK, so it seems that all the distractions are handled now and linux-next > should provide a reasonable base for testing. You said you weren't able > to reproduce the original long stalls on too_many_isolated(). I would be > still interested to see those oom reports and potential anomalies in the > isolated counts before I send the patch for inclusion so your further > testing would be more than appreciated. Also stalls > 10s without any > previous occurrences would be interesting. I confirmed that linux-next-20170221 with kmallocwd applied can reproduce infinite too_many_isolated() loop problem. Please send your patches to linux-next. Complete log is at http://I-love.SAKURA.ne.jp/tmp/serial-20170221.txt.xz . [ 1160.162013] Out of memory: Kill process 7523 (a.out) score 998 or sacrifice child [ 1160.164422] Killed process 7523 (a.out) total-vm:4168kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB [ 1160.169699] oom_reaper: reaped process 7523 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB [ 1209.781787] MemAlloc-Info: stalling=32 dying=1 exiting=0 victim=1 oom_count=45896 [ 1209.790966] MemAlloc: kswapd0(67) flags=0xa60840 switches=51139 uninterruptible [ 1209.799726] kswapd0 D1093667 2 0x [ 1209.807326] Call Trace: [ 1209.812581] __schedule+0x336/0xe00 [ 1209.818599] schedule+0x3d/0x90 [ 1209.823907] schedule_timeout+0x26a/0x510 [ 1209.827218] ? trace_hardirqs_on+0xd/0x10 [ 1209.830535] __down_common+0xfb/0x131 [ 1209.833801] ? _xfs_buf_find+0x2cb/0xc10 [xfs] [ 1209.837372] __down+0x1d/0x1f [ 1209.840331] down+0x41/0x50 [ 1209.843243] xfs_buf_lock+0x64/0x370 [xfs] [ 1209.846597] _xfs_buf_find+0x2cb/0xc10 [xfs] [ 1209.850031] ? _xfs_buf_find+0xa4/0xc10 [xfs] [ 1209.853514] xfs_buf_get_map+0x2a/0x480 [xfs] [ 1209.855831] xfs_buf_read_map+0x2c/0x400 [xfs] [ 1209.857388] ? free_debug_processing+0x27d/0x2af [ 1209.859037] xfs_trans_read_buf_map+0x186/0x830 [xfs] [ 1209.860707] xfs_read_agf+0xc8/0x2b0 [xfs] [ 1209.862184] xfs_alloc_read_agf+0x7a/0x300 [xfs] [ 1209.863728] ? xfs_alloc_space_available+0x7b/0x120 [xfs] [ 1209.865385] xfs_alloc_fix_freelist+0x3bc/0x490 [xfs] [ 1209.866974] ? __radix_tree_lookup+0x84/0xf0 [ 1209.868374] ? xfs_perag_get+0x1a0/0x310 [xfs] [ 1209.869798] ? xfs_perag_get+0x5/0x310 [xfs] [ 1209.871288] xfs_alloc_vextent+0x161/0xda0 [xfs] [ 1209.872757] xfs_bmap_btalloc+0x46c/0x8b0 [xfs] [ 1209.874182] ? save_stack_trace+0x1b/0x20 [ 1209.875542] xfs_bmap_alloc+0x17/0x30 [xfs] [ 1209.876847] xfs_bmapi_write+0x74e/0x11d0 [xfs] [ 1209.878190] xfs_iomap_write_allocate+0x199/0x3a0 [xfs] [ 1209.879632] xfs_map_blocks+0x2cc/0x5a0 [xfs] [ 1209.880909] xfs_do_writepage+0x215/0x920 [xfs] [ 1209.882255] ? clear_page_dirty_for_io+0xb4/0x310 [ 1209.883598] xfs_vm_writepage+0x3b/0x70 [xfs] [ 1209.884841] pageout.isra.54+0x1a4/0x460 [ 1209.886210] shrink_page_list+0xa86/0xcf0 [ 1209.887441] shrink_inactive_list+0x1c5/0x660 [ 1209.888682] shrink_node_memcg+0x535/0x7f0 [ 1209.889975] ? mem_cgroup_iter+0x14d/0x720 [ 1209.891197] shrink_node+0xe1/0x310 [ 1209.892288] kswapd+0x362/0x9b0 [ 1209.893308] kthread+0x10f/0x150 [ 1209.894383] ? mem_cgroup_shrink_node+0x3b0/0x3b0 [ 1209.895703] ? kthread_create_on_node+0x70/0x70 [ 1209.896956] ret_from_fork+0x31/0x40 [ 1209.898117] MemAlloc: systemd-journal(526) flags=0x400900 switches=33248 seq=121659 gfp=0x14201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) order=0 delay=52772 uninterruptible [ 1209.902154] systemd-journal D11240 526 1 0x [ 1209.903642] Call Trace: [ 1209.904574] __schedule+0x336/0xe00 [ 1209.905734] schedule+0x3d/0x90 [ 1209.906817] schedule_timeout+0x20d/0x510 [ 1209.908025] ? prepare_to_wait+0x2b/0xc0 [ 1209.909268] ? lock_timer_base+0xa0/0xa0 [ 1209.910460] io_schedule_timeout+0x1e/0x50 [ 1209.911681] congestion_wait+0x86/0x260 [ 1209.912853] ? remove_wait_queue+0x60/0x60 [ 1209.914115] shrink_inactive_list+0x5b4/0x660 [ 1209.915385] ? __list_lru_count_one.isra.2+0x22/0x80 [ 1209.916768] shrink_node_memcg+0x535/0x7f0 [ 1209.918173] shrink_node+0xe1/0x310 [ 1209.919288] do_try_to_free_pages+0xe1/0x300 [ 1209.920548] try_to_free_pages+0x131/0x3f0 [ 1209.921827] __alloc_pages_slowpath+0x3ec/0xd95 [ 1209.923137] __alloc_pages_nodemask+0x3e4/0x460 [ 1209.924454] ? __radix_tree_lookup+0x84/0xf0 [ 1209.925790] alloc_pages_current+0x97/0x1b0 [ 1209.927021] ? find_get_entry+0x5/0x300 [ 1209.928189] __page_cache_alloc+0x15d/0x1a0 [ 1209.929471] ? pagecache_get_page+0x2c/0x2b0 [ 1209.930716] filemap_fault+0x4df/0x8b0 [ 1209.931867] ? filemap_fault+0x373/0x8b0 [ 1209.933111] ? xfs_ilock+0x22c/0x360 [xfs] [ 1209.934510] ? xfs_filemap_fault+0x64/0x1e0 [xfs] [ 1209.935857] ? down_read_nested+0x7b/0xc0 [ 1209.937123] ? xfs_ilock+0x22c/0x360 [xfs] [ 1209.938373] xfs_filemap_fault+0x6c/0x1e0 [xfs] [ 1209.939691] __do_fault+0x1e/0xa0 [ 1209.940807] ? _raw_spin_unlock+0x27/0x40 [ 1209.94
Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
On Fri 03-02-17 19:57:39, Tetsuo Handa wrote: > Michal Hocko wrote: > > On Mon 30-01-17 09:55:46, Michal Hocko wrote: > > > On Sun 29-01-17 00:27:27, Tetsuo Handa wrote: > > [...] > > > > Regarding [1], it helped avoiding the too_many_isolated() issue. I can't > > > > tell whether it has any negative effect, but I got on the first trial > > > > that > > > > all allocating threads are blocked on wait_for_completion() from > > > > flush_work() > > > > in drain_all_pages() introduced by "mm, page_alloc: drain per-cpu pages > > > > from > > > > workqueue context". There was no warn_alloc() stall warning message > > > > afterwords. > > > > > > That patch is buggy and there is a follow up [1] which is not sitting in > > > the > > > mmotm (and thus linux-next) yet. I didn't get to review it properly and > > > I cannot say I would be too happy about using WQ from the page > > > allocator. I believe even the follow up needs to have WQ_RECLAIM WQ. > > > > > > [1] > > > http://lkml.kernel.org/r/20170125083038.rzb5f43nptmk7...@techsingularity.net > > > > Did you get chance to test with this follow up patch? It would be > > interesting to see whether OOM situation can still starve the waiter. > > The current linux-next should contain this patch. > > So far I can't reproduce problems except two listed below (cond_resched() trap > in printk() and IDLE priority trap are excluded from the list). OK, so it seems that all the distractions are handled now and linux-next should provide a reasonable base for testing. You said you weren't able to reproduce the original long stalls on too_many_isolated(). I would be still interested to see those oom reports and potential anomalies in the isolated counts before I send the patch for inclusion so your further testing would be more than appreciated. Also stalls > 10s without any previous occurrences would be interesting. Thanks! -- Michal Hocko SUSE Labs
Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
On Tue, Feb 07, 2017 at 10:12:12PM +0100, Michal Hocko wrote: > This is moot - > http://lkml.kernel.org/r/20170207201950.20482-1-mho...@kernel.org Thanks! I was just about to go stare at it in more detail.
Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
On Mon 06-02-17 11:39:18, Michal Hocko wrote: > On Sun 05-02-17 19:43:07, Tetsuo Handa wrote: > > Michal Hocko wrote: > > I got same warning with ext4. Maybe we need to check carefully. > > > > [ 511.215743] = > > [ 511.218003] WARNING: RECLAIM_FS-safe -> RECLAIM_FS-unsafe lock order > > detected > > [ 511.220031] 4.10.0-rc6-next-20170202+ #500 Not tainted > > [ 511.221689] - > > [ 511.223579] a.out/49302 [HC0[0]:SC0[0]:HE1:SE1] is trying to acquire: > > [ 511.225533] (cpu_hotplug.dep_map){++}, at: [] > > get_online_cpus+0x37/0x80 > > [ 511.227795] > > [ 511.227795] and this task is already holding: > > [ 511.230082] (jbd2_handle){-.}, at: [] > > start_this_handle+0x1a7/0x590 > > [ 511.232592] which would create a new lock dependency: > > [ 511.234192] (jbd2_handle){-.} -> (cpu_hotplug.dep_map){++} > > [ 511.235966] > > [ 511.235966] but this new dependency connects a RECLAIM_FS-irq-safe lock: > > [ 511.238563] (jbd2_handle){-.} > > [ 511.238564] > > [ 511.238564] ... which became RECLAIM_FS-irq-safe at: > > [ 511.242078] > > [ 511.242084] [] __lock_acquire+0x34b/0x1640 > > [ 511.244495] [] lock_acquire+0xc9/0x250 > > [ 511.246697] [] jbd2_log_wait_commit+0x55/0x1d0 > [...] > > [ 511.276216] to a RECLAIM_FS-irq-unsafe lock: > > [ 511.278128] (cpu_hotplug.dep_map){++} > > [ 511.278130] > > [ 511.278130] ... which became RECLAIM_FS-irq-unsafe at: > > [ 511.281809] ... > > [ 511.281811] > > [ 511.282598] [] mark_held_locks+0x71/0x90 > > [ 511.284854] [] lockdep_trace_alloc+0x6f/0xd0 > > [ 511.287218] [] kmem_cache_alloc_node_trace+0x48/0x3b0 > > [ 511.289755] [] __smpboot_create_thread.part.2+0x35/0xf0 > > [ 511.292329] [] smpboot_create_threads+0x66/0x90 > [...] > > [ 511.317867] other info that might help us debug this: > > [ 511.317867] > > [ 511.320920] Possible interrupt unsafe locking scenario: > > [ 511.320920] > > [ 511.323218]CPU0CPU1 > > [ 511.324622] > > [ 511.325973] lock(cpu_hotplug.dep_map); > > [ 511.327246]local_irq_disable(); > > [ 511.328870]lock(jbd2_handle); > > [ 511.330483]lock(cpu_hotplug.dep_map); > > [ 511.332259] > > [ 511.333187] lock(jbd2_handle); > > Peter, is there any way how to tell the lockdep that this is in fact > reclaim safe? The direct reclaim only does the trylock and backs off so > we cannot deadlock here. > > Or am I misinterpreting the trace? This is moot - http://lkml.kernel.org/r/20170207201950.20482-1-mho...@kernel.org -- Michal Hocko SUSE Labs
Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
On Tue, Feb 07, 2017 at 07:30:54PM +0900, Tetsuo Handa wrote: > Brian Foster wrote: > > > The workload is to write to a single file on XFS from 10 processes > > > demonstrated at > > > http://lkml.kernel.org/r/201512052133.iae00551.lsoqftmffvo...@i-love.sakura.ne.jp > > > using "while :; do ./oom-write; done" loop on a VM with 4CPUs / 2048MB > > > RAM. > > > With this XFS_FILBLKS_MIN() change applied, I no longer hit assertion > > > failures. > > > > > > > Thanks for testing. Well, that's an interesting workload. I couldn't > > reproduce on a few quick tries in a similarly configured vm. > > It takes 10 to 15 minutes. Maybe some size threshold involved? > > > /tmp/file _is_ on an XFS filesystem in your test, correct? If so and if > > you still have the output file from a test that reproduced, could you > > get the 'xfs_io -c "fiemap -v" ' output? > > Here it is. > > [ 720.199748] 0 pages HighMem/MovableOnly > [ 720.199749] 150524 pages reserved > [ 720.199749] 0 pages cma reserved > [ 720.199750] 0 pages hwpoisoned > [ 722.187335] XFS: Assertion failed: oldlen > newlen, file: > fs/xfs/libxfs/xfs_bmap.c, line: 2867 > [ 722.201784] [ cut here ] ... > > # ls -l /tmp/file > -rw--- 1 kumaneko kumaneko 43426648064 Feb 7 19:25 /tmp/file > # xfs_io -c "fiemap -v" /tmp/file > /tmp/file: > EXT: FILE-OFFSET BLOCK-RANGETOTAL FLAGS >0: [0..262015]: 358739712..359001727 262016 0x0 ... > 187: [84810808..84901119]: 110211736..110302047 90312 0x1 Ok, from the size of the file I realized that I missed you were running in a loop the first time around. I tried playing with it some more and still haven't been able to reproduce. Anyways, the patch intended to fix this has been reviewed[1] and queued for the next release, so it's probably not a big deal since you've already verified it. Thanks again. Brian [1] http://www.spinics.net/lists/linux-xfs/msg04083.html
Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
Brian Foster wrote: > > The workload is to write to a single file on XFS from 10 processes > > demonstrated at > > http://lkml.kernel.org/r/201512052133.iae00551.lsoqftmffvo...@i-love.sakura.ne.jp > > using "while :; do ./oom-write; done" loop on a VM with 4CPUs / 2048MB RAM. > > With this XFS_FILBLKS_MIN() change applied, I no longer hit assertion > > failures. > > > > Thanks for testing. Well, that's an interesting workload. I couldn't > reproduce on a few quick tries in a similarly configured vm. It takes 10 to 15 minutes. Maybe some size threshold involved? > /tmp/file _is_ on an XFS filesystem in your test, correct? If so and if > you still have the output file from a test that reproduced, could you > get the 'xfs_io -c "fiemap -v" ' output? Here it is. [ 720.199748] 0 pages HighMem/MovableOnly [ 720.199749] 150524 pages reserved [ 720.199749] 0 pages cma reserved [ 720.199750] 0 pages hwpoisoned [ 722.187335] XFS: Assertion failed: oldlen > newlen, file: fs/xfs/libxfs/xfs_bmap.c, line: 2867 [ 722.201784] [ cut here ] [ 722.205940] WARNING: CPU: 0 PID: 4877 at fs/xfs/xfs_message.c:105 asswarn+0x33/0x40 [xfs] [ 722.212333] Modules linked in: nf_conntrack_netbios_ns nf_conntrack_broadcast ip6t_rpfilter ipt_REJECT nf_reject_ipv4 ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_raw ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter coretemp crct10dif_pclmul vmw_vsock_vmci_transport crc32_pclmul ghash_clmulni_intel vsock aesni_intel crypto_simd cryptd glue_helper ppdev vmw_balloon pcspkr sg parport_pc i2c_piix4 shpchp vmw_vmci parport ip_tables xfs libcrc32c sd_mod sr_mod cdrom ata_generic pata_acpi crc32c_intel serio_raw vmwgfx drm_kms_helper syscopyarea sysfillrect [ 722.243207] sysimgblt fb_sys_fops mptspi scsi_transport_spi ata_piix ahci ttm mptscsih libahci drm libata mptbase e1000 i2c_core [ 722.247704] CPU: 0 PID: 4877 Comm: write Not tainted 4.10.0-rc6-next-20170202 #498 [ 722.250612] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015 [ 722.254089] Call Trace: [ 722.255751] dump_stack+0x85/0xc9 [ 722.257650] __warn+0xd1/0xf0 [ 722.259420] warn_slowpath_null+0x1d/0x20 [ 722.261434] asswarn+0x33/0x40 [xfs] [ 722.263356] xfs_bmap_add_extent_hole_delay+0xb7f/0xdf0 [xfs] [ 722.265695] xfs_bmapi_reserve_delalloc+0x297/0x440 [xfs] [ 722.267792] ? xfs_ilock+0x1c9/0x360 [xfs] [ 722.269559] xfs_file_iomap_begin+0x880/0x1140 [xfs] [ 722.271606] ? iomap_write_end+0x80/0x80 [ 722.273377] iomap_apply+0x6c/0x130 [ 722.274969] iomap_file_buffered_write+0x68/0xa0 [ 722.276702] ? iomap_write_end+0x80/0x80 [ 722.278311] xfs_file_buffered_aio_write+0x132/0x390 [xfs] [ 722.280394] ? _raw_spin_unlock+0x27/0x40 [ 722.282247] xfs_file_write_iter+0x90/0x130 [xfs] [ 722.284257] __vfs_write+0xe5/0x140 [ 722.285924] vfs_write+0xc7/0x1f0 [ 722.287536] ? syscall_trace_enter+0x1d0/0x380 [ 722.289490] SyS_write+0x58/0xc0 [ 722.291025] do_int80_syscall_32+0x6c/0x1f0 [ 722.292671] entry_INT80_compat+0x38/0x50 [ 722.294298] RIP: 0023:0x8048076 [ 722.295684] RSP: 002b:ffedf840 EFLAGS: 0202 ORIG_RAX: 0004 [ 722.298075] RAX: ffda RBX: 0001 RCX: 08048000 [ 722.300516] RDX: 1000 RSI: RDI: [ 722.302902] RBP: R08: R09: [ 722.305278] R10: R11: R12: [ 722.307567] R13: R14: R15: [ 722.309792] ---[ end trace 5b7012eeb84093b7 ]--- [ 732.650867] oom_reaper: reaped process 4876 (oom-write), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB # ls -l /tmp/file -rw--- 1 kumaneko kumaneko 43426648064 Feb 7 19:25 /tmp/file # xfs_io -c "fiemap -v" /tmp/file /tmp/file: EXT: FILE-OFFSET BLOCK-RANGETOTAL FLAGS 0: [0..262015]: 358739712..359001727 262016 0x0 1: [262016..524159]:367651920..367914063 262144 0x0 2: [524160..1048447]: 385063864..385588151 524288 0x0 3: [1048448..1238031]: 463702512..463892095 189584 0x0 4: [1238032..3335167]: 448234520..450331655 2097136 0x0 5: [3335168..4769775]: 36165320..37599927 1434608 0x0 6: [4769776..6897175]: 31677984..33805383 2127400 0x0 7: [6897176..15285759]: 450331656..458720239 8388584 0x0 8: [15285760..18520255]: 237497528..240732023 3234496 0x0 9: [18520256..21063607]: 229750248..232293599 2543352 0x0 10: [21063608..25257855]: 240732024..244926271 4194248 0x0 11: [25257856..29452159]: 179523440..183717743 4194304 0x0 12: [29452160..30380031]
Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
On Mon, Feb 06, 2017 at 03:42:22PM +0100, Michal Hocko wrote: > On Mon 06-02-17 09:35:33, Brian Foster wrote: > > On Mon, Feb 06, 2017 at 03:29:24PM +0900, Tetsuo Handa wrote: > > > Brian Foster wrote: > > > > On Fri, Feb 03, 2017 at 03:50:09PM +0100, Michal Hocko wrote: > > > > > [Let's CC more xfs people] > > > > > > > > > > On Fri 03-02-17 19:57:39, Tetsuo Handa wrote: > > > > > [...] > > > > > > (1) I got an assertion failure. > > > > > > > > > > I suspect this is a result of > > > > > http://lkml.kernel.org/r/20170201092706.9966-2-mho...@kernel.org > > > > > I have no idea what the assert means though. > > > > > > > > > > > > > > > > > [ 969.626518] Killed process 6262 (oom-write) total-vm:2166856kB, > > > > > > anon-rss:1128732kB, file-rss:4kB, shmem-rss:0kB > > > > > > [ 969.958307] oom_reaper: reaped process 6262 (oom-write), now > > > > > > anon-rss:0kB, file-rss:0kB, shmem-rss:0kB > > > > > > [ 972.114644] XFS: Assertion failed: oldlen > newlen, file: > > > > > > fs/xfs/libxfs/xfs_bmap.c, line: 2867 > > > > > > > > Indirect block reservation underrun on delayed allocation extent merge. > > > > These are extra blocks are used for the inode bmap btree when a delalloc > > > > extent is converted to physical blocks. We're in a case where we expect > > > > to only ever free excess blocks due to a merge of extents with > > > > independent reservations, but a situation occurs where we actually need > > > > blocks and hence the assert fails. This can occur if an extent is merged > > > > with one that has a reservation less than the expected worst case > > > > reservation for its size (due to previous extent splits due to hole > > > > punches, for example). Therefore, I think the core expectation that > > > > xfs_bmap_add_extent_hole_delay() will always have enough blocks > > > > pre-reserved is invalid. > > > > > > > > Can you describe the workload that reproduces this? FWIW, I think the > > > > way xfs_bmap_add_extent_hole_delay() currently works is likely broken > > > > and have a couple patches to fix up indlen reservation that I haven't > > > > posted yet. The diff that deals with this particular bit is appended. > > > > Care to give that a try? > > > > > > The workload is to write to a single file on XFS from 10 processes > > > demonstrated at > > > http://lkml.kernel.org/r/201512052133.iae00551.lsoqftmffvo...@i-love.sakura.ne.jp > > > using "while :; do ./oom-write; done" loop on a VM with 4CPUs / 2048MB > > > RAM. > > > With this XFS_FILBLKS_MIN() change applied, I no longer hit assertion > > > failures. > > > > > > > Thanks for testing. Well, that's an interesting workload. I couldn't > > reproduce on a few quick tries in a similarly configured vm. > > > > Normally I'd expect to see this kind of thing on a hole punching > > workload or dealing with large, sparse files that make use of > > speculative preallocation (post-eof blocks allocated in anticipation of > > file extending writes). I'm wondering if what is happening here is that > > the appending writes and file closes due to oom kills are generating > > speculative preallocs and prealloc truncates, respectively, and that > > causes prealloc extents at the eof boundary to be split up and then > > re-merged by surviving appending writers. > > Can those preallocs be affected by > http://lkml.kernel.org/r/20170201092706.9966-2-mho...@kernel.org ? > Hmm, I wouldn't expect that to make much of a difference wrt to the core problem. The prealloc is created on a file extending write that requires block allocation (we basically just tack on extra blocks to an extending alloc based on some heuristics like the size of the file and the previous extent). Whether that allocation occurs on one iomap iteration or another due to a short write and retry, I wouldn't expect to matter that much. I suppose it could change the behavior of specialized workload though. E.g., if it caused a write() call to return quicker and thus lead to a file close(). We do use file release as an indication that prealloc will not be used and can reclaim it at that point (presumably causing an extent split with pre-eof blocks). Brian > -- > Michal Hocko > SUSE Labs > -- > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
On Mon 06-02-17 09:35:33, Brian Foster wrote: > On Mon, Feb 06, 2017 at 03:29:24PM +0900, Tetsuo Handa wrote: > > Brian Foster wrote: > > > On Fri, Feb 03, 2017 at 03:50:09PM +0100, Michal Hocko wrote: > > > > [Let's CC more xfs people] > > > > > > > > On Fri 03-02-17 19:57:39, Tetsuo Handa wrote: > > > > [...] > > > > > (1) I got an assertion failure. > > > > > > > > I suspect this is a result of > > > > http://lkml.kernel.org/r/20170201092706.9966-2-mho...@kernel.org > > > > I have no idea what the assert means though. > > > > > > > > > > > > > > [ 969.626518] Killed process 6262 (oom-write) total-vm:2166856kB, > > > > > anon-rss:1128732kB, file-rss:4kB, shmem-rss:0kB > > > > > [ 969.958307] oom_reaper: reaped process 6262 (oom-write), now > > > > > anon-rss:0kB, file-rss:0kB, shmem-rss:0kB > > > > > [ 972.114644] XFS: Assertion failed: oldlen > newlen, file: > > > > > fs/xfs/libxfs/xfs_bmap.c, line: 2867 > > > > > > Indirect block reservation underrun on delayed allocation extent merge. > > > These are extra blocks are used for the inode bmap btree when a delalloc > > > extent is converted to physical blocks. We're in a case where we expect > > > to only ever free excess blocks due to a merge of extents with > > > independent reservations, but a situation occurs where we actually need > > > blocks and hence the assert fails. This can occur if an extent is merged > > > with one that has a reservation less than the expected worst case > > > reservation for its size (due to previous extent splits due to hole > > > punches, for example). Therefore, I think the core expectation that > > > xfs_bmap_add_extent_hole_delay() will always have enough blocks > > > pre-reserved is invalid. > > > > > > Can you describe the workload that reproduces this? FWIW, I think the > > > way xfs_bmap_add_extent_hole_delay() currently works is likely broken > > > and have a couple patches to fix up indlen reservation that I haven't > > > posted yet. The diff that deals with this particular bit is appended. > > > Care to give that a try? > > > > The workload is to write to a single file on XFS from 10 processes > > demonstrated at > > http://lkml.kernel.org/r/201512052133.iae00551.lsoqftmffvo...@i-love.sakura.ne.jp > > using "while :; do ./oom-write; done" loop on a VM with 4CPUs / 2048MB RAM. > > With this XFS_FILBLKS_MIN() change applied, I no longer hit assertion > > failures. > > > > Thanks for testing. Well, that's an interesting workload. I couldn't > reproduce on a few quick tries in a similarly configured vm. > > Normally I'd expect to see this kind of thing on a hole punching > workload or dealing with large, sparse files that make use of > speculative preallocation (post-eof blocks allocated in anticipation of > file extending writes). I'm wondering if what is happening here is that > the appending writes and file closes due to oom kills are generating > speculative preallocs and prealloc truncates, respectively, and that > causes prealloc extents at the eof boundary to be split up and then > re-merged by surviving appending writers. Can those preallocs be affected by http://lkml.kernel.org/r/20170201092706.9966-2-mho...@kernel.org ? -- Michal Hocko SUSE Labs
Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
On Mon, Feb 06, 2017 at 03:29:24PM +0900, Tetsuo Handa wrote: > Brian Foster wrote: > > On Fri, Feb 03, 2017 at 03:50:09PM +0100, Michal Hocko wrote: > > > [Let's CC more xfs people] > > > > > > On Fri 03-02-17 19:57:39, Tetsuo Handa wrote: > > > [...] > > > > (1) I got an assertion failure. > > > > > > I suspect this is a result of > > > http://lkml.kernel.org/r/20170201092706.9966-2-mho...@kernel.org > > > I have no idea what the assert means though. > > > > > > > > > > > [ 969.626518] Killed process 6262 (oom-write) total-vm:2166856kB, > > > > anon-rss:1128732kB, file-rss:4kB, shmem-rss:0kB > > > > [ 969.958307] oom_reaper: reaped process 6262 (oom-write), now > > > > anon-rss:0kB, file-rss:0kB, shmem-rss:0kB > > > > [ 972.114644] XFS: Assertion failed: oldlen > newlen, file: > > > > fs/xfs/libxfs/xfs_bmap.c, line: 2867 > > > > Indirect block reservation underrun on delayed allocation extent merge. > > These are extra blocks are used for the inode bmap btree when a delalloc > > extent is converted to physical blocks. We're in a case where we expect > > to only ever free excess blocks due to a merge of extents with > > independent reservations, but a situation occurs where we actually need > > blocks and hence the assert fails. This can occur if an extent is merged > > with one that has a reservation less than the expected worst case > > reservation for its size (due to previous extent splits due to hole > > punches, for example). Therefore, I think the core expectation that > > xfs_bmap_add_extent_hole_delay() will always have enough blocks > > pre-reserved is invalid. > > > > Can you describe the workload that reproduces this? FWIW, I think the > > way xfs_bmap_add_extent_hole_delay() currently works is likely broken > > and have a couple patches to fix up indlen reservation that I haven't > > posted yet. The diff that deals with this particular bit is appended. > > Care to give that a try? > > The workload is to write to a single file on XFS from 10 processes > demonstrated at > http://lkml.kernel.org/r/201512052133.iae00551.lsoqftmffvo...@i-love.sakura.ne.jp > using "while :; do ./oom-write; done" loop on a VM with 4CPUs / 2048MB RAM. > With this XFS_FILBLKS_MIN() change applied, I no longer hit assertion > failures. > Thanks for testing. Well, that's an interesting workload. I couldn't reproduce on a few quick tries in a similarly configured vm. Normally I'd expect to see this kind of thing on a hole punching workload or dealing with large, sparse files that make use of speculative preallocation (post-eof blocks allocated in anticipation of file extending writes). I'm wondering if what is happening here is that the appending writes and file closes due to oom kills are generating speculative preallocs and prealloc truncates, respectively, and that causes prealloc extents at the eof boundary to be split up and then re-merged by surviving appending writers. /tmp/file _is_ on an XFS filesystem in your test, correct? If so and if you still have the output file from a test that reproduced, could you get the 'xfs_io -c "fiemap -v" ' output? I suppose another possibility is that prealloc occurs, write failure(s) leads to extent splits via unmapping the target range of the write, and then surviving writers generate the warning on a delalloc extent merge.. Brian > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majord...@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: mailto:"d...@kvack.org";> em...@kvack.org
Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
On Sun 05-02-17 19:43:07, Tetsuo Handa wrote: > Michal Hocko wrote: > I got same warning with ext4. Maybe we need to check carefully. > > [ 511.215743] = > [ 511.218003] WARNING: RECLAIM_FS-safe -> RECLAIM_FS-unsafe lock order > detected > [ 511.220031] 4.10.0-rc6-next-20170202+ #500 Not tainted > [ 511.221689] - > [ 511.223579] a.out/49302 [HC0[0]:SC0[0]:HE1:SE1] is trying to acquire: > [ 511.225533] (cpu_hotplug.dep_map){++}, at: [] > get_online_cpus+0x37/0x80 > [ 511.227795] > [ 511.227795] and this task is already holding: > [ 511.230082] (jbd2_handle){-.}, at: [] > start_this_handle+0x1a7/0x590 > [ 511.232592] which would create a new lock dependency: > [ 511.234192] (jbd2_handle){-.} -> (cpu_hotplug.dep_map){++} > [ 511.235966] > [ 511.235966] but this new dependency connects a RECLAIM_FS-irq-safe lock: > [ 511.238563] (jbd2_handle){-.} > [ 511.238564] > [ 511.238564] ... which became RECLAIM_FS-irq-safe at: > [ 511.242078] > [ 511.242084] [] __lock_acquire+0x34b/0x1640 > [ 511.244495] [] lock_acquire+0xc9/0x250 > [ 511.246697] [] jbd2_log_wait_commit+0x55/0x1d0 [...] > [ 511.276216] to a RECLAIM_FS-irq-unsafe lock: > [ 511.278128] (cpu_hotplug.dep_map){++} > [ 511.278130] > [ 511.278130] ... which became RECLAIM_FS-irq-unsafe at: > [ 511.281809] ... > [ 511.281811] > [ 511.282598] [] mark_held_locks+0x71/0x90 > [ 511.284854] [] lockdep_trace_alloc+0x6f/0xd0 > [ 511.287218] [] kmem_cache_alloc_node_trace+0x48/0x3b0 > [ 511.289755] [] __smpboot_create_thread.part.2+0x35/0xf0 > [ 511.292329] [] smpboot_create_threads+0x66/0x90 [...] > [ 511.317867] other info that might help us debug this: > [ 511.317867] > [ 511.320920] Possible interrupt unsafe locking scenario: > [ 511.320920] > [ 511.323218]CPU0CPU1 > [ 511.324622] > [ 511.325973] lock(cpu_hotplug.dep_map); > [ 511.327246]local_irq_disable(); > [ 511.328870]lock(jbd2_handle); > [ 511.330483]lock(cpu_hotplug.dep_map); > [ 511.332259] > [ 511.333187] lock(jbd2_handle); Peter, is there any way how to tell the lockdep that this is in fact reclaim safe? The direct reclaim only does the trylock and backs off so we cannot deadlock here. Or am I misinterpreting the trace? -- Michal Hocko SUSE Labs
Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
On Sun 05-02-17 19:43:07, Tetsuo Handa wrote: [...] > Below one is also a loop. Maybe we can add __GFP_NOMEMALLOC to GFP_NOWAIT ? No, GFP_NOWAIT is just too generic to use this flag. > [ 257.781715] Out of memory: Kill process 5171 (a.out) score 842 or > sacrifice child > [ 257.784726] Killed process 5171 (a.out) total-vm:2177096kB, > anon-rss:1476488kB, file-rss:4kB, shmem-rss:0kB > [ 257.787691] a.out(5171): TIF_MEMDIE allocation: order=0 > mode=0x1000200(GFP_NOWAIT|__GFP_NOWARN) > [ 257.789789] CPU: 3 PID: 5171 Comm: a.out Not tainted > 4.10.0-rc6-next-20170202+ #500 > [ 257.791784] Hardware name: VMware, Inc. VMware Virtual Platform/440BX > Desktop Reference Platform, BIOS 6.00 07/02/2015 > [ 257.794700] Call Trace: > [ 257.795690] dump_stack+0x85/0xc9 > [ 257.797224] __alloc_pages_slowpath+0xacb/0xe36 > [ 257.798612] __alloc_pages_nodemask+0x382/0x3d0 > [ 257.799942] alloc_pages_current+0x97/0x1b0 > [ 257.801236] __get_free_pages+0x14/0x50 > [ 257.802546] __tlb_remove_page_size+0x70/0xd0 This is bound to MAX_GATHER_BATCH_COUNT which shouldn't be a lot of pages (20 or so). We could add __GFP_NOMEMALLOC into tlb_next_batch but I am not entirely convinced it is really necessary. -- Michal Hocko SUSE Labs
Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
Brian Foster wrote: > On Fri, Feb 03, 2017 at 03:50:09PM +0100, Michal Hocko wrote: > > [Let's CC more xfs people] > > > > On Fri 03-02-17 19:57:39, Tetsuo Handa wrote: > > [...] > > > (1) I got an assertion failure. > > > > I suspect this is a result of > > http://lkml.kernel.org/r/20170201092706.9966-2-mho...@kernel.org > > I have no idea what the assert means though. > > > > > > > > [ 969.626518] Killed process 6262 (oom-write) total-vm:2166856kB, > > > anon-rss:1128732kB, file-rss:4kB, shmem-rss:0kB > > > [ 969.958307] oom_reaper: reaped process 6262 (oom-write), now > > > anon-rss:0kB, file-rss:0kB, shmem-rss:0kB > > > [ 972.114644] XFS: Assertion failed: oldlen > newlen, file: > > > fs/xfs/libxfs/xfs_bmap.c, line: 2867 > > Indirect block reservation underrun on delayed allocation extent merge. > These are extra blocks are used for the inode bmap btree when a delalloc > extent is converted to physical blocks. We're in a case where we expect > to only ever free excess blocks due to a merge of extents with > independent reservations, but a situation occurs where we actually need > blocks and hence the assert fails. This can occur if an extent is merged > with one that has a reservation less than the expected worst case > reservation for its size (due to previous extent splits due to hole > punches, for example). Therefore, I think the core expectation that > xfs_bmap_add_extent_hole_delay() will always have enough blocks > pre-reserved is invalid. > > Can you describe the workload that reproduces this? FWIW, I think the > way xfs_bmap_add_extent_hole_delay() currently works is likely broken > and have a couple patches to fix up indlen reservation that I haven't > posted yet. The diff that deals with this particular bit is appended. > Care to give that a try? The workload is to write to a single file on XFS from 10 processes demonstrated at http://lkml.kernel.org/r/201512052133.iae00551.lsoqftmffvo...@i-love.sakura.ne.jp using "while :; do ./oom-write; done" loop on a VM with 4CPUs / 2048MB RAM. With this XFS_FILBLKS_MIN() change applied, I no longer hit assertion failures.
Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
Michal Hocko wrote: > [CC Petr] > > On Fri 03-02-17 19:57:39, Tetsuo Handa wrote: > [...] > > (2) I got a lockdep warning. (A new false positive?) > > Yes, I suspect this is a false possitive. I do not see how we can > deadlock. __alloc_pages_direct_reclaim calls drain_all_pages(NULL) which > means that a potential recursion to the page allocator during draining > would just bail out on the trylock. Maybe I am misinterpreting the > report though. > I got same warning with ext4. Maybe we need to check carefully. [ 511.215743] = [ 511.218003] WARNING: RECLAIM_FS-safe -> RECLAIM_FS-unsafe lock order detected [ 511.220031] 4.10.0-rc6-next-20170202+ #500 Not tainted [ 511.221689] - [ 511.223579] a.out/49302 [HC0[0]:SC0[0]:HE1:SE1] is trying to acquire: [ 511.225533] (cpu_hotplug.dep_map){++}, at: [] get_online_cpus+0x37/0x80 [ 511.227795] [ 511.227795] and this task is already holding: [ 511.230082] (jbd2_handle){-.}, at: [] start_this_handle+0x1a7/0x590 [ 511.232592] which would create a new lock dependency: [ 511.234192] (jbd2_handle){-.} -> (cpu_hotplug.dep_map){++} [ 511.235966] [ 511.235966] but this new dependency connects a RECLAIM_FS-irq-safe lock: [ 511.238563] (jbd2_handle){-.} [ 511.238564] [ 511.238564] ... which became RECLAIM_FS-irq-safe at: [ 511.242078] [ 511.242084] [] __lock_acquire+0x34b/0x1640 [ 511.244492] [ 511.244495] [] lock_acquire+0xc9/0x250 [ 511.246694] [ 511.246697] [] jbd2_log_wait_commit+0x55/0x1d0 [ 511.249323] [ 511.249328] [] jbd2_complete_transaction+0x71/0x90 [ 511.252069] [ 511.252074] [] ext4_evict_inode+0x356/0x760 [ 511.254753] [ 511.254757] [] evict+0xd1/0x1a0 [ 511.257062] [ 511.257065] [] dispose_list+0x4d/0x80 [ 511.259531] [ 511.259535] [] prune_icache_sb+0x5a/0x80 [ 511.261953] [ 511.261957] [] super_cache_scan+0x141/0x190 [ 511.264540] [ 511.264545] [] shrink_slab+0x29f/0x6d0 [ 511.267165] [ 511.267171] [] shrink_node+0x2fa/0x310 [ 511.269455] [ 511.269459] [] kswapd+0x362/0x9b0 [ 511.271831] [ 511.271834] [] kthread+0x10f/0x150 [ 511.274031] [ 511.274035] [] ret_from_fork+0x31/0x40 [ 511.276216] [ 511.276216] to a RECLAIM_FS-irq-unsafe lock: [ 511.278128] (cpu_hotplug.dep_map){++} [ 511.278130] [ 511.278130] ... which became RECLAIM_FS-irq-unsafe at: [ 511.281809] ... [ 511.281811] [ 511.282598] [] mark_held_locks+0x71/0x90 [ 511.284852] [ 511.284854] [] lockdep_trace_alloc+0x6f/0xd0 [ 511.287215] [ 511.287218] [] kmem_cache_alloc_node_trace+0x48/0x3b0 [ 511.289751] [ 511.289755] [] __smpboot_create_thread.part.2+0x35/0xf0 [ 511.292326] [ 511.292329] [] smpboot_create_threads+0x66/0x90 [ 511.295025] [ 511.295030] [] cpuhp_invoke_callback+0x229/0x9e0 [ 511.299245] [ 511.299253] [] cpuhp_up_callbacks+0x37/0xb0 [ 511.301889] [ 511.301894] [] _cpu_up+0x89/0xf0 [ 511.304270] [ 511.304275] [] do_cpu_up+0x85/0xb0 [ 511.306428] [ 511.306431] [] cpu_up+0x13/0x20 [ 511.308533] [ 511.308535] [] smp_init+0x6b/0xcc [ 511.310710] [ 511.310713] [] kernel_init_freeable+0x17d/0x2ac [ 511.313232] [ 511.313235] [] kernel_init+0xe/0x110 [ 511.315616] [ 511.315620] [] ret_from_fork+0x31/0x40 [ 511.317867] [ 511.317867] other info that might help us debug this: [ 511.317867] [ 511.320920] Possible interrupt unsafe locking scenario: [ 511.320920] [ 511.323218]CPU0CPU1 [ 511.324622] [ 511.325973] lock(cpu_hotplug.dep_map); [ 511.327246]local_irq_disable(); [ 511.328870]lock(jbd2_handle); [ 511.330483]lock(cpu_hotplug.dep_map); [ 511.332259] [ 511.333187] lock(jbd2_handle); [ 511.334304] [ 511.334304] *** DEADLOCK *** [ 511.334304] [ 511.336749] 4 locks held by a.out/49302: [ 511.338129] #0: (sb_writers#8){.+.+.+}, at: [] mnt_want_write+0x24/0x50 [ 511.340768] #1: (&type->i_mutex_dir_key#3){++}, at: [] path_openat+0x60b/0xd50 [ 511.343744] #2: (jbd2_handle){-.}, at: [] start_this_handle+0x1a7/0x590 [ 511.345743] #3: (pcpu_drain_mutex){+.+...}, at: [] drain_all_pages.part.89+0x1f/0x2c0 [ 511.348605] [ 511.348605] the dependencies between RECLAIM_FS-irq-safe lock and the holding lock: [ 511.351336] -> (jbd2_handle){-.} ops: 203220 { [ 511.352768]HARDIRQ-ON-W at: [ 511.353827] [ 511.353833] [] __lock_acquire+0x9de/0x1640 [ 511.356489] [ 511.356492] [] lock_acquire+0xc9/0x250 [ 511.359063] [ 511.359067] [] jbd2_log_wait_commit+0x55/0x1d0 [ 511.361905] [ 511.361908] [] jbd2_complete_transaction+0x71/0x90 [ 511.364560] [ 511.364563] [] ext4_sync_file+0x2
Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
On Fri, Feb 03, 2017 at 03:50:09PM +0100, Michal Hocko wrote: > [Let's CC more xfs people] > > On Fri 03-02-17 19:57:39, Tetsuo Handa wrote: > [...] > > (1) I got an assertion failure. > > I suspect this is a result of > http://lkml.kernel.org/r/20170201092706.9966-2-mho...@kernel.org > I have no idea what the assert means though. > > > > > [ 969.626518] Killed process 6262 (oom-write) total-vm:2166856kB, > > anon-rss:1128732kB, file-rss:4kB, shmem-rss:0kB > > [ 969.958307] oom_reaper: reaped process 6262 (oom-write), now > > anon-rss:0kB, file-rss:0kB, shmem-rss:0kB > > [ 972.114644] XFS: Assertion failed: oldlen > newlen, file: > > fs/xfs/libxfs/xfs_bmap.c, line: 2867 Indirect block reservation underrun on delayed allocation extent merge. These are extra blocks are used for the inode bmap btree when a delalloc extent is converted to physical blocks. We're in a case where we expect to only ever free excess blocks due to a merge of extents with independent reservations, but a situation occurs where we actually need blocks and hence the assert fails. This can occur if an extent is merged with one that has a reservation less than the expected worst case reservation for its size (due to previous extent splits due to hole punches, for example). Therefore, I think the core expectation that xfs_bmap_add_extent_hole_delay() will always have enough blocks pre-reserved is invalid. Can you describe the workload that reproduces this? FWIW, I think the way xfs_bmap_add_extent_hole_delay() currently works is likely broken and have a couple patches to fix up indlen reservation that I haven't posted yet. The diff that deals with this particular bit is appended. Care to give that a try? Brian > > [ 972.125085] [ cut here ] > > [ 972.129261] WARNING: CPU: 0 PID: 6280 at fs/xfs/xfs_message.c:105 > > asswarn+0x33/0x40 [xfs] > > [ 972.136146] Modules linked in: nf_conntrack_netbios_ns > > nf_conntrack_broadcast ip6t_rpfilter ipt_REJECT nf_reject_ipv4 ip6t_REJECT > > nf_reject_ipv6 xt_conntrack coretemp crct10dif_pclmul ppdev crc32_pclmul > > ghash_clmulni_intel ip_set nfnetlink ebtable_nat aesni_intel crypto_simd > > cryptd ebtable_broute glue_helper vmw_balloon bridge stp llc ip6table_nat > > nf_conntrack_ipv6 nf_defrag_ipv6 pcspkr nf_nat_ipv6 ip6table_mangle > > ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 > > nf_nat nf_conntrack iptable_mangle iptable_raw ebtable_filter ebtables > > ip6table_filter ip6_tables iptable_filter sg parport_pc parport shpchp > > i2c_piix4 vmw_vsock_vmci_transport vsock vmw_vmci ip_tables xfs libcrc32c > > sr_mod cdrom ata_generic sd_mod pata_acpi crc32c_intel serio_raw vmwgfx > > drm_kms_helper syscopyarea sysfillrect > > [ 972.163630] sysimgblt fb_sys_fops ttm drm ata_piix ahci libahci mptspi > > scsi_transport_spi mptscsih e1000 libata i2c_core mptbase > > [ 972.172535] CPU: 0 PID: 6280 Comm: write Not tainted > > 4.10.0-rc6-next-20170202 #498 > > [ 972.175126] Hardware name: VMware, Inc. VMware Virtual Platform/440BX > > Desktop Reference Platform, BIOS 6.00 07/02/2015 > > [ 972.178381] Call Trace: ... ---8<--- diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c index bfc00de..d2e48ed 100644 --- a/fs/xfs/libxfs/xfs_bmap.c +++ b/fs/xfs/libxfs/xfs_bmap.c @@ -2809,7 +2809,8 @@ xfs_bmap_add_extent_hole_delay( oldlen = startblockval(left.br_startblock) + startblockval(new->br_startblock) + startblockval(right.br_startblock); - newlen = xfs_bmap_worst_indlen(ip, temp); + newlen = XFS_FILBLKS_MIN(xfs_bmap_worst_indlen(ip, temp), +oldlen); xfs_bmbt_set_startblock(xfs_iext_get_ext(ifp, *idx), nullstartblock((int)newlen)); trace_xfs_bmap_post_update(ip, *idx, state, _THIS_IP_); @@ -2830,7 +2831,8 @@ xfs_bmap_add_extent_hole_delay( xfs_bmbt_set_blockcount(xfs_iext_get_ext(ifp, *idx), temp); oldlen = startblockval(left.br_startblock) + startblockval(new->br_startblock); - newlen = xfs_bmap_worst_indlen(ip, temp); + newlen = XFS_FILBLKS_MIN(xfs_bmap_worst_indlen(ip, temp), +oldlen); xfs_bmbt_set_startblock(xfs_iext_get_ext(ifp, *idx), nullstartblock((int)newlen)); trace_xfs_bmap_post_update(ip, *idx, state, _THIS_IP_); @@ -2846,7 +2848,8 @@ xfs_bmap_add_extent_hole_delay( temp = new->br_blockcount + right.br_blockcount; oldlen = startblockval(new->br_startblock) + startblockval(right.br_startblock); - newlen = xfs_bmap_worst_indlen(ip, temp); + newlen = XFS_FILBLKS_MIN(xfs_bmap_worst_indlen(ip, temp), +oldlen);
Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
[CC Petr] On Fri 03-02-17 19:57:39, Tetsuo Handa wrote: [...] > (2) I got a lockdep warning. (A new false positive?) Yes, I suspect this is a false possitive. I do not see how we can deadlock. __alloc_pages_direct_reclaim calls drain_all_pages(NULL) which means that a potential recursion to the page allocator during draining would just bail out on the trylock. Maybe I am misinterpreting the report though. > [ 243.036975] = > [ 243.042976] WARNING: RECLAIM_FS-safe -> RECLAIM_FS-unsafe lock order > detected > [ 243.051211] 4.10.0-rc6-next-20170202 #46 Not tainted > [ 243.054619] - > [ 243.057395] awk/8767 [HC0[0]:SC0[0]:HE1:SE1] is trying to acquire: > [ 243.060310] (cpu_hotplug.dep_map){++}, at: [] > get_online_cpus+0x32/0x80 > [ 243.063462] > [ 243.063462] and this task is already holding: > [ 243.066851] (&xfs_dir_ilock_class){-.}, at: [] > xfs_ilock+0x114/0x290 [xfs] > [ 243.069949] which would create a new lock dependency: > [ 243.072143] (&xfs_dir_ilock_class){-.} -> > (cpu_hotplug.dep_map){++} > [ 243.074789] > [ 243.074789] but this new dependency connects a RECLAIM_FS-irq-safe lock: > [ 243.078735] (&xfs_dir_ilock_class){-.} > [ 243.078739] > [ 243.078739] ... which became RECLAIM_FS-irq-safe at: > [ 243.084175] > [ 243.084180] [] __lock_acquire+0x344/0x1bb0 > [ 243.087257] > [ 243.087261] [] lock_acquire+0xe0/0x2a0 > [ 243.090027] > [ 243.090033] [] down_write_nested+0x59/0xc0 > [ 243.092838] > [ 243.092888] [] xfs_ilock+0x14e/0x290 [xfs] > [ 243.095453] > [ 243.095485] [] xfs_reclaim_inode+0x135/0x340 [xfs] > [ 243.098083] > [ 243.098109] [] xfs_reclaim_inodes_ag+0x2ca/0x4f0 [xfs] > [ 243.100668] > [ 243.100692] [] xfs_reclaim_inodes_nr+0x2e/0x40 [xfs] > [ 243.103191] > [ 243.103221] [] xfs_fs_free_cached_objects+0x14/0x20 [xfs] > [ 243.105710] > [ 243.105714] [] super_cache_scan+0x17c/0x190 > [ 243.107947] > [ 243.107950] [] shrink_slab+0x29a/0x710 > [ 243.110133] > [ 243.110135] [] shrink_node+0x23d/0x320 > [ 243.112262] > [ 243.112264] [] kswapd+0x354/0xa10 > [ 243.114323] > [ 243.114326] [] kthread+0x10a/0x140 > [ 243.116448] > [ 243.116452] [] ret_from_fork+0x31/0x40 > [ 243.118692] > [ 243.118692] to a RECLAIM_FS-irq-unsafe lock: > [ 243.120636] (cpu_hotplug.dep_map){++} > [ 243.120638] > [ 243.120638] ... which became RECLAIM_FS-irq-unsafe at: > [ 243.124021] ... > [ 243.124022] > [ 243.124820] [] mark_held_locks+0x71/0x90 > [ 243.127033] > [ 243.127035] [] lockdep_trace_alloc+0xc5/0x110 > [ 243.129228] > [ 243.129231] [] kmem_cache_alloc_node_trace+0x4a/0x410 > [ 243.131534] > [ 243.131536] [] __smpboot_create_thread.part.3+0x30/0xf0 > [ 243.133850] > [ 243.133852] [] smpboot_create_threads+0x61/0x90 > [ 243.136113] > [ 243.136119] [] cpuhp_invoke_callback+0xbb/0xb70 > [ 243.138319] > [ 243.138320] [] cpuhp_up_callbacks+0x32/0xb0 > [ 243.140479] > [ 243.140480] [] _cpu_up+0x84/0xf0 > [ 243.142484] > [ 243.142485] [] do_cpu_up+0x84/0xd0 > [ 243.144716] > [ 243.144719] [] cpu_up+0xe/0x10 > [ 243.146684] > [ 243.146687] [] smp_init+0xd5/0x141 > [ 243.148755] > [ 243.148758] [] kernel_init_freeable+0x17d/0x2a7 > [ 243.150932] > [ 243.150936] [] kernel_init+0x9/0x100 > [ 243.153088] > [ 243.153092] [] ret_from_fork+0x31/0x40 > [ 243.155135] > [ 243.155135] other info that might help us debug this: > [ 243.155135] > [ 243.157724] Possible interrupt unsafe locking scenario: > [ 243.157724] > [ 243.159877]CPU0CPU1 > [ 243.161047] > [ 243.162210] lock(cpu_hotplug.dep_map); > [ 243.163279]local_irq_disable(); > [ 243.164669]lock(&xfs_dir_ilock_class); > [ 243.166148]lock(cpu_hotplug.dep_map); > [ 243.167653] > [ 243.168594] lock(&xfs_dir_ilock_class); > [ 243.169694] > [ 243.169694] *** DEADLOCK *** > [ 243.169694] > [ 243.171864] 3 locks held by awk/8767: > [ 243.172872] #0: (&type->i_mutex_dir_key#3){++}, at: > [] path_openat+0x53c/0xa90 > [ 243.174791] #1: (&xfs_dir_ilock_class){-.}, at: [] > xfs_ilock+0x114/0x290 [xfs] > [ 243.176899] #2: (pcpu_drain_mutex){+.+...}, at: [] > drain_all_pages.part.80+0x1a/0x320 > [ 243.178875] > [ 243.178875] the dependencies between RECLAIM_FS-irq-safe lock and the > holding lock: > [ 243.181262] -> (&xfs_dir_ilock_class){-.} ops: 17348 { > [ 243.182610]HARDIRQ-ON-W at: > [ 243.183603] > [ 243.183606] [] __lock_acquire+0x794/0x1bb0 > [ 243.186056] > [ 243.186059] [] lock_acquire+0xe0/0x2a0 > [ 243.188419] > [ 243.188422] [] down_write_nested+0x59/0xc0 > [ 243.190909]
Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
[Let's CC more xfs people] On Fri 03-02-17 19:57:39, Tetsuo Handa wrote: [...] > (1) I got an assertion failure. I suspect this is a result of http://lkml.kernel.org/r/20170201092706.9966-2-mho...@kernel.org I have no idea what the assert means though. > > [ 969.626518] Killed process 6262 (oom-write) total-vm:2166856kB, > anon-rss:1128732kB, file-rss:4kB, shmem-rss:0kB > [ 969.958307] oom_reaper: reaped process 6262 (oom-write), now anon-rss:0kB, > file-rss:0kB, shmem-rss:0kB > [ 972.114644] XFS: Assertion failed: oldlen > newlen, file: > fs/xfs/libxfs/xfs_bmap.c, line: 2867 > [ 972.125085] [ cut here ] > [ 972.129261] WARNING: CPU: 0 PID: 6280 at fs/xfs/xfs_message.c:105 > asswarn+0x33/0x40 [xfs] > [ 972.136146] Modules linked in: nf_conntrack_netbios_ns > nf_conntrack_broadcast ip6t_rpfilter ipt_REJECT nf_reject_ipv4 ip6t_REJECT > nf_reject_ipv6 xt_conntrack coretemp crct10dif_pclmul ppdev crc32_pclmul > ghash_clmulni_intel ip_set nfnetlink ebtable_nat aesni_intel crypto_simd > cryptd ebtable_broute glue_helper vmw_balloon bridge stp llc ip6table_nat > nf_conntrack_ipv6 nf_defrag_ipv6 pcspkr nf_nat_ipv6 ip6table_mangle > ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat > nf_conntrack iptable_mangle iptable_raw ebtable_filter ebtables > ip6table_filter ip6_tables iptable_filter sg parport_pc parport shpchp > i2c_piix4 vmw_vsock_vmci_transport vsock vmw_vmci ip_tables xfs libcrc32c > sr_mod cdrom ata_generic sd_mod pata_acpi crc32c_intel serio_raw vmwgfx > drm_kms_helper syscopyarea sysfillrect > [ 972.163630] sysimgblt fb_sys_fops ttm drm ata_piix ahci libahci mptspi > scsi_transport_spi mptscsih e1000 libata i2c_core mptbase > [ 972.172535] CPU: 0 PID: 6280 Comm: write Not tainted > 4.10.0-rc6-next-20170202 #498 > [ 972.175126] Hardware name: VMware, Inc. VMware Virtual Platform/440BX > Desktop Reference Platform, BIOS 6.00 07/02/2015 > [ 972.178381] Call Trace: > [ 972.180003] dump_stack+0x85/0xc9 > [ 972.181682] __warn+0xd1/0xf0 > [ 972.183374] warn_slowpath_null+0x1d/0x20 > [ 972.185223] asswarn+0x33/0x40 [xfs] > [ 972.186950] xfs_bmap_add_extent_hole_delay+0xb7f/0xdf0 [xfs] > [ 972.189055] xfs_bmapi_reserve_delalloc+0x297/0x440 [xfs] > [ 972.191263] ? xfs_ilock+0x1c9/0x360 [xfs] > [ 972.193414] xfs_file_iomap_begin+0x880/0x1140 [xfs] > [ 972.195300] ? iomap_write_end+0x80/0x80 > [ 972.196980] iomap_apply+0x6c/0x130 > [ 972.198539] iomap_file_buffered_write+0x68/0xa0 > [ 972.200316] ? iomap_write_end+0x80/0x80 > [ 972.201950] xfs_file_buffered_aio_write+0x132/0x390 [xfs] > [ 972.203868] ? _raw_spin_unlock+0x27/0x40 > [ 972.205470] xfs_file_write_iter+0x90/0x130 [xfs] > [ 972.207167] __vfs_write+0xe5/0x140 > [ 972.208752] vfs_write+0xc7/0x1f0 > [ 972.210233] ? syscall_trace_enter+0x1d0/0x380 > [ 972.211809] SyS_write+0x58/0xc0 > [ 972.213166] do_int80_syscall_32+0x6c/0x1f0 > [ 972.214676] entry_INT80_compat+0x38/0x50 > [ 972.216168] RIP: 0023:0x8048076 > [ 972.217494] RSP: 002b:ff997020 EFLAGS: 0202 ORIG_RAX: > 0004 > [ 972.219635] RAX: ffda RBX: 0001 RCX: > 08048000 > [ 972.221679] RDX: 1000 RSI: RDI: > > [ 972.223774] RBP: R08: R09: > > [ 972.225905] R10: R11: R12: > > [ 972.227946] R13: R14: R15: > > [ 972.230064] ---[ end trace d498098daec56c11 ]--- > [ 984.210890] vmtoolsd invoked oom-killer: > gfp_mask=0x14201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD), nodemask=(null), > order=0, oom_score_adj=0 > [ 984.224191] vmtoolsd cpuset=/ mems_allowed=0 > [ 984.231022] CPU: 0 PID: 689 Comm: vmtoolsd Tainted: GW > 4.10.0-rc6-next-20170202 #498 -- Michal Hocko SUSE Labs
Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
On Fri 03-02-17 19:57:39, Tetsuo Handa wrote: > Michal Hocko wrote: > > On Mon 30-01-17 09:55:46, Michal Hocko wrote: > > > On Sun 29-01-17 00:27:27, Tetsuo Handa wrote: > > [...] > > > > Regarding [1], it helped avoiding the too_many_isolated() issue. I can't > > > > tell whether it has any negative effect, but I got on the first trial > > > > that > > > > all allocating threads are blocked on wait_for_completion() from > > > > flush_work() > > > > in drain_all_pages() introduced by "mm, page_alloc: drain per-cpu pages > > > > from > > > > workqueue context". There was no warn_alloc() stall warning message > > > > afterwords. > > > > > > That patch is buggy and there is a follow up [1] which is not sitting in > > > the > > > mmotm (and thus linux-next) yet. I didn't get to review it properly and > > > I cannot say I would be too happy about using WQ from the page > > > allocator. I believe even the follow up needs to have WQ_RECLAIM WQ. > > > > > > [1] > > > http://lkml.kernel.org/r/20170125083038.rzb5f43nptmk7...@techsingularity.net > > > > Did you get chance to test with this follow up patch? It would be > > interesting to see whether OOM situation can still starve the waiter. > > The current linux-next should contain this patch. > > So far I can't reproduce problems except two listed below (cond_resched() trap > in printk() and IDLE priority trap are excluded from the list). But I agree > that > the follow up patch needs to use a WQ_RECLAIM WQ. It is theoretically possible > that an allocation request which can trigger the OOM killer waits for the > system_wq while there is already a work which is in system_wq which is looping > forever inside the page allocator without triggering the OOM killer. Well, this shouldn't happen AFAICS because a new worker would be requested and that would certainly require a memory and that allocation would trigger the OOM killer. On the other hand I agree that it would be safer to not depend on memory allocation from within the page allocator. > Maybe the follow up patch can share the vmstat WQ? Yes, this would be an option. -- Michal Hocko SUSE Labs
Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
Michal Hocko wrote: > On Mon 30-01-17 09:55:46, Michal Hocko wrote: > > On Sun 29-01-17 00:27:27, Tetsuo Handa wrote: > [...] > > > Regarding [1], it helped avoiding the too_many_isolated() issue. I can't > > > tell whether it has any negative effect, but I got on the first trial that > > > all allocating threads are blocked on wait_for_completion() from > > > flush_work() > > > in drain_all_pages() introduced by "mm, page_alloc: drain per-cpu pages > > > from > > > workqueue context". There was no warn_alloc() stall warning message > > > afterwords. > > > > That patch is buggy and there is a follow up [1] which is not sitting in the > > mmotm (and thus linux-next) yet. I didn't get to review it properly and > > I cannot say I would be too happy about using WQ from the page > > allocator. I believe even the follow up needs to have WQ_RECLAIM WQ. > > > > [1] > > http://lkml.kernel.org/r/20170125083038.rzb5f43nptmk7...@techsingularity.net > > Did you get chance to test with this follow up patch? It would be > interesting to see whether OOM situation can still starve the waiter. > The current linux-next should contain this patch. So far I can't reproduce problems except two listed below (cond_resched() trap in printk() and IDLE priority trap are excluded from the list). But I agree that the follow up patch needs to use a WQ_RECLAIM WQ. It is theoretically possible that an allocation request which can trigger the OOM killer waits for the system_wq while there is already a work which is in system_wq which is looping forever inside the page allocator without triggering the OOM killer. Maybe the follow up patch can share the vmstat WQ? (1) I got an assertion failure. [ 969.626518] Killed process 6262 (oom-write) total-vm:2166856kB, anon-rss:1128732kB, file-rss:4kB, shmem-rss:0kB [ 969.958307] oom_reaper: reaped process 6262 (oom-write), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB [ 972.114644] XFS: Assertion failed: oldlen > newlen, file: fs/xfs/libxfs/xfs_bmap.c, line: 2867 [ 972.125085] [ cut here ] [ 972.129261] WARNING: CPU: 0 PID: 6280 at fs/xfs/xfs_message.c:105 asswarn+0x33/0x40 [xfs] [ 972.136146] Modules linked in: nf_conntrack_netbios_ns nf_conntrack_broadcast ip6t_rpfilter ipt_REJECT nf_reject_ipv4 ip6t_REJECT nf_reject_ipv6 xt_conntrack coretemp crct10dif_pclmul ppdev crc32_pclmul ghash_clmulni_intel ip_set nfnetlink ebtable_nat aesni_intel crypto_simd cryptd ebtable_broute glue_helper vmw_balloon bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 pcspkr nf_nat_ipv6 ip6table_mangle ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_raw ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter sg parport_pc parport shpchp i2c_piix4 vmw_vsock_vmci_transport vsock vmw_vmci ip_tables xfs libcrc32c sr_mod cdrom ata_generic sd_mod pata_acpi crc32c_intel serio_raw vmwgfx drm_kms_helper syscopyarea sysfillrect [ 972.163630] sysimgblt fb_sys_fops ttm drm ata_piix ahci libahci mptspi scsi_transport_spi mptscsih e1000 libata i2c_core mptbase [ 972.172535] CPU: 0 PID: 6280 Comm: write Not tainted 4.10.0-rc6-next-20170202 #498 [ 972.175126] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015 [ 972.178381] Call Trace: [ 972.180003] dump_stack+0x85/0xc9 [ 972.181682] __warn+0xd1/0xf0 [ 972.183374] warn_slowpath_null+0x1d/0x20 [ 972.185223] asswarn+0x33/0x40 [xfs] [ 972.186950] xfs_bmap_add_extent_hole_delay+0xb7f/0xdf0 [xfs] [ 972.189055] xfs_bmapi_reserve_delalloc+0x297/0x440 [xfs] [ 972.191263] ? xfs_ilock+0x1c9/0x360 [xfs] [ 972.193414] xfs_file_iomap_begin+0x880/0x1140 [xfs] [ 972.195300] ? iomap_write_end+0x80/0x80 [ 972.196980] iomap_apply+0x6c/0x130 [ 972.198539] iomap_file_buffered_write+0x68/0xa0 [ 972.200316] ? iomap_write_end+0x80/0x80 [ 972.201950] xfs_file_buffered_aio_write+0x132/0x390 [xfs] [ 972.203868] ? _raw_spin_unlock+0x27/0x40 [ 972.205470] xfs_file_write_iter+0x90/0x130 [xfs] [ 972.207167] __vfs_write+0xe5/0x140 [ 972.208752] vfs_write+0xc7/0x1f0 [ 972.210233] ? syscall_trace_enter+0x1d0/0x380 [ 972.211809] SyS_write+0x58/0xc0 [ 972.213166] do_int80_syscall_32+0x6c/0x1f0 [ 972.214676] entry_INT80_compat+0x38/0x50 [ 972.216168] RIP: 0023:0x8048076 [ 972.217494] RSP: 002b:ff997020 EFLAGS: 0202 ORIG_RAX: 0004 [ 972.219635] RAX: ffda RBX: 0001 RCX: 08048000 [ 972.221679] RDX: 1000 RSI: RDI: [ 972.223774] RBP: R08: R09: [ 972.225905] R10: R11: R12: [ 972.227946] R13: R14: R15: [ 972.230064] ---[ end trace d498098daec56c11 ]--- [ 984.210890] vmtoolsd invoked oom-killer: gfp_mask=0x14201ca(GFP_HIGHUSER_
Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
On Mon 30-01-17 09:55:46, Michal Hocko wrote: > On Sun 29-01-17 00:27:27, Tetsuo Handa wrote: [...] > > Regarding [1], it helped avoiding the too_many_isolated() issue. I can't > > tell whether it has any negative effect, but I got on the first trial that > > all allocating threads are blocked on wait_for_completion() from > > flush_work() > > in drain_all_pages() introduced by "mm, page_alloc: drain per-cpu pages from > > workqueue context". There was no warn_alloc() stall warning message > > afterwords. > > That patch is buggy and there is a follow up [1] which is not sitting in the > mmotm (and thus linux-next) yet. I didn't get to review it properly and > I cannot say I would be too happy about using WQ from the page > allocator. I believe even the follow up needs to have WQ_RECLAIM WQ. > > [1] > http://lkml.kernel.org/r/20170125083038.rzb5f43nptmk7...@techsingularity.net Did you get chance to test with this follow up patch? It would be interesting to see whether OOM situation can still starve the waiter. The current linux-next should contain this patch. Thanks! -- Michal Hocko SUSE Labs
Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
On Tue 31-01-17 13:51:40, Christoph Hellwig wrote: > On Tue, Jan 31, 2017 at 12:58:46PM +0100, Michal Hocko wrote: > > What do you think Christoph? I have an additional patch to handle > > do_generic_file_read and a similar one to back off in > > __vmalloc_area_node. I would like to post them all in one series but I > > would like to know that this one is OK before I do that. > > Well, that patch you posted is okay, but you probably need additional > ones for the other interesting users of iomap_apply. I have checked all of them I guees/hope. Which one you have in mind? -- Michal Hocko SUSE Labs
Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
On Tue, Jan 31, 2017 at 12:58:46PM +0100, Michal Hocko wrote: > What do you think Christoph? I have an additional patch to handle > do_generic_file_read and a similar one to back off in > __vmalloc_area_node. I would like to post them all in one series but I > would like to know that this one is OK before I do that. Well, that patch you posted is okay, but you probably need additional ones for the other interesting users of iomap_apply.
Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
On Wed 25-01-17 14:00:14, Michal Hocko wrote: > On Wed 25-01-17 20:09:31, Tetsuo Handa wrote: > > Michal Hocko wrote: > > > On Wed 25-01-17 11:19:57, Christoph Hellwig wrote: > > > > On Wed, Jan 25, 2017 at 11:15:17AM +0100, Michal Hocko wrote: > > > > > I think we are missing a check for fatal_signal_pending in > > > > > iomap_file_buffered_write. This means that an oom victim can consume > > > > > the > > > > > full memory reserves. What do you think about the following? I haven't > > > > > tested this but it mimics generic_perform_write so I guess it should > > > > > work. > > > > > > > > Hi Michal, > > > > > > > > this looks reasonable to me. But we have a few more such loops, > > > > maybe it makes sense to move the check into iomap_apply? > > > > > > I wasn't sure about the expected semantic of iomap_apply but now that > > > I've actually checked all the callers I believe all of them should be > > > able to handle EINTR just fine. Well iomap_file_dirty, iomap_zero_range, > > > iomap_fiemap and iomap_page_mkwriteseem do not follow the standard > > > pattern to return the number of written pages or an error but it rather > > > propagates the error out. From my limited understanding of those code > > > paths that should just be ok. I was not all that sure about iomap_dio_rw > > > that is just too convoluted for me. If that one is OK as well then > > > the following patch should be indeed better. > > > > Is "length" in > > > >written = actor(inode, pos, length, data, &iomap); > > > > call guaranteed to be small enough? If not guaranteed, > > don't we need to check SIGKILL inside "actor" functions? > > You are right! Checking for signals inside iomap_apply doesn't really > solve anything because basically all users do iov_iter_count(). Blee. So > we have loops around iomap_apply which itself loops inside the actor. > iomap_write_begin seems to be used by most of them which is also where we > get the pagecache page so I guess this should be the "right" place to > put the check in. Things like dax_iomap_actor will need an explicit check. > This is quite unfortunate but I do not see any better solution. > What do you think Christoph? What do you think Christoph? I have an additional patch to handle do_generic_file_read and a similar one to back off in __vmalloc_area_node. I would like to post them all in one series but I would like to know that this one is OK before I do that. Thanks! > --- > From 362da5cac527146a341300c2ca441245c16043e8 Mon Sep 17 00:00:00 2001 > From: Michal Hocko > Date: Wed, 25 Jan 2017 11:06:37 +0100 > Subject: [PATCH] fs: break out of iomap_file_buffered_write on fatal signals > > Tetsuo has noticed that an OOM stress test which performs large write > requests can cause the full memory reserves depletion. He has tracked > this down to the following path > __alloc_pages_nodemask+0x436/0x4d0 > alloc_pages_current+0x97/0x1b0 > __page_cache_alloc+0x15d/0x1a0 mm/filemap.c:728 > pagecache_get_page+0x5a/0x2b0 mm/filemap.c:1331 > grab_cache_page_write_begin+0x23/0x40 mm/filemap.c:2773 > iomap_write_begin+0x50/0xd0 fs/iomap.c:118 > iomap_write_actor+0xb5/0x1a0fs/iomap.c:190 > ? iomap_write_end+0x80/0x80 fs/iomap.c:150 > iomap_apply+0xb3/0x130 fs/iomap.c:79 > iomap_file_buffered_write+0x68/0xa0 fs/iomap.c:243 > ? iomap_write_end+0x80/0x80 > xfs_file_buffered_aio_write+0x132/0x390 [xfs] > ? remove_wait_queue+0x59/0x60 > xfs_file_write_iter+0x90/0x130 [xfs] > __vfs_write+0xe5/0x140 > vfs_write+0xc7/0x1f0 > ? syscall_trace_enter+0x1d0/0x380 > SyS_write+0x58/0xc0 > do_syscall_64+0x6c/0x200 > entry_SYSCALL64_slow_path+0x25/0x25 > > the oom victim has access to all memory reserves to make a forward > progress to exit easier. But iomap_file_buffered_write and other callers > of iomap_apply loop to complete the full request. We need to check for > fatal signals and back off with a short write instead. As the > iomap_apply delegates all the work down to the actor we have to hook > into those. All callers that work with the page cache are calling > iomap_write_begin so we will check for signals there. dax_iomap_actor > has to handle the situation explicitly because it copies data to the > userspace directly. Other callers like iomap_page_mkwrite work on a > single page or iomap_fiemap_actor do not allocate memory based on the > given len. > > Fixes: 68a9f5e7007c ("xfs: implement iomap based buffered write path") > Cc: stable # 4.8+ > Reported-by: Tetsuo Handa > Signed-off-by: Michal Hocko > --- > fs/dax.c | 5 + > fs/iomap.c | 3 +++ > 2 files changed, 8 insertions(+) > > diff --git a/fs/dax.c b/fs/dax.c > index 413a91db9351..0e263dacf9cf 100644 > --- a/fs/dax.c > +++ b/fs/dax.c > @@ -1033,6 +1033,11 @@ dax_iomap_actor(struct inode *inode, loff_t pos, > loff_t length, void *data,
Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
On Sun 29-01-17 00:27:27, Tetsuo Handa wrote: > Michal Hocko wrote: > > Tetsuo, > > before we settle on the proper fix for this issue, could you give the > > patch a try and try to reproduce the too_many_isolated() issue or > > just see whether patch [1] has any negative effect on your oom stress > > testing? > > > > [1] http://lkml.kernel.org/r/20170119112336.gn30...@dhcp22.suse.cz > > I tested with both [1] and below patch applied on linux-next-20170125 and > the result is at http://I-love.SAKURA.ne.jp/tmp/serial-20170128.txt.xz . > > Regarding below patch, it helped avoiding complete memory depletion with > large write() request. I don't know whether below patch helps avoiding > complete memory depletion when reading large amount (in other words, I > don't know whether this check is done for large read() request). It's not AFAICS. do_generic_file_read doesn't do the fatal_signal_pending check. > But > I believe that __GFP_KILLABLE (despite the limitation that there are > unkillable waits in the reclaim path) is better solution compared to > scattering around fatal_signal_pending() in the callers. The reason > we check SIGKILL here is to avoid allocating memory more than needed. > If we check SIGKILL in the entry point of __alloc_pages_nodemask() and > retry: label in __alloc_pages_slowpath(), we waste 0 page. Regardless > of whether the OOM killer is invoked, whether memory can be allocated > without direct reclaim operation, not allocating memory unless needed > (in other words, allow page allocator fail immediately if the caller > can give up on SIGKILL and SIGKILL is pending) makes sense. It will > reduce possibility of OOM livelock on CONFIG_MMU=n kernels where the > OOM reaper is not available. I am not really convinced this is a good idea. Put aside the fuzzy semantic of __GFP_KILLABLE, we would have to use this flag in all potentially allocating places from read/write paths and then it is just easier to do the explicit checks in the the loops around those allocations. > > On Wed 25-01-17 14:00:14, Michal Hocko wrote: > > [...] > > > From 362da5cac527146a341300c2ca441245c16043e8 Mon Sep 17 00:00:00 2001 > > > From: Michal Hocko > > > Date: Wed, 25 Jan 2017 11:06:37 +0100 > > > Subject: [PATCH] fs: break out of iomap_file_buffered_write on fatal > > > signals > > > > > > Tetsuo has noticed that an OOM stress test which performs large write > > > requests can cause the full memory reserves depletion. He has tracked > > > this down to the following path > > > __alloc_pages_nodemask+0x436/0x4d0 > > > alloc_pages_current+0x97/0x1b0 > > > __page_cache_alloc+0x15d/0x1a0 mm/filemap.c:728 > > > pagecache_get_page+0x5a/0x2b0 mm/filemap.c:1331 > > > grab_cache_page_write_begin+0x23/0x40 mm/filemap.c:2773 > > > iomap_write_begin+0x50/0xd0 fs/iomap.c:118 > > > iomap_write_actor+0xb5/0x1a0fs/iomap.c:190 > > > ? iomap_write_end+0x80/0x80 fs/iomap.c:150 > > > iomap_apply+0xb3/0x130 fs/iomap.c:79 > > > iomap_file_buffered_write+0x68/0xa0 fs/iomap.c:243 > > > ? iomap_write_end+0x80/0x80 > > > xfs_file_buffered_aio_write+0x132/0x390 [xfs] > > > ? remove_wait_queue+0x59/0x60 > > > xfs_file_write_iter+0x90/0x130 [xfs] > > > __vfs_write+0xe5/0x140 > > > vfs_write+0xc7/0x1f0 > > > ? syscall_trace_enter+0x1d0/0x380 > > > SyS_write+0x58/0xc0 > > > do_syscall_64+0x6c/0x200 > > > entry_SYSCALL64_slow_path+0x25/0x25 > > > > > > the oom victim has access to all memory reserves to make a forward > > > progress to exit easier. But iomap_file_buffered_write and other callers > > > of iomap_apply loop to complete the full request. We need to check for > > > fatal signals and back off with a short write instead. As the > > > iomap_apply delegates all the work down to the actor we have to hook > > > into those. All callers that work with the page cache are calling > > > iomap_write_begin so we will check for signals there. dax_iomap_actor > > > has to handle the situation explicitly because it copies data to the > > > userspace directly. Other callers like iomap_page_mkwrite work on a > > > single page or iomap_fiemap_actor do not allocate memory based on the > > > given len. > > > > > > Fixes: 68a9f5e7007c ("xfs: implement iomap based buffered write path") > > > Cc: stable # 4.8+ > > > Reported-by: Tetsuo Handa > > > Signed-off-by: Michal Hocko > > > --- > > > fs/dax.c | 5 + > > > fs/iomap.c | 3 +++ > > > 2 files changed, 8 insertions(+) > > > > > > diff --git a/fs/dax.c b/fs/dax.c > > > index 413a91db9351..0e263dacf9cf 100644 > > > --- a/fs/dax.c > > > +++ b/fs/dax.c > > > @@ -1033,6 +1033,11 @@ dax_iomap_actor(struct inode *inode, loff_t pos, > > > loff_t length, void *data, > > > struct blk_dax_ctl dax = { 0 }; > > > ssize_t map_len; > > > > > > + if (fatal_signal_pending(current)) { > > > + ret = -EINTR; > > > + break; > > >
Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
Michal Hocko wrote: > Tetsuo, > before we settle on the proper fix for this issue, could you give the > patch a try and try to reproduce the too_many_isolated() issue or > just see whether patch [1] has any negative effect on your oom stress > testing? > > [1] http://lkml.kernel.org/r/20170119112336.gn30...@dhcp22.suse.cz I tested with both [1] and below patch applied on linux-next-20170125 and the result is at http://I-love.SAKURA.ne.jp/tmp/serial-20170128.txt.xz . Regarding below patch, it helped avoiding complete memory depletion with large write() request. I don't know whether below patch helps avoiding complete memory depletion when reading large amount (in other words, I don't know whether this check is done for large read() request). But I believe that __GFP_KILLABLE (despite the limitation that there are unkillable waits in the reclaim path) is better solution compared to scattering around fatal_signal_pending() in the callers. The reason we check SIGKILL here is to avoid allocating memory more than needed. If we check SIGKILL in the entry point of __alloc_pages_nodemask() and retry: label in __alloc_pages_slowpath(), we waste 0 page. Regardless of whether the OOM killer is invoked, whether memory can be allocated without direct reclaim operation, not allocating memory unless needed (in other words, allow page allocator fail immediately if the caller can give up on SIGKILL and SIGKILL is pending) makes sense. It will reduce possibility of OOM livelock on CONFIG_MMU=n kernels where the OOM reaper is not available. > > On Wed 25-01-17 14:00:14, Michal Hocko wrote: > [...] > > From 362da5cac527146a341300c2ca441245c16043e8 Mon Sep 17 00:00:00 2001 > > From: Michal Hocko > > Date: Wed, 25 Jan 2017 11:06:37 +0100 > > Subject: [PATCH] fs: break out of iomap_file_buffered_write on fatal signals > > > > Tetsuo has noticed that an OOM stress test which performs large write > > requests can cause the full memory reserves depletion. He has tracked > > this down to the following path > > __alloc_pages_nodemask+0x436/0x4d0 > > alloc_pages_current+0x97/0x1b0 > > __page_cache_alloc+0x15d/0x1a0 mm/filemap.c:728 > > pagecache_get_page+0x5a/0x2b0 mm/filemap.c:1331 > > grab_cache_page_write_begin+0x23/0x40 mm/filemap.c:2773 > > iomap_write_begin+0x50/0xd0 fs/iomap.c:118 > > iomap_write_actor+0xb5/0x1a0fs/iomap.c:190 > > ? iomap_write_end+0x80/0x80 fs/iomap.c:150 > > iomap_apply+0xb3/0x130 fs/iomap.c:79 > > iomap_file_buffered_write+0x68/0xa0 fs/iomap.c:243 > > ? iomap_write_end+0x80/0x80 > > xfs_file_buffered_aio_write+0x132/0x390 [xfs] > > ? remove_wait_queue+0x59/0x60 > > xfs_file_write_iter+0x90/0x130 [xfs] > > __vfs_write+0xe5/0x140 > > vfs_write+0xc7/0x1f0 > > ? syscall_trace_enter+0x1d0/0x380 > > SyS_write+0x58/0xc0 > > do_syscall_64+0x6c/0x200 > > entry_SYSCALL64_slow_path+0x25/0x25 > > > > the oom victim has access to all memory reserves to make a forward > > progress to exit easier. But iomap_file_buffered_write and other callers > > of iomap_apply loop to complete the full request. We need to check for > > fatal signals and back off with a short write instead. As the > > iomap_apply delegates all the work down to the actor we have to hook > > into those. All callers that work with the page cache are calling > > iomap_write_begin so we will check for signals there. dax_iomap_actor > > has to handle the situation explicitly because it copies data to the > > userspace directly. Other callers like iomap_page_mkwrite work on a > > single page or iomap_fiemap_actor do not allocate memory based on the > > given len. > > > > Fixes: 68a9f5e7007c ("xfs: implement iomap based buffered write path") > > Cc: stable # 4.8+ > > Reported-by: Tetsuo Handa > > Signed-off-by: Michal Hocko > > --- > > fs/dax.c | 5 + > > fs/iomap.c | 3 +++ > > 2 files changed, 8 insertions(+) > > > > diff --git a/fs/dax.c b/fs/dax.c > > index 413a91db9351..0e263dacf9cf 100644 > > --- a/fs/dax.c > > +++ b/fs/dax.c > > @@ -1033,6 +1033,11 @@ dax_iomap_actor(struct inode *inode, loff_t pos, > > loff_t length, void *data, > > struct blk_dax_ctl dax = { 0 }; > > ssize_t map_len; > > > > + if (fatal_signal_pending(current)) { > > + ret = -EINTR; > > + break; > > + } > > + > > dax.sector = dax_iomap_sector(iomap, pos); > > dax.size = (length + offset + PAGE_SIZE - 1) & PAGE_MASK; > > map_len = dax_map_atomic(iomap->bdev, &dax); > > diff --git a/fs/iomap.c b/fs/iomap.c > > index e57b90b5ff37..691eada58b06 100644 > > --- a/fs/iomap.c > > +++ b/fs/iomap.c > > @@ -114,6 +114,9 @@ iomap_write_begin(struct inode *inode, loff_t pos, > > unsigned len, unsigned flags, > > > > BUG_ON(pos + len > iomap->offset + iomap->length); > > > > + if (fatal_signal_pendin
Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
Tetsuo, before we settle on the proper fix for this issue, could you give the patch a try and try to reproduce the too_many_isolated() issue or just see whether patch [1] has any negative effect on your oom stress testing? [1] http://lkml.kernel.org/r/20170119112336.gn30...@dhcp22.suse.cz On Wed 25-01-17 14:00:14, Michal Hocko wrote: [...] > From 362da5cac527146a341300c2ca441245c16043e8 Mon Sep 17 00:00:00 2001 > From: Michal Hocko > Date: Wed, 25 Jan 2017 11:06:37 +0100 > Subject: [PATCH] fs: break out of iomap_file_buffered_write on fatal signals > > Tetsuo has noticed that an OOM stress test which performs large write > requests can cause the full memory reserves depletion. He has tracked > this down to the following path > __alloc_pages_nodemask+0x436/0x4d0 > alloc_pages_current+0x97/0x1b0 > __page_cache_alloc+0x15d/0x1a0 mm/filemap.c:728 > pagecache_get_page+0x5a/0x2b0 mm/filemap.c:1331 > grab_cache_page_write_begin+0x23/0x40 mm/filemap.c:2773 > iomap_write_begin+0x50/0xd0 fs/iomap.c:118 > iomap_write_actor+0xb5/0x1a0fs/iomap.c:190 > ? iomap_write_end+0x80/0x80 fs/iomap.c:150 > iomap_apply+0xb3/0x130 fs/iomap.c:79 > iomap_file_buffered_write+0x68/0xa0 fs/iomap.c:243 > ? iomap_write_end+0x80/0x80 > xfs_file_buffered_aio_write+0x132/0x390 [xfs] > ? remove_wait_queue+0x59/0x60 > xfs_file_write_iter+0x90/0x130 [xfs] > __vfs_write+0xe5/0x140 > vfs_write+0xc7/0x1f0 > ? syscall_trace_enter+0x1d0/0x380 > SyS_write+0x58/0xc0 > do_syscall_64+0x6c/0x200 > entry_SYSCALL64_slow_path+0x25/0x25 > > the oom victim has access to all memory reserves to make a forward > progress to exit easier. But iomap_file_buffered_write and other callers > of iomap_apply loop to complete the full request. We need to check for > fatal signals and back off with a short write instead. As the > iomap_apply delegates all the work down to the actor we have to hook > into those. All callers that work with the page cache are calling > iomap_write_begin so we will check for signals there. dax_iomap_actor > has to handle the situation explicitly because it copies data to the > userspace directly. Other callers like iomap_page_mkwrite work on a > single page or iomap_fiemap_actor do not allocate memory based on the > given len. > > Fixes: 68a9f5e7007c ("xfs: implement iomap based buffered write path") > Cc: stable # 4.8+ > Reported-by: Tetsuo Handa > Signed-off-by: Michal Hocko > --- > fs/dax.c | 5 + > fs/iomap.c | 3 +++ > 2 files changed, 8 insertions(+) > > diff --git a/fs/dax.c b/fs/dax.c > index 413a91db9351..0e263dacf9cf 100644 > --- a/fs/dax.c > +++ b/fs/dax.c > @@ -1033,6 +1033,11 @@ dax_iomap_actor(struct inode *inode, loff_t pos, > loff_t length, void *data, > struct blk_dax_ctl dax = { 0 }; > ssize_t map_len; > > + if (fatal_signal_pending(current)) { > + ret = -EINTR; > + break; > + } > + > dax.sector = dax_iomap_sector(iomap, pos); > dax.size = (length + offset + PAGE_SIZE - 1) & PAGE_MASK; > map_len = dax_map_atomic(iomap->bdev, &dax); > diff --git a/fs/iomap.c b/fs/iomap.c > index e57b90b5ff37..691eada58b06 100644 > --- a/fs/iomap.c > +++ b/fs/iomap.c > @@ -114,6 +114,9 @@ iomap_write_begin(struct inode *inode, loff_t pos, > unsigned len, unsigned flags, > > BUG_ON(pos + len > iomap->offset + iomap->length); > > + if (fatal_signal_pending(current)) > + return -EINTR; > + > page = grab_cache_page_write_begin(inode->i_mapping, index, flags); > if (!page) > return -ENOMEM; > -- > 2.11.0 > > > -- > Michal Hocko > SUSE Labs -- Michal Hocko SUSE Labs
Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
Michal Hocko wrote: > On Wed 25-01-17 19:33:59, Tetsuo Handa wrote: > > Michal Hocko wrote: > > > I think we are missing a check for fatal_signal_pending in > > > iomap_file_buffered_write. This means that an oom victim can consume the > > > full memory reserves. What do you think about the following? I haven't > > > tested this but it mimics generic_perform_write so I guess it should > > > work. > > > > Looks OK to me. I worried > > > > #define AOP_FLAG_UNINTERRUPTIBLE0x0001 /* will not do a short write > > */ > > > > which forbids (!?) aborting the loop. But it seems that this flag is > > no longer checked (i.e. set but not used). So, everybody should be ready > > for short write, although I don't know whether exofs / hfs / hfsplus are > > doing appropriate error handling. > > Those were using generic implementation before and that handles this > case AFAICS. What I wanted to say is: "We can remove AOP_FLAG_UNINTERRUPTIBLE completely because grep does not find that flag used in condition check, can't we?".
Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
On Wed 25-01-17 20:09:31, Tetsuo Handa wrote: > Michal Hocko wrote: > > On Wed 25-01-17 11:19:57, Christoph Hellwig wrote: > > > On Wed, Jan 25, 2017 at 11:15:17AM +0100, Michal Hocko wrote: > > > > I think we are missing a check for fatal_signal_pending in > > > > iomap_file_buffered_write. This means that an oom victim can consume the > > > > full memory reserves. What do you think about the following? I haven't > > > > tested this but it mimics generic_perform_write so I guess it should > > > > work. > > > > > > Hi Michal, > > > > > > this looks reasonable to me. But we have a few more such loops, > > > maybe it makes sense to move the check into iomap_apply? > > > > I wasn't sure about the expected semantic of iomap_apply but now that > > I've actually checked all the callers I believe all of them should be > > able to handle EINTR just fine. Well iomap_file_dirty, iomap_zero_range, > > iomap_fiemap and iomap_page_mkwriteseem do not follow the standard > > pattern to return the number of written pages or an error but it rather > > propagates the error out. From my limited understanding of those code > > paths that should just be ok. I was not all that sure about iomap_dio_rw > > that is just too convoluted for me. If that one is OK as well then > > the following patch should be indeed better. > > Is "length" in > >written = actor(inode, pos, length, data, &iomap); > > call guaranteed to be small enough? If not guaranteed, > don't we need to check SIGKILL inside "actor" functions? You are right! Checking for signals inside iomap_apply doesn't really solve anything because basically all users do iov_iter_count(). Blee. So we have loops around iomap_apply which itself loops inside the actor. iomap_write_begin seems to be used by most of them which is also where we get the pagecache page so I guess this should be the "right" place to put the check in. Things like dax_iomap_actor will need an explicit check. This is quite unfortunate but I do not see any better solution. What do you think Christoph? --- >From 362da5cac527146a341300c2ca441245c16043e8 Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Wed, 25 Jan 2017 11:06:37 +0100 Subject: [PATCH] fs: break out of iomap_file_buffered_write on fatal signals Tetsuo has noticed that an OOM stress test which performs large write requests can cause the full memory reserves depletion. He has tracked this down to the following path __alloc_pages_nodemask+0x436/0x4d0 alloc_pages_current+0x97/0x1b0 __page_cache_alloc+0x15d/0x1a0 mm/filemap.c:728 pagecache_get_page+0x5a/0x2b0 mm/filemap.c:1331 grab_cache_page_write_begin+0x23/0x40 mm/filemap.c:2773 iomap_write_begin+0x50/0xd0 fs/iomap.c:118 iomap_write_actor+0xb5/0x1a0fs/iomap.c:190 ? iomap_write_end+0x80/0x80 fs/iomap.c:150 iomap_apply+0xb3/0x130 fs/iomap.c:79 iomap_file_buffered_write+0x68/0xa0 fs/iomap.c:243 ? iomap_write_end+0x80/0x80 xfs_file_buffered_aio_write+0x132/0x390 [xfs] ? remove_wait_queue+0x59/0x60 xfs_file_write_iter+0x90/0x130 [xfs] __vfs_write+0xe5/0x140 vfs_write+0xc7/0x1f0 ? syscall_trace_enter+0x1d0/0x380 SyS_write+0x58/0xc0 do_syscall_64+0x6c/0x200 entry_SYSCALL64_slow_path+0x25/0x25 the oom victim has access to all memory reserves to make a forward progress to exit easier. But iomap_file_buffered_write and other callers of iomap_apply loop to complete the full request. We need to check for fatal signals and back off with a short write instead. As the iomap_apply delegates all the work down to the actor we have to hook into those. All callers that work with the page cache are calling iomap_write_begin so we will check for signals there. dax_iomap_actor has to handle the situation explicitly because it copies data to the userspace directly. Other callers like iomap_page_mkwrite work on a single page or iomap_fiemap_actor do not allocate memory based on the given len. Fixes: 68a9f5e7007c ("xfs: implement iomap based buffered write path") Cc: stable # 4.8+ Reported-by: Tetsuo Handa Signed-off-by: Michal Hocko --- fs/dax.c | 5 + fs/iomap.c | 3 +++ 2 files changed, 8 insertions(+) diff --git a/fs/dax.c b/fs/dax.c index 413a91db9351..0e263dacf9cf 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -1033,6 +1033,11 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data, struct blk_dax_ctl dax = { 0 }; ssize_t map_len; + if (fatal_signal_pending(current)) { + ret = -EINTR; + break; + } + dax.sector = dax_iomap_sector(iomap, pos); dax.size = (length + offset + PAGE_SIZE - 1) & PAGE_MASK; map_len = dax_map_atomic(iomap->bdev, &dax); diff --git a/fs/iomap.c b/fs/iomap.c index e57b90b
Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
Michal Hocko wrote: > On Wed 25-01-17 11:19:57, Christoph Hellwig wrote: > > On Wed, Jan 25, 2017 at 11:15:17AM +0100, Michal Hocko wrote: > > > I think we are missing a check for fatal_signal_pending in > > > iomap_file_buffered_write. This means that an oom victim can consume the > > > full memory reserves. What do you think about the following? I haven't > > > tested this but it mimics generic_perform_write so I guess it should > > > work. > > > > Hi Michal, > > > > this looks reasonable to me. But we have a few more such loops, > > maybe it makes sense to move the check into iomap_apply? > > I wasn't sure about the expected semantic of iomap_apply but now that > I've actually checked all the callers I believe all of them should be > able to handle EINTR just fine. Well iomap_file_dirty, iomap_zero_range, > iomap_fiemap and iomap_page_mkwriteseem do not follow the standard > pattern to return the number of written pages or an error but it rather > propagates the error out. From my limited understanding of those code > paths that should just be ok. I was not all that sure about iomap_dio_rw > that is just too convoluted for me. If that one is OK as well then > the following patch should be indeed better. Is "length" in written = actor(inode, pos, length, data, &iomap); call guaranteed to be small enough? If not guaranteed, don't we need to check SIGKILL inside "actor" functions?
Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
On Wed 25-01-17 11:19:57, Christoph Hellwig wrote: > On Wed, Jan 25, 2017 at 11:15:17AM +0100, Michal Hocko wrote: > > I think we are missing a check for fatal_signal_pending in > > iomap_file_buffered_write. This means that an oom victim can consume the > > full memory reserves. What do you think about the following? I haven't > > tested this but it mimics generic_perform_write so I guess it should > > work. > > Hi Michal, > > this looks reasonable to me. But we have a few more such loops, > maybe it makes sense to move the check into iomap_apply? I wasn't sure about the expected semantic of iomap_apply but now that I've actually checked all the callers I believe all of them should be able to handle EINTR just fine. Well iomap_file_dirty, iomap_zero_range, iomap_fiemap and iomap_page_mkwriteseem do not follow the standard pattern to return the number of written pages or an error but it rather propagates the error out. From my limited understanding of those code paths that should just be ok. I was not all that sure about iomap_dio_rw that is just too convoluted for me. If that one is OK as well then the following patch should be indeed better. --- >From d99c9d4115bed69a5d71281f59c190b0b26627cf Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Wed, 25 Jan 2017 11:06:37 +0100 Subject: [PATCH] fs: break out of iomap_file_buffered_write on fatal signals Tetsuo has noticed that an OOM stress test which performs large write requests can cause the full memory reserves depletion. He has tracked this down to the following path __alloc_pages_nodemask+0x436/0x4d0 alloc_pages_current+0x97/0x1b0 __page_cache_alloc+0x15d/0x1a0 mm/filemap.c:728 pagecache_get_page+0x5a/0x2b0 mm/filemap.c:1331 grab_cache_page_write_begin+0x23/0x40 mm/filemap.c:2773 iomap_write_begin+0x50/0xd0 fs/iomap.c:118 iomap_write_actor+0xb5/0x1a0fs/iomap.c:190 ? iomap_write_end+0x80/0x80 fs/iomap.c:150 iomap_apply+0xb3/0x130 fs/iomap.c:79 iomap_file_buffered_write+0x68/0xa0 fs/iomap.c:243 ? iomap_write_end+0x80/0x80 xfs_file_buffered_aio_write+0x132/0x390 [xfs] ? remove_wait_queue+0x59/0x60 xfs_file_write_iter+0x90/0x130 [xfs] __vfs_write+0xe5/0x140 vfs_write+0xc7/0x1f0 ? syscall_trace_enter+0x1d0/0x380 SyS_write+0x58/0xc0 do_syscall_64+0x6c/0x200 entry_SYSCALL64_slow_path+0x25/0x25 the oom victim has access to all memory reserves to make a forward progress to exit easier. But iomap_file_buffered_write and other callers of iomap_apply loop to complete the full request. We need to check for fatal signals and back off with a short write instead. Fixes: 68a9f5e7007c ("xfs: implement iomap based buffered write path") Cc: stable # 4.8+ Reported-by: Tetsuo Handa Signed-off-by: Michal Hocko --- fs/iomap.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/fs/iomap.c b/fs/iomap.c index e57b90b5ff37..a58190f7a3e4 100644 --- a/fs/iomap.c +++ b/fs/iomap.c @@ -46,6 +46,9 @@ iomap_apply(struct inode *inode, loff_t pos, loff_t length, unsigned flags, struct iomap iomap = { 0 }; loff_t written = 0, ret; + if (fatal_signal_pending(current)) + return -EINTR; + /* * Need to map a range from start position for length bytes. This can * span multiple pages - it is only guaranteed to return a range of a -- 2.11.0 -- Michal Hocko SUSE Labs
Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
On Wed, Jan 25, 2017 at 11:15:17AM +0100, Michal Hocko wrote: > I think we are missing a check for fatal_signal_pending in > iomap_file_buffered_write. This means that an oom victim can consume the > full memory reserves. What do you think about the following? I haven't > tested this but it mimics generic_perform_write so I guess it should > work. Hi Michal, this looks reasonable to me. But we have a few more such loops, maybe it makes sense to move the check into iomap_apply?
Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
[Let's add Christoph] The below insane^Wstress test should exercise the OOM killer behavior. On Sat 21-01-17 16:42:42, Tetsuo Handa wrote: > Tetsuo Handa wrote: > > And I think that there is a different problem if I tune a reproducer > > like below (i.e. increased the buffer size to write()/fsync() from 4096). > > > > -- > > #include > > #include > > #include > > #include > > #include > > #include > > #include > > > > int main(int argc, char *argv[]) > > { > > static char buffer[10485760] = { }; /* or 1048576 */ > > char *buf = NULL; > > unsigned long size; > > unsigned long i; > > for (i = 0; i < 1024; i++) { > > if (fork() == 0) { > > int fd = open("/proc/self/oom_score_adj", O_WRONLY); > > write(fd, "1000", 4); > > close(fd); > > sleep(1); > > snprintf(buffer, sizeof(buffer), "/tmp/file.%u", > > getpid()); > > fd = open(buffer, O_WRONLY | O_CREAT | O_APPEND, 0600); > > while (write(fd, buffer, sizeof(buffer)) == > > sizeof(buffer)) > > fsync(fd); > > _exit(0); > > } > > } > > for (size = 1048576; size < 512UL * (1 << 30); size <<= 1) { > > char *cp = realloc(buf, size); > > if (!cp) { > > size >>= 1; > > break; > > } > > buf = cp; > > } > > sleep(2); > > /* Will cause OOM due to overcommit */ > > for (i = 0; i < size; i += 4096) > > buf[i] = 0; > > pause(); > > return 0; > > } > > -- > > > > Above reproducer sometimes kills all OOM killable processes and the system > > finally panics. I guess that somebody is abusing TIF_MEMDIE for needless > > allocations to the level where GFP_ATOMIC allocations start failing. [...] > And I got flood of traces shown below. It seems to be consuming memory > reserves > until the size passed to write() request is stored to the page cache even > after > OOM-killed. > > Complete log is at http://I-love.SAKURA.ne.jp/tmp/serial-20170121.txt.xz . > > [ 202.306077] a.out(9789): TIF_MEMDIE allocation: order=0 > mode=0x1c2004a(GFP_NOFS|__GFP_HIGHMEM|__GFP_HARDWALL|__GFP_MOVABLE|__GFP_WRITE) > [ 202.309832] CPU: 0 PID: 9789 Comm: a.out Not tainted > 4.10.0-rc4-next-20170120+ #492 > [ 202.312323] Hardware name: VMware, Inc. VMware Virtual Platform/440BX > Desktop Reference Platform, BIOS 6.00 07/02/2015 > [ 202.315429] Call Trace: > [ 202.316902] dump_stack+0x85/0xc9 > [ 202.318810] __alloc_pages_slowpath+0xa99/0xd7c > [ 202.320697] ? node_dirty_ok+0xef/0x130 > [ 202.322454] __alloc_pages_nodemask+0x436/0x4d0 > [ 202.324506] alloc_pages_current+0x97/0x1b0 > [ 202.326397] __page_cache_alloc+0x15d/0x1a0 mm/filemap.c:728 > [ 202.328209] pagecache_get_page+0x5a/0x2b0 mm/filemap.c:1331 > [ 202.329989] grab_cache_page_write_begin+0x23/0x40 mm/filemap.c:2773 > [ 202.331905] iomap_write_begin+0x50/0xd0 fs/iomap.c:118 > [ 202.333641] iomap_write_actor+0xb5/0x1a0fs/iomap.c:190 > [ 202.335377] ? iomap_write_end+0x80/0x80 fs/iomap.c:150 > [ 202.337090] iomap_apply+0xb3/0x130 fs/iomap.c:79 > [ 202.338721] iomap_file_buffered_write+0x68/0xa0 fs/iomap.c:243 > [ 202.340613] ? iomap_write_end+0x80/0x80 > [ 202.342471] xfs_file_buffered_aio_write+0x132/0x390 [xfs] > [ 202.344501] ? remove_wait_queue+0x59/0x60 > [ 202.346261] xfs_file_write_iter+0x90/0x130 [xfs] > [ 202.348082] __vfs_write+0xe5/0x140 > [ 202.349743] vfs_write+0xc7/0x1f0 > [ 202.351214] ? syscall_trace_enter+0x1d0/0x380 > [ 202.353155] SyS_write+0x58/0xc0 > [ 202.354628] do_syscall_64+0x6c/0x200 > [ 202.356100] entry_SYSCALL64_slow_path+0x25/0x25 > > > Do we need to allow access to memory reserves for this allocation? > Or, should the caller check for SIGKILL rather than iterate the loop? I think we are missing a check for fatal_signal_pending in iomap_file_buffered_write. This means that an oom victim can consume the full memory reserves. What do you think about the following? I haven't tested this but it mimics generic_perform_write so I guess it should work. --- >From d56b54b708d403d1bf39fccb89750bab31c19032 Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Wed, 25 Jan 2017 11:06:37 +0100 Subject: [PATCH] fs: break out of iomap_file_buffered_write on fatal signals Tetsuo has noticed that an OOM stress test which performs large write requests can cause the full memory reserves depletion. He has tracked this down to the following path __alloc_pages_nodemask+0x436/0x4d0 alloc_pages_current+0x97/0x1b0 __page_cache_alloc+0x15d/0x1a0 mm/filemap.c:728 pagecache_get_page+0x5a/0x2b0 mm/
Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
On Fri 20-01-17 22:27:27, Tetsuo Handa wrote: > Mel Gorman wrote: > > On Thu, Jan 19, 2017 at 12:23:36PM +0100, Michal Hocko wrote: > > > So what do you think about the following? Tetsuo, would you be willing > > > to run this patch through your torture testing please? > > > > I'm fine with treating this as a starting point. > > OK. So I tried to test this patch but I failed at preparation step. > There are too many pending mm patches and I'm not sure which patch on > which linux-next snapshot I should try. The current linux-next should be good to test. It contains all patches sitting in the mmotm tree. If you want a more stable base then you can use mmotm git tree (git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git #since-4.9 or its #auto-latest alias) > Also as another question, > too_many_isolated() loop exists in both mm/vmscan.c and mm/compaction.c > but why this patch does not touch the loop in mm/compaction.c part? I am not yet convinced the compaction suffers from the same problem. Compaction backs off much sooner so that path shouldn't get into pathological situation AFAICS. I might be wrong here but I think we should start with the reclaim path first. -- Michal Hocko SUSE Labs
Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
Tetsuo Handa wrote: > And I think that there is a different problem if I tune a reproducer > like below (i.e. increased the buffer size to write()/fsync() from 4096). > > -- > #include > #include > #include > #include > #include > #include > #include > > int main(int argc, char *argv[]) > { > static char buffer[10485760] = { }; /* or 1048576 */ > char *buf = NULL; > unsigned long size; > unsigned long i; > for (i = 0; i < 1024; i++) { > if (fork() == 0) { > int fd = open("/proc/self/oom_score_adj", O_WRONLY); > write(fd, "1000", 4); > close(fd); > sleep(1); > snprintf(buffer, sizeof(buffer), "/tmp/file.%u", > getpid()); > fd = open(buffer, O_WRONLY | O_CREAT | O_APPEND, 0600); > while (write(fd, buffer, sizeof(buffer)) == > sizeof(buffer)) > fsync(fd); > _exit(0); > } > } > for (size = 1048576; size < 512UL * (1 << 30); size <<= 1) { > char *cp = realloc(buf, size); > if (!cp) { > size >>= 1; > break; > } > buf = cp; > } > sleep(2); > /* Will cause OOM due to overcommit */ > for (i = 0; i < size; i += 4096) > buf[i] = 0; > pause(); > return 0; > } > -- > > Above reproducer sometimes kills all OOM killable processes and the system > finally panics. I guess that somebody is abusing TIF_MEMDIE for needless > allocations to the level where GFP_ATOMIC allocations start failing. I tracked who is abusing TIF_MEMDIE using below patch. diff --git a/mm/page_alloc.c b/mm/page_alloc.c index ea088e1..d9ac53d 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3038,7 +3038,7 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...) static DEFINE_RATELIMIT_STATE(nopage_rs, DEFAULT_RATELIMIT_INTERVAL, DEFAULT_RATELIMIT_BURST); - if ((gfp_mask & __GFP_NOWARN) || !__ratelimit(&nopage_rs) || + if (1 || (gfp_mask & __GFP_NOWARN) || !__ratelimit(&nopage_rs) || debug_guardpage_minorder() > 0) return; @@ -3573,6 +3573,7 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask) int no_progress_loops = 0; unsigned long alloc_start = jiffies; unsigned int stall_timeout = 10 * HZ; + bool victim = false; /* * In the slowpath, we sanity check order to avoid ever trying to @@ -3656,8 +3657,10 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask) if (gfp_mask & __GFP_KSWAPD_RECLAIM) wake_all_kswapds(order, ac); - if (gfp_pfmemalloc_allowed(gfp_mask)) + if (gfp_pfmemalloc_allowed(gfp_mask)) { alloc_flags = ALLOC_NO_WATERMARKS; + victim = test_thread_flag(TIF_MEMDIE); + } /* * Reset the zonelist iterators if memory policies can be ignored. @@ -3790,6 +3793,11 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask) warn_alloc(gfp_mask, ac->nodemask, "page allocation failure: order:%u", order); got_pg: + if (page && victim) { + pr_warn("%s(%u): TIF_MEMDIE allocation: order=%d mode=%#x(%pGg)\n", + current->comm, current->pid, order, gfp_mask, &gfp_mask); + dump_stack(); + } return page; } And I got flood of traces shown below. It seems to be consuming memory reserves until the size passed to write() request is stored to the page cache even after OOM-killed. Complete log is at http://I-love.SAKURA.ne.jp/tmp/serial-20170121.txt.xz . [ 202.306077] a.out(9789): TIF_MEMDIE allocation: order=0 mode=0x1c2004a(GFP_NOFS|__GFP_HIGHMEM|__GFP_HARDWALL|__GFP_MOVABLE|__GFP_WRITE) [ 202.309832] CPU: 0 PID: 9789 Comm: a.out Not tainted 4.10.0-rc4-next-20170120+ #492 [ 202.312323] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015 [ 202.315429] Call Trace: [ 202.316902] dump_stack+0x85/0xc9 [ 202.318810] __alloc_pages_slowpath+0xa99/0xd7c [ 202.320697] ? node_dirty_ok+0xef/0x130 [ 202.322454] __alloc_pages_nodemask+0x436/0x4d0 [ 202.324506] alloc_pages_current+0x97/0x1b0 [ 202.326397] __page_cache_alloc+0x15d/0x1a0 mm/filemap.c:728 [ 202.328209] pagecache_get_page+0x5a/0x2b0 mm/filemap.c:1331 [ 202.329989] grab_cache_page_write_begin+0x23/0x40 mm/filemap.c:2773 [ 202.331905] iomap_write_begin+0x50/0xd0 fs/iomap.c:118 [ 202.333641] iomap_write_actor+0xb5/0x1a0fs/iomap.c:190 [ 202.335377] ? iomap_write_end+0x80/0x80 fs/
Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
Mel Gorman wrote: > On Thu, Jan 19, 2017 at 12:23:36PM +0100, Michal Hocko wrote: > > So what do you think about the following? Tetsuo, would you be willing > > to run this patch through your torture testing please? > > I'm fine with treating this as a starting point. OK. So I tried to test this patch but I failed at preparation step. There are too many pending mm patches and I'm not sure which patch on which linux-next snapshot I should try. Also as another question, too_many_isolated() loop exists in both mm/vmscan.c and mm/compaction.c but why this patch does not touch the loop in mm/compaction.c part? Is there a guarantee that the problem can be avoided by tweaking only too_many_isolated() part? Anyway I tried linux-next-20170119 snapshot in order to confirm that my reproducer can still reproduce the problem before trying this patch. But I was not able to reproduce the problem today, for mm part is changing rapidly and existing reproducers need tuning. And I think that there is a different problem if I tune a reproducer like below (i.e. increased the buffer size to write()/fsync() from 4096). -- #include #include #include #include #include #include #include int main(int argc, char *argv[]) { static char buffer[10485760] = { }; /* or 1048576 */ char *buf = NULL; unsigned long size; unsigned long i; for (i = 0; i < 1024; i++) { if (fork() == 0) { int fd = open("/proc/self/oom_score_adj", O_WRONLY); write(fd, "1000", 4); close(fd); sleep(1); snprintf(buffer, sizeof(buffer), "/tmp/file.%u", getpid()); fd = open(buffer, O_WRONLY | O_CREAT | O_APPEND, 0600); while (write(fd, buffer, sizeof(buffer)) == sizeof(buffer)) fsync(fd); _exit(0); } } for (size = 1048576; size < 512UL * (1 << 30); size <<= 1) { char *cp = realloc(buf, size); if (!cp) { size >>= 1; break; } buf = cp; } sleep(2); /* Will cause OOM due to overcommit */ for (i = 0; i < size; i += 4096) buf[i] = 0; pause(); return 0; } -- Above reproducer sometimes kills all OOM killable processes and the system finally panics. I guess that somebody is abusing TIF_MEMDIE for needless allocations to the level where GFP_ATOMIC allocations start failing. Complete log is at http://I-love.SAKURA.ne.jp/tmp/serial-20170120.txt.xz . -- [ 184.482761] a.out invoked oom-killer: gfp_mask=0x14280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=0, order=0, oom_score_adj=0 (...snipped...) [ 184.482955] Node 0 active_anon:1418748kB inactive_anon:13548kB active_file:11448kB inactive_file:26044kB unevictable:0kB isolated(anon):0kB isolated(file):132kB mapped:13744kB dirty:25872kB writeback:376kB shmem:0kB shmem_thp: 0kB sh\ mem_pmdmapped: 258048kB anon_thp: 14184kB writeback_tmp:0kB unstable:0kB pages_scanned:95127 all_unreclaimable? yes [ 184.482956] Node 0 DMA free:7660kB min:380kB low:472kB high:564kB active_anon:8176kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15988kB managed:15904kB mlocked:0kB slab_reclaimable:40\ kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:28kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB [ 184.482959] lowmem_reserve[]: 0 1823 1823 1823 [ 184.482963] Node 0 DMA32 free:44636kB min:44672kB low:55840kB high:67008kB active_anon:1410572kB inactive_anon:13548kB active_file:11448kB inactive_file:26044kB unevictable:0kB writepending:26248kB present:2080640kB managed:1866768kB\ mlocked:0kB slab_reclaimable:85544kB slab_unreclaimable:128876kB kernel_stack:20496kB pagetables:40712kB bounce:0kB free_pcp:1136kB local_pcp:656kB free_cma:0kB [ 184.482966] lowmem_reserve[]: 0 0 0 0 [ 184.482970] Node 0 DMA: 9*4kB (UE) 5*8kB (E) 2*16kB (ME) 0*32kB 2*64kB (U) 2*128kB (UE) 2*256kB (UE) 1*512kB (E) 2*1024kB (UE) 2*2048kB (ME) 0*4096kB = 7660kB [ 184.482994] Node 0 DMA32: 3845*4kB (UME) 1809*8kB (UME) 600*16kB (UME) 134*32kB (UME) 14*64kB (UME) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 44636kB (...snipped...) [ 187.477371] Node 0 active_anon:1415648kB inactive_anon:13548kB active_file:11452kB inactive_file:79120kB unevictable:0kB isolated(anon):0kB isolated(file):5220kB mapped:13748kB dirty:83484kB writeback:376kB shmem:0kB shmem_thp: 0kB s\ hmem_pmdmapped: 258048kB anon_thp: 14184kB writeback_tmp:0kB unstable:0kB pages_scanned:16058 all_unreclaimable? no [ 187.477372] Node 0 DMA free:0kB min:380kB low:472kB high:564kB active_anon:8176kB inactive_anon:0kB active_file:0kB inactive_file:6976kB unevictable:0kB writepending:7492kB present:15988kB
Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
On Fri, Jan 20, 2017 at 02:42:24PM +0800, Hillf Danton wrote: > > @@ -1603,16 +1603,16 @@ int isolate_lru_page(struct page *page) > > * the LRU list will go small and be scanned faster than necessary, > > leading to > > * unnecessary swapping, thrashing and OOM. > > */ > > -static int too_many_isolated(struct pglist_data *pgdat, int file, > > +static bool safe_to_isolate(struct pglist_data *pgdat, int file, > > struct scan_control *sc) > > I prefer the current function name. > The restructure is to work with the workqueue api. > > { > > unsigned long inactive, isolated; > > > > if (current_is_kswapd()) > > - return 0; > > + return true; > > > > - if (!sane_reclaim(sc)) > > - return 0; > > + if (sane_reclaim(sc)) > > + return true; > > We only need a one-line change. It's bool so the conversion is made to bool while it's being changed anyway. > > > > if (file) { > > inactive = node_page_state(pgdat, NR_INACTIVE_FILE); > > @@ -1630,7 +1630,7 @@ static int too_many_isolated(struct pglist_data > > *pgdat, int file, > > if ((sc->gfp_mask & (__GFP_IO | __GFP_FS)) == (__GFP_IO | __GFP_FS)) > > inactive >>= 3; > > > > - return isolated > inactive; > > + return isolated < inactive; > > } > > > > static noinline_for_stack void > > @@ -1719,12 +1719,28 @@ shrink_inactive_list(unsigned long nr_to_scan, > > struct lruvec *lruvec, > > struct pglist_data *pgdat = lruvec_pgdat(lruvec); > > struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat; > > > > - while (unlikely(too_many_isolated(pgdat, file, sc))) { > > - congestion_wait(BLK_RW_ASYNC, HZ/10); > > + while (!safe_to_isolate(pgdat, file, sc)) { > > + long ret; > > + > > + ret = wait_event_interruptible_timeout(pgdat->isolated_wait, > > + safe_to_isolate(pgdat, file, sc), HZ/10); > > > > /* We are about to die and free our memory. Return now. */ > > - if (fatal_signal_pending(current)) > > - return SWAP_CLUSTER_MAX; > > + if (fatal_signal_pending(current)) { > > + nr_reclaimed = SWAP_CLUSTER_MAX; > > + goto out; > > + } > > + > > + /* > > +* If we reached the timeout, this is direct reclaim, and > > +* pages cannot be isolated then return. If the situation > > Please add something that we would rather shrink slab than go > another round of nap. > That's not necessarily true or even a good idea. It could result in excessive slab shrinking that is no longer in proportion to LRU scanning and increased contention within shrinkers. > > +* persists for a long time then it'll eventually reach > > +* the no_progress limit in should_reclaim_retry and consider > > +* going OOM. In this case, do not wake the isolated_wait > > +* queue as the wakee will still not be able to make progress. > > +*/ > > + if (!ret && !current_is_kswapd() && !safe_to_isolate(pgdat, > > file, sc)) > > + return 0; > > } > > > > lru_add_drain(); > > @@ -1839,6 +1855,10 @@ shrink_inactive_list(unsigned long nr_to_scan, > > struct lruvec *lruvec, > > stat.nr_activate, stat.nr_ref_keep, > > stat.nr_unmap_fail, > > sc->priority, file); > > + > > +out: > > + if (waitqueue_active(&pgdat->isolated_wait)) > > + wake_up(&pgdat->isolated_wait); > > return nr_reclaimed; > > } > > > Is it also needed to check isolated_wait active before kswapd > takes nap? > No because this is where pages were isolated and there is no putback event that would justify waking the queue. There is a race between waitqueue_active() and going to sleep that we rely on the timeout to recover from. -- Mel Gorman SUSE Labs
Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
On Thursday, January 19, 2017 6:08 PM Mel Gorman wrote: > > If it's definitely required and is proven to fix the > infinite-loop-without-oom workload then I'll back off and withdraw my > objections. However, I'd at least like the following untested patch to > be considered as an alternative. It has some weaknesses and would be > slower to OOM than your patch but it avoids reintroducing zone counters > > ---8<--- > mm, vmscan: Wait on a waitqueue when too many pages are isolated > > When too many pages are isolated, direct reclaim waits on congestion to clear > for up to a tenth of a second. There is no reason to believe that too many > pages are isolated due to dirty pages, reclaim efficiency or congestion. > It may simply be because an extremely large number of processes have entered > direct reclaim at the same time. However, it is possible for the situation > to persist forever and never reach OOM. > > This patch queues processes a waitqueue when too many pages are isolated. > When parallel reclaimers finish shrink_page_list, they wake the waiters > to recheck whether too many pages are isolated. > > The wait on the queue has a timeout as not all sites that isolate pages > will do the wakeup. Depending on every isolation of LRU pages to be perfect > forever is potentially fragile. The specific wakeups occur for page reclaim > and compaction. If too many pages are isolated due to memory failure, > hotplug or directly calling migration from a syscall then the waiting > processes may wait the full timeout. > > Note that the timeout allows the use of waitqueue_active() on the basis > that a race will cause the full timeout to be reached due to a missed > wakeup. This is relatively harmless and still a massive improvement over > unconditionally calling congestion_wait. > > Direct reclaimers that cannot isolate pages within the timeout will consider > return to the caller. This is somewhat clunky as it won't return immediately > and make go through the other priorities and slab shrinking. Eventually, > it'll go through a few iterations of should_reclaim_retry and reach the > MAX_RECLAIM_RETRIES limit and consider going OOM. > > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h > index 91f69aa0d581..3dd617d0c8c4 100644 > --- a/include/linux/mmzone.h > +++ b/include/linux/mmzone.h > @@ -628,6 +628,7 @@ typedef struct pglist_data { > int node_id; > wait_queue_head_t kswapd_wait; > wait_queue_head_t pfmemalloc_wait; > + wait_queue_head_t isolated_wait; > struct task_struct *kswapd; /* Protected by > mem_hotplug_begin/end() */ > int kswapd_order; > diff --git a/mm/compaction.c b/mm/compaction.c > index 43a6cf1dc202..1b1ff6da7401 100644 > --- a/mm/compaction.c > +++ b/mm/compaction.c > @@ -1634,6 +1634,10 @@ static enum compact_result compact_zone(struct zone > *zone, struct compact_contro > count_compact_events(COMPACTMIGRATE_SCANNED, cc->total_migrate_scanned); > count_compact_events(COMPACTFREE_SCANNED, cc->total_free_scanned); > > + /* Page reclaim could have stalled due to isolated pages */ > + if (waitqueue_active(&zone->zone_pgdat->isolated_wait)) > + wake_up(&zone->zone_pgdat->isolated_wait); > + > trace_mm_compaction_end(start_pfn, cc->migrate_pfn, > cc->free_pfn, end_pfn, sync, ret); > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 8ff25883c172..d848c9f31bff 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -5823,6 +5823,7 @@ static void __paginginit free_area_init_core(struct > pglist_data *pgdat) > #endif > init_waitqueue_head(&pgdat->kswapd_wait); > init_waitqueue_head(&pgdat->pfmemalloc_wait); > + init_waitqueue_head(&pgdat->isolated_wait); > #ifdef CONFIG_COMPACTION > init_waitqueue_head(&pgdat->kcompactd_wait); > #endif > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 2281ad310d06..c93f299fbad7 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -1603,16 +1603,16 @@ int isolate_lru_page(struct page *page) > * the LRU list will go small and be scanned faster than necessary, leading > to > * unnecessary swapping, thrashing and OOM. > */ > -static int too_many_isolated(struct pglist_data *pgdat, int file, > +static bool safe_to_isolate(struct pglist_data *pgdat, int file, > struct scan_control *sc) I prefer the current function name. > { > unsigned long inactive, isolated; > > if (current_is_kswapd()) > - return 0; > + return true; > > - if (!sane_reclaim(sc)) > - return 0; > + if (sane_reclaim(sc)) > + return true; We only need a one-line change. > > if (file) { > inactive = node_page_state(pgdat, NR_INACTIVE_FILE); > @@ -1630,7 +1630,7 @@ static int too_many_isolated(struct pglist_data *pgdat, > int file, > if ((sc->gfp_mask & (__GFP_IO | __GFP_FS)) == (__GFP_IO | __
Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
On Thu, Jan 19, 2017 at 12:23:36PM +0100, Michal Hocko wrote: > On Thu 19-01-17 10:07:55, Mel Gorman wrote: > [...] > > mm, vmscan: Wait on a waitqueue when too many pages are isolated > > > > When too many pages are isolated, direct reclaim waits on congestion to > > clear > > for up to a tenth of a second. There is no reason to believe that too many > > pages are isolated due to dirty pages, reclaim efficiency or congestion. > > It may simply be because an extremely large number of processes have entered > > direct reclaim at the same time. However, it is possible for the situation > > to persist forever and never reach OOM. > > > > This patch queues processes a waitqueue when too many pages are isolated. > > When parallel reclaimers finish shrink_page_list, they wake the waiters > > to recheck whether too many pages are isolated. > > > > The wait on the queue has a timeout as not all sites that isolate pages > > will do the wakeup. Depending on every isolation of LRU pages to be perfect > > forever is potentially fragile. The specific wakeups occur for page reclaim > > and compaction. If too many pages are isolated due to memory failure, > > hotplug or directly calling migration from a syscall then the waiting > > processes may wait the full timeout. > > > > Note that the timeout allows the use of waitqueue_active() on the basis > > that a race will cause the full timeout to be reached due to a missed > > wakeup. This is relatively harmless and still a massive improvement over > > unconditionally calling congestion_wait. > > > > Direct reclaimers that cannot isolate pages within the timeout will consider > > return to the caller. This is somewhat clunky as it won't return immediately > > and make go through the other priorities and slab shrinking. Eventually, > > it'll go through a few iterations of should_reclaim_retry and reach the > > MAX_RECLAIM_RETRIES limit and consider going OOM. > > I cannot really say I would like this. It's just much more complex than > necessary. I guess it's a difference in opinion. Miximg per-zone and per-node information for me is complex. I liked the workqueue because it was an example of waiting on a specific event instead of relying completely on time. > I definitely agree that congestion_wait while waiting for > too_many_isolated is a crude hack. This patch doesn't really resolve > my biggest worry, though, that we go OOM with too many pages isolated > as your patch doesn't alter zone_reclaimable_pages to reflect those > numbers. > Indeed, but such cases are also caught by the no_progress_loop logic to avoid a premature OOM. > Anyway, I think both of us are probably overcomplicating things a bit. > Your waitqueue approach is definitely better semantically than the > congestion_wait because we are waiting for a different event than the > API is intended for. On the other hand a mere > schedule_timeout_interruptible might work equally well in the real life. > On the other side I might really over emphasise the role of NR_ISOLATED* > counts. It might really turn out that we can safely ignore them and it > won't be the end of the world. So what do you think about the following > as a starting point. If we ever see oom reports with high number of > NR_ISOLATED* which are part of the oom report then we know we have to do > something about that. Those changes would at least be driven by a real > usecase rather than theoretical scenarios. > > So what do you think about the following? Tetsuo, would you be willing > to run this patch through your torture testing please? I'm fine with treating this as a starting point. Thanks. -- Mel Gorman SUSE Labs
Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
On Thu 19-01-17 10:07:55, Mel Gorman wrote: [...] > mm, vmscan: Wait on a waitqueue when too many pages are isolated > > When too many pages are isolated, direct reclaim waits on congestion to clear > for up to a tenth of a second. There is no reason to believe that too many > pages are isolated due to dirty pages, reclaim efficiency or congestion. > It may simply be because an extremely large number of processes have entered > direct reclaim at the same time. However, it is possible for the situation > to persist forever and never reach OOM. > > This patch queues processes a waitqueue when too many pages are isolated. > When parallel reclaimers finish shrink_page_list, they wake the waiters > to recheck whether too many pages are isolated. > > The wait on the queue has a timeout as not all sites that isolate pages > will do the wakeup. Depending on every isolation of LRU pages to be perfect > forever is potentially fragile. The specific wakeups occur for page reclaim > and compaction. If too many pages are isolated due to memory failure, > hotplug or directly calling migration from a syscall then the waiting > processes may wait the full timeout. > > Note that the timeout allows the use of waitqueue_active() on the basis > that a race will cause the full timeout to be reached due to a missed > wakeup. This is relatively harmless and still a massive improvement over > unconditionally calling congestion_wait. > > Direct reclaimers that cannot isolate pages within the timeout will consider > return to the caller. This is somewhat clunky as it won't return immediately > and make go through the other priorities and slab shrinking. Eventually, > it'll go through a few iterations of should_reclaim_retry and reach the > MAX_RECLAIM_RETRIES limit and consider going OOM. I cannot really say I would like this. It's just much more complex than necessary. I definitely agree that congestion_wait while waiting for too_many_isolated is a crude hack. This patch doesn't really resolve my biggest worry, though, that we go OOM with too many pages isolated as your patch doesn't alter zone_reclaimable_pages to reflect those numbers. Anyway, I think both of us are probably overcomplicating things a bit. Your waitqueue approach is definitely better semantically than the congestion_wait because we are waiting for a different event than the API is intended for. On the other hand a mere schedule_timeout_interruptible might work equally well in the real life. On the other side I might really over emphasise the role of NR_ISOLATED* counts. It might really turn out that we can safely ignore them and it won't be the end of the world. So what do you think about the following as a starting point. If we ever see oom reports with high number of NR_ISOLATED* which are part of the oom report then we know we have to do something about that. Those changes would at least be driven by a real usecase rather than theoretical scenarios. So what do you think about the following? Tetsuo, would you be willing to run this patch through your torture testing please? --- >From 47cba23b5b50260b533d7ad57a4c9e6a800d9b20 Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Thu, 19 Jan 2017 12:11:56 +0100 Subject: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever Tetsuo Handa has reported [1] that direct reclaimers might get stuck in too_many_isolated loop basically for ever because the last few pages on the LRU lists are isolated by the kswapd which is stuck on fs locks when doing the pageout. This in turn means that there is nobody to actually trigger the oom killer and the system is basically unusable. too_many_isolated has been introduced by 35cd78156c49 ("vmscan: throttle direct reclaim when too many pages are isolated already") to prevent from pre-mature oom killer invocations because back then no reclaim progress could indeed trigger the OOM killer too early. But since the oom detection rework 0a0337e0d1d1 ("mm, oom: rework oom detection") the allocation/reclaim retry loop considers all the reclaimable pages and throttles the allocation at that layer so we can loosen the direct reclaim throttling. Make shrink_inactive_list loop over too_many_isolated bounded and returns immediately when the situation hasn't resolved after the first sleep. Replace congestion_wait by a simple schedule_timeout_interruptible because we are not really waiting on the IO congestion in this path. Please note that this patch can theoretically cause the OOM killer to trigger earlier while there are many pages isolated for the reclaim which makes progress only very slowly. This would be obvious from the oom report as the number of isolated pages are printed there. If we ever hit this should_reclaim_retry should consider those numbers in the evaluation in one way or another. [1] http://lkml.kernel.org/r/201602092349.acg81273.osvtmjqhlof...@i-love.sakura.ne.jp Signed-off-by: Michal Hocko --- mm/vmscan.c | 8 +++- 1 file changed, 7 insertions(+), 1 delet
Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
On Wed, Jan 18, 2017 at 06:29:46PM +0100, Michal Hocko wrote: > On Wed 18-01-17 17:00:10, Mel Gorman wrote: > > > > You don't appear to directly use that information in patch 2. > > > > > > It is used via zone_reclaimable_pages in should_reclaim_retry > > > > > > > Which is still not directly required to avoid the infinite loop. There > > even is a small inherent risk if the too_isolated_condition no longer > > applies at the time should_reclaim_retry is attempted. > > Not really because, if those pages are no longer isolated then they > either have been reclaimed - and NR_FREE_PAGES will increase - or they > have been put back to LRU in which case we will see them in regular LRU > counters. I need to catch the case where there are still too many pages > isolated which would skew should_reclaim_retry watermark check. > We can also rely on the no_progress_loops counter to trigger OOM. It'll take longer but has a lower risk of premature OOM. > > > > The primary > > > > breakout is returning after stalling at least once. You could also avoid > > > > an infinite loop by using a waitqueue that sleeps on too many isolated. > > > > > > That would be tricky on its own. Just consider the report form Tetsuo. > > > Basically all the direct reclamers are looping on too_many_isolated > > > while the kswapd is not making any progres because it is blocked on FS > > > locks which are held by flushers which are making dead slow progress. > > > Some of those direct reclaimers could have gone oom instead and release > > > some memory if we decide so, which we cannot because we are deep down in > > > the reclaim path. Waiting for on the reclaimer to increase the ISOLATED > > > counter wouldn't help in this situation. > > > > > > > If it's a waitqueue waking one process at a time, the progress may be > > slow but it'll still exit the loop, attempt the reclaim and then > > potentially OOM if no progress is made. The key is using the waitqueue > > to have a fair queue of processes making progress instead of a > > potentially infinite loop that never meets the exit conditions. > > It is not clear to me who would wake waiters on the queue. You cannot > rely on kswapd to do that as already mentioned. > We can use timeouts to guard against an infinite wait. Besides, updating every single place where pages are put back on the LRU would be fragile and too easy to break. > > > > That would both avoid the clunky congestion_wait() and guarantee forward > > > > progress. If the primary motivation is to avoid an infinite loop with > > > > too_many_isolated then there are ways of handling that without > > > > reintroducing > > > > zone-based counters. > > > > > > > > > > Heavy memory pressure on highmem should be spread across the whole > > > > > > node as > > > > > > we no longer are applying the fair zone allocation policy. The > > > > > > processes > > > > > > with highmem requirements will be reclaiming from all zones and > > > > > > when it > > > > > > finishes, it's possible that a lowmem-specific request will be > > > > > > clear to make > > > > > > progress. It's all the same LRU so if there are too many pages > > > > > > isolated, > > > > > > it makes sense to wait regardless of the allocation request. > > > > > > > > > > This is true but I am not sure how it is realated to the patch. > > > > > > > > Because heavy pressure that is enough to trigger too many isolated pages > > > > is unlikely to be specifically targetting a lower zone. > > > > > > Why? Basically any GFP_KERNEL allocation will make lowmem pressure and > > > going OOM on lowmem is not all that unrealistic scenario on 32b systems. > > > > > > > If the sole source of pressure is from GFP_KERNEL allocations then the > > isolated counter will also be specific to the lower zones and there is no > > benefit from the patch. > > I believe you are wrong here. Just consider that you have isolated > basically all lowmem pages. too_many_isolated will still happily tell > you to not throttle or back off because NR_INACTIVE_* are way too bigger > than all low mem pages altogether. Or am I still missing your point? > This is a potential risk. It could be accounted for by including the node isolated counters in the calculation but it'll be inherently fuzzy and may stall a lowmem direct reclaimer unnecessarily in the presence of highmem reclaim. > > If there is a combination of highmem and lowmem pressure then the highmem > > reclaimers will also reclaim lowmem memory. > > > > > > There is general > > > > pressure with multiple direct reclaimers being applied. If the system is > > > > under enough pressure with parallel reclaimers to trigger > > > > too_many_isolated > > > > checks then the system is grinding already and making little progress. > > > > Adding > > > > multiple counters to allow a lowmem reclaimer to potentially make faster > > > > progress is going to be marginal at best. > > > > > > OK, I agree that the situation where highmem blocks
Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
On Wed 18-01-17 17:00:10, Mel Gorman wrote: > On Wed, Jan 18, 2017 at 05:17:31PM +0100, Michal Hocko wrote: > > On Wed 18-01-17 15:54:30, Mel Gorman wrote: > > > On Wed, Jan 18, 2017 at 04:15:31PM +0100, Michal Hocko wrote: > > > > On Wed 18-01-17 14:46:55, Mel Gorman wrote: > > > > > On Wed, Jan 18, 2017 at 02:44:52PM +0100, Michal Hocko wrote: > > > > > > From: Michal Hocko > > > > > > > > > > > > 599d0c954f91 ("mm, vmscan: move LRU lists to node") has moved > > > > > > NR_ISOLATED* counters from zones to nodes. This is not the best fit > > > > > > especially for systems with high/lowmem because a heavy memory > > > > > > pressure > > > > > > on the highmem zone might block lowmem requests from making > > > > > > progress. Or > > > > > > we might allow to reclaim lowmem zone even though there are too many > > > > > > pages already isolated from the eligible zones just because highmem > > > > > > pages will easily bias too_many_isolated to say no. > > > > > > > > > > > > Fix these potential issues by moving isolated stats back to zones > > > > > > and > > > > > > teach too_many_isolated to consider only eligible zones. Per zone > > > > > > isolation counters are a bit tricky with the node reclaim because > > > > > > we have to track each page separatelly. > > > > > > > > > > > > > > > > I'm quite unhappy with this. Each move back increases the cache > > > > > footprint > > > > > because of the counters > > > > > > > > Why would per zone counters cause an increased cache footprint? > > > > > > > > > > Because there are multiple counters, each of which need to be updated. > > > > How does this differ from per node counter though. > > A per-node counter is 2 * nr_online_nodes > A per-zone counter is 2 * nr_populated_zones > > > We would need to do > > the accounting anyway. Moreover none of the accounting is done in a hot > > path. > > > > > > > but it's not clear at all this patch actually helps anything. > > > > > > > > Yes, I cannot prove any real issue so far. The main motivation was the > > > > patch 2 which needs per-zone accounting to use it in the retry logic > > > > (should_reclaim_retry). I've spotted too_many_isoalated issues on the > > > > way. > > > > > > > > > > You don't appear to directly use that information in patch 2. > > > > It is used via zone_reclaimable_pages in should_reclaim_retry > > > > Which is still not directly required to avoid the infinite loop. There > even is a small inherent risk if the too_isolated_condition no longer > applies at the time should_reclaim_retry is attempted. Not really because, if those pages are no longer isolated then they either have been reclaimed - and NR_FREE_PAGES will increase - or they have been put back to LRU in which case we will see them in regular LRU counters. I need to catch the case where there are still too many pages isolated which would skew should_reclaim_retry watermark check. > > > The primary > > > breakout is returning after stalling at least once. You could also avoid > > > an infinite loop by using a waitqueue that sleeps on too many isolated. > > > > That would be tricky on its own. Just consider the report form Tetsuo. > > Basically all the direct reclamers are looping on too_many_isolated > > while the kswapd is not making any progres because it is blocked on FS > > locks which are held by flushers which are making dead slow progress. > > Some of those direct reclaimers could have gone oom instead and release > > some memory if we decide so, which we cannot because we are deep down in > > the reclaim path. Waiting for on the reclaimer to increase the ISOLATED > > counter wouldn't help in this situation. > > > > If it's a waitqueue waking one process at a time, the progress may be > slow but it'll still exit the loop, attempt the reclaim and then > potentially OOM if no progress is made. The key is using the waitqueue > to have a fair queue of processes making progress instead of a > potentially infinite loop that never meets the exit conditions. It is not clear to me who would wake waiters on the queue. You cannot rely on kswapd to do that as already mentioned. > > > That would both avoid the clunky congestion_wait() and guarantee forward > > > progress. If the primary motivation is to avoid an infinite loop with > > > too_many_isolated then there are ways of handling that without > > > reintroducing > > > zone-based counters. > > > > > > > > Heavy memory pressure on highmem should be spread across the whole > > > > > node as > > > > > we no longer are applying the fair zone allocation policy. The > > > > > processes > > > > > with highmem requirements will be reclaiming from all zones and when > > > > > it > > > > > finishes, it's possible that a lowmem-specific request will be clear > > > > > to make > > > > > progress. It's all the same LRU so if there are too many pages > > > > > isolated, > > > > > it makes sense to wait regardless of the allocation request. > > > > > > > > This is tru
Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
On Wed, Jan 18, 2017 at 05:17:31PM +0100, Michal Hocko wrote: > On Wed 18-01-17 15:54:30, Mel Gorman wrote: > > On Wed, Jan 18, 2017 at 04:15:31PM +0100, Michal Hocko wrote: > > > On Wed 18-01-17 14:46:55, Mel Gorman wrote: > > > > On Wed, Jan 18, 2017 at 02:44:52PM +0100, Michal Hocko wrote: > > > > > From: Michal Hocko > > > > > > > > > > 599d0c954f91 ("mm, vmscan: move LRU lists to node") has moved > > > > > NR_ISOLATED* counters from zones to nodes. This is not the best fit > > > > > especially for systems with high/lowmem because a heavy memory > > > > > pressure > > > > > on the highmem zone might block lowmem requests from making progress. > > > > > Or > > > > > we might allow to reclaim lowmem zone even though there are too many > > > > > pages already isolated from the eligible zones just because highmem > > > > > pages will easily bias too_many_isolated to say no. > > > > > > > > > > Fix these potential issues by moving isolated stats back to zones and > > > > > teach too_many_isolated to consider only eligible zones. Per zone > > > > > isolation counters are a bit tricky with the node reclaim because > > > > > we have to track each page separatelly. > > > > > > > > > > > > > I'm quite unhappy with this. Each move back increases the cache > > > > footprint > > > > because of the counters > > > > > > Why would per zone counters cause an increased cache footprint? > > > > > > > Because there are multiple counters, each of which need to be updated. > > How does this differ from per node counter though. A per-node counter is 2 * nr_online_nodes A per-zone counter is 2 * nr_populated_zones > We would need to do > the accounting anyway. Moreover none of the accounting is done in a hot > path. > > > > > but it's not clear at all this patch actually helps anything. > > > > > > Yes, I cannot prove any real issue so far. The main motivation was the > > > patch 2 which needs per-zone accounting to use it in the retry logic > > > (should_reclaim_retry). I've spotted too_many_isoalated issues on the > > > way. > > > > > > > You don't appear to directly use that information in patch 2. > > It is used via zone_reclaimable_pages in should_reclaim_retry > Which is still not directly required to avoid the infinite loop. There even is a small inherent risk if the too_isolated_condition no longer applies at the time should_reclaim_retry is attempted. > > The primary > > breakout is returning after stalling at least once. You could also avoid > > an infinite loop by using a waitqueue that sleeps on too many isolated. > > That would be tricky on its own. Just consider the report form Tetsuo. > Basically all the direct reclamers are looping on too_many_isolated > while the kswapd is not making any progres because it is blocked on FS > locks which are held by flushers which are making dead slow progress. > Some of those direct reclaimers could have gone oom instead and release > some memory if we decide so, which we cannot because we are deep down in > the reclaim path. Waiting for on the reclaimer to increase the ISOLATED > counter wouldn't help in this situation. > If it's a waitqueue waking one process at a time, the progress may be slow but it'll still exit the loop, attempt the reclaim and then potentially OOM if no progress is made. The key is using the waitqueue to have a fair queue of processes making progress instead of a potentially infinite loop that never meets the exit conditions. > > That would both avoid the clunky congestion_wait() and guarantee forward > > progress. If the primary motivation is to avoid an infinite loop with > > too_many_isolated then there are ways of handling that without reintroducing > > zone-based counters. > > > > > > Heavy memory pressure on highmem should be spread across the whole node > > > > as > > > > we no longer are applying the fair zone allocation policy. The processes > > > > with highmem requirements will be reclaiming from all zones and when it > > > > finishes, it's possible that a lowmem-specific request will be clear to > > > > make > > > > progress. It's all the same LRU so if there are too many pages isolated, > > > > it makes sense to wait regardless of the allocation request. > > > > > > This is true but I am not sure how it is realated to the patch. > > > > Because heavy pressure that is enough to trigger too many isolated pages > > is unlikely to be specifically targetting a lower zone. > > Why? Basically any GFP_KERNEL allocation will make lowmem pressure and > going OOM on lowmem is not all that unrealistic scenario on 32b systems. > If the sole source of pressure is from GFP_KERNEL allocations then the isolated counter will also be specific to the lower zones and there is no benefit from the patch. If there is a combination of highmem and lowmem pressure then the highmem reclaimers will also reclaim lowmem memory. > > There is general > > pressure with multiple direct reclaimers being applied. If the system is > > under
Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
On Wed 18-01-17 15:54:30, Mel Gorman wrote: > On Wed, Jan 18, 2017 at 04:15:31PM +0100, Michal Hocko wrote: > > On Wed 18-01-17 14:46:55, Mel Gorman wrote: > > > On Wed, Jan 18, 2017 at 02:44:52PM +0100, Michal Hocko wrote: > > > > From: Michal Hocko > > > > > > > > 599d0c954f91 ("mm, vmscan: move LRU lists to node") has moved > > > > NR_ISOLATED* counters from zones to nodes. This is not the best fit > > > > especially for systems with high/lowmem because a heavy memory pressure > > > > on the highmem zone might block lowmem requests from making progress. Or > > > > we might allow to reclaim lowmem zone even though there are too many > > > > pages already isolated from the eligible zones just because highmem > > > > pages will easily bias too_many_isolated to say no. > > > > > > > > Fix these potential issues by moving isolated stats back to zones and > > > > teach too_many_isolated to consider only eligible zones. Per zone > > > > isolation counters are a bit tricky with the node reclaim because > > > > we have to track each page separatelly. > > > > > > > > > > I'm quite unhappy with this. Each move back increases the cache footprint > > > because of the counters > > > > Why would per zone counters cause an increased cache footprint? > > > > Because there are multiple counters, each of which need to be updated. How does this differ from per node counter though. We would need to do the accounting anyway. Moreover none of the accounting is done in a hot path. > > > but it's not clear at all this patch actually helps anything. > > > > Yes, I cannot prove any real issue so far. The main motivation was the > > patch 2 which needs per-zone accounting to use it in the retry logic > > (should_reclaim_retry). I've spotted too_many_isoalated issues on the > > way. > > > > You don't appear to directly use that information in patch 2. It is used via zone_reclaimable_pages in should_reclaim_retry > The primary > breakout is returning after stalling at least once. You could also avoid > an infinite loop by using a waitqueue that sleeps on too many isolated. That would be tricky on its own. Just consider the report form Tetsuo. Basically all the direct reclamers are looping on too_many_isolated while the kswapd is not making any progres because it is blocked on FS locks which are held by flushers which are making dead slow progress. Some of those direct reclaimers could have gone oom instead and release some memory if we decide so, which we cannot because we are deep down in the reclaim path. Waiting for on the reclaimer to increase the ISOLATED counter wouldn't help in this situation. > That would both avoid the clunky congestion_wait() and guarantee forward > progress. If the primary motivation is to avoid an infinite loop with > too_many_isolated then there are ways of handling that without reintroducing > zone-based counters. > > > > Heavy memory pressure on highmem should be spread across the whole node as > > > we no longer are applying the fair zone allocation policy. The processes > > > with highmem requirements will be reclaiming from all zones and when it > > > finishes, it's possible that a lowmem-specific request will be clear to > > > make > > > progress. It's all the same LRU so if there are too many pages isolated, > > > it makes sense to wait regardless of the allocation request. > > > > This is true but I am not sure how it is realated to the patch. > > Because heavy pressure that is enough to trigger too many isolated pages > is unlikely to be specifically targetting a lower zone. Why? Basically any GFP_KERNEL allocation will make lowmem pressure and going OOM on lowmem is not all that unrealistic scenario on 32b systems. > There is general > pressure with multiple direct reclaimers being applied. If the system is > under enough pressure with parallel reclaimers to trigger too_many_isolated > checks then the system is grinding already and making little progress. Adding > multiple counters to allow a lowmem reclaimer to potentially make faster > progress is going to be marginal at best. OK, I agree that the situation where highmem blocks lowmem from making progress is much less likely than the other situation described in the changelog when lowmem doesn't get throttled ever. Which is the one I am interested more about. > > Also consider that lowmem throttling in too_many_isolated has only small > > chance to ever work with the node counters because highmem >> lowmem in > > many/most configurations. > > > > While true, it's also not that important. > > > > More importantly, this patch may make things worse and delay reclaim. If > > > this patch allowed a lowmem request to make progress that would have > > > previously stalled, it's going to spend time skipping pages in the LRU > > > instead of letting kswapd and the highmem pressured processes make > > > progress. > > > > I am not sure I understand this part. Say that we have highmem pressure > > which would isolated too m
Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
On Wed, Jan 18, 2017 at 04:15:31PM +0100, Michal Hocko wrote: > On Wed 18-01-17 14:46:55, Mel Gorman wrote: > > On Wed, Jan 18, 2017 at 02:44:52PM +0100, Michal Hocko wrote: > > > From: Michal Hocko > > > > > > 599d0c954f91 ("mm, vmscan: move LRU lists to node") has moved > > > NR_ISOLATED* counters from zones to nodes. This is not the best fit > > > especially for systems with high/lowmem because a heavy memory pressure > > > on the highmem zone might block lowmem requests from making progress. Or > > > we might allow to reclaim lowmem zone even though there are too many > > > pages already isolated from the eligible zones just because highmem > > > pages will easily bias too_many_isolated to say no. > > > > > > Fix these potential issues by moving isolated stats back to zones and > > > teach too_many_isolated to consider only eligible zones. Per zone > > > isolation counters are a bit tricky with the node reclaim because > > > we have to track each page separatelly. > > > > > > > I'm quite unhappy with this. Each move back increases the cache footprint > > because of the counters > > Why would per zone counters cause an increased cache footprint? > Because there are multiple counters, each of which need to be updated. > > but it's not clear at all this patch actually helps anything. > > Yes, I cannot prove any real issue so far. The main motivation was the > patch 2 which needs per-zone accounting to use it in the retry logic > (should_reclaim_retry). I've spotted too_many_isoalated issues on the > way. > You don't appear to directly use that information in patch 2. The primary breakout is returning after stalling at least once. You could also avoid an infinite loop by using a waitqueue that sleeps on too many isolated. That would both avoid the clunky congestion_wait() and guarantee forward progress. If the primary motivation is to avoid an infinite loop with too_many_isolated then there are ways of handling that without reintroducing zone-based counters. > > Heavy memory pressure on highmem should be spread across the whole node as > > we no longer are applying the fair zone allocation policy. The processes > > with highmem requirements will be reclaiming from all zones and when it > > finishes, it's possible that a lowmem-specific request will be clear to make > > progress. It's all the same LRU so if there are too many pages isolated, > > it makes sense to wait regardless of the allocation request. > > This is true but I am not sure how it is realated to the patch. Because heavy pressure that is enough to trigger too many isolated pages is unlikely to be specifically targetting a lower zone. There is general pressure with multiple direct reclaimers being applied. If the system is under enough pressure with parallel reclaimers to trigger too_many_isolated checks then the system is grinding already and making little progress. Adding multiple counters to allow a lowmem reclaimer to potentially make faster progress is going to be marginal at best. > Also consider that lowmem throttling in too_many_isolated has only small > chance to ever work with the node counters because highmem >> lowmem in > many/most configurations. > While true, it's also not that important. > > More importantly, this patch may make things worse and delay reclaim. If > > this patch allowed a lowmem request to make progress that would have > > previously stalled, it's going to spend time skipping pages in the LRU > > instead of letting kswapd and the highmem pressured processes make progress. > > I am not sure I understand this part. Say that we have highmem pressure > which would isolated too many pages from the LRU. Which requires multiple direct reclaimers or tiny inactive lists. In the event there is such highmem pressure, it also means the lower zones are depleted. > lowmem request would > stall previously regardless of where those pages came from. With this > patch it would stall only when we isolated too many pages from the > eligible zones. And when it makes progress, it's goign to compete with the other direct reclaimers except the lowmem reclaim is skipping some pages and recycling them through the LRU. It chews up CPU that would probably have been better spent letting kswapd and the other direct reclaimers do their work. > So let's assume that lowmem is not under pressure, It has to be or the highmem request would have used memory from the lower zones. -- Mel Gorman SUSE Labs
Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
On Wed 18-01-17 14:46:55, Mel Gorman wrote: > On Wed, Jan 18, 2017 at 02:44:52PM +0100, Michal Hocko wrote: > > From: Michal Hocko > > > > 599d0c954f91 ("mm, vmscan: move LRU lists to node") has moved > > NR_ISOLATED* counters from zones to nodes. This is not the best fit > > especially for systems with high/lowmem because a heavy memory pressure > > on the highmem zone might block lowmem requests from making progress. Or > > we might allow to reclaim lowmem zone even though there are too many > > pages already isolated from the eligible zones just because highmem > > pages will easily bias too_many_isolated to say no. > > > > Fix these potential issues by moving isolated stats back to zones and > > teach too_many_isolated to consider only eligible zones. Per zone > > isolation counters are a bit tricky with the node reclaim because > > we have to track each page separatelly. > > > > I'm quite unhappy with this. Each move back increases the cache footprint > because of the counters Why would per zone counters cause an increased cache footprint? > but it's not clear at all this patch actually helps anything. Yes, I cannot prove any real issue so far. The main motivation was the patch 2 which needs per-zone accounting to use it in the retry logic (should_reclaim_retry). I've spotted too_many_isoalated issues on the way. > Heavy memory pressure on highmem should be spread across the whole node as > we no longer are applying the fair zone allocation policy. The processes > with highmem requirements will be reclaiming from all zones and when it > finishes, it's possible that a lowmem-specific request will be clear to make > progress. It's all the same LRU so if there are too many pages isolated, > it makes sense to wait regardless of the allocation request. This is true but I am not sure how it is realated to the patch. If we have a heavy highmem memory pressure then we will throttle based on pages isolated from the respective zones. So if the there is a lowmem pressure at the same time then we throttle it only when we need to. Also consider that lowmem throttling in too_many_isolated has only small chance to ever work with the node counters because highmem >> lowmem in many/most configurations. > More importantly, this patch may make things worse and delay reclaim. If > this patch allowed a lowmem request to make progress that would have > previously stalled, it's going to spend time skipping pages in the LRU > instead of letting kswapd and the highmem pressured processes make progress. I am not sure I understand this part. Say that we have highmem pressure which would isolated too many pages from the LRU. lowmem request would stall previously regardless of where those pages came from. With this patch it would stall only when we isolated too many pages from the eligible zones. So let's assume that lowmem is not under pressure, why should we stall? And why would it delay reclaim? Whoever want to make progress on that zone has to iterate and potentially skip many pages. -- Michal Hocko SUSE Labs
Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
On Wed, Jan 18, 2017 at 02:44:52PM +0100, Michal Hocko wrote: > From: Michal Hocko > > 599d0c954f91 ("mm, vmscan: move LRU lists to node") has moved > NR_ISOLATED* counters from zones to nodes. This is not the best fit > especially for systems with high/lowmem because a heavy memory pressure > on the highmem zone might block lowmem requests from making progress. Or > we might allow to reclaim lowmem zone even though there are too many > pages already isolated from the eligible zones just because highmem > pages will easily bias too_many_isolated to say no. > > Fix these potential issues by moving isolated stats back to zones and > teach too_many_isolated to consider only eligible zones. Per zone > isolation counters are a bit tricky with the node reclaim because > we have to track each page separatelly. > I'm quite unhappy with this. Each move back increases the cache footprint because of the counters but it's not clear at all this patch actually helps anything. Heavy memory pressure on highmem should be spread across the whole node as we no longer are applying the fair zone allocation policy. The processes with highmem requirements will be reclaiming from all zones and when it finishes, it's possible that a lowmem-specific request will be clear to make progress. It's all the same LRU so if there are too many pages isolated, it makes sense to wait regardless of the allocation request. More importantly, this patch may make things worse and delay reclaim. If this patch allowed a lowmem request to make progress that would have previously stalled, it's going to spend time skipping pages in the LRU instead of letting kswapd and the highmem pressured processes make progress. -- Mel Gorman SUSE Labs
[RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
From: Michal Hocko 599d0c954f91 ("mm, vmscan: move LRU lists to node") has moved NR_ISOLATED* counters from zones to nodes. This is not the best fit especially for systems with high/lowmem because a heavy memory pressure on the highmem zone might block lowmem requests from making progress. Or we might allow to reclaim lowmem zone even though there are too many pages already isolated from the eligible zones just because highmem pages will easily bias too_many_isolated to say no. Fix these potential issues by moving isolated stats back to zones and teach too_many_isolated to consider only eligible zones. Per zone isolation counters are a bit tricky with the node reclaim because we have to track each page separatelly. Signed-off-by: Michal Hocko --- include/linux/mmzone.h | 4 ++-- mm/compaction.c| 16 +++--- mm/khugepaged.c| 4 ++-- mm/memory_hotplug.c| 2 +- mm/migrate.c | 4 ++-- mm/page_alloc.c| 14 ++-- mm/vmscan.c| 58 -- mm/vmstat.c| 4 ++-- 8 files changed, 56 insertions(+), 50 deletions(-) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 91f69aa0d581..100e7f37b7dc 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -119,6 +119,8 @@ enum zone_stat_item { NR_ZONE_INACTIVE_FILE, NR_ZONE_ACTIVE_FILE, NR_ZONE_UNEVICTABLE, + NR_ZONE_ISOLATED_ANON, /* Temporary isolated pages from anon lru */ + NR_ZONE_ISOLATED_FILE, /* Temporary isolated pages from file lru */ NR_ZONE_WRITE_PENDING, /* Count of dirty, writeback and unstable pages */ NR_MLOCK, /* mlock()ed pages found and moved off LRU */ NR_SLAB_RECLAIMABLE, @@ -148,8 +150,6 @@ enum node_stat_item { NR_INACTIVE_FILE, /* " " " " " */ NR_ACTIVE_FILE, /* " " " " " */ NR_UNEVICTABLE, /* " " " " " */ - NR_ISOLATED_ANON, /* Temporary isolated pages from anon lru */ - NR_ISOLATED_FILE, /* Temporary isolated pages from file lru */ NR_PAGES_SCANNED, /* pages scanned since last reclaim */ WORKINGSET_REFAULT, WORKINGSET_ACTIVATE, diff --git a/mm/compaction.c b/mm/compaction.c index 43a6cf1dc202..f84104217887 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -639,12 +639,12 @@ static bool too_many_isolated(struct zone *zone) { unsigned long active, inactive, isolated; - inactive = node_page_state(zone->zone_pgdat, NR_INACTIVE_FILE) + - node_page_state(zone->zone_pgdat, NR_INACTIVE_ANON); - active = node_page_state(zone->zone_pgdat, NR_ACTIVE_FILE) + - node_page_state(zone->zone_pgdat, NR_ACTIVE_ANON); - isolated = node_page_state(zone->zone_pgdat, NR_ISOLATED_FILE) + - node_page_state(zone->zone_pgdat, NR_ISOLATED_ANON); + inactive = zone_page_state(zone, NR_ZONE_INACTIVE_FILE) + + zone_page_state(zone, NR_ZONE_INACTIVE_ANON); + active = zone_page_state(zone, NR_ZONE_ACTIVE_FILE) + + zone_page_state(zone, NR_ZONE_ACTIVE_ANON); + isolated = zone_page_state(zone, NR_ZONE_ISOLATED_FILE) + + zone_page_state(zone, NR_ZONE_ISOLATED_ANON); return isolated > (inactive + active) / 2; } @@ -857,8 +857,8 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn, /* Successfully isolated */ del_page_from_lru_list(page, lruvec, page_lru(page)); - inc_node_page_state(page, - NR_ISOLATED_ANON + page_is_file_cache(page)); + inc_zone_page_state(page, + NR_ZONE_ISOLATED_ANON + page_is_file_cache(page)); isolate_success: list_add(&page->lru, &cc->migratepages); diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 34bce5c308e3..8e692b683cac 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -482,7 +482,7 @@ void __khugepaged_exit(struct mm_struct *mm) static void release_pte_page(struct page *page) { /* 0 stands for page_is_file_cache(page) == false */ - dec_node_page_state(page, NR_ISOLATED_ANON + 0); + dec_zone_page_state(page, NR_ZONE_ISOLATED_ANON + 0); unlock_page(page); putback_lru_page(page); } @@ -578,7 +578,7 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma, goto out; } /* 0 stands for page_is_file_cache(page) == false */ - inc_node_page_state(page, NR_ISOLATED_ANON + 0); + inc_zone_page_state(page, NR_ZONE_ISOLATED_ANON + 0); VM_BUG_ON_PAGE(!PageLocked(page), page); VM_BUG_ON_PAGE(PageLRU(page), page); diff --git a/mm/memory_hotpl