Re: [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list().
Tetsuo Handa wrote: > Here is a demo patch. If you can join analysis of why memory allocation > function cannot return for more than 15 minutes under severe memory pressure, > I'll invite you to private discussion in order to share steps for reproducing > such memory pressure. A quick test says that memory reclaiming functions are > too optimistic about reclaiming memory; they are needlessly called again and > again and again with an assumption that some memory will be reclaimed within > a few seconds. If I insert some delay, CPU usage during stalls can be reduced. Here is a formal patch. This patch includes a test result of today's linux.git tree with https://lkml.org/lkml/2014/5/29/673 applied, in order to find what deadlock occurs next. The blocking delay on the mutex inside the ttm shrinker has gone, but a kernel worker thread trying to perform a block I/O using GFP_NOIO context is blocked for more than 10 minutes. I think this is not good. -- Start of patch -- >From c5274057bd71832fcf0baef64d43a49c20f29dbf Mon Sep 17 00:00:00 2001 From: Tetsuo Handa Date: Wed, 2 Jul 2014 09:34:51 +0900 Subject: [PATCH] mm: Remember ongoing memory allocation status. When a stall by memory allocation problem occurs, printing how long a thread blocked for memory allocation will be useful. This patch allows remembering how many jiffies was spent for ongoing __alloc_pages_nodemask() and reading it by printing backtrace and by analyzing kdump. Two examples are shown below. You can see that the GFP flags passed to memory shrinker functions can be GFP_NOIO or GFP_NOFS. Therefore, when writing memory shrinker functions, please be careful with dependency inside shrinker functions. For example, unconditional use of GFP_KERNEL may lead to deadlock. For another example, unconditional use of blocking lock operations (e.g. mutex_lock()) which are called by multiple different GFP contexts may lead to deadlock. kworker/2:2 R running task0 189 2 0x MemAlloc: 624869 jiffies on 0x10 Workqueue: events_freezable_power_ disk_events_workfn 880036eacfe0 4486d7e5 88007fc83c48 81090a3f 880036eacfe0 88007fc83c80 81090b35 880036ead210 4486d7e5 817bada0 0074 Call Trace: [] ? _raw_spin_lock+0x2f/0x50 [] list_lru_count_node+0x19/0x60 [] super_cache_count+0x50/0xd0 [] shrink_slab_node+0x3a/0x1b0 [] ? vmpressure+0x1c/0x80 [] shrink_slab+0x83/0x150 [] do_try_to_free_pages+0x2f9/0x530 [] try_to_free_pages+0x98/0xd0 [] __alloc_pages_nodemask+0x6e3/0xad0 [] alloc_pages_current+0xa3/0x170 [] bio_copy_user_iov+0x1c7/0x370 [] bio_copy_kern+0x49/0xe0 [] blk_rq_map_kern+0x6f/0x130 [] ? blk_get_request+0x83/0x140 [] scsi_execute+0x131/0x160 [] scsi_execute_req_flags+0x84/0xf0 [] sr_check_events+0xbc/0x2d0 [sr_mod] [] cdrom_check_events+0x13/0x30 [cdrom] [] sr_block_check_events+0x2d/0x30 [sr_mod] [] disk_check_events+0x55/0x1e0 [] ? _cond_resched+0x35/0x60 [] disk_events_workfn+0x11/0x20 [] process_one_work+0x15f/0x3d0 [] worker_thread+0x119/0x620 [] ? rescuer_thread+0x440/0x440 [] kthread+0xdc/0x100 [] ? kthread_create_on_node+0x1a0/0x1a0 [] ret_from_fork+0x7c/0xb0 [] ? kthread_create_on_node+0x1a0/0x1a0 kworker/u16:2 R running task0 14009 13723 0x0080 MemAlloc: 624951 jiffies on 0x250 0100 28f5c28f5c28f5c3 1705 0060 0064 0064 880036dfea40 ff10 8158401a 0010 Call Trace: [] ? _raw_spin_lock+0x2a/0x50 [] ? list_lru_count_node+0x19/0x60 [] ? super_cache_count+0x50/0xd0 [] ? shrink_slab_node+0x3a/0x1b0 [] ? vmpressure+0x1c/0x80 [] ? shrink_slab+0x83/0x150 [] ? do_try_to_free_pages+0x2f9/0x530 [] ? try_to_free_pages+0x98/0xd0 [] ? __alloc_pages_nodemask+0x6e3/0xad0 [] ? alloc_pages_current+0xa3/0x170 [] ? xfs_buf_allocate_memory+0x168/0x245 [xfs] [] ? xfs_buf_get_map+0xd2/0x130 [xfs] [] ? xfs_buf_read_map+0x24/0xc0 [xfs] [] ? xfs_trans_read_buf_map+0xa9/0x330 [xfs] [] ? xfs_imap_to_bp+0x69/0xf0 [xfs] [] ? xfs_iread+0x79/0x410 [xfs] [] ? kmem_zone_alloc+0x6f/0xf0 [xfs] [] ? xfs_iget+0x1a3/0x510 [xfs] [] ? xfs_lookup+0xbe/0xf0 [xfs] [] ? xfs_vn_lookup+0x73/0xc0 [xfs] [] ? lookup_real+0x18/0x50 [] ? do_last+0x8bd/0xe90 [] ? link_path_walk+0x27e/0x8e0 [] ? path_openat+0xc8/0x6a0 [] ? select_task_rq_fair+0x3dc/0x7e0 [] ? do_filp_open+0x48/0xb0 [] ? kmem_cache_alloc+0x109/0x170 [] ? security_prepare_creds+0x11/0x20 [] ? do_open_exec+0x1d/0xe0 [] ?
Re: [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list().
Tetsuo Handa wrote: Here is a demo patch. If you can join analysis of why memory allocation function cannot return for more than 15 minutes under severe memory pressure, I'll invite you to private discussion in order to share steps for reproducing such memory pressure. A quick test says that memory reclaiming functions are too optimistic about reclaiming memory; they are needlessly called again and again and again with an assumption that some memory will be reclaimed within a few seconds. If I insert some delay, CPU usage during stalls can be reduced. Here is a formal patch. This patch includes a test result of today's linux.git tree with https://lkml.org/lkml/2014/5/29/673 applied, in order to find what deadlock occurs next. The blocking delay on the mutex inside the ttm shrinker has gone, but a kernel worker thread trying to perform a block I/O using GFP_NOIO context is blocked for more than 10 minutes. I think this is not good. -- Start of patch -- From c5274057bd71832fcf0baef64d43a49c20f29dbf Mon Sep 17 00:00:00 2001 From: Tetsuo Handa penguin-ker...@i-love.sakura.ne.jp Date: Wed, 2 Jul 2014 09:34:51 +0900 Subject: [PATCH] mm: Remember ongoing memory allocation status. When a stall by memory allocation problem occurs, printing how long a thread blocked for memory allocation will be useful. This patch allows remembering how many jiffies was spent for ongoing __alloc_pages_nodemask() and reading it by printing backtrace and by analyzing kdump. Two examples are shown below. You can see that the GFP flags passed to memory shrinker functions can be GFP_NOIO or GFP_NOFS. Therefore, when writing memory shrinker functions, please be careful with dependency inside shrinker functions. For example, unconditional use of GFP_KERNEL may lead to deadlock. For another example, unconditional use of blocking lock operations (e.g. mutex_lock()) which are called by multiple different GFP contexts may lead to deadlock. kworker/2:2 R running task0 189 2 0x MemAlloc: 624869 jiffies on 0x10 Workqueue: events_freezable_power_ disk_events_workfn 880036eacfe0 4486d7e5 88007fc83c48 81090a3f 880036eacfe0 88007fc83c80 81090b35 880036ead210 4486d7e5 817bada0 0074 Call Trace: [8158401f] ? _raw_spin_lock+0x2f/0x50 [81126b99] list_lru_count_node+0x19/0x60 [81171e10] super_cache_count+0x50/0xd0 [8111460a] shrink_slab_node+0x3a/0x1b0 [811683fc] ? vmpressure+0x1c/0x80 [811153f3] shrink_slab+0x83/0x150 [81118499] do_try_to_free_pages+0x2f9/0x530 [81118768] try_to_free_pages+0x98/0xd0 [8110e3f3] __alloc_pages_nodemask+0x6e3/0xad0 [8114b2b3] alloc_pages_current+0xa3/0x170 [81244d87] bio_copy_user_iov+0x1c7/0x370 [81244fc9] bio_copy_kern+0x49/0xe0 [8124ed4f] blk_rq_map_kern+0x6f/0x130 [81249273] ? blk_get_request+0x83/0x140 [81393381] scsi_execute+0x131/0x160 [81393484] scsi_execute_req_flags+0x84/0xf0 [a01b987c] sr_check_events+0xbc/0x2d0 [sr_mod] [a018f173] cdrom_check_events+0x13/0x30 [cdrom] [a01b9ced] sr_block_check_events+0x2d/0x30 [sr_mod] [81258c75] disk_check_events+0x55/0x1e0 [81580e65] ? _cond_resched+0x35/0x60 [81258e11] disk_events_workfn+0x11/0x20 [8107d64f] process_one_work+0x15f/0x3d0 [8107de19] worker_thread+0x119/0x620 [8107dd00] ? rescuer_thread+0x440/0x440 [8108439c] kthread+0xdc/0x100 [810842c0] ? kthread_create_on_node+0x1a0/0x1a0 [8158483c] ret_from_fork+0x7c/0xb0 [810842c0] ? kthread_create_on_node+0x1a0/0x1a0 kworker/u16:2 R running task0 14009 13723 0x0080 MemAlloc: 624951 jiffies on 0x250 0100 28f5c28f5c28f5c3 1705 0060 0064 0064 880036dfea40 ff10 8158401a 0010 Call Trace: [8158401a] ? _raw_spin_lock+0x2a/0x50 [81126b99] ? list_lru_count_node+0x19/0x60 [81171e10] ? super_cache_count+0x50/0xd0 [8111460a] ? shrink_slab_node+0x3a/0x1b0 [811683fc] ? vmpressure+0x1c/0x80 [811153f3] ? shrink_slab+0x83/0x150 [81118499] ? do_try_to_free_pages+0x2f9/0x530 [81118768] ? try_to_free_pages+0x98/0xd0 [8110e3f3] ? __alloc_pages_nodemask+0x6e3/0xad0 [8114b2b3] ? alloc_pages_current+0xa3/0x170 [a0232755] ? xfs_buf_allocate_memory+0x168/0x245 [xfs] [a01cc382] ? xfs_buf_get_map+0xd2/0x130 [xfs]
Re: [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list().
Tetsuo Handa wrote: > We need some more changes. I'm thinking memory allocation watchdog thread. > Add an "unsigned long" field to "struct task_struct", set jiffies to the field > upon entry of GFP_WAIT-able memory allocation attempts, and clear the field > upon returning from GFP_WAIT-able memory allocation attempts. A kernel thread > periodically scans task list and compares the field and jiffies, and (at > least) > print warning messages (maybe optionally trigger OOM-killer or kernel panic) > if single memory allocation attempt is taking too long (e.g. 60 seconds). > What do you think? > Here is a demo patch. If you can join analysis of why memory allocation function cannot return for more than 15 minutes under severe memory pressure, I'll invite you to private discussion in order to share steps for reproducing such memory pressure. A quick test says that memory reclaiming functions are too optimistic about reclaiming memory; they are needlessly called again and again and again with an assumption that some memory will be reclaimed within a few seconds. If I insert some delay, CPU usage during stalls can be reduced. -- >From 015fecd45761b2849974f37dc379edf3e86acfa6 Mon Sep 17 00:00:00 2001 From: Tetsuo Handa Date: Mon, 9 Jun 2014 15:00:47 +0900 Subject: [PATCH] mm: Add memory allocation watchdog kernel thread. When a certain type of memory pressure is given, the system may stall for many minutes trying to allocate memory pages. But stalling without any messages is annoying because (e.g.) timeout will happen without any prior warning messages. This patch introduces a watchdog thread which periodically reports the longest stalling thread if __alloc_pages_nodemask() is taking more than 10 seconds. An example output from a VM with 4 CPU / 2GB RAM (without swap) running a v3.15 kernel with this patch is shown below. [ 5835.136868] INFO: task pcscd:14569 blocked for 11 seconds at memory allocation [ 5845.137932] INFO: task pcscd:14569 blocked for 21 seconds at memory allocation [ 5855.142985] INFO: task pcscd:14569 blocked for 31 seconds at memory allocation (...snipped...) [ 6710.227984] INFO: task pcscd:14569 blocked for 886 seconds at memory allocation [ 6720.228058] INFO: task pcscd:14569 blocked for 896 seconds at memory allocation [ 6730.231108] INFO: task pcscd:14569 blocked for 906 seconds at memory allocation [ 6740.242185] INFO: task pcscd:14569 blocked for 916 seconds at memory allocation Signed-off-by: Tetsuo Handa --- include/linux/sched.h | 1 + mm/page_alloc.c | 61 --- 2 files changed, 59 insertions(+), 3 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index 221b2bd..befd496 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1610,6 +1610,7 @@ struct task_struct { unsigned intsequential_io; unsigned intsequential_io_avg; #endif + unsigned long memory_allocation_start_jiffies; }; /* Future-safe accessor for struct task_struct's cpus_allowed. */ diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 5dba293..211b0b7 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -61,6 +61,7 @@ #include #include #include +#include #include #include @@ -2698,6 +2699,16 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, unsigned int cpuset_mems_cookie; int alloc_flags = ALLOC_WMARK_LOW|ALLOC_CPUSET|ALLOC_FAIR; struct mem_cgroup *memcg = NULL; + bool memory_allocation_recursion = false; + unsigned long *stamp = >memory_allocation_start_jiffies; + + if (likely(!*stamp)) { + *stamp = jiffies; + if (unlikely(!*stamp)) + (*stamp)++; + } else { + memory_allocation_recursion = true; + } gfp_mask &= gfp_allowed_mask; @@ -2706,7 +2717,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, might_sleep_if(gfp_mask & __GFP_WAIT); if (should_fail_alloc_page(gfp_mask, order)) - return NULL; + goto nopage; /* * Check the zones suitable for the gfp_mask contain at least one @@ -2714,14 +2725,14 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, * of GFP_THISNODE and a memoryless node */ if (unlikely(!zonelist->_zonerefs->zone)) - return NULL; + goto nopage; /* * Will only have any effect when __GFP_KMEMCG is set. This is * verified in the (always inline) callee */ if (!memcg_kmem_newpage_charge(gfp_mask, , order)) - return NULL; + goto nopage; retry_cpuset: cpuset_mems_cookie = read_mems_allowed_begin(); @@ -2784,10 +2795,54 @@ out: memcg_kmem_commit_charge(page, memcg, order); +nopage: + if (likely(!memory_allocation_recursion)) +
Re: [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list().
Tetsuo Handa wrote: We need some more changes. I'm thinking memory allocation watchdog thread. Add an unsigned long field to struct task_struct, set jiffies to the field upon entry of GFP_WAIT-able memory allocation attempts, and clear the field upon returning from GFP_WAIT-able memory allocation attempts. A kernel thread periodically scans task list and compares the field and jiffies, and (at least) print warning messages (maybe optionally trigger OOM-killer or kernel panic) if single memory allocation attempt is taking too long (e.g. 60 seconds). What do you think? Here is a demo patch. If you can join analysis of why memory allocation function cannot return for more than 15 minutes under severe memory pressure, I'll invite you to private discussion in order to share steps for reproducing such memory pressure. A quick test says that memory reclaiming functions are too optimistic about reclaiming memory; they are needlessly called again and again and again with an assumption that some memory will be reclaimed within a few seconds. If I insert some delay, CPU usage during stalls can be reduced. -- From 015fecd45761b2849974f37dc379edf3e86acfa6 Mon Sep 17 00:00:00 2001 From: Tetsuo Handa penguin-ker...@i-love.sakura.ne.jp Date: Mon, 9 Jun 2014 15:00:47 +0900 Subject: [PATCH] mm: Add memory allocation watchdog kernel thread. When a certain type of memory pressure is given, the system may stall for many minutes trying to allocate memory pages. But stalling without any messages is annoying because (e.g.) timeout will happen without any prior warning messages. This patch introduces a watchdog thread which periodically reports the longest stalling thread if __alloc_pages_nodemask() is taking more than 10 seconds. An example output from a VM with 4 CPU / 2GB RAM (without swap) running a v3.15 kernel with this patch is shown below. [ 5835.136868] INFO: task pcscd:14569 blocked for 11 seconds at memory allocation [ 5845.137932] INFO: task pcscd:14569 blocked for 21 seconds at memory allocation [ 5855.142985] INFO: task pcscd:14569 blocked for 31 seconds at memory allocation (...snipped...) [ 6710.227984] INFO: task pcscd:14569 blocked for 886 seconds at memory allocation [ 6720.228058] INFO: task pcscd:14569 blocked for 896 seconds at memory allocation [ 6730.231108] INFO: task pcscd:14569 blocked for 906 seconds at memory allocation [ 6740.242185] INFO: task pcscd:14569 blocked for 916 seconds at memory allocation Signed-off-by: Tetsuo Handa penguin-ker...@i-love.sakura.ne.jp --- include/linux/sched.h | 1 + mm/page_alloc.c | 61 --- 2 files changed, 59 insertions(+), 3 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index 221b2bd..befd496 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1610,6 +1610,7 @@ struct task_struct { unsigned intsequential_io; unsigned intsequential_io_avg; #endif + unsigned long memory_allocation_start_jiffies; }; /* Future-safe accessor for struct task_struct's cpus_allowed. */ diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 5dba293..211b0b7 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -61,6 +61,7 @@ #include linux/page-debug-flags.h #include linux/hugetlb.h #include linux/sched/rt.h +#include linux/kthread.h #include asm/sections.h #include asm/tlbflush.h @@ -2698,6 +2699,16 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, unsigned int cpuset_mems_cookie; int alloc_flags = ALLOC_WMARK_LOW|ALLOC_CPUSET|ALLOC_FAIR; struct mem_cgroup *memcg = NULL; + bool memory_allocation_recursion = false; + unsigned long *stamp = current-memory_allocation_start_jiffies; + + if (likely(!*stamp)) { + *stamp = jiffies; + if (unlikely(!*stamp)) + (*stamp)++; + } else { + memory_allocation_recursion = true; + } gfp_mask = gfp_allowed_mask; @@ -2706,7 +2717,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, might_sleep_if(gfp_mask __GFP_WAIT); if (should_fail_alloc_page(gfp_mask, order)) - return NULL; + goto nopage; /* * Check the zones suitable for the gfp_mask contain at least one @@ -2714,14 +2725,14 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, * of GFP_THISNODE and a memoryless node */ if (unlikely(!zonelist-_zonerefs-zone)) - return NULL; + goto nopage; /* * Will only have any effect when __GFP_KMEMCG is set. This is * verified in the (always inline) callee */ if (!memcg_kmem_newpage_charge(gfp_mask, memcg, order)) - return NULL; + goto nopage; retry_cpuset: cpuset_mems_cookie = read_mems_allowed_begin(); @@ -2784,10 +2795,54 @@
Re: [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list().
On Thu, Jun 05, 2014 at 09:45:26PM +0900, Tetsuo Handa wrote: > David Rientjes wrote: > > On Mon, 26 May 2014, Tetsuo Handa wrote: > > > > > In shrink_inactive_list(), we do not insert delay at > > > > > > if (!sc->hibernation_mode && !current_is_kswapd()) > > > wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10); > > > > > > if sc->hibernation_mode != 0. > > > Follow the same reason, we should not insert delay at > > > > > > while (unlikely(too_many_isolated(zone, file, sc))) { > > > congestion_wait(BLK_RW_ASYNC, HZ/10); > > > > > > /* We are about to die and free our memory. Return now. */ > > > if (fatal_signal_pending(current)) > > > return SWAP_CLUSTER_MAX; > > > } > > > > > > if sc->hibernation_mode != 0. > > > > > > Signed-off-by: Tetsuo Handa > > > --- > > > mm/vmscan.c |3 +++ > > > 1 files changed, 3 insertions(+), 0 deletions(-) > > > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > > index 32c661d..89c42ca 100644 > > > --- a/mm/vmscan.c > > > +++ b/mm/vmscan.c > > > @@ -1362,6 +1362,9 @@ static int too_many_isolated(struct zone *zone, int > > > file, > > > if (current_is_kswapd()) > > > return 0; > > > > > > + if (sc->hibernation_mode) > > > + return 0; > > > + > > > if (!global_reclaim(sc)) > > > return 0; > > > > > > > This isn't the only too_many_isolated() functions that do a delay, how is > > the too_many_isolated() in mm/compaction.c different? > > > > I don't know. But today I realized that this patch is not sufficient. > > I'm trying to find why __alloc_pages_slowpath() cannot return for many minutes > when a certain type of memory pressure is given on a RHEL7 environment with > 4 CPU / 2GB RAM. Today I tried to use ftrace for examining the breakdown of > time-consuming functions inside __alloc_pages_slowpath(). But on the first > run, > all processes are trapped into this too_many_isolated()/congestion_wait() loop > while kswapd is not running; stalling forever because nobody can perform > operations for making too_many_isolated() to return 0. > > This means that, under rare circumstances, it is possible that all processes > other than kswapd are trapped into too_many_isolated()/congestion_wait() loop > while kswapd is sleeping because this loop assumes that somebody else shall > wake up kswapd and kswapd shall perform operations for making > too_many_isolated() to return 0. However, we cannot guarantee that kswapd is > waken by somebody nor kswapd is not blocked by blocking operations inside > shrinker functions (e.g. mutex_lock()). So what you are saying is that kswapd is having problems with getting blocked on locks held by processes in direct reclaim? What are the stack traces that demonstrate such a dependency loop? Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list().
David Rientjes wrote: > On Mon, 26 May 2014, Tetsuo Handa wrote: > > > In shrink_inactive_list(), we do not insert delay at > > > > if (!sc->hibernation_mode && !current_is_kswapd()) > > wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10); > > > > if sc->hibernation_mode != 0. > > Follow the same reason, we should not insert delay at > > > > while (unlikely(too_many_isolated(zone, file, sc))) { > > congestion_wait(BLK_RW_ASYNC, HZ/10); > > > > /* We are about to die and free our memory. Return now. */ > > if (fatal_signal_pending(current)) > > return SWAP_CLUSTER_MAX; > > } > > > > if sc->hibernation_mode != 0. > > > > Signed-off-by: Tetsuo Handa > > --- > > mm/vmscan.c |3 +++ > > 1 files changed, 3 insertions(+), 0 deletions(-) > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > index 32c661d..89c42ca 100644 > > --- a/mm/vmscan.c > > +++ b/mm/vmscan.c > > @@ -1362,6 +1362,9 @@ static int too_many_isolated(struct zone *zone, int > > file, > > if (current_is_kswapd()) > > return 0; > > > > + if (sc->hibernation_mode) > > + return 0; > > + > > if (!global_reclaim(sc)) > > return 0; > > > > This isn't the only too_many_isolated() functions that do a delay, how is > the too_many_isolated() in mm/compaction.c different? > I don't know. But today I realized that this patch is not sufficient. I'm trying to find why __alloc_pages_slowpath() cannot return for many minutes when a certain type of memory pressure is given on a RHEL7 environment with 4 CPU / 2GB RAM. Today I tried to use ftrace for examining the breakdown of time-consuming functions inside __alloc_pages_slowpath(). But on the first run, all processes are trapped into this too_many_isolated()/congestion_wait() loop while kswapd is not running; stalling forever because nobody can perform operations for making too_many_isolated() to return 0. This means that, under rare circumstances, it is possible that all processes other than kswapd are trapped into too_many_isolated()/congestion_wait() loop while kswapd is sleeping because this loop assumes that somebody else shall wake up kswapd and kswapd shall perform operations for making too_many_isolated() to return 0. However, we cannot guarantee that kswapd is waken by somebody nor kswapd is not blocked by blocking operations inside shrinker functions (e.g. mutex_lock()). We need some more changes. I'm thinking memory allocation watchdog thread. Add an "unsigned long" field to "struct task_struct", set jiffies to the field upon entry of GFP_WAIT-able memory allocation attempts, and clear the field upon returning from GFP_WAIT-able memory allocation attempts. A kernel thread periodically scans task list and compares the field and jiffies, and (at least) print warning messages (maybe optionally trigger OOM-killer or kernel panic) if single memory allocation attempt is taking too long (e.g. 60 seconds). What do you think? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list().
David Rientjes wrote: On Mon, 26 May 2014, Tetsuo Handa wrote: In shrink_inactive_list(), we do not insert delay at if (!sc-hibernation_mode !current_is_kswapd()) wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10); if sc-hibernation_mode != 0. Follow the same reason, we should not insert delay at while (unlikely(too_many_isolated(zone, file, sc))) { congestion_wait(BLK_RW_ASYNC, HZ/10); /* We are about to die and free our memory. Return now. */ if (fatal_signal_pending(current)) return SWAP_CLUSTER_MAX; } if sc-hibernation_mode != 0. Signed-off-by: Tetsuo Handa penguin-ker...@i-love.sakura.ne.jp --- mm/vmscan.c |3 +++ 1 files changed, 3 insertions(+), 0 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index 32c661d..89c42ca 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1362,6 +1362,9 @@ static int too_many_isolated(struct zone *zone, int file, if (current_is_kswapd()) return 0; + if (sc-hibernation_mode) + return 0; + if (!global_reclaim(sc)) return 0; This isn't the only too_many_isolated() functions that do a delay, how is the too_many_isolated() in mm/compaction.c different? I don't know. But today I realized that this patch is not sufficient. I'm trying to find why __alloc_pages_slowpath() cannot return for many minutes when a certain type of memory pressure is given on a RHEL7 environment with 4 CPU / 2GB RAM. Today I tried to use ftrace for examining the breakdown of time-consuming functions inside __alloc_pages_slowpath(). But on the first run, all processes are trapped into this too_many_isolated()/congestion_wait() loop while kswapd is not running; stalling forever because nobody can perform operations for making too_many_isolated() to return 0. This means that, under rare circumstances, it is possible that all processes other than kswapd are trapped into too_many_isolated()/congestion_wait() loop while kswapd is sleeping because this loop assumes that somebody else shall wake up kswapd and kswapd shall perform operations for making too_many_isolated() to return 0. However, we cannot guarantee that kswapd is waken by somebody nor kswapd is not blocked by blocking operations inside shrinker functions (e.g. mutex_lock()). We need some more changes. I'm thinking memory allocation watchdog thread. Add an unsigned long field to struct task_struct, set jiffies to the field upon entry of GFP_WAIT-able memory allocation attempts, and clear the field upon returning from GFP_WAIT-able memory allocation attempts. A kernel thread periodically scans task list and compares the field and jiffies, and (at least) print warning messages (maybe optionally trigger OOM-killer or kernel panic) if single memory allocation attempt is taking too long (e.g. 60 seconds). What do you think? -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list().
On Thu, Jun 05, 2014 at 09:45:26PM +0900, Tetsuo Handa wrote: David Rientjes wrote: On Mon, 26 May 2014, Tetsuo Handa wrote: In shrink_inactive_list(), we do not insert delay at if (!sc-hibernation_mode !current_is_kswapd()) wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10); if sc-hibernation_mode != 0. Follow the same reason, we should not insert delay at while (unlikely(too_many_isolated(zone, file, sc))) { congestion_wait(BLK_RW_ASYNC, HZ/10); /* We are about to die and free our memory. Return now. */ if (fatal_signal_pending(current)) return SWAP_CLUSTER_MAX; } if sc-hibernation_mode != 0. Signed-off-by: Tetsuo Handa penguin-ker...@i-love.sakura.ne.jp --- mm/vmscan.c |3 +++ 1 files changed, 3 insertions(+), 0 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index 32c661d..89c42ca 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1362,6 +1362,9 @@ static int too_many_isolated(struct zone *zone, int file, if (current_is_kswapd()) return 0; + if (sc-hibernation_mode) + return 0; + if (!global_reclaim(sc)) return 0; This isn't the only too_many_isolated() functions that do a delay, how is the too_many_isolated() in mm/compaction.c different? I don't know. But today I realized that this patch is not sufficient. I'm trying to find why __alloc_pages_slowpath() cannot return for many minutes when a certain type of memory pressure is given on a RHEL7 environment with 4 CPU / 2GB RAM. Today I tried to use ftrace for examining the breakdown of time-consuming functions inside __alloc_pages_slowpath(). But on the first run, all processes are trapped into this too_many_isolated()/congestion_wait() loop while kswapd is not running; stalling forever because nobody can perform operations for making too_many_isolated() to return 0. This means that, under rare circumstances, it is possible that all processes other than kswapd are trapped into too_many_isolated()/congestion_wait() loop while kswapd is sleeping because this loop assumes that somebody else shall wake up kswapd and kswapd shall perform operations for making too_many_isolated() to return 0. However, we cannot guarantee that kswapd is waken by somebody nor kswapd is not blocked by blocking operations inside shrinker functions (e.g. mutex_lock()). So what you are saying is that kswapd is having problems with getting blocked on locks held by processes in direct reclaim? What are the stack traces that demonstrate such a dependency loop? Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list().
> -Original Message- > From: Tetsuo Handa [mailto:penguin-ker...@i-love.sakura.ne.jp] > Sent: Tuesday, May 20, 2014 11:58 PM > To: da...@fromorbit.com; r...@redhat.com > Cc: Motohiro Kosaki JP; fengguang...@intel.com; > kamezawa.hir...@jp.fujitsu.com; a...@linux-foundation.org; > h...@infradead.org; linux-kernel@vger.kernel.org; x...@oss.sgi.com > Subject: Re: [PATCH] mm/vmscan: Do not block forever at > shrink_inactive_list(). > > Today I discussed with Kosaki-san at LinuxCon Japan 2014 about this issue. > He does not like the idea of adding timeout to throttle loop. As Dave posted > a patch that fixes a bug in XFS delayed allocation, I > updated my patch accordingly. > > Although the bug in XFS was fixed by Dave's patch, other kernel code would > have bugs which would fall into this infinite throttle loop. > But to keep the possibility of triggering OOM killer minimum, can we agree > with this updated patch (and in the future adding some > warning mechanism like /proc/sys/kernel/hung_task_timeout_secs for detecting > memory allocation stall)? > > Dave, if you are OK with this updated patch, please let me know commit ID of > your patch. > > Regards. > -- > >From 408e65d9025e8e24838e7bf6ac9066ba8a9391a6 Mon Sep 17 00:00:00 2001 > From: Tetsuo Handa > Date: Tue, 20 May 2014 23:34:34 +0900 > Subject: [PATCH] mm/vmscan: Do not throttle kswapd at shrink_inactive_list(). > > I can observe that commit 35cd7815 "vmscan: throttle direct reclaim when too > many pages are isolated already" causes RHEL7 > environment to stall with 0% CPU usage when a certain type of memory pressure > is given. > This is because nobody can reclaim memory due to rules listed below. > > (a) XFS uses a kernel worker thread for delayed allocation > (b) kswapd wakes up the kernel worker thread for delayed allocation > (c) the kernel worker thread is throttled due to commit 35cd7815 > > This patch and commit "xfs: block allocation work needs to be kswapd > aware" will solve rule (c). > > Signed-off-by: Tetsuo Handa > --- > mm/vmscan.c | 20 +++- > 1 files changed, 15 insertions(+), 5 deletions(-) > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 32c661d..5c6960e 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -1460,12 +1460,22 @@ shrink_inactive_list(unsigned long nr_to_scan, struct > lruvec *lruvec, > struct zone *zone = lruvec_zone(lruvec); > struct zone_reclaim_stat *reclaim_stat = >reclaim_stat; > > - while (unlikely(too_many_isolated(zone, file, sc))) { > - congestion_wait(BLK_RW_ASYNC, HZ/10); > + /* > + * Throttle only direct reclaimers. Allocations by kswapd (and > + * allocation workqueue on behalf of kswapd) should not be > + * throttled here; otherwise memory allocation will deadlock. > + */ > + if (!sc->hibernation_mode && !current_is_kswapd()) { > + while (unlikely(too_many_isolated(zone, file, sc))) { > + congestion_wait(BLK_RW_ASYNC, HZ/10); > > - /* We are about to die and free our memory. Return now. */ > - if (fatal_signal_pending(current)) > - return SWAP_CLUSTER_MAX; > + /* > + * We are about to die and free our memory. > + * Return now. > + */ > + if (fatal_signal_pending(current)) > + return SWAP_CLUSTER_MAX; > + } > } Acked-by: KOSAKI Motohiro Dave, I don't like Tetsuo's first patch because this too_many_isolated exist to prevent false oom-kill. So, simple timeout resurrect it. Please let me know if you need further MM enhancement to solve XFS issue. I'd like join and assist this. Thanks. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list().
Today I discussed with Kosaki-san at LinuxCon Japan 2014 about this issue. He does not like the idea of adding timeout to throttle loop. As Dave posted a patch that fixes a bug in XFS delayed allocation, I updated my patch accordingly. Although the bug in XFS was fixed by Dave's patch, other kernel code would have bugs which would fall into this infinite throttle loop. But to keep the possibility of triggering OOM killer minimum, can we agree with this updated patch (and in the future adding some warning mechanism like /proc/sys/kernel/hung_task_timeout_secs for detecting memory allocation stall)? Dave, if you are OK with this updated patch, please let me know commit ID of your patch. Regards. -- >From 408e65d9025e8e24838e7bf6ac9066ba8a9391a6 Mon Sep 17 00:00:00 2001 From: Tetsuo Handa Date: Tue, 20 May 2014 23:34:34 +0900 Subject: [PATCH] mm/vmscan: Do not throttle kswapd at shrink_inactive_list(). I can observe that commit 35cd7815 "vmscan: throttle direct reclaim when too many pages are isolated already" causes RHEL7 environment to stall with 0% CPU usage when a certain type of memory pressure is given. This is because nobody can reclaim memory due to rules listed below. (a) XFS uses a kernel worker thread for delayed allocation (b) kswapd wakes up the kernel worker thread for delayed allocation (c) the kernel worker thread is throttled due to commit 35cd7815 This patch and commit "xfs: block allocation work needs to be kswapd aware" will solve rule (c). Signed-off-by: Tetsuo Handa --- mm/vmscan.c | 20 +++- 1 files changed, 15 insertions(+), 5 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index 32c661d..5c6960e 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1460,12 +1460,22 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec, struct zone *zone = lruvec_zone(lruvec); struct zone_reclaim_stat *reclaim_stat = >reclaim_stat; - while (unlikely(too_many_isolated(zone, file, sc))) { - congestion_wait(BLK_RW_ASYNC, HZ/10); + /* +* Throttle only direct reclaimers. Allocations by kswapd (and +* allocation workqueue on behalf of kswapd) should not be +* throttled here; otherwise memory allocation will deadlock. +*/ + if (!sc->hibernation_mode && !current_is_kswapd()) { + while (unlikely(too_many_isolated(zone, file, sc))) { + congestion_wait(BLK_RW_ASYNC, HZ/10); - /* We are about to die and free our memory. Return now. */ - if (fatal_signal_pending(current)) - return SWAP_CLUSTER_MAX; + /* +* We are about to die and free our memory. +* Return now. +*/ + if (fatal_signal_pending(current)) + return SWAP_CLUSTER_MAX; + } } lru_add_drain(); -- 1.7.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list().
On Mon, May 19, 2014 at 10:59:15PM -0700, Andrew Morton wrote: > So current_is_kswapd() returns true for a thread which is not kswapd. > That's a bit smelly. > > Should this thread really be incrementing KSWAPD_INODESTEAL instead of > PGINODESTEAL, for example? current_is_kswapd() does a range of things, > only one(?) of which you actually want. Actually we want all of them. The allocation workqueue is a workaround for the incredible stack usage in the Linux I/O path. If it is called by kswapd it should act as if it were kswapd for all purposes. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list().
On Mon, May 19, 2014 at 11:03:11PM -0700, Andrew Morton wrote: > On Mon, 19 May 2014 22:59:15 -0700 Andrew Morton > wrote: > > > On Tue, 20 May 2014 10:44:49 +1000 Dave Chinner wrote: > > > > > @@ -258,14 +258,23 @@ xfs_bmapi_allocate_worker( > > > struct xfs_bmalloca *args = container_of(work, > > > struct xfs_bmalloca, work); > > > unsigned long pflags; > > > + unsigned long new_pflags = PF_FSTRANS; > > > > > > - /* we are in a transaction context here */ > > > - current_set_flags_nested(, PF_FSTRANS); > > > + /* > > > + * we are in a transaction context here, but may also be doing work > > > + * in kswapd context, and hence we may need to inherit that state > > > + * temporarily to ensure that we don't block waiting for memory reclaim > > > + * in any way. > > > + */ > > > + if (args->kswapd) > > > + new_pflags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD; > > > > So current_is_kswapd() returns true for a thread which is not kswapd. > > That's a bit smelly. > > > > Should this thread really be incrementing KSWAPD_INODESTEAL instead of > > PGINODESTEAL, for example? current_is_kswapd() does a range of things, > > only one(?) of which you actually want. > > > > It would be cleaner to create a new PF_ flag to select just that > > behavior. That's a better model than telling the world "I am magic and > > special". > > Or a new __GFP_FLAG. Sure - and with that XFS will need another PF_ flag to tell the memory allocator to set the new __GFP_FLAG on every allocation done in that kworker task context, just like it uses PF_FSTRANS to ensure that __GFP_NOFS is set for all the allocations in that kworker context Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list().
On Mon, May 19, 2014 at 10:59:15PM -0700, Andrew Morton wrote: > On Tue, 20 May 2014 10:44:49 +1000 Dave Chinner wrote: > > > @@ -258,14 +258,23 @@ xfs_bmapi_allocate_worker( > > struct xfs_bmalloca *args = container_of(work, > > struct xfs_bmalloca, work); > > unsigned long pflags; > > + unsigned long new_pflags = PF_FSTRANS; > > > > - /* we are in a transaction context here */ > > - current_set_flags_nested(, PF_FSTRANS); > > + /* > > +* we are in a transaction context here, but may also be doing work > > +* in kswapd context, and hence we may need to inherit that state > > +* temporarily to ensure that we don't block waiting for memory reclaim > > +* in any way. > > +*/ > > + if (args->kswapd) > > + new_pflags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD; > > So current_is_kswapd() returns true for a thread which is not kswapd. > That's a bit smelly. > > Should this thread really be incrementing KSWAPD_INODESTEAL instead of > PGINODESTEAL, for example? current_is_kswapd() does a range of things, > only one(?) of which you actually want. It's doing work on behalf of kswapd under kswapd constraints, so it should both behave and be accounted as if kswapd was executing the work directly. > It would be cleaner to create a new PF_ flag to select just that > behavior. That's a better model than telling the world "I am magic and > special". However, it is magic and special because of who the work needs to be done for. > But we're awfully close to running out of PF_ space and I don't know if > this ugly justifies consuming a flag. I don't really care enough argue over what mechanism should be used. I'll push this patch through the XFS tree, and when a new flag or generic mechanism for pushing task contexts to kworker threads is provided, I'll change the XFS code to use that Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list().
On Mon, 19 May 2014 22:59:15 -0700 Andrew Morton wrote: > On Tue, 20 May 2014 10:44:49 +1000 Dave Chinner wrote: > > > @@ -258,14 +258,23 @@ xfs_bmapi_allocate_worker( > > struct xfs_bmalloca *args = container_of(work, > > struct xfs_bmalloca, work); > > unsigned long pflags; > > + unsigned long new_pflags = PF_FSTRANS; > > > > - /* we are in a transaction context here */ > > - current_set_flags_nested(, PF_FSTRANS); > > + /* > > +* we are in a transaction context here, but may also be doing work > > +* in kswapd context, and hence we may need to inherit that state > > +* temporarily to ensure that we don't block waiting for memory reclaim > > +* in any way. > > +*/ > > + if (args->kswapd) > > + new_pflags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD; > > So current_is_kswapd() returns true for a thread which is not kswapd. > That's a bit smelly. > > Should this thread really be incrementing KSWAPD_INODESTEAL instead of > PGINODESTEAL, for example? current_is_kswapd() does a range of things, > only one(?) of which you actually want. > > It would be cleaner to create a new PF_ flag to select just that > behavior. That's a better model than telling the world "I am magic and > special". Or a new __GFP_FLAG. > But we're awfully close to running out of PF_ space and I don't know if > this ugly justifies consuming a flag. > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list().
On Tue, 20 May 2014 10:44:49 +1000 Dave Chinner wrote: > @@ -258,14 +258,23 @@ xfs_bmapi_allocate_worker( > struct xfs_bmalloca *args = container_of(work, > struct xfs_bmalloca, work); > unsigned long pflags; > + unsigned long new_pflags = PF_FSTRANS; > > - /* we are in a transaction context here */ > - current_set_flags_nested(, PF_FSTRANS); > + /* > + * we are in a transaction context here, but may also be doing work > + * in kswapd context, and hence we may need to inherit that state > + * temporarily to ensure that we don't block waiting for memory reclaim > + * in any way. > + */ > + if (args->kswapd) > + new_pflags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD; So current_is_kswapd() returns true for a thread which is not kswapd. That's a bit smelly. Should this thread really be incrementing KSWAPD_INODESTEAL instead of PGINODESTEAL, for example? current_is_kswapd() does a range of things, only one(?) of which you actually want. It would be cleaner to create a new PF_ flag to select just that behavior. That's a better model than telling the world "I am magic and special". But we're awfully close to running out of PF_ space and I don't know if this ugly justifies consuming a flag. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list().
On Tue, 20 May 2014 10:44:49 +1000 Dave Chinner da...@fromorbit.com wrote: @@ -258,14 +258,23 @@ xfs_bmapi_allocate_worker( struct xfs_bmalloca *args = container_of(work, struct xfs_bmalloca, work); unsigned long pflags; + unsigned long new_pflags = PF_FSTRANS; - /* we are in a transaction context here */ - current_set_flags_nested(pflags, PF_FSTRANS); + /* + * we are in a transaction context here, but may also be doing work + * in kswapd context, and hence we may need to inherit that state + * temporarily to ensure that we don't block waiting for memory reclaim + * in any way. + */ + if (args-kswapd) + new_pflags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD; So current_is_kswapd() returns true for a thread which is not kswapd. That's a bit smelly. Should this thread really be incrementing KSWAPD_INODESTEAL instead of PGINODESTEAL, for example? current_is_kswapd() does a range of things, only one(?) of which you actually want. It would be cleaner to create a new PF_ flag to select just that behavior. That's a better model than telling the world I am magic and special. But we're awfully close to running out of PF_ space and I don't know if this ugly justifies consuming a flag. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list().
On Mon, 19 May 2014 22:59:15 -0700 Andrew Morton a...@linux-foundation.org wrote: On Tue, 20 May 2014 10:44:49 +1000 Dave Chinner da...@fromorbit.com wrote: @@ -258,14 +258,23 @@ xfs_bmapi_allocate_worker( struct xfs_bmalloca *args = container_of(work, struct xfs_bmalloca, work); unsigned long pflags; + unsigned long new_pflags = PF_FSTRANS; - /* we are in a transaction context here */ - current_set_flags_nested(pflags, PF_FSTRANS); + /* +* we are in a transaction context here, but may also be doing work +* in kswapd context, and hence we may need to inherit that state +* temporarily to ensure that we don't block waiting for memory reclaim +* in any way. +*/ + if (args-kswapd) + new_pflags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD; So current_is_kswapd() returns true for a thread which is not kswapd. That's a bit smelly. Should this thread really be incrementing KSWAPD_INODESTEAL instead of PGINODESTEAL, for example? current_is_kswapd() does a range of things, only one(?) of which you actually want. It would be cleaner to create a new PF_ flag to select just that behavior. That's a better model than telling the world I am magic and special. Or a new __GFP_FLAG. But we're awfully close to running out of PF_ space and I don't know if this ugly justifies consuming a flag. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list().
On Mon, May 19, 2014 at 10:59:15PM -0700, Andrew Morton wrote: On Tue, 20 May 2014 10:44:49 +1000 Dave Chinner da...@fromorbit.com wrote: @@ -258,14 +258,23 @@ xfs_bmapi_allocate_worker( struct xfs_bmalloca *args = container_of(work, struct xfs_bmalloca, work); unsigned long pflags; + unsigned long new_pflags = PF_FSTRANS; - /* we are in a transaction context here */ - current_set_flags_nested(pflags, PF_FSTRANS); + /* +* we are in a transaction context here, but may also be doing work +* in kswapd context, and hence we may need to inherit that state +* temporarily to ensure that we don't block waiting for memory reclaim +* in any way. +*/ + if (args-kswapd) + new_pflags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD; So current_is_kswapd() returns true for a thread which is not kswapd. That's a bit smelly. Should this thread really be incrementing KSWAPD_INODESTEAL instead of PGINODESTEAL, for example? current_is_kswapd() does a range of things, only one(?) of which you actually want. It's doing work on behalf of kswapd under kswapd constraints, so it should both behave and be accounted as if kswapd was executing the work directly. It would be cleaner to create a new PF_ flag to select just that behavior. That's a better model than telling the world I am magic and special. However, it is magic and special because of who the work needs to be done for. But we're awfully close to running out of PF_ space and I don't know if this ugly justifies consuming a flag. I don't really care enough argue over what mechanism should be used. I'll push this patch through the XFS tree, and when a new flag or generic mechanism for pushing task contexts to kworker threads is provided, I'll change the XFS code to use that Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list().
On Mon, May 19, 2014 at 11:03:11PM -0700, Andrew Morton wrote: On Mon, 19 May 2014 22:59:15 -0700 Andrew Morton a...@linux-foundation.org wrote: On Tue, 20 May 2014 10:44:49 +1000 Dave Chinner da...@fromorbit.com wrote: @@ -258,14 +258,23 @@ xfs_bmapi_allocate_worker( struct xfs_bmalloca *args = container_of(work, struct xfs_bmalloca, work); unsigned long pflags; + unsigned long new_pflags = PF_FSTRANS; - /* we are in a transaction context here */ - current_set_flags_nested(pflags, PF_FSTRANS); + /* + * we are in a transaction context here, but may also be doing work + * in kswapd context, and hence we may need to inherit that state + * temporarily to ensure that we don't block waiting for memory reclaim + * in any way. + */ + if (args-kswapd) + new_pflags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD; So current_is_kswapd() returns true for a thread which is not kswapd. That's a bit smelly. Should this thread really be incrementing KSWAPD_INODESTEAL instead of PGINODESTEAL, for example? current_is_kswapd() does a range of things, only one(?) of which you actually want. It would be cleaner to create a new PF_ flag to select just that behavior. That's a better model than telling the world I am magic and special. Or a new __GFP_FLAG. Sure - and with that XFS will need another PF_ flag to tell the memory allocator to set the new __GFP_FLAG on every allocation done in that kworker task context, just like it uses PF_FSTRANS to ensure that __GFP_NOFS is set for all the allocations in that kworker context Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list().
On Mon, May 19, 2014 at 10:59:15PM -0700, Andrew Morton wrote: So current_is_kswapd() returns true for a thread which is not kswapd. That's a bit smelly. Should this thread really be incrementing KSWAPD_INODESTEAL instead of PGINODESTEAL, for example? current_is_kswapd() does a range of things, only one(?) of which you actually want. Actually we want all of them. The allocation workqueue is a workaround for the incredible stack usage in the Linux I/O path. If it is called by kswapd it should act as if it were kswapd for all purposes. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list().
Today I discussed with Kosaki-san at LinuxCon Japan 2014 about this issue. He does not like the idea of adding timeout to throttle loop. As Dave posted a patch that fixes a bug in XFS delayed allocation, I updated my patch accordingly. Although the bug in XFS was fixed by Dave's patch, other kernel code would have bugs which would fall into this infinite throttle loop. But to keep the possibility of triggering OOM killer minimum, can we agree with this updated patch (and in the future adding some warning mechanism like /proc/sys/kernel/hung_task_timeout_secs for detecting memory allocation stall)? Dave, if you are OK with this updated patch, please let me know commit ID of your patch. Regards. -- From 408e65d9025e8e24838e7bf6ac9066ba8a9391a6 Mon Sep 17 00:00:00 2001 From: Tetsuo Handa penguin-ker...@i-love.sakura.ne.jp Date: Tue, 20 May 2014 23:34:34 +0900 Subject: [PATCH] mm/vmscan: Do not throttle kswapd at shrink_inactive_list(). I can observe that commit 35cd7815 vmscan: throttle direct reclaim when too many pages are isolated already causes RHEL7 environment to stall with 0% CPU usage when a certain type of memory pressure is given. This is because nobody can reclaim memory due to rules listed below. (a) XFS uses a kernel worker thread for delayed allocation (b) kswapd wakes up the kernel worker thread for delayed allocation (c) the kernel worker thread is throttled due to commit 35cd7815 This patch and commit xfs: block allocation work needs to be kswapd aware will solve rule (c). Signed-off-by: Tetsuo Handa penguin-ker...@i-love.sakura.ne.jp --- mm/vmscan.c | 20 +++- 1 files changed, 15 insertions(+), 5 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index 32c661d..5c6960e 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1460,12 +1460,22 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec, struct zone *zone = lruvec_zone(lruvec); struct zone_reclaim_stat *reclaim_stat = lruvec-reclaim_stat; - while (unlikely(too_many_isolated(zone, file, sc))) { - congestion_wait(BLK_RW_ASYNC, HZ/10); + /* +* Throttle only direct reclaimers. Allocations by kswapd (and +* allocation workqueue on behalf of kswapd) should not be +* throttled here; otherwise memory allocation will deadlock. +*/ + if (!sc-hibernation_mode !current_is_kswapd()) { + while (unlikely(too_many_isolated(zone, file, sc))) { + congestion_wait(BLK_RW_ASYNC, HZ/10); - /* We are about to die and free our memory. Return now. */ - if (fatal_signal_pending(current)) - return SWAP_CLUSTER_MAX; + /* +* We are about to die and free our memory. +* Return now. +*/ + if (fatal_signal_pending(current)) + return SWAP_CLUSTER_MAX; + } } lru_add_drain(); -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list().
-Original Message- From: Tetsuo Handa [mailto:penguin-ker...@i-love.sakura.ne.jp] Sent: Tuesday, May 20, 2014 11:58 PM To: da...@fromorbit.com; r...@redhat.com Cc: Motohiro Kosaki JP; fengguang...@intel.com; kamezawa.hir...@jp.fujitsu.com; a...@linux-foundation.org; h...@infradead.org; linux-kernel@vger.kernel.org; x...@oss.sgi.com Subject: Re: [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list(). Today I discussed with Kosaki-san at LinuxCon Japan 2014 about this issue. He does not like the idea of adding timeout to throttle loop. As Dave posted a patch that fixes a bug in XFS delayed allocation, I updated my patch accordingly. Although the bug in XFS was fixed by Dave's patch, other kernel code would have bugs which would fall into this infinite throttle loop. But to keep the possibility of triggering OOM killer minimum, can we agree with this updated patch (and in the future adding some warning mechanism like /proc/sys/kernel/hung_task_timeout_secs for detecting memory allocation stall)? Dave, if you are OK with this updated patch, please let me know commit ID of your patch. Regards. -- From 408e65d9025e8e24838e7bf6ac9066ba8a9391a6 Mon Sep 17 00:00:00 2001 From: Tetsuo Handa penguin-ker...@i-love.sakura.ne.jp Date: Tue, 20 May 2014 23:34:34 +0900 Subject: [PATCH] mm/vmscan: Do not throttle kswapd at shrink_inactive_list(). I can observe that commit 35cd7815 vmscan: throttle direct reclaim when too many pages are isolated already causes RHEL7 environment to stall with 0% CPU usage when a certain type of memory pressure is given. This is because nobody can reclaim memory due to rules listed below. (a) XFS uses a kernel worker thread for delayed allocation (b) kswapd wakes up the kernel worker thread for delayed allocation (c) the kernel worker thread is throttled due to commit 35cd7815 This patch and commit xfs: block allocation work needs to be kswapd aware will solve rule (c). Signed-off-by: Tetsuo Handa penguin-ker...@i-love.sakura.ne.jp --- mm/vmscan.c | 20 +++- 1 files changed, 15 insertions(+), 5 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index 32c661d..5c6960e 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1460,12 +1460,22 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec, struct zone *zone = lruvec_zone(lruvec); struct zone_reclaim_stat *reclaim_stat = lruvec-reclaim_stat; - while (unlikely(too_many_isolated(zone, file, sc))) { - congestion_wait(BLK_RW_ASYNC, HZ/10); + /* + * Throttle only direct reclaimers. Allocations by kswapd (and + * allocation workqueue on behalf of kswapd) should not be + * throttled here; otherwise memory allocation will deadlock. + */ + if (!sc-hibernation_mode !current_is_kswapd()) { + while (unlikely(too_many_isolated(zone, file, sc))) { + congestion_wait(BLK_RW_ASYNC, HZ/10); - /* We are about to die and free our memory. Return now. */ - if (fatal_signal_pending(current)) - return SWAP_CLUSTER_MAX; + /* + * We are about to die and free our memory. + * Return now. + */ + if (fatal_signal_pending(current)) + return SWAP_CLUSTER_MAX; + } } Acked-by: KOSAKI Motohiro kosaki.motoh...@jp.fujitsu.com Dave, I don't like Tetsuo's first patch because this too_many_isolated exist to prevent false oom-kill. So, simple timeout resurrect it. Please let me know if you need further MM enhancement to solve XFS issue. I'd like join and assist this. Thanks. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list().
On Tue, May 20, 2014 at 12:54:29PM +0900, Tetsuo Handa wrote: > Dave Chinner wrote: > > So, XFS should be passing kswapd context to the workqueue allocation > > context. The patch below does this. > > > > Tetsuo-san, when it comes to problems involving XFS, you should > > really CC x...@oss.sgi.com because very few people really know how > > XFS works and even fewer still know how it is supposed to interact > > with memory reclaim > > Thank you for the patch, but ... > > #define PF_KSWAPD 0x0004 /* I am kswapd */ > > static inline int current_is_kswapd(void) > { > return current->flags & PF_KSWAPD; > } > I think ((char) (current->flags & 0x0004)) == 0. > Your patch wants > > -args->kswapd = current_is_kswapd(); > +args->kswapd = (current_is_kswapd() != 0); Thanks for pointing that out, but I think: -static inline int current_is_kswapd(void) +static inline bool current_is_kswapd(void) is a better solution. It can only be true or false. But regardless, I need to change the boolean options in that XFS structure to be, well, booleans. Cheers, dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list().
Dave Chinner wrote: > So, XFS should be passing kswapd context to the workqueue allocation > context. The patch below does this. > > Tetsuo-san, when it comes to problems involving XFS, you should > really CC x...@oss.sgi.com because very few people really know how > XFS works and even fewer still know how it is supposed to interact > with memory reclaim Thank you for the patch, but ... #define PF_KSWAPD 0x0004 /* I am kswapd */ static inline int current_is_kswapd(void) { return current->flags & PF_KSWAPD; } I think ((char) (current->flags & 0x0004)) == 0. Your patch wants -args->kswapd = current_is_kswapd(); +args->kswapd = (current_is_kswapd() != 0); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list().
On Mon, May 19, 2014 at 10:40 PM, Tetsuo Handa wrote: > > Since the kernel worker thread needs to escape from the while loop so that > alloc_page() can allocate memory (and eventually allow xfs_vm_writepage() > to release memory), I think that we should not block forever. This patch > introduces 30 seconds timeout for userspace processes and 5 seconds timeout > for kernel processes. > > Signed-off-by: Tetsuo Handa > --- > mm/vmscan.c |7 ++- > 1 files changed, 6 insertions(+), 1 deletions(-) > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 32c661d..3eeeda6 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -1459,13 +1459,18 @@ shrink_inactive_list(unsigned long nr_to_scan, struct > lruvec *lruvec, > int file = is_file_lru(lru); > struct zone *zone = lruvec_zone(lruvec); > struct zone_reclaim_stat *reclaim_stat = >reclaim_stat; > + int i = 0; > > - while (unlikely(too_many_isolated(zone, file, sc))) { > + /* Throttle with timeout. */ > + while (unlikely(too_many_isolated(zone, file, sc)) && i++ < 300) { > congestion_wait(BLK_RW_ASYNC, HZ/10); > > /* We are about to die and free our memory. Return now. */ > if (fatal_signal_pending(current)) > return SWAP_CLUSTER_MAX; > + /* Kernel threads should not be blocked for too long. */ > + if (i == 50 && (current->flags & PF_KTHREAD)) > + break; > } > > lru_add_drain(); Hi, Tetsuo Handa I think it is good to use a MACRO for this magic number instead of harding code it, in a long-term maintainability view, and would better with appropriate document. Thanks, Jianyu Zhan -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list().
[cc x...@oss.sgi.com] On Mon, May 19, 2014 at 11:40:46PM +0900, Tetsuo Handa wrote: > >From f016db5d7f84d6321132150b13c5888ef67d694f Mon Sep 17 00:00:00 2001 > From: Tetsuo Handa > Date: Mon, 19 May 2014 23:24:11 +0900 > Subject: [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list(). > > I can observe that commit 35cd7815 "vmscan: throttle direct reclaim when > too many pages are isolated already" causes RHEL7 environment to stall with > 0% CPU usage when a certain type of memory pressure is given. > > Upon memory pressure, kswapd calls xfs_vm_writepage() from shrink_page_list(). > xfs_vm_writepage() eventually calls wait_for_completion() which waits for > xfs_bmapi_allocate_worker(). > > Then, a kernel worker thread calls xfs_bmapi_allocate_worker() and > xfs_bmapi_allocate_worker() eventually calls xfs_btree_lookup_get_block(). > xfs_btree_lookup_get_block() eventually calls alloc_page(). > alloc_page() eventually calls shrink_inactive_list(). > > The stack trace showed that the kernel worker thread which the kswapd is > waiting for was looping at a while loop in shrink_inactive_list(). [snip stack traces indicating XFS is deferring allocation work to the allocation workqueue on behalf of kswapd] > Since the kernel worker thread needs to escape from the while loop so that > alloc_page() can allocate memory (and eventually allow xfs_vm_writepage() > to release memory), I think that we should not block forever. This patch > introduces 30 seconds timeout for userspace processes and 5 seconds timeout > for kernel processes. That's like trying to swat a fly with a runaway train. Eventually you'll hit the fly with the train Q: What is missing from the workqueue context that results in the workqueue being blocked in memory reclaim on congestion? Hint: XFS is deferring allocation on behalf of kswapd to a workqueue. A: PF_KSWAPD. shrink_inactive_list() has this guard: /* * Stall direct reclaim for IO completions if underlying BDIs or zone * is congested. Allow kswapd to continue until it starts encountering * unqueued dirty pages or cycling through the LRU too quickly. */ if (!sc->hibernation_mode && !current_is_kswapd()) wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10); So, XFS should be passing kswapd context to the workqueue allocation context. The patch below does this. Tetsuo-san, when it comes to problems involving XFS, you should really CC x...@oss.sgi.com because very few people really know how XFS works and even fewer still know how it is supposed to interact with memory reclaim Cheers, Dave. -- Dave Chinner da...@fromorbit.com xfs: block allocation work needs to be kswapd aware From: Dave Chinner Upon memory pressure, kswapd calls xfs_vm_writepage() from shrink_page_list(). This can result in delayed allocation occurring and that gets deferred to the the allocation workqueue. The allocation then runs outside kswapd context, which means if it needs memory (and it does to demand page metadata from disk) it can block in shrink_inactive_list() waiting for IO congestion. These blocking waits are normally avoiding in kswapd context, so under memory pressure writeback from kswapd can be arbitrarily delayed by memory reclaim. To avoid this, pass the kswapd context to the allocation being done by the workqueue, so that memory reclaim understands correctly that the work is being done for kswapd and therefore it is not blocked and does not delay memory reclaim. Reported-by: Tetsuo Handa Signed-off-by: Dave Chinner --- fs/xfs/xfs_bmap_util.c | 16 +--- fs/xfs/xfs_bmap_util.h | 1 + 2 files changed, 14 insertions(+), 3 deletions(-) diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c index 057f671..703b3ec 100644 --- a/fs/xfs/xfs_bmap_util.c +++ b/fs/xfs/xfs_bmap_util.c @@ -258,14 +258,23 @@ xfs_bmapi_allocate_worker( struct xfs_bmalloca *args = container_of(work, struct xfs_bmalloca, work); unsigned long pflags; + unsigned long new_pflags = PF_FSTRANS; - /* we are in a transaction context here */ - current_set_flags_nested(, PF_FSTRANS); + /* +* we are in a transaction context here, but may also be doing work +* in kswapd context, and hence we may need to inherit that state +* temporarily to ensure that we don't block waiting for memory reclaim +* in any way. +*/ + if (args->kswapd) + new_pflags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD; + + current_set_flags_nested(, new_pflags); args->result = __xfs_bmapi_allocate(args); complete(args->done); - current_restore_flags_nested(, PF_FSTRANS); + current_restore_flags_nested(, new_pflags); } /* @@ -284,6 +293,7 @@ xfs_bmapi_allocate( args->done = + args->kswapd = current_is_kswapd();
Re: [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list().
On 05/19/2014 10:40 AM, Tetsuo Handa wrote: >>From f016db5d7f84d6321132150b13c5888ef67d694f Mon Sep 17 00:00:00 2001 > From: Tetsuo Handa > Date: Mon, 19 May 2014 23:24:11 +0900 > Subject: [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list(). > > I can observe that commit 35cd7815 "vmscan: throttle direct reclaim when > too many pages are isolated already" causes RHEL7 environment to stall with > 0% CPU usage when a certain type of memory pressure is given. > Since the kernel worker thread needs to escape from the while loop so that > alloc_page() can allocate memory (and eventually allow xfs_vm_writepage() > to release memory), I think that we should not block forever. This patch > introduces 30 seconds timeout for userspace processes and 5 seconds timeout > for kernel processes. I suspect the fundamental problem is that other processes can drive the system back to the limit, and unlucky processes can end up starved forever. This makes your fix (or something like it) necessary. > Signed-off-by: Tetsuo Handa Acked-by: Rik van Riel -- All rights reversed -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list().
[cc x...@oss.sgi.com] On Mon, May 19, 2014 at 11:40:46PM +0900, Tetsuo Handa wrote: From f016db5d7f84d6321132150b13c5888ef67d694f Mon Sep 17 00:00:00 2001 From: Tetsuo Handa penguin-ker...@i-love.sakura.ne.jp Date: Mon, 19 May 2014 23:24:11 +0900 Subject: [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list(). I can observe that commit 35cd7815 vmscan: throttle direct reclaim when too many pages are isolated already causes RHEL7 environment to stall with 0% CPU usage when a certain type of memory pressure is given. Upon memory pressure, kswapd calls xfs_vm_writepage() from shrink_page_list(). xfs_vm_writepage() eventually calls wait_for_completion() which waits for xfs_bmapi_allocate_worker(). Then, a kernel worker thread calls xfs_bmapi_allocate_worker() and xfs_bmapi_allocate_worker() eventually calls xfs_btree_lookup_get_block(). xfs_btree_lookup_get_block() eventually calls alloc_page(). alloc_page() eventually calls shrink_inactive_list(). The stack trace showed that the kernel worker thread which the kswapd is waiting for was looping at a while loop in shrink_inactive_list(). [snip stack traces indicating XFS is deferring allocation work to the allocation workqueue on behalf of kswapd] Since the kernel worker thread needs to escape from the while loop so that alloc_page() can allocate memory (and eventually allow xfs_vm_writepage() to release memory), I think that we should not block forever. This patch introduces 30 seconds timeout for userspace processes and 5 seconds timeout for kernel processes. That's like trying to swat a fly with a runaway train. Eventually you'll hit the fly with the train Q: What is missing from the workqueue context that results in the workqueue being blocked in memory reclaim on congestion? Hint: XFS is deferring allocation on behalf of kswapd to a workqueue. A: PF_KSWAPD. shrink_inactive_list() has this guard: /* * Stall direct reclaim for IO completions if underlying BDIs or zone * is congested. Allow kswapd to continue until it starts encountering * unqueued dirty pages or cycling through the LRU too quickly. */ if (!sc-hibernation_mode !current_is_kswapd()) wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10); So, XFS should be passing kswapd context to the workqueue allocation context. The patch below does this. Tetsuo-san, when it comes to problems involving XFS, you should really CC x...@oss.sgi.com because very few people really know how XFS works and even fewer still know how it is supposed to interact with memory reclaim Cheers, Dave. -- Dave Chinner da...@fromorbit.com xfs: block allocation work needs to be kswapd aware From: Dave Chinner dchin...@redhat.com Upon memory pressure, kswapd calls xfs_vm_writepage() from shrink_page_list(). This can result in delayed allocation occurring and that gets deferred to the the allocation workqueue. The allocation then runs outside kswapd context, which means if it needs memory (and it does to demand page metadata from disk) it can block in shrink_inactive_list() waiting for IO congestion. These blocking waits are normally avoiding in kswapd context, so under memory pressure writeback from kswapd can be arbitrarily delayed by memory reclaim. To avoid this, pass the kswapd context to the allocation being done by the workqueue, so that memory reclaim understands correctly that the work is being done for kswapd and therefore it is not blocked and does not delay memory reclaim. Reported-by: Tetsuo Handa penguin-ker...@i-love.sakura.ne.jp Signed-off-by: Dave Chinner dchin...@redhat.com --- fs/xfs/xfs_bmap_util.c | 16 +--- fs/xfs/xfs_bmap_util.h | 1 + 2 files changed, 14 insertions(+), 3 deletions(-) diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c index 057f671..703b3ec 100644 --- a/fs/xfs/xfs_bmap_util.c +++ b/fs/xfs/xfs_bmap_util.c @@ -258,14 +258,23 @@ xfs_bmapi_allocate_worker( struct xfs_bmalloca *args = container_of(work, struct xfs_bmalloca, work); unsigned long pflags; + unsigned long new_pflags = PF_FSTRANS; - /* we are in a transaction context here */ - current_set_flags_nested(pflags, PF_FSTRANS); + /* +* we are in a transaction context here, but may also be doing work +* in kswapd context, and hence we may need to inherit that state +* temporarily to ensure that we don't block waiting for memory reclaim +* in any way. +*/ + if (args-kswapd) + new_pflags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD; + + current_set_flags_nested(pflags, new_pflags); args-result = __xfs_bmapi_allocate(args); complete(args-done); - current_restore_flags_nested(pflags, PF_FSTRANS); + current_restore_flags_nested(pflags, new_pflags); } /* @@ -284,6 +293,7 @@
Re: [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list().
On Mon, May 19, 2014 at 10:40 PM, Tetsuo Handa penguin-ker...@i-love.sakura.ne.jp wrote: Since the kernel worker thread needs to escape from the while loop so that alloc_page() can allocate memory (and eventually allow xfs_vm_writepage() to release memory), I think that we should not block forever. This patch introduces 30 seconds timeout for userspace processes and 5 seconds timeout for kernel processes. Signed-off-by: Tetsuo Handa penguin-ker...@i-love.sakura.ne.jp --- mm/vmscan.c |7 ++- 1 files changed, 6 insertions(+), 1 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index 32c661d..3eeeda6 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1459,13 +1459,18 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec, int file = is_file_lru(lru); struct zone *zone = lruvec_zone(lruvec); struct zone_reclaim_stat *reclaim_stat = lruvec-reclaim_stat; + int i = 0; - while (unlikely(too_many_isolated(zone, file, sc))) { + /* Throttle with timeout. */ + while (unlikely(too_many_isolated(zone, file, sc)) i++ 300) { congestion_wait(BLK_RW_ASYNC, HZ/10); /* We are about to die and free our memory. Return now. */ if (fatal_signal_pending(current)) return SWAP_CLUSTER_MAX; + /* Kernel threads should not be blocked for too long. */ + if (i == 50 (current-flags PF_KTHREAD)) + break; } lru_add_drain(); Hi, Tetsuo Handa I think it is good to use a MACRO for this magic number instead of harding code it, in a long-term maintainability view, and would better with appropriate document. Thanks, Jianyu Zhan -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list().
Dave Chinner wrote: So, XFS should be passing kswapd context to the workqueue allocation context. The patch below does this. Tetsuo-san, when it comes to problems involving XFS, you should really CC x...@oss.sgi.com because very few people really know how XFS works and even fewer still know how it is supposed to interact with memory reclaim Thank you for the patch, but ... #define PF_KSWAPD 0x0004 /* I am kswapd */ static inline int current_is_kswapd(void) { return current-flags PF_KSWAPD; } I think ((char) (current-flags 0x0004)) == 0. Your patch wants -args-kswapd = current_is_kswapd(); +args-kswapd = (current_is_kswapd() != 0); -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list().
On Tue, May 20, 2014 at 12:54:29PM +0900, Tetsuo Handa wrote: Dave Chinner wrote: So, XFS should be passing kswapd context to the workqueue allocation context. The patch below does this. Tetsuo-san, when it comes to problems involving XFS, you should really CC x...@oss.sgi.com because very few people really know how XFS works and even fewer still know how it is supposed to interact with memory reclaim Thank you for the patch, but ... #define PF_KSWAPD 0x0004 /* I am kswapd */ static inline int current_is_kswapd(void) { return current-flags PF_KSWAPD; } I think ((char) (current-flags 0x0004)) == 0. Your patch wants -args-kswapd = current_is_kswapd(); +args-kswapd = (current_is_kswapd() != 0); Thanks for pointing that out, but I think: -static inline int current_is_kswapd(void) +static inline bool current_is_kswapd(void) is a better solution. It can only be true or false. But regardless, I need to change the boolean options in that XFS structure to be, well, booleans. Cheers, dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list().
On 05/19/2014 10:40 AM, Tetsuo Handa wrote: From f016db5d7f84d6321132150b13c5888ef67d694f Mon Sep 17 00:00:00 2001 From: Tetsuo Handa penguin-ker...@i-love.sakura.ne.jp Date: Mon, 19 May 2014 23:24:11 +0900 Subject: [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list(). I can observe that commit 35cd7815 vmscan: throttle direct reclaim when too many pages are isolated already causes RHEL7 environment to stall with 0% CPU usage when a certain type of memory pressure is given. Since the kernel worker thread needs to escape from the while loop so that alloc_page() can allocate memory (and eventually allow xfs_vm_writepage() to release memory), I think that we should not block forever. This patch introduces 30 seconds timeout for userspace processes and 5 seconds timeout for kernel processes. I suspect the fundamental problem is that other processes can drive the system back to the limit, and unlucky processes can end up starved forever. This makes your fix (or something like it) necessary. Signed-off-by: Tetsuo Handa penguin-ker...@i-love.sakura.ne.jp Acked-by: Rik van Riel r...@redhat.com -- All rights reversed -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/