Re: [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list().

2014-07-02 Thread Tetsuo Handa
Tetsuo Handa wrote:
> Here is a demo patch. If you can join analysis of why memory allocation
> function cannot return for more than 15 minutes under severe memory pressure,
> I'll invite you to private discussion in order to share steps for reproducing
> such memory pressure. A quick test says that memory reclaiming functions are
> too optimistic about reclaiming memory; they are needlessly called again and
> again and again with an assumption that some memory will be reclaimed within
> a few seconds. If I insert some delay, CPU usage during stalls can be reduced.

Here is a formal patch. This patch includes a test result of today's linux.git
tree with https://lkml.org/lkml/2014/5/29/673 applied, in order to find what
deadlock occurs next. The blocking delay on the mutex inside the ttm shrinker
has gone, but a kernel worker thread trying to perform a block I/O using
GFP_NOIO context is blocked for more than 10 minutes. I think this is not good.

-- Start of patch --

>From c5274057bd71832fcf0baef64d43a49c20f29dbf Mon Sep 17 00:00:00 2001
From: Tetsuo Handa 
Date: Wed, 2 Jul 2014 09:34:51 +0900
Subject: [PATCH] mm: Remember ongoing memory allocation status.

When a stall by memory allocation problem occurs, printing how long
a thread blocked for memory allocation will be useful.

This patch allows remembering how many jiffies was spent for ongoing
__alloc_pages_nodemask() and reading it by printing backtrace and by
analyzing kdump.

Two examples are shown below. You can see that the GFP flags passed to
memory shrinker functions can be GFP_NOIO or GFP_NOFS. Therefore, when
writing memory shrinker functions, please be careful with dependency
inside shrinker functions. For example, unconditional use of GFP_KERNEL
may lead to deadlock. For another example, unconditional use of
blocking lock operations (e.g. mutex_lock()) which are called by
multiple different GFP contexts may lead to deadlock.

 kworker/2:2 R  running task0   189  2 0x
 MemAlloc: 624869 jiffies on 0x10
 Workqueue: events_freezable_power_ disk_events_workfn
  880036eacfe0 4486d7e5 88007fc83c48 81090a3f
  880036eacfe0  88007fc83c80 81090b35
  880036ead210 4486d7e5 817bada0 0074
 Call Trace:
  [] ? _raw_spin_lock+0x2f/0x50
  [] list_lru_count_node+0x19/0x60
  [] super_cache_count+0x50/0xd0
  [] shrink_slab_node+0x3a/0x1b0
  [] ? vmpressure+0x1c/0x80
  [] shrink_slab+0x83/0x150
  [] do_try_to_free_pages+0x2f9/0x530
  [] try_to_free_pages+0x98/0xd0
  [] __alloc_pages_nodemask+0x6e3/0xad0
  [] alloc_pages_current+0xa3/0x170
  [] bio_copy_user_iov+0x1c7/0x370
  [] bio_copy_kern+0x49/0xe0
  [] blk_rq_map_kern+0x6f/0x130
  [] ? blk_get_request+0x83/0x140
  [] scsi_execute+0x131/0x160
  [] scsi_execute_req_flags+0x84/0xf0
  [] sr_check_events+0xbc/0x2d0 [sr_mod]
  [] cdrom_check_events+0x13/0x30 [cdrom]
  [] sr_block_check_events+0x2d/0x30 [sr_mod]
  [] disk_check_events+0x55/0x1e0
  [] ? _cond_resched+0x35/0x60
  [] disk_events_workfn+0x11/0x20
  [] process_one_work+0x15f/0x3d0
  [] worker_thread+0x119/0x620
  [] ? rescuer_thread+0x440/0x440
  [] kthread+0xdc/0x100
  [] ? kthread_create_on_node+0x1a0/0x1a0
  [] ret_from_fork+0x7c/0xb0
  [] ? kthread_create_on_node+0x1a0/0x1a0

 kworker/u16:2   R  running task0 14009  13723 0x0080
 MemAlloc: 624951 jiffies on 0x250
   0100  28f5c28f5c28f5c3
  1705 0060 0064 0064
  880036dfea40 ff10 8158401a 0010
 Call Trace:
  [] ? _raw_spin_lock+0x2a/0x50
  [] ? list_lru_count_node+0x19/0x60
  [] ? super_cache_count+0x50/0xd0
  [] ? shrink_slab_node+0x3a/0x1b0
  [] ? vmpressure+0x1c/0x80
  [] ? shrink_slab+0x83/0x150
  [] ? do_try_to_free_pages+0x2f9/0x530
  [] ? try_to_free_pages+0x98/0xd0
  [] ? __alloc_pages_nodemask+0x6e3/0xad0
  [] ? alloc_pages_current+0xa3/0x170
  [] ? xfs_buf_allocate_memory+0x168/0x245 [xfs]
  [] ? xfs_buf_get_map+0xd2/0x130 [xfs]
  [] ? xfs_buf_read_map+0x24/0xc0 [xfs]
  [] ? xfs_trans_read_buf_map+0xa9/0x330 [xfs]
  [] ? xfs_imap_to_bp+0x69/0xf0 [xfs]
  [] ? xfs_iread+0x79/0x410 [xfs]
  [] ? kmem_zone_alloc+0x6f/0xf0 [xfs]
  [] ? xfs_iget+0x1a3/0x510 [xfs]
  [] ? xfs_lookup+0xbe/0xf0 [xfs]
  [] ? xfs_vn_lookup+0x73/0xc0 [xfs]
  [] ? lookup_real+0x18/0x50
  [] ? do_last+0x8bd/0xe90
  [] ? link_path_walk+0x27e/0x8e0
  [] ? path_openat+0xc8/0x6a0
  [] ? select_task_rq_fair+0x3dc/0x7e0
  [] ? do_filp_open+0x48/0xb0
  [] ? kmem_cache_alloc+0x109/0x170
  [] ? security_prepare_creds+0x11/0x20
  [] ? do_open_exec+0x1d/0xe0
  [] ? 

Re: [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list().

2014-07-02 Thread Tetsuo Handa
Tetsuo Handa wrote:
 Here is a demo patch. If you can join analysis of why memory allocation
 function cannot return for more than 15 minutes under severe memory pressure,
 I'll invite you to private discussion in order to share steps for reproducing
 such memory pressure. A quick test says that memory reclaiming functions are
 too optimistic about reclaiming memory; they are needlessly called again and
 again and again with an assumption that some memory will be reclaimed within
 a few seconds. If I insert some delay, CPU usage during stalls can be reduced.

Here is a formal patch. This patch includes a test result of today's linux.git
tree with https://lkml.org/lkml/2014/5/29/673 applied, in order to find what
deadlock occurs next. The blocking delay on the mutex inside the ttm shrinker
has gone, but a kernel worker thread trying to perform a block I/O using
GFP_NOIO context is blocked for more than 10 minutes. I think this is not good.

-- Start of patch --

From c5274057bd71832fcf0baef64d43a49c20f29dbf Mon Sep 17 00:00:00 2001
From: Tetsuo Handa penguin-ker...@i-love.sakura.ne.jp
Date: Wed, 2 Jul 2014 09:34:51 +0900
Subject: [PATCH] mm: Remember ongoing memory allocation status.

When a stall by memory allocation problem occurs, printing how long
a thread blocked for memory allocation will be useful.

This patch allows remembering how many jiffies was spent for ongoing
__alloc_pages_nodemask() and reading it by printing backtrace and by
analyzing kdump.

Two examples are shown below. You can see that the GFP flags passed to
memory shrinker functions can be GFP_NOIO or GFP_NOFS. Therefore, when
writing memory shrinker functions, please be careful with dependency
inside shrinker functions. For example, unconditional use of GFP_KERNEL
may lead to deadlock. For another example, unconditional use of
blocking lock operations (e.g. mutex_lock()) which are called by
multiple different GFP contexts may lead to deadlock.

 kworker/2:2 R  running task0   189  2 0x
 MemAlloc: 624869 jiffies on 0x10
 Workqueue: events_freezable_power_ disk_events_workfn
  880036eacfe0 4486d7e5 88007fc83c48 81090a3f
  880036eacfe0  88007fc83c80 81090b35
  880036ead210 4486d7e5 817bada0 0074
 Call Trace:
  [8158401f] ? _raw_spin_lock+0x2f/0x50
  [81126b99] list_lru_count_node+0x19/0x60
  [81171e10] super_cache_count+0x50/0xd0
  [8111460a] shrink_slab_node+0x3a/0x1b0
  [811683fc] ? vmpressure+0x1c/0x80
  [811153f3] shrink_slab+0x83/0x150
  [81118499] do_try_to_free_pages+0x2f9/0x530
  [81118768] try_to_free_pages+0x98/0xd0
  [8110e3f3] __alloc_pages_nodemask+0x6e3/0xad0
  [8114b2b3] alloc_pages_current+0xa3/0x170
  [81244d87] bio_copy_user_iov+0x1c7/0x370
  [81244fc9] bio_copy_kern+0x49/0xe0
  [8124ed4f] blk_rq_map_kern+0x6f/0x130
  [81249273] ? blk_get_request+0x83/0x140
  [81393381] scsi_execute+0x131/0x160
  [81393484] scsi_execute_req_flags+0x84/0xf0
  [a01b987c] sr_check_events+0xbc/0x2d0 [sr_mod]
  [a018f173] cdrom_check_events+0x13/0x30 [cdrom]
  [a01b9ced] sr_block_check_events+0x2d/0x30 [sr_mod]
  [81258c75] disk_check_events+0x55/0x1e0
  [81580e65] ? _cond_resched+0x35/0x60
  [81258e11] disk_events_workfn+0x11/0x20
  [8107d64f] process_one_work+0x15f/0x3d0
  [8107de19] worker_thread+0x119/0x620
  [8107dd00] ? rescuer_thread+0x440/0x440
  [8108439c] kthread+0xdc/0x100
  [810842c0] ? kthread_create_on_node+0x1a0/0x1a0
  [8158483c] ret_from_fork+0x7c/0xb0
  [810842c0] ? kthread_create_on_node+0x1a0/0x1a0

 kworker/u16:2   R  running task0 14009  13723 0x0080
 MemAlloc: 624951 jiffies on 0x250
   0100  28f5c28f5c28f5c3
  1705 0060 0064 0064
  880036dfea40 ff10 8158401a 0010
 Call Trace:
  [8158401a] ? _raw_spin_lock+0x2a/0x50
  [81126b99] ? list_lru_count_node+0x19/0x60
  [81171e10] ? super_cache_count+0x50/0xd0
  [8111460a] ? shrink_slab_node+0x3a/0x1b0
  [811683fc] ? vmpressure+0x1c/0x80
  [811153f3] ? shrink_slab+0x83/0x150
  [81118499] ? do_try_to_free_pages+0x2f9/0x530
  [81118768] ? try_to_free_pages+0x98/0xd0
  [8110e3f3] ? __alloc_pages_nodemask+0x6e3/0xad0
  [8114b2b3] ? alloc_pages_current+0xa3/0x170
  [a0232755] ? xfs_buf_allocate_memory+0x168/0x245 [xfs]
  [a01cc382] ? xfs_buf_get_map+0xd2/0x130 [xfs]
  

Re: [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list().

2014-06-09 Thread Tetsuo Handa
Tetsuo Handa wrote:
> We need some more changes. I'm thinking memory allocation watchdog thread.
> Add an "unsigned long" field to "struct task_struct", set jiffies to the field
> upon entry of GFP_WAIT-able memory allocation attempts, and clear the field
> upon returning from GFP_WAIT-able memory allocation attempts. A kernel thread
> periodically scans task list and compares the field and jiffies, and (at 
> least)
> print warning messages (maybe optionally trigger OOM-killer or kernel panic)
> if single memory allocation attempt is taking too long (e.g. 60 seconds).
> What do you think?
> 
Here is a demo patch. If you can join analysis of why memory allocation
function cannot return for more than 15 minutes under severe memory pressure,
I'll invite you to private discussion in order to share steps for reproducing
such memory pressure. A quick test says that memory reclaiming functions are
too optimistic about reclaiming memory; they are needlessly called again and
again and again with an assumption that some memory will be reclaimed within
a few seconds. If I insert some delay, CPU usage during stalls can be reduced.
--
>From 015fecd45761b2849974f37dc379edf3e86acfa6 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa 
Date: Mon, 9 Jun 2014 15:00:47 +0900
Subject: [PATCH] mm: Add memory allocation watchdog kernel thread.

When a certain type of memory pressure is given, the system may stall
for many minutes trying to allocate memory pages. But stalling without
any messages is annoying because (e.g.) timeout will happen without
any prior warning messages.

This patch introduces a watchdog thread which periodically reports the
longest stalling thread if __alloc_pages_nodemask() is taking more than
10 seconds. An example output from a VM with 4 CPU / 2GB RAM (without swap)
running a v3.15 kernel with this patch is shown below.

  [ 5835.136868] INFO: task pcscd:14569 blocked for 11 seconds at memory 
allocation
  [ 5845.137932] INFO: task pcscd:14569 blocked for 21 seconds at memory 
allocation
  [ 5855.142985] INFO: task pcscd:14569 blocked for 31 seconds at memory 
allocation
  (...snipped...)
  [ 6710.227984] INFO: task pcscd:14569 blocked for 886 seconds at memory 
allocation
  [ 6720.228058] INFO: task pcscd:14569 blocked for 896 seconds at memory 
allocation
  [ 6730.231108] INFO: task pcscd:14569 blocked for 906 seconds at memory 
allocation
  [ 6740.242185] INFO: task pcscd:14569 blocked for 916 seconds at memory 
allocation

Signed-off-by: Tetsuo Handa 
---
 include/linux/sched.h |  1 +
 mm/page_alloc.c   | 61 ---
 2 files changed, 59 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 221b2bd..befd496 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1610,6 +1610,7 @@ struct task_struct {
unsigned intsequential_io;
unsigned intsequential_io_avg;
 #endif
+   unsigned long memory_allocation_start_jiffies;
 };
 
 /* Future-safe accessor for struct task_struct's cpus_allowed. */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5dba293..211b0b7 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -61,6 +61,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -2698,6 +2699,16 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int 
order,
unsigned int cpuset_mems_cookie;
int alloc_flags = ALLOC_WMARK_LOW|ALLOC_CPUSET|ALLOC_FAIR;
struct mem_cgroup *memcg = NULL;
+   bool memory_allocation_recursion = false;
+   unsigned long *stamp = >memory_allocation_start_jiffies;
+
+   if (likely(!*stamp)) {
+   *stamp = jiffies;
+   if (unlikely(!*stamp))
+   (*stamp)++;
+   } else {
+   memory_allocation_recursion = true;
+   }
 
gfp_mask &= gfp_allowed_mask;
 
@@ -2706,7 +2717,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
might_sleep_if(gfp_mask & __GFP_WAIT);
 
if (should_fail_alloc_page(gfp_mask, order))
-   return NULL;
+   goto nopage;
 
/*
 * Check the zones suitable for the gfp_mask contain at least one
@@ -2714,14 +2725,14 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int 
order,
 * of GFP_THISNODE and a memoryless node
 */
if (unlikely(!zonelist->_zonerefs->zone))
-   return NULL;
+   goto nopage;
 
/*
 * Will only have any effect when __GFP_KMEMCG is set.  This is
 * verified in the (always inline) callee
 */
if (!memcg_kmem_newpage_charge(gfp_mask, , order))
-   return NULL;
+   goto nopage;
 
 retry_cpuset:
cpuset_mems_cookie = read_mems_allowed_begin();
@@ -2784,10 +2795,54 @@ out:
 
memcg_kmem_commit_charge(page, memcg, order);
 
+nopage:
+   if (likely(!memory_allocation_recursion))
+   

Re: [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list().

2014-06-09 Thread Tetsuo Handa
Tetsuo Handa wrote:
 We need some more changes. I'm thinking memory allocation watchdog thread.
 Add an unsigned long field to struct task_struct, set jiffies to the field
 upon entry of GFP_WAIT-able memory allocation attempts, and clear the field
 upon returning from GFP_WAIT-able memory allocation attempts. A kernel thread
 periodically scans task list and compares the field and jiffies, and (at 
 least)
 print warning messages (maybe optionally trigger OOM-killer or kernel panic)
 if single memory allocation attempt is taking too long (e.g. 60 seconds).
 What do you think?
 
Here is a demo patch. If you can join analysis of why memory allocation
function cannot return for more than 15 minutes under severe memory pressure,
I'll invite you to private discussion in order to share steps for reproducing
such memory pressure. A quick test says that memory reclaiming functions are
too optimistic about reclaiming memory; they are needlessly called again and
again and again with an assumption that some memory will be reclaimed within
a few seconds. If I insert some delay, CPU usage during stalls can be reduced.
--
From 015fecd45761b2849974f37dc379edf3e86acfa6 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa penguin-ker...@i-love.sakura.ne.jp
Date: Mon, 9 Jun 2014 15:00:47 +0900
Subject: [PATCH] mm: Add memory allocation watchdog kernel thread.

When a certain type of memory pressure is given, the system may stall
for many minutes trying to allocate memory pages. But stalling without
any messages is annoying because (e.g.) timeout will happen without
any prior warning messages.

This patch introduces a watchdog thread which periodically reports the
longest stalling thread if __alloc_pages_nodemask() is taking more than
10 seconds. An example output from a VM with 4 CPU / 2GB RAM (without swap)
running a v3.15 kernel with this patch is shown below.

  [ 5835.136868] INFO: task pcscd:14569 blocked for 11 seconds at memory 
allocation
  [ 5845.137932] INFO: task pcscd:14569 blocked for 21 seconds at memory 
allocation
  [ 5855.142985] INFO: task pcscd:14569 blocked for 31 seconds at memory 
allocation
  (...snipped...)
  [ 6710.227984] INFO: task pcscd:14569 blocked for 886 seconds at memory 
allocation
  [ 6720.228058] INFO: task pcscd:14569 blocked for 896 seconds at memory 
allocation
  [ 6730.231108] INFO: task pcscd:14569 blocked for 906 seconds at memory 
allocation
  [ 6740.242185] INFO: task pcscd:14569 blocked for 916 seconds at memory 
allocation

Signed-off-by: Tetsuo Handa penguin-ker...@i-love.sakura.ne.jp
---
 include/linux/sched.h |  1 +
 mm/page_alloc.c   | 61 ---
 2 files changed, 59 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 221b2bd..befd496 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1610,6 +1610,7 @@ struct task_struct {
unsigned intsequential_io;
unsigned intsequential_io_avg;
 #endif
+   unsigned long memory_allocation_start_jiffies;
 };
 
 /* Future-safe accessor for struct task_struct's cpus_allowed. */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5dba293..211b0b7 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -61,6 +61,7 @@
 #include linux/page-debug-flags.h
 #include linux/hugetlb.h
 #include linux/sched/rt.h
+#include linux/kthread.h
 
 #include asm/sections.h
 #include asm/tlbflush.h
@@ -2698,6 +2699,16 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int 
order,
unsigned int cpuset_mems_cookie;
int alloc_flags = ALLOC_WMARK_LOW|ALLOC_CPUSET|ALLOC_FAIR;
struct mem_cgroup *memcg = NULL;
+   bool memory_allocation_recursion = false;
+   unsigned long *stamp = current-memory_allocation_start_jiffies;
+
+   if (likely(!*stamp)) {
+   *stamp = jiffies;
+   if (unlikely(!*stamp))
+   (*stamp)++;
+   } else {
+   memory_allocation_recursion = true;
+   }
 
gfp_mask = gfp_allowed_mask;
 
@@ -2706,7 +2717,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
might_sleep_if(gfp_mask  __GFP_WAIT);
 
if (should_fail_alloc_page(gfp_mask, order))
-   return NULL;
+   goto nopage;
 
/*
 * Check the zones suitable for the gfp_mask contain at least one
@@ -2714,14 +2725,14 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int 
order,
 * of GFP_THISNODE and a memoryless node
 */
if (unlikely(!zonelist-_zonerefs-zone))
-   return NULL;
+   goto nopage;
 
/*
 * Will only have any effect when __GFP_KMEMCG is set.  This is
 * verified in the (always inline) callee
 */
if (!memcg_kmem_newpage_charge(gfp_mask, memcg, order))
-   return NULL;
+   goto nopage;
 
 retry_cpuset:
cpuset_mems_cookie = read_mems_allowed_begin();
@@ -2784,10 +2795,54 @@ 

Re: [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list().

2014-06-05 Thread Dave Chinner
On Thu, Jun 05, 2014 at 09:45:26PM +0900, Tetsuo Handa wrote:
> David Rientjes wrote:
> > On Mon, 26 May 2014, Tetsuo Handa wrote:
> > 
> > > In shrink_inactive_list(), we do not insert delay at
> > > 
> > >   if (!sc->hibernation_mode && !current_is_kswapd())
> > > wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10);
> > > 
> > > if sc->hibernation_mode != 0.
> > > Follow the same reason, we should not insert delay at
> > > 
> > >   while (unlikely(too_many_isolated(zone, file, sc))) {
> > > congestion_wait(BLK_RW_ASYNC, HZ/10);
> > > 
> > > /* We are about to die and free our memory. Return now. */
> > > if (fatal_signal_pending(current))
> > >   return SWAP_CLUSTER_MAX;
> > >   }
> > > 
> > > if sc->hibernation_mode != 0.
> > > 
> > > Signed-off-by: Tetsuo Handa 
> > > ---
> > >  mm/vmscan.c |3 +++
> > >  1 files changed, 3 insertions(+), 0 deletions(-)
> > > 
> > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > index 32c661d..89c42ca 100644
> > > --- a/mm/vmscan.c
> > > +++ b/mm/vmscan.c
> > > @@ -1362,6 +1362,9 @@ static int too_many_isolated(struct zone *zone, int 
> > > file,
> > >   if (current_is_kswapd())
> > >   return 0;
> > >  
> > > + if (sc->hibernation_mode)
> > > + return 0;
> > > +
> > >   if (!global_reclaim(sc))
> > >   return 0;
> > >  
> > 
> > This isn't the only too_many_isolated() functions that do a delay, how is 
> > the too_many_isolated() in mm/compaction.c different?
> > 
> 
> I don't know. But today I realized that this patch is not sufficient.
> 
> I'm trying to find why __alloc_pages_slowpath() cannot return for many minutes
> when a certain type of memory pressure is given on a RHEL7 environment with
> 4 CPU / 2GB RAM. Today I tried to use ftrace for examining the breakdown of
> time-consuming functions inside __alloc_pages_slowpath(). But on the first 
> run,
> all processes are trapped into this too_many_isolated()/congestion_wait() loop
> while kswapd is not running; stalling forever because nobody can perform
> operations for making too_many_isolated() to return 0.
> 
> This means that, under rare circumstances, it is possible that all processes
> other than kswapd are trapped into too_many_isolated()/congestion_wait() loop
> while kswapd is sleeping because this loop assumes that somebody else shall
> wake up kswapd and kswapd shall perform operations for making
> too_many_isolated() to return 0. However, we cannot guarantee that kswapd is
> waken by somebody nor kswapd is not blocked by blocking operations inside
> shrinker functions (e.g. mutex_lock()).

So what you are saying is that kswapd is having problems with
getting blocked on locks held by processes in direct reclaim?

What are the stack traces that demonstrate such a dependency loop?

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list().

2014-06-05 Thread Tetsuo Handa
David Rientjes wrote:
> On Mon, 26 May 2014, Tetsuo Handa wrote:
> 
> > In shrink_inactive_list(), we do not insert delay at
> > 
> >   if (!sc->hibernation_mode && !current_is_kswapd())
> > wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10);
> > 
> > if sc->hibernation_mode != 0.
> > Follow the same reason, we should not insert delay at
> > 
> >   while (unlikely(too_many_isolated(zone, file, sc))) {
> > congestion_wait(BLK_RW_ASYNC, HZ/10);
> > 
> > /* We are about to die and free our memory. Return now. */
> > if (fatal_signal_pending(current))
> >   return SWAP_CLUSTER_MAX;
> >   }
> > 
> > if sc->hibernation_mode != 0.
> > 
> > Signed-off-by: Tetsuo Handa 
> > ---
> >  mm/vmscan.c |3 +++
> >  1 files changed, 3 insertions(+), 0 deletions(-)
> > 
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 32c661d..89c42ca 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -1362,6 +1362,9 @@ static int too_many_isolated(struct zone *zone, int 
> > file,
> > if (current_is_kswapd())
> > return 0;
> >  
> > +   if (sc->hibernation_mode)
> > +   return 0;
> > +
> > if (!global_reclaim(sc))
> > return 0;
> >  
> 
> This isn't the only too_many_isolated() functions that do a delay, how is 
> the too_many_isolated() in mm/compaction.c different?
> 

I don't know. But today I realized that this patch is not sufficient.

I'm trying to find why __alloc_pages_slowpath() cannot return for many minutes
when a certain type of memory pressure is given on a RHEL7 environment with
4 CPU / 2GB RAM. Today I tried to use ftrace for examining the breakdown of
time-consuming functions inside __alloc_pages_slowpath(). But on the first run,
all processes are trapped into this too_many_isolated()/congestion_wait() loop
while kswapd is not running; stalling forever because nobody can perform
operations for making too_many_isolated() to return 0.

This means that, under rare circumstances, it is possible that all processes
other than kswapd are trapped into too_many_isolated()/congestion_wait() loop
while kswapd is sleeping because this loop assumes that somebody else shall
wake up kswapd and kswapd shall perform operations for making
too_many_isolated() to return 0. However, we cannot guarantee that kswapd is
waken by somebody nor kswapd is not blocked by blocking operations inside
shrinker functions (e.g. mutex_lock()).

We need some more changes. I'm thinking memory allocation watchdog thread.
Add an "unsigned long" field to "struct task_struct", set jiffies to the field
upon entry of GFP_WAIT-able memory allocation attempts, and clear the field
upon returning from GFP_WAIT-able memory allocation attempts. A kernel thread
periodically scans task list and compares the field and jiffies, and (at least)
print warning messages (maybe optionally trigger OOM-killer or kernel panic)
if single memory allocation attempt is taking too long (e.g. 60 seconds).
What do you think?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list().

2014-06-05 Thread Tetsuo Handa
David Rientjes wrote:
 On Mon, 26 May 2014, Tetsuo Handa wrote:
 
  In shrink_inactive_list(), we do not insert delay at
  
if (!sc-hibernation_mode  !current_is_kswapd())
  wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10);
  
  if sc-hibernation_mode != 0.
  Follow the same reason, we should not insert delay at
  
while (unlikely(too_many_isolated(zone, file, sc))) {
  congestion_wait(BLK_RW_ASYNC, HZ/10);
  
  /* We are about to die and free our memory. Return now. */
  if (fatal_signal_pending(current))
return SWAP_CLUSTER_MAX;
}
  
  if sc-hibernation_mode != 0.
  
  Signed-off-by: Tetsuo Handa penguin-ker...@i-love.sakura.ne.jp
  ---
   mm/vmscan.c |3 +++
   1 files changed, 3 insertions(+), 0 deletions(-)
  
  diff --git a/mm/vmscan.c b/mm/vmscan.c
  index 32c661d..89c42ca 100644
  --- a/mm/vmscan.c
  +++ b/mm/vmscan.c
  @@ -1362,6 +1362,9 @@ static int too_many_isolated(struct zone *zone, int 
  file,
  if (current_is_kswapd())
  return 0;
   
  +   if (sc-hibernation_mode)
  +   return 0;
  +
  if (!global_reclaim(sc))
  return 0;
   
 
 This isn't the only too_many_isolated() functions that do a delay, how is 
 the too_many_isolated() in mm/compaction.c different?
 

I don't know. But today I realized that this patch is not sufficient.

I'm trying to find why __alloc_pages_slowpath() cannot return for many minutes
when a certain type of memory pressure is given on a RHEL7 environment with
4 CPU / 2GB RAM. Today I tried to use ftrace for examining the breakdown of
time-consuming functions inside __alloc_pages_slowpath(). But on the first run,
all processes are trapped into this too_many_isolated()/congestion_wait() loop
while kswapd is not running; stalling forever because nobody can perform
operations for making too_many_isolated() to return 0.

This means that, under rare circumstances, it is possible that all processes
other than kswapd are trapped into too_many_isolated()/congestion_wait() loop
while kswapd is sleeping because this loop assumes that somebody else shall
wake up kswapd and kswapd shall perform operations for making
too_many_isolated() to return 0. However, we cannot guarantee that kswapd is
waken by somebody nor kswapd is not blocked by blocking operations inside
shrinker functions (e.g. mutex_lock()).

We need some more changes. I'm thinking memory allocation watchdog thread.
Add an unsigned long field to struct task_struct, set jiffies to the field
upon entry of GFP_WAIT-able memory allocation attempts, and clear the field
upon returning from GFP_WAIT-able memory allocation attempts. A kernel thread
periodically scans task list and compares the field and jiffies, and (at least)
print warning messages (maybe optionally trigger OOM-killer or kernel panic)
if single memory allocation attempt is taking too long (e.g. 60 seconds).
What do you think?
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list().

2014-06-05 Thread Dave Chinner
On Thu, Jun 05, 2014 at 09:45:26PM +0900, Tetsuo Handa wrote:
 David Rientjes wrote:
  On Mon, 26 May 2014, Tetsuo Handa wrote:
  
   In shrink_inactive_list(), we do not insert delay at
   
 if (!sc-hibernation_mode  !current_is_kswapd())
   wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10);
   
   if sc-hibernation_mode != 0.
   Follow the same reason, we should not insert delay at
   
 while (unlikely(too_many_isolated(zone, file, sc))) {
   congestion_wait(BLK_RW_ASYNC, HZ/10);
   
   /* We are about to die and free our memory. Return now. */
   if (fatal_signal_pending(current))
 return SWAP_CLUSTER_MAX;
 }
   
   if sc-hibernation_mode != 0.
   
   Signed-off-by: Tetsuo Handa penguin-ker...@i-love.sakura.ne.jp
   ---
mm/vmscan.c |3 +++
1 files changed, 3 insertions(+), 0 deletions(-)
   
   diff --git a/mm/vmscan.c b/mm/vmscan.c
   index 32c661d..89c42ca 100644
   --- a/mm/vmscan.c
   +++ b/mm/vmscan.c
   @@ -1362,6 +1362,9 @@ static int too_many_isolated(struct zone *zone, int 
   file,
 if (current_is_kswapd())
 return 0;

   + if (sc-hibernation_mode)
   + return 0;
   +
 if (!global_reclaim(sc))
 return 0;

  
  This isn't the only too_many_isolated() functions that do a delay, how is 
  the too_many_isolated() in mm/compaction.c different?
  
 
 I don't know. But today I realized that this patch is not sufficient.
 
 I'm trying to find why __alloc_pages_slowpath() cannot return for many minutes
 when a certain type of memory pressure is given on a RHEL7 environment with
 4 CPU / 2GB RAM. Today I tried to use ftrace for examining the breakdown of
 time-consuming functions inside __alloc_pages_slowpath(). But on the first 
 run,
 all processes are trapped into this too_many_isolated()/congestion_wait() loop
 while kswapd is not running; stalling forever because nobody can perform
 operations for making too_many_isolated() to return 0.
 
 This means that, under rare circumstances, it is possible that all processes
 other than kswapd are trapped into too_many_isolated()/congestion_wait() loop
 while kswapd is sleeping because this loop assumes that somebody else shall
 wake up kswapd and kswapd shall perform operations for making
 too_many_isolated() to return 0. However, we cannot guarantee that kswapd is
 waken by somebody nor kswapd is not blocked by blocking operations inside
 shrinker functions (e.g. mutex_lock()).

So what you are saying is that kswapd is having problems with
getting blocked on locks held by processes in direct reclaim?

What are the stack traces that demonstrate such a dependency loop?

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list().

2014-05-20 Thread Motohiro Kosaki


> -Original Message-
> From: Tetsuo Handa [mailto:penguin-ker...@i-love.sakura.ne.jp]
> Sent: Tuesday, May 20, 2014 11:58 PM
> To: da...@fromorbit.com; r...@redhat.com
> Cc: Motohiro Kosaki JP; fengguang...@intel.com; 
> kamezawa.hir...@jp.fujitsu.com; a...@linux-foundation.org;
> h...@infradead.org; linux-kernel@vger.kernel.org; x...@oss.sgi.com
> Subject: Re: [PATCH] mm/vmscan: Do not block forever at 
> shrink_inactive_list().
> 
> Today I discussed with Kosaki-san at LinuxCon Japan 2014 about this issue.
> He does not like the idea of adding timeout to throttle loop. As Dave posted 
> a patch that fixes a bug in XFS delayed allocation, I
> updated my patch accordingly.
> 
> Although the bug in XFS was fixed by Dave's patch, other kernel code would 
> have bugs which would fall into this infinite throttle loop.
> But to keep the possibility of triggering OOM killer minimum, can we agree 
> with this updated patch (and in the future adding some
> warning mechanism like /proc/sys/kernel/hung_task_timeout_secs for detecting 
> memory allocation stall)?
> 
> Dave, if you are OK with this updated patch, please let me know commit ID of 
> your patch.
> 
> Regards.
> --
> >From 408e65d9025e8e24838e7bf6ac9066ba8a9391a6 Mon Sep 17 00:00:00 2001
> From: Tetsuo Handa 
> Date: Tue, 20 May 2014 23:34:34 +0900
> Subject: [PATCH] mm/vmscan: Do not throttle kswapd at shrink_inactive_list().
> 
> I can observe that commit 35cd7815 "vmscan: throttle direct reclaim when too 
> many pages are isolated already" causes RHEL7
> environment to stall with 0% CPU usage when a certain type of memory pressure 
> is given.
> This is because nobody can reclaim memory due to rules listed below.
> 
>   (a) XFS uses a kernel worker thread for delayed allocation
>   (b) kswapd wakes up the kernel worker thread for delayed allocation
>   (c) the kernel worker thread is throttled due to commit 35cd7815
> 
> This patch and commit  "xfs: block allocation work needs to be kswapd 
> aware" will solve rule (c).
> 
> Signed-off-by: Tetsuo Handa 
> ---
>  mm/vmscan.c |   20 +++-
>  1 files changed, 15 insertions(+), 5 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 32c661d..5c6960e 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1460,12 +1460,22 @@ shrink_inactive_list(unsigned long nr_to_scan, struct 
> lruvec *lruvec,
>   struct zone *zone = lruvec_zone(lruvec);
>   struct zone_reclaim_stat *reclaim_stat = >reclaim_stat;
> 
> - while (unlikely(too_many_isolated(zone, file, sc))) {
> - congestion_wait(BLK_RW_ASYNC, HZ/10);
> + /*
> +  * Throttle only direct reclaimers. Allocations by kswapd (and
> +  * allocation workqueue on behalf of kswapd) should not be
> +  * throttled here; otherwise memory allocation will deadlock.
> +  */
> + if (!sc->hibernation_mode && !current_is_kswapd()) {
> + while (unlikely(too_many_isolated(zone, file, sc))) {
> + congestion_wait(BLK_RW_ASYNC, HZ/10);
> 
> - /* We are about to die and free our memory. Return now. */
> - if (fatal_signal_pending(current))
> - return SWAP_CLUSTER_MAX;
> + /*
> +  * We are about to die and free our memory.
> +  * Return now.
> +  */
> + if (fatal_signal_pending(current))
> + return SWAP_CLUSTER_MAX;
> + }
>   }


Acked-by: KOSAKI Motohiro 


Dave, I don't like Tetsuo's first patch because this too_many_isolated exist to 
prevent false oom-kill. So, simple timeout
resurrect it. Please let me know if you need further MM enhancement to solve 
XFS issue. I'd like join and assist this.

Thanks.






--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list().

2014-05-20 Thread Tetsuo Handa
Today I discussed with Kosaki-san at LinuxCon Japan 2014 about this issue.
He does not like the idea of adding timeout to throttle loop. As Dave
posted a patch that fixes a bug in XFS delayed allocation, I updated
my patch accordingly.

Although the bug in XFS was fixed by Dave's patch, other kernel code
would have bugs which would fall into this infinite throttle loop.
But to keep the possibility of triggering OOM killer minimum, can we
agree with this updated patch (and in the future adding some warning
mechanism like /proc/sys/kernel/hung_task_timeout_secs for detecting
memory allocation stall)?

Dave, if you are OK with this updated patch, please let me know
commit ID of your patch.

Regards.
--
>From 408e65d9025e8e24838e7bf6ac9066ba8a9391a6 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa 
Date: Tue, 20 May 2014 23:34:34 +0900
Subject: [PATCH] mm/vmscan: Do not throttle kswapd at shrink_inactive_list().

I can observe that commit 35cd7815 "vmscan: throttle direct reclaim when
too many pages are isolated already" causes RHEL7 environment to stall
with 0% CPU usage when a certain type of memory pressure is given.
This is because nobody can reclaim memory due to rules listed below.

  (a) XFS uses a kernel worker thread for delayed allocation
  (b) kswapd wakes up the kernel worker thread for delayed allocation
  (c) the kernel worker thread is throttled due to commit 35cd7815

This patch and commit  "xfs: block allocation work needs to be
kswapd aware" will solve rule (c).

Signed-off-by: Tetsuo Handa 
---
 mm/vmscan.c |   20 +++-
 1 files changed, 15 insertions(+), 5 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 32c661d..5c6960e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1460,12 +1460,22 @@ shrink_inactive_list(unsigned long nr_to_scan, struct 
lruvec *lruvec,
struct zone *zone = lruvec_zone(lruvec);
struct zone_reclaim_stat *reclaim_stat = >reclaim_stat;
 
-   while (unlikely(too_many_isolated(zone, file, sc))) {
-   congestion_wait(BLK_RW_ASYNC, HZ/10);
+   /*
+* Throttle only direct reclaimers. Allocations by kswapd (and
+* allocation workqueue on behalf of kswapd) should not be
+* throttled here; otherwise memory allocation will deadlock.
+*/
+   if (!sc->hibernation_mode && !current_is_kswapd()) {
+   while (unlikely(too_many_isolated(zone, file, sc))) {
+   congestion_wait(BLK_RW_ASYNC, HZ/10);
 
-   /* We are about to die and free our memory. Return now. */
-   if (fatal_signal_pending(current))
-   return SWAP_CLUSTER_MAX;
+   /*
+* We are about to die and free our memory.
+* Return now.
+*/
+   if (fatal_signal_pending(current))
+   return SWAP_CLUSTER_MAX;
+   }
}
 
lru_add_drain();
-- 
1.7.1
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list().

2014-05-20 Thread Christoph Hellwig
On Mon, May 19, 2014 at 10:59:15PM -0700, Andrew Morton wrote:
> So current_is_kswapd() returns true for a thread which is not kswapd. 
> That's a bit smelly.
> 
> Should this thread really be incrementing KSWAPD_INODESTEAL instead of
> PGINODESTEAL, for example?  current_is_kswapd() does a range of things,
> only one(?) of which you actually want.

Actually we want all of them. The allocation workqueue is a workaround
for the incredible stack usage in the Linux I/O path.  If it is called
by kswapd it should act as if it were kswapd for all purposes.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list().

2014-05-20 Thread Dave Chinner
On Mon, May 19, 2014 at 11:03:11PM -0700, Andrew Morton wrote:
> On Mon, 19 May 2014 22:59:15 -0700 Andrew Morton  
> wrote:
> 
> > On Tue, 20 May 2014 10:44:49 +1000 Dave Chinner  wrote:
> > 
> > > @@ -258,14 +258,23 @@ xfs_bmapi_allocate_worker(
> > >   struct xfs_bmalloca *args = container_of(work,
> > >   struct xfs_bmalloca, work);
> > >   unsigned long   pflags;
> > > + unsigned long   new_pflags = PF_FSTRANS;
> > >  
> > > - /* we are in a transaction context here */
> > > - current_set_flags_nested(, PF_FSTRANS);
> > > + /*
> > > +  * we are in a transaction context here, but may also be doing work
> > > +  * in kswapd context, and hence we may need to inherit that state
> > > +  * temporarily to ensure that we don't block waiting for memory reclaim
> > > +  * in any way.
> > > +  */
> > > + if (args->kswapd)
> > > + new_pflags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD;
> > 
> > So current_is_kswapd() returns true for a thread which is not kswapd. 
> > That's a bit smelly.
> > 
> > Should this thread really be incrementing KSWAPD_INODESTEAL instead of
> > PGINODESTEAL, for example?  current_is_kswapd() does a range of things,
> > only one(?) of which you actually want.
> > 
> > It would be cleaner to create a new PF_ flag to select just that
> > behavior.  That's a better model than telling the world "I am magic and
> > special".
> 
> Or a new __GFP_FLAG.

Sure - and with that XFS will need another PF_ flag to tell the
memory allocator to set the new __GFP_FLAG on every allocation done
in that kworker task context, just like it uses PF_FSTRANS to ensure
that __GFP_NOFS is set for all the allocations in that kworker
context

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list().

2014-05-20 Thread Dave Chinner
On Mon, May 19, 2014 at 10:59:15PM -0700, Andrew Morton wrote:
> On Tue, 20 May 2014 10:44:49 +1000 Dave Chinner  wrote:
> 
> > @@ -258,14 +258,23 @@ xfs_bmapi_allocate_worker(
> > struct xfs_bmalloca *args = container_of(work,
> > struct xfs_bmalloca, work);
> > unsigned long   pflags;
> > +   unsigned long   new_pflags = PF_FSTRANS;
> >  
> > -   /* we are in a transaction context here */
> > -   current_set_flags_nested(, PF_FSTRANS);
> > +   /*
> > +* we are in a transaction context here, but may also be doing work
> > +* in kswapd context, and hence we may need to inherit that state
> > +* temporarily to ensure that we don't block waiting for memory reclaim
> > +* in any way.
> > +*/
> > +   if (args->kswapd)
> > +   new_pflags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD;
> 
> So current_is_kswapd() returns true for a thread which is not kswapd. 
> That's a bit smelly.
> 
> Should this thread really be incrementing KSWAPD_INODESTEAL instead of
> PGINODESTEAL, for example?  current_is_kswapd() does a range of things,
> only one(?) of which you actually want.

It's doing work on behalf of kswapd under kswapd constraints, so it
should both behave and be accounted as if kswapd was executing the
work directly.

> It would be cleaner to create a new PF_ flag to select just that
> behavior.  That's a better model than telling the world "I am magic and
> special".

However, it is magic and special because of who the work needs to be
done for.

> But we're awfully close to running out of PF_ space and I don't know if
> this ugly justifies consuming a flag.

I don't really care enough argue over what mechanism should be used.
I'll push this patch through the XFS tree, and when a new flag or
generic mechanism for pushing task contexts to kworker threads
is provided, I'll change the XFS code to use that

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list().

2014-05-20 Thread Andrew Morton
On Mon, 19 May 2014 22:59:15 -0700 Andrew Morton  
wrote:

> On Tue, 20 May 2014 10:44:49 +1000 Dave Chinner  wrote:
> 
> > @@ -258,14 +258,23 @@ xfs_bmapi_allocate_worker(
> > struct xfs_bmalloca *args = container_of(work,
> > struct xfs_bmalloca, work);
> > unsigned long   pflags;
> > +   unsigned long   new_pflags = PF_FSTRANS;
> >  
> > -   /* we are in a transaction context here */
> > -   current_set_flags_nested(, PF_FSTRANS);
> > +   /*
> > +* we are in a transaction context here, but may also be doing work
> > +* in kswapd context, and hence we may need to inherit that state
> > +* temporarily to ensure that we don't block waiting for memory reclaim
> > +* in any way.
> > +*/
> > +   if (args->kswapd)
> > +   new_pflags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD;
> 
> So current_is_kswapd() returns true for a thread which is not kswapd. 
> That's a bit smelly.
> 
> Should this thread really be incrementing KSWAPD_INODESTEAL instead of
> PGINODESTEAL, for example?  current_is_kswapd() does a range of things,
> only one(?) of which you actually want.
> 
> It would be cleaner to create a new PF_ flag to select just that
> behavior.  That's a better model than telling the world "I am magic and
> special".

Or a new __GFP_FLAG.

> But we're awfully close to running out of PF_ space and I don't know if
> this ugly justifies consuming a flag.
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list().

2014-05-20 Thread Andrew Morton
On Tue, 20 May 2014 10:44:49 +1000 Dave Chinner  wrote:

> @@ -258,14 +258,23 @@ xfs_bmapi_allocate_worker(
>   struct xfs_bmalloca *args = container_of(work,
>   struct xfs_bmalloca, work);
>   unsigned long   pflags;
> + unsigned long   new_pflags = PF_FSTRANS;
>  
> - /* we are in a transaction context here */
> - current_set_flags_nested(, PF_FSTRANS);
> + /*
> +  * we are in a transaction context here, but may also be doing work
> +  * in kswapd context, and hence we may need to inherit that state
> +  * temporarily to ensure that we don't block waiting for memory reclaim
> +  * in any way.
> +  */
> + if (args->kswapd)
> + new_pflags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD;

So current_is_kswapd() returns true for a thread which is not kswapd. 
That's a bit smelly.

Should this thread really be incrementing KSWAPD_INODESTEAL instead of
PGINODESTEAL, for example?  current_is_kswapd() does a range of things,
only one(?) of which you actually want.

It would be cleaner to create a new PF_ flag to select just that
behavior.  That's a better model than telling the world "I am magic and
special".

But we're awfully close to running out of PF_ space and I don't know if
this ugly justifies consuming a flag.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list().

2014-05-20 Thread Andrew Morton
On Tue, 20 May 2014 10:44:49 +1000 Dave Chinner da...@fromorbit.com wrote:

 @@ -258,14 +258,23 @@ xfs_bmapi_allocate_worker(
   struct xfs_bmalloca *args = container_of(work,
   struct xfs_bmalloca, work);
   unsigned long   pflags;
 + unsigned long   new_pflags = PF_FSTRANS;
  
 - /* we are in a transaction context here */
 - current_set_flags_nested(pflags, PF_FSTRANS);
 + /*
 +  * we are in a transaction context here, but may also be doing work
 +  * in kswapd context, and hence we may need to inherit that state
 +  * temporarily to ensure that we don't block waiting for memory reclaim
 +  * in any way.
 +  */
 + if (args-kswapd)
 + new_pflags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD;

So current_is_kswapd() returns true for a thread which is not kswapd. 
That's a bit smelly.

Should this thread really be incrementing KSWAPD_INODESTEAL instead of
PGINODESTEAL, for example?  current_is_kswapd() does a range of things,
only one(?) of which you actually want.

It would be cleaner to create a new PF_ flag to select just that
behavior.  That's a better model than telling the world I am magic and
special.

But we're awfully close to running out of PF_ space and I don't know if
this ugly justifies consuming a flag.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list().

2014-05-20 Thread Andrew Morton
On Mon, 19 May 2014 22:59:15 -0700 Andrew Morton a...@linux-foundation.org 
wrote:

 On Tue, 20 May 2014 10:44:49 +1000 Dave Chinner da...@fromorbit.com wrote:
 
  @@ -258,14 +258,23 @@ xfs_bmapi_allocate_worker(
  struct xfs_bmalloca *args = container_of(work,
  struct xfs_bmalloca, work);
  unsigned long   pflags;
  +   unsigned long   new_pflags = PF_FSTRANS;
   
  -   /* we are in a transaction context here */
  -   current_set_flags_nested(pflags, PF_FSTRANS);
  +   /*
  +* we are in a transaction context here, but may also be doing work
  +* in kswapd context, and hence we may need to inherit that state
  +* temporarily to ensure that we don't block waiting for memory reclaim
  +* in any way.
  +*/
  +   if (args-kswapd)
  +   new_pflags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD;
 
 So current_is_kswapd() returns true for a thread which is not kswapd. 
 That's a bit smelly.
 
 Should this thread really be incrementing KSWAPD_INODESTEAL instead of
 PGINODESTEAL, for example?  current_is_kswapd() does a range of things,
 only one(?) of which you actually want.
 
 It would be cleaner to create a new PF_ flag to select just that
 behavior.  That's a better model than telling the world I am magic and
 special.

Or a new __GFP_FLAG.

 But we're awfully close to running out of PF_ space and I don't know if
 this ugly justifies consuming a flag.
 
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list().

2014-05-20 Thread Dave Chinner
On Mon, May 19, 2014 at 10:59:15PM -0700, Andrew Morton wrote:
 On Tue, 20 May 2014 10:44:49 +1000 Dave Chinner da...@fromorbit.com wrote:
 
  @@ -258,14 +258,23 @@ xfs_bmapi_allocate_worker(
  struct xfs_bmalloca *args = container_of(work,
  struct xfs_bmalloca, work);
  unsigned long   pflags;
  +   unsigned long   new_pflags = PF_FSTRANS;
   
  -   /* we are in a transaction context here */
  -   current_set_flags_nested(pflags, PF_FSTRANS);
  +   /*
  +* we are in a transaction context here, but may also be doing work
  +* in kswapd context, and hence we may need to inherit that state
  +* temporarily to ensure that we don't block waiting for memory reclaim
  +* in any way.
  +*/
  +   if (args-kswapd)
  +   new_pflags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD;
 
 So current_is_kswapd() returns true for a thread which is not kswapd. 
 That's a bit smelly.
 
 Should this thread really be incrementing KSWAPD_INODESTEAL instead of
 PGINODESTEAL, for example?  current_is_kswapd() does a range of things,
 only one(?) of which you actually want.

It's doing work on behalf of kswapd under kswapd constraints, so it
should both behave and be accounted as if kswapd was executing the
work directly.

 It would be cleaner to create a new PF_ flag to select just that
 behavior.  That's a better model than telling the world I am magic and
 special.

However, it is magic and special because of who the work needs to be
done for.

 But we're awfully close to running out of PF_ space and I don't know if
 this ugly justifies consuming a flag.

I don't really care enough argue over what mechanism should be used.
I'll push this patch through the XFS tree, and when a new flag or
generic mechanism for pushing task contexts to kworker threads
is provided, I'll change the XFS code to use that

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list().

2014-05-20 Thread Dave Chinner
On Mon, May 19, 2014 at 11:03:11PM -0700, Andrew Morton wrote:
 On Mon, 19 May 2014 22:59:15 -0700 Andrew Morton a...@linux-foundation.org 
 wrote:
 
  On Tue, 20 May 2014 10:44:49 +1000 Dave Chinner da...@fromorbit.com wrote:
  
   @@ -258,14 +258,23 @@ xfs_bmapi_allocate_worker(
 struct xfs_bmalloca *args = container_of(work,
 struct xfs_bmalloca, work);
 unsigned long   pflags;
   + unsigned long   new_pflags = PF_FSTRANS;

   - /* we are in a transaction context here */
   - current_set_flags_nested(pflags, PF_FSTRANS);
   + /*
   +  * we are in a transaction context here, but may also be doing work
   +  * in kswapd context, and hence we may need to inherit that state
   +  * temporarily to ensure that we don't block waiting for memory reclaim
   +  * in any way.
   +  */
   + if (args-kswapd)
   + new_pflags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD;
  
  So current_is_kswapd() returns true for a thread which is not kswapd. 
  That's a bit smelly.
  
  Should this thread really be incrementing KSWAPD_INODESTEAL instead of
  PGINODESTEAL, for example?  current_is_kswapd() does a range of things,
  only one(?) of which you actually want.
  
  It would be cleaner to create a new PF_ flag to select just that
  behavior.  That's a better model than telling the world I am magic and
  special.
 
 Or a new __GFP_FLAG.

Sure - and with that XFS will need another PF_ flag to tell the
memory allocator to set the new __GFP_FLAG on every allocation done
in that kworker task context, just like it uses PF_FSTRANS to ensure
that __GFP_NOFS is set for all the allocations in that kworker
context

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list().

2014-05-20 Thread Christoph Hellwig
On Mon, May 19, 2014 at 10:59:15PM -0700, Andrew Morton wrote:
 So current_is_kswapd() returns true for a thread which is not kswapd. 
 That's a bit smelly.
 
 Should this thread really be incrementing KSWAPD_INODESTEAL instead of
 PGINODESTEAL, for example?  current_is_kswapd() does a range of things,
 only one(?) of which you actually want.

Actually we want all of them. The allocation workqueue is a workaround
for the incredible stack usage in the Linux I/O path.  If it is called
by kswapd it should act as if it were kswapd for all purposes.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list().

2014-05-20 Thread Tetsuo Handa
Today I discussed with Kosaki-san at LinuxCon Japan 2014 about this issue.
He does not like the idea of adding timeout to throttle loop. As Dave
posted a patch that fixes a bug in XFS delayed allocation, I updated
my patch accordingly.

Although the bug in XFS was fixed by Dave's patch, other kernel code
would have bugs which would fall into this infinite throttle loop.
But to keep the possibility of triggering OOM killer minimum, can we
agree with this updated patch (and in the future adding some warning
mechanism like /proc/sys/kernel/hung_task_timeout_secs for detecting
memory allocation stall)?

Dave, if you are OK with this updated patch, please let me know
commit ID of your patch.

Regards.
--
From 408e65d9025e8e24838e7bf6ac9066ba8a9391a6 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa penguin-ker...@i-love.sakura.ne.jp
Date: Tue, 20 May 2014 23:34:34 +0900
Subject: [PATCH] mm/vmscan: Do not throttle kswapd at shrink_inactive_list().

I can observe that commit 35cd7815 vmscan: throttle direct reclaim when
too many pages are isolated already causes RHEL7 environment to stall
with 0% CPU usage when a certain type of memory pressure is given.
This is because nobody can reclaim memory due to rules listed below.

  (a) XFS uses a kernel worker thread for delayed allocation
  (b) kswapd wakes up the kernel worker thread for delayed allocation
  (c) the kernel worker thread is throttled due to commit 35cd7815

This patch and commit  xfs: block allocation work needs to be
kswapd aware will solve rule (c).

Signed-off-by: Tetsuo Handa penguin-ker...@i-love.sakura.ne.jp
---
 mm/vmscan.c |   20 +++-
 1 files changed, 15 insertions(+), 5 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 32c661d..5c6960e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1460,12 +1460,22 @@ shrink_inactive_list(unsigned long nr_to_scan, struct 
lruvec *lruvec,
struct zone *zone = lruvec_zone(lruvec);
struct zone_reclaim_stat *reclaim_stat = lruvec-reclaim_stat;
 
-   while (unlikely(too_many_isolated(zone, file, sc))) {
-   congestion_wait(BLK_RW_ASYNC, HZ/10);
+   /*
+* Throttle only direct reclaimers. Allocations by kswapd (and
+* allocation workqueue on behalf of kswapd) should not be
+* throttled here; otherwise memory allocation will deadlock.
+*/
+   if (!sc-hibernation_mode  !current_is_kswapd()) {
+   while (unlikely(too_many_isolated(zone, file, sc))) {
+   congestion_wait(BLK_RW_ASYNC, HZ/10);
 
-   /* We are about to die and free our memory. Return now. */
-   if (fatal_signal_pending(current))
-   return SWAP_CLUSTER_MAX;
+   /*
+* We are about to die and free our memory.
+* Return now.
+*/
+   if (fatal_signal_pending(current))
+   return SWAP_CLUSTER_MAX;
+   }
}
 
lru_add_drain();
-- 
1.7.1
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list().

2014-05-20 Thread Motohiro Kosaki


 -Original Message-
 From: Tetsuo Handa [mailto:penguin-ker...@i-love.sakura.ne.jp]
 Sent: Tuesday, May 20, 2014 11:58 PM
 To: da...@fromorbit.com; r...@redhat.com
 Cc: Motohiro Kosaki JP; fengguang...@intel.com; 
 kamezawa.hir...@jp.fujitsu.com; a...@linux-foundation.org;
 h...@infradead.org; linux-kernel@vger.kernel.org; x...@oss.sgi.com
 Subject: Re: [PATCH] mm/vmscan: Do not block forever at 
 shrink_inactive_list().
 
 Today I discussed with Kosaki-san at LinuxCon Japan 2014 about this issue.
 He does not like the idea of adding timeout to throttle loop. As Dave posted 
 a patch that fixes a bug in XFS delayed allocation, I
 updated my patch accordingly.
 
 Although the bug in XFS was fixed by Dave's patch, other kernel code would 
 have bugs which would fall into this infinite throttle loop.
 But to keep the possibility of triggering OOM killer minimum, can we agree 
 with this updated patch (and in the future adding some
 warning mechanism like /proc/sys/kernel/hung_task_timeout_secs for detecting 
 memory allocation stall)?
 
 Dave, if you are OK with this updated patch, please let me know commit ID of 
 your patch.
 
 Regards.
 --
 From 408e65d9025e8e24838e7bf6ac9066ba8a9391a6 Mon Sep 17 00:00:00 2001
 From: Tetsuo Handa penguin-ker...@i-love.sakura.ne.jp
 Date: Tue, 20 May 2014 23:34:34 +0900
 Subject: [PATCH] mm/vmscan: Do not throttle kswapd at shrink_inactive_list().
 
 I can observe that commit 35cd7815 vmscan: throttle direct reclaim when too 
 many pages are isolated already causes RHEL7
 environment to stall with 0% CPU usage when a certain type of memory pressure 
 is given.
 This is because nobody can reclaim memory due to rules listed below.
 
   (a) XFS uses a kernel worker thread for delayed allocation
   (b) kswapd wakes up the kernel worker thread for delayed allocation
   (c) the kernel worker thread is throttled due to commit 35cd7815
 
 This patch and commit  xfs: block allocation work needs to be kswapd 
 aware will solve rule (c).
 
 Signed-off-by: Tetsuo Handa penguin-ker...@i-love.sakura.ne.jp
 ---
  mm/vmscan.c |   20 +++-
  1 files changed, 15 insertions(+), 5 deletions(-)
 
 diff --git a/mm/vmscan.c b/mm/vmscan.c
 index 32c661d..5c6960e 100644
 --- a/mm/vmscan.c
 +++ b/mm/vmscan.c
 @@ -1460,12 +1460,22 @@ shrink_inactive_list(unsigned long nr_to_scan, struct 
 lruvec *lruvec,
   struct zone *zone = lruvec_zone(lruvec);
   struct zone_reclaim_stat *reclaim_stat = lruvec-reclaim_stat;
 
 - while (unlikely(too_many_isolated(zone, file, sc))) {
 - congestion_wait(BLK_RW_ASYNC, HZ/10);
 + /*
 +  * Throttle only direct reclaimers. Allocations by kswapd (and
 +  * allocation workqueue on behalf of kswapd) should not be
 +  * throttled here; otherwise memory allocation will deadlock.
 +  */
 + if (!sc-hibernation_mode  !current_is_kswapd()) {
 + while (unlikely(too_many_isolated(zone, file, sc))) {
 + congestion_wait(BLK_RW_ASYNC, HZ/10);
 
 - /* We are about to die and free our memory. Return now. */
 - if (fatal_signal_pending(current))
 - return SWAP_CLUSTER_MAX;
 + /*
 +  * We are about to die and free our memory.
 +  * Return now.
 +  */
 + if (fatal_signal_pending(current))
 + return SWAP_CLUSTER_MAX;
 + }
   }


Acked-by: KOSAKI Motohiro kosaki.motoh...@jp.fujitsu.com


Dave, I don't like Tetsuo's first patch because this too_many_isolated exist to 
prevent false oom-kill. So, simple timeout
resurrect it. Please let me know if you need further MM enhancement to solve 
XFS issue. I'd like join and assist this.

Thanks.






--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list().

2014-05-19 Thread Dave Chinner
On Tue, May 20, 2014 at 12:54:29PM +0900, Tetsuo Handa wrote:
> Dave Chinner wrote:
> > So, XFS should be passing kswapd context to the workqueue allocation
> > context. The patch below does this.
> > 
> > Tetsuo-san, when it comes to problems involving XFS, you should
> > really CC x...@oss.sgi.com because very few people really know how
> > XFS works and even fewer still know how it is supposed to interact
> > with memory reclaim
> 
> Thank you for the patch, but ...
> 
>   #define PF_KSWAPD   0x0004  /* I am kswapd */
> 
>   static inline int current_is_kswapd(void)
>   {
>   return current->flags & PF_KSWAPD;
>   }

> I think ((char) (current->flags & 0x0004)) == 0.
> Your patch wants
> 
>   -args->kswapd = current_is_kswapd();
>   +args->kswapd = (current_is_kswapd() != 0);

Thanks for pointing that out, but I think:

-static inline int current_is_kswapd(void)
+static inline bool current_is_kswapd(void)

is a better solution. It can only be true or false.

But regardless, I need to change the boolean options in that XFS
structure to be, well, booleans.

Cheers,

dave.

-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list().

2014-05-19 Thread Tetsuo Handa
Dave Chinner wrote:
> So, XFS should be passing kswapd context to the workqueue allocation
> context. The patch below does this.
> 
> Tetsuo-san, when it comes to problems involving XFS, you should
> really CC x...@oss.sgi.com because very few people really know how
> XFS works and even fewer still know how it is supposed to interact
> with memory reclaim

Thank you for the patch, but ...

  #define PF_KSWAPD   0x0004  /* I am kswapd */

  static inline int current_is_kswapd(void)
  {
return current->flags & PF_KSWAPD;
  }

I think ((char) (current->flags & 0x0004)) == 0.
Your patch wants

  -args->kswapd = current_is_kswapd();
  +args->kswapd = (current_is_kswapd() != 0);
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list().

2014-05-19 Thread Jianyu Zhan
On Mon, May 19, 2014 at 10:40 PM, Tetsuo Handa
 wrote:
>
> Since the kernel worker thread needs to escape from the while loop so that
> alloc_page() can allocate memory (and eventually allow xfs_vm_writepage()
> to release memory), I think that we should not block forever. This patch
> introduces 30 seconds timeout for userspace processes and 5 seconds timeout
> for kernel processes.
>
> Signed-off-by: Tetsuo Handa 
> ---
>  mm/vmscan.c |7 ++-
>  1 files changed, 6 insertions(+), 1 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 32c661d..3eeeda6 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1459,13 +1459,18 @@ shrink_inactive_list(unsigned long nr_to_scan, struct 
> lruvec *lruvec,
> int file = is_file_lru(lru);
> struct zone *zone = lruvec_zone(lruvec);
> struct zone_reclaim_stat *reclaim_stat = >reclaim_stat;
> +   int i = 0;
>
> -   while (unlikely(too_many_isolated(zone, file, sc))) {
> +   /* Throttle with timeout. */
> +   while (unlikely(too_many_isolated(zone, file, sc)) && i++ < 300) {
> congestion_wait(BLK_RW_ASYNC, HZ/10);
>
> /* We are about to die and free our memory. Return now. */
> if (fatal_signal_pending(current))
> return SWAP_CLUSTER_MAX;
> +   /* Kernel threads should not be blocked for too long. */
> +   if (i == 50 && (current->flags & PF_KTHREAD))
> +   break;
> }
>
> lru_add_drain();

Hi, Tetsuo Handa

I think it is good to use a MACRO for this magic number instead of harding code
it, in a long-term maintainability view,  and would better with
appropriate document.

Thanks,
Jianyu Zhan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list().

2014-05-19 Thread Dave Chinner
[cc x...@oss.sgi.com]

On Mon, May 19, 2014 at 11:40:46PM +0900, Tetsuo Handa wrote:
> >From f016db5d7f84d6321132150b13c5888ef67d694f Mon Sep 17 00:00:00 2001
> From: Tetsuo Handa 
> Date: Mon, 19 May 2014 23:24:11 +0900
> Subject: [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list().
> 
> I can observe that commit 35cd7815 "vmscan: throttle direct reclaim when
> too many pages are isolated already" causes RHEL7 environment to stall with
> 0% CPU usage when a certain type of memory pressure is given.
> 
> Upon memory pressure, kswapd calls xfs_vm_writepage() from shrink_page_list().
> xfs_vm_writepage() eventually calls wait_for_completion() which waits for
> xfs_bmapi_allocate_worker().
> 
> Then, a kernel worker thread calls xfs_bmapi_allocate_worker() and
> xfs_bmapi_allocate_worker() eventually calls xfs_btree_lookup_get_block().
> xfs_btree_lookup_get_block() eventually calls alloc_page().
> alloc_page() eventually calls shrink_inactive_list().
> 
> The stack trace showed that the kernel worker thread which the kswapd is
> waiting for was looping at a while loop in shrink_inactive_list().

[snip stack traces indicating XFS is deferring allocation work to the
allocation workqueue on behalf of kswapd]

> Since the kernel worker thread needs to escape from the while loop so that
> alloc_page() can allocate memory (and eventually allow xfs_vm_writepage()
> to release memory), I think that we should not block forever. This patch
> introduces 30 seconds timeout for userspace processes and 5 seconds timeout
> for kernel processes.

That's like trying to swat a fly with a runaway train. Eventually
you'll hit the fly with the train

Q: What is missing from the workqueue context that results
in the workqueue being blocked in memory reclaim on congestion?

Hint: XFS is deferring allocation on behalf of kswapd to a
workqueue.

A: PF_KSWAPD.  shrink_inactive_list() has this guard:

/*
 * Stall direct reclaim for IO completions if underlying BDIs or zone
 * is congested. Allow kswapd to continue until it starts encountering
 * unqueued dirty pages or cycling through the LRU too quickly.
 */
if (!sc->hibernation_mode && !current_is_kswapd())
wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10);


So, XFS should be passing kswapd context to the workqueue allocation
context. The patch below does this.

Tetsuo-san, when it comes to problems involving XFS, you should
really CC x...@oss.sgi.com because very few people really know how
XFS works and even fewer still know how it is supposed to interact
with memory reclaim

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


xfs: block allocation work needs to be kswapd aware

From: Dave Chinner 

Upon memory pressure, kswapd calls xfs_vm_writepage() from
shrink_page_list(). This can result in delayed allocation occurring
and that gets deferred to the the allocation workqueue.

The allocation then runs outside kswapd context, which means if it
needs memory (and it does to demand page metadata from disk) it can
block in shrink_inactive_list() waiting for IO congestion. These
blocking waits are normally avoiding in kswapd context, so under
memory pressure writeback from kswapd can be arbitrarily delayed by
memory reclaim.

To avoid this, pass the kswapd context to the allocation being done
by the workqueue, so that memory reclaim understands correctly that
the work is being done for kswapd and therefore it is not blocked
and does not delay memory reclaim.

Reported-by: Tetsuo Handa 
Signed-off-by: Dave Chinner 
---
 fs/xfs/xfs_bmap_util.c | 16 +---
 fs/xfs/xfs_bmap_util.h |  1 +
 2 files changed, 14 insertions(+), 3 deletions(-)

diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 057f671..703b3ec 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -258,14 +258,23 @@ xfs_bmapi_allocate_worker(
struct xfs_bmalloca *args = container_of(work,
struct xfs_bmalloca, work);
unsigned long   pflags;
+   unsigned long   new_pflags = PF_FSTRANS;
 
-   /* we are in a transaction context here */
-   current_set_flags_nested(, PF_FSTRANS);
+   /*
+* we are in a transaction context here, but may also be doing work
+* in kswapd context, and hence we may need to inherit that state
+* temporarily to ensure that we don't block waiting for memory reclaim
+* in any way.
+*/
+   if (args->kswapd)
+   new_pflags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD;
+
+   current_set_flags_nested(, new_pflags);
 
args->result = __xfs_bmapi_allocate(args);
complete(args->done);
 
-   current_restore_flags_nested(, PF_FSTRANS);
+   current_restore_flags_nested(, new_pflags);
 }
 
 /*
@@ -284,6 +293,7 @@ xfs_bmapi_allocate(
 
 
args->done = 
+   args->kswapd = current_is_kswapd();

Re: [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list().

2014-05-19 Thread Rik van Riel
On 05/19/2014 10:40 AM, Tetsuo Handa wrote:
>>From f016db5d7f84d6321132150b13c5888ef67d694f Mon Sep 17 00:00:00 2001
> From: Tetsuo Handa 
> Date: Mon, 19 May 2014 23:24:11 +0900
> Subject: [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list().
> 
> I can observe that commit 35cd7815 "vmscan: throttle direct reclaim when
> too many pages are isolated already" causes RHEL7 environment to stall with
> 0% CPU usage when a certain type of memory pressure is given.

> Since the kernel worker thread needs to escape from the while loop so that
> alloc_page() can allocate memory (and eventually allow xfs_vm_writepage()
> to release memory), I think that we should not block forever. This patch
> introduces 30 seconds timeout for userspace processes and 5 seconds timeout
> for kernel processes.

I suspect the fundamental problem is that other processes can
drive the system back to the limit, and unlucky processes can
end up starved forever.

This makes your fix (or something like it) necessary.

> Signed-off-by: Tetsuo Handa 

Acked-by: Rik van Riel 

-- 
All rights reversed
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list().

2014-05-19 Thread Dave Chinner
[cc x...@oss.sgi.com]

On Mon, May 19, 2014 at 11:40:46PM +0900, Tetsuo Handa wrote:
 From f016db5d7f84d6321132150b13c5888ef67d694f Mon Sep 17 00:00:00 2001
 From: Tetsuo Handa penguin-ker...@i-love.sakura.ne.jp
 Date: Mon, 19 May 2014 23:24:11 +0900
 Subject: [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list().
 
 I can observe that commit 35cd7815 vmscan: throttle direct reclaim when
 too many pages are isolated already causes RHEL7 environment to stall with
 0% CPU usage when a certain type of memory pressure is given.
 
 Upon memory pressure, kswapd calls xfs_vm_writepage() from shrink_page_list().
 xfs_vm_writepage() eventually calls wait_for_completion() which waits for
 xfs_bmapi_allocate_worker().
 
 Then, a kernel worker thread calls xfs_bmapi_allocate_worker() and
 xfs_bmapi_allocate_worker() eventually calls xfs_btree_lookup_get_block().
 xfs_btree_lookup_get_block() eventually calls alloc_page().
 alloc_page() eventually calls shrink_inactive_list().
 
 The stack trace showed that the kernel worker thread which the kswapd is
 waiting for was looping at a while loop in shrink_inactive_list().

[snip stack traces indicating XFS is deferring allocation work to the
allocation workqueue on behalf of kswapd]

 Since the kernel worker thread needs to escape from the while loop so that
 alloc_page() can allocate memory (and eventually allow xfs_vm_writepage()
 to release memory), I think that we should not block forever. This patch
 introduces 30 seconds timeout for userspace processes and 5 seconds timeout
 for kernel processes.

That's like trying to swat a fly with a runaway train. Eventually
you'll hit the fly with the train

Q: What is missing from the workqueue context that results
in the workqueue being blocked in memory reclaim on congestion?

Hint: XFS is deferring allocation on behalf of kswapd to a
workqueue.

A: PF_KSWAPD.  shrink_inactive_list() has this guard:

/*
 * Stall direct reclaim for IO completions if underlying BDIs or zone
 * is congested. Allow kswapd to continue until it starts encountering
 * unqueued dirty pages or cycling through the LRU too quickly.
 */
if (!sc-hibernation_mode  !current_is_kswapd())
wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10);


So, XFS should be passing kswapd context to the workqueue allocation
context. The patch below does this.

Tetsuo-san, when it comes to problems involving XFS, you should
really CC x...@oss.sgi.com because very few people really know how
XFS works and even fewer still know how it is supposed to interact
with memory reclaim

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


xfs: block allocation work needs to be kswapd aware

From: Dave Chinner dchin...@redhat.com

Upon memory pressure, kswapd calls xfs_vm_writepage() from
shrink_page_list(). This can result in delayed allocation occurring
and that gets deferred to the the allocation workqueue.

The allocation then runs outside kswapd context, which means if it
needs memory (and it does to demand page metadata from disk) it can
block in shrink_inactive_list() waiting for IO congestion. These
blocking waits are normally avoiding in kswapd context, so under
memory pressure writeback from kswapd can be arbitrarily delayed by
memory reclaim.

To avoid this, pass the kswapd context to the allocation being done
by the workqueue, so that memory reclaim understands correctly that
the work is being done for kswapd and therefore it is not blocked
and does not delay memory reclaim.

Reported-by: Tetsuo Handa penguin-ker...@i-love.sakura.ne.jp
Signed-off-by: Dave Chinner dchin...@redhat.com
---
 fs/xfs/xfs_bmap_util.c | 16 +---
 fs/xfs/xfs_bmap_util.h |  1 +
 2 files changed, 14 insertions(+), 3 deletions(-)

diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 057f671..703b3ec 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -258,14 +258,23 @@ xfs_bmapi_allocate_worker(
struct xfs_bmalloca *args = container_of(work,
struct xfs_bmalloca, work);
unsigned long   pflags;
+   unsigned long   new_pflags = PF_FSTRANS;
 
-   /* we are in a transaction context here */
-   current_set_flags_nested(pflags, PF_FSTRANS);
+   /*
+* we are in a transaction context here, but may also be doing work
+* in kswapd context, and hence we may need to inherit that state
+* temporarily to ensure that we don't block waiting for memory reclaim
+* in any way.
+*/
+   if (args-kswapd)
+   new_pflags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD;
+
+   current_set_flags_nested(pflags, new_pflags);
 
args-result = __xfs_bmapi_allocate(args);
complete(args-done);
 
-   current_restore_flags_nested(pflags, PF_FSTRANS);
+   current_restore_flags_nested(pflags, new_pflags);
 }
 
 /*
@@ -284,6 +293,7 @@ 

Re: [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list().

2014-05-19 Thread Jianyu Zhan
On Mon, May 19, 2014 at 10:40 PM, Tetsuo Handa
penguin-ker...@i-love.sakura.ne.jp wrote:

 Since the kernel worker thread needs to escape from the while loop so that
 alloc_page() can allocate memory (and eventually allow xfs_vm_writepage()
 to release memory), I think that we should not block forever. This patch
 introduces 30 seconds timeout for userspace processes and 5 seconds timeout
 for kernel processes.

 Signed-off-by: Tetsuo Handa penguin-ker...@i-love.sakura.ne.jp
 ---
  mm/vmscan.c |7 ++-
  1 files changed, 6 insertions(+), 1 deletions(-)

 diff --git a/mm/vmscan.c b/mm/vmscan.c
 index 32c661d..3eeeda6 100644
 --- a/mm/vmscan.c
 +++ b/mm/vmscan.c
 @@ -1459,13 +1459,18 @@ shrink_inactive_list(unsigned long nr_to_scan, struct 
 lruvec *lruvec,
 int file = is_file_lru(lru);
 struct zone *zone = lruvec_zone(lruvec);
 struct zone_reclaim_stat *reclaim_stat = lruvec-reclaim_stat;
 +   int i = 0;

 -   while (unlikely(too_many_isolated(zone, file, sc))) {
 +   /* Throttle with timeout. */
 +   while (unlikely(too_many_isolated(zone, file, sc))  i++  300) {
 congestion_wait(BLK_RW_ASYNC, HZ/10);

 /* We are about to die and free our memory. Return now. */
 if (fatal_signal_pending(current))
 return SWAP_CLUSTER_MAX;
 +   /* Kernel threads should not be blocked for too long. */
 +   if (i == 50  (current-flags  PF_KTHREAD))
 +   break;
 }

 lru_add_drain();

Hi, Tetsuo Handa

I think it is good to use a MACRO for this magic number instead of harding code
it, in a long-term maintainability view,  and would better with
appropriate document.

Thanks,
Jianyu Zhan
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list().

2014-05-19 Thread Tetsuo Handa
Dave Chinner wrote:
 So, XFS should be passing kswapd context to the workqueue allocation
 context. The patch below does this.
 
 Tetsuo-san, when it comes to problems involving XFS, you should
 really CC x...@oss.sgi.com because very few people really know how
 XFS works and even fewer still know how it is supposed to interact
 with memory reclaim

Thank you for the patch, but ...

  #define PF_KSWAPD   0x0004  /* I am kswapd */

  static inline int current_is_kswapd(void)
  {
return current-flags  PF_KSWAPD;
  }

I think ((char) (current-flags  0x0004)) == 0.
Your patch wants

  -args-kswapd = current_is_kswapd();
  +args-kswapd = (current_is_kswapd() != 0);
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list().

2014-05-19 Thread Dave Chinner
On Tue, May 20, 2014 at 12:54:29PM +0900, Tetsuo Handa wrote:
 Dave Chinner wrote:
  So, XFS should be passing kswapd context to the workqueue allocation
  context. The patch below does this.
  
  Tetsuo-san, when it comes to problems involving XFS, you should
  really CC x...@oss.sgi.com because very few people really know how
  XFS works and even fewer still know how it is supposed to interact
  with memory reclaim
 
 Thank you for the patch, but ...
 
   #define PF_KSWAPD   0x0004  /* I am kswapd */
 
   static inline int current_is_kswapd(void)
   {
   return current-flags  PF_KSWAPD;
   }

 I think ((char) (current-flags  0x0004)) == 0.
 Your patch wants
 
   -args-kswapd = current_is_kswapd();
   +args-kswapd = (current_is_kswapd() != 0);

Thanks for pointing that out, but I think:

-static inline int current_is_kswapd(void)
+static inline bool current_is_kswapd(void)

is a better solution. It can only be true or false.

But regardless, I need to change the boolean options in that XFS
structure to be, well, booleans.

Cheers,

dave.

-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list().

2014-05-19 Thread Rik van Riel
On 05/19/2014 10:40 AM, Tetsuo Handa wrote:
From f016db5d7f84d6321132150b13c5888ef67d694f Mon Sep 17 00:00:00 2001
 From: Tetsuo Handa penguin-ker...@i-love.sakura.ne.jp
 Date: Mon, 19 May 2014 23:24:11 +0900
 Subject: [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list().
 
 I can observe that commit 35cd7815 vmscan: throttle direct reclaim when
 too many pages are isolated already causes RHEL7 environment to stall with
 0% CPU usage when a certain type of memory pressure is given.

 Since the kernel worker thread needs to escape from the while loop so that
 alloc_page() can allocate memory (and eventually allow xfs_vm_writepage()
 to release memory), I think that we should not block forever. This patch
 introduces 30 seconds timeout for userspace processes and 5 seconds timeout
 for kernel processes.

I suspect the fundamental problem is that other processes can
drive the system back to the limit, and unlucky processes can
end up starved forever.

This makes your fix (or something like it) necessary.

 Signed-off-by: Tetsuo Handa penguin-ker...@i-love.sakura.ne.jp

Acked-by: Rik van Riel r...@redhat.com

-- 
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/