Re: [PATCH 1/9] mm: fix 100% CPU kswapd busyloop on unreclaimable nodes

2017-03-09 Thread Mel Gorman
On Tue, Mar 07, 2017 at 11:56:31AM -0500, Johannes Weiner wrote:
> On Tue, Mar 07, 2017 at 11:17:02AM +0100, Michal Hocko wrote:
> > On Mon 06-03-17 11:24:10, Johannes Weiner wrote:
> > > @@ -3271,7 +3271,8 @@ static int balance_pgdat(pg_data_t *pgdat, int 
> > > order, int classzone_idx)
> > >* Raise priority if scanning rate is too low or there was no
> > >* progress in reclaiming pages
> > >*/
> > > - if (raise_priority || !sc.nr_reclaimed)
> > > + nr_reclaimed = sc.nr_reclaimed - nr_reclaimed;
> > > + if (raise_priority || !nr_reclaimed)
> > >   sc.priority--;
> > >   } while (sc.priority >= 1);
> > >  
> > 
> > I would rather not play with the sc state here. From a quick look at
> > least 
> > /*
> >  * Fragmentation may mean that the system cannot be rebalanced for
> >  * high-order allocations. If twice the allocation size has been
> >  * reclaimed then recheck watermarks only at order-0 to prevent
> >  * excessive reclaim. Assume that a process requested a high-order
> >  * can direct reclaim/compact.
> >  */
> > if (sc->order && sc->nr_reclaimed >= compact_gap(sc->order))
> > sc->order = 0;
> > 
> > does rely on the value. Wouldn't something like the following be safer?
> 
> Well, what behavior is correct, though? This check looks like an
> argument *against* resetting sc.nr_reclaimed.
> 
> If kswapd is woken up for a higher order, this check sets a reclaim
> cutoff beyond which it should give up on the order and balance for 0.
> 
> That's on the scope of the kswapd invocation. Applying this threshold
> to the outcome of just the preceeding priority seems like a mistake.
> 
> Mel? Vlastimil?

I cannot say which is definitely the correct behaviour. The current
behaviour is conservative due to the historical concerns about kswapd
reclaiming the world. The hazard as I see it is that resetting it *may*
lead to more aggressive reclaim for high-order allocations. That may be a
welcome outcome to some that really want high-order pages and be unwelcome
to others that prefer pages to remain resident.

However, in this case it's a tight window and problems would be tricky to
detect. THP allocations won't trigger the behaviour and with vmalloc'd
stack, I'd expect that only SLUB-intensive workloads using high-order
pages would trigger any adverse behaviour. While I'm mildly concerned, I
would be a little surprised if it actually caused runaway reclaim.

-- 
Mel Gorman
SUSE Labs


Re: [PATCH 1/9] mm: fix 100% CPU kswapd busyloop on unreclaimable nodes

2017-03-07 Thread Johannes Weiner
On Tue, Mar 07, 2017 at 11:17:02AM +0100, Michal Hocko wrote:
> On Mon 06-03-17 11:24:10, Johannes Weiner wrote:
> > @@ -3271,7 +3271,8 @@ static int balance_pgdat(pg_data_t *pgdat, int order, 
> > int classzone_idx)
> >  * Raise priority if scanning rate is too low or there was no
> >  * progress in reclaiming pages
> >  */
> > -   if (raise_priority || !sc.nr_reclaimed)
> > +   nr_reclaimed = sc.nr_reclaimed - nr_reclaimed;
> > +   if (raise_priority || !nr_reclaimed)
> > sc.priority--;
> > } while (sc.priority >= 1);
> >  
> 
> I would rather not play with the sc state here. From a quick look at
> least 
>   /*
>* Fragmentation may mean that the system cannot be rebalanced for
>* high-order allocations. If twice the allocation size has been
>* reclaimed then recheck watermarks only at order-0 to prevent
>* excessive reclaim. Assume that a process requested a high-order
>* can direct reclaim/compact.
>*/
>   if (sc->order && sc->nr_reclaimed >= compact_gap(sc->order))
>   sc->order = 0;
> 
> does rely on the value. Wouldn't something like the following be safer?

Well, what behavior is correct, though? This check looks like an
argument *against* resetting sc.nr_reclaimed.

If kswapd is woken up for a higher order, this check sets a reclaim
cutoff beyond which it should give up on the order and balance for 0.

That's on the scope of the kswapd invocation. Applying this threshold
to the outcome of just the preceeding priority seems like a mistake.

Mel? Vlastimil?


Re: [PATCH 1/9] mm: fix 100% CPU kswapd busyloop on unreclaimable nodes

2017-03-07 Thread Michal Hocko
On Mon 06-03-17 11:24:10, Johannes Weiner wrote:
[...]
> >From e126db716926ff353b35f3a6205bd5853e01877b Mon Sep 17 00:00:00 2001
> From: Johannes Weiner 
> Date: Mon, 6 Mar 2017 10:53:59 -0500
> Subject: [PATCH] mm: fix 100% CPU kswapd busyloop on unreclaimable nodes fix
> 
> Check kswapd failure against the cumulative nr_reclaimed count, not
> against the count from the lowest priority iteration.
> 
> Suggested-by: Minchan Kim 
> Signed-off-by: Johannes Weiner 
> ---
>  mm/vmscan.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index ddcff8a11c1e..b834b2dd4e19 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -3179,9 +3179,9 @@ static int balance_pgdat(pg_data_t *pgdat, int order, 
> int classzone_idx)
>   count_vm_event(PAGEOUTRUN);
>  
>   do {
> + unsigned long nr_reclaimed = sc.nr_reclaimed;
>   bool raise_priority = true;
>  
> - sc.nr_reclaimed = 0;
>   sc.reclaim_idx = classzone_idx;
>  
>   /*
> @@ -3271,7 +3271,8 @@ static int balance_pgdat(pg_data_t *pgdat, int order, 
> int classzone_idx)
>* Raise priority if scanning rate is too low or there was no
>* progress in reclaiming pages
>*/
> - if (raise_priority || !sc.nr_reclaimed)
> + nr_reclaimed = sc.nr_reclaimed - nr_reclaimed;
> + if (raise_priority || !nr_reclaimed)
>   sc.priority--;
>   } while (sc.priority >= 1);
>  

I would rather not play with the sc state here. From a quick look at
least 
/*
 * Fragmentation may mean that the system cannot be rebalanced for
 * high-order allocations. If twice the allocation size has been
 * reclaimed then recheck watermarks only at order-0 to prevent
 * excessive reclaim. Assume that a process requested a high-order
 * can direct reclaim/compact.
 */
if (sc->order && sc->nr_reclaimed >= compact_gap(sc->order))
sc->order = 0;

does rely on the value. Wouldn't something like the following be safer?
---
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c15b2e4c47ca..b731f24fed12 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3183,6 +3183,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int 
classzone_idx)
.may_unmap = 1,
.may_swap = 1,
};
+   bool reclaimable = false;
count_vm_event(PAGEOUTRUN);
 
do {
@@ -3274,6 +3275,9 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int 
classzone_idx)
if (try_to_freeze() || kthread_should_stop())
break;
 
+   if (sc.nr_reclaimed)
+   reclaimable = true;
+
/*
 * Raise priority if scanning rate is too low or there was no
 * progress in reclaiming pages
@@ -3282,7 +3286,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int 
classzone_idx)
sc.priority--;
} while (sc.priority >= 1);
 
-   if (!sc.nr_reclaimed)
+   if (!reclaimable)
pgdat->kswapd_failures++;
 
 out:
-- 
Michal Hocko
SUSE Labs


Re: [PATCH 1/9] mm: fix 100% CPU kswapd busyloop on unreclaimable nodes

2017-03-06 Thread Minchan Kim
On Mon, Mar 06, 2017 at 11:24:10AM -0500, Johannes Weiner wrote:
> On Mon, Mar 06, 2017 at 10:37:40AM +0900, Minchan Kim wrote:
> > On Fri, Mar 03, 2017 at 08:59:54AM +0100, Michal Hocko wrote:
> > > On Fri 03-03-17 10:26:09, Minchan Kim wrote:
> > > > On Tue, Feb 28, 2017 at 04:39:59PM -0500, Johannes Weiner wrote:
> > > > > @@ -3316,6 +3325,9 @@ static int balance_pgdat(pg_data_t *pgdat, int 
> > > > > order, int classzone_idx)
> > > > >   sc.priority--;
> > > > >   } while (sc.priority >= 1);
> > > > >  
> > > > > + if (!sc.nr_reclaimed)
> > > > > + pgdat->kswapd_failures++;
> > > > 
> > > > sc.nr_reclaimed is reset to zero in above big loop's beginning so most 
> > > > of time,
> > > > it pgdat->kswapd_failures is increased.
> 
> That wasn't intentional; I didn't see the sc.nr_reclaimed reset.
> 
> ---
> 
> From e126db716926ff353b35f3a6205bd5853e01877b Mon Sep 17 00:00:00 2001
> From: Johannes Weiner 
> Date: Mon, 6 Mar 2017 10:53:59 -0500
> Subject: [PATCH] mm: fix 100% CPU kswapd busyloop on unreclaimable nodes fix
> 
> Check kswapd failure against the cumulative nr_reclaimed count, not
> against the count from the lowest priority iteration.
> 
> Suggested-by: Minchan Kim 
> Signed-off-by: Johannes Weiner 
Acked-by: Minchan Kim 




Re: [PATCH 1/9] mm: fix 100% CPU kswapd busyloop on unreclaimable nodes

2017-03-06 Thread Hillf Danton

On March 07, 2017 12:24 AM Johannes Weiner wrote: 
> On Mon, Mar 06, 2017 at 10:37:40AM +0900, Minchan Kim wrote:
> > On Fri, Mar 03, 2017 at 08:59:54AM +0100, Michal Hocko wrote:
> > > On Fri 03-03-17 10:26:09, Minchan Kim wrote:
> > > > On Tue, Feb 28, 2017 at 04:39:59PM -0500, Johannes Weiner wrote:
> > > > > @@ -3316,6 +3325,9 @@ static int balance_pgdat(pg_data_t *pgdat, int 
> > > > > order, int classzone_idx)
> > > > >   sc.priority--;
> > > > >   } while (sc.priority >= 1);
> > > > >
> > > > > + if (!sc.nr_reclaimed)
> > > > > + pgdat->kswapd_failures++;
> > > >
> > > > sc.nr_reclaimed is reset to zero in above big loop's beginning so most 
> > > > of time,
> > > > it pgdat->kswapd_failures is increased.
> 
> That wasn't intentional; I didn't see the sc.nr_reclaimed reset.
> 
> ---
> 
> From e126db716926ff353b35f3a6205bd5853e01877b Mon Sep 17 00:00:00 2001
> From: Johannes Weiner 
> Date: Mon, 6 Mar 2017 10:53:59 -0500
> Subject: [PATCH] mm: fix 100% CPU kswapd busyloop on unreclaimable nodes fix
> 
> Check kswapd failure against the cumulative nr_reclaimed count, not
> against the count from the lowest priority iteration.
> 
> Suggested-by: Minchan Kim 
> Signed-off-by: Johannes Weiner 
> ---
>  mm/vmscan.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index ddcff8a11c1e..b834b2dd4e19 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -3179,9 +3179,9 @@ static int balance_pgdat(pg_data_t *pgdat, int order, 
> int classzone_idx)
>   count_vm_event(PAGEOUTRUN);
> 
>   do {
> + unsigned long nr_reclaimed = sc.nr_reclaimed;
>   bool raise_priority = true;
> 
> - sc.nr_reclaimed = 0;

This has another effect that we'll reclaim less pages than we're
currently doing if we are balancing for high order request. And 
it looks worth including that info also in log message.

>   sc.reclaim_idx = classzone_idx;
> 
>   /*
> @@ -3271,7 +3271,8 @@ static int balance_pgdat(pg_data_t *pgdat, int order, 
> int classzone_idx)
>* Raise priority if scanning rate is too low or there was no
>* progress in reclaiming pages
>*/
> - if (raise_priority || !sc.nr_reclaimed)
> + nr_reclaimed = sc.nr_reclaimed - nr_reclaimed;
> + if (raise_priority || !nr_reclaimed)
>   sc.priority--;
>   } while (sc.priority >= 1);
> 
> --
> 2.11.1



Re: [PATCH 1/9] mm: fix 100% CPU kswapd busyloop on unreclaimable nodes

2017-03-06 Thread Johannes Weiner
On Mon, Mar 06, 2017 at 10:37:40AM +0900, Minchan Kim wrote:
> On Fri, Mar 03, 2017 at 08:59:54AM +0100, Michal Hocko wrote:
> > On Fri 03-03-17 10:26:09, Minchan Kim wrote:
> > > On Tue, Feb 28, 2017 at 04:39:59PM -0500, Johannes Weiner wrote:
> > > > @@ -3316,6 +3325,9 @@ static int balance_pgdat(pg_data_t *pgdat, int 
> > > > order, int classzone_idx)
> > > > sc.priority--;
> > > > } while (sc.priority >= 1);
> > > >  
> > > > +   if (!sc.nr_reclaimed)
> > > > +   pgdat->kswapd_failures++;
> > > 
> > > sc.nr_reclaimed is reset to zero in above big loop's beginning so most of 
> > > time,
> > > it pgdat->kswapd_failures is increased.

That wasn't intentional; I didn't see the sc.nr_reclaimed reset.

---

>From e126db716926ff353b35f3a6205bd5853e01877b Mon Sep 17 00:00:00 2001
From: Johannes Weiner 
Date: Mon, 6 Mar 2017 10:53:59 -0500
Subject: [PATCH] mm: fix 100% CPU kswapd busyloop on unreclaimable nodes fix

Check kswapd failure against the cumulative nr_reclaimed count, not
against the count from the lowest priority iteration.

Suggested-by: Minchan Kim 
Signed-off-by: Johannes Weiner 
---
 mm/vmscan.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index ddcff8a11c1e..b834b2dd4e19 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3179,9 +3179,9 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int 
classzone_idx)
count_vm_event(PAGEOUTRUN);
 
do {
+   unsigned long nr_reclaimed = sc.nr_reclaimed;
bool raise_priority = true;
 
-   sc.nr_reclaimed = 0;
sc.reclaim_idx = classzone_idx;
 
/*
@@ -3271,7 +3271,8 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int 
classzone_idx)
 * Raise priority if scanning rate is too low or there was no
 * progress in reclaiming pages
 */
-   if (raise_priority || !sc.nr_reclaimed)
+   nr_reclaimed = sc.nr_reclaimed - nr_reclaimed;
+   if (raise_priority || !nr_reclaimed)
sc.priority--;
} while (sc.priority >= 1);
 
-- 
2.11.1



Re: [PATCH 1/9] mm: fix 100% CPU kswapd busyloop on unreclaimable nodes

2017-03-05 Thread Minchan Kim
Hi Michal,

On Fri, Mar 03, 2017 at 08:59:54AM +0100, Michal Hocko wrote:
> On Fri 03-03-17 10:26:09, Minchan Kim wrote:
> > Hi Johannes,
> > 
> > On Tue, Feb 28, 2017 at 04:39:59PM -0500, Johannes Weiner wrote:
> > > Jia He reports a problem with kswapd spinning at 100% CPU when
> > > requesting more hugepages than memory available in the system:
> > > 
> > > $ echo 4000 >/proc/sys/vm/nr_hugepages
> > > 
> > > top - 13:42:59 up  3:37,  1 user,  load average: 1.09, 1.03, 1.01
> > > Tasks:   1 total,   1 running,   0 sleeping,   0 stopped,   0 zombie
> > > %Cpu(s):  0.0 us, 12.5 sy,  0.0 ni, 85.5 id,  2.0 wa,  0.0 hi,  0.0 si,  
> > > 0.0 st
> > > KiB Mem:  31371520 total, 30915136 used,   456384 free,  320 buffers
> > > KiB Swap:  6284224 total,   115712 used,  6168512 free.48192 cached 
> > > Mem
> > > 
> > >   PID USER  PR  NIVIRTRESSHR S  %CPU  %MEM TIME+ 
> > > COMMAND
> > >76 root  20   0   0  0  0 R 100.0 0.000 217:17.29 
> > > kswapd3
> > > 
> > > At that time, there are no reclaimable pages left in the node, but as
> > > kswapd fails to restore the high watermarks it refuses to go to sleep.
> > > 
> > > Kswapd needs to back away from nodes that fail to balance. Up until
> > > 1d82de618ddd ("mm, vmscan: make kswapd reclaim in terms of nodes")
> > > kswapd had such a mechanism. It considered zones whose theoretically
> > > reclaimable pages it had reclaimed six times over as unreclaimable and
> > > backed away from them. This guard was erroneously removed as the patch
> > > changed the definition of a balanced node.
> > > 
> > > However, simply restoring this code wouldn't help in the case reported
> > > here: there *are* no reclaimable pages that could be scanned until the
> > > threshold is met. Kswapd would stay awake anyway.
> > > 
> > > Introduce a new and much simpler way of backing off. If kswapd runs
> > > through MAX_RECLAIM_RETRIES (16) cycles without reclaiming a single
> > > page, make it back off from the node. This is the same number of shots
> > > direct reclaim takes before declaring OOM. Kswapd will go to sleep on
> > > that node until a direct reclaimer manages to reclaim some pages, thus
> > > proving the node reclaimable again.
> > > 
> > > v2: move MAX_RECLAIM_RETRIES to mm/internal.h (Michal)
> > > 
> > > Reported-by: Jia He 
> > > Signed-off-by: Johannes Weiner 
> > > Tested-by: Jia He 
> > > Acked-by: Michal Hocko 
> > > ---
> > >  include/linux/mmzone.h |  2 ++
> > >  mm/internal.h  |  6 ++
> > >  mm/page_alloc.c|  9 ++---
> > >  mm/vmscan.c| 27 ---
> > >  mm/vmstat.c|  2 +-
> > >  5 files changed, 31 insertions(+), 15 deletions(-)
> > > 
> > > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > > index 8e02b3750fe0..d2c50ab6ae40 100644
> > > --- a/include/linux/mmzone.h
> > > +++ b/include/linux/mmzone.h
> > > @@ -630,6 +630,8 @@ typedef struct pglist_data {
> > >   int kswapd_order;
> > >   enum zone_type kswapd_classzone_idx;
> > >  
> > > + int kswapd_failures;/* Number of 'reclaimed == 0' runs */
> > > +
> > >  #ifdef CONFIG_COMPACTION
> > >   int kcompactd_max_order;
> > >   enum zone_type kcompactd_classzone_idx;
> > > diff --git a/mm/internal.h b/mm/internal.h
> > > index ccfc2a2969f4..aae93e3fd984 100644
> > > --- a/mm/internal.h
> > > +++ b/mm/internal.h
> > > @@ -81,6 +81,12 @@ static inline void set_page_refcounted(struct page 
> > > *page)
> > >  extern unsigned long highest_memmap_pfn;
> > >  
> > >  /*
> > > + * Maximum number of reclaim retries without progress before the OOM
> > > + * killer is consider the only way forward.
> > > + */
> > > +#define MAX_RECLAIM_RETRIES 16
> > > +
> > > +/*
> > >   * in mm/vmscan.c:
> > >   */
> > >  extern int isolate_lru_page(struct page *page);
> > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > index 614cd0397ce3..f50e36e7b024 100644
> > > --- a/mm/page_alloc.c
> > > +++ b/mm/page_alloc.c
> > > @@ -3516,12 +3516,6 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
> > >  }
> > >  
> > >  /*
> > > - * Maximum number of reclaim retries without any progress before OOM 
> > > killer
> > > - * is consider as the only way to move forward.
> > > - */
> > > -#define MAX_RECLAIM_RETRIES 16
> > > -
> > > -/*
> > >   * Checks whether it makes sense to retry the reclaim to make a forward 
> > > progress
> > >   * for the given allocation request.
> > >   * The reclaim feedback represented by did_some_progress (any progress 
> > > during
> > > @@ -4527,7 +4521,8 @@ void show_free_areas(unsigned int filter, 
> > > nodemask_t *nodemask)
> > >   K(node_page_state(pgdat, NR_WRITEBACK_TEMP)),
> > >   K(node_page_state(pgdat, NR_UNSTABLE_NFS)),
> > >   node_page_state(pgdat, NR_PAGES_SCANNED),
> > > - !pgdat_reclaimable(pgdat) ? "yes" : "no");
> > > + pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES ?
> > > +   

Re: [PATCH 1/9] mm: fix 100% CPU kswapd busyloop on unreclaimable nodes

2017-03-03 Thread Michal Hocko
On Fri 03-03-17 10:26:09, Minchan Kim wrote:
> Hi Johannes,
> 
> On Tue, Feb 28, 2017 at 04:39:59PM -0500, Johannes Weiner wrote:
> > Jia He reports a problem with kswapd spinning at 100% CPU when
> > requesting more hugepages than memory available in the system:
> > 
> > $ echo 4000 >/proc/sys/vm/nr_hugepages
> > 
> > top - 13:42:59 up  3:37,  1 user,  load average: 1.09, 1.03, 1.01
> > Tasks:   1 total,   1 running,   0 sleeping,   0 stopped,   0 zombie
> > %Cpu(s):  0.0 us, 12.5 sy,  0.0 ni, 85.5 id,  2.0 wa,  0.0 hi,  0.0 si,  
> > 0.0 st
> > KiB Mem:  31371520 total, 30915136 used,   456384 free,  320 buffers
> > KiB Swap:  6284224 total,   115712 used,  6168512 free.48192 cached Mem
> > 
> >   PID USER  PR  NIVIRTRESSHR S  %CPU  %MEM TIME+ COMMAND
> >76 root  20   0   0  0  0 R 100.0 0.000 217:17.29 kswapd3
> > 
> > At that time, there are no reclaimable pages left in the node, but as
> > kswapd fails to restore the high watermarks it refuses to go to sleep.
> > 
> > Kswapd needs to back away from nodes that fail to balance. Up until
> > 1d82de618ddd ("mm, vmscan: make kswapd reclaim in terms of nodes")
> > kswapd had such a mechanism. It considered zones whose theoretically
> > reclaimable pages it had reclaimed six times over as unreclaimable and
> > backed away from them. This guard was erroneously removed as the patch
> > changed the definition of a balanced node.
> > 
> > However, simply restoring this code wouldn't help in the case reported
> > here: there *are* no reclaimable pages that could be scanned until the
> > threshold is met. Kswapd would stay awake anyway.
> > 
> > Introduce a new and much simpler way of backing off. If kswapd runs
> > through MAX_RECLAIM_RETRIES (16) cycles without reclaiming a single
> > page, make it back off from the node. This is the same number of shots
> > direct reclaim takes before declaring OOM. Kswapd will go to sleep on
> > that node until a direct reclaimer manages to reclaim some pages, thus
> > proving the node reclaimable again.
> > 
> > v2: move MAX_RECLAIM_RETRIES to mm/internal.h (Michal)
> > 
> > Reported-by: Jia He 
> > Signed-off-by: Johannes Weiner 
> > Tested-by: Jia He 
> > Acked-by: Michal Hocko 
> > ---
> >  include/linux/mmzone.h |  2 ++
> >  mm/internal.h  |  6 ++
> >  mm/page_alloc.c|  9 ++---
> >  mm/vmscan.c| 27 ---
> >  mm/vmstat.c|  2 +-
> >  5 files changed, 31 insertions(+), 15 deletions(-)
> > 
> > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > index 8e02b3750fe0..d2c50ab6ae40 100644
> > --- a/include/linux/mmzone.h
> > +++ b/include/linux/mmzone.h
> > @@ -630,6 +630,8 @@ typedef struct pglist_data {
> > int kswapd_order;
> > enum zone_type kswapd_classzone_idx;
> >  
> > +   int kswapd_failures;/* Number of 'reclaimed == 0' runs */
> > +
> >  #ifdef CONFIG_COMPACTION
> > int kcompactd_max_order;
> > enum zone_type kcompactd_classzone_idx;
> > diff --git a/mm/internal.h b/mm/internal.h
> > index ccfc2a2969f4..aae93e3fd984 100644
> > --- a/mm/internal.h
> > +++ b/mm/internal.h
> > @@ -81,6 +81,12 @@ static inline void set_page_refcounted(struct page *page)
> >  extern unsigned long highest_memmap_pfn;
> >  
> >  /*
> > + * Maximum number of reclaim retries without progress before the OOM
> > + * killer is consider the only way forward.
> > + */
> > +#define MAX_RECLAIM_RETRIES 16
> > +
> > +/*
> >   * in mm/vmscan.c:
> >   */
> >  extern int isolate_lru_page(struct page *page);
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 614cd0397ce3..f50e36e7b024 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -3516,12 +3516,6 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
> >  }
> >  
> >  /*
> > - * Maximum number of reclaim retries without any progress before OOM killer
> > - * is consider as the only way to move forward.
> > - */
> > -#define MAX_RECLAIM_RETRIES 16
> > -
> > -/*
> >   * Checks whether it makes sense to retry the reclaim to make a forward 
> > progress
> >   * for the given allocation request.
> >   * The reclaim feedback represented by did_some_progress (any progress 
> > during
> > @@ -4527,7 +4521,8 @@ void show_free_areas(unsigned int filter, nodemask_t 
> > *nodemask)
> > K(node_page_state(pgdat, NR_WRITEBACK_TEMP)),
> > K(node_page_state(pgdat, NR_UNSTABLE_NFS)),
> > node_page_state(pgdat, NR_PAGES_SCANNED),
> > -   !pgdat_reclaimable(pgdat) ? "yes" : "no");
> > +   pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES ?
> > +   "yes" : "no");
> > }
> >  
> > for_each_populated_zone(zone) {
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 26c3b405ef34..407b27831ff7 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -2626,6 +2626,15 @@ static bool shrink_node(pg_data_t *pgdat, struct 
> > scan_control

Re: [PATCH 1/9] mm: fix 100% CPU kswapd busyloop on unreclaimable nodes

2017-03-02 Thread Minchan Kim
Hi Johannes,

On Tue, Feb 28, 2017 at 04:39:59PM -0500, Johannes Weiner wrote:
> Jia He reports a problem with kswapd spinning at 100% CPU when
> requesting more hugepages than memory available in the system:
> 
> $ echo 4000 >/proc/sys/vm/nr_hugepages
> 
> top - 13:42:59 up  3:37,  1 user,  load average: 1.09, 1.03, 1.01
> Tasks:   1 total,   1 running,   0 sleeping,   0 stopped,   0 zombie
> %Cpu(s):  0.0 us, 12.5 sy,  0.0 ni, 85.5 id,  2.0 wa,  0.0 hi,  0.0 si,  0.0 
> st
> KiB Mem:  31371520 total, 30915136 used,   456384 free,  320 buffers
> KiB Swap:  6284224 total,   115712 used,  6168512 free.48192 cached Mem
> 
>   PID USER  PR  NIVIRTRESSHR S  %CPU  %MEM TIME+ COMMAND
>76 root  20   0   0  0  0 R 100.0 0.000 217:17.29 kswapd3
> 
> At that time, there are no reclaimable pages left in the node, but as
> kswapd fails to restore the high watermarks it refuses to go to sleep.
> 
> Kswapd needs to back away from nodes that fail to balance. Up until
> 1d82de618ddd ("mm, vmscan: make kswapd reclaim in terms of nodes")
> kswapd had such a mechanism. It considered zones whose theoretically
> reclaimable pages it had reclaimed six times over as unreclaimable and
> backed away from them. This guard was erroneously removed as the patch
> changed the definition of a balanced node.
> 
> However, simply restoring this code wouldn't help in the case reported
> here: there *are* no reclaimable pages that could be scanned until the
> threshold is met. Kswapd would stay awake anyway.
> 
> Introduce a new and much simpler way of backing off. If kswapd runs
> through MAX_RECLAIM_RETRIES (16) cycles without reclaiming a single
> page, make it back off from the node. This is the same number of shots
> direct reclaim takes before declaring OOM. Kswapd will go to sleep on
> that node until a direct reclaimer manages to reclaim some pages, thus
> proving the node reclaimable again.
> 
> v2: move MAX_RECLAIM_RETRIES to mm/internal.h (Michal)
> 
> Reported-by: Jia He 
> Signed-off-by: Johannes Weiner 
> Tested-by: Jia He 
> Acked-by: Michal Hocko 
> ---
>  include/linux/mmzone.h |  2 ++
>  mm/internal.h  |  6 ++
>  mm/page_alloc.c|  9 ++---
>  mm/vmscan.c| 27 ---
>  mm/vmstat.c|  2 +-
>  5 files changed, 31 insertions(+), 15 deletions(-)
> 
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 8e02b3750fe0..d2c50ab6ae40 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -630,6 +630,8 @@ typedef struct pglist_data {
>   int kswapd_order;
>   enum zone_type kswapd_classzone_idx;
>  
> + int kswapd_failures;/* Number of 'reclaimed == 0' runs */
> +
>  #ifdef CONFIG_COMPACTION
>   int kcompactd_max_order;
>   enum zone_type kcompactd_classzone_idx;
> diff --git a/mm/internal.h b/mm/internal.h
> index ccfc2a2969f4..aae93e3fd984 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -81,6 +81,12 @@ static inline void set_page_refcounted(struct page *page)
>  extern unsigned long highest_memmap_pfn;
>  
>  /*
> + * Maximum number of reclaim retries without progress before the OOM
> + * killer is consider the only way forward.
> + */
> +#define MAX_RECLAIM_RETRIES 16
> +
> +/*
>   * in mm/vmscan.c:
>   */
>  extern int isolate_lru_page(struct page *page);
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 614cd0397ce3..f50e36e7b024 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3516,12 +3516,6 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
>  }
>  
>  /*
> - * Maximum number of reclaim retries without any progress before OOM killer
> - * is consider as the only way to move forward.
> - */
> -#define MAX_RECLAIM_RETRIES 16
> -
> -/*
>   * Checks whether it makes sense to retry the reclaim to make a forward 
> progress
>   * for the given allocation request.
>   * The reclaim feedback represented by did_some_progress (any progress during
> @@ -4527,7 +4521,8 @@ void show_free_areas(unsigned int filter, nodemask_t 
> *nodemask)
>   K(node_page_state(pgdat, NR_WRITEBACK_TEMP)),
>   K(node_page_state(pgdat, NR_UNSTABLE_NFS)),
>   node_page_state(pgdat, NR_PAGES_SCANNED),
> - !pgdat_reclaimable(pgdat) ? "yes" : "no");
> + pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES ?
> + "yes" : "no");
>   }
>  
>   for_each_populated_zone(zone) {
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 26c3b405ef34..407b27831ff7 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2626,6 +2626,15 @@ static bool shrink_node(pg_data_t *pgdat, struct 
> scan_control *sc)
>   } while (should_continue_reclaim(pgdat, sc->nr_reclaimed - nr_reclaimed,
>sc->nr_scanned - nr_scanned, sc));
>  
> + /*
> +  * Kswapd gives up on balancing particular nodes after too
> +  * many

Re: [PATCH 1/9] mm: fix 100% CPU kswapd busyloop on unreclaimable nodes

2017-03-02 Thread Shakeel Butt
On Tue, Feb 28, 2017 at 1:39 PM, Johannes Weiner  wrote:
> Jia He reports a problem with kswapd spinning at 100% CPU when
> requesting more hugepages than memory available in the system:
>
> $ echo 4000 >/proc/sys/vm/nr_hugepages
>
> top - 13:42:59 up  3:37,  1 user,  load average: 1.09, 1.03, 1.01
> Tasks:   1 total,   1 running,   0 sleeping,   0 stopped,   0 zombie
> %Cpu(s):  0.0 us, 12.5 sy,  0.0 ni, 85.5 id,  2.0 wa,  0.0 hi,  0.0 si,  0.0 
> st
> KiB Mem:  31371520 total, 30915136 used,   456384 free,  320 buffers
> KiB Swap:  6284224 total,   115712 used,  6168512 free.48192 cached Mem
>
>   PID USER  PR  NIVIRTRESSHR S  %CPU  %MEM TIME+ COMMAND
>76 root  20   0   0  0  0 R 100.0 0.000 217:17.29 kswapd3
>
> At that time, there are no reclaimable pages left in the node, but as
> kswapd fails to restore the high watermarks it refuses to go to sleep.
>
> Kswapd needs to back away from nodes that fail to balance. Up until
> 1d82de618ddd ("mm, vmscan: make kswapd reclaim in terms of nodes")
> kswapd had such a mechanism. It considered zones whose theoretically
> reclaimable pages it had reclaimed six times over as unreclaimable and
> backed away from them. This guard was erroneously removed as the patch
> changed the definition of a balanced node.
>
> However, simply restoring this code wouldn't help in the case reported
> here: there *are* no reclaimable pages that could be scanned until the
> threshold is met. Kswapd would stay awake anyway.
>
> Introduce a new and much simpler way of backing off. If kswapd runs
> through MAX_RECLAIM_RETRIES (16) cycles without reclaiming a single
> page, make it back off from the node. This is the same number of shots
> direct reclaim takes before declaring OOM. Kswapd will go to sleep on
> that node until a direct reclaimer manages to reclaim some pages, thus
> proving the node reclaimable again.
>

Should the condition of wait_event_killable in throttle_direct_reclaim
be changed to (pfmemalloc_watermark_ok(pgdat) ||
pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES)?

> v2: move MAX_RECLAIM_RETRIES to mm/internal.h (Michal)
>
> Reported-by: Jia He 
> Signed-off-by: Johannes Weiner 
> Tested-by: Jia He 
> Acked-by: Michal Hocko 
> ---
>  include/linux/mmzone.h |  2 ++
>  mm/internal.h  |  6 ++
>  mm/page_alloc.c|  9 ++---
>  mm/vmscan.c| 27 ---
>  mm/vmstat.c|  2 +-
>  5 files changed, 31 insertions(+), 15 deletions(-)
>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 8e02b3750fe0..d2c50ab6ae40 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -630,6 +630,8 @@ typedef struct pglist_data {
> int kswapd_order;
> enum zone_type kswapd_classzone_idx;
>
> +   int kswapd_failures;/* Number of 'reclaimed == 0' runs */
> +
>  #ifdef CONFIG_COMPACTION
> int kcompactd_max_order;
> enum zone_type kcompactd_classzone_idx;
> diff --git a/mm/internal.h b/mm/internal.h
> index ccfc2a2969f4..aae93e3fd984 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -81,6 +81,12 @@ static inline void set_page_refcounted(struct page *page)
>  extern unsigned long highest_memmap_pfn;
>
>  /*
> + * Maximum number of reclaim retries without progress before the OOM
> + * killer is consider the only way forward.
> + */
> +#define MAX_RECLAIM_RETRIES 16
> +
> +/*
>   * in mm/vmscan.c:
>   */
>  extern int isolate_lru_page(struct page *page);
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 614cd0397ce3..f50e36e7b024 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3516,12 +3516,6 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
>  }
>
>  /*
> - * Maximum number of reclaim retries without any progress before OOM killer
> - * is consider as the only way to move forward.
> - */
> -#define MAX_RECLAIM_RETRIES 16
> -
> -/*
>   * Checks whether it makes sense to retry the reclaim to make a forward 
> progress
>   * for the given allocation request.
>   * The reclaim feedback represented by did_some_progress (any progress during
> @@ -4527,7 +4521,8 @@ void show_free_areas(unsigned int filter, nodemask_t 
> *nodemask)
> K(node_page_state(pgdat, NR_WRITEBACK_TEMP)),
> K(node_page_state(pgdat, NR_UNSTABLE_NFS)),
> node_page_state(pgdat, NR_PAGES_SCANNED),
> -   !pgdat_reclaimable(pgdat) ? "yes" : "no");
> +   pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES ?
> +   "yes" : "no");
> }
>
> for_each_populated_zone(zone) {
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 26c3b405ef34..407b27831ff7 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2626,6 +2626,15 @@ static bool shrink_node(pg_data_t *pgdat, struct 
> scan_control *sc)
> } while (should_continue_reclaim(pgdat, sc->nr_reclaimed - 
> nr_reclaimed,
>   

Re: [PATCH 1/9] mm: fix 100% CPU kswapd busyloop on unreclaimable nodes

2017-03-01 Thread Hillf Danton

On March 01, 2017 5:40 AM Johannes Weiner wrote:
> 
> Jia He reports a problem with kswapd spinning at 100% CPU when
> requesting more hugepages than memory available in the system:
> 
> $ echo 4000 >/proc/sys/vm/nr_hugepages
> 
> top - 13:42:59 up  3:37,  1 user,  load average: 1.09, 1.03, 1.01
> Tasks:   1 total,   1 running,   0 sleeping,   0 stopped,   0 zombie
> %Cpu(s):  0.0 us, 12.5 sy,  0.0 ni, 85.5 id,  2.0 wa,  0.0 hi,  0.0 si,  0.0 
> st
> KiB Mem:  31371520 total, 30915136 used,   456384 free,  320 buffers
> KiB Swap:  6284224 total,   115712 used,  6168512 free.48192 cached Mem
> 
>   PID USER  PR  NIVIRTRESSHR S  %CPU  %MEM TIME+ COMMAND
>76 root  20   0   0  0  0 R 100.0 0.000 217:17.29 kswapd3
> 
> At that time, there are no reclaimable pages left in the node, but as
> kswapd fails to restore the high watermarks it refuses to go to sleep.
> 
> Kswapd needs to back away from nodes that fail to balance. Up until
> 1d82de618ddd ("mm, vmscan: make kswapd reclaim in terms of nodes")
> kswapd had such a mechanism. It considered zones whose theoretically
> reclaimable pages it had reclaimed six times over as unreclaimable and
> backed away from them. This guard was erroneously removed as the patch
> changed the definition of a balanced node.
> 
> However, simply restoring this code wouldn't help in the case reported
> here: there *are* no reclaimable pages that could be scanned until the
> threshold is met. Kswapd would stay awake anyway.
> 
> Introduce a new and much simpler way of backing off. If kswapd runs
> through MAX_RECLAIM_RETRIES (16) cycles without reclaiming a single
> page, make it back off from the node. This is the same number of shots
> direct reclaim takes before declaring OOM. Kswapd will go to sleep on
> that node until a direct reclaimer manages to reclaim some pages, thus
> proving the node reclaimable again.
> 
> v2: move MAX_RECLAIM_RETRIES to mm/internal.h (Michal)
> 
> Reported-by: Jia He 
> Signed-off-by: Johannes Weiner 
> Tested-by: Jia He 
> Acked-by: Michal Hocko 
> ---

Acked-by: Hillf Danton 



[PATCH 1/9] mm: fix 100% CPU kswapd busyloop on unreclaimable nodes

2017-02-28 Thread Johannes Weiner
Jia He reports a problem with kswapd spinning at 100% CPU when
requesting more hugepages than memory available in the system:

$ echo 4000 >/proc/sys/vm/nr_hugepages

top - 13:42:59 up  3:37,  1 user,  load average: 1.09, 1.03, 1.01
Tasks:   1 total,   1 running,   0 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.0 us, 12.5 sy,  0.0 ni, 85.5 id,  2.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem:  31371520 total, 30915136 used,   456384 free,  320 buffers
KiB Swap:  6284224 total,   115712 used,  6168512 free.48192 cached Mem

  PID USER  PR  NIVIRTRESSHR S  %CPU  %MEM TIME+ COMMAND
   76 root  20   0   0  0  0 R 100.0 0.000 217:17.29 kswapd3

At that time, there are no reclaimable pages left in the node, but as
kswapd fails to restore the high watermarks it refuses to go to sleep.

Kswapd needs to back away from nodes that fail to balance. Up until
1d82de618ddd ("mm, vmscan: make kswapd reclaim in terms of nodes")
kswapd had such a mechanism. It considered zones whose theoretically
reclaimable pages it had reclaimed six times over as unreclaimable and
backed away from them. This guard was erroneously removed as the patch
changed the definition of a balanced node.

However, simply restoring this code wouldn't help in the case reported
here: there *are* no reclaimable pages that could be scanned until the
threshold is met. Kswapd would stay awake anyway.

Introduce a new and much simpler way of backing off. If kswapd runs
through MAX_RECLAIM_RETRIES (16) cycles without reclaiming a single
page, make it back off from the node. This is the same number of shots
direct reclaim takes before declaring OOM. Kswapd will go to sleep on
that node until a direct reclaimer manages to reclaim some pages, thus
proving the node reclaimable again.

v2: move MAX_RECLAIM_RETRIES to mm/internal.h (Michal)

Reported-by: Jia He 
Signed-off-by: Johannes Weiner 
Tested-by: Jia He 
Acked-by: Michal Hocko 
---
 include/linux/mmzone.h |  2 ++
 mm/internal.h  |  6 ++
 mm/page_alloc.c|  9 ++---
 mm/vmscan.c| 27 ---
 mm/vmstat.c|  2 +-
 5 files changed, 31 insertions(+), 15 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 8e02b3750fe0..d2c50ab6ae40 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -630,6 +630,8 @@ typedef struct pglist_data {
int kswapd_order;
enum zone_type kswapd_classzone_idx;
 
+   int kswapd_failures;/* Number of 'reclaimed == 0' runs */
+
 #ifdef CONFIG_COMPACTION
int kcompactd_max_order;
enum zone_type kcompactd_classzone_idx;
diff --git a/mm/internal.h b/mm/internal.h
index ccfc2a2969f4..aae93e3fd984 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -81,6 +81,12 @@ static inline void set_page_refcounted(struct page *page)
 extern unsigned long highest_memmap_pfn;
 
 /*
+ * Maximum number of reclaim retries without progress before the OOM
+ * killer is consider the only way forward.
+ */
+#define MAX_RECLAIM_RETRIES 16
+
+/*
  * in mm/vmscan.c:
  */
 extern int isolate_lru_page(struct page *page);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 614cd0397ce3..f50e36e7b024 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3516,12 +3516,6 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
 }
 
 /*
- * Maximum number of reclaim retries without any progress before OOM killer
- * is consider as the only way to move forward.
- */
-#define MAX_RECLAIM_RETRIES 16
-
-/*
  * Checks whether it makes sense to retry the reclaim to make a forward 
progress
  * for the given allocation request.
  * The reclaim feedback represented by did_some_progress (any progress during
@@ -4527,7 +4521,8 @@ void show_free_areas(unsigned int filter, nodemask_t 
*nodemask)
K(node_page_state(pgdat, NR_WRITEBACK_TEMP)),
K(node_page_state(pgdat, NR_UNSTABLE_NFS)),
node_page_state(pgdat, NR_PAGES_SCANNED),
-   !pgdat_reclaimable(pgdat) ? "yes" : "no");
+   pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES ?
+   "yes" : "no");
}
 
for_each_populated_zone(zone) {
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 26c3b405ef34..407b27831ff7 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2626,6 +2626,15 @@ static bool shrink_node(pg_data_t *pgdat, struct 
scan_control *sc)
} while (should_continue_reclaim(pgdat, sc->nr_reclaimed - nr_reclaimed,
 sc->nr_scanned - nr_scanned, sc));
 
+   /*
+* Kswapd gives up on balancing particular nodes after too
+* many failures to reclaim anything from them and goes to
+* sleep. On reclaim progress, reset the failure counter. A
+* successful direct reclaim run will revive a dormant kswapd.
+*/
+   if (reclaimable)
+   pgdat->kswapd_failures = 0;
+
retu