[patch 08/10] mm: thrash detection-based file cache sizing

2014-02-03 Thread Johannes Weiner
The VM maintains cached filesystem pages on two types of lists.  One
list holds the pages recently faulted into the cache, the other list
holds pages that have been referenced repeatedly on that first list.
The idea is to prefer reclaiming young pages over those that have
shown to benefit from caching in the past.  We call the recently used
list "inactive list" and the frequently used list "active list".

Currently, the VM aims for a 1:1 ratio between the lists, which is the
"perfect" trade-off between the ability to *protect* frequently used
pages and the ability to *detect* frequently used pages.  This means
that working set changes bigger than half of cache memory go
undetected and thrash indefinitely, whereas working sets bigger than
half of cache memory are unprotected against used-once streams that
don't even need caching.

Historically, every reclaim scan of the inactive list also took a
smaller number of pages from the tail of the active list and moved
them to the head of the inactive list.  This model gave established
working sets more gracetime in the face of temporary use-once streams,
but ultimately was not significantly better than a FIFO policy and
still thrashed cache based on eviction speed, rather than actual
demand for cache.

This patch solves one half of the problem by decoupling the ability to
detect working set changes from the inactive list size.  By
maintaining a history of recently evicted file pages it can detect
frequently used pages with an arbitrarily small inactive list size,
and subsequently apply pressure on the active list based on actual
demand for cache, not just overall eviction speed.

Every zone maintains a counter that tracks inactive list aging speed.
When a page is evicted, a snapshot of this counter is stored in the
now-empty page cache radix tree slot.  On refault, the minimum access
distance of the page can be assessed, to evaluate whether the page
should be part of the active list or not.

This fixes the VM's blindness towards working set changes in excess of
the inactive list.  And it's the foundation to further improve the
protection ability and reduce the minimum inactive list size of 50%.

Signed-off-by: Johannes Weiner 
Reviewed-by: Rik van Riel 
Reviewed-by: Minchan Kim 
Reviewed-by: Bob Liu 
---
 include/linux/mmzone.h |   5 +
 include/linux/swap.h   |   5 +
 mm/Makefile|   2 +-
 mm/filemap.c   |  61 
 mm/swap.c  |   2 +
 mm/vmscan.c|  24 -
 mm/vmstat.c|   2 +
 mm/workingset.c| 253 +
 8 files changed, 331 insertions(+), 23 deletions(-)
 create mode 100644 mm/workingset.c

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 5f2052c83154..b4bdeb411a4d 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -142,6 +142,8 @@ enum zone_stat_item {
NUMA_LOCAL, /* allocation from local node */
NUMA_OTHER, /* allocation from other node */
 #endif
+   WORKINGSET_REFAULT,
+   WORKINGSET_ACTIVATE,
NR_ANON_TRANSPARENT_HUGEPAGES,
NR_FREE_CMA_PAGES,
NR_VM_ZONE_STAT_ITEMS };
@@ -392,6 +394,9 @@ struct zone {
spinlock_t  lru_lock;
struct lruvec   lruvec;
 
+   /* Evictions & activations on the inactive file list */
+   atomic_long_t   inactive_age;
+
unsigned long   pages_scanned; /* since last reclaim */
unsigned long   flags; /* zone flags, see below */
 
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 46ba0c6c219f..b83cf61403ed 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -260,6 +260,11 @@ struct swap_list_t {
int next;   /* swapfile to be used next */
 };
 
+/* linux/mm/workingset.c */
+void *workingset_eviction(struct address_space *mapping, struct page *page);
+bool workingset_refault(void *shadow);
+void workingset_activation(struct page *page);
+
 /* linux/mm/page_alloc.c */
 extern unsigned long totalram_pages;
 extern unsigned long totalreserve_pages;
diff --git a/mm/Makefile b/mm/Makefile
index 310c90a09264..cdd741519ee0 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -17,7 +17,7 @@ obj-y := filemap.o mempool.o oom_kill.o 
fadvise.o \
   util.o mmzone.o vmstat.o backing-dev.o \
   mm_init.o mmu_context.o percpu.o slab_common.o \
   compaction.o balloon_compaction.o \
-  interval_tree.o list_lru.o $(mmu-y)
+  interval_tree.o list_lru.o workingset.o $(mmu-y)
 
 obj-y += init-mm.o
 
diff --git a/mm/filemap.c b/mm/filemap.c
index 18f80d418f83..33ceebf4d577 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -469,7 +469,7 @@ int replace_page_cache_page(struct page *old, struct page 
*new, gfp_t gfp_mask)
 EXPORT_SYMBOL_GPL(replace_page_cache_page);
 
 static int 

[patch 08/10] mm: thrash detection-based file cache sizing

2014-02-03 Thread Johannes Weiner
The VM maintains cached filesystem pages on two types of lists.  One
list holds the pages recently faulted into the cache, the other list
holds pages that have been referenced repeatedly on that first list.
The idea is to prefer reclaiming young pages over those that have
shown to benefit from caching in the past.  We call the recently used
list inactive list and the frequently used list active list.

Currently, the VM aims for a 1:1 ratio between the lists, which is the
perfect trade-off between the ability to *protect* frequently used
pages and the ability to *detect* frequently used pages.  This means
that working set changes bigger than half of cache memory go
undetected and thrash indefinitely, whereas working sets bigger than
half of cache memory are unprotected against used-once streams that
don't even need caching.

Historically, every reclaim scan of the inactive list also took a
smaller number of pages from the tail of the active list and moved
them to the head of the inactive list.  This model gave established
working sets more gracetime in the face of temporary use-once streams,
but ultimately was not significantly better than a FIFO policy and
still thrashed cache based on eviction speed, rather than actual
demand for cache.

This patch solves one half of the problem by decoupling the ability to
detect working set changes from the inactive list size.  By
maintaining a history of recently evicted file pages it can detect
frequently used pages with an arbitrarily small inactive list size,
and subsequently apply pressure on the active list based on actual
demand for cache, not just overall eviction speed.

Every zone maintains a counter that tracks inactive list aging speed.
When a page is evicted, a snapshot of this counter is stored in the
now-empty page cache radix tree slot.  On refault, the minimum access
distance of the page can be assessed, to evaluate whether the page
should be part of the active list or not.

This fixes the VM's blindness towards working set changes in excess of
the inactive list.  And it's the foundation to further improve the
protection ability and reduce the minimum inactive list size of 50%.

Signed-off-by: Johannes Weiner han...@cmpxchg.org
Reviewed-by: Rik van Riel r...@redhat.com
Reviewed-by: Minchan Kim minc...@kernel.org
Reviewed-by: Bob Liu bob@oracle.com
---
 include/linux/mmzone.h |   5 +
 include/linux/swap.h   |   5 +
 mm/Makefile|   2 +-
 mm/filemap.c   |  61 
 mm/swap.c  |   2 +
 mm/vmscan.c|  24 -
 mm/vmstat.c|   2 +
 mm/workingset.c| 253 +
 8 files changed, 331 insertions(+), 23 deletions(-)
 create mode 100644 mm/workingset.c

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 5f2052c83154..b4bdeb411a4d 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -142,6 +142,8 @@ enum zone_stat_item {
NUMA_LOCAL, /* allocation from local node */
NUMA_OTHER, /* allocation from other node */
 #endif
+   WORKINGSET_REFAULT,
+   WORKINGSET_ACTIVATE,
NR_ANON_TRANSPARENT_HUGEPAGES,
NR_FREE_CMA_PAGES,
NR_VM_ZONE_STAT_ITEMS };
@@ -392,6 +394,9 @@ struct zone {
spinlock_t  lru_lock;
struct lruvec   lruvec;
 
+   /* Evictions  activations on the inactive file list */
+   atomic_long_t   inactive_age;
+
unsigned long   pages_scanned; /* since last reclaim */
unsigned long   flags; /* zone flags, see below */
 
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 46ba0c6c219f..b83cf61403ed 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -260,6 +260,11 @@ struct swap_list_t {
int next;   /* swapfile to be used next */
 };
 
+/* linux/mm/workingset.c */
+void *workingset_eviction(struct address_space *mapping, struct page *page);
+bool workingset_refault(void *shadow);
+void workingset_activation(struct page *page);
+
 /* linux/mm/page_alloc.c */
 extern unsigned long totalram_pages;
 extern unsigned long totalreserve_pages;
diff --git a/mm/Makefile b/mm/Makefile
index 310c90a09264..cdd741519ee0 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -17,7 +17,7 @@ obj-y := filemap.o mempool.o oom_kill.o 
fadvise.o \
   util.o mmzone.o vmstat.o backing-dev.o \
   mm_init.o mmu_context.o percpu.o slab_common.o \
   compaction.o balloon_compaction.o \
-  interval_tree.o list_lru.o $(mmu-y)
+  interval_tree.o list_lru.o workingset.o $(mmu-y)
 
 obj-y += init-mm.o
 
diff --git a/mm/filemap.c b/mm/filemap.c
index 18f80d418f83..33ceebf4d577 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -469,7 +469,7 @@ int replace_page_cache_page(struct page *old, struct page 
*new, gfp_t gfp_mask)