[patch 08/10] mm: thrash detection-based file cache sizing
The VM maintains cached filesystem pages on two types of lists. One list holds the pages recently faulted into the cache, the other list holds pages that have been referenced repeatedly on that first list. The idea is to prefer reclaiming young pages over those that have shown to benefit from caching in the past. We call the recently used list "inactive list" and the frequently used list "active list". Currently, the VM aims for a 1:1 ratio between the lists, which is the "perfect" trade-off between the ability to *protect* frequently used pages and the ability to *detect* frequently used pages. This means that working set changes bigger than half of cache memory go undetected and thrash indefinitely, whereas working sets bigger than half of cache memory are unprotected against used-once streams that don't even need caching. Historically, every reclaim scan of the inactive list also took a smaller number of pages from the tail of the active list and moved them to the head of the inactive list. This model gave established working sets more gracetime in the face of temporary use-once streams, but ultimately was not significantly better than a FIFO policy and still thrashed cache based on eviction speed, rather than actual demand for cache. This patch solves one half of the problem by decoupling the ability to detect working set changes from the inactive list size. By maintaining a history of recently evicted file pages it can detect frequently used pages with an arbitrarily small inactive list size, and subsequently apply pressure on the active list based on actual demand for cache, not just overall eviction speed. Every zone maintains a counter that tracks inactive list aging speed. When a page is evicted, a snapshot of this counter is stored in the now-empty page cache radix tree slot. On refault, the minimum access distance of the page can be assessed, to evaluate whether the page should be part of the active list or not. This fixes the VM's blindness towards working set changes in excess of the inactive list. And it's the foundation to further improve the protection ability and reduce the minimum inactive list size of 50%. Signed-off-by: Johannes Weiner Reviewed-by: Rik van Riel Reviewed-by: Minchan Kim Reviewed-by: Bob Liu --- include/linux/mmzone.h | 5 + include/linux/swap.h | 5 + mm/Makefile| 2 +- mm/filemap.c | 61 mm/swap.c | 2 + mm/vmscan.c| 24 - mm/vmstat.c| 2 + mm/workingset.c| 253 + 8 files changed, 331 insertions(+), 23 deletions(-) create mode 100644 mm/workingset.c diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 5f2052c83154..b4bdeb411a4d 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -142,6 +142,8 @@ enum zone_stat_item { NUMA_LOCAL, /* allocation from local node */ NUMA_OTHER, /* allocation from other node */ #endif + WORKINGSET_REFAULT, + WORKINGSET_ACTIVATE, NR_ANON_TRANSPARENT_HUGEPAGES, NR_FREE_CMA_PAGES, NR_VM_ZONE_STAT_ITEMS }; @@ -392,6 +394,9 @@ struct zone { spinlock_t lru_lock; struct lruvec lruvec; + /* Evictions & activations on the inactive file list */ + atomic_long_t inactive_age; + unsigned long pages_scanned; /* since last reclaim */ unsigned long flags; /* zone flags, see below */ diff --git a/include/linux/swap.h b/include/linux/swap.h index 46ba0c6c219f..b83cf61403ed 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -260,6 +260,11 @@ struct swap_list_t { int next; /* swapfile to be used next */ }; +/* linux/mm/workingset.c */ +void *workingset_eviction(struct address_space *mapping, struct page *page); +bool workingset_refault(void *shadow); +void workingset_activation(struct page *page); + /* linux/mm/page_alloc.c */ extern unsigned long totalram_pages; extern unsigned long totalreserve_pages; diff --git a/mm/Makefile b/mm/Makefile index 310c90a09264..cdd741519ee0 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -17,7 +17,7 @@ obj-y := filemap.o mempool.o oom_kill.o fadvise.o \ util.o mmzone.o vmstat.o backing-dev.o \ mm_init.o mmu_context.o percpu.o slab_common.o \ compaction.o balloon_compaction.o \ - interval_tree.o list_lru.o $(mmu-y) + interval_tree.o list_lru.o workingset.o $(mmu-y) obj-y += init-mm.o diff --git a/mm/filemap.c b/mm/filemap.c index 18f80d418f83..33ceebf4d577 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -469,7 +469,7 @@ int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask) EXPORT_SYMBOL_GPL(replace_page_cache_page); static int
[patch 08/10] mm: thrash detection-based file cache sizing
The VM maintains cached filesystem pages on two types of lists. One list holds the pages recently faulted into the cache, the other list holds pages that have been referenced repeatedly on that first list. The idea is to prefer reclaiming young pages over those that have shown to benefit from caching in the past. We call the recently used list inactive list and the frequently used list active list. Currently, the VM aims for a 1:1 ratio between the lists, which is the perfect trade-off between the ability to *protect* frequently used pages and the ability to *detect* frequently used pages. This means that working set changes bigger than half of cache memory go undetected and thrash indefinitely, whereas working sets bigger than half of cache memory are unprotected against used-once streams that don't even need caching. Historically, every reclaim scan of the inactive list also took a smaller number of pages from the tail of the active list and moved them to the head of the inactive list. This model gave established working sets more gracetime in the face of temporary use-once streams, but ultimately was not significantly better than a FIFO policy and still thrashed cache based on eviction speed, rather than actual demand for cache. This patch solves one half of the problem by decoupling the ability to detect working set changes from the inactive list size. By maintaining a history of recently evicted file pages it can detect frequently used pages with an arbitrarily small inactive list size, and subsequently apply pressure on the active list based on actual demand for cache, not just overall eviction speed. Every zone maintains a counter that tracks inactive list aging speed. When a page is evicted, a snapshot of this counter is stored in the now-empty page cache radix tree slot. On refault, the minimum access distance of the page can be assessed, to evaluate whether the page should be part of the active list or not. This fixes the VM's blindness towards working set changes in excess of the inactive list. And it's the foundation to further improve the protection ability and reduce the minimum inactive list size of 50%. Signed-off-by: Johannes Weiner han...@cmpxchg.org Reviewed-by: Rik van Riel r...@redhat.com Reviewed-by: Minchan Kim minc...@kernel.org Reviewed-by: Bob Liu bob@oracle.com --- include/linux/mmzone.h | 5 + include/linux/swap.h | 5 + mm/Makefile| 2 +- mm/filemap.c | 61 mm/swap.c | 2 + mm/vmscan.c| 24 - mm/vmstat.c| 2 + mm/workingset.c| 253 + 8 files changed, 331 insertions(+), 23 deletions(-) create mode 100644 mm/workingset.c diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 5f2052c83154..b4bdeb411a4d 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -142,6 +142,8 @@ enum zone_stat_item { NUMA_LOCAL, /* allocation from local node */ NUMA_OTHER, /* allocation from other node */ #endif + WORKINGSET_REFAULT, + WORKINGSET_ACTIVATE, NR_ANON_TRANSPARENT_HUGEPAGES, NR_FREE_CMA_PAGES, NR_VM_ZONE_STAT_ITEMS }; @@ -392,6 +394,9 @@ struct zone { spinlock_t lru_lock; struct lruvec lruvec; + /* Evictions activations on the inactive file list */ + atomic_long_t inactive_age; + unsigned long pages_scanned; /* since last reclaim */ unsigned long flags; /* zone flags, see below */ diff --git a/include/linux/swap.h b/include/linux/swap.h index 46ba0c6c219f..b83cf61403ed 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -260,6 +260,11 @@ struct swap_list_t { int next; /* swapfile to be used next */ }; +/* linux/mm/workingset.c */ +void *workingset_eviction(struct address_space *mapping, struct page *page); +bool workingset_refault(void *shadow); +void workingset_activation(struct page *page); + /* linux/mm/page_alloc.c */ extern unsigned long totalram_pages; extern unsigned long totalreserve_pages; diff --git a/mm/Makefile b/mm/Makefile index 310c90a09264..cdd741519ee0 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -17,7 +17,7 @@ obj-y := filemap.o mempool.o oom_kill.o fadvise.o \ util.o mmzone.o vmstat.o backing-dev.o \ mm_init.o mmu_context.o percpu.o slab_common.o \ compaction.o balloon_compaction.o \ - interval_tree.o list_lru.o $(mmu-y) + interval_tree.o list_lru.o workingset.o $(mmu-y) obj-y += init-mm.o diff --git a/mm/filemap.c b/mm/filemap.c index 18f80d418f83..33ceebf4d577 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -469,7 +469,7 @@ int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask)