Hi Shaohua,

On Sun, Jan 29, 2017 at 09:51:17PM -0800, Shaohua Li wrote:
> Hi,
> 
> We are trying to use MADV_FREE in jemalloc. Several issues are found. Without
> solving the issues, jemalloc can't use the MADV_FREE feature.
> - Doesn't support system without swap enabled. Because if swap is off, we 
> can't
>   or can't efficiently age anonymous pages. And since MADV_FREE pages are 
> mixed
>   with other anonymous pages, we can't reclaim MADV_FREE pages. In current
>   implementation, MADV_FREE will fallback to MADV_DONTNEED without swap 
> enabled.
>   But in our environment, a lot of machines don't enable swap. This will 
> prevent
>   our setup using MADV_FREE.
> - Increases memory pressure. page reclaim bias file pages reclaim against
>   anonymous pages. This doesn't make sense for MADV_FREE pages, because those
>   pages could be freed easily and refilled with very slight penality. Even 
> page
>   reclaim doesn't bias file pages, there is still an issue, because MADV_FREE
>   pages and other anonymous pages are mixed together. To reclaim a MADV_FREE
>   page, we probably must scan a lot of other anonymous pages, which is
>   inefficient. In our test, we usually see oom with MADV_FREE enabled and 
> nothing
>   without it.
> - RSS accounting. MADV_FREE pages are accounted as normal anon pages and
>   reclaimed lazily, so application's RSS becomes bigger. This confuses our
>   workloads. We have monitoring daemon running and if it finds applications' 
> RSS
>   becomes abnormal, the daemon will kill the applications even kernel can 
> reclaim
>   the memory easily. Currently we don't export separate RSS accounting for
>   MADV_FREE pages. This will prevent our setup using MADV_FREE too.
> 
> For the first two issues, introducing a new LRU list for MADV_FREE pages could
> solve the issues. We can directly reclaim MADV_FREE pages without writting 
> them
> out to swap, so the first issue could be fixed. If only MADV_FREE pages are in
> the new list, page reclaim can easily reclaim such pages without interference
> of file or anonymous pages. The memory pressure issue will disappear.
> 
> Actually Minchan posted patches to add the LRU list before, but he didn't
> pursue. So I picked up them and the patches are based on Minchan's previous
> patches. The main difference between my patches and Minchan previous patches 
> is
> page reclaim policy. Minchan's patches introduces a knob to balance the 
> reclaim
> of MADV_FREE pages and anon/file pages, while the patches always reclaim
> MADV_FREE pages first if there are. I described the reason in patch 5.

First of all, thanks for th effort to support MADV_FREE for swapless system,
Shaohua!

CCing Daniel,

The reason I have postponed is due to controverial part about balancing
used-once vs. madv_freed apges. I thought it doesn't make sense to reclaim
madv_freed pages first even if there are lots of used-once pages.

Recently, Johannes posted patches for balancing file/anon and it was based
on the cost model, IIRC. I wanted to base on it.

The idea is VM reclaims file-based pages and if refault happens, we can measure
refault distance and sizeof(LRU_LAZYFREE list). If refault distance is smaller
than lazyfree LRU list's size, it means the file-backed page have been kept
in memory if we have discarded lazyfree pages so it adds more cost to reclaim
lazyfree LRU list more agressively.

I tested your patch with simple MADV_FREE workload(just alloc and then repeated
touch/madv_free) with background stream-read process. In that case, the
MADV_FREE workload regressed in half without any gain for stream-read process.

I tested hacky code to simulate feedback loop I suggested idea and it restores
the performance regression. I'm not saying below hacky patch should merge in
but I think we should have used-once reclaim feedback logic to prevent
unnecessary purging for madv_freed pages.

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 589a165..39d4bba 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -703,6 +703,7 @@ typedef struct pglist_data {
        /* Per-node vmstats */
        struct per_cpu_nodestat __percpu *per_cpu_nodestats;
        atomic_long_t           vm_stat[NR_VM_NODE_STAT_ITEMS];
+       bool lazyfree;
 } pg_data_t;
 
 #define node_present_pages(nid)        (NODE_DATA(nid)->node_present_pages)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index f809f04..cf54b81 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2364,22 +2364,25 @@ static void shrink_node_memcg(struct pglist_data 
*pgdat, struct mem_cgroup *memc
        struct blk_plug plug;
        bool scan_adjusted;
 
-       /* reclaim all lazyfree pages so don't apply priority  */
-       nr[LRU_LAZYFREE] = lruvec_lru_size(lruvec, LRU_LAZYFREE, 
sc->reclaim_idx);
-       while (nr[LRU_LAZYFREE]) {
-               nr_to_scan = min(nr[LRU_LAZYFREE], SWAP_CLUSTER_MAX);
-               nr[LRU_LAZYFREE] -= nr_to_scan;
-               nr_reclaimed += shrink_inactive_list(nr_to_scan, lruvec, sc,
-                       LRU_LAZYFREE);
-
-               if (nr_reclaimed >= nr_to_reclaim)
-                       break;
-               cond_resched();
-       }
+       if (pgdat->lazyfree) {
+               /* reclaim all lazyfree pages so don't apply priority  */
+               nr[LRU_LAZYFREE] = lruvec_lru_size(lruvec, LRU_LAZYFREE, 
sc->reclaim_idx);
+               while (nr[LRU_LAZYFREE]) {
+                       nr_to_scan = min(nr[LRU_LAZYFREE], SWAP_CLUSTER_MAX);
+                       nr[LRU_LAZYFREE] -= nr_to_scan;
+                       nr_reclaimed += shrink_inactive_list(nr_to_scan, 
lruvec, sc,
+                               LRU_LAZYFREE);
+
+                       if (nr_reclaimed >= nr_to_reclaim)
+                               break;
+                       cond_resched();
+               }
 
-       if (nr_reclaimed >= nr_to_reclaim) {
-               sc->nr_reclaimed += nr_reclaimed;
-               return;
+               if (nr_reclaimed >= nr_to_reclaim) {
+                       sc->nr_reclaimed += nr_reclaimed;
+                       pgdat->lazyfree = false;
+                       return;
+               }
        }
 
        get_scan_count(lruvec, memcg, sc, nr, lru_pages);
diff --git a/mm/workingset.c b/mm/workingset.c
index c573cb2..2a01b91 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -233,7 +233,7 @@ void *workingset_eviction(struct address_space *mapping, 
struct page *page)
 bool workingset_refault(void *shadow)
 {
        unsigned long refault_distance;
-       unsigned long active_file;
+       unsigned long active_file, lazyfree;
        struct mem_cgroup *memcg;
        unsigned long eviction;
        struct lruvec *lruvec;
@@ -268,6 +268,7 @@ bool workingset_refault(void *shadow)
        lruvec = mem_cgroup_lruvec(pgdat, memcg);
        refault = atomic_long_read(&lruvec->inactive_age);
        active_file = lruvec_lru_size(lruvec, LRU_ACTIVE_FILE, MAX_NR_ZONES);
+       lazyfree = lruvec_lru_size(lruvec, LRU_LAZYFREE, MAX_NR_ZONES);
        rcu_read_unlock();
 
        /*
@@ -290,6 +291,9 @@ bool workingset_refault(void *shadow)
 
        inc_node_state(pgdat, WORKINGSET_REFAULT);
 
+       if (refault_distance <= lazyfree)
+               pgdat->lazyfree = true;
+
        if (refault_distance <= active_file) {
                inc_node_state(pgdat, WORKINGSET_ACTIVATE);
                return true;

> 
> For the third issue, we can add a separate RSS count for MADV_FREE pages. The
> count will be increased in madvise syscall and decreased in page reclaim (eg,
> unmap). One issue is activate_page(). A MADV_FREE page can be promoted to
> active page there. But there isn't mm_struct context at that place. Iterating
> vma there sounds too silly. The patchset don't fix this issue yet. Hopefully
> somebody can share a hint how to fix this issue.
> 
> Thanks,
> Shaohua
> 
> Minchan previous patches:
> http://marc.info/?l=linux-mm&m=144800657002763&w=2
> 
> Shaohua Li (6):
>   mm: add wrap for page accouting index
>   mm: add lazyfree page flag
>   mm: add LRU_LAZYFREE lru list
>   mm: move MADV_FREE pages into LRU_LAZYFREE list
>   mm: reclaim lazyfree pages
>   mm: enable MADV_FREE for swapless system
> 
>  drivers/base/node.c                       |  2 +
>  drivers/staging/android/lowmemorykiller.c |  3 +-
>  fs/proc/meminfo.c                         |  1 +
>  fs/proc/task_mmu.c                        |  8 ++-
>  include/linux/mm_inline.h                 | 41 +++++++++++++
>  include/linux/mmzone.h                    |  9 +++
>  include/linux/page-flags.h                |  6 ++
>  include/linux/swap.h                      |  2 +-
>  include/linux/vm_event_item.h             |  2 +-
>  include/trace/events/mmflags.h            |  1 +
>  include/trace/events/vmscan.h             | 31 +++++-----
>  kernel/power/snapshot.c                   |  1 +
>  mm/compaction.c                           | 11 ++--
>  mm/huge_memory.c                          |  6 +-
>  mm/khugepaged.c                           |  6 +-
>  mm/madvise.c                              | 11 +---
>  mm/memcontrol.c                           |  4 ++
>  mm/memory-failure.c                       |  3 +-
>  mm/memory_hotplug.c                       |  3 +-
>  mm/mempolicy.c                            |  3 +-
>  mm/migrate.c                              | 29 ++++------
>  mm/page_alloc.c                           | 10 ++++
>  mm/rmap.c                                 |  7 ++-
>  mm/swap.c                                 | 51 +++++++++-------
>  mm/vmscan.c                               | 96 
> +++++++++++++++++++++++--------
>  mm/vmstat.c                               |  4 ++
>  26 files changed, 242 insertions(+), 109 deletions(-)
> 
> -- 
> 2.9.3
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majord...@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"d...@kvack.org";> em...@kvack.org </a>

Reply via email to