Hello,

On Thu, Jan 12, 2017 at 10:10:17AM +0100, Michal Hocko wrote:
> On Thu 12-01-17 17:48:13, Minchan Kim wrote:
> > On Thu, Jan 12, 2017 at 09:15:54AM +0100, Michal Hocko wrote:
> > > On Thu 12-01-17 14:12:47, Minchan Kim wrote:
> > > > Hello,
> > > > 
> > > > On Wed, Jan 11, 2017 at 04:52:39PM +0100, Michal Hocko wrote:
> > > > > On Wed 11-01-17 08:52:50, Minchan Kim wrote:
> > > > > [...]
> > > > > > > @@ -2055,8 +2055,8 @@ static bool inactive_list_is_low(struct
> > > > > > >   if (!file && !total_swap_pages)
> > > > > > >           return false;
> > > > > > >  
> > > > > > > - inactive = lruvec_lru_size(lruvec, file * LRU_FILE);
> > > > > > > - active = lruvec_lru_size(lruvec, file * LRU_FILE + LRU_ACTIVE);
> > > > > > > + total_inactive = inactive = lruvec_lru_size(lruvec, file * 
> > > > > > > LRU_FILE);
> > > > > > > + total_active = active = lruvec_lru_size(lruvec, file * LRU_FILE 
> > > > > > > + LRU_ACTIVE);
> > > > > > >  
> > > > > > 
> > > > > > the decision of deactivating is based on eligible zone's LRU size,
> > > > > > not whole zone so why should we need to get a trace of all zones's 
> > > > > > LRU?
> > > > > 
> > > > > Strictly speaking, the total_ counters are not necessary for making 
> > > > > the
> > > > > decision. I found reporting those numbers useful regardless because 
> > > > > this
> > > > > will give us also an information how large is the eligible portion of
> > > > > the LRU list. We do not have any other tracepoint which would report
> > > > > that.
> > > > 
> > > > The patch doesn't say anything why it's useful. Could you tell why it's
> > > > useful and inactive_list_is_low should be right place?
> > > > 
> > > > Don't get me wrong, please. I don't want to bother you.
> > > > I really don't want to add random stuff although it's tracepoint for
> > > > debugging.
> > > 
> > > This doesn't sounds random to me. We simply do not have a full picture
> > > on 32b systems without this information. Especially when memcgs are
> > > involved and global numbers spread over different LRUs.
> > 
> > Could you elaborate it?
> 
> The problem with 32b systems is that you only can consider a part of the
> LRU for the lowmem requests. While we have global counters to see how
> much lowmem inactive/active pages we have, those get distributed to
> memcg LRUs. And that distribution is impossible to guess. So my thinking
> is that it can become a real head scratcher to realize why certain
> active LRUs are aged while others are not. This was the case when I was
> debugging the last issue which triggered all this. All of the sudden I
> have seen many invocations when inactive and active were zero which
> sounded weird, until I realized that those are memcg's lruvec which is
> what total numbers told me...

Hmm, it seems I miss something. AFAIU, what you need is just memcg
identifier, not all lru size. If it isn't, please tell more detail
usecase of all lru size in that particular tracepoint.

> 
> Later on I would like to add an memcg identifier to the vmscan
> tracepoints but I didn't get there yet.
>  
> > "
> > Currently we have tracepoints for both active and inactive LRU lists
> > reclaim but we do not have any which would tell us why we we decided to
> > age the active list.  Without that it is quite hard to diagnose
> > active/inactive lists balancing.  Add mm_vmscan_inactive_list_is_low
> > tracepoint to tell us this information.
> > "
> > 
> > Your description says "why we decided to age the active list".
> > So, what's needed?
> > 
> > >  
> > > > > [...]
> > > > > > > @@ -2223,7 +2228,7 @@ static void get_scan_count(struct lruvec
> > > > > > >    * lruvec even if it has plenty of old anonymous pages unless 
> > > > > > > the
> > > > > > >    * system is under heavy pressure.
> > > > > > >    */
> > > > > > > - if (!inactive_list_is_low(lruvec, true, sc) &&
> > > > > > > + if (!inactive_list_is_low(lruvec, true, sc, false) &&
> > > > > > 
> > > > > > Hmm, I was curious why you added trace boolean arguement and found 
> > > > > > it here.
> > > > > > Yes, here is not related to deactivation directly but couldn't we 
> > > > > > help to
> > > > > > trace it unconditionally?
> > > > > 
> > > > > I've had it like that when I was debugging the mentioned bug and found
> > > > > it a bit disturbing. It generated more output than I would like and it
> > > > > wasn't really clear from which code path  this has been called from.
> > > > 
> > > > Indeed.
> > > > 
> > > > Personally, I want to move inactive_list_is_low in shrink_active_list
> > > > and shrink_active_list calls inactive_list_is_low(...., true),
> > > > unconditionally so that it can make code simple/clear but cannot remove
> > > > trace boolean variable , which what I want. So, it's okay if you love
> > > > your version.
> > > 
> > > I am not sure I am following. Why is the additional parameter a problem?
> > 
> > Well, to me, it's not a elegance. Is it? If we need such boolean variable
> > to control show the trace, it means it's not a good place or think
> > refactoring.
> 
> But, even when you refactor the code there will be other callers of
> inactive_list_is_low outside of shrink_active_list...

Yes, that's why I said "it's okay if you love your version". However,
we can do refactoring to remove "bool trace" and even, it makes code
more readable, I believe.

>From 06eb7201d781155a8dee7e72fbb8423ec8175223 Mon Sep 17 00:00:00 2001
From: Minchan Kim <minc...@kernel.org>
Date: Fri, 13 Jan 2017 10:13:36 +0900
Subject: [PATCH] mm: refactoring inactive_list_is_low

Recently, Michal Hocko added tracepoint into inactive_list_is_low
for catching why VM decided to age the active list to know
active/inacive balancing problem. With that, unfortunately, it
added "bool trace" to inactlive_list_is_low to control some place
should be prohibited tracing. It is not elegant to me so this patch
try to clean it up.

Normally, most inactive_list_is_low is used for deciding active list
demotion but one site(i.e., get_scan_count) uses for other purpose
which reclaim file LRU forcefully. Sites for deactivation calls it
with shrink_active_list. It means inactive_list_is_low could be
located in shrink_active_list.

One more thing this patch does is to remove "ratio" in the tracepoint
because we can get it by post processing in script via simple math.

Signed-off-by: Minchan Kim <minc...@kernel.org>
---
 include/trace/events/vmscan.h |  9 +++-----
 mm/vmscan.c                   | 51 ++++++++++++++++++++++++-------------------
 2 files changed, 31 insertions(+), 29 deletions(-)

diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
index 27e8a5c..406ea95 100644
--- a/include/trace/events/vmscan.h
+++ b/include/trace/events/vmscan.h
@@ -432,9 +432,9 @@ TRACE_EVENT(mm_vmscan_inactive_list_is_low,
        TP_PROTO(int nid, int reclaim_idx,
                unsigned long total_inactive, unsigned long inactive,
                unsigned long total_active, unsigned long active,
-               unsigned long ratio, int file),
+               int file),
 
-       TP_ARGS(nid, reclaim_idx, total_inactive, inactive, total_active, 
active, ratio, file),
+       TP_ARGS(nid, reclaim_idx, total_inactive, inactive, total_active, 
active, file),
 
        TP_STRUCT__entry(
                __field(int, nid)
@@ -443,7 +443,6 @@ TRACE_EVENT(mm_vmscan_inactive_list_is_low,
                __field(unsigned long, inactive)
                __field(unsigned long, total_active)
                __field(unsigned long, active)
-               __field(unsigned long, ratio)
                __field(int, reclaim_flags)
        ),
 
@@ -454,16 +453,14 @@ TRACE_EVENT(mm_vmscan_inactive_list_is_low,
                __entry->inactive = inactive;
                __entry->total_active = total_active;
                __entry->active = active;
-               __entry->ratio = ratio;
                __entry->reclaim_flags = trace_shrink_flags(file) & 
RECLAIM_WB_LRU;
        ),
 
-       TP_printk("nid=%d reclaim_idx=%d total_inactive=%ld inactive=%ld 
total_active=%ld active=%ld ratio=%ld flags=%s",
+       TP_printk("nid=%d reclaim_idx=%d total_inactive=%ld inactive=%ld 
total_active=%ld active=%ld flags=%s",
                __entry->nid,
                __entry->reclaim_idx,
                __entry->total_inactive, __entry->inactive,
                __entry->total_active, __entry->active,
-               __entry->ratio,
                show_reclaim_flags(__entry->reclaim_flags))
 );
 #endif /* _TRACE_VMSCAN_H */
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 75cdf68..6890c21 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -150,6 +150,7 @@ unsigned long vm_total_pages;
 
 static LIST_HEAD(shrinker_list);
 static DECLARE_RWSEM(shrinker_rwsem);
+static bool inactive_list_is_low(bool file, unsigned long, unsigned long);
 
 #ifdef CONFIG_MEMCG
 static bool global_reclaim(struct scan_control *sc)
@@ -1962,6 +1963,22 @@ static void shrink_active_list(unsigned long nr_to_scan,
        isolate_mode_t isolate_mode = 0;
        int file = is_file_lru(lru);
        struct pglist_data *pgdat = lruvec_pgdat(lruvec);
+       unsigned long inactive, active;
+       enum lru_list inactive_lru = file * LRU_FILE;
+       enum lru_list active_lru = file * LRU_FILE + LRU_ACTIVE;
+       bool deactivate;
+
+       inactive = lruvec_lru_size_eligibe_zones(lruvec, file * LRU_FILE,
+                                       sc->reclaim_idx);
+       active = lruvec_lru_size_eligibe_zones(lruvec, file * LRU_FILE +
+                                       LRU_ACTIVE, sc->reclaim_idx);
+       deactivate = inactive_list_is_low(file, inactive, active);
+       trace_mm_vmscan_inactive_list_is_low(pgdat->node_id,
+                       sc->reclaim_idx,
+                       lruvec_lru_size(lruvec, inactive_lru), inactive,
+                       lruvec_lru_size(lruvec, active_lru), active, file);
+       if (!deactivate)
+               return;
 
        lru_add_drain();
 
@@ -2073,13 +2090,10 @@ static void shrink_active_list(unsigned long nr_to_scan,
  *    1TB     101        10GB
  *   10TB     320        32GB
  */
-static bool inactive_list_is_low(struct lruvec *lruvec, bool file,
-                                               struct scan_control *sc, bool 
trace)
+static bool inactive_list_is_low(bool file,
+                       unsigned long inactive, unsigned long active)
 {
        unsigned long inactive_ratio;
-       unsigned long inactive, active;
-       enum lru_list inactive_lru = file * LRU_FILE;
-       enum lru_list active_lru = file * LRU_FILE + LRU_ACTIVE;
        unsigned long gb;
 
        /*
@@ -2089,22 +2103,12 @@ static bool inactive_list_is_low(struct lruvec *lruvec, 
bool file,
        if (!file && !total_swap_pages)
                return false;
 
-       inactive = lruvec_lru_size_eligibe_zones(lruvec, inactive_lru, 
sc->reclaim_idx);
-       active = lruvec_lru_size_eligibe_zones(lruvec, active_lru, 
sc->reclaim_idx);
-
        gb = (inactive + active) >> (30 - PAGE_SHIFT);
        if (gb)
                inactive_ratio = int_sqrt(10 * gb);
        else
                inactive_ratio = 1;
 
-       if (trace)
-               
trace_mm_vmscan_inactive_list_is_low(lruvec_pgdat(lruvec)->node_id,
-                               sc->reclaim_idx,
-                               lruvec_lru_size(lruvec, inactive_lru), inactive,
-                               lruvec_lru_size(lruvec, active_lru), active,
-                               inactive_ratio, file);
-
        return inactive * inactive_ratio < active;
 }
 
@@ -2112,8 +2116,7 @@ static unsigned long shrink_list(enum lru_list lru, 
unsigned long nr_to_scan,
                                 struct lruvec *lruvec, struct scan_control *sc)
 {
        if (is_active_lru(lru)) {
-               if (inactive_list_is_low(lruvec, is_file_lru(lru), sc, true))
-                       shrink_active_list(nr_to_scan, lruvec, sc, lru);
+               shrink_active_list(nr_to_scan, lruvec, sc, lru);
                return 0;
        }
 
@@ -2153,6 +2156,7 @@ static void get_scan_count(struct lruvec *lruvec, struct 
mem_cgroup *memcg,
        enum lru_list lru;
        bool some_scanned;
        int pass;
+       unsigned long inactive, active;
 
        /*
         * If the zone or memcg is small, nr[l] can be 0.  This
@@ -2243,7 +2247,11 @@ static void get_scan_count(struct lruvec *lruvec, struct 
mem_cgroup *memcg,
         * lruvec even if it has plenty of old anonymous pages unless the
         * system is under heavy pressure.
         */
-       if (!inactive_list_is_low(lruvec, true, sc, false) &&
+       inactive = lruvec_lru_size_eligibe_zones(lruvec,
+                               LRU_FILE, sc->reclaim_idx);
+       active = lruvec_lru_size_eligibe_zones(lruvec,
+                               LRU_FILE + LRU_ACTIVE, sc->reclaim_idx);
+       if (!inactive_list_is_low(true, inactive, active) &&
            lruvec_lru_size_eligibe_zones(lruvec, LRU_INACTIVE_FILE, 
sc->reclaim_idx) >> sc->priority) {
                scan_balance = SCAN_FILE;
                goto out;
@@ -2468,9 +2476,7 @@ static void shrink_node_memcg(struct pglist_data *pgdat, 
struct mem_cgroup *memc
         * Even if we did not try to evict anon pages at all, we want to
         * rebalance the anon lru active/inactive ratio.
         */
-       if (inactive_list_is_low(lruvec, false, sc, true))
-               shrink_active_list(SWAP_CLUSTER_MAX, lruvec,
-                                  sc, LRU_ACTIVE_ANON);
+       shrink_active_list(SWAP_CLUSTER_MAX, lruvec, sc, LRU_ACTIVE_ANON);
 }
 
 /* Use reclaim/compaction for costly allocs or under memory pressure */
@@ -3118,8 +3124,7 @@ static void age_active_anon(struct pglist_data *pgdat,
        do {
                struct lruvec *lruvec = mem_cgroup_lruvec(pgdat, memcg);
 
-               if (inactive_list_is_low(lruvec, false, sc, true))
-                       shrink_active_list(SWAP_CLUSTER_MAX, lruvec,
+               shrink_active_list(SWAP_CLUSTER_MAX, lruvec,
                                           sc, LRU_ACTIVE_ANON);
 
                memcg = mem_cgroup_iter(NULL, memcg, NULL);
-- 
2.7.4

Reply via email to