Re: [PATCH v19 18/20] mm/lru: replace pgdat lru_lock with lruvec lock

2020-10-25 Thread Alex Shi



在 2020/10/26 上午5:51, Hugh Dickins 写道:
> [PATCH] mm/lru: minimize workingset_age_nonresident() interference
> 
> 1. move_pages_to_lru() batch workingset_age_nonresident() (from Yu Zhao)
> 2. workingset_age_nonresident() after unlock of lru_lock (from Yu Zhao)
> 3. lru_note_cost_unlock_irq() to remove an unlock+lock of lru_lock
> 4. lru_note_cost_unlock_irq() include updates of nonresident_age
> 5. workingset_refault() use that OR workingset_age_nonresident()
> 
> Of those, I expect 1 & 2 (from Yu Zhao) to be the most significant,
> increasingly so with more cpus, increasingly so with more levels in
> the memcg hierarchy.
> 
> Signed-off-by: Hugh Dickins 
> Cc: Yu Zhao 
> Cc: Johannes Weiner 

Hi Hugh,

Thanks a lot for checking and patch. I am in traveling today. So
just can pick and put it as lruv21 branch on my github tree:
https://github.com/alexshi/linux.git

Hi Rong,

We will very appreciate if you could launch testing on the branch
and it'd better, if you have time to change the test case time as 1k
second as Hugh wanted. Many thanks!

Thanks
Alex


Re: [PATCH v19 18/20] mm/lru: replace pgdat lru_lock with lruvec lock

2020-10-25 Thread Hugh Dickins
On Wed, 20 Oct 2020, Alex Shi wrote:
> 在 2020/9/24 上午11:28, Alex Shi 写道:
> > --- a/include/linux/mmzone.h
> > +++ b/include/linux/mmzone.h
> > @@ -273,6 +273,8 @@ enum lruvec_flags {
> >  };
> >  
> >  struct lruvec {
> > +   /* per lruvec lru_lock for memcg */
> > +   spinlock_t  lru_lock;
> > struct list_headlists[NR_LRU_LISTS];
> > /*
> >  * These track the cost of reclaiming one LRU - file or anon -
> 
> Hi All,
> 
> Intel Rong Chen, LKP, report a big regression on this patch, about
> 12 ~ 32% performance drop on fio.read_iops and case-lru-file-mmap-read
> case on wide Intel machine with attached kernel config.

Many thanks for the stats that followed: but though I spent many hours
poring over them, I ended up (as all too usual for me) just drowning
in the numbers, less and less able to arrive at any conclusion.

Why is 141% such a common standard deviation? I presume an artifact
of trying to estimate a stddev from only 3 runs? I wanted to ask you
to increase the runs from 3 to 12: too many of the numbers looked
unsafe to rely upon. And perhaps to increase the runtime from 200s
to 1000s, since some of the entries (invalidate_mapping_pages) look
as if they come from test preparation rather than the test itself.
But don't bother unless someone else asks: I would only drown again!

I have come to a tentative conclusion below, I don't see those stats
contradicting it, but I cannot argue how they support it either.

> Hugh Dickins pointed it's a false sharing issue on the lru_lock.

Well, I did suggest that as a possibility at first, before finding
that the slab (or slob) cache is guaranteeing cacheline alignment to
the mem_cgroup_per_node, hence to the lruvec, hence to the lru_lock.
So, I could not then see any false sharing.

> And that could be fixed by move the lru_lock out of busy lists[]
> cacheline,

No, access to those lists[] is always under that lru_lock: they
belong together, and sharing a cacheline between them is very
efficient: that is good sharing, not false sharing.

> like the following patch:

Not really.

> 
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index a75e6d0effcb..58b21bffef95 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -272,9 +272,9 @@ enum lruvec_flags {
>  };
> 
>  struct lruvec {
> +   struct list_headlists[NR_LRU_LISTS];
> /* per lruvec lru_lock for memcg */
> spinlock_t  lru_lock;
> -   struct list_headlists[NR_LRU_LISTS];
> /*
>  * These track the cost of reclaiming one LRU - file or anon -
>  * over the other. As the observed cost of reclaiming one LRU

Aha, that's a different patch than you showed me before, and it's
wonderfully simple, but rather surprising.  This patch really works
to fix the reported regression?  That's great, so long as it doesn't
turn out to introduce regressions elsewhere.  If it works, it's so
simple that I strongly approve of it, even if I don't understand it.

But (I see it in the lruv20 branch of your github tree) I would much
prefer you to remove my name from the commit message: it's your patch,
all I contributed was misdirection and bafflement.

> 
> Although the problem fixed, But I still no idea of the reasons
> and the gut problem. Any comments for this?

Right, how to explain it?

Well (and I'm assuming a 64-byte cacheline - I do hope the machines
you've been testing on really have that, rather than some greater
size that the slab allocator is unaware of), with your patch in
the first cacheline of the struct lruvec is:

inactive_anon.next inactive_anon.prev active_anon.next active_anon.prev
inactive_file.next inactive_file.prev active_file.next active_file.prev

And the interesting second cacheline of the struct lruvec is:

unevictable.next   unevictable.prev   lru_lock anon_cost
file_cost  nonresident_agerefaults[anon]   refaults[file]

(I hope I'm not missing some debug option in your config, which would
expand lru_lock from an int plus a pad int, to something much bigger.)

Of those fields, unevictable.next unevictable.prev anon_cost and
file_cost are accessed under lru_lock, so are good to share its
cacheline.  snapshot_refaults() does not take any lock when setting
refaults[2], and they are not atomic; shrink_node() does not take
any lock when reading them.  We could quite easily change the code
to require lru_lock to access them, but it looks to me as if all
that would do is add unnecessary overhead: I could be wrong, but
they don't look worrying to me.

The worrying one is atomic_long_t nonresident_age.

And the advice we would usually give is to keep that in a separate
cacheline from lru_lock and its domain.  But what your patch does is
the opposite: it moves lru_lock into the same cacheline as its "rival";
and testing (so far) has shown that to be a good choice.

That seems audacious and surprising to me, but not 

[PATCH v19 18/20] mm/lru: replace pgdat lru_lock with lruvec lock

2020-09-23 Thread Alex Shi
This patch moves per node lru_lock into lruvec, thus bring a lru_lock for
each of memcg per node. So on a large machine, each of memcg don't
have to suffer from per node pgdat->lru_lock competition. They could go
fast with their self lru_lock.

After move memcg charge before lru inserting, page isolation could
serialize page's memcg, then per memcg lruvec lock is stable and could
replace per node lru lock.

In func isolate_migratepages_block, compact_unlock_should_abort and
lock_page_lruvec_irqsave are open coded to work with compact_control.
Also add a debug func in locking which may give some clues if there are
sth out of hands.

Daniel Jordan's testing show 62% improvement on modified readtwice case
on his 2P * 10 core * 2 HT broadwell box.
https://lore.kernel.org/lkml/20200915165807.kpp7uhiw7l3lo...@ca-dmjordan1.us.oracle.com/

On a large machine with memcg enabled but not used, the page's lruvec
seeking pass a few pointers, that may lead to lru_lock holding time
increase and a bit regression.

Hugh Dickins helped on patch polish, thanks!

Signed-off-by: Alex Shi 
Acked-by: Hugh Dickins 
Cc: Hugh Dickins 
Cc: Andrew Morton 
Cc: Johannes Weiner 
Cc: Michal Hocko 
Cc: Vladimir Davydov 
Cc: Yang Shi 
Cc: Matthew Wilcox 
Cc: Konstantin Khlebnikov 
Cc: Tejun Heo 
Cc: linux-kernel@vger.kernel.org
Cc: linux...@kvack.org
Cc: cgro...@vger.kernel.org
---
 include/linux/memcontrol.h |  58 +
 include/linux/mmzone.h |   3 +-
 mm/compaction.c|  56 +++-
 mm/huge_memory.c   |  11 ++---
 mm/memcontrol.c|  62 --
 mm/mlock.c |  22 +++---
 mm/mmzone.c|   1 +
 mm/page_alloc.c|   1 -
 mm/swap.c  | 105 +
 mm/vmscan.c|  55 +++-
 10 files changed, 249 insertions(+), 125 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index d0b036123c6a..7b170e9028b5 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -494,6 +494,19 @@ static inline struct lruvec *mem_cgroup_lruvec(struct 
mem_cgroup *memcg,
 
 struct mem_cgroup *get_mem_cgroup_from_page(struct page *page);
 
+struct lruvec *lock_page_lruvec(struct page *page);
+struct lruvec *lock_page_lruvec_irq(struct page *page);
+struct lruvec *lock_page_lruvec_irqsave(struct page *page,
+   unsigned long *flags);
+
+#ifdef CONFIG_DEBUG_VM
+void lruvec_memcg_debug(struct lruvec *lruvec, struct page *page);
+#else
+static inline void lruvec_memcg_debug(struct lruvec *lruvec, struct page *page)
+{
+}
+#endif
+
 static inline
 struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css){
return css ? container_of(css, struct mem_cgroup, css) : NULL;
@@ -1035,6 +1048,31 @@ static inline void mem_cgroup_put(struct mem_cgroup 
*memcg)
 {
 }
 
+static inline struct lruvec *lock_page_lruvec(struct page *page)
+{
+   struct pglist_data *pgdat = page_pgdat(page);
+
+   spin_lock(>__lruvec.lru_lock);
+   return >__lruvec;
+}
+
+static inline struct lruvec *lock_page_lruvec_irq(struct page *page)
+{
+   struct pglist_data *pgdat = page_pgdat(page);
+
+   spin_lock_irq(>__lruvec.lru_lock);
+   return >__lruvec;
+}
+
+static inline struct lruvec *lock_page_lruvec_irqsave(struct page *page,
+   unsigned long *flagsp)
+{
+   struct pglist_data *pgdat = page_pgdat(page);
+
+   spin_lock_irqsave(>__lruvec.lru_lock, *flagsp);
+   return >__lruvec;
+}
+
 static inline struct mem_cgroup *
 mem_cgroup_iter(struct mem_cgroup *root,
struct mem_cgroup *prev,
@@ -1282,6 +1320,10 @@ static inline void count_memcg_page_event(struct page 
*page,
 void count_memcg_event_mm(struct mm_struct *mm, enum vm_event_item idx)
 {
 }
+
+static inline void lruvec_memcg_debug(struct lruvec *lruvec, struct page *page)
+{
+}
 #endif /* CONFIG_MEMCG */
 
 /* idx can be of type enum memcg_stat_item or node_stat_item */
@@ -1411,6 +1453,22 @@ static inline struct lruvec *parent_lruvec(struct lruvec 
*lruvec)
return mem_cgroup_lruvec(memcg, lruvec_pgdat(lruvec));
 }
 
+static inline void unlock_page_lruvec(struct lruvec *lruvec)
+{
+   spin_unlock(>lru_lock);
+}
+
+static inline void unlock_page_lruvec_irq(struct lruvec *lruvec)
+{
+   spin_unlock_irq(>lru_lock);
+}
+
+static inline void unlock_page_lruvec_irqrestore(struct lruvec *lruvec,
+   unsigned long flags)
+{
+   spin_unlock_irqrestore(>lru_lock, flags);
+}
+
 #ifdef CONFIG_CGROUP_WRITEBACK
 
 struct wb_domain *mem_cgroup_wb_domain(struct bdi_writeback *wb);
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 8379432f4f2f..7727f4c373f7 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -273,6 +273,8 @@ enum lruvec_flags {
 };
 
 struct lruvec {
+   /* per lruvec lru_lock for memcg */
+