Re: [PATCH v8 6/6] zswap: shrinks zswap pool based on memory pressure

2023-12-06 Thread Chengming Zhou
On 2023/12/7 03:47, Nhat Pham wrote:
> [...]
>>
>> Hmm so how should we proceed from here? How about this:
>>
>> a) I can send a fixlet to move the enablement check above the stats
>> flushing + use mem_cgroup_flush_stats
>> b) Then maybe, you can send a fixlet to update this new callsite?
>>
>> Does that sound reasonable?
> 
> I just sent out the fixlet. Yosry and Chengming, let me know if that
> looks good. Thank you both for detecting this issue and proposing the
> fix!

Yeah, also looks good to me. Thanks!

-- 
Best regards,
Chengming Zhou




Re: [PATCH v8 6/6] zswap: shrinks zswap pool based on memory pressure

2023-12-06 Thread Yosry Ahmed
On Wed, Dec 6, 2023 at 11:47 AM Nhat Pham  wrote:
>
> [...]
> >
> > Hmm so how should we proceed from here? How about this:
> >
> > a) I can send a fixlet to move the enablement check above the stats
> > flushing + use mem_cgroup_flush_stats
> > b) Then maybe, you can send a fixlet to update this new callsite?
> >
> > Does that sound reasonable?
>
> I just sent out the fixlet. Yosry and Chengming, let me know if that
> looks good. Thank you both for detecting this issue and proposing the
> fix!

The fixlet looks good, and Andrew already took care of (b) before I
could send a followup fixlet out :)



Re: [PATCH v8 6/6] zswap: shrinks zswap pool based on memory pressure

2023-12-06 Thread Nhat Pham
[...]
>
> Hmm so how should we proceed from here? How about this:
>
> a) I can send a fixlet to move the enablement check above the stats
> flushing + use mem_cgroup_flush_stats
> b) Then maybe, you can send a fixlet to update this new callsite?
>
> Does that sound reasonable?

I just sent out the fixlet. Yosry and Chengming, let me know if that
looks good. Thank you both for detecting this issue and proposing the
fix!



Re: [PATCH v8 6/6] zswap: shrinks zswap pool based on memory pressure

2023-12-06 Thread Nhat Pham
On Tue, Dec 5, 2023 at 10:00 PM Yosry Ahmed  wrote:
>
> [..]
> > > @@ -526,6 +582,102 @@ static struct zswap_entry 
> > > *zswap_entry_find_get(struct rb_root *root,
> > >   return entry;
> > >  }
> > >
> > > +/*
> > > +* shrinker functions
> > > +**/
> > > +static enum lru_status shrink_memcg_cb(struct list_head *item, struct 
> > > list_lru_one *l,
> > > +spinlock_t *lock, void *arg);
> > > +
> > > +static unsigned long zswap_shrinker_scan(struct shrinker *shrinker,
> > > + struct shrink_control *sc)
> > > +{
> > > + struct lruvec *lruvec = mem_cgroup_lruvec(sc->memcg, 
> > > NODE_DATA(sc->nid));
> > > + unsigned long shrink_ret, nr_protected, lru_size;
> > > + struct zswap_pool *pool = shrinker->private_data;
> > > + bool encountered_page_in_swapcache = false;
> > > +
> > > + nr_protected =
> > > + 
> > > atomic_long_read(>zswap_lruvec_state.nr_zswap_protected);
> > > + lru_size = list_lru_shrink_count(>list_lru, sc);
> > > +
> > > + /*
> > > +  * Abort if the shrinker is disabled or if we are shrinking into the
> > > +  * protected region.
> > > +  *
> > > +  * This short-circuiting is necessary because if we have too many 
> > > multiple
> > > +  * concurrent reclaimers getting the freeable zswap object counts 
> > > at the
> > > +  * same time (before any of them made reasonable progress), the 
> > > total
> > > +  * number of reclaimed objects might be more than the number of 
> > > unprotected
> > > +  * objects (i.e the reclaimers will reclaim into the protected area 
> > > of the
> > > +  * zswap LRU).
> > > +  */
> > > + if (!zswap_shrinker_enabled || nr_protected >= lru_size - 
> > > sc->nr_to_scan) {
> > > + sc->nr_scanned = 0;
> > > + return SHRINK_STOP;
> > > + }
> > > +
> > > + shrink_ret = list_lru_shrink_walk(>list_lru, sc, 
> > > _memcg_cb,
> > > + _page_in_swapcache);
> > > +
> > > + if (encountered_page_in_swapcache)
> > > + return SHRINK_STOP;
> > > +
> > > + return shrink_ret ? shrink_ret : SHRINK_STOP;
> > > +}
> > > +
> > > +static unsigned long zswap_shrinker_count(struct shrinker *shrinker,
> > > + struct shrink_control *sc)
> > > +{
> > > + struct zswap_pool *pool = shrinker->private_data;
> > > + struct mem_cgroup *memcg = sc->memcg;
> > > + struct lruvec *lruvec = mem_cgroup_lruvec(memcg, 
> > > NODE_DATA(sc->nid));
> > > + unsigned long nr_backing, nr_stored, nr_freeable, nr_protected;
> > > +
> > > +#ifdef CONFIG_MEMCG_KMEM
> > > + cgroup_rstat_flush(memcg->css.cgroup);
> > > + nr_backing = memcg_page_state(memcg, MEMCG_ZSWAP_B) >> PAGE_SHIFT;
> > > + nr_stored = memcg_page_state(memcg, MEMCG_ZSWAPPED);
> > > +#else
> > > + /* use pool stats instead of memcg stats */
> > > + nr_backing = get_zswap_pool_size(pool) >> PAGE_SHIFT;
> > > + nr_stored = atomic_read(>nr_stored);
> > > +#endif
> > > +
> > > + if (!zswap_shrinker_enabled || !nr_stored)
> > When I tested with this series, with !zswap_shrinker_enabled in the default 
> > case,
> > I found the performance is much worse than that without this patch.
> >
> > Testcase: memory.max=2G, zswap enabled, kernel build -j32 in a tmpfs 
> > directory.
> >
> > The reason seems the above cgroup_rstat_flush(), caused much rstat lock 
> > contention
> > to the zswap_store() path. And if I put the "zswap_shrinker_enabled" check 
> > above
> > the cgroup_rstat_flush(), the performance become much better.
> >
> > Maybe we can put the "zswap_shrinker_enabled" check above 
> > cgroup_rstat_flush()?
>
> Yes, we should do nothing if !zswap_shrinker_enabled. We should also
> use mem_cgroup_flush_stats() here like other places unless accuracy is
> crucial, which I doubt given that reclaim uses
> mem_cgroup_flush_stats().

Ah, good points on both suggestions. We should not do extra work for
non-user. And, this is a best-effort approximation of the memory
saving factor, so as long as it is not *too* far off I think it's
acceptable.

>
> mem_cgroup_flush_stats() has some thresholding to make sure we don't
> do flushes unnecessarily, and I have a pending series in mm-unstable
> that makes that thresholding per-memcg. Keep in mind that adding a
> call to mem_cgroup_flush_stats() will cause a conflict in mm-unstable,
> because the series there adds a memcg argument to
> mem_cgroup_flush_stats(). That should be easily amenable though, I can
> post a fixlet for my series to add the memcg argument there on top of
> users if needed.

Hmm so how should we proceed from here? How about this:

a) I can send a fixlet to move the enablement check above the stats
flushing + use mem_cgroup_flush_stats
b) Then maybe, you can send a fixlet to update this new callsite?

Does that sound reasonable?

>
> >
> > Thanks!
> >
> > > + return 0;

Re: [PATCH v8 6/6] zswap: shrinks zswap pool based on memory pressure

2023-12-05 Thread Chengming Zhou
On 2023/12/6 15:36, Yosry Ahmed wrote:
> On Tue, Dec 5, 2023 at 10:43 PM Chengming Zhou  
> wrote:
>>
>> On 2023/12/6 13:59, Yosry Ahmed wrote:
>>> [..]
> @@ -526,6 +582,102 @@ static struct zswap_entry 
> *zswap_entry_find_get(struct rb_root *root,
>   return entry;
>  }
>
> +/*
> +* shrinker functions
> +**/
> +static enum lru_status shrink_memcg_cb(struct list_head *item, struct 
> list_lru_one *l,
> +spinlock_t *lock, void *arg);
> +
> +static unsigned long zswap_shrinker_scan(struct shrinker *shrinker,
> + struct shrink_control *sc)
> +{
> + struct lruvec *lruvec = mem_cgroup_lruvec(sc->memcg, 
> NODE_DATA(sc->nid));
> + unsigned long shrink_ret, nr_protected, lru_size;
> + struct zswap_pool *pool = shrinker->private_data;
> + bool encountered_page_in_swapcache = false;
> +
> + nr_protected =
> + 
> atomic_long_read(>zswap_lruvec_state.nr_zswap_protected);
> + lru_size = list_lru_shrink_count(>list_lru, sc);
> +
> + /*
> +  * Abort if the shrinker is disabled or if we are shrinking into the
> +  * protected region.
> +  *
> +  * This short-circuiting is necessary because if we have too many 
> multiple
> +  * concurrent reclaimers getting the freeable zswap object counts 
> at the
> +  * same time (before any of them made reasonable progress), the 
> total
> +  * number of reclaimed objects might be more than the number of 
> unprotected
> +  * objects (i.e the reclaimers will reclaim into the protected area 
> of the
> +  * zswap LRU).
> +  */
> + if (!zswap_shrinker_enabled || nr_protected >= lru_size - 
> sc->nr_to_scan) {
> + sc->nr_scanned = 0;
> + return SHRINK_STOP;
> + }
> +
> + shrink_ret = list_lru_shrink_walk(>list_lru, sc, 
> _memcg_cb,
> + _page_in_swapcache);
> +
> + if (encountered_page_in_swapcache)
> + return SHRINK_STOP;
> +
> + return shrink_ret ? shrink_ret : SHRINK_STOP;
> +}
> +
> +static unsigned long zswap_shrinker_count(struct shrinker *shrinker,
> + struct shrink_control *sc)
> +{
> + struct zswap_pool *pool = shrinker->private_data;
> + struct mem_cgroup *memcg = sc->memcg;
> + struct lruvec *lruvec = mem_cgroup_lruvec(memcg, 
> NODE_DATA(sc->nid));
> + unsigned long nr_backing, nr_stored, nr_freeable, nr_protected;
> +
> +#ifdef CONFIG_MEMCG_KMEM
> + cgroup_rstat_flush(memcg->css.cgroup);
> + nr_backing = memcg_page_state(memcg, MEMCG_ZSWAP_B) >> PAGE_SHIFT;
> + nr_stored = memcg_page_state(memcg, MEMCG_ZSWAPPED);
> +#else
> + /* use pool stats instead of memcg stats */
> + nr_backing = get_zswap_pool_size(pool) >> PAGE_SHIFT;
> + nr_stored = atomic_read(>nr_stored);
> +#endif
> +
> + if (!zswap_shrinker_enabled || !nr_stored)
 When I tested with this series, with !zswap_shrinker_enabled in the 
 default case,
 I found the performance is much worse than that without this patch.

 Testcase: memory.max=2G, zswap enabled, kernel build -j32 in a tmpfs 
 directory.

 The reason seems the above cgroup_rstat_flush(), caused much rstat lock 
 contention
 to the zswap_store() path. And if I put the "zswap_shrinker_enabled" check 
 above
 the cgroup_rstat_flush(), the performance become much better.

 Maybe we can put the "zswap_shrinker_enabled" check above 
 cgroup_rstat_flush()?
>>>
>>> Yes, we should do nothing if !zswap_shrinker_enabled. We should also
>>> use mem_cgroup_flush_stats() here like other places unless accuracy is
>>> crucial, which I doubt given that reclaim uses
>>> mem_cgroup_flush_stats().
>>>
>>
>> Yes. After changing to use mem_cgroup_flush_stats() here, the performance
>> become much better.
>>
>>> mem_cgroup_flush_stats() has some thresholding to make sure we don't
>>> do flushes unnecessarily, and I have a pending series in mm-unstable
>>> that makes that thresholding per-memcg. Keep in mind that adding a
>>> call to mem_cgroup_flush_stats() will cause a conflict in mm-unstable,
>>
>> My test branch is linux-next 20231205, and it's all good after changing
>> to use mem_cgroup_flush_stats(memcg).
> 
> Thanks for reporting back. We should still move the
> zswap_shrinker_enabled check ahead, no need to even call
> mem_cgroup_flush_stats() if we will do nothing anyway.
> 

Yes, agree!




Re: [PATCH v8 6/6] zswap: shrinks zswap pool based on memory pressure

2023-12-05 Thread Yosry Ahmed
On Tue, Dec 5, 2023 at 10:43 PM Chengming Zhou  wrote:
>
> On 2023/12/6 13:59, Yosry Ahmed wrote:
> > [..]
> >>> @@ -526,6 +582,102 @@ static struct zswap_entry 
> >>> *zswap_entry_find_get(struct rb_root *root,
> >>>   return entry;
> >>>  }
> >>>
> >>> +/*
> >>> +* shrinker functions
> >>> +**/
> >>> +static enum lru_status shrink_memcg_cb(struct list_head *item, struct 
> >>> list_lru_one *l,
> >>> +spinlock_t *lock, void *arg);
> >>> +
> >>> +static unsigned long zswap_shrinker_scan(struct shrinker *shrinker,
> >>> + struct shrink_control *sc)
> >>> +{
> >>> + struct lruvec *lruvec = mem_cgroup_lruvec(sc->memcg, 
> >>> NODE_DATA(sc->nid));
> >>> + unsigned long shrink_ret, nr_protected, lru_size;
> >>> + struct zswap_pool *pool = shrinker->private_data;
> >>> + bool encountered_page_in_swapcache = false;
> >>> +
> >>> + nr_protected =
> >>> + 
> >>> atomic_long_read(>zswap_lruvec_state.nr_zswap_protected);
> >>> + lru_size = list_lru_shrink_count(>list_lru, sc);
> >>> +
> >>> + /*
> >>> +  * Abort if the shrinker is disabled or if we are shrinking into the
> >>> +  * protected region.
> >>> +  *
> >>> +  * This short-circuiting is necessary because if we have too many 
> >>> multiple
> >>> +  * concurrent reclaimers getting the freeable zswap object counts 
> >>> at the
> >>> +  * same time (before any of them made reasonable progress), the 
> >>> total
> >>> +  * number of reclaimed objects might be more than the number of 
> >>> unprotected
> >>> +  * objects (i.e the reclaimers will reclaim into the protected area 
> >>> of the
> >>> +  * zswap LRU).
> >>> +  */
> >>> + if (!zswap_shrinker_enabled || nr_protected >= lru_size - 
> >>> sc->nr_to_scan) {
> >>> + sc->nr_scanned = 0;
> >>> + return SHRINK_STOP;
> >>> + }
> >>> +
> >>> + shrink_ret = list_lru_shrink_walk(>list_lru, sc, 
> >>> _memcg_cb,
> >>> + _page_in_swapcache);
> >>> +
> >>> + if (encountered_page_in_swapcache)
> >>> + return SHRINK_STOP;
> >>> +
> >>> + return shrink_ret ? shrink_ret : SHRINK_STOP;
> >>> +}
> >>> +
> >>> +static unsigned long zswap_shrinker_count(struct shrinker *shrinker,
> >>> + struct shrink_control *sc)
> >>> +{
> >>> + struct zswap_pool *pool = shrinker->private_data;
> >>> + struct mem_cgroup *memcg = sc->memcg;
> >>> + struct lruvec *lruvec = mem_cgroup_lruvec(memcg, 
> >>> NODE_DATA(sc->nid));
> >>> + unsigned long nr_backing, nr_stored, nr_freeable, nr_protected;
> >>> +
> >>> +#ifdef CONFIG_MEMCG_KMEM
> >>> + cgroup_rstat_flush(memcg->css.cgroup);
> >>> + nr_backing = memcg_page_state(memcg, MEMCG_ZSWAP_B) >> PAGE_SHIFT;
> >>> + nr_stored = memcg_page_state(memcg, MEMCG_ZSWAPPED);
> >>> +#else
> >>> + /* use pool stats instead of memcg stats */
> >>> + nr_backing = get_zswap_pool_size(pool) >> PAGE_SHIFT;
> >>> + nr_stored = atomic_read(>nr_stored);
> >>> +#endif
> >>> +
> >>> + if (!zswap_shrinker_enabled || !nr_stored)
> >> When I tested with this series, with !zswap_shrinker_enabled in the 
> >> default case,
> >> I found the performance is much worse than that without this patch.
> >>
> >> Testcase: memory.max=2G, zswap enabled, kernel build -j32 in a tmpfs 
> >> directory.
> >>
> >> The reason seems the above cgroup_rstat_flush(), caused much rstat lock 
> >> contention
> >> to the zswap_store() path. And if I put the "zswap_shrinker_enabled" check 
> >> above
> >> the cgroup_rstat_flush(), the performance become much better.
> >>
> >> Maybe we can put the "zswap_shrinker_enabled" check above 
> >> cgroup_rstat_flush()?
> >
> > Yes, we should do nothing if !zswap_shrinker_enabled. We should also
> > use mem_cgroup_flush_stats() here like other places unless accuracy is
> > crucial, which I doubt given that reclaim uses
> > mem_cgroup_flush_stats().
> >
>
> Yes. After changing to use mem_cgroup_flush_stats() here, the performance
> become much better.
>
> > mem_cgroup_flush_stats() has some thresholding to make sure we don't
> > do flushes unnecessarily, and I have a pending series in mm-unstable
> > that makes that thresholding per-memcg. Keep in mind that adding a
> > call to mem_cgroup_flush_stats() will cause a conflict in mm-unstable,
>
> My test branch is linux-next 20231205, and it's all good after changing
> to use mem_cgroup_flush_stats(memcg).

Thanks for reporting back. We should still move the
zswap_shrinker_enabled check ahead, no need to even call
mem_cgroup_flush_stats() if we will do nothing anyway.

>
> > because the series there adds a memcg argument to
> > mem_cgroup_flush_stats(). That should be easily amenable though, I can
> > post a fixlet for my series to add the memcg argument there on top of
> > users if needed.
> >
>
> It's great. Thanks!
>



Re: [PATCH v8 6/6] zswap: shrinks zswap pool based on memory pressure

2023-12-05 Thread Chengming Zhou
On 2023/12/6 13:59, Yosry Ahmed wrote:
> [..]
>>> @@ -526,6 +582,102 @@ static struct zswap_entry 
>>> *zswap_entry_find_get(struct rb_root *root,
>>>   return entry;
>>>  }
>>>
>>> +/*
>>> +* shrinker functions
>>> +**/
>>> +static enum lru_status shrink_memcg_cb(struct list_head *item, struct 
>>> list_lru_one *l,
>>> +spinlock_t *lock, void *arg);
>>> +
>>> +static unsigned long zswap_shrinker_scan(struct shrinker *shrinker,
>>> + struct shrink_control *sc)
>>> +{
>>> + struct lruvec *lruvec = mem_cgroup_lruvec(sc->memcg, 
>>> NODE_DATA(sc->nid));
>>> + unsigned long shrink_ret, nr_protected, lru_size;
>>> + struct zswap_pool *pool = shrinker->private_data;
>>> + bool encountered_page_in_swapcache = false;
>>> +
>>> + nr_protected =
>>> + 
>>> atomic_long_read(>zswap_lruvec_state.nr_zswap_protected);
>>> + lru_size = list_lru_shrink_count(>list_lru, sc);
>>> +
>>> + /*
>>> +  * Abort if the shrinker is disabled or if we are shrinking into the
>>> +  * protected region.
>>> +  *
>>> +  * This short-circuiting is necessary because if we have too many 
>>> multiple
>>> +  * concurrent reclaimers getting the freeable zswap object counts at 
>>> the
>>> +  * same time (before any of them made reasonable progress), the total
>>> +  * number of reclaimed objects might be more than the number of 
>>> unprotected
>>> +  * objects (i.e the reclaimers will reclaim into the protected area 
>>> of the
>>> +  * zswap LRU).
>>> +  */
>>> + if (!zswap_shrinker_enabled || nr_protected >= lru_size - 
>>> sc->nr_to_scan) {
>>> + sc->nr_scanned = 0;
>>> + return SHRINK_STOP;
>>> + }
>>> +
>>> + shrink_ret = list_lru_shrink_walk(>list_lru, sc, 
>>> _memcg_cb,
>>> + _page_in_swapcache);
>>> +
>>> + if (encountered_page_in_swapcache)
>>> + return SHRINK_STOP;
>>> +
>>> + return shrink_ret ? shrink_ret : SHRINK_STOP;
>>> +}
>>> +
>>> +static unsigned long zswap_shrinker_count(struct shrinker *shrinker,
>>> + struct shrink_control *sc)
>>> +{
>>> + struct zswap_pool *pool = shrinker->private_data;
>>> + struct mem_cgroup *memcg = sc->memcg;
>>> + struct lruvec *lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(sc->nid));
>>> + unsigned long nr_backing, nr_stored, nr_freeable, nr_protected;
>>> +
>>> +#ifdef CONFIG_MEMCG_KMEM
>>> + cgroup_rstat_flush(memcg->css.cgroup);
>>> + nr_backing = memcg_page_state(memcg, MEMCG_ZSWAP_B) >> PAGE_SHIFT;
>>> + nr_stored = memcg_page_state(memcg, MEMCG_ZSWAPPED);
>>> +#else
>>> + /* use pool stats instead of memcg stats */
>>> + nr_backing = get_zswap_pool_size(pool) >> PAGE_SHIFT;
>>> + nr_stored = atomic_read(>nr_stored);
>>> +#endif
>>> +
>>> + if (!zswap_shrinker_enabled || !nr_stored)
>> When I tested with this series, with !zswap_shrinker_enabled in the default 
>> case,
>> I found the performance is much worse than that without this patch.
>>
>> Testcase: memory.max=2G, zswap enabled, kernel build -j32 in a tmpfs 
>> directory.
>>
>> The reason seems the above cgroup_rstat_flush(), caused much rstat lock 
>> contention
>> to the zswap_store() path. And if I put the "zswap_shrinker_enabled" check 
>> above
>> the cgroup_rstat_flush(), the performance become much better.
>>
>> Maybe we can put the "zswap_shrinker_enabled" check above 
>> cgroup_rstat_flush()?
> 
> Yes, we should do nothing if !zswap_shrinker_enabled. We should also
> use mem_cgroup_flush_stats() here like other places unless accuracy is
> crucial, which I doubt given that reclaim uses
> mem_cgroup_flush_stats().
> 

Yes. After changing to use mem_cgroup_flush_stats() here, the performance
become much better.

> mem_cgroup_flush_stats() has some thresholding to make sure we don't
> do flushes unnecessarily, and I have a pending series in mm-unstable
> that makes that thresholding per-memcg. Keep in mind that adding a
> call to mem_cgroup_flush_stats() will cause a conflict in mm-unstable,

My test branch is linux-next 20231205, and it's all good after changing
to use mem_cgroup_flush_stats(memcg).

> because the series there adds a memcg argument to
> mem_cgroup_flush_stats(). That should be easily amenable though, I can
> post a fixlet for my series to add the memcg argument there on top of
> users if needed.
> 

It's great. Thanks!




Re: [PATCH v8 6/6] zswap: shrinks zswap pool based on memory pressure

2023-12-05 Thread Yosry Ahmed
[..]
> > @@ -526,6 +582,102 @@ static struct zswap_entry 
> > *zswap_entry_find_get(struct rb_root *root,
> >   return entry;
> >  }
> >
> > +/*
> > +* shrinker functions
> > +**/
> > +static enum lru_status shrink_memcg_cb(struct list_head *item, struct 
> > list_lru_one *l,
> > +spinlock_t *lock, void *arg);
> > +
> > +static unsigned long zswap_shrinker_scan(struct shrinker *shrinker,
> > + struct shrink_control *sc)
> > +{
> > + struct lruvec *lruvec = mem_cgroup_lruvec(sc->memcg, 
> > NODE_DATA(sc->nid));
> > + unsigned long shrink_ret, nr_protected, lru_size;
> > + struct zswap_pool *pool = shrinker->private_data;
> > + bool encountered_page_in_swapcache = false;
> > +
> > + nr_protected =
> > + 
> > atomic_long_read(>zswap_lruvec_state.nr_zswap_protected);
> > + lru_size = list_lru_shrink_count(>list_lru, sc);
> > +
> > + /*
> > +  * Abort if the shrinker is disabled or if we are shrinking into the
> > +  * protected region.
> > +  *
> > +  * This short-circuiting is necessary because if we have too many 
> > multiple
> > +  * concurrent reclaimers getting the freeable zswap object counts at 
> > the
> > +  * same time (before any of them made reasonable progress), the total
> > +  * number of reclaimed objects might be more than the number of 
> > unprotected
> > +  * objects (i.e the reclaimers will reclaim into the protected area 
> > of the
> > +  * zswap LRU).
> > +  */
> > + if (!zswap_shrinker_enabled || nr_protected >= lru_size - 
> > sc->nr_to_scan) {
> > + sc->nr_scanned = 0;
> > + return SHRINK_STOP;
> > + }
> > +
> > + shrink_ret = list_lru_shrink_walk(>list_lru, sc, 
> > _memcg_cb,
> > + _page_in_swapcache);
> > +
> > + if (encountered_page_in_swapcache)
> > + return SHRINK_STOP;
> > +
> > + return shrink_ret ? shrink_ret : SHRINK_STOP;
> > +}
> > +
> > +static unsigned long zswap_shrinker_count(struct shrinker *shrinker,
> > + struct shrink_control *sc)
> > +{
> > + struct zswap_pool *pool = shrinker->private_data;
> > + struct mem_cgroup *memcg = sc->memcg;
> > + struct lruvec *lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(sc->nid));
> > + unsigned long nr_backing, nr_stored, nr_freeable, nr_protected;
> > +
> > +#ifdef CONFIG_MEMCG_KMEM
> > + cgroup_rstat_flush(memcg->css.cgroup);
> > + nr_backing = memcg_page_state(memcg, MEMCG_ZSWAP_B) >> PAGE_SHIFT;
> > + nr_stored = memcg_page_state(memcg, MEMCG_ZSWAPPED);
> > +#else
> > + /* use pool stats instead of memcg stats */
> > + nr_backing = get_zswap_pool_size(pool) >> PAGE_SHIFT;
> > + nr_stored = atomic_read(>nr_stored);
> > +#endif
> > +
> > + if (!zswap_shrinker_enabled || !nr_stored)
> When I tested with this series, with !zswap_shrinker_enabled in the default 
> case,
> I found the performance is much worse than that without this patch.
>
> Testcase: memory.max=2G, zswap enabled, kernel build -j32 in a tmpfs 
> directory.
>
> The reason seems the above cgroup_rstat_flush(), caused much rstat lock 
> contention
> to the zswap_store() path. And if I put the "zswap_shrinker_enabled" check 
> above
> the cgroup_rstat_flush(), the performance become much better.
>
> Maybe we can put the "zswap_shrinker_enabled" check above 
> cgroup_rstat_flush()?

Yes, we should do nothing if !zswap_shrinker_enabled. We should also
use mem_cgroup_flush_stats() here like other places unless accuracy is
crucial, which I doubt given that reclaim uses
mem_cgroup_flush_stats().

mem_cgroup_flush_stats() has some thresholding to make sure we don't
do flushes unnecessarily, and I have a pending series in mm-unstable
that makes that thresholding per-memcg. Keep in mind that adding a
call to mem_cgroup_flush_stats() will cause a conflict in mm-unstable,
because the series there adds a memcg argument to
mem_cgroup_flush_stats(). That should be easily amenable though, I can
post a fixlet for my series to add the memcg argument there on top of
users if needed.

>
> Thanks!
>
> > + return 0;
> > +
> > + nr_protected =
> > + 
> > atomic_long_read(>zswap_lruvec_state.nr_zswap_protected);
> > + nr_freeable = list_lru_shrink_count(>list_lru, sc);
> > + /*
> > +  * Subtract the lru size by an estimate of the number of pages
> > +  * that should be protected.
> > +  */
> > + nr_freeable = nr_freeable > nr_protected ? nr_freeable - nr_protected 
> > : 0;
> > +
> > + /*
> > +  * Scale the number of freeable pages by the memory saving factor.
> > +  * This ensures that the better zswap compresses memory, the fewer
> > +  * pages we will evict to swap (as it will otherwise incur IO for
> > +  * relatively small memory saving).
> > +  */
> > + return mult_frac(nr_freeable, 

Re: [PATCH v8 6/6] zswap: shrinks zswap pool based on memory pressure

2023-12-05 Thread Chengming Zhou
On 2023/12/1 03:40, Nhat Pham wrote:
> Currently, we only shrink the zswap pool when the user-defined limit is
> hit. This means that if we set the limit too high, cold data that are
> unlikely to be used again will reside in the pool, wasting precious
> memory. It is hard to predict how much zswap space will be needed ahead
> of time, as this depends on the workload (specifically, on factors such
> as memory access patterns and compressibility of the memory pages).
> 
> This patch implements a memcg- and NUMA-aware shrinker for zswap, that
> is initiated when there is memory pressure. The shrinker does not
> have any parameter that must be tuned by the user, and can be opted in
> or out on a per-memcg basis.
> 
> Furthermore, to make it more robust for many workloads and prevent
> overshrinking (i.e evicting warm pages that might be refaulted into
> memory), we build in the following heuristics:
> 
> * Estimate the number of warm pages residing in zswap, and attempt to
>   protect this region of the zswap LRU.
> * Scale the number of freeable objects by an estimate of the memory
>   saving factor. The better zswap compresses the data, the fewer pages
>   we will evict to swap (as we will otherwise incur IO for relatively
>   small memory saving).
> * During reclaim, if the shrinker encounters a page that is also being
>   brought into memory, the shrinker will cautiously terminate its
>   shrinking action, as this is a sign that it is touching the warmer
>   region of the zswap LRU.
> 
> As a proof of concept, we ran the following synthetic benchmark:
> build the linux kernel in a memory-limited cgroup, and allocate some
> cold data in tmpfs to see if the shrinker could write them out and
> improved the overall performance. Depending on the amount of cold data
> generated, we observe from 14% to 35% reduction in kernel CPU time used
> in the kernel builds.
> 
> Signed-off-by: Nhat Pham 
> Acked-by: Johannes Weiner 
> ---
>  Documentation/admin-guide/mm/zswap.rst |  10 ++
>  include/linux/mmzone.h |   2 +
>  include/linux/zswap.h  |  25 +++-
>  mm/Kconfig |  14 ++
>  mm/mmzone.c|   1 +
>  mm/swap_state.c|   2 +
>  mm/zswap.c | 185 -
>  7 files changed, 233 insertions(+), 6 deletions(-)
> 
> diff --git a/Documentation/admin-guide/mm/zswap.rst 
> b/Documentation/admin-guide/mm/zswap.rst
> index 45b98390e938..62fc244ec702 100644
> --- a/Documentation/admin-guide/mm/zswap.rst
> +++ b/Documentation/admin-guide/mm/zswap.rst
> @@ -153,6 +153,16 @@ attribute, e. g.::
>  
>  Setting this parameter to 100 will disable the hysteresis.
>  
> +When there is a sizable amount of cold memory residing in the zswap pool, it
> +can be advantageous to proactively write these cold pages to swap and reclaim
> +the memory for other use cases. By default, the zswap shrinker is disabled.
> +User can enable it as follows:
> +
> +  echo Y > /sys/module/zswap/parameters/shrinker_enabled
> +
> +This can be enabled at the boot time if ``CONFIG_ZSWAP_SHRINKER_DEFAULT_ON`` 
> is
> +selected.
> +
>  A debugfs interface is provided for various statistic about pool size, number
>  of pages stored, same-value filled pages and various counters for the reasons
>  pages are rejected.
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 7b1816450bfc..b23bc5390240 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -22,6 +22,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  
>  /* Free memory management - zoned buddy allocator.  */
> @@ -641,6 +642,7 @@ struct lruvec {
>  #ifdef CONFIG_MEMCG
>   struct pglist_data *pgdat;
>  #endif
> + struct zswap_lruvec_state zswap_lruvec_state;
>  };
>  
>  /* Isolate for asynchronous migration */
> diff --git a/include/linux/zswap.h b/include/linux/zswap.h
> index e571e393669b..08c240e16a01 100644
> --- a/include/linux/zswap.h
> +++ b/include/linux/zswap.h
> @@ -5,20 +5,40 @@
>  #include 
>  #include 
>  
> +struct lruvec;
> +
>  extern u64 zswap_pool_total_size;
>  extern atomic_t zswap_stored_pages;
>  
>  #ifdef CONFIG_ZSWAP
>  
> +struct zswap_lruvec_state {
> + /*
> +  * Number of pages in zswap that should be protected from the shrinker.
> +  * This number is an estimate of the following counts:
> +  *
> +  * a) Recent page faults.
> +  * b) Recent insertion to the zswap LRU. This includes new zswap stores,
> +  *as well as recent zswap LRU rotations.
> +  *
> +  * These pages are likely to be warm, and might incur IO if the are 
> written
> +  * to swap.
> +  */
> + atomic_long_t nr_zswap_protected;
> +};
> +
>  bool zswap_store(struct folio *folio);
>  bool zswap_load(struct folio *folio);
>  void zswap_invalidate(int type, pgoff_t offset);
>  void zswap_swapon(int type);
>  void zswap_swapoff(int type);
>  void