Re: [PATCH v8 6/6] zswap: shrinks zswap pool based on memory pressure
On 2023/12/7 03:47, Nhat Pham wrote: > [...] >> >> Hmm so how should we proceed from here? How about this: >> >> a) I can send a fixlet to move the enablement check above the stats >> flushing + use mem_cgroup_flush_stats >> b) Then maybe, you can send a fixlet to update this new callsite? >> >> Does that sound reasonable? > > I just sent out the fixlet. Yosry and Chengming, let me know if that > looks good. Thank you both for detecting this issue and proposing the > fix! Yeah, also looks good to me. Thanks! -- Best regards, Chengming Zhou
Re: [PATCH v8 6/6] zswap: shrinks zswap pool based on memory pressure
On Wed, Dec 6, 2023 at 11:47 AM Nhat Pham wrote: > > [...] > > > > Hmm so how should we proceed from here? How about this: > > > > a) I can send a fixlet to move the enablement check above the stats > > flushing + use mem_cgroup_flush_stats > > b) Then maybe, you can send a fixlet to update this new callsite? > > > > Does that sound reasonable? > > I just sent out the fixlet. Yosry and Chengming, let me know if that > looks good. Thank you both for detecting this issue and proposing the > fix! The fixlet looks good, and Andrew already took care of (b) before I could send a followup fixlet out :)
Re: [PATCH v8 6/6] zswap: shrinks zswap pool based on memory pressure
[...] > > Hmm so how should we proceed from here? How about this: > > a) I can send a fixlet to move the enablement check above the stats > flushing + use mem_cgroup_flush_stats > b) Then maybe, you can send a fixlet to update this new callsite? > > Does that sound reasonable? I just sent out the fixlet. Yosry and Chengming, let me know if that looks good. Thank you both for detecting this issue and proposing the fix!
Re: [PATCH v8 6/6] zswap: shrinks zswap pool based on memory pressure
On Tue, Dec 5, 2023 at 10:00 PM Yosry Ahmed wrote: > > [..] > > > @@ -526,6 +582,102 @@ static struct zswap_entry > > > *zswap_entry_find_get(struct rb_root *root, > > > return entry; > > > } > > > > > > +/* > > > +* shrinker functions > > > +**/ > > > +static enum lru_status shrink_memcg_cb(struct list_head *item, struct > > > list_lru_one *l, > > > +spinlock_t *lock, void *arg); > > > + > > > +static unsigned long zswap_shrinker_scan(struct shrinker *shrinker, > > > + struct shrink_control *sc) > > > +{ > > > + struct lruvec *lruvec = mem_cgroup_lruvec(sc->memcg, > > > NODE_DATA(sc->nid)); > > > + unsigned long shrink_ret, nr_protected, lru_size; > > > + struct zswap_pool *pool = shrinker->private_data; > > > + bool encountered_page_in_swapcache = false; > > > + > > > + nr_protected = > > > + > > > atomic_long_read(>zswap_lruvec_state.nr_zswap_protected); > > > + lru_size = list_lru_shrink_count(>list_lru, sc); > > > + > > > + /* > > > + * Abort if the shrinker is disabled or if we are shrinking into the > > > + * protected region. > > > + * > > > + * This short-circuiting is necessary because if we have too many > > > multiple > > > + * concurrent reclaimers getting the freeable zswap object counts > > > at the > > > + * same time (before any of them made reasonable progress), the > > > total > > > + * number of reclaimed objects might be more than the number of > > > unprotected > > > + * objects (i.e the reclaimers will reclaim into the protected area > > > of the > > > + * zswap LRU). > > > + */ > > > + if (!zswap_shrinker_enabled || nr_protected >= lru_size - > > > sc->nr_to_scan) { > > > + sc->nr_scanned = 0; > > > + return SHRINK_STOP; > > > + } > > > + > > > + shrink_ret = list_lru_shrink_walk(>list_lru, sc, > > > _memcg_cb, > > > + _page_in_swapcache); > > > + > > > + if (encountered_page_in_swapcache) > > > + return SHRINK_STOP; > > > + > > > + return shrink_ret ? shrink_ret : SHRINK_STOP; > > > +} > > > + > > > +static unsigned long zswap_shrinker_count(struct shrinker *shrinker, > > > + struct shrink_control *sc) > > > +{ > > > + struct zswap_pool *pool = shrinker->private_data; > > > + struct mem_cgroup *memcg = sc->memcg; > > > + struct lruvec *lruvec = mem_cgroup_lruvec(memcg, > > > NODE_DATA(sc->nid)); > > > + unsigned long nr_backing, nr_stored, nr_freeable, nr_protected; > > > + > > > +#ifdef CONFIG_MEMCG_KMEM > > > + cgroup_rstat_flush(memcg->css.cgroup); > > > + nr_backing = memcg_page_state(memcg, MEMCG_ZSWAP_B) >> PAGE_SHIFT; > > > + nr_stored = memcg_page_state(memcg, MEMCG_ZSWAPPED); > > > +#else > > > + /* use pool stats instead of memcg stats */ > > > + nr_backing = get_zswap_pool_size(pool) >> PAGE_SHIFT; > > > + nr_stored = atomic_read(>nr_stored); > > > +#endif > > > + > > > + if (!zswap_shrinker_enabled || !nr_stored) > > When I tested with this series, with !zswap_shrinker_enabled in the default > > case, > > I found the performance is much worse than that without this patch. > > > > Testcase: memory.max=2G, zswap enabled, kernel build -j32 in a tmpfs > > directory. > > > > The reason seems the above cgroup_rstat_flush(), caused much rstat lock > > contention > > to the zswap_store() path. And if I put the "zswap_shrinker_enabled" check > > above > > the cgroup_rstat_flush(), the performance become much better. > > > > Maybe we can put the "zswap_shrinker_enabled" check above > > cgroup_rstat_flush()? > > Yes, we should do nothing if !zswap_shrinker_enabled. We should also > use mem_cgroup_flush_stats() here like other places unless accuracy is > crucial, which I doubt given that reclaim uses > mem_cgroup_flush_stats(). Ah, good points on both suggestions. We should not do extra work for non-user. And, this is a best-effort approximation of the memory saving factor, so as long as it is not *too* far off I think it's acceptable. > > mem_cgroup_flush_stats() has some thresholding to make sure we don't > do flushes unnecessarily, and I have a pending series in mm-unstable > that makes that thresholding per-memcg. Keep in mind that adding a > call to mem_cgroup_flush_stats() will cause a conflict in mm-unstable, > because the series there adds a memcg argument to > mem_cgroup_flush_stats(). That should be easily amenable though, I can > post a fixlet for my series to add the memcg argument there on top of > users if needed. Hmm so how should we proceed from here? How about this: a) I can send a fixlet to move the enablement check above the stats flushing + use mem_cgroup_flush_stats b) Then maybe, you can send a fixlet to update this new callsite? Does that sound reasonable? > > > > > Thanks! > > > > > + return 0;
Re: [PATCH v8 6/6] zswap: shrinks zswap pool based on memory pressure
On 2023/12/6 15:36, Yosry Ahmed wrote: > On Tue, Dec 5, 2023 at 10:43 PM Chengming Zhou > wrote: >> >> On 2023/12/6 13:59, Yosry Ahmed wrote: >>> [..] > @@ -526,6 +582,102 @@ static struct zswap_entry > *zswap_entry_find_get(struct rb_root *root, > return entry; > } > > +/* > +* shrinker functions > +**/ > +static enum lru_status shrink_memcg_cb(struct list_head *item, struct > list_lru_one *l, > +spinlock_t *lock, void *arg); > + > +static unsigned long zswap_shrinker_scan(struct shrinker *shrinker, > + struct shrink_control *sc) > +{ > + struct lruvec *lruvec = mem_cgroup_lruvec(sc->memcg, > NODE_DATA(sc->nid)); > + unsigned long shrink_ret, nr_protected, lru_size; > + struct zswap_pool *pool = shrinker->private_data; > + bool encountered_page_in_swapcache = false; > + > + nr_protected = > + > atomic_long_read(>zswap_lruvec_state.nr_zswap_protected); > + lru_size = list_lru_shrink_count(>list_lru, sc); > + > + /* > + * Abort if the shrinker is disabled or if we are shrinking into the > + * protected region. > + * > + * This short-circuiting is necessary because if we have too many > multiple > + * concurrent reclaimers getting the freeable zswap object counts > at the > + * same time (before any of them made reasonable progress), the > total > + * number of reclaimed objects might be more than the number of > unprotected > + * objects (i.e the reclaimers will reclaim into the protected area > of the > + * zswap LRU). > + */ > + if (!zswap_shrinker_enabled || nr_protected >= lru_size - > sc->nr_to_scan) { > + sc->nr_scanned = 0; > + return SHRINK_STOP; > + } > + > + shrink_ret = list_lru_shrink_walk(>list_lru, sc, > _memcg_cb, > + _page_in_swapcache); > + > + if (encountered_page_in_swapcache) > + return SHRINK_STOP; > + > + return shrink_ret ? shrink_ret : SHRINK_STOP; > +} > + > +static unsigned long zswap_shrinker_count(struct shrinker *shrinker, > + struct shrink_control *sc) > +{ > + struct zswap_pool *pool = shrinker->private_data; > + struct mem_cgroup *memcg = sc->memcg; > + struct lruvec *lruvec = mem_cgroup_lruvec(memcg, > NODE_DATA(sc->nid)); > + unsigned long nr_backing, nr_stored, nr_freeable, nr_protected; > + > +#ifdef CONFIG_MEMCG_KMEM > + cgroup_rstat_flush(memcg->css.cgroup); > + nr_backing = memcg_page_state(memcg, MEMCG_ZSWAP_B) >> PAGE_SHIFT; > + nr_stored = memcg_page_state(memcg, MEMCG_ZSWAPPED); > +#else > + /* use pool stats instead of memcg stats */ > + nr_backing = get_zswap_pool_size(pool) >> PAGE_SHIFT; > + nr_stored = atomic_read(>nr_stored); > +#endif > + > + if (!zswap_shrinker_enabled || !nr_stored) When I tested with this series, with !zswap_shrinker_enabled in the default case, I found the performance is much worse than that without this patch. Testcase: memory.max=2G, zswap enabled, kernel build -j32 in a tmpfs directory. The reason seems the above cgroup_rstat_flush(), caused much rstat lock contention to the zswap_store() path. And if I put the "zswap_shrinker_enabled" check above the cgroup_rstat_flush(), the performance become much better. Maybe we can put the "zswap_shrinker_enabled" check above cgroup_rstat_flush()? >>> >>> Yes, we should do nothing if !zswap_shrinker_enabled. We should also >>> use mem_cgroup_flush_stats() here like other places unless accuracy is >>> crucial, which I doubt given that reclaim uses >>> mem_cgroup_flush_stats(). >>> >> >> Yes. After changing to use mem_cgroup_flush_stats() here, the performance >> become much better. >> >>> mem_cgroup_flush_stats() has some thresholding to make sure we don't >>> do flushes unnecessarily, and I have a pending series in mm-unstable >>> that makes that thresholding per-memcg. Keep in mind that adding a >>> call to mem_cgroup_flush_stats() will cause a conflict in mm-unstable, >> >> My test branch is linux-next 20231205, and it's all good after changing >> to use mem_cgroup_flush_stats(memcg). > > Thanks for reporting back. We should still move the > zswap_shrinker_enabled check ahead, no need to even call > mem_cgroup_flush_stats() if we will do nothing anyway. > Yes, agree!
Re: [PATCH v8 6/6] zswap: shrinks zswap pool based on memory pressure
On Tue, Dec 5, 2023 at 10:43 PM Chengming Zhou wrote: > > On 2023/12/6 13:59, Yosry Ahmed wrote: > > [..] > >>> @@ -526,6 +582,102 @@ static struct zswap_entry > >>> *zswap_entry_find_get(struct rb_root *root, > >>> return entry; > >>> } > >>> > >>> +/* > >>> +* shrinker functions > >>> +**/ > >>> +static enum lru_status shrink_memcg_cb(struct list_head *item, struct > >>> list_lru_one *l, > >>> +spinlock_t *lock, void *arg); > >>> + > >>> +static unsigned long zswap_shrinker_scan(struct shrinker *shrinker, > >>> + struct shrink_control *sc) > >>> +{ > >>> + struct lruvec *lruvec = mem_cgroup_lruvec(sc->memcg, > >>> NODE_DATA(sc->nid)); > >>> + unsigned long shrink_ret, nr_protected, lru_size; > >>> + struct zswap_pool *pool = shrinker->private_data; > >>> + bool encountered_page_in_swapcache = false; > >>> + > >>> + nr_protected = > >>> + > >>> atomic_long_read(>zswap_lruvec_state.nr_zswap_protected); > >>> + lru_size = list_lru_shrink_count(>list_lru, sc); > >>> + > >>> + /* > >>> + * Abort if the shrinker is disabled or if we are shrinking into the > >>> + * protected region. > >>> + * > >>> + * This short-circuiting is necessary because if we have too many > >>> multiple > >>> + * concurrent reclaimers getting the freeable zswap object counts > >>> at the > >>> + * same time (before any of them made reasonable progress), the > >>> total > >>> + * number of reclaimed objects might be more than the number of > >>> unprotected > >>> + * objects (i.e the reclaimers will reclaim into the protected area > >>> of the > >>> + * zswap LRU). > >>> + */ > >>> + if (!zswap_shrinker_enabled || nr_protected >= lru_size - > >>> sc->nr_to_scan) { > >>> + sc->nr_scanned = 0; > >>> + return SHRINK_STOP; > >>> + } > >>> + > >>> + shrink_ret = list_lru_shrink_walk(>list_lru, sc, > >>> _memcg_cb, > >>> + _page_in_swapcache); > >>> + > >>> + if (encountered_page_in_swapcache) > >>> + return SHRINK_STOP; > >>> + > >>> + return shrink_ret ? shrink_ret : SHRINK_STOP; > >>> +} > >>> + > >>> +static unsigned long zswap_shrinker_count(struct shrinker *shrinker, > >>> + struct shrink_control *sc) > >>> +{ > >>> + struct zswap_pool *pool = shrinker->private_data; > >>> + struct mem_cgroup *memcg = sc->memcg; > >>> + struct lruvec *lruvec = mem_cgroup_lruvec(memcg, > >>> NODE_DATA(sc->nid)); > >>> + unsigned long nr_backing, nr_stored, nr_freeable, nr_protected; > >>> + > >>> +#ifdef CONFIG_MEMCG_KMEM > >>> + cgroup_rstat_flush(memcg->css.cgroup); > >>> + nr_backing = memcg_page_state(memcg, MEMCG_ZSWAP_B) >> PAGE_SHIFT; > >>> + nr_stored = memcg_page_state(memcg, MEMCG_ZSWAPPED); > >>> +#else > >>> + /* use pool stats instead of memcg stats */ > >>> + nr_backing = get_zswap_pool_size(pool) >> PAGE_SHIFT; > >>> + nr_stored = atomic_read(>nr_stored); > >>> +#endif > >>> + > >>> + if (!zswap_shrinker_enabled || !nr_stored) > >> When I tested with this series, with !zswap_shrinker_enabled in the > >> default case, > >> I found the performance is much worse than that without this patch. > >> > >> Testcase: memory.max=2G, zswap enabled, kernel build -j32 in a tmpfs > >> directory. > >> > >> The reason seems the above cgroup_rstat_flush(), caused much rstat lock > >> contention > >> to the zswap_store() path. And if I put the "zswap_shrinker_enabled" check > >> above > >> the cgroup_rstat_flush(), the performance become much better. > >> > >> Maybe we can put the "zswap_shrinker_enabled" check above > >> cgroup_rstat_flush()? > > > > Yes, we should do nothing if !zswap_shrinker_enabled. We should also > > use mem_cgroup_flush_stats() here like other places unless accuracy is > > crucial, which I doubt given that reclaim uses > > mem_cgroup_flush_stats(). > > > > Yes. After changing to use mem_cgroup_flush_stats() here, the performance > become much better. > > > mem_cgroup_flush_stats() has some thresholding to make sure we don't > > do flushes unnecessarily, and I have a pending series in mm-unstable > > that makes that thresholding per-memcg. Keep in mind that adding a > > call to mem_cgroup_flush_stats() will cause a conflict in mm-unstable, > > My test branch is linux-next 20231205, and it's all good after changing > to use mem_cgroup_flush_stats(memcg). Thanks for reporting back. We should still move the zswap_shrinker_enabled check ahead, no need to even call mem_cgroup_flush_stats() if we will do nothing anyway. > > > because the series there adds a memcg argument to > > mem_cgroup_flush_stats(). That should be easily amenable though, I can > > post a fixlet for my series to add the memcg argument there on top of > > users if needed. > > > > It's great. Thanks! >
Re: [PATCH v8 6/6] zswap: shrinks zswap pool based on memory pressure
On 2023/12/6 13:59, Yosry Ahmed wrote: > [..] >>> @@ -526,6 +582,102 @@ static struct zswap_entry >>> *zswap_entry_find_get(struct rb_root *root, >>> return entry; >>> } >>> >>> +/* >>> +* shrinker functions >>> +**/ >>> +static enum lru_status shrink_memcg_cb(struct list_head *item, struct >>> list_lru_one *l, >>> +spinlock_t *lock, void *arg); >>> + >>> +static unsigned long zswap_shrinker_scan(struct shrinker *shrinker, >>> + struct shrink_control *sc) >>> +{ >>> + struct lruvec *lruvec = mem_cgroup_lruvec(sc->memcg, >>> NODE_DATA(sc->nid)); >>> + unsigned long shrink_ret, nr_protected, lru_size; >>> + struct zswap_pool *pool = shrinker->private_data; >>> + bool encountered_page_in_swapcache = false; >>> + >>> + nr_protected = >>> + >>> atomic_long_read(>zswap_lruvec_state.nr_zswap_protected); >>> + lru_size = list_lru_shrink_count(>list_lru, sc); >>> + >>> + /* >>> + * Abort if the shrinker is disabled or if we are shrinking into the >>> + * protected region. >>> + * >>> + * This short-circuiting is necessary because if we have too many >>> multiple >>> + * concurrent reclaimers getting the freeable zswap object counts at >>> the >>> + * same time (before any of them made reasonable progress), the total >>> + * number of reclaimed objects might be more than the number of >>> unprotected >>> + * objects (i.e the reclaimers will reclaim into the protected area >>> of the >>> + * zswap LRU). >>> + */ >>> + if (!zswap_shrinker_enabled || nr_protected >= lru_size - >>> sc->nr_to_scan) { >>> + sc->nr_scanned = 0; >>> + return SHRINK_STOP; >>> + } >>> + >>> + shrink_ret = list_lru_shrink_walk(>list_lru, sc, >>> _memcg_cb, >>> + _page_in_swapcache); >>> + >>> + if (encountered_page_in_swapcache) >>> + return SHRINK_STOP; >>> + >>> + return shrink_ret ? shrink_ret : SHRINK_STOP; >>> +} >>> + >>> +static unsigned long zswap_shrinker_count(struct shrinker *shrinker, >>> + struct shrink_control *sc) >>> +{ >>> + struct zswap_pool *pool = shrinker->private_data; >>> + struct mem_cgroup *memcg = sc->memcg; >>> + struct lruvec *lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(sc->nid)); >>> + unsigned long nr_backing, nr_stored, nr_freeable, nr_protected; >>> + >>> +#ifdef CONFIG_MEMCG_KMEM >>> + cgroup_rstat_flush(memcg->css.cgroup); >>> + nr_backing = memcg_page_state(memcg, MEMCG_ZSWAP_B) >> PAGE_SHIFT; >>> + nr_stored = memcg_page_state(memcg, MEMCG_ZSWAPPED); >>> +#else >>> + /* use pool stats instead of memcg stats */ >>> + nr_backing = get_zswap_pool_size(pool) >> PAGE_SHIFT; >>> + nr_stored = atomic_read(>nr_stored); >>> +#endif >>> + >>> + if (!zswap_shrinker_enabled || !nr_stored) >> When I tested with this series, with !zswap_shrinker_enabled in the default >> case, >> I found the performance is much worse than that without this patch. >> >> Testcase: memory.max=2G, zswap enabled, kernel build -j32 in a tmpfs >> directory. >> >> The reason seems the above cgroup_rstat_flush(), caused much rstat lock >> contention >> to the zswap_store() path. And if I put the "zswap_shrinker_enabled" check >> above >> the cgroup_rstat_flush(), the performance become much better. >> >> Maybe we can put the "zswap_shrinker_enabled" check above >> cgroup_rstat_flush()? > > Yes, we should do nothing if !zswap_shrinker_enabled. We should also > use mem_cgroup_flush_stats() here like other places unless accuracy is > crucial, which I doubt given that reclaim uses > mem_cgroup_flush_stats(). > Yes. After changing to use mem_cgroup_flush_stats() here, the performance become much better. > mem_cgroup_flush_stats() has some thresholding to make sure we don't > do flushes unnecessarily, and I have a pending series in mm-unstable > that makes that thresholding per-memcg. Keep in mind that adding a > call to mem_cgroup_flush_stats() will cause a conflict in mm-unstable, My test branch is linux-next 20231205, and it's all good after changing to use mem_cgroup_flush_stats(memcg). > because the series there adds a memcg argument to > mem_cgroup_flush_stats(). That should be easily amenable though, I can > post a fixlet for my series to add the memcg argument there on top of > users if needed. > It's great. Thanks!
Re: [PATCH v8 6/6] zswap: shrinks zswap pool based on memory pressure
[..] > > @@ -526,6 +582,102 @@ static struct zswap_entry > > *zswap_entry_find_get(struct rb_root *root, > > return entry; > > } > > > > +/* > > +* shrinker functions > > +**/ > > +static enum lru_status shrink_memcg_cb(struct list_head *item, struct > > list_lru_one *l, > > +spinlock_t *lock, void *arg); > > + > > +static unsigned long zswap_shrinker_scan(struct shrinker *shrinker, > > + struct shrink_control *sc) > > +{ > > + struct lruvec *lruvec = mem_cgroup_lruvec(sc->memcg, > > NODE_DATA(sc->nid)); > > + unsigned long shrink_ret, nr_protected, lru_size; > > + struct zswap_pool *pool = shrinker->private_data; > > + bool encountered_page_in_swapcache = false; > > + > > + nr_protected = > > + > > atomic_long_read(>zswap_lruvec_state.nr_zswap_protected); > > + lru_size = list_lru_shrink_count(>list_lru, sc); > > + > > + /* > > + * Abort if the shrinker is disabled or if we are shrinking into the > > + * protected region. > > + * > > + * This short-circuiting is necessary because if we have too many > > multiple > > + * concurrent reclaimers getting the freeable zswap object counts at > > the > > + * same time (before any of them made reasonable progress), the total > > + * number of reclaimed objects might be more than the number of > > unprotected > > + * objects (i.e the reclaimers will reclaim into the protected area > > of the > > + * zswap LRU). > > + */ > > + if (!zswap_shrinker_enabled || nr_protected >= lru_size - > > sc->nr_to_scan) { > > + sc->nr_scanned = 0; > > + return SHRINK_STOP; > > + } > > + > > + shrink_ret = list_lru_shrink_walk(>list_lru, sc, > > _memcg_cb, > > + _page_in_swapcache); > > + > > + if (encountered_page_in_swapcache) > > + return SHRINK_STOP; > > + > > + return shrink_ret ? shrink_ret : SHRINK_STOP; > > +} > > + > > +static unsigned long zswap_shrinker_count(struct shrinker *shrinker, > > + struct shrink_control *sc) > > +{ > > + struct zswap_pool *pool = shrinker->private_data; > > + struct mem_cgroup *memcg = sc->memcg; > > + struct lruvec *lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(sc->nid)); > > + unsigned long nr_backing, nr_stored, nr_freeable, nr_protected; > > + > > +#ifdef CONFIG_MEMCG_KMEM > > + cgroup_rstat_flush(memcg->css.cgroup); > > + nr_backing = memcg_page_state(memcg, MEMCG_ZSWAP_B) >> PAGE_SHIFT; > > + nr_stored = memcg_page_state(memcg, MEMCG_ZSWAPPED); > > +#else > > + /* use pool stats instead of memcg stats */ > > + nr_backing = get_zswap_pool_size(pool) >> PAGE_SHIFT; > > + nr_stored = atomic_read(>nr_stored); > > +#endif > > + > > + if (!zswap_shrinker_enabled || !nr_stored) > When I tested with this series, with !zswap_shrinker_enabled in the default > case, > I found the performance is much worse than that without this patch. > > Testcase: memory.max=2G, zswap enabled, kernel build -j32 in a tmpfs > directory. > > The reason seems the above cgroup_rstat_flush(), caused much rstat lock > contention > to the zswap_store() path. And if I put the "zswap_shrinker_enabled" check > above > the cgroup_rstat_flush(), the performance become much better. > > Maybe we can put the "zswap_shrinker_enabled" check above > cgroup_rstat_flush()? Yes, we should do nothing if !zswap_shrinker_enabled. We should also use mem_cgroup_flush_stats() here like other places unless accuracy is crucial, which I doubt given that reclaim uses mem_cgroup_flush_stats(). mem_cgroup_flush_stats() has some thresholding to make sure we don't do flushes unnecessarily, and I have a pending series in mm-unstable that makes that thresholding per-memcg. Keep in mind that adding a call to mem_cgroup_flush_stats() will cause a conflict in mm-unstable, because the series there adds a memcg argument to mem_cgroup_flush_stats(). That should be easily amenable though, I can post a fixlet for my series to add the memcg argument there on top of users if needed. > > Thanks! > > > + return 0; > > + > > + nr_protected = > > + > > atomic_long_read(>zswap_lruvec_state.nr_zswap_protected); > > + nr_freeable = list_lru_shrink_count(>list_lru, sc); > > + /* > > + * Subtract the lru size by an estimate of the number of pages > > + * that should be protected. > > + */ > > + nr_freeable = nr_freeable > nr_protected ? nr_freeable - nr_protected > > : 0; > > + > > + /* > > + * Scale the number of freeable pages by the memory saving factor. > > + * This ensures that the better zswap compresses memory, the fewer > > + * pages we will evict to swap (as it will otherwise incur IO for > > + * relatively small memory saving). > > + */ > > + return mult_frac(nr_freeable,
Re: [PATCH v8 6/6] zswap: shrinks zswap pool based on memory pressure
On 2023/12/1 03:40, Nhat Pham wrote: > Currently, we only shrink the zswap pool when the user-defined limit is > hit. This means that if we set the limit too high, cold data that are > unlikely to be used again will reside in the pool, wasting precious > memory. It is hard to predict how much zswap space will be needed ahead > of time, as this depends on the workload (specifically, on factors such > as memory access patterns and compressibility of the memory pages). > > This patch implements a memcg- and NUMA-aware shrinker for zswap, that > is initiated when there is memory pressure. The shrinker does not > have any parameter that must be tuned by the user, and can be opted in > or out on a per-memcg basis. > > Furthermore, to make it more robust for many workloads and prevent > overshrinking (i.e evicting warm pages that might be refaulted into > memory), we build in the following heuristics: > > * Estimate the number of warm pages residing in zswap, and attempt to > protect this region of the zswap LRU. > * Scale the number of freeable objects by an estimate of the memory > saving factor. The better zswap compresses the data, the fewer pages > we will evict to swap (as we will otherwise incur IO for relatively > small memory saving). > * During reclaim, if the shrinker encounters a page that is also being > brought into memory, the shrinker will cautiously terminate its > shrinking action, as this is a sign that it is touching the warmer > region of the zswap LRU. > > As a proof of concept, we ran the following synthetic benchmark: > build the linux kernel in a memory-limited cgroup, and allocate some > cold data in tmpfs to see if the shrinker could write them out and > improved the overall performance. Depending on the amount of cold data > generated, we observe from 14% to 35% reduction in kernel CPU time used > in the kernel builds. > > Signed-off-by: Nhat Pham > Acked-by: Johannes Weiner > --- > Documentation/admin-guide/mm/zswap.rst | 10 ++ > include/linux/mmzone.h | 2 + > include/linux/zswap.h | 25 +++- > mm/Kconfig | 14 ++ > mm/mmzone.c| 1 + > mm/swap_state.c| 2 + > mm/zswap.c | 185 - > 7 files changed, 233 insertions(+), 6 deletions(-) > > diff --git a/Documentation/admin-guide/mm/zswap.rst > b/Documentation/admin-guide/mm/zswap.rst > index 45b98390e938..62fc244ec702 100644 > --- a/Documentation/admin-guide/mm/zswap.rst > +++ b/Documentation/admin-guide/mm/zswap.rst > @@ -153,6 +153,16 @@ attribute, e. g.:: > > Setting this parameter to 100 will disable the hysteresis. > > +When there is a sizable amount of cold memory residing in the zswap pool, it > +can be advantageous to proactively write these cold pages to swap and reclaim > +the memory for other use cases. By default, the zswap shrinker is disabled. > +User can enable it as follows: > + > + echo Y > /sys/module/zswap/parameters/shrinker_enabled > + > +This can be enabled at the boot time if ``CONFIG_ZSWAP_SHRINKER_DEFAULT_ON`` > is > +selected. > + > A debugfs interface is provided for various statistic about pool size, number > of pages stored, same-value filled pages and various counters for the reasons > pages are rejected. > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h > index 7b1816450bfc..b23bc5390240 100644 > --- a/include/linux/mmzone.h > +++ b/include/linux/mmzone.h > @@ -22,6 +22,7 @@ > #include > #include > #include > +#include > #include > > /* Free memory management - zoned buddy allocator. */ > @@ -641,6 +642,7 @@ struct lruvec { > #ifdef CONFIG_MEMCG > struct pglist_data *pgdat; > #endif > + struct zswap_lruvec_state zswap_lruvec_state; > }; > > /* Isolate for asynchronous migration */ > diff --git a/include/linux/zswap.h b/include/linux/zswap.h > index e571e393669b..08c240e16a01 100644 > --- a/include/linux/zswap.h > +++ b/include/linux/zswap.h > @@ -5,20 +5,40 @@ > #include > #include > > +struct lruvec; > + > extern u64 zswap_pool_total_size; > extern atomic_t zswap_stored_pages; > > #ifdef CONFIG_ZSWAP > > +struct zswap_lruvec_state { > + /* > + * Number of pages in zswap that should be protected from the shrinker. > + * This number is an estimate of the following counts: > + * > + * a) Recent page faults. > + * b) Recent insertion to the zswap LRU. This includes new zswap stores, > + *as well as recent zswap LRU rotations. > + * > + * These pages are likely to be warm, and might incur IO if the are > written > + * to swap. > + */ > + atomic_long_t nr_zswap_protected; > +}; > + > bool zswap_store(struct folio *folio); > bool zswap_load(struct folio *folio); > void zswap_invalidate(int type, pgoff_t offset); > void zswap_swapon(int type); > void zswap_swapoff(int type); > void