Re: [PATCH] selftests/mm: Introduce a test program to assess swap entry allocation for thp_swapout

2024-06-21 Thread Chris Li
On Fri, Jun 21, 2024 at 12:47 AM Barry Song <21cn...@gmail.com> wrote:
>
> On Fri, Jun 21, 2024 at 7:25 PM Ryan Roberts  wrote:
> >
> > On 20/06/2024 12:34, David Hildenbrand wrote:
> > > On 20.06.24 11:04, Ryan Roberts wrote:
> > >> On 20/06/2024 01:26, Barry Song wrote:
> > >>> From: Barry Song 
> > >>>
> > >>> Both Ryan and Chris have been utilizing the small test program to aid
> > >>> in debugging and identifying issues with swap entry allocation. While
> > >>> a real or intricate workload might be more suitable for assessing the
> > >>> correctness and effectiveness of the swap allocation policy, a small
> > >>> test program presents a simpler means of understanding the problem and
> > >>> initially verifying the improvements being made.
> > >>>
> > >>> Let's endeavor to integrate it into the self-test suite. Although it
> > >>> presently only accommodates 64KB and 4KB, I'm optimistic that we can
> > >>> expand its capabilities to support multiple sizes and simulate more
> > >>> complex systems in the future as required.
> > >>
> > >> I'll try to summarize the thread with Huang Ying by suggesting this test 
> > >> program
> > >> is "neccessary but not sufficient" to exhaustively test the mTHP 
> > >> swap-out path.
> > >> I've certainly found it useful and think it would be a valuable addition 
> > >> to the
> > >> tree.
> > >>
> > >> That said, I'm not convinced it is a selftest; IMO a selftest should 
> > >> provide a
> > >> clear pass/fail result against some criteria and must be able to be run
> > >> automatically by (e.g.) a CI system.
> > >
> > > Likely we should then consider moving other such performance-related 
> > > thingies
> > > out of the selftests?
> >
> > Yes, that would get my vote. But of the 4 tests you mentioned that use
> > clock_gettime(), it looks like transhuge-stress is the only one that doesn't
> > have a pass/fail result, so is probably the only candidate for moving.
> >
> > The others either use the times as a timeout and determines failure if the
> > action didn't occur within the timeout (e.g. ksm_tests.c) or use it to add 
> > some
> > supplemental performance information to an otherwise functionality-oriented 
> > test.
>
> Thank you very much, Ryan. I think you've found a better home for this
> tool . I will
> send v2, relocating it to tools/mm and adding a function to swap in
> either the whole
> mTHPs or a portion of mTHPs by "-a"(aligned swapin).
>
> So basically, we will have
>
> 1. Use MADV_PAGEPUT for rapid swap-out, putting the swap allocation code under
> high exercise in a short time.
>
> 2. Use MADV_DONTNEED to simulate the behavior of libc and Java heap in freeing
> memory, as well as for munmap, app exits, or OOM killer scenarios. This 
> ensures
> new mTHP is always generated, released or swapped out, similar to the behavior
> on a PC or Android phone where many applications are frequently started and
> terminated.

Will this cover the case that the ratio of order 0 and order 4 swap
requests change during LMK, and swapfile is almost full?

If not, please add that :-)

> 3. Swap in with or without the "-a" option to observe how fragments
> due to swap-in
> and the incoming swap-in of large folios will impact swap-out fallback.
>
> And many thanks to Chris for the suggestion on improving it within
> selftest, though I
> prefer to place it in tools/mm.

I am perfectly fine with that. Looking forward to your V2.

Chris



Re: [PATCH] selftests/mm: Introduce a test program to assess swap entry allocation for thp_swapout

2024-06-20 Thread Chris Li
Hi Barry,

Thanks for the wonderful test program.

I have also used other swap test programs as well. A lot of those
tests are harder to setup up and run.

This test is very quick and simple to run. It can test some hard to
hit corner cases for me.

I am able to reproduce the warning and the kernel oops with this test program.
So for me, I am using it as a functional test that my allocator did
not produce a crash.
In that regard, it definitely provides value as a function test.

Having a fall percentage output is fine, as long as we don't fail the
test based on performance number.

I am also fine with moving the test to under tools/mm etc. I see good
value to include the test in the tree one way or the other.


On Wed, Jun 19, 2024 at 5:27 PM Barry Song <21cn...@gmail.com> wrote:
>
> From: Barry Song 
>
> Both Ryan and Chris have been utilizing the small test program to aid
> in debugging and identifying issues with swap entry allocation. While
> a real or intricate workload might be more suitable for assessing the
> correctness and effectiveness of the swap allocation policy, a small
> test program presents a simpler means of understanding the problem and
> initially verifying the improvements being made.
>
> Let's endeavor to integrate it into the self-test suite. Although it
> presently only accommodates 64KB and 4KB, I'm optimistic that we can
> expand its capabilities to support multiple sizes and simulate more
> complex systems in the future as required.
>
> Signed-off-by: Barry Song 
> ---
>  tools/testing/selftests/mm/Makefile   |   1 +
>  .../selftests/mm/thp_swap_allocator_test.c| 192 ++
>  2 files changed, 193 insertions(+)

Assume we want to keep it as selftest.
You did not add your test in run_vmtests.sh.

You might need something like this:

--- a/tools/testing/selftests/mm/run_vmtests.sh
+++ b/tools/testing/selftests/mm/run_vmtests.sh
@@ -418,6 +418,14 @@ CATEGORY="thp" run_test ./khugepaged -s 2

 CATEGORY="thp" run_test ./transhuge-stress -d 20

+# config and swapon zram here.
+
+CATEGORY="thp" run_test ./thp_swap_allocator_test
+
+CATEGORY="thp" run_test ./thp_swap_allocator_test -s
+
+# swapoff zram here.
+
 # Try to create XFS if not provided
 if [ -z "${SPLIT_HUGE_PAGE_TEST_XFS_PATH}" ]; then
 if test_selected "thp"; then


You can use the following XFS test as an example of how to setup the zram swap.
XFS uses file system mount, you use swapon.

Also you need to update the usage string in run_vmtests.sh.

BTW, here is how I invoke the test runs:

kselftest_override_timeout=500 make -C tools/testing/selftests
TARGETS=mm run_tests

The time out is not for this test, it is for some other test before
the thp_swap which exit run_vmtests.sh before hitting thp_swap. I am
running in a VM so it is slower than native machine.

>  create mode 100644 tools/testing/selftests/mm/thp_swap_allocator_test.c
>
> diff --git a/tools/testing/selftests/mm/Makefile 
> b/tools/testing/selftests/mm/Makefile
> index e1aa09ddaa3d..64164ad66835 100644
> --- a/tools/testing/selftests/mm/Makefile
> +++ b/tools/testing/selftests/mm/Makefile
> @@ -65,6 +65,7 @@ TEST_GEN_FILES += mseal_test
>  TEST_GEN_FILES += seal_elf
>  TEST_GEN_FILES += on-fault-limit
>  TEST_GEN_FILES += pagemap_ioctl
> +TEST_GEN_FILES += thp_swap_allocator_test
>  TEST_GEN_FILES += thuge-gen
>  TEST_GEN_FILES += transhuge-stress
>  TEST_GEN_FILES += uffd-stress
> diff --git a/tools/testing/selftests/mm/thp_swap_allocator_test.c 
> b/tools/testing/selftests/mm/thp_swap_allocator_test.c
> new file mode 100644
> index ..4443a906d0f8
> --- /dev/null
> +++ b/tools/testing/selftests/mm/thp_swap_allocator_test.c
> @@ -0,0 +1,192 @@
> +// SPDX-License-Identifier: GPL-2.0-or-later
> +/*
> + * thp_swap_allocator_test
> + *
> + * The purpose of this test program is helping check if THP swpout
> + * can correctly get swap slots to swap out as a whole instead of
> + * being split. It randomly releases swap entries through madvise
> + * DONTNEED and do swapout on two memory areas: a memory area for
> + * 64KB THP and the other area for small folios. The second memory
> + * can be enabled by "-s".
> + * Before running the program, we need to setup a zRAM or similar
> + * swap device by:
> + *  echo lzo > /sys/block/zram0/comp_algorithm
> + *  echo 64M > /sys/block/zram0/disksize
> + *  echo never > /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled
> + *  echo always > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/enabled
> + *  mkswap /dev/zram0
> + *  swapon /dev/zram0

This setup needs to go into run_vmtest.sh as well.

Also tear it down after the test.

Chris

> + * The expected result should be 0% anon swpout fallback ratio w/ or
> + * w/o "-s".
> + *
> + * Author(s): Barry Song 
> + */
> +
> +#define _GNU_SOURCE
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +#define MEMSIZE_MTHP (60 * 1024 * 1024)
> +#define MEMSIZE_SMALLFOLIO (1 * 1024 * 1024)

Re: [PATCH v8 5/6] selftests: cgroup: update per-memcg zswap writeback selftest

2023-12-07 Thread Chris Li
Hi Nhat,

Thanks for the self test.

Acked-by: Chris Li  (Google)

Chris

On Thu, Nov 30, 2023 at 11:40 AM Nhat Pham  wrote:
>
> From: Domenico Cerasuolo 
>
> The memcg-zswap self test is updated to adjust to the behavior change
> implemented by commit 87730b165089 ("zswap: make shrinking memcg-aware"),
> where zswap performs writeback for specific memcg.
>
> Signed-off-by: Domenico Cerasuolo 
> Signed-off-by: Nhat Pham 
> ---
>  tools/testing/selftests/cgroup/test_zswap.c | 74 ++---
>  1 file changed, 50 insertions(+), 24 deletions(-)
>
> diff --git a/tools/testing/selftests/cgroup/test_zswap.c 
> b/tools/testing/selftests/cgroup/test_zswap.c
> index c99d2adaca3f..47fdaa146443 100644
> --- a/tools/testing/selftests/cgroup/test_zswap.c
> +++ b/tools/testing/selftests/cgroup/test_zswap.c
> @@ -50,9 +50,9 @@ static int get_zswap_stored_pages(size_t *value)
> return read_int("/sys/kernel/debug/zswap/stored_pages", value);
>  }
>
> -static int get_zswap_written_back_pages(size_t *value)
> +static int get_cg_wb_count(const char *cg)
>  {
> -   return read_int("/sys/kernel/debug/zswap/written_back_pages", value);
> +   return cg_read_key_long(cg, "memory.stat", "zswp_wb");
>  }
>
>  static long get_zswpout(const char *cgroup)
> @@ -73,6 +73,24 @@ static int allocate_bytes(const char *cgroup, void *arg)
> return 0;
>  }
>
> +static char *setup_test_group_1M(const char *root, const char *name)
> +{
> +   char *group_name = cg_name(root, name);
> +
> +   if (!group_name)
> +   return NULL;
> +   if (cg_create(group_name))
> +   goto fail;
> +   if (cg_write(group_name, "memory.max", "1M")) {
> +   cg_destroy(group_name);
> +   goto fail;
> +   }
> +   return group_name;
> +fail:
> +   free(group_name);
> +   return NULL;
> +}
> +
>  /*
>   * Sanity test to check that pages are written into zswap.
>   */
> @@ -117,43 +135,51 @@ static int test_zswap_usage(const char *root)
>
>  /*
>   * When trying to store a memcg page in zswap, if the memcg hits its memory
> - * limit in zswap, writeback should not be triggered.
> - *
> - * This was fixed with commit 0bdf0efa180a("zswap: do not shrink if cgroup 
> may
> - * not zswap"). Needs to be revised when a per memcg writeback mechanism is
> - * implemented.
> + * limit in zswap, writeback should affect only the zswapped pages of that
> + * memcg.
>   */
>  static int test_no_invasive_cgroup_shrink(const char *root)
>  {
> -   size_t written_back_before, written_back_after;
> int ret = KSFT_FAIL;
> -   char *test_group;
> +   size_t control_allocation_size = MB(10);
> +   char *control_allocation, *wb_group = NULL, *control_group = NULL;
>
> /* Set up */
> -   test_group = cg_name(root, "no_shrink_test");
> -   if (!test_group)
> -   goto out;
> -   if (cg_create(test_group))
> +   wb_group = setup_test_group_1M(root, "per_memcg_wb_test1");
> +   if (!wb_group)
> +   return KSFT_FAIL;
> +   if (cg_write(wb_group, "memory.zswap.max", "10K"))
> goto out;
> -   if (cg_write(test_group, "memory.max", "1M"))
> +   control_group = setup_test_group_1M(root, "per_memcg_wb_test2");
> +   if (!control_group)
> goto out;
> -   if (cg_write(test_group, "memory.zswap.max", "10K"))
> +
> +   /* Push some test_group2 memory into zswap */
> +   if (cg_enter_current(control_group))
> goto out;
> -   if (get_zswap_written_back_pages(&written_back_before))
> +   control_allocation = malloc(control_allocation_size);
> +   for (int i = 0; i < control_allocation_size; i += 4095)
> +   control_allocation[i] = 'a';
> +   if (cg_read_key_long(control_group, "memory.stat", "zswapped") < 1)
> goto out;
>
> -   /* Allocate 10x memory.max to push memory into zswap */
> -   if (cg_run(test_group, allocate_bytes, (void *)MB(10)))
> +   /* Allocate 10x memory.max to push wb_group memory into zswap and 
> trigger wb */
> +   if (cg_run(wb_group, allocate_bytes, (void *)MB(10)))
> goto out;
>
> -   /* Verify that no writeback happened because of the memcg allocation 
> */
> -   if (get_zswap_written_back_pages(&written_back_after))
> -   goto out;
> -   if (written_back_after 

Re: [PATCH v8 4/6] mm: memcg: add per-memcg zswap writeback stat (fix)

2023-12-07 Thread Chris Li
Acked-by: Chris Li  (Google)

Chris

On Tue, Dec 5, 2023 at 11:33 AM Nhat Pham  wrote:
>
> Rename ZSWP_WB to ZSWPWB to better match the existing counters naming
> scheme.
>
> Suggested-by: Johannes Weiner 
> Signed-off-by: Nhat Pham 
> ---
>  include/linux/vm_event_item.h | 2 +-
>  mm/memcontrol.c   | 2 +-
>  mm/vmstat.c   | 2 +-
>  mm/zswap.c| 4 ++--
>  4 files changed, 5 insertions(+), 5 deletions(-)
>
> diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
> index f4569ad98edf..747943bc8cc2 100644
> --- a/include/linux/vm_event_item.h
> +++ b/include/linux/vm_event_item.h
> @@ -142,7 +142,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
>  #ifdef CONFIG_ZSWAP
> ZSWPIN,
> ZSWPOUT,
> -   ZSWP_WB,
> +   ZSWPWB,
>  #endif
>  #ifdef CONFIG_X86
> DIRECT_MAP_LEVEL2_SPLIT,
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 21d79249c8b4..0286b7d38832 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -703,7 +703,7 @@ static const unsigned int memcg_vm_event_stat[] = {
>  #if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_ZSWAP)
> ZSWPIN,
> ZSWPOUT,
> -   ZSWP_WB,
> +   ZSWPWB,
>  #endif
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> THP_FAULT_ALLOC,
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 2249f85e4a87..cfd8d8256f8e 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -1401,7 +1401,7 @@ const char * const vmstat_text[] = {
>  #ifdef CONFIG_ZSWAP
> "zswpin",
> "zswpout",
> -   "zswp_wb",
> +   "zswpwb",
>  #endif
>  #ifdef CONFIG_X86
> "direct_map_level2_splits",
> diff --git a/mm/zswap.c b/mm/zswap.c
> index c65b8ccc6b72..0fb0945c0031 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -761,9 +761,9 @@ static enum lru_status shrink_memcg_cb(struct list_head 
> *item, struct list_lru_o
> zswap_written_back_pages++;
>
> if (entry->objcg)
> -   count_objcg_event(entry->objcg, ZSWP_WB);
> +   count_objcg_event(entry->objcg, ZSWPWB);
>
> -   count_vm_event(ZSWP_WB);
> +   count_vm_event(ZSWPWB);
> /*
>  * Writeback started successfully, the page now belongs to the
>  * swapcache. Drop the entry from zswap - unless invalidate already
> --
> 2.34.1
>



Re: [PATCH v8 2/6] memcontrol: implement mem_cgroup_tryget_online()

2023-12-05 Thread Chris Li
On Mon, Dec 4, 2023 at 5:39 PM Nhat Pham  wrote:
>
> > > memcg as a candidate for the global limit reclaim.
> >
> > Very minor nitpick. This patch can fold with the later patch that uses
> > it. That makes the review easier, no need to cross reference different
> > patches. It will also make it harder to introduce API that nobody
> > uses.
>
> I don't have a strong preference one way or the other :) Probably not
> worth the churn tho.

Squashing a patch is very easy. If you are refreshing a new series, it
is worthwhile to do it. I notice on the other thread Yosry pointed out
you did  not use the function "mem_cgroup_tryget_online" in patch 3,
that is exactly the situation my suggestion is trying to prevent.

If you don't have a strong preference, it sounds like you should squash it.

Chris

>
> >
> > Chris
> >
> > >
> > > Signed-off-by: Nhat Pham 
> > > ---
> > >  include/linux/memcontrol.h | 10 ++
> > >  1 file changed, 10 insertions(+)
> > >
> > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > > index 7bdcf3020d7a..2bd7d14ace78 100644
> > > --- a/include/linux/memcontrol.h
> > > +++ b/include/linux/memcontrol.h
> > > @@ -821,6 +821,11 @@ static inline bool mem_cgroup_tryget(struct 
> > > mem_cgroup *memcg)
> > > return !memcg || css_tryget(&memcg->css);
> > >  }
> > >
> > > +static inline bool mem_cgroup_tryget_online(struct mem_cgroup *memcg)
> > > +{
> > > +   return !memcg || css_tryget_online(&memcg->css);
> > > +}
> > > +
> > >  static inline void mem_cgroup_put(struct mem_cgroup *memcg)
> > >  {
> > > if (memcg)
> > > @@ -1349,6 +1354,11 @@ static inline bool mem_cgroup_tryget(struct 
> > > mem_cgroup *memcg)
> > > return true;
> > >  }
> > >
> > > +static inline bool mem_cgroup_tryget_online(struct mem_cgroup *memcg)
> > > +{
> > > +   return true;
> > > +}
> > > +
> > >  static inline void mem_cgroup_put(struct mem_cgroup *memcg)
> > >  {
> > >  }
> > > --
> > > 2.34.1
> > >
>



Re: [PATCH v8 3/6] zswap: make shrinking memcg-aware

2023-12-05 Thread Chris Li
Hi Nhat,

Still working my way up of your patches series.

On Thu, Nov 30, 2023 at 11:40 AM Nhat Pham  wrote:
>
> From: Domenico Cerasuolo 
>
> Currently, we only have a single global LRU for zswap. This makes it
> impossible to perform worload-specific shrinking - an memcg cannot
> determine which pages in the pool it owns, and often ends up writing
> pages from other memcgs. This issue has been previously observed in
> practice and mitigated by simply disabling memcg-initiated shrinking:
>
> https://lore.kernel.org/all/20230530232435.3097106-1-npha...@gmail.com/T/#u
>
> This patch fully resolves the issue by replacing the global zswap LRU
> with memcg- and NUMA-specific LRUs, and modify the reclaim logic:
>
> a) When a store attempt hits an memcg limit, it now triggers a
>synchronous reclaim attempt that, if successful, allows the new
>hotter page to be accepted by zswap.
> b) If the store attempt instead hits the global zswap limit, it will
>trigger an asynchronous reclaim attempt, in which an memcg is
>selected for reclaim in a round-robin-like fashion.
>
> Signed-off-by: Domenico Cerasuolo 
> Co-developed-by: Nhat Pham 
> Signed-off-by: Nhat Pham 
> ---
>  include/linux/memcontrol.h |   5 +
>  include/linux/zswap.h  |   2 +
>  mm/memcontrol.c|   2 +
>  mm/swap.h  |   3 +-
>  mm/swap_state.c|  24 +++-
>  mm/zswap.c | 269 +
>  6 files changed, 245 insertions(+), 60 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 2bd7d14ace78..a308c8eacf20 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -1192,6 +1192,11 @@ static inline struct mem_cgroup 
> *page_memcg_check(struct page *page)
> return NULL;
>  }
>
> +static inline struct mem_cgroup *get_mem_cgroup_from_objcg(struct obj_cgroup 
> *objcg)
> +{
> +   return NULL;
> +}
> +
>  static inline bool folio_memcg_kmem(struct folio *folio)
>  {
> return false;
> diff --git a/include/linux/zswap.h b/include/linux/zswap.h
> index 2a60ce39cfde..e571e393669b 100644
> --- a/include/linux/zswap.h
> +++ b/include/linux/zswap.h
> @@ -15,6 +15,7 @@ bool zswap_load(struct folio *folio);
>  void zswap_invalidate(int type, pgoff_t offset);
>  void zswap_swapon(int type);
>  void zswap_swapoff(int type);
> +void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg);
>
>  #else
>
> @@ -31,6 +32,7 @@ static inline bool zswap_load(struct folio *folio)
>  static inline void zswap_invalidate(int type, pgoff_t offset) {}
>  static inline void zswap_swapon(int type) {}
>  static inline void zswap_swapoff(int type) {}
> +static inline void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg) {}
>
>  #endif
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 470821d1ba1a..792ca21c5815 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -5614,6 +5614,8 @@ static void mem_cgroup_css_offline(struct 
> cgroup_subsys_state *css)
> page_counter_set_min(&memcg->memory, 0);
> page_counter_set_low(&memcg->memory, 0);
>
> +   zswap_memcg_offline_cleanup(memcg);
> +
> memcg_offline_kmem(memcg);
> reparent_shrinker_deferred(memcg);
> wb_memcg_offline(memcg);
> diff --git a/mm/swap.h b/mm/swap.h
> index 73c332ee4d91..c0dc73e10e91 100644
> --- a/mm/swap.h
> +++ b/mm/swap.h
> @@ -51,7 +51,8 @@ struct page *read_swap_cache_async(swp_entry_t entry, gfp_t 
> gfp_mask,
>struct swap_iocb **plug);
>  struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
>  struct mempolicy *mpol, pgoff_t ilx,
> -bool *new_page_allocated);
> +bool *new_page_allocated,
> +bool skip_if_exists);
>  struct page *swap_cluster_readahead(swp_entry_t entry, gfp_t flag,
> struct mempolicy *mpol, pgoff_t ilx);
>  struct page *swapin_readahead(swp_entry_t entry, gfp_t flag,
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index 85d9e5806a6a..6c84236382f3 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -412,7 +412,8 @@ struct folio *filemap_get_incore_folio(struct 
> address_space *mapping,
>
>  struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
>  struct mempolicy *mpol, pgoff_t ilx,
> -bool *new_page_allocated)
> +bool *new_page_allocated,
> +bool skip_if_exists)

I think this skip_if_exists is problematic here. You might need to
redesign this.
First of all, the skip_if_exists as the argument name, the meaning to
the caller is not clear. When I saw this, I was wondering, what does
the function return when this condition is triggered? Unlike
"*new_page_allocated",

Re: [PATCH v8 2/6] memcontrol: implement mem_cgroup_tryget_online()

2023-12-04 Thread Chris Li
Hi Nhat,

Very minor nitpick. This patch can fold with the later patch that uses
it. That makes the review easier, no need to cross reference different
patches. It will also make it harder to introduce API that nobody
uses.

Chris

On Thu, Nov 30, 2023 at 11:40 AM Nhat Pham  wrote:
>
> This patch implements a helper function that try to get a reference to
> an memcg's css, as well as checking if it is online. This new function
> is almost exactly the same as the existing mem_cgroup_tryget(), except
> for the onlineness check. In the !CONFIG_MEMCG case, it always returns
> true, analogous to mem_cgroup_tryget(). This is useful for e.g to the
> new zswap writeback scheme, where we need to select the next online
> memcg as a candidate for the global limit reclaim.

Very minor nitpick. This patch can fold with the later patch that uses
it. That makes the review easier, no need to cross reference different
patches. It will also make it harder to introduce API that nobody
uses.

Chris

>
> Signed-off-by: Nhat Pham 
> ---
>  include/linux/memcontrol.h | 10 ++
>  1 file changed, 10 insertions(+)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 7bdcf3020d7a..2bd7d14ace78 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -821,6 +821,11 @@ static inline bool mem_cgroup_tryget(struct mem_cgroup 
> *memcg)
> return !memcg || css_tryget(&memcg->css);
>  }
>
> +static inline bool mem_cgroup_tryget_online(struct mem_cgroup *memcg)
> +{
> +   return !memcg || css_tryget_online(&memcg->css);
> +}
> +
>  static inline void mem_cgroup_put(struct mem_cgroup *memcg)
>  {
> if (memcg)
> @@ -1349,6 +1354,11 @@ static inline bool mem_cgroup_tryget(struct mem_cgroup 
> *memcg)
> return true;
>  }
>
> +static inline bool mem_cgroup_tryget_online(struct mem_cgroup *memcg)
> +{
> +   return true;
> +}
> +
>  static inline void mem_cgroup_put(struct mem_cgroup *memcg)
>  {
>  }
> --
> 2.34.1
>



Re: [PATCH v8 1/6] list_lru: allows explicit memcg and NUMA node selection

2023-12-04 Thread Chris Li
On Thu, Nov 30, 2023 at 12:35 PM Johannes Weiner  wrote:
>
> On Thu, Nov 30, 2023 at 12:07:41PM -0800, Nhat Pham wrote:
> > On Thu, Nov 30, 2023 at 11:57 AM Matthew Wilcox  wrote:
> > >
> > > On Thu, Nov 30, 2023 at 11:40:18AM -0800, Nhat Pham wrote:
> > > > This patch changes list_lru interface so that the caller must explicitly
> > > > specify numa node and memcg when adding and removing objects. The old
> > > > list_lru_add() and list_lru_del() are renamed to list_lru_add_obj() and
> > > > list_lru_del_obj(), respectively.
> > >
> > > Wouldn't it be better to add list_lru_add_memcg() and
> > > list_lru_del_memcg() and have:

That is my first thought as well. If we are having two different
flavors of LRU add, one has memcg and one without. The list_lru_add()
vs list_lru_add_memcg() is the common way to do it.
> > >
> > > +bool list_lru_del(struct list_lru *lru, struct list_head *item)
> > > +{
> > > +   int nid = page_to_nid(virt_to_page(item));
> > > +   struct mem_cgroup *memcg = list_lru_memcg_aware(lru) ?
> > > +   mem_cgroup_from_slab_obj(item) : NULL;
> > > +
> > > +   return list_lru_del_memcg(lru, item, nid, memcg);
> > > +}
> > >
> > > Seems like _most_ callers will want the original versions and only
> > > a few will want the explicit memcg/nid versions.  No?
> > >
> >
> > I actually did something along that line in earlier iterations of this
> > patch series (albeit with poorer naming - __list_lru_add() instead of
> > list_lru_add_memcg()). The consensus after some back and forth was
> > that the original list_lru_add() was not a very good design (the
> > better one was this new version that allows for explicit numa/memcg
> > selection). So I agreed to fix it everywhere as a prep patch.
> >
> > I don't have strong opinions here to be completely honest, but I do
> > think this new API makes more sense (at the cost of quite a bit of
> > elbow grease to fix every callsites and extra reviewing).
>
> Maybe I can shed some light since I was pushing for doing it this way.
>
> The quiet assumption that 'struct list_head *item' is (embedded in) a
> slab object that is also charged to a cgroup is a bit much, given that
> nothing in the name or documentation of the function points to that.

We can add it to the document if that is desirable.

>
> It bit us in the THP shrinker where that list head is embedded in a
> tailpage (virt_to_page(page) is fun to debug). And it caused some
> confusion in this case as well, where the zswap entry is a slab object
> but not charged (the entry descriptor is not attractive for cgroup
> accounting, only the backing memory it points to.)
>
> Yes, for most users - at least right now - the current assumption is
> accurate. The thinking was just that if we do have to differentiate
> callers now anyway, we might as well make the interface a bit more
> self-documenting and harder to misuse going forward, even if it's a
> bit more churn now.

It comes down to whether we need to have the non memcg version of API
going forward. If we don't then change the meaning of list_lru_add()
to perform the deed of list_lru_add_memcg() makes sense. My assumption
is that the non memcg version of the API does have legit usage. In
that case, it seems having the original list_lru_add() and
list_lur_add_memcg() as Mattew suggested feels more natural. What you
really want is that every caller of list_lru_add() should seriously
consider switching to list_lru_add_memcg() unless it has a very good
reason to stay in the non memcg version. Renaming and changing the
meaning of list_lru_add() is a bit confusing and has a negative impact
on the outstanding patch that uses list_lru_add().  The end of the
day, some developers still need to evaluate the call site one by one,
renaming the function is not going to help that effort. Just make it
more obvious.

Just my 2 cents, others please chime in. Just to make it clear, that
is my preference, it is not a NACK.

Chris



Re: [PATCH v5 0/6] workload-specific and memory pressure-driven zswap writeback

2023-11-19 Thread Chris Li
On Fri, Nov 17, 2023 at 8:23 AM Nhat Pham  wrote:
>
> On Thu, Nov 16, 2023 at 4:57 PM Chris Li  wrote:
> >
> > Hi Nhat,
> >
> > I want want to share the high level feedback we discussed here in the
> > mailing list as well.
> >
> > It is my observation that each memcg LRU list can't compare the page
> > time order with other memcg.
> > It works great when the leaf level memcg hits the memory limit and you
> > want to reclaim from that memcg.
> > It works less well on the global memory pressure you need to reclaim
> > from all memcg. You kind of have to
> > scan each all child memcg to find out the best page to shrink from. It
> > is less effective to get to the most desirable page quickly.
> >
> > This can benefit from a design similar to MGLRU. This idea is
> > suggested by Yu Zhao, credit goes to him not me.
> > In other words, the current patch is similar to the memcg page list
> > pre MGLRU world. We can have a MRLRU
> > like per memcg zswap shrink list.
>
> I was gonna summarize the points myself :P But thanks for doing this.
> It's your idea so you're more qualified to explain this anyway ;)

The MGLRU like shrinker was Zhao Yu's idea. I just observe the problem.

>
> I absolutely agree that having a generation-aware cgroup-aware
> NUMA-aware LRU is the future way to go. Currently, IIUC, the reclaim logic
> selects cgroups in a round-robin-ish manner. It's "fair" in this perspective,
> but I also think it's not ideal. As we have discussed, the current list_lru
> infrastructure only take into account intra-cgroup relative recency, not
> inter-cgroup relative recency. The recently proposed time-based zswap
> reclaim mechanism will provide us with a source of information, but the
> overhead of using this might be too high - and it's very zswap-specific.

I don't mind it is zswap-specific, as long as it is effective.
The overhead has two folds:
1) memory overhead on storing timestamps on per compressed page.
2) cpu overhead for reading timestamps.
Using MGLRU likely have advantage over time stamps on both memory and
cpu. The generation can use fewer bits and doesn't require reading
time on every page.

> Maybe after this, we should improve zswap reclaim (and perhaps all
> list_lru users) by adding generations to list_lru then take generations
> into account in the vmscan code. This patch series could be merged

One high level idea is that we can get the page generation in the
MGLRU before it gets into zswap. Just retain the generation into the
zpool LRU somehow.

> as-is, and once we make list_lru generation-aware, zswap shrinker
> will automagically be improved (along with all other list_lru/shrinker
> users).

I don't think it will automatically improve, you will need to rewrite
a lot of code in the shrinker as well to best use MGLRU zpool.

>
> I don't know enough about the current design of MGLRU to comment
> too much further, but let me know if this makes sense, and if you have
> objections/other ideas.

Taking the step by step approach is fine by me as long as we are
making steady progress towards the better end goal.

>
> And if you have other documentations for MGLRU than its code, could
> you please let me know? I'm struggling to find more details about this.

I would need to learn MGLRU myself. We can share and compare notes
when we get to it.

Chris



Re: [PATCH v5 0/6] workload-specific and memory pressure-driven zswap writeback

2023-11-16 Thread Chris Li
Hi Nhat,

I want want to share the high level feedback we discussed here in the
mailing list as well.

It is my observation that each memcg LRU list can't compare the page
time order with other memcg.
It works great when the leaf level memcg hits the memory limit and you
want to reclaim from that memcg.
It works less well on the global memory pressure you need to reclaim
from all memcg. You kind of have to
scan each all child memcg to find out the best page to shrink from. It
is less effective to get to the most desirable page quickly.

This can benefit from a design similar to MGLRU. This idea is
suggested by Yu Zhao, credit goes to him not me.
In other words, the current patch is similar to the memcg page list
pre MGLRU world. We can have a MRLRU
like per memcg zswap shrink list.


Chris

On Wed, Nov 8, 2023 at 6:10 PM Chris Li  wrote:
>
> On Wed, Nov 8, 2023 at 4:28 PM Nhat Pham  wrote:
> >
> > Hmm my guess is that I probably sent this out based on an outdated
> > mm-unstable. There has since been a new zswap selftest merged
> > to mm-unstable (written by no other than myself - oh the irony), so
> > maybe it does not apply cleanly anymore with git am.
>
> $ git am -3 patches/zswap-pool-lru/0005
> Applying: selftests: cgroup: update per-memcg zswap writeback selftest
> Using index info to reconstruct a base tree...
> M   tools/testing/selftests/cgroup/test_zswap.c
> Falling back to patching base and 3-way merge...
> Auto-merging tools/testing/selftests/cgroup/test_zswap.c
> $ git am -3 patches/zswap-pool-lru/0006
> Applying: zswap: shrinks zswap pool based on memory pressure
> error: sha1 information is lacking or useless (mm/zswap.c).
> error: could not build fake ancestor
> Patch failed at 0001 zswap: shrinks zswap pool based on memory pressure
> hint: Use 'git am --show-current-patch=diff' to see the failed patch
> When you have resolved this problem, run "git am --continue".
> If you prefer to skip this patch, run "git am --skip" instead.
> To restore the original branch and stop patching, run "git am --abort".
>
> I was able to resolve the conflict on patch 6 by hand though. So I am good 
> now.
>
> Thanks
>
> Chris



Re: [PATCH v5 0/6] workload-specific and memory pressure-driven zswap writeback

2023-11-08 Thread Chris Li
On Wed, Nov 8, 2023 at 4:28 PM Nhat Pham  wrote:
>
> Hmm my guess is that I probably sent this out based on an outdated
> mm-unstable. There has since been a new zswap selftest merged
> to mm-unstable (written by no other than myself - oh the irony), so
> maybe it does not apply cleanly anymore with git am.

$ git am -3 patches/zswap-pool-lru/0005
Applying: selftests: cgroup: update per-memcg zswap writeback selftest
Using index info to reconstruct a base tree...
M   tools/testing/selftests/cgroup/test_zswap.c
Falling back to patching base and 3-way merge...
Auto-merging tools/testing/selftests/cgroup/test_zswap.c
$ git am -3 patches/zswap-pool-lru/0006
Applying: zswap: shrinks zswap pool based on memory pressure
error: sha1 information is lacking or useless (mm/zswap.c).
error: could not build fake ancestor
Patch failed at 0001 zswap: shrinks zswap pool based on memory pressure
hint: Use 'git am --show-current-patch=diff' to see the failed patch
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

I was able to resolve the conflict on patch 6 by hand though. So I am good now.

Thanks

Chris


Re: [PATCH v5 0/6] workload-specific and memory pressure-driven zswap writeback

2023-11-08 Thread Chris Li
Hi Nhat,

On Wed, Nov 8, 2023 at 1:15 PM Nhat Pham  wrote:
>
> Ah that was meant to be a fixlet - so that on top of the original
> "zswap: make shrinking memcg-aware" patch. The intention was
> to eventually squash it...
>
> But this is getting a bit annoyingly confusing, I admit. I just rebased to
> mm-unstable + squashed it all again, then sent one single replacement
> patch:
>
> [PATCH v5 3/6 REPLACE] zswap: make shrinking memcg-aware

Thank you for the quick response.

Yes, I am able to download your replacement version of patch 3.
Just FYI, I am using "git mailsplit" to split up the mbox into 6
separate patch files.
On mm-unstable, I am able to apply your replacement patch 3 cleanly.
I also need some help on the patch 0005, it does not apply cleanly either.

$ git mailsplit -ozswap-pool-lru
v5_20231106_nphamcs_workload_specific_and_memory_pressure_driven_zswap_writeback.mbx
$ git am patches/zswap-pool-lru/0001
Applying: list_lru: allows explicit memcg and NUMA node selection
$ git am patches/zswap-pool-lru/0002
Applying: memcontrol: allows mem_cgroup_iter() to check for onlineness
$ git am patches/zswap-pool-lru/3.replace
Applying: zswap: make shrinking memcg-aware
$ git am patches/zswap-pool-lru/0004
Applying: mm: memcg: add per-memcg zswap writeback stat
$ git am patches/zswap-pool-lru/0005
Applying: selftests: cgroup: update per-memcg zswap writeback selftest
error: patch failed: tools/testing/selftests/cgroup/test_zswap.c:50
error: tools/testing/selftests/cgroup/test_zswap.c: patch does not apply
Patch failed at 0001 selftests: cgroup: update per-memcg zswap
writeback selftest
hint: Use 'git am --show-current-patch=diff' to see the failed patch
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

>
> Let me know if this still fails to apply. If not, I'll send the whole thing
> again as v6! My sincerest apologies for the troubles and confusion :(

No problem at all. Thanks for your help on patch 3.

Chris


Re: [PATCH v5 0/6] workload-specific and memory pressure-driven zswap writeback

2023-11-08 Thread Chris Li
Hi Nhat,

Sorry for being late to the party. I want to take a look at your patches series.
However I wasn't able to "git am" your patches series cleanly on current
mm-stable, mm-unstable or linux tip.

$ git am 
patches/v5_20231106_nphamcs_workload_specific_and_memory_pressure_driven_zswap_writeback.mbx
Applying: list_lru: allows explicit memcg and NUMA node selection
Applying: memcontrol: allows mem_cgroup_iter() to check for onlineness
Applying: zswap: make shrinking memcg-aware (fix)
error: patch failed: mm/zswap.c:174
error: mm/zswap.c: patch does not apply
Patch failed at 0003 zswap: make shrinking memcg-aware (fix)

What is the base of your patches? A git hash or a branch I can pull
from would be
nice.

Thanks

Chris

On Mon, Nov 6, 2023 at 10:32 AM Nhat Pham  wrote:
>
> Changelog:
> v5:
>* Replace reference getting with an rcu_read_lock() section for
>  zswap lru modifications (suggested by Yosry)
>* Add a new prep patch that allows mem_cgroup_iter() to return
>  online cgroup.
>* Add a callback that updates pool->next_shrink when the cgroup is
>  offlined (suggested by Yosry Ahmed, Johannes Weiner)
> v4:
>* Rename list_lru_add to list_lru_add_obj and __list_lru_add to
>  list_lru_add (patch 1) (suggested by Johannes Weiner and
>  Yosry Ahmed)
>* Some cleanups on the memcg aware LRU patch (patch 2)
>  (suggested by Yosry Ahmed)
>* Use event interface for the new per-cgroup writeback counters.
>  (patch 3) (suggested by Yosry Ahmed)
>* Abstract zswap's lruvec states and handling into
>  zswap_lruvec_state (patch 5) (suggested by Yosry Ahmed)
> v3:
>* Add a patch to export per-cgroup zswap writeback counters
>* Add a patch to update zswap's kselftest
>* Separate the new list_lru functions into its own prep patch
>* Do not start from the top of the hierarchy when encounter a memcg
>  that is not online for the global limit zswap writeback (patch 2)
>  (suggested by Yosry Ahmed)
>* Do not remove the swap entry from list_lru in
>  __read_swapcache_async() (patch 2) (suggested by Yosry Ahmed)
>* Removed a redundant zswap pool getting (patch 2)
>  (reported by Ryan Roberts)
>* Use atomic for the nr_zswap_protected (instead of lruvec's lock)
>  (patch 5) (suggested by Yosry Ahmed)
>* Remove the per-cgroup zswap shrinker knob (patch 5)
>  (suggested by Yosry Ahmed)
> v2:
>* Fix loongarch compiler errors
>* Use pool stats instead of memcg stats when !CONFIG_MEMCG_KEM
>
> There are currently several issues with zswap writeback:
>
> 1. There is only a single global LRU for zswap, making it impossible to
>perform worload-specific shrinking - an memcg under memory pressure
>cannot determine which pages in the pool it owns, and often ends up
>writing pages from other memcgs. This issue has been previously
>observed in practice and mitigated by simply disabling
>memcg-initiated shrinking:
>
>https://lore.kernel.org/all/20230530232435.3097106-1-npha...@gmail.com/T/#u
>
>But this solution leaves a lot to be desired, as we still do not
>have an avenue for an memcg to free up its own memory locked up in
>the zswap pool.
>
> 2. We only shrink the zswap pool when the user-defined limit is hit.
>This means that if we set the limit too high, cold data that are
>unlikely to be used again will reside in the pool, wasting precious
>memory. It is hard to predict how much zswap space will be needed
>ahead of time, as this depends on the workload (specifically, on
>factors such as memory access patterns and compressibility of the
>memory pages).
>
> This patch series solves these issues by separating the global zswap
> LRU into per-memcg and per-NUMA LRUs, and performs workload-specific
> (i.e memcg- and NUMA-aware) zswap writeback under memory pressure. The
> new shrinker does not have any parameter that must be tuned by the
> user, and can be opted in or out on a per-memcg basis.
>
> As a proof of concept, we ran the following synthetic benchmark:
> build the linux kernel in a memory-limited cgroup, and allocate some
> cold data in tmpfs to see if the shrinker could write them out and
> improved the overall performance. Depending on the amount of cold data
> generated, we observe from 14% to 35% reduction in kernel CPU time used
> in the kernel builds.
>
> Domenico Cerasuolo (3):
>   zswap: make shrinking memcg-aware
>   mm: memcg: add per-memcg zswap writeback stat
>   selftests: cgroup: update per-memcg zswap writeback selftest
>
> Nhat Pham (3):
>   list_lru: allows explicit memcg and NUMA node selection
>   memcontrol: allows mem_cgroup_iter() to check for onlineness
>   zswap: shrinks zswap pool based on memory pressure
>
>  Documentation/admin-guide/mm/zswap.rst  |   7 +
>  drivers/android/binder_alloc.c  |   5 +-
>  fs/dcache.c |   8 +-
>  fs/gfs2/quota.c |   6