Re: [PATCH v4 0/4] percpu: partial chunk depopulation
Hello Dennis, On 20/04/21 8:09 pm, Dennis Zhou wrote: On Tue, Apr 20, 2021 at 04:37:02PM +0530, Pratik Sampat wrote: On 20/04/21 4:27 am, Dennis Zhou wrote: On Mon, Apr 19, 2021 at 10:50:43PM +, Dennis Zhou wrote: Hello, This series is a continuation of Roman's series in [1]. It aims to solve chunks holding onto free pages by adding a reclaim process to the percpu balance work item. The main difference is that the nr_empty_pop_pages is now managed at time of isolation instead of intermixed. This helps with deciding which chunks to free instead of having to interleave returning chunks to active duty. The allocation priority is as follows: 1) appropriate chunk slot increasing until fit 2) sidelined chunks 3) full free chunks The last slot for to_depopulate is never used for allocations. A big thanks to Roman for initiating the work and being available for iterating on these ideas. This patchset contains the following 4 patches: 0001-percpu-factor-out-pcpu_check_block_hint.patch 0002-percpu-use-pcpu_free_slot-instead-of-pcpu_nr_slots-1.patch 0003-percpu-implement-partial-chunk-depopulation.patch 0004-percpu-use-reclaim-threshold-instead-of-running-for-.patch 0001 and 0002 are clean ups. 0003 implement partial chunk depopulation initially from Roman. 0004 adds a reclaim threshold so we do not need to schedule for every page freed. This series is on top of percpu$for-5.14 67c2669d69fb. diffstats below: Dennis Zhou (2): percpu: use pcpu_free_slot instead of pcpu_nr_slots - 1 percpu: use reclaim threshold instead of running for every page Roman Gushchin (2): percpu: factor out pcpu_check_block_hint() percpu: implement partial chunk depopulation mm/percpu-internal.h | 5 + mm/percpu-km.c | 5 + mm/percpu-stats.c| 20 ++-- mm/percpu-vm.c | 30 ++ mm/percpu.c | 252 ++- 5 files changed, 278 insertions(+), 34 deletions(-) Thanks, Dennis Hello Pratik, Do you mind testing this series again on POWER9? The base is available here: https://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu.git/log/?h=for-5.14 Thanks, Dennis Hello Dennis, I have tested this patchset on POWER9. I have tried variations of the percpu_test in the top level and nested cgroups creation as the test with 1000:10 didn't show any benefits. This is most likely because the 1 in every 11 still pins every page while 1 in 50 does not. Can you try the patch below on top? I think it may show slightly better perf as well. If it doesn't I'll just drop it. I did try it out, although my test spanned only across varying the inner cgroup folders; it didn't seem to show any benefit over the previous test without the patch for being being able to spawn as little memory cgroup folders and seeing partial memory depopulation. The following example shows more consistent benefits with the de-allocation strategy. Outer: 1000 Inner: 50 # ./percpu_test.sh Percpu: 6912 kB Percpu: 532736 kB Percpu: 278784 kB I believe it could be a result of bulk freeing within "free_unref_page_commit", where pages are only free'd if pcp->count >= pcp->high. As POWER has a larger page size it would end up creating lesser number of pages but with the effects of fragmentation. This is unrelated to per cpu pages in slab/slub. Percpu is a separate memory allocator. You're right, I was actually referencing incorrect code. Having said that, the patchset and its behavior does look good to me. Thanks, can I throw the following on the appropriate patches? In the future it's good to be explicit about this because some prefer to credit different emails. Tested-by: Pratik Sampat Sure thing please feel free to add a tested-by wherever you feel appropriate. I'll be more explicit about them in the future. Thanks! Thanks, Dennis The following may do a little better on power9: --- From a1464c4d5900cca68fd95b935178d72bb74837d5 Mon Sep 17 00:00:00 2001 From: Dennis Zhou Date: Tue, 20 Apr 2021 14:25:20 + Subject: [PATCH] percpu: convert free page float to bytes The percpu memory allocator keeps around a minimum number of free pages to ensure we can satisfy atomic allocations. However, we've always kept this number in terms of pages. On certain architectures like arm and powerpc, the default page size could be 64k instead of 4k. So, start with a target number of free bytes and then convert to pages. Signed-off-by: Dennis Zhou --- mm/percpu.c | 8 +--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/mm/percpu.c b/mm/percpu.c index ba13e683d022..287fe3091244 100644 --- a/mm/percpu.c +++ b/mm/percpu.c @@ -80,6 +80,7 @@ #include #include #include +#include #include #include #include @@ -107,11 +108,12 @@ /* chunks in slots below this are subject to being sidelined on failed alloc */ #define PCPU_SLOT_FAIL_THRESHOLD
Re: [PATCH v4 0/4] percpu: partial chunk depopulation
On 20/04/21 4:27 am, Dennis Zhou wrote: On Mon, Apr 19, 2021 at 10:50:43PM +, Dennis Zhou wrote: Hello, This series is a continuation of Roman's series in [1]. It aims to solve chunks holding onto free pages by adding a reclaim process to the percpu balance work item. The main difference is that the nr_empty_pop_pages is now managed at time of isolation instead of intermixed. This helps with deciding which chunks to free instead of having to interleave returning chunks to active duty. The allocation priority is as follows: 1) appropriate chunk slot increasing until fit 2) sidelined chunks 3) full free chunks The last slot for to_depopulate is never used for allocations. A big thanks to Roman for initiating the work and being available for iterating on these ideas. This patchset contains the following 4 patches: 0001-percpu-factor-out-pcpu_check_block_hint.patch 0002-percpu-use-pcpu_free_slot-instead-of-pcpu_nr_slots-1.patch 0003-percpu-implement-partial-chunk-depopulation.patch 0004-percpu-use-reclaim-threshold-instead-of-running-for-.patch 0001 and 0002 are clean ups. 0003 implement partial chunk depopulation initially from Roman. 0004 adds a reclaim threshold so we do not need to schedule for every page freed. This series is on top of percpu$for-5.14 67c2669d69fb. diffstats below: Dennis Zhou (2): percpu: use pcpu_free_slot instead of pcpu_nr_slots - 1 percpu: use reclaim threshold instead of running for every page Roman Gushchin (2): percpu: factor out pcpu_check_block_hint() percpu: implement partial chunk depopulation mm/percpu-internal.h | 5 + mm/percpu-km.c | 5 + mm/percpu-stats.c| 20 ++-- mm/percpu-vm.c | 30 ++ mm/percpu.c | 252 ++- 5 files changed, 278 insertions(+), 34 deletions(-) Thanks, Dennis Hello Pratik, Do you mind testing this series again on POWER9? The base is available here: https://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu.git/log/?h=for-5.14 Thanks, Dennis Hello Dennis, I have tested this patchset on POWER9. I have tried variations of the percpu_test in the top level and nested cgroups creation as the test with 1000:10 didn't show any benefits. The following example shows more consistent benefits with the de-allocation strategy. Outer: 1000 Inner: 50 # ./percpu_test.sh Percpu: 6912 kB Percpu: 532736 kB Percpu: 278784 kB I believe it could be a result of bulk freeing within "free_unref_page_commit", where pages are only free'd if pcp->count >= pcp->high. As POWER has a larger page size it would end up creating lesser number of pages but with the effects of fragmentation. Having said that, the patchset and its behavior does look good to me. Thanks! Pratik
Re: [PATCH v3 0/6] percpu: partial chunk depopulation
On 17/04/21 3:17 am, Dennis Zhou wrote: Hello, On Sat, Apr 17, 2021 at 01:14:03AM +0530, Pratik Sampat wrote: On 17/04/21 12:39 am, Roman Gushchin wrote: On Sat, Apr 17, 2021 at 12:11:37AM +0530, Pratik Sampat wrote: On 17/04/21 12:04 am, Roman Gushchin wrote: On Fri, Apr 16, 2021 at 11:57:03PM +0530, Pratik Sampat wrote: On 16/04/21 10:43 pm, Roman Gushchin wrote: On Fri, Apr 16, 2021 at 08:58:33PM +0530, Pratik Sampat wrote: Hello Dennis, I apologize for the clutter of logs before, I'm pasting the logs of before and after the percpu test in the case of the patchset being applied on 5.12-rc6 and the vanilla kernel 5.12-rc6. On 16/04/21 7:48 pm, Dennis Zhou wrote: Hello, On Fri, Apr 16, 2021 at 06:26:15PM +0530, Pratik Sampat wrote: Hello Roman, I've tried the v3 patch series on a POWER9 and an x86 KVM setup. My results of the percpu_test are as follows: Intel KVM 4CPU:4G Vanilla 5.12-rc6 # ./percpu_test.sh Percpu: 1952 kB Percpu: 219648 kB Percpu: 219648 kB 5.12-rc6 + with patchset applied # ./percpu_test.sh Percpu: 2080 kB Percpu: 219712 kB Percpu: 72672 kB I'm able to see improvement comparable to that of what you're see too. However, on POWERPC I'm unable to reproduce these improvements with the patchset in the same configuration POWER9 KVM 4CPU:4G Vanilla 5.12-rc6 # ./percpu_test.sh Percpu: 5888 kB Percpu: 118272 kB Percpu: 118272 kB 5.12-rc6 + with patchset applied # ./percpu_test.sh Percpu: 6144 kB Percpu: 119040 kB Percpu: 119040 kB I'm wondering if there's any architectural specific code that needs plumbing here? There shouldn't be. Can you send me the percpu_stats debug output before and after? I'll paste the whole debug stats before and after here. 5.12-rc6 + patchset -BEFORE- Percpu Memory Statistics Allocation Info: Hm, this looks highly suspicious. Here is your stats in a more compact form: Vanilla nr_alloc: 9038 nr_alloc:97046 nr_dealloc : 6992 nr_dealloc :94237 nr_cur_alloc: 2046 nr_cur_alloc: 2809 nr_max_alloc: 2178 nr_max_alloc:90054 nr_chunks :3 nr_chunks : 11 nr_max_chunks :3 nr_max_chunks : 47 min_alloc_size :4 min_alloc_size :4 max_alloc_size : 1072 max_alloc_size : 1072 empty_pop_pages :5 empty_pop_pages : 29 Patched nr_alloc: 9040 nr_alloc:97048 nr_dealloc : 6994 nr_dealloc :95002 nr_cur_alloc: 2046 nr_cur_alloc: 2046 nr_max_alloc: 2208 nr_max_alloc:90054 nr_chunks :3 nr_chunks : 48 nr_max_chunks :3 nr_max_chunks : 48 min_alloc_size :4 min_alloc_size :4 max_alloc_size : 1072 max_alloc_size : 1072 empty_pop_pages : 12 empty_pop_pages : 61 So it looks like the number of chunks got bigger, as well as the number of empty_pop_pages? This contradicts to what you wrote, so can you, please, make sure that the data is correct and we're not messing two cases? So it looks like for some reason sidelined (depopulated) chunks are not getting freed completely. But I struggle to explain why the initial empty_pop_pages is bigger with the same amount of chunks. So, can you, please, apply the following patch and provide an updated statistics? Unfortunately, I'm not completely well versed in this area, but yes the empty pop pages number doesn't make sense to me either. I re-ran the numbers trying to make sure my experiment setup is sane but results remain the same. Vanilla nr_alloc: 9040 nr_alloc:97048 nr_dealloc : 6994 nr_dealloc :94404 nr_cur_alloc: 2046 nr_cur_alloc: 2644 nr_max_alloc: 2169 nr_max_alloc:90054 nr_chunks :3 nr_chunks : 10 nr_max_chunks :3 nr_max_chunks : 47 min_alloc_size :4 min_alloc_size :4 max_alloc_size : 1072 max_alloc_size : 1072 empty_pop_pages :4 empty_pop_pages : 32 With the patchset + debug patch the results are as follows: Patched nr_alloc: 9040 nr_alloc
Re: [PATCH v3 0/6] percpu: partial chunk depopulation
On 17/04/21 1:33 am, Roman Gushchin wrote: On Sat, Apr 17, 2021 at 01:14:03AM +0530, Pratik Sampat wrote: On 17/04/21 12:39 am, Roman Gushchin wrote: On Sat, Apr 17, 2021 at 12:11:37AM +0530, Pratik Sampat wrote: On 17/04/21 12:04 am, Roman Gushchin wrote: On Fri, Apr 16, 2021 at 11:57:03PM +0530, Pratik Sampat wrote: On 16/04/21 10:43 pm, Roman Gushchin wrote: On Fri, Apr 16, 2021 at 08:58:33PM +0530, Pratik Sampat wrote: Hello Dennis, I apologize for the clutter of logs before, I'm pasting the logs of before and after the percpu test in the case of the patchset being applied on 5.12-rc6 and the vanilla kernel 5.12-rc6. On 16/04/21 7:48 pm, Dennis Zhou wrote: Hello, On Fri, Apr 16, 2021 at 06:26:15PM +0530, Pratik Sampat wrote: Hello Roman, I've tried the v3 patch series on a POWER9 and an x86 KVM setup. My results of the percpu_test are as follows: Intel KVM 4CPU:4G Vanilla 5.12-rc6 # ./percpu_test.sh Percpu: 1952 kB Percpu: 219648 kB Percpu: 219648 kB 5.12-rc6 + with patchset applied # ./percpu_test.sh Percpu: 2080 kB Percpu: 219712 kB Percpu: 72672 kB I'm able to see improvement comparable to that of what you're see too. However, on POWERPC I'm unable to reproduce these improvements with the patchset in the same configuration POWER9 KVM 4CPU:4G Vanilla 5.12-rc6 # ./percpu_test.sh Percpu: 5888 kB Percpu: 118272 kB Percpu: 118272 kB 5.12-rc6 + with patchset applied # ./percpu_test.sh Percpu: 6144 kB Percpu: 119040 kB Percpu: 119040 kB I'm wondering if there's any architectural specific code that needs plumbing here? There shouldn't be. Can you send me the percpu_stats debug output before and after? I'll paste the whole debug stats before and after here. 5.12-rc6 + patchset -BEFORE- Percpu Memory Statistics Allocation Info: Hm, this looks highly suspicious. Here is your stats in a more compact form: Vanilla nr_alloc: 9038 nr_alloc:97046 nr_dealloc : 6992 nr_dealloc :94237 nr_cur_alloc: 2046 nr_cur_alloc: 2809 nr_max_alloc: 2178 nr_max_alloc:90054 nr_chunks :3 nr_chunks : 11 nr_max_chunks :3 nr_max_chunks : 47 min_alloc_size :4 min_alloc_size :4 max_alloc_size : 1072 max_alloc_size : 1072 empty_pop_pages :5 empty_pop_pages : 29 Patched nr_alloc: 9040 nr_alloc:97048 nr_dealloc : 6994 nr_dealloc :95002 nr_cur_alloc: 2046 nr_cur_alloc: 2046 nr_max_alloc: 2208 nr_max_alloc:90054 nr_chunks :3 nr_chunks : 48 nr_max_chunks :3 nr_max_chunks : 48 min_alloc_size :4 min_alloc_size :4 max_alloc_size : 1072 max_alloc_size : 1072 empty_pop_pages : 12 empty_pop_pages : 61 So it looks like the number of chunks got bigger, as well as the number of empty_pop_pages? This contradicts to what you wrote, so can you, please, make sure that the data is correct and we're not messing two cases? So it looks like for some reason sidelined (depopulated) chunks are not getting freed completely. But I struggle to explain why the initial empty_pop_pages is bigger with the same amount of chunks. So, can you, please, apply the following patch and provide an updated statistics? Unfortunately, I'm not completely well versed in this area, but yes the empty pop pages number doesn't make sense to me either. I re-ran the numbers trying to make sure my experiment setup is sane but results remain the same. Vanilla nr_alloc: 9040 nr_alloc:97048 nr_dealloc : 6994 nr_dealloc :94404 nr_cur_alloc: 2046 nr_cur_alloc: 2644 nr_max_alloc: 2169 nr_max_alloc:90054 nr_chunks :3 nr_chunks : 10 nr_max_chunks :3 nr_max_chunks : 47 min_alloc_size :4 min_alloc_size :4 max_alloc_size : 1072 max_alloc_size : 1072 empty_pop_pages :4 empty_pop_pages : 32 With the patchset + debug patch the results are as follows: Patched nr_alloc: 9040 nr_alloc
Re: [PATCH v3 0/6] percpu: partial chunk depopulation
On 17/04/21 12:39 am, Roman Gushchin wrote: On Sat, Apr 17, 2021 at 12:11:37AM +0530, Pratik Sampat wrote: On 17/04/21 12:04 am, Roman Gushchin wrote: On Fri, Apr 16, 2021 at 11:57:03PM +0530, Pratik Sampat wrote: On 16/04/21 10:43 pm, Roman Gushchin wrote: On Fri, Apr 16, 2021 at 08:58:33PM +0530, Pratik Sampat wrote: Hello Dennis, I apologize for the clutter of logs before, I'm pasting the logs of before and after the percpu test in the case of the patchset being applied on 5.12-rc6 and the vanilla kernel 5.12-rc6. On 16/04/21 7:48 pm, Dennis Zhou wrote: Hello, On Fri, Apr 16, 2021 at 06:26:15PM +0530, Pratik Sampat wrote: Hello Roman, I've tried the v3 patch series on a POWER9 and an x86 KVM setup. My results of the percpu_test are as follows: Intel KVM 4CPU:4G Vanilla 5.12-rc6 # ./percpu_test.sh Percpu: 1952 kB Percpu: 219648 kB Percpu: 219648 kB 5.12-rc6 + with patchset applied # ./percpu_test.sh Percpu: 2080 kB Percpu: 219712 kB Percpu: 72672 kB I'm able to see improvement comparable to that of what you're see too. However, on POWERPC I'm unable to reproduce these improvements with the patchset in the same configuration POWER9 KVM 4CPU:4G Vanilla 5.12-rc6 # ./percpu_test.sh Percpu: 5888 kB Percpu: 118272 kB Percpu: 118272 kB 5.12-rc6 + with patchset applied # ./percpu_test.sh Percpu: 6144 kB Percpu: 119040 kB Percpu: 119040 kB I'm wondering if there's any architectural specific code that needs plumbing here? There shouldn't be. Can you send me the percpu_stats debug output before and after? I'll paste the whole debug stats before and after here. 5.12-rc6 + patchset -BEFORE- Percpu Memory Statistics Allocation Info: Hm, this looks highly suspicious. Here is your stats in a more compact form: Vanilla nr_alloc: 9038 nr_alloc:97046 nr_dealloc : 6992 nr_dealloc :94237 nr_cur_alloc: 2046 nr_cur_alloc: 2809 nr_max_alloc: 2178 nr_max_alloc:90054 nr_chunks :3 nr_chunks : 11 nr_max_chunks :3 nr_max_chunks : 47 min_alloc_size :4 min_alloc_size :4 max_alloc_size : 1072 max_alloc_size : 1072 empty_pop_pages :5 empty_pop_pages : 29 Patched nr_alloc: 9040 nr_alloc:97048 nr_dealloc : 6994 nr_dealloc :95002 nr_cur_alloc: 2046 nr_cur_alloc: 2046 nr_max_alloc: 2208 nr_max_alloc:90054 nr_chunks :3 nr_chunks : 48 nr_max_chunks :3 nr_max_chunks : 48 min_alloc_size :4 min_alloc_size :4 max_alloc_size : 1072 max_alloc_size : 1072 empty_pop_pages : 12 empty_pop_pages : 61 So it looks like the number of chunks got bigger, as well as the number of empty_pop_pages? This contradicts to what you wrote, so can you, please, make sure that the data is correct and we're not messing two cases? So it looks like for some reason sidelined (depopulated) chunks are not getting freed completely. But I struggle to explain why the initial empty_pop_pages is bigger with the same amount of chunks. So, can you, please, apply the following patch and provide an updated statistics? Unfortunately, I'm not completely well versed in this area, but yes the empty pop pages number doesn't make sense to me either. I re-ran the numbers trying to make sure my experiment setup is sane but results remain the same. Vanilla nr_alloc: 9040 nr_alloc:97048 nr_dealloc : 6994 nr_dealloc :94404 nr_cur_alloc: 2046 nr_cur_alloc: 2644 nr_max_alloc: 2169 nr_max_alloc:90054 nr_chunks :3 nr_chunks : 10 nr_max_chunks :3 nr_max_chunks : 47 min_alloc_size :4 min_alloc_size :4 max_alloc_size : 1072 max_alloc_size : 1072 empty_pop_pages :4 empty_pop_pages : 32 With the patchset + debug patch the results are as follows: Patched nr_alloc: 9040 nr_alloc:97048 nr_dealloc : 6994 nr_dealloc :94349 nr_cur_alloc
Re: [PATCH v3 0/6] percpu: partial chunk depopulation
On 17/04/21 12:04 am, Roman Gushchin wrote: On Fri, Apr 16, 2021 at 11:57:03PM +0530, Pratik Sampat wrote: On 16/04/21 10:43 pm, Roman Gushchin wrote: On Fri, Apr 16, 2021 at 08:58:33PM +0530, Pratik Sampat wrote: Hello Dennis, I apologize for the clutter of logs before, I'm pasting the logs of before and after the percpu test in the case of the patchset being applied on 5.12-rc6 and the vanilla kernel 5.12-rc6. On 16/04/21 7:48 pm, Dennis Zhou wrote: Hello, On Fri, Apr 16, 2021 at 06:26:15PM +0530, Pratik Sampat wrote: Hello Roman, I've tried the v3 patch series on a POWER9 and an x86 KVM setup. My results of the percpu_test are as follows: Intel KVM 4CPU:4G Vanilla 5.12-rc6 # ./percpu_test.sh Percpu: 1952 kB Percpu: 219648 kB Percpu: 219648 kB 5.12-rc6 + with patchset applied # ./percpu_test.sh Percpu: 2080 kB Percpu: 219712 kB Percpu: 72672 kB I'm able to see improvement comparable to that of what you're see too. However, on POWERPC I'm unable to reproduce these improvements with the patchset in the same configuration POWER9 KVM 4CPU:4G Vanilla 5.12-rc6 # ./percpu_test.sh Percpu: 5888 kB Percpu: 118272 kB Percpu: 118272 kB 5.12-rc6 + with patchset applied # ./percpu_test.sh Percpu: 6144 kB Percpu: 119040 kB Percpu: 119040 kB I'm wondering if there's any architectural specific code that needs plumbing here? There shouldn't be. Can you send me the percpu_stats debug output before and after? I'll paste the whole debug stats before and after here. 5.12-rc6 + patchset -BEFORE- Percpu Memory Statistics Allocation Info: Hm, this looks highly suspicious. Here is your stats in a more compact form: Vanilla nr_alloc: 9038 nr_alloc:97046 nr_dealloc : 6992 nr_dealloc :94237 nr_cur_alloc: 2046 nr_cur_alloc: 2809 nr_max_alloc: 2178 nr_max_alloc:90054 nr_chunks :3 nr_chunks : 11 nr_max_chunks :3 nr_max_chunks : 47 min_alloc_size :4 min_alloc_size :4 max_alloc_size : 1072 max_alloc_size : 1072 empty_pop_pages :5 empty_pop_pages : 29 Patched nr_alloc: 9040 nr_alloc:97048 nr_dealloc : 6994 nr_dealloc :95002 nr_cur_alloc: 2046 nr_cur_alloc: 2046 nr_max_alloc: 2208 nr_max_alloc:90054 nr_chunks :3 nr_chunks : 48 nr_max_chunks :3 nr_max_chunks : 48 min_alloc_size :4 min_alloc_size :4 max_alloc_size : 1072 max_alloc_size : 1072 empty_pop_pages : 12 empty_pop_pages : 61 So it looks like the number of chunks got bigger, as well as the number of empty_pop_pages? This contradicts to what you wrote, so can you, please, make sure that the data is correct and we're not messing two cases? So it looks like for some reason sidelined (depopulated) chunks are not getting freed completely. But I struggle to explain why the initial empty_pop_pages is bigger with the same amount of chunks. So, can you, please, apply the following patch and provide an updated statistics? Unfortunately, I'm not completely well versed in this area, but yes the empty pop pages number doesn't make sense to me either. I re-ran the numbers trying to make sure my experiment setup is sane but results remain the same. Vanilla nr_alloc: 9040 nr_alloc:97048 nr_dealloc : 6994 nr_dealloc :94404 nr_cur_alloc: 2046 nr_cur_alloc: 2644 nr_max_alloc: 2169 nr_max_alloc:90054 nr_chunks :3 nr_chunks : 10 nr_max_chunks :3 nr_max_chunks : 47 min_alloc_size :4 min_alloc_size :4 max_alloc_size : 1072 max_alloc_size : 1072 empty_pop_pages :4 empty_pop_pages : 32 With the patchset + debug patch the results are as follows: Patched nr_alloc: 9040 nr_alloc:97048 nr_dealloc : 6994 nr_dealloc :94349 nr_cur_alloc: 2046 nr_cur_alloc: 2699 nr_max_alloc: 2194
Re: [PATCH v3 0/6] percpu: partial chunk depopulation
On 16/04/21 10:43 pm, Roman Gushchin wrote: On Fri, Apr 16, 2021 at 08:58:33PM +0530, Pratik Sampat wrote: Hello Dennis, I apologize for the clutter of logs before, I'm pasting the logs of before and after the percpu test in the case of the patchset being applied on 5.12-rc6 and the vanilla kernel 5.12-rc6. On 16/04/21 7:48 pm, Dennis Zhou wrote: Hello, On Fri, Apr 16, 2021 at 06:26:15PM +0530, Pratik Sampat wrote: Hello Roman, I've tried the v3 patch series on a POWER9 and an x86 KVM setup. My results of the percpu_test are as follows: Intel KVM 4CPU:4G Vanilla 5.12-rc6 # ./percpu_test.sh Percpu: 1952 kB Percpu: 219648 kB Percpu: 219648 kB 5.12-rc6 + with patchset applied # ./percpu_test.sh Percpu: 2080 kB Percpu: 219712 kB Percpu: 72672 kB I'm able to see improvement comparable to that of what you're see too. However, on POWERPC I'm unable to reproduce these improvements with the patchset in the same configuration POWER9 KVM 4CPU:4G Vanilla 5.12-rc6 # ./percpu_test.sh Percpu: 5888 kB Percpu: 118272 kB Percpu: 118272 kB 5.12-rc6 + with patchset applied # ./percpu_test.sh Percpu: 6144 kB Percpu: 119040 kB Percpu: 119040 kB I'm wondering if there's any architectural specific code that needs plumbing here? There shouldn't be. Can you send me the percpu_stats debug output before and after? I'll paste the whole debug stats before and after here. 5.12-rc6 + patchset -BEFORE- Percpu Memory Statistics Allocation Info: Hm, this looks highly suspicious. Here is your stats in a more compact form: Vanilla nr_alloc: 9038 nr_alloc:97046 nr_dealloc : 6992 nr_dealloc :94237 nr_cur_alloc: 2046 nr_cur_alloc: 2809 nr_max_alloc: 2178 nr_max_alloc:90054 nr_chunks :3 nr_chunks : 11 nr_max_chunks :3 nr_max_chunks : 47 min_alloc_size :4 min_alloc_size :4 max_alloc_size : 1072 max_alloc_size : 1072 empty_pop_pages :5 empty_pop_pages : 29 Patched nr_alloc: 9040 nr_alloc:97048 nr_dealloc : 6994 nr_dealloc :95002 nr_cur_alloc: 2046 nr_cur_alloc: 2046 nr_max_alloc: 2208 nr_max_alloc:90054 nr_chunks :3 nr_chunks : 48 nr_max_chunks :3 nr_max_chunks : 48 min_alloc_size :4 min_alloc_size :4 max_alloc_size : 1072 max_alloc_size : 1072 empty_pop_pages : 12 empty_pop_pages : 61 So it looks like the number of chunks got bigger, as well as the number of empty_pop_pages? This contradicts to what you wrote, so can you, please, make sure that the data is correct and we're not messing two cases? So it looks like for some reason sidelined (depopulated) chunks are not getting freed completely. But I struggle to explain why the initial empty_pop_pages is bigger with the same amount of chunks. So, can you, please, apply the following patch and provide an updated statistics? Unfortunately, I'm not completely well versed in this area, but yes the empty pop pages number doesn't make sense to me either. I re-ran the numbers trying to make sure my experiment setup is sane but results remain the same. Vanilla nr_alloc: 9040 nr_alloc:97048 nr_dealloc : 6994 nr_dealloc :94404 nr_cur_alloc: 2046 nr_cur_alloc: 2644 nr_max_alloc: 2169 nr_max_alloc:90054 nr_chunks :3 nr_chunks : 10 nr_max_chunks :3 nr_max_chunks : 47 min_alloc_size :4 min_alloc_size :4 max_alloc_size : 1072 max_alloc_size : 1072 empty_pop_pages :4 empty_pop_pages : 32 With the patchset + debug patch the results are as follows: Patched nr_alloc: 9040 nr_alloc:97048 nr_dealloc : 6994 nr_dealloc :94349 nr_cur_alloc: 2046 nr_cur_alloc: 2699 nr_max_alloc: 2194 nr_max_alloc:90054 nr_chunks :3 nr_chunks
Re: [PATCH v3 0/6] percpu: partial chunk depopulation
Hello Dennis, I apologize for the clutter of logs before, I'm pasting the logs of before and after the percpu test in the case of the patchset being applied on 5.12-rc6 and the vanilla kernel 5.12-rc6. On 16/04/21 7:48 pm, Dennis Zhou wrote: Hello, On Fri, Apr 16, 2021 at 06:26:15PM +0530, Pratik Sampat wrote: Hello Roman, I've tried the v3 patch series on a POWER9 and an x86 KVM setup. My results of the percpu_test are as follows: Intel KVM 4CPU:4G Vanilla 5.12-rc6 # ./percpu_test.sh Percpu: 1952 kB Percpu: 219648 kB Percpu: 219648 kB 5.12-rc6 + with patchset applied # ./percpu_test.sh Percpu: 2080 kB Percpu: 219712 kB Percpu: 72672 kB I'm able to see improvement comparable to that of what you're see too. However, on POWERPC I'm unable to reproduce these improvements with the patchset in the same configuration POWER9 KVM 4CPU:4G Vanilla 5.12-rc6 # ./percpu_test.sh Percpu: 5888 kB Percpu: 118272 kB Percpu: 118272 kB 5.12-rc6 + with patchset applied # ./percpu_test.sh Percpu: 6144 kB Percpu: 119040 kB Percpu: 119040 kB I'm wondering if there's any architectural specific code that needs plumbing here? There shouldn't be. Can you send me the percpu_stats debug output before and after? I'll paste the whole debug stats before and after here. 5.12-rc6 + patchset -BEFORE- Percpu Memory Statistics Allocation Info: unit_size : 655360 static_size : 608920 reserved_size :0 dyn_size:46440 atom_size :65536 alloc_size : 655360 Global Stats: nr_alloc: 9040 nr_dealloc : 6994 nr_cur_alloc: 2046 nr_max_alloc: 2208 nr_chunks :3 nr_max_chunks :3 min_alloc_size :4 max_alloc_size : 1072 empty_pop_pages : 12 Per Chunk Stats: Chunk: <- First Chunk nr_alloc: 859 max_alloc_size : 1072 empty_pop_pages :0 first_bit :16384 free_bytes :0 contig_bytes:0 sum_frag:0 max_frag:0 cur_min_alloc :4 cur_med_alloc :8 cur_max_alloc : 1072 memcg_aware :0 Chunk: nr_alloc: 827 max_alloc_size : 992 empty_pop_pages :8 first_bit : 692 free_bytes : 645012 contig_bytes: 460096 sum_frag: 466420 max_frag: 460096 cur_min_alloc :4 cur_med_alloc :8 cur_max_alloc : 152 memcg_aware :0 Chunk: nr_alloc: 360 max_alloc_size : 1072 empty_pop_pages :4 first_bit :29207 free_bytes : 506640 contig_bytes: 506556 sum_frag: 84 max_frag: 32 cur_min_alloc :4 cur_med_alloc : 156 cur_max_alloc : 1072 memcg_aware :1 -AFTER- Percpu Memory Statistics Allocation Info: unit_size : 655360 static_size : 608920 reserved_size :0 dyn_size:46440 atom_size :65536 alloc_size : 655360 Global Stats: nr_alloc:97048 nr_dealloc :95002 nr_cur_alloc: 2046 nr_max_alloc:90054 nr_chunks : 48 nr_max_chunks : 48 min_alloc_size :4 max_alloc_size : 1072 empty_pop_pages : 61 Per Chunk Stats: Chunk: <- First Chunk nr_alloc: 859 max_alloc_size : 1072 empty_pop_pages :0 first_bit :16384 free_bytes :0 contig_bytes:0 sum_frag:0 max_frag:0 cur_min_alloc :4 cur_med_alloc :8 cur_max_alloc : 1072 memcg_aware :0 Chunk: nr_alloc: 827 max_alloc_size : 1072 empty_pop_pages :8 first_bit : 692 free_bytes :
Re: [PATCH v3 0/6] percpu: partial chunk depopulation
Hello Roman, I've tried the v3 patch series on a POWER9 and an x86 KVM setup. My results of the percpu_test are as follows: Intel KVM 4CPU:4G Vanilla 5.12-rc6 # ./percpu_test.sh Percpu: 1952 kB Percpu: 219648 kB Percpu: 219648 kB 5.12-rc6 + with patchset applied # ./percpu_test.sh Percpu: 2080 kB Percpu: 219712 kB Percpu: 72672 kB I'm able to see improvement comparable to that of what you're see too. However, on POWERPC I'm unable to reproduce these improvements with the patchset in the same configuration POWER9 KVM 4CPU:4G Vanilla 5.12-rc6 # ./percpu_test.sh Percpu: 5888 kB Percpu: 118272 kB Percpu: 118272 kB 5.12-rc6 + with patchset applied # ./percpu_test.sh Percpu: 6144 kB Percpu: 119040 kB Percpu: 119040 kB I'm wondering if there's any architectural specific code that needs plumbing here? I will also look through the code to find the reason why POWER isn't depopulating pages. Thank you, Pratik On 08/04/21 9:27 am, Roman Gushchin wrote: In our production experience the percpu memory allocator is sometimes struggling with returning the memory to the system. A typical example is a creation of several thousands memory cgroups (each has several chunks of the percpu data used for vmstats, vmevents, ref counters etc). Deletion and complete releasing of these cgroups doesn't always lead to a shrinkage of the percpu memory, so that sometimes there are several GB's of memory wasted. The underlying problem is the fragmentation: to release an underlying chunk all percpu allocations should be released first. The percpu allocator tends to top up chunks to improve the utilization. It means new small-ish allocations (e.g. percpu ref counters) are placed onto almost filled old-ish chunks, effectively pinning them in memory. This patchset solves this problem by implementing a partial depopulation of percpu chunks: chunks with many empty pages are being asynchronously depopulated and the pages are returned to the system. To illustrate the problem the following script can be used: -- #!/bin/bash cd /sys/fs/cgroup mkdir percpu_test echo "+memory" > percpu_test/cgroup.subtree_control cat /proc/meminfo | grep Percpu for i in `seq 1 1000`; do mkdir percpu_test/cg_"${i}" for j in `seq 1 10`; do mkdir percpu_test/cg_"${i}"_"${j}" done done cat /proc/meminfo | grep Percpu for i in `seq 1 1000`; do for j in `seq 1 10`; do rmdir percpu_test/cg_"${i}"_"${j}" done done sleep 10 cat /proc/meminfo | grep Percpu for i in `seq 1 1000`; do rmdir percpu_test/cg_"${i}" done rmdir percpu_test -- It creates 11000 memory cgroups and removes every 10 out of 11. It prints the initial size of the percpu memory, the size after creating all cgroups and the size after deleting most of them. Results: vanilla: ./percpu_test.sh Percpu: 7488 kB Percpu: 481152 kB Percpu: 481152 kB with this patchset applied: ./percpu_test.sh Percpu: 7488 kB Percpu: 481408 kB Percpu: 135552 kB So the total size of the percpu memory was reduced by more than 3.5 times. v3: - introduced pcpu_check_chunk_hint() - fixed a bug related to the hint check - minor cosmetic changes - s/pretends/fixes (cc Vlastimil) v2: - depopulated chunks are sidelined - depopulation happens in the reverse order - depopulate list made per-chunk type - better results due to better heuristics v1: - depopulation heuristics changed and optimized - chunks are put into a separate list, depopulation scan this list - chunk->isolated is introduced, chunk->depopulate is dropped - rearranged patches a bit - fixed a panic discovered by krobot - made pcpu_nr_empty_pop_pages per chunk type - minor fixes rfc: https://lwn.net/Articles/850508/ Roman Gushchin (6): percpu: fix a comment about the chunks ordering percpu: split __pcpu_balance_workfn() percpu: make pcpu_nr_empty_pop_pages per chunk type percpu: generalize pcpu_balance_populated() percpu: factor out pcpu_check_chunk_hint() percpu: implement partial chunk depopulation mm/percpu-internal.h | 4 +- mm/percpu-stats.c| 9 +- mm/percpu.c | 306 +++ 3 files changed, 261 insertions(+), 58 deletions(-)
Re: [RFC v3 0/2] CPU-Idle latency selftest framework
Hello Doug, On 09/04/21 10:53 am, Doug Smythies wrote: Hi Pratik, I tried V3 on a Intel i5-10600K processor with 6 cores and 12 CPUs. The core to cpu mappings are: core 0 has cpus 0 and 6 core 1 has cpus 1 and 7 core 2 has cpus 2 and 8 core 3 has cpus 3 and 9 core 4 has cpus 4 and 10 core 5 has cpus 5 and 11 By default, it will test CPUs 0,2,4,6,10 on cores 0,2,4,0,2,4. wouldn't it make more sense to test each core once? Ideally it would be better to run on all the CPUs, however on larger systems that I'm testing on with hundreds of cores and a high a thread count, the execution time increases while not particularly bringing any additional information to the table. That is why it made sense only run on one of the threads of each core to make the experiment faster while preserving accuracy. To handle various thread topologies it maybe worthwhile if we parse /sys/devices/system/cpu/cpuX/topology/thread_siblings_list for each core and use this information to run only once per physical core, rather than assuming the topology. What are your thoughts on a mechanism like this? With the source CPU always 0, I think the results from the results from the destination CPUs 0 and 6, on core 0 bias the results, at least in the deeper idle states. They don't make much difference in the shallow states. Myself, I wouldn't include them in the results. I agree, CPU0->CPU0 same core interaction is causing a bias. I could omit that observation while computing the average. In the verbose mode I'll omit all the threads of CPU0 and in the default (quick) mode just CPU0's latency can be omitted while computing average. Thank you, Pratik Example, where I used the -v option for all CPUs: --IPI Latency Test--- --Baseline IPI Latency measurement: CPU Busy-- SRC_CPU DEST_CPU IPI_Latency(ns) 00 101 01 790 02 609 03 595 04 737 05 759 06 780 07 741 08 574 09 681 0 10 527 0 11 552 Baseline Avg IPI latency(ns): 620 suggest 656 here ---Enabling state: 0--- SRC_CPU DEST_CPU IPI_Latency(ns) 00 76 01 471 02 420 03 462 04 454 05 468 06 453 07 473 08 380 09 483 0 10 492 0 11 454 Expected IPI latency(ns): 0 Observed Avg IPI latency(ns) - State 0: 423 < suggest 456 here ---Enabling state: 1--- SRC_CPU DEST_CPU IPI_Latency(ns) 00 112 01 866 02 663 03 851 04 1090 05 1314 06 1941 07 1458 08 687 09 802 0 10 1041 0 11 1284 Expected IPI latency(ns): 1000 Observed Avg IPI latency(ns) - State 1: 1009 suggest 1006 here ---Enabling state: 2--- SRC_CPU DEST_CPU IPI_Latency(ns) 00 75 0116362 0216785 0319650 0417356 0517606 06 2217 0717958 0817332 0916615 0 1017382 0 1117423 Expected IPI latency(ns): 12 Observed Avg IPI latency(ns) - State 2: 14730 suggest 17447 here ---Enabling state: 3--- SRC_CPU DEST_CPU IPI_Latency(ns) 00 103 0117416 0217961 0316651 0417867 0517726 06 2178 0716620 0820951 0916567 0 1017131 0 1117563 Expected IPI latency(ns): 1034000 Observed Avg IPI latency(ns) - State 3: 14894 suggest 17645 here Hope this helps. ... Doug
Re: [RFC v2 2/2] selftest/cpuidle: Add support for cpuidle latency measurement
Hello Doug, Thanks for your review. On 02/04/21 4:57 am, Doug Smythies wrote: Hi Pratik, On Thu, Apr 1, 2021 at 4:45 AM Pratik Rajesh Sampat wrote: ... To run this test specifically: $ make -C tools/testing/selftests TARGETS="cpuidle" run_tests I have not become any smarter than I was with version 1, and still assumed that the "$" meant regular user. Please put it as "#" or separate the two steps, compile and run. Apologies, I missed incorporating the root user discussion we had. I'll add a sudo past the "$" symbol. There are a few optinal arguments too that the script can take optional Suggest to also specifically mention how to run without re-compile, # ./cpuidle.sh -v Sure thing, I'll add a comment specifying this. Note also that the test still leaves all idle states disabled when done. Yes, I missed out enabling all the idle states after the tests are done. I'll spin a new version where I enable idle states at the end of the experiment so that the system stays coherent. [-h ] [-i ] [-m ] [-o ] [-v (run on all cpus)] Default Output location in: tools/testing/selftest/cpuidle/cpuidle.log ... +cpu_is_online() +{ + cpu=$1 + if [ ! -f "/sys/devices/system/cpu/cpu$cpu/online" ]; then + echo 0 incorrect. should be: + echo 1 Right! Thanks for catching this. ... Doug Thank you, Pratik
Re: [RFC 2/2] selftest/cpuidle: Add support for cpuidle latency measurement
Hi Doug, On 20/03/21 8:34 pm, Doug Smythies wrote: On Wed, Mar 17, 2021 at 11:44 PM Pratik Sampat wrote: Hi Doug, Thanks for trying these patches out. On 18/03/21 2:30 am, Doug Smythies wrote: Hi Pratik, It just so happens that I have been trying Artem's version this last week, so I tried yours. On Mon, Mar 15, 2021 at 4:49 AM Pratik Rajesh Sampat wrote: ... ... Other notes: No idle state for CPU 0 ever gets disabled. I assume this is because CPU 0 can never be offline, so that bit of code (Disable all stop states) doesn't find its state. By the way, processor = Intel i5-9600K I had tried these patches on an IBM POWER 9 processor and disabling CPU0's idle state works there. However, it does make sense for some processors to treat CPU 0 differently. Maybe I could write in a case if idle state disabling fails for a CPU then we just skip it? I didn't try it, I just did a hack so I could continue for this reply. Sure. In subsequent version I could write something to cleanly handle online fail checks, maybe even specifically for CPU0. The system is left with all idle states disabled, well not for CPU 0 as per the above comment. The suggestion is to restore them, otherwise my processor hogs 42 watts instead of 2. My results are highly variable per test. Question: Do you notice high variability with IPI test, Timer test or both? The IPI test has less variability than the Timer test. I can think of two reasons for high run to run variance: 1. If you observe variance in timer tests, then I believe there could a mechanism of "C-state pre-wake" on some Intel machines at play here, which can pre-wake a CPU from an idle state when timers are armed. I'm not sure if the Intel platform that you're running on does that or not. Artem had described this behavior to me a while ago and I think his wult page describes this behavior in more detail: https://intel.github.io/wult/#c-state-pre-wake Yes, I have reviewed all the references. And yes, I think my processors have the pre-wake stuff. I do not have the proper hardware to do the Artem pre-wake workaround method, but might buy it in future. That explains the variability that we are seeing in the Timer tests on the Intel processor you've tried on. Also based on the data pasted below, it means that the IPI tests are more reliable than Timers. Maybe it would be better to not run the Timer test on Intel platforms that support this pre-wakeup feature? However, I don't know how (or if) Intel exposes this information to the userspace and if other platforms like AMD also have this feature in some form. Another way of solving this problem could be to have the timer test as an optional parameter in the selftest for people to use while also printing a disclaimer for x86 users about the potential hardware design? -- Thanks Pratik 2. I have noticed variability in results when there are kernel book-keeping or jitter tasks scheduled from time to time on an otherwise idle core. In the full per-CPU logs at tools/testing/selftests/cpuidle/cpuidle.log can you spot any obvious outliers per-CPU state? Yes. I'll just paste in an example cpuidle.log file having used the -v option below, along with my hack job diff. doug@s19:~/temp-k-git/linux/tools/testing/selftests/cpuidle$ cat cpuidle.log.v3-1 --IPI Latency Test--- --Baseline IPI Latency measurement: CPU Busy-- SRC_CPU DEST_CPU IPI_Latency(ns) 00 140 01 632 02 675 03 671 04 675 05 767 06 653 07 826 08 819 09 615 0 10 758 0 11 758 Baseline Avg IPI latency(ns): 665 ---Enabling state: 0--- SRC_CPU DEST_CPU IPI_Latency(ns) 00 76 01 484 02 494 03 539 04 498 05 491 06 474 07 434 08 544 09 476 0 10 447 0 11 467 Expected IPI latency(ns): 0 Observed Avg IPI latency(ns) - State 0: 452 ---Enabling state: 1--- SRC_CPU DEST_CPU IPI_Latency(ns) 00 72 01 1081 02 821 03 1486 04 1022 05 960 06 1634 07 933 08 1032 09 1046 0 10 1430 0 11 1338 Expected IPI latency(ns): 1000 Observed Avg IPI latency(ns) - State 1: 1071 ---Enabling state: 2--- SRC_CPU DEST_CPU IPI_Latency(ns) 00 264 0130836 0230562 0330748 0435286 0530978 06 1952 07
Re: [RFC 2/2] selftest/cpuidle: Add support for cpuidle latency measurement
Hi Doug, Thanks for trying these patches out. On 18/03/21 2:30 am, Doug Smythies wrote: Hi Pratik, It just so happens that I have been trying Artem's version this last week, so I tried yours. On Mon, Mar 15, 2021 at 4:49 AM Pratik Rajesh Sampat wrote: ... To run this test specifically: $ make -C tools/testing/selftests TARGETS="cpuidle" run_tests While I suppose it should have been obvious, I interpreted the "$" sign to mean I could run as a regular user, which I can not. Ah yes, this does need root privileges, I should have prefixed the command with sudo in the instructions for better understanding. There are a few optinal arguments too that the script can take [-h ] [-m ] [-o ] [-v (run on all cpus)] Default Output location in: tools/testing/cpuidle/cpuidle.log Isn't it: tools/testing/selftests/cpuidle/cpuidle.log My bad, It was a typing error. I missed the "selftest" directory while typing it out. ? At least, that is where my file was. Other notes: No idle state for CPU 0 ever gets disabled. I assume this is because CPU 0 can never be offline, so that bit of code (Disable all stop states) doesn't find its state. By the way, processor = Intel i5-9600K I had tried these patches on an IBM POWER 9 processor and disabling CPU0's idle state works there. However, it does make sense for some processors to treat CPU 0 differently. Maybe I could write in a case if idle state disabling fails for a CPU then we just skip it? The system is left with all idle states disabled, well not for CPU 0 as per the above comment. The suggestion is to restore them, otherwise my processor hogs 42 watts instead of 2. My results are highly variable per test. Question: Do you notice high variability with IPI test, Timer test or both? I can think of two reasons for high run to run variance: 1. If you observe variance in timer tests, then I believe there could a mechanism of "C-state pre-wake" on some Intel machines at play here, which can pre-wake a CPU from an idle state when timers are armed. I'm not sure if the Intel platform that you're running on does that or not. Artem had described this behavior to me a while ago and I think his wult page describes this behavior in more detail: https://intel.github.io/wult/#c-state-pre-wake 2. I have noticed variability in results when there are kernel book-keeping or jitter tasks scheduled from time to time on an otherwise idle core. In the full per-CPU logs at tools/testing/selftests/cpuidle/cpuidle.log can you spot any obvious outliers per-CPU state? Also you may want to run the test in verbose mode which runs for all the threads of each CPU with the following: "sudo ./cpuidle.sh -v". While latency mostly matters for per-core basis but it still maybe a good idea to see if that changes anything with the observations. -- Thanks and regards, Pratik My system is very idle: Example (from turbostat at 6 seconds sample rate): Busy% Bzy_MHz IRQ PkgTmp PkgWatt RAMWatt 0.034600153 28 2.031.89 0.014600103 29 2.031.89 0.054600115 29 2.081.89 0.01460095 28 2.091.89 0.034600114 28 2.111.89 0.014600107 29 2.071.89 0.024600102 29 2.111.89 ... ... Doug
Re: [RFC v4 1/1] selftests/cpuidle: Add support for cpuidle latency measurement
On 03/09/20 8:20 pm, Artem Bityutskiy wrote: On Thu, 2020-09-03 at 17:30 +0530, Pratik Sampat wrote: I certainly did not know about that the Intel architecture being aware of timers and pre-wakes the CPUs which makes the timer experiment observations void. Well, things depend on platform, it is really "void", it is just different and it measures an optimized case. The result may be smaller observed latency. And things depend on the platform. Of course, this will be for just software observability and hardware can be more complex with each architecture behaving differently. However, we are also collecting a baseline measurement wherein we run the same test on a 100% busy CPU and the measurement of latency from that could be considered to the kernel-userspace overhead. The rest of the measurements would be considered keeping this baseline in mind. Yes, this should give the idea of the overhead, but still, at least for many Intel platforms I would not be comfortable using the resulting number (measured latency - baseline) for a cpuidle driver, because there are just too many variables there. I am not sure I could assume the baseline measured this way is an invariant - it could be noticeably different depending on whether you use C-states or not. At least on Intel platforms, this will mean that the IPI method won't cover deep C-states like, say, PC6, because one CPU is busy. Again, not saying this is not interesting, just pointing out the limitation. That's a valid point. We have similar deep idle states in POWER too. The idea here is that this test should be run on an already idle system, of course there will be kernel jitters along the way which can cause little skewness in observations across some CPUs but I believe the observations overall should be stable. If baseline and cpuidle latency are numbers of same order of magnitude, and you are measuring in a controlled lab system, may be yes. But if baseline is, say, in milliseconds, and you are measuring a 10 microseconds C-state, then probably no. This makes complete sense. The magnitude of deviations being greater than the scope of the experiment may not be very useful in quantifying the latency metric. One way is to minimize the baseline overhead is to make this a kernel module https://lkml.org/lkml/2020/7/21/567. However, the overhead is unavoidable but definetly can be further minimized by using an external approach suggested by you in your LPC talk Another solution to this could be using isolcpus, but that just increases the complexity all the more. If you have any suggestions of any other way that could guarantee idleness that would be great. Well, I did not try to guarantee idleness. I just use timers and external device (the network card), so no CPUs needs to be busy and the system can enter deep C-states. Then I just look at median, 99%-th percentile, etc. But by all means IPI is also a very interesting experiment. Just covers a different usage scenario. When I started experimenting in this area, one of my main early takeaways was realization that C-state latency really depends on the event source. That is an interesting observation, on POWER systems where we don't have timer related wakeup optimizations, the readings from this test do signify a difference between latencies of IPI versus the latency gathered after a timer interrupt. However, these timer based variations weren't as prominent on my Intel based ThinkPad t480, therefore in confirmation with your observations. This discussions does help! Although this approach may not help quantify latency deviations at a hardware-accurate level but could still be helpful in quantifying this metric from a software observability point of view. Thanks! Pratik
Re: [RFC v4 1/1] selftests/cpuidle: Add support for cpuidle latency measurement
Hello Artem, On 02/09/20 8:55 pm, Artem Bityutskiy wrote: On Wed, 2020-09-02 at 17:15 +0530, Pratik Rajesh Sampat wrote: Measure cpuidle latencies on wakeup to determine and compare with the advertsied wakeup latencies for each idle state. Thank you for pointing me to your talk. It was very interesting! I certainly did not know about that the Intel architecture being aware of timers and pre-wakes the CPUs which makes the timer experiment observations void. It looks like the measurements include more than just C-state wake, they also include the overhead of waking up the proces, context switch, and potentially any interrupts that happen on that CPU. I am not saying this is not interesting data, it surely is, but it is going to be larger than you see in cpuidle latency tables. Potentially significantly larger. The measurements will definitely include overhead than just the C-State wakeup. However, we are also collecting a baseline measurement wherein we run the same test on a 100% busy CPU and the measurement of latency from that could be considered to the kernel-userspace overhead. The rest of the measurements would be considered keeping this baseline in mind. Therefore, I am not sure this program should be advertised as "cpuidle measurement". It really measures the "IPI latency" in case of the IPI method. Now with the new found knowledge of timers in Intel, I understand that this really only seems to measure IPI latency and not timer latency, although both the observations shouldn't be too far off anyways. A baseline measurement for each case of IPI and timers is taken at 100 percent CPU usage to quantify for the kernel-userpsace overhead during execution. At least on Intel platforms, this will mean that the IPI method won't cover deep C-states like, say, PC6, because one CPU is busy. Again, not saying this is not interesting, just pointing out the limitation. That's a valid point. We have similar deep idle states in POWER too. The idea here is that this test should be run on an already idle system, of course there will be kernel jitters along the way which can cause little skewness in observations across some CPUs but I believe the observations overall should be stable. Another solution to this could be using isolcpus, but that just increases the complexity all the more. If you have any suggestions of any other way that could guarantee idleness that would be great. I was working on a somewhat similar stuff for x86 platforms, and I am almost ready to publish that on github. I can notify you when I do so if you are interested. But here is a small presentation of the approach that I did on Plumbers last year: https://youtu.be/Opk92aQyvt0?t=8266 (the link points to the start of my talk) Sure thing. Do notify me when it comes up. I would be happy to have a look at it. -- Thanks! Pratik
Re: [PATCH] Revert "powerpc/powernv/idle: Replace CPU feature check with PVR check"
On 26/08/20 2:07 pm, Christophe Leroy wrote: Le 26/08/2020 à 10:29, Pratik Rajesh Sampat a écrit : Cpuidle stop state implementation has minor optimizations for P10 where hardware preserves more SPR registers compared to P9. The current P9 driver works for P10, although does few extra save-restores. P9 driver can provide the required power management features like SMT thread folding and core level power savings on a P10 platform. Until the P10 stop driver is available, revert the commit which allows for only P9 systems to utilize cpuidle and blocks all idle stop states for P10. Cpu idle states are enabled and tested on the P10 platform with this fix. This reverts commit 8747bf36f312356f8a295a0c39ff092d65ce75ae. Fixes: 8747bf36f312 ("powerpc/powernv/idle: Replace CPU feature check with PVR check") Signed-off-by: Pratik Rajesh Sampat --- @mpe: This revert would resolve a staging issue wherein the P10 stop driver is not yet ready while cpuidle stop states need not be blocked on 5.9 for Power10 systems which could cause SMT folding related performance issues. The P10 stop driver is in the works here: https://lists.ozlabs.org/pipermail/linuxppc-dev/2020-August/216773.html arch/powerpc/platforms/powernv/idle.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/powerpc/platforms/powernv/idle.c b/arch/powerpc/platforms/powernv/idle.c index 77513a80cef9..345ab062b21a 100644 --- a/arch/powerpc/platforms/powernv/idle.c +++ b/arch/powerpc/platforms/powernv/idle.c @@ -1223,7 +1223,7 @@ static void __init pnv_probe_idle_states(void) return; } - if (pvr_version_is(PVR_POWER9)) + if (cpu_has_feature(CPU_FTR_ARCH_300)) Why not something like: if (pvr_version_is(PVR_POWER9) || pvr_version_is(PVR_POWER10)) pnv_power9_idle_init(); In order to use PVR_POWER10 I would need to define it under arch/powerpc/include/asm/reg.h, which is not present in 5.9 yet. However, if it okay with @mpe I could split out Nick's P10 stop driver (https://lists.ozlabs.org/pipermail/linuxppc-dev/2020-August/216773.html) into two parts: 1. This could include minimal code to introduce the P10 PVR and the stop wrappers for it. Although this patch internally still calls into the P9 path. This should gracefully fix the issue. 2. Then later in this patch we could introduce the p10 callback methods as they are in Nick's series. --- Thanks Pratik pnv_power9_idle_init(); for (i = 0; i < nr_pnv_idle_states; i++) Christophe
Re: [PATCH v3 1/2] cpuidle: Trace IPI based and timer based wakeup latency from idle states
Hello Rafael, On 27/07/20 7:12 pm, Rafael J. Wysocki wrote: On Tue, Jul 21, 2020 at 2:43 PM Pratik Rajesh Sampat wrote: Fire directed smp_call_function_single IPIs from a specified source CPU to the specified target CPU to reduce the noise we have to wade through in the trace log. And what's the purpose of it? The idea for this comes from that fact that estimating wake-up latencies and residencies for stop states is not an easy task. The purpose is essentially to determine wakeup latencies, that are caused by either, an IPI or a timer and compare with the advertised wakeup latencies for each stop state. This might help in determining the accuracy of our advertised values and/or if they need any re-calibration. The module is based on the idea written by Srivatsa Bhat and maintained by Vaidyanathan Srinivasan internally. Queue HR timer and measure jitter. Wakeup latency measurement for idle states using hrtimer. Echo a value in ns to timer_test_function and watch trace. A HRtimer will be queued and when it fires the expected wakeup vs actual wakeup is computes and delay printed in ns. Implemented as a module which utilizes debugfs so that it can be integrated with selftests. To include the module, check option and include as module kernel hacking -> Cpuidle latency selftests [srivatsa.b...@linux.vnet.ibm.com: Initial implementation in cpidle/sysfs] [sva...@linux.vnet.ibm.com: wakeup latency measurements using hrtimer and fix some of the time calculation] [e...@linux.vnet.ibm.com: Fix some whitespace and tab errors and increase the resolution of IPI wakeup] Signed-off-by: Pratik Rajesh Sampat Reviewed-by: Gautham R. Shenoy --- drivers/cpuidle/Makefile | 1 + drivers/cpuidle/test-cpuidle_latency.c | 150 + lib/Kconfig.debug | 10 ++ 3 files changed, 161 insertions(+) create mode 100644 drivers/cpuidle/test-cpuidle_latency.c diff --git a/drivers/cpuidle/Makefile b/drivers/cpuidle/Makefile index f07800cbb43f..2ae05968078c 100644 --- a/drivers/cpuidle/Makefile +++ b/drivers/cpuidle/Makefile @@ -8,6 +8,7 @@ obj-$(CONFIG_ARCH_NEEDS_CPU_IDLE_COUPLED) += coupled.o obj-$(CONFIG_DT_IDLE_STATES) += dt_idle_states.o obj-$(CONFIG_ARCH_HAS_CPU_RELAX) += poll_state.o obj-$(CONFIG_HALTPOLL_CPUIDLE) += cpuidle-haltpoll.o +obj-$(CONFIG_IDLE_LATENCY_SELFTEST) += test-cpuidle_latency.o ## # ARM SoC drivers diff --git a/drivers/cpuidle/test-cpuidle_latency.c b/drivers/cpuidle/test-cpuidle_latency.c new file mode 100644 index ..61574665e972 --- /dev/null +++ b/drivers/cpuidle/test-cpuidle_latency.c @@ -0,0 +1,150 @@ +// SPDX-License-Identifier: GPL-2.0-or-later +/* + * Module-based API test facility for cpuidle latency using IPIs and timers I'd like to see a more detailed description of what it does and how it works here. Right, I'll add that. Based on comments from Daniel I have also been working on a user-space only variant of this test as that does seem like a better way to go. The only downside is that the latency will be higher, but as we are taking baseline measurements the diff of that from our observed reading should still remain the same. Just that the test will take longer to run. I'm yet to accurately confirm this. I would appreciate your thoughts on that. + */ + +#include +#include +#include + +/* IPI based wakeup latencies */ +struct latency { + unsigned int src_cpu; + unsigned int dest_cpu; + ktime_t time_start; + ktime_t time_end; + u64 latency_ns; +} ipi_wakeup; + +static void measure_latency(void *info) +{ + struct latency *v; + ktime_t time_diff; + + v = (struct latency *)info; + v->time_end = ktime_get(); + time_diff = ktime_sub(v->time_end, v->time_start); + v->latency_ns = ktime_to_ns(time_diff); +} + +void run_smp_call_function_test(unsigned int cpu) +{ + ipi_wakeup.src_cpu = smp_processor_id(); + ipi_wakeup.dest_cpu = cpu; + ipi_wakeup.time_start = ktime_get(); + smp_call_function_single(cpu, measure_latency, _wakeup, 1); +} + +/* Timer based wakeup latencies */ +struct timer_data { + unsigned int src_cpu; + u64 timeout; + ktime_t time_start; + ktime_t time_end; + struct hrtimer timer; + u64 timeout_diff_ns; +} timer_wakeup; + +static enum hrtimer_restart timer_called(struct hrtimer *hrtimer) +{ + struct timer_data *w; + ktime_t time_diff; + + w = container_of(hrtimer, struct timer_data, timer); + w->time_end = ktime_get(); + + time_diff = ktime_sub(w->time_end, w->time_start); + time_diff = ktime_sub(time_diff, ns_to_ktime(w->timeout)); + w->timeout_diff_ns = ktime_to_ns(time_diff); + return HRTIMER_NORESTART; +} + +static void run_timer_test(unsigned int ns) +{ +
Re: [PATCH v2 2/3] powerpc/powernv/idle: save-restore DAWR0,DAWRX0 for P10
On 24/07/20 6:55 am, Michael Neuling wrote: On Fri, 2020-07-10 at 10:52 +0530, Pratik Rajesh Sampat wrote: Additional registers DAWR0, DAWRX0 may be lost on Power 10 for stop levels < 4. Therefore save the values of these SPRs before entering a "stop" state and restore their values on wakeup. Signed-off-by: Pratik Rajesh Sampat --- arch/powerpc/platforms/powernv/idle.c | 10 ++ 1 file changed, 10 insertions(+) diff --git a/arch/powerpc/platforms/powernv/idle.c b/arch/powerpc/platforms/powernv/idle.c index 19d94d021357..f2e2a6a4c274 100644 --- a/arch/powerpc/platforms/powernv/idle.c +++ b/arch/powerpc/platforms/powernv/idle.c @@ -600,6 +600,8 @@ struct p9_sprs { u64 iamr; u64 amor; u64 uamor; + u64 dawr0; + u64 dawrx0; }; static unsigned long power9_idle_stop(unsigned long psscr, bool mmu_on) @@ -687,6 +689,10 @@ static unsigned long power9_idle_stop(unsigned long psscr, bool mmu_on) sprs.iamr = mfspr(SPRN_IAMR); sprs.amor = mfspr(SPRN_AMOR); sprs.uamor = mfspr(SPRN_UAMOR); + if (cpu_has_feature(CPU_FTR_ARCH_31)) { You are actually viewing an old version of the patches The main point of change were based on comments from Nick Piggin, I have changed the top level function check from ARCH_300 to a P9 PVR check instead. A similar thing needs to be done for P10, however as the P10 PVR isn't exposed yet, I've shelved this particular patch. Nick's comment to check based on PVR:https://lkml.org/lkml/2020/7/13/1018 v4 of the series:https://lkml.org/lkml/2020/7/21/784 Thanks for your review, Pratik Can you add a comment here saying even though DAWR0 is ARCH_30, it's only required to be saved on 31. Otherwise this looks pretty odd. + sprs.dawr0 = mfspr(SPRN_DAWR0); + sprs.dawrx0 = mfspr(SPRN_DAWRX0); + } srr1 = isa300_idle_stop_mayloss(psscr); /* go idle */ @@ -710,6 +716,10 @@ static unsigned long power9_idle_stop(unsigned long psscr, bool mmu_on) mtspr(SPRN_IAMR,sprs.iamr); mtspr(SPRN_AMOR,sprs.amor); mtspr(SPRN_UAMOR, sprs.uamor); + if (cpu_has_feature(CPU_FTR_ARCH_31)) { + mtspr(SPRN_DAWR0, sprs.dawr0); + mtspr(SPRN_DAWRX0, sprs.dawrx0); + } /* * Workaround for POWER9 DD2.0, if we lost resources, the ERAT
Re: [PATCH v3 0/2] Selftest for cpuidle latency measurement
Hello Daniel, On 21/07/20 8:27 pm, Daniel Lezcano wrote: On 21/07/2020 14:42, Pratik Rajesh Sampat wrote: v2: https://lkml.org/lkml/2020/7/17/369 Changelog v2-->v3 Based on comments from Gautham R. Shenoy adding the following in the selftest, 1. Grepping modules to determine if already loaded 2. Wrapper to enable/disable states 3. Preventing any operation/test on offlined CPUs --- The patch series introduces a mechanism to measure wakeup latency for IPI and timer based interrupts The motivation behind this series is to find significant deviations behind advertised latency and resisdency values Why do you want to measure for the timer and the IPI ? Whatever the source of the wakeup, the exit latency remains the same, no ? Is all this kernel-ish code really needed ? What about using a highres periodic timer and make it expires every eg. 50ms x 2400, so it is 120 secondes and measure the deviation. Repeat the operation for each idle states. And in order to make it as much accurate as possible, set the program affinity on a CPU and isolate this one by preventing other processes to be scheduled on and migrate the interrupts on the other CPUs. That will be all userspace code, no? The kernel module may not needed now that you mention it. IPI latencies could be measured using pipes and threads using pthread_attr_setaffinity_np to control the experiment, as you suggested. This should internally fire a smp_call_function_single. The original idea was to essentially measure it as closely as possible in the kernel without involving the kernel-->userspace overhead. However, the user-space approach may not be too much of a problem as we are collecting a baseline and the delta of the latency is what we would be concerned about anyways! With respect to measuring both timers and IPI latencies: In principle yes, the exit latency should remain the same but if there is a deviation in reality we may want to measure it. I'll implement this experiment in the userspace and get back with the numbers to confirm. Thanks for your comments! Best, Pratik
Re: [PATCH v2 2/2] selftest/cpuidle: Add support for cpuidle latency measurement
Hi Gautham, Thanks for the review. On 20/07/20 11:22 am, Gautham R Shenoy wrote: Hi Pratik, On Fri, Jul 17, 2020 at 02:48:01PM +0530, Pratik Rajesh Sampat wrote: This patch adds support to trace IPI based and timer based wakeup latency from idle states Latches onto the test-cpuidle_latency kernel module using the debugfs interface to send IPIs or schedule a timer based event, which in-turn populates the debugfs with the latency measurements. Currently for the IPI and timer tests; first disable all idle states and then test for latency measurements incrementally enabling each state Signed-off-by: Pratik Rajesh Sampat A few comments below. --- tools/testing/selftests/Makefile | 1 + tools/testing/selftests/cpuidle/Makefile | 6 + tools/testing/selftests/cpuidle/cpuidle.sh | 257 + tools/testing/selftests/cpuidle/settings | 1 + 4 files changed, 265 insertions(+) create mode 100644 tools/testing/selftests/cpuidle/Makefile create mode 100755 tools/testing/selftests/cpuidle/cpuidle.sh create mode 100644 tools/testing/selftests/cpuidle/settings [..skip..] + +ins_mod() +{ + if [ ! -f "$MODULE" ]; then + printf "$MODULE module does not exist. Exitting\n" If the module has been compiled into the kernel (due to a localyesconfig, for instance), then it is unlikely that we will find it in /lib/modules. Perhaps you want to check if the debugfs directories created by the module exist, and if so, print a message saying that the modules is already loaded or some such? That's a good idea. I can can grep for this module within /proc/modules and not insert it, if it is already there + exit $ksft_skip + fi + printf "Inserting $MODULE module\n\n" + insmod $MODULE + if [ $? != 0 ]; then + printf "Insmod $MODULE failed\n" + exit $ksft_skip + fi +} + +compute_average() +{ + arr=("$@") + sum=0 + size=${#arr[@]} + for i in "${arr[@]}" + do + sum=$((sum + i)) + done + avg=$((sum/size)) It would be good to assert that "size" isn't 0 here. Sure +} + +# Disable all stop states +disable_idle() +{ + for ((cpu=0; cpu /sys/devices/system/cpu/cpu$cpu/cpuidle/state$state/disable So, on offlined CPUs, we won't see /sys/devices/system/cpu/cpu$cpu/cpuidle/state$state directory. You should probably perform this operation only on online CPUs. Right. I should make CPU operations only on online CPUs all over the script [..snip..] Thanks Pratik
Re: [PATCH v3 2/3] powerpc/powernv/idle: Rename pnv_first_spr_loss_level variable
On 20/07/20 5:27 am, Nicholas Piggin wrote: Excerpts from Pratik Rajesh Sampat's message of July 18, 2020 4:53 am: Replace the variable name from using "pnv_first_spr_loss_level" to "pnv_first_fullstate_loss_level". As pnv_first_spr_loss_level is supposed to be the earliest state that has OPAL_PM_LOSE_FULL_CONTEXT set, however as shallow states too loose SPR values, render an incorrect terminology. It also doesn't lose "full" state at this loss level though. From the architecture it could be called "hv state loss level", but in POWER10 even that is not strictly true. Right. Just discovered that deep stop states won't loose full state P10 onwards. Would it better if we rename it as "pnv_all_spr_loss_state" instead so that it stays generic enough while being semantically coherent? Thanks Pratik Signed-off-by: Pratik Rajesh Sampat --- arch/powerpc/platforms/powernv/idle.c | 18 +- 1 file changed, 9 insertions(+), 9 deletions(-) diff --git a/arch/powerpc/platforms/powernv/idle.c b/arch/powerpc/platforms/powernv/idle.c index f62904f70fc6..d439e11af101 100644 --- a/arch/powerpc/platforms/powernv/idle.c +++ b/arch/powerpc/platforms/powernv/idle.c @@ -48,7 +48,7 @@ static bool default_stop_found; * First stop state levels when SPR and TB loss can occur. */ static u64 pnv_first_tb_loss_level = MAX_STOP_STATE + 1; -static u64 pnv_first_spr_loss_level = MAX_STOP_STATE + 1; +static u64 pnv_first_fullstate_loss_level = MAX_STOP_STATE + 1; /* * psscr value and mask of the deepest stop idle state. @@ -657,7 +657,7 @@ static unsigned long power9_idle_stop(unsigned long psscr, bool mmu_on) */ mmcr0 = mfspr(SPRN_MMCR0); } - if ((psscr & PSSCR_RL_MASK) >= pnv_first_spr_loss_level) { + if ((psscr & PSSCR_RL_MASK) >= pnv_first_fullstate_loss_level) { sprs.lpcr = mfspr(SPRN_LPCR); sprs.hfscr = mfspr(SPRN_HFSCR); sprs.fscr = mfspr(SPRN_FSCR); @@ -741,7 +741,7 @@ static unsigned long power9_idle_stop(unsigned long psscr, bool mmu_on) * just always test PSSCR for SPR/TB state loss. */ pls = (psscr & PSSCR_PLS) >> PSSCR_PLS_SHIFT; - if (likely(pls < pnv_first_spr_loss_level)) { + if (likely(pls < pnv_first_fullstate_loss_level)) { if (sprs_saved) atomic_stop_thread_idle(); goto out; @@ -1088,7 +1088,7 @@ static void __init pnv_power9_idle_init(void) * the deepest loss-less (OPAL_PM_STOP_INST_FAST) stop state. */ pnv_first_tb_loss_level = MAX_STOP_STATE + 1; - pnv_first_spr_loss_level = MAX_STOP_STATE + 1; + pnv_first_fullstate_loss_level = MAX_STOP_STATE + 1; for (i = 0; i < nr_pnv_idle_states; i++) { int err; struct pnv_idle_states_t *state = _idle_states[i]; @@ -1099,8 +1099,8 @@ static void __init pnv_power9_idle_init(void) pnv_first_tb_loss_level = psscr_rl; if ((state->flags & OPAL_PM_LOSE_FULL_CONTEXT) && -(pnv_first_spr_loss_level > psscr_rl)) - pnv_first_spr_loss_level = psscr_rl; +(pnv_first_fullstate_loss_level > psscr_rl)) + pnv_first_fullstate_loss_level = psscr_rl; /* * The idle code does not deal with TB loss occurring @@ -,8 +,8 @@ static void __init pnv_power9_idle_init(void) * compatibility. */ if ((state->flags & OPAL_PM_TIMEBASE_STOP) && -(pnv_first_spr_loss_level > psscr_rl)) - pnv_first_spr_loss_level = psscr_rl; +(pnv_first_fullstate_loss_level > psscr_rl)) + pnv_first_fullstate_loss_level = psscr_rl; err = validate_psscr_val_mask(>psscr_val, >psscr_mask, @@ -1158,7 +1158,7 @@ static void __init pnv_power9_idle_init(void) } pr_info("cpuidle-powernv: First stop level that may lose SPRs = 0x%llx\n", - pnv_first_spr_loss_level); + pnv_first_fullstate_loss_level); pr_info("cpuidle-powernv: First stop level that may lose timebase = 0x%llx\n", pnv_first_tb_loss_level); -- 2.25.4
Re: [PATCH v3 1/3] powerpc/powernv/idle: Replace CPU features checks with PVR checks
On 20/07/20 5:30 am, Nicholas Piggin wrote: Excerpts from Pratik Rajesh Sampat's message of July 18, 2020 4:53 am: As the idle framework's architecture is incomplete, hence instead of checking for just the processor type advertised in the device tree CPU features; check for the Processor Version Register (PVR) so that finer granularity can be leveraged while making processor checks. Signed-off-by: Pratik Rajesh Sampat --- arch/powerpc/platforms/powernv/idle.c | 14 +++--- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/arch/powerpc/platforms/powernv/idle.c b/arch/powerpc/platforms/powernv/idle.c index 2dd467383a88..f62904f70fc6 100644 --- a/arch/powerpc/platforms/powernv/idle.c +++ b/arch/powerpc/platforms/powernv/idle.c @@ -92,7 +92,7 @@ static int pnv_save_sprs_for_deep_states(void) if (rc != 0) return rc; - if (cpu_has_feature(CPU_FTR_ARCH_300)) { + if (pvr_version_is(PVR_POWER9)) { rc = opal_slw_set_reg(pir, P9_STOP_SPR_MSR, msr_val); if (rc) return rc; @@ -116,7 +116,7 @@ static int pnv_save_sprs_for_deep_states(void) return rc; /* Only p8 needs to set extra HID regiters */ - if (!cpu_has_feature(CPU_FTR_ARCH_300)) { + if (!pvr_version_is(PVR_POWER9)) { rc = opal_slw_set_reg(pir, SPRN_HID1, hid1_val); if (rc != 0) What I think you should do is keep using CPU_FTR_ARCH_300 for this stuff which is written for power9 and we know is running on power9, because that's a faster test (static branch and does not have to read PVR. And then... @@ -1205,7 +1205,7 @@ static void __init pnv_probe_idle_states(void) return; } - if (cpu_has_feature(CPU_FTR_ARCH_300)) + if (pvr_version_is(PVR_POWER9)) pnv_power9_idle_init(); for (i = 0; i < nr_pnv_idle_states; i++) Here is where you would put the version check. Once we have code that can also handle P10 (either by testing CPU_FTR_ARCH_31, or by adding an entirely new power10 idle function), then you can add the P10 version check here. Sure, it makes sense to make this check on the top level function and retain CPU_FTR_ARCH_300 lower in the calls for speed. I'll make that change. Thanks Pratik Thanks, Nick
Re: [PATCH v2 0/3] Power10 basic energy management
On 13/07/20 10:20 pm, Nicholas Piggin wrote: Excerpts from Pratik Sampat's message of July 13, 2020 8:02 pm: Thank you for your comments, On 13/07/20 10:53 am, Nicholas Piggin wrote: Excerpts from Pratik Rajesh Sampat's message of July 10, 2020 3:22 pm: Changelog v1 --> v2: 1. Save-restore DAWR and DAWRX unconditionally as they are lost in shallow idle states too 2. Rename pnv_first_spr_loss_level to pnv_first_fullstate_loss_level to correct naming terminology Pratik Rajesh Sampat (3): powerpc/powernv/idle: Exclude mfspr on HID1,4,5 on P9 and above powerpc/powernv/idle: save-restore DAWR0,DAWRX0 for P10 powerpc/powernv/idle: Rename pnv_first_spr_loss_level variable arch/powerpc/platforms/powernv/idle.c | 34 +-- 1 file changed, 22 insertions(+), 12 deletions(-) These look okay to me, but the CPU_FTR_ARCH_300 test for pnv_power9_idle_init() is actually wrong, it should be a PVR test because idle is not completely architected (not even shallow stop states, unfortunately). It doesn't look like we support POWER10 idle correctly yet, and on older kernels it wouldn't work even if we fixed newer, so ideally the PVR check would be backported as a fix in the front of the series. Sadly, we have no OPAL idle driver yet. Hopefully we will before the next processor shows up :P Thanks, Nick So if I understand this correctly, in powernv/idle.c where we check for CPU_FTR_ARCH_300, we should rather be making a pvr_version_is(PVR_POWER9) check instead? Of course, the P10 PVR and its relevant checks will have to be added then too. Yes I think so, unfortunately. Thanks, Nick Sure, I'll add these checks in. Thanks, Pratik
Re: [PATCH v2 0/3] Power10 basic energy management
Thank you for your comments, On 13/07/20 10:53 am, Nicholas Piggin wrote: Excerpts from Pratik Rajesh Sampat's message of July 10, 2020 3:22 pm: Changelog v1 --> v2: 1. Save-restore DAWR and DAWRX unconditionally as they are lost in shallow idle states too 2. Rename pnv_first_spr_loss_level to pnv_first_fullstate_loss_level to correct naming terminology Pratik Rajesh Sampat (3): powerpc/powernv/idle: Exclude mfspr on HID1,4,5 on P9 and above powerpc/powernv/idle: save-restore DAWR0,DAWRX0 for P10 powerpc/powernv/idle: Rename pnv_first_spr_loss_level variable arch/powerpc/platforms/powernv/idle.c | 34 +-- 1 file changed, 22 insertions(+), 12 deletions(-) These look okay to me, but the CPU_FTR_ARCH_300 test for pnv_power9_idle_init() is actually wrong, it should be a PVR test because idle is not completely architected (not even shallow stop states, unfortunately). It doesn't look like we support POWER10 idle correctly yet, and on older kernels it wouldn't work even if we fixed newer, so ideally the PVR check would be backported as a fix in the front of the series. Sadly, we have no OPAL idle driver yet. Hopefully we will before the next processor shows up :P Thanks, Nick So if I understand this correctly, in powernv/idle.c where we check for CPU_FTR_ARCH_300, we should rather be making a pvr_version_is(PVR_POWER9) check instead? Of course, the P10 PVR and its relevant checks will have to be added then too. Thanks Pratik
Re: [PATCH 2/2] powerpc/powernv/idle: save-restore DAWR0,DAWRX0 for P10
On 09/07/20 2:39 pm, Gautham R Shenoy wrote: On Fri, Jul 03, 2020 at 06:16:40PM +0530, Pratik Rajesh Sampat wrote: Additional registers DAWR0, DAWRX0 may be lost on Power 10 for stop levels < 4. Adding Ravi Bangoria to the cc. Therefore save the values of these SPRs before entering a "stop" state and restore their values on wakeup. Signed-off-by: Pratik Rajesh Sampat The saving and restoration looks good to me. --- arch/powerpc/platforms/powernv/idle.c | 10 ++ 1 file changed, 10 insertions(+) diff --git a/arch/powerpc/platforms/powernv/idle.c b/arch/powerpc/platforms/powernv/idle.c index 19d94d021357..471d4a65b1fa 100644 --- a/arch/powerpc/platforms/powernv/idle.c +++ b/arch/powerpc/platforms/powernv/idle.c @@ -600,6 +600,8 @@ struct p9_sprs { u64 iamr; u64 amor; u64 uamor; + u64 dawr0; + u64 dawrx0; }; static unsigned long power9_idle_stop(unsigned long psscr, bool mmu_on) @@ -677,6 +679,10 @@ static unsigned long power9_idle_stop(unsigned long psscr, bool mmu_on) sprs.tscr = mfspr(SPRN_TSCR); if (!firmware_has_feature(FW_FEATURE_ULTRAVISOR)) sprs.ldbar = mfspr(SPRN_LDBAR); + if (cpu_has_feature(CPU_FTR_ARCH_31)) { + sprs.dawr0 = mfspr(SPRN_DAWR0); + sprs.dawrx0 = mfspr(SPRN_DAWRX0); + } But this is within the if condition which says if ((psscr & PSSCR_RL_MASK) >= pnv_first_spr_loss_level) This if condition is meant for stop4 and stop5 since these are stop levels that have OPAL_PM_LOSE_HYP_CONTEXT set. Since we can lose DAWR*, on states that lose limited hypervisor context, such as stop0-2, we need to unconditionally save them like AMR, IAMR etc. Right, shallow states too loose DAWR/X. Thanks for pointing it out. I'll fix this and resend. sprs_saved = true; @@ -792,6 +798,10 @@ static unsigned long power9_idle_stop(unsigned long psscr, bool mmu_on) mtspr(SPRN_MMCR2, sprs.mmcr2); if (!firmware_has_feature(FW_FEATURE_ULTRAVISOR)) mtspr(SPRN_LDBAR, sprs.ldbar); + if (cpu_has_feature(CPU_FTR_ARCH_31)) { + mtspr(SPRN_DAWR0, sprs.dawr0); + mtspr(SPRN_DAWRX0, sprs.dawrx0); + } Likewise, we need to unconditionally restore these SPRs. mtspr(SPRN_SPRG3, local_paca->sprg_vdso); -- 2.25.4 Thanks Pratik
Re: [PATCH 2/2] selftest/cpuidle: Add support for cpuidle latency measurement
[..snip..] + +ins_mod() +{ + if [ ! -f "$MODULE" ]; then + printf "$MODULE module does not exist. Exitting\n" + exit 2 Please use ksft_skip code to indicate the test is being skipped. Sure thing I'll use ksft_skip exit code instead. + fi + printf "Inserting $MODULE module\n\n" + insmod $MODULE + if [ $? != 0 ]; then + printf "Insmod $MODULE failed\n" + exit 2 This is fine since you expect to be able to load the module. Thanks for the review. Pratik [..snip..] thanks, -- Shuah
Re: [RFC 0/1] Alternate history mechanism for the TEO governor
Hello Doug, Thanks a lot for running these benchmarks on an Intel box. On 17/05/20 11:41 pm, Doug Smythies wrote: On 2020.05.11 Pratik Rajesh Sampat wrote: First RFC posting:https://lkml.org/lkml/2020/2/22/27 Summary: On that thread I wrote: > I have done a couple of other tests with this patch set, > but nothing to report yet, as the differences have been > minor so far. I tried your tests, or as close as I could find, and still do not notice much difference. That is quite unfortunate. At least it doesn't seem to regress. Nevertheless, as Rafael suggested aging is crucial, this patch doesn't age weights. I do have a version with aging but I had a lot of run to run variance so I had refrained from posting that. I'm tweaking around the logic for aging as well as distribution of weights, hopefully that may help. For detail, but likely little added value, read on: Kernel: 5.7-rc4: "teo": unmodified kernel. "wtteo": with this patch added. "menu": the menu idle governor, for comparison. CPU frequency scaling driver: intel-cpufreq CPU frequency scaling governor: schedutil CPU idle driver: intel_idle ... Benchmarks: Schbench Benchmarks scheduler wakeup latencies 1. latency 99th percentile - usec I found a Phoronix schbench test. It defaults to 99.9th percentile. schbench (usec, 99.9th Latency Percentile, less is better)(8 workers) threads teo wtteo menu 2 14197 14194 99.98% 14467 101.90% 4 46733 46490 99.48% 46554 99.62% 6 57306 58291 101.72% 57754 100.78% 8 81408 80768 99.21% 81715 100.38% 16 157286 156570 99.54% 156621 99.58% 32 314573 310784 98.80% 315802 100.39% Powers and other idle statistics were similar. [1] 2. Power - watts Machine - IBM Power 9 Latency and Power - Normalized +-+--+-+---+ | Threads | TEO Baseline | Wt. TEO Latency | Wt. TEO Power | +-+--+-+---+ | 2 | 100 | 101.3 | 85.29 | +-+--+-+---+ | 4 | 100 | 105.06 | 113.63| +-+--+-+---+ | 8 | 100 | 92.32 | 90.36 | +-+--+-+---+ | 16 | 100 | 99.1| 92.43 | +-+--+-+---+ Accuracy Vanilla TEO Governor - Prediction distribution % +-+--+--+--+---+---+---+-+ | Threads | US 1 | US 2 | US 3 | US 4 | US 5 | US 6 | Correct | +-+--+--+--+---+---+---+-+ | 2 | 6.12 | 1.08 | 1.76 | 20.41 | 9.2 | 28.74 | 22.51 | +-+--+--+--+---+---+---+-+ | 4 | 8.54 | 1.56 | 1.25 | 20.24 | 10.75 | 25.17 | 22.67 | +-+--+--+--+---+---+---+-+ | 8 | 5.88 | 2.67 | 1.09 | 13.72 | 17.08 | 32.04 | 22.95 | +-+--+--+--+---+---+---+-+ | 16 | 6.29 | 2.43 | 0.86 | 13.21 | 15.33 | 26.52 | 29.34 | +-+--+--+--+---+---+---+-+ +-+--+--+--+ | Threads | OS 1 | OS 2 | OS 3 | +-+--+--+--+ | 2 | 1.77 | 1.27 | 7.14 | +-+--+--+--+ | 4 | 1.8 | 1.31 | 6.71 | +-+--+--+--+ | 8 | 0.65 | 0.72 | 3.2 | +-+--+--+--+ | 16 | 0.63 | 1.71 | 3.68 | +-+--+--+--+ Weighted TEO Governor - Prediction distribution % +-+--+--+--+---+---+---+-+ | Threads | US 1 | US 2 | US 3 | US 4 | US 5 | US 6 | Correct | +-+--+--+--+---+---+---+-+ | 2 | 7.26 | 2.07 | 0.02 | 15.85 | 13.29 | 36.26 | 22.13 | +-+--+--+--+---+---+---+-+ | 4 | 4.33 | 1.45 | 0.15 | 14.17 | 14.68 | 40.36 | 21.01 | +-+--+--+--+---+---+---+-+ | 8 | 4.73 | 2.46 | 0.12 | 12.48 | 14.68 | 32.38 | 28.9| +-+--+--+--+---+---+---+-+ | 16 | 7.68 | 1.25 | 0.98 | 12.15 | 11.19 | 24.91 | 35.92 | +-+--+--+--+---+---+---+-+ +-+--+--+--+ | Threads | OS 1 | OS 2 | OS 3 | +-+--+--+--+ | 2 | 0.39 | 0.42 | 2.31 | +-+--+--+--+ | 4 | 0.45 | 0.51 | 2.89 | +-+--+--+--+ | 8 | 0.53 | 0.66 | 3.06 | +-+--+--+--+ | 16 | 0.97 | 1.9 | 3.05 |
Re: [RFC 1/1] Weighted approach to gather and use history in TEO governor
On 13/05/20 8:19 pm, Rafael J. Wysocki wrote: On Wed, May 13, 2020 at 7:31 AM Pratik Sampat wrote: Thanks for your comment. On 12/05/20 11:07 pm, Peter Zijlstra wrote: Just a quick note.. On Mon, May 11, 2020 at 07:40:55PM +0530, Pratik Rajesh Sampat wrote: +/* + * Rearrange the weight distribution of the state, increase the weight + * by the LEARNING RATE % for the idle state that was supposed to be + * chosen and reduce by the same amount for rest of the states + * + * If the weights are greater than (100 - LEARNING_RATE) % or lesser + * than LEARNING_RATE %, do not increase or decrease the confidence + * respectively + */ +for (i = 0; i < drv->state_count; i++) { +unsigned int delta; + +if (idx == -1) +break; +if (i == idx) { +delta = (LEARNING_RATE * cpu_data->state_mat[last_idx][i]) / 100; 100 is a crap number to divide by as a computer. We bio-puddings happend to have 10 digits, so 100 makes sense to us, but it does not to our binary friends. Absolutely! I just wrote the code exactly the way I did the Math on paper, definitely need to figure out an optimal way of doing things. There is no particular reason to use percent in computations at all. You may as well use 1/1024 parts instead (and then use shifts instead of divisions). Yes you're right. Looking at it now the whole percent system and divisions does seem quite unnecessary and we can achieve it rather with bitwise operations. Thanks!
Re: [RFC 1/1] Weighted approach to gather and use history in TEO governor
Thanks for your comment. On 12/05/20 11:07 pm, Peter Zijlstra wrote: Just a quick note.. On Mon, May 11, 2020 at 07:40:55PM +0530, Pratik Rajesh Sampat wrote: + /* +* Rearrange the weight distribution of the state, increase the weight +* by the LEARNING RATE % for the idle state that was supposed to be +* chosen and reduce by the same amount for rest of the states +* +* If the weights are greater than (100 - LEARNING_RATE) % or lesser +* than LEARNING_RATE %, do not increase or decrease the confidence +* respectively +*/ + for (i = 0; i < drv->state_count; i++) { + unsigned int delta; + + if (idx == -1) + break; + if (i == idx) { + delta = (LEARNING_RATE * cpu_data->state_mat[last_idx][i]) / 100; 100 is a crap number to divide by as a computer. We bio-puddings happend to have 10 digits, so 100 makes sense to us, but it does not to our binary friends. Absolutely! I just wrote the code exactly the way I did the Math on paper, definitely need to figure out an optimal way of doing things. ~Pratik