Re: [PATCH v4 0/4] percpu: partial chunk depopulation

2021-04-20 Thread Pratik Sampat

Hello Dennis,


On 20/04/21 8:09 pm, Dennis Zhou wrote:

On Tue, Apr 20, 2021 at 04:37:02PM +0530, Pratik Sampat wrote:

On 20/04/21 4:27 am, Dennis Zhou wrote:

On Mon, Apr 19, 2021 at 10:50:43PM +, Dennis Zhou wrote:

Hello,

This series is a continuation of Roman's series in [1]. It aims to solve
chunks holding onto free pages by adding a reclaim process to the percpu
balance work item.

The main difference is that the nr_empty_pop_pages is now managed at
time of isolation instead of intermixed. This helps with deciding which
chunks to free instead of having to interleave returning chunks to
active duty.

The allocation priority is as follows:
1) appropriate chunk slot increasing until fit
2) sidelined chunks
3) full free chunks

The last slot for to_depopulate is never used for allocations.

A big thanks to Roman for initiating the work and being available for
iterating on these ideas.

This patchset contains the following 4 patches:
0001-percpu-factor-out-pcpu_check_block_hint.patch
0002-percpu-use-pcpu_free_slot-instead-of-pcpu_nr_slots-1.patch
0003-percpu-implement-partial-chunk-depopulation.patch
0004-percpu-use-reclaim-threshold-instead-of-running-for-.patch

0001 and 0002 are clean ups. 0003 implement partial chunk depopulation
initially from Roman. 0004 adds a reclaim threshold so we do not need to
schedule for every page freed.

This series is on top of percpu$for-5.14 67c2669d69fb.

diffstats below:

Dennis Zhou (2):
percpu: use pcpu_free_slot instead of pcpu_nr_slots - 1
percpu: use reclaim threshold instead of running for every page

Roman Gushchin (2):
percpu: factor out pcpu_check_block_hint()
percpu: implement partial chunk depopulation

   mm/percpu-internal.h |   5 +
   mm/percpu-km.c   |   5 +
   mm/percpu-stats.c|  20 ++--
   mm/percpu-vm.c   |  30 ++
   mm/percpu.c  | 252 ++-
   5 files changed, 278 insertions(+), 34 deletions(-)

Thanks,
Dennis

Hello Pratik,

Do you mind testing this series again on POWER9? The base is available
here:
https://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu.git/log/?h=for-5.14

Thanks,
Dennis

Hello Dennis, I have tested this patchset on POWER9.

I have tried variations of the percpu_test in the top level and nested cgroups
creation as the test with 1000:10 didn't show any benefits.

This is most likely because the 1 in every 11 still pins every page
while 1 in 50 does not. Can you try the patch below on top? I think it
may show slightly better perf as well. If it doesn't I'll just drop it.


I did try it out, although my test spanned only across varying the inner cgroup
folders; it didn't seem to show any benefit over the previous test without the
patch for being being able to spawn as little memory cgroup folders and seeing
partial memory depopulation.


The following example shows more consistent benefits with the de-allocation
strategy.
Outer: 1000
Inner: 50
# ./percpu_test.sh
Percpu: 6912 kB
Percpu:   532736 kB
Percpu:   278784 kB

I believe it could be a result of bulk freeing within "free_unref_page_commit",
where pages are only free'd if pcp->count >= pcp->high. As POWER has a larger
page size it would end up creating lesser number of pages but with the
effects of fragmentation.

This is unrelated to per cpu pages in slab/slub. Percpu is a separate
memory allocator.


You're right, I was actually referencing incorrect code.


Having said that, the patchset and its behavior does look good to me.

Thanks, can I throw the following on the appropriate patches? In the
future it's good to be explicit about this because some prefer to credit
different emails.

Tested-by: Pratik Sampat 


Sure thing please feel free to add a tested-by wherever you feel appropriate.
I'll be more explicit about them in the future. Thanks!


Thanks,
Dennis

The following may do a little better on power9:
---
 From a1464c4d5900cca68fd95b935178d72bb74837d5 Mon Sep 17 00:00:00 2001
From: Dennis Zhou 
Date: Tue, 20 Apr 2021 14:25:20 +
Subject: [PATCH] percpu: convert free page float to bytes

The percpu memory allocator keeps around a minimum number of free pages
to ensure we can satisfy atomic allocations. However, we've always kept
this number in terms of pages. On certain architectures like arm and
powerpc, the default page size could be 64k instead of 4k. So, start
with a target number of free bytes and then convert to pages.

Signed-off-by: Dennis Zhou 
---
  mm/percpu.c | 8 +---
  1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/mm/percpu.c b/mm/percpu.c
index ba13e683d022..287fe3091244 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -80,6 +80,7 @@
  #include 
  #include 
  #include 
+#include 
  #include 
  #include 
  #include 
@@ -107,11 +108,12 @@
  /* chunks in slots below this are subject to being sidelined on failed alloc 
*/
  #define PCPU_SLOT_FAIL_THRESHOLD

Re: [PATCH v4 0/4] percpu: partial chunk depopulation

2021-04-20 Thread Pratik Sampat



On 20/04/21 4:27 am, Dennis Zhou wrote:

On Mon, Apr 19, 2021 at 10:50:43PM +, Dennis Zhou wrote:

Hello,

This series is a continuation of Roman's series in [1]. It aims to solve
chunks holding onto free pages by adding a reclaim process to the percpu
balance work item.

The main difference is that the nr_empty_pop_pages is now managed at
time of isolation instead of intermixed. This helps with deciding which
chunks to free instead of having to interleave returning chunks to
active duty.

The allocation priority is as follows:
   1) appropriate chunk slot increasing until fit
   2) sidelined chunks
   3) full free chunks

The last slot for to_depopulate is never used for allocations.

A big thanks to Roman for initiating the work and being available for
iterating on these ideas.

This patchset contains the following 4 patches:
   0001-percpu-factor-out-pcpu_check_block_hint.patch
   0002-percpu-use-pcpu_free_slot-instead-of-pcpu_nr_slots-1.patch
   0003-percpu-implement-partial-chunk-depopulation.patch
   0004-percpu-use-reclaim-threshold-instead-of-running-for-.patch

0001 and 0002 are clean ups. 0003 implement partial chunk depopulation
initially from Roman. 0004 adds a reclaim threshold so we do not need to
schedule for every page freed.

This series is on top of percpu$for-5.14 67c2669d69fb.

diffstats below:

Dennis Zhou (2):
   percpu: use pcpu_free_slot instead of pcpu_nr_slots - 1
   percpu: use reclaim threshold instead of running for every page

Roman Gushchin (2):
   percpu: factor out pcpu_check_block_hint()
   percpu: implement partial chunk depopulation

  mm/percpu-internal.h |   5 +
  mm/percpu-km.c   |   5 +
  mm/percpu-stats.c|  20 ++--
  mm/percpu-vm.c   |  30 ++
  mm/percpu.c  | 252 ++-
  5 files changed, 278 insertions(+), 34 deletions(-)

Thanks,
Dennis

Hello Pratik,

Do you mind testing this series again on POWER9? The base is available
here:
https://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu.git/log/?h=for-5.14

Thanks,
Dennis


Hello Dennis, I have tested this patchset on POWER9.

I have tried variations of the percpu_test in the top level and nested cgroups
creation as the test with 1000:10 didn't show any benefits.

The following example shows more consistent benefits with the de-allocation
strategy.
Outer: 1000
Inner: 50
# ./percpu_test.sh
Percpu: 6912 kB
Percpu:   532736 kB
Percpu:   278784 kB

I believe it could be a result of bulk freeing within "free_unref_page_commit",
where pages are only free'd if pcp->count >= pcp->high. As POWER has a larger
page size it would end up creating lesser number of pages but with the
effects of fragmentation.

Having said that, the patchset and its behavior does look good to me.

Thanks!
Pratik





Re: [PATCH v3 0/6] percpu: partial chunk depopulation

2021-04-17 Thread Pratik Sampat




On 17/04/21 3:17 am, Dennis Zhou wrote:

Hello,

On Sat, Apr 17, 2021 at 01:14:03AM +0530, Pratik Sampat wrote:


On 17/04/21 12:39 am, Roman Gushchin wrote:

On Sat, Apr 17, 2021 at 12:11:37AM +0530, Pratik Sampat wrote:

On 17/04/21 12:04 am, Roman Gushchin wrote:

On Fri, Apr 16, 2021 at 11:57:03PM +0530, Pratik Sampat wrote:

On 16/04/21 10:43 pm, Roman Gushchin wrote:

On Fri, Apr 16, 2021 at 08:58:33PM +0530, Pratik Sampat wrote:

Hello Dennis,

I apologize for the clutter of logs before, I'm pasting the logs of before and
after the percpu test in the case of the patchset being applied on 5.12-rc6 and
the vanilla kernel 5.12-rc6.

On 16/04/21 7:48 pm, Dennis Zhou wrote:

Hello,

On Fri, Apr 16, 2021 at 06:26:15PM +0530, Pratik Sampat wrote:

Hello Roman,

I've tried the v3 patch series on a POWER9 and an x86 KVM setup.

My results of the percpu_test are as follows:
Intel KVM 4CPU:4G
Vanilla 5.12-rc6
# ./percpu_test.sh
Percpu: 1952 kB
Percpu:   219648 kB
Percpu:   219648 kB

5.12-rc6 + with patchset applied
# ./percpu_test.sh
Percpu: 2080 kB
Percpu:   219712 kB
Percpu:    72672 kB

I'm able to see improvement comparable to that of what you're see too.

However, on POWERPC I'm unable to reproduce these improvements with the 
patchset in the same configuration

POWER9 KVM 4CPU:4G
Vanilla 5.12-rc6
# ./percpu_test.sh
Percpu: 5888 kB
Percpu:   118272 kB
Percpu:   118272 kB

5.12-rc6 + with patchset applied
# ./percpu_test.sh
Percpu: 6144 kB
Percpu:   119040 kB
Percpu:   119040 kB

I'm wondering if there's any architectural specific code that needs plumbing
here?


There shouldn't be. Can you send me the percpu_stats debug output before
and after?

I'll paste the whole debug stats before and after here.
5.12-rc6 + patchset
-BEFORE-
Percpu Memory Statistics
Allocation Info:

Hm, this looks highly suspicious. Here is your stats in a more compact form:

Vanilla

nr_alloc: 9038 nr_alloc:97046
nr_dealloc  : 6992 nr_dealloc  :94237
nr_cur_alloc: 2046 nr_cur_alloc: 2809
nr_max_alloc: 2178 nr_max_alloc:90054
nr_chunks   :3 nr_chunks   :   11
nr_max_chunks   :3 nr_max_chunks   :   47
min_alloc_size  :4 min_alloc_size  :4
max_alloc_size  : 1072 max_alloc_size  : 1072
empty_pop_pages :5 empty_pop_pages :   29


Patched

nr_alloc: 9040 nr_alloc:97048
nr_dealloc  : 6994 nr_dealloc  :95002
nr_cur_alloc: 2046 nr_cur_alloc: 2046
nr_max_alloc: 2208 nr_max_alloc:90054
nr_chunks   :3 nr_chunks   :   48
nr_max_chunks   :3 nr_max_chunks   :   48
min_alloc_size  :4 min_alloc_size  :4
max_alloc_size  : 1072 max_alloc_size  : 1072
empty_pop_pages :   12 empty_pop_pages :   61


So it looks like the number of chunks got bigger, as well as the number of
empty_pop_pages? This contradicts to what you wrote, so can you, please, make
sure that the data is correct and we're not messing two cases?

So it looks like for some reason sidelined (depopulated) chunks are not getting
freed completely. But I struggle to explain why the initial empty_pop_pages is
bigger with the same amount of chunks.

So, can you, please, apply the following patch and provide an updated 
statistics?

Unfortunately, I'm not completely well versed in this area, but yes the empty
pop pages number doesn't make sense to me either.

I re-ran the numbers trying to make sure my experiment setup is sane but
results remain the same.

Vanilla
nr_alloc: 9040 nr_alloc:97048
nr_dealloc  : 6994 nr_dealloc  :94404
nr_cur_alloc: 2046 nr_cur_alloc: 2644
nr_max_alloc: 2169 nr_max_alloc:90054
nr_chunks   :3 nr_chunks   :   10
nr_max_chunks   :3 nr_max_chunks   :   47
min_alloc_size  :4 min_alloc_size  :4
max_alloc_size  : 1072 max_alloc_size  : 1072
empty_pop_pages :4 empty_pop_pages :   32

With the patchset + debug patch the results are as follows:
Patched

nr_alloc: 9040 nr_alloc

Re: [PATCH v3 0/6] percpu: partial chunk depopulation

2021-04-17 Thread Pratik Sampat




On 17/04/21 1:33 am, Roman Gushchin wrote:

On Sat, Apr 17, 2021 at 01:14:03AM +0530, Pratik Sampat wrote:


On 17/04/21 12:39 am, Roman Gushchin wrote:

On Sat, Apr 17, 2021 at 12:11:37AM +0530, Pratik Sampat wrote:

On 17/04/21 12:04 am, Roman Gushchin wrote:

On Fri, Apr 16, 2021 at 11:57:03PM +0530, Pratik Sampat wrote:

On 16/04/21 10:43 pm, Roman Gushchin wrote:

On Fri, Apr 16, 2021 at 08:58:33PM +0530, Pratik Sampat wrote:

Hello Dennis,

I apologize for the clutter of logs before, I'm pasting the logs of before and
after the percpu test in the case of the patchset being applied on 5.12-rc6 and
the vanilla kernel 5.12-rc6.

On 16/04/21 7:48 pm, Dennis Zhou wrote:

Hello,

On Fri, Apr 16, 2021 at 06:26:15PM +0530, Pratik Sampat wrote:

Hello Roman,

I've tried the v3 patch series on a POWER9 and an x86 KVM setup.

My results of the percpu_test are as follows:
Intel KVM 4CPU:4G
Vanilla 5.12-rc6
# ./percpu_test.sh
Percpu: 1952 kB
Percpu:   219648 kB
Percpu:   219648 kB

5.12-rc6 + with patchset applied
# ./percpu_test.sh
Percpu: 2080 kB
Percpu:   219712 kB
Percpu:    72672 kB

I'm able to see improvement comparable to that of what you're see too.

However, on POWERPC I'm unable to reproduce these improvements with the 
patchset in the same configuration

POWER9 KVM 4CPU:4G
Vanilla 5.12-rc6
# ./percpu_test.sh
Percpu: 5888 kB
Percpu:   118272 kB
Percpu:   118272 kB

5.12-rc6 + with patchset applied
# ./percpu_test.sh
Percpu: 6144 kB
Percpu:   119040 kB
Percpu:   119040 kB

I'm wondering if there's any architectural specific code that needs plumbing
here?


There shouldn't be. Can you send me the percpu_stats debug output before
and after?

I'll paste the whole debug stats before and after here.
5.12-rc6 + patchset
-BEFORE-
Percpu Memory Statistics
Allocation Info:

Hm, this looks highly suspicious. Here is your stats in a more compact form:

Vanilla

nr_alloc: 9038 nr_alloc:97046
nr_dealloc  : 6992 nr_dealloc  :94237
nr_cur_alloc: 2046 nr_cur_alloc: 2809
nr_max_alloc: 2178 nr_max_alloc:90054
nr_chunks   :3 nr_chunks   :   11
nr_max_chunks   :3 nr_max_chunks   :   47
min_alloc_size  :4 min_alloc_size  :4
max_alloc_size  : 1072 max_alloc_size  : 1072
empty_pop_pages :5 empty_pop_pages :   29


Patched

nr_alloc: 9040 nr_alloc:97048
nr_dealloc  : 6994 nr_dealloc  :95002
nr_cur_alloc: 2046 nr_cur_alloc: 2046
nr_max_alloc: 2208 nr_max_alloc:90054
nr_chunks   :3 nr_chunks   :   48
nr_max_chunks   :3 nr_max_chunks   :   48
min_alloc_size  :4 min_alloc_size  :4
max_alloc_size  : 1072 max_alloc_size  : 1072
empty_pop_pages :   12 empty_pop_pages :   61


So it looks like the number of chunks got bigger, as well as the number of
empty_pop_pages? This contradicts to what you wrote, so can you, please, make
sure that the data is correct and we're not messing two cases?

So it looks like for some reason sidelined (depopulated) chunks are not getting
freed completely. But I struggle to explain why the initial empty_pop_pages is
bigger with the same amount of chunks.

So, can you, please, apply the following patch and provide an updated 
statistics?

Unfortunately, I'm not completely well versed in this area, but yes the empty
pop pages number doesn't make sense to me either.

I re-ran the numbers trying to make sure my experiment setup is sane but
results remain the same.

Vanilla
nr_alloc: 9040 nr_alloc:97048
nr_dealloc  : 6994 nr_dealloc  :94404
nr_cur_alloc: 2046 nr_cur_alloc: 2644
nr_max_alloc: 2169 nr_max_alloc:90054
nr_chunks   :3 nr_chunks   :   10
nr_max_chunks   :3 nr_max_chunks   :   47
min_alloc_size  :4 min_alloc_size  :4
max_alloc_size  : 1072 max_alloc_size  : 1072
empty_pop_pages :4 empty_pop_pages :   32

With the patchset + debug patch the results are as follows:
Patched

nr_alloc: 9040 nr_alloc

Re: [PATCH v3 0/6] percpu: partial chunk depopulation

2021-04-16 Thread Pratik Sampat




On 17/04/21 12:39 am, Roman Gushchin wrote:

On Sat, Apr 17, 2021 at 12:11:37AM +0530, Pratik Sampat wrote:


On 17/04/21 12:04 am, Roman Gushchin wrote:

On Fri, Apr 16, 2021 at 11:57:03PM +0530, Pratik Sampat wrote:

On 16/04/21 10:43 pm, Roman Gushchin wrote:

On Fri, Apr 16, 2021 at 08:58:33PM +0530, Pratik Sampat wrote:

Hello Dennis,

I apologize for the clutter of logs before, I'm pasting the logs of before and
after the percpu test in the case of the patchset being applied on 5.12-rc6 and
the vanilla kernel 5.12-rc6.

On 16/04/21 7:48 pm, Dennis Zhou wrote:

Hello,

On Fri, Apr 16, 2021 at 06:26:15PM +0530, Pratik Sampat wrote:

Hello Roman,

I've tried the v3 patch series on a POWER9 and an x86 KVM setup.

My results of the percpu_test are as follows:
Intel KVM 4CPU:4G
Vanilla 5.12-rc6
# ./percpu_test.sh
Percpu: 1952 kB
Percpu:   219648 kB
Percpu:   219648 kB

5.12-rc6 + with patchset applied
# ./percpu_test.sh
Percpu: 2080 kB
Percpu:   219712 kB
Percpu:    72672 kB

I'm able to see improvement comparable to that of what you're see too.

However, on POWERPC I'm unable to reproduce these improvements with the 
patchset in the same configuration

POWER9 KVM 4CPU:4G
Vanilla 5.12-rc6
# ./percpu_test.sh
Percpu: 5888 kB
Percpu:   118272 kB
Percpu:   118272 kB

5.12-rc6 + with patchset applied
# ./percpu_test.sh
Percpu: 6144 kB
Percpu:   119040 kB
Percpu:   119040 kB

I'm wondering if there's any architectural specific code that needs plumbing
here?


There shouldn't be. Can you send me the percpu_stats debug output before
and after?

I'll paste the whole debug stats before and after here.
5.12-rc6 + patchset
-BEFORE-
Percpu Memory Statistics
Allocation Info:

Hm, this looks highly suspicious. Here is your stats in a more compact form:

Vanilla

nr_alloc: 9038 nr_alloc:97046
nr_dealloc  : 6992 nr_dealloc  :94237
nr_cur_alloc: 2046 nr_cur_alloc: 2809
nr_max_alloc: 2178 nr_max_alloc:90054
nr_chunks   :3 nr_chunks   :   11
nr_max_chunks   :3 nr_max_chunks   :   47
min_alloc_size  :4 min_alloc_size  :4
max_alloc_size  : 1072 max_alloc_size  : 1072
empty_pop_pages :5 empty_pop_pages :   29


Patched

nr_alloc: 9040 nr_alloc:97048
nr_dealloc  : 6994 nr_dealloc  :95002
nr_cur_alloc: 2046 nr_cur_alloc: 2046
nr_max_alloc: 2208 nr_max_alloc:90054
nr_chunks   :3 nr_chunks   :   48
nr_max_chunks   :3 nr_max_chunks   :   48
min_alloc_size  :4 min_alloc_size  :4
max_alloc_size  : 1072 max_alloc_size  : 1072
empty_pop_pages :   12 empty_pop_pages :   61


So it looks like the number of chunks got bigger, as well as the number of
empty_pop_pages? This contradicts to what you wrote, so can you, please, make
sure that the data is correct and we're not messing two cases?

So it looks like for some reason sidelined (depopulated) chunks are not getting
freed completely. But I struggle to explain why the initial empty_pop_pages is
bigger with the same amount of chunks.

So, can you, please, apply the following patch and provide an updated 
statistics?

Unfortunately, I'm not completely well versed in this area, but yes the empty
pop pages number doesn't make sense to me either.

I re-ran the numbers trying to make sure my experiment setup is sane but
results remain the same.

Vanilla
nr_alloc: 9040 nr_alloc:97048
nr_dealloc  : 6994 nr_dealloc  :94404
nr_cur_alloc: 2046 nr_cur_alloc: 2644
nr_max_alloc: 2169 nr_max_alloc:90054
nr_chunks   :3 nr_chunks   :   10
nr_max_chunks   :3 nr_max_chunks   :   47
min_alloc_size  :4 min_alloc_size  :4
max_alloc_size  : 1072 max_alloc_size  : 1072
empty_pop_pages :4 empty_pop_pages :   32

With the patchset + debug patch the results are as follows:
Patched

nr_alloc: 9040 nr_alloc:97048
nr_dealloc  : 6994 nr_dealloc  :94349
nr_cur_alloc

Re: [PATCH v3 0/6] percpu: partial chunk depopulation

2021-04-16 Thread Pratik Sampat




On 17/04/21 12:04 am, Roman Gushchin wrote:

On Fri, Apr 16, 2021 at 11:57:03PM +0530, Pratik Sampat wrote:


On 16/04/21 10:43 pm, Roman Gushchin wrote:

On Fri, Apr 16, 2021 at 08:58:33PM +0530, Pratik Sampat wrote:

Hello Dennis,

I apologize for the clutter of logs before, I'm pasting the logs of before and
after the percpu test in the case of the patchset being applied on 5.12-rc6 and
the vanilla kernel 5.12-rc6.

On 16/04/21 7:48 pm, Dennis Zhou wrote:

Hello,

On Fri, Apr 16, 2021 at 06:26:15PM +0530, Pratik Sampat wrote:

Hello Roman,

I've tried the v3 patch series on a POWER9 and an x86 KVM setup.

My results of the percpu_test are as follows:
Intel KVM 4CPU:4G
Vanilla 5.12-rc6
# ./percpu_test.sh
Percpu: 1952 kB
Percpu:   219648 kB
Percpu:   219648 kB

5.12-rc6 + with patchset applied
# ./percpu_test.sh
Percpu: 2080 kB
Percpu:   219712 kB
Percpu:    72672 kB

I'm able to see improvement comparable to that of what you're see too.

However, on POWERPC I'm unable to reproduce these improvements with the 
patchset in the same configuration

POWER9 KVM 4CPU:4G
Vanilla 5.12-rc6
# ./percpu_test.sh
Percpu: 5888 kB
Percpu:   118272 kB
Percpu:   118272 kB

5.12-rc6 + with patchset applied
# ./percpu_test.sh
Percpu: 6144 kB
Percpu:   119040 kB
Percpu:   119040 kB

I'm wondering if there's any architectural specific code that needs plumbing
here?


There shouldn't be. Can you send me the percpu_stats debug output before
and after?

I'll paste the whole debug stats before and after here.
5.12-rc6 + patchset
-BEFORE-
Percpu Memory Statistics
Allocation Info:

Hm, this looks highly suspicious. Here is your stats in a more compact form:

Vanilla

nr_alloc: 9038 nr_alloc:97046
nr_dealloc  : 6992 nr_dealloc  :94237
nr_cur_alloc: 2046 nr_cur_alloc: 2809
nr_max_alloc: 2178 nr_max_alloc:90054
nr_chunks   :3 nr_chunks   :   11
nr_max_chunks   :3 nr_max_chunks   :   47
min_alloc_size  :4 min_alloc_size  :4
max_alloc_size  : 1072 max_alloc_size  : 1072
empty_pop_pages :5 empty_pop_pages :   29


Patched

nr_alloc: 9040 nr_alloc:97048
nr_dealloc  : 6994 nr_dealloc  :95002
nr_cur_alloc: 2046 nr_cur_alloc: 2046
nr_max_alloc: 2208 nr_max_alloc:90054
nr_chunks   :3 nr_chunks   :   48
nr_max_chunks   :3 nr_max_chunks   :   48
min_alloc_size  :4 min_alloc_size  :4
max_alloc_size  : 1072 max_alloc_size  : 1072
empty_pop_pages :   12 empty_pop_pages :   61


So it looks like the number of chunks got bigger, as well as the number of
empty_pop_pages? This contradicts to what you wrote, so can you, please, make
sure that the data is correct and we're not messing two cases?

So it looks like for some reason sidelined (depopulated) chunks are not getting
freed completely. But I struggle to explain why the initial empty_pop_pages is
bigger with the same amount of chunks.

So, can you, please, apply the following patch and provide an updated 
statistics?

Unfortunately, I'm not completely well versed in this area, but yes the empty
pop pages number doesn't make sense to me either.

I re-ran the numbers trying to make sure my experiment setup is sane but
results remain the same.

Vanilla
nr_alloc: 9040 nr_alloc:97048
nr_dealloc  : 6994 nr_dealloc  :94404
nr_cur_alloc: 2046 nr_cur_alloc: 2644
nr_max_alloc: 2169 nr_max_alloc:90054
nr_chunks   :3 nr_chunks   :   10
nr_max_chunks   :3 nr_max_chunks   :   47
min_alloc_size  :4 min_alloc_size  :4
max_alloc_size  : 1072 max_alloc_size  : 1072
empty_pop_pages :4 empty_pop_pages :   32

With the patchset + debug patch the results are as follows:
Patched

nr_alloc: 9040 nr_alloc:97048
nr_dealloc  : 6994 nr_dealloc  :94349
nr_cur_alloc: 2046 nr_cur_alloc: 2699
nr_max_alloc: 2194

Re: [PATCH v3 0/6] percpu: partial chunk depopulation

2021-04-16 Thread Pratik Sampat




On 16/04/21 10:43 pm, Roman Gushchin wrote:

On Fri, Apr 16, 2021 at 08:58:33PM +0530, Pratik Sampat wrote:

Hello Dennis,

I apologize for the clutter of logs before, I'm pasting the logs of before and
after the percpu test in the case of the patchset being applied on 5.12-rc6 and
the vanilla kernel 5.12-rc6.

On 16/04/21 7:48 pm, Dennis Zhou wrote:

Hello,

On Fri, Apr 16, 2021 at 06:26:15PM +0530, Pratik Sampat wrote:

Hello Roman,

I've tried the v3 patch series on a POWER9 and an x86 KVM setup.

My results of the percpu_test are as follows:
Intel KVM 4CPU:4G
Vanilla 5.12-rc6
# ./percpu_test.sh
Percpu: 1952 kB
Percpu:   219648 kB
Percpu:   219648 kB

5.12-rc6 + with patchset applied
# ./percpu_test.sh
Percpu: 2080 kB
Percpu:   219712 kB
Percpu:    72672 kB

I'm able to see improvement comparable to that of what you're see too.

However, on POWERPC I'm unable to reproduce these improvements with the 
patchset in the same configuration

POWER9 KVM 4CPU:4G
Vanilla 5.12-rc6
# ./percpu_test.sh
Percpu: 5888 kB
Percpu:   118272 kB
Percpu:   118272 kB

5.12-rc6 + with patchset applied
# ./percpu_test.sh
Percpu: 6144 kB
Percpu:   119040 kB
Percpu:   119040 kB

I'm wondering if there's any architectural specific code that needs plumbing
here?


There shouldn't be. Can you send me the percpu_stats debug output before
and after?

I'll paste the whole debug stats before and after here.
5.12-rc6 + patchset
-BEFORE-
Percpu Memory Statistics
Allocation Info:


Hm, this looks highly suspicious. Here is your stats in a more compact form:

Vanilla

nr_alloc: 9038 nr_alloc:97046
nr_dealloc  : 6992 nr_dealloc  :94237
nr_cur_alloc: 2046 nr_cur_alloc: 2809
nr_max_alloc: 2178 nr_max_alloc:90054
nr_chunks   :3 nr_chunks   :   11
nr_max_chunks   :3 nr_max_chunks   :   47
min_alloc_size  :4 min_alloc_size  :4
max_alloc_size  : 1072 max_alloc_size  : 1072
empty_pop_pages :5 empty_pop_pages :   29


Patched

nr_alloc: 9040 nr_alloc:97048
nr_dealloc  : 6994 nr_dealloc  :95002
nr_cur_alloc: 2046 nr_cur_alloc: 2046
nr_max_alloc: 2208 nr_max_alloc:90054
nr_chunks   :3 nr_chunks   :   48
nr_max_chunks   :3 nr_max_chunks   :   48
min_alloc_size  :4 min_alloc_size  :4
max_alloc_size  : 1072 max_alloc_size  : 1072
empty_pop_pages :   12 empty_pop_pages :   61


So it looks like the number of chunks got bigger, as well as the number of
empty_pop_pages? This contradicts to what you wrote, so can you, please, make
sure that the data is correct and we're not messing two cases?

So it looks like for some reason sidelined (depopulated) chunks are not getting
freed completely. But I struggle to explain why the initial empty_pop_pages is
bigger with the same amount of chunks.

So, can you, please, apply the following patch and provide an updated 
statistics?


Unfortunately, I'm not completely well versed in this area, but yes the empty
pop pages number doesn't make sense to me either.

I re-ran the numbers trying to make sure my experiment setup is sane but
results remain the same.

Vanilla
nr_alloc: 9040 nr_alloc:97048
nr_dealloc  : 6994 nr_dealloc  :94404
nr_cur_alloc: 2046 nr_cur_alloc: 2644
nr_max_alloc: 2169 nr_max_alloc:90054
nr_chunks   :3 nr_chunks   :   10
nr_max_chunks   :3 nr_max_chunks   :   47
min_alloc_size  :4 min_alloc_size  :4
max_alloc_size  : 1072 max_alloc_size  : 1072
empty_pop_pages :4 empty_pop_pages :   32

With the patchset + debug patch the results are as follows:
Patched

nr_alloc: 9040 nr_alloc:97048
nr_dealloc  : 6994 nr_dealloc  :94349
nr_cur_alloc: 2046 nr_cur_alloc: 2699
nr_max_alloc: 2194 nr_max_alloc:90054
nr_chunks   :3 nr_chunks

Re: [PATCH v3 0/6] percpu: partial chunk depopulation

2021-04-16 Thread Pratik Sampat

Hello Dennis,

I apologize for the clutter of logs before, I'm pasting the logs of before and
after the percpu test in the case of the patchset being applied on 5.12-rc6 and
the vanilla kernel 5.12-rc6.

On 16/04/21 7:48 pm, Dennis Zhou wrote:

Hello,

On Fri, Apr 16, 2021 at 06:26:15PM +0530, Pratik Sampat wrote:

Hello Roman,

I've tried the v3 patch series on a POWER9 and an x86 KVM setup.

My results of the percpu_test are as follows:
Intel KVM 4CPU:4G
Vanilla 5.12-rc6
# ./percpu_test.sh
Percpu: 1952 kB
Percpu:   219648 kB
Percpu:   219648 kB

5.12-rc6 + with patchset applied
# ./percpu_test.sh
Percpu: 2080 kB
Percpu:   219712 kB
Percpu:    72672 kB

I'm able to see improvement comparable to that of what you're see too.

However, on POWERPC I'm unable to reproduce these improvements with the 
patchset in the same configuration

POWER9 KVM 4CPU:4G
Vanilla 5.12-rc6
# ./percpu_test.sh
Percpu: 5888 kB
Percpu:   118272 kB
Percpu:   118272 kB

5.12-rc6 + with patchset applied
# ./percpu_test.sh
Percpu: 6144 kB
Percpu:   119040 kB
Percpu:   119040 kB

I'm wondering if there's any architectural specific code that needs plumbing
here?


There shouldn't be. Can you send me the percpu_stats debug output before
and after?


I'll paste the whole debug stats before and after here.
5.12-rc6 + patchset
-BEFORE-
Percpu Memory Statistics
Allocation Info:

  unit_size   :   655360
  static_size :   608920
  reserved_size   :0
  dyn_size:46440
  atom_size   :65536
  alloc_size  :   655360

Global Stats:

  nr_alloc: 9040
  nr_dealloc  : 6994
  nr_cur_alloc: 2046
  nr_max_alloc: 2208
  nr_chunks   :3
  nr_max_chunks   :3
  min_alloc_size  :4
  max_alloc_size  : 1072
  empty_pop_pages :   12

Per Chunk Stats:

Chunk: <- First Chunk
  nr_alloc:  859
  max_alloc_size  : 1072
  empty_pop_pages :0
  first_bit   :16384
  free_bytes  :0
  contig_bytes:0
  sum_frag:0
  max_frag:0
  cur_min_alloc   :4
  cur_med_alloc   :8
  cur_max_alloc   : 1072
  memcg_aware :0

Chunk:
  nr_alloc:  827
  max_alloc_size  :  992
  empty_pop_pages :8
  first_bit   :  692
  free_bytes  :   645012
  contig_bytes:   460096
  sum_frag:   466420
  max_frag:   460096
  cur_min_alloc   :4
  cur_med_alloc   :8
  cur_max_alloc   :  152
  memcg_aware :0

Chunk:
  nr_alloc:  360
  max_alloc_size  : 1072
  empty_pop_pages :4
  first_bit   :29207
  free_bytes  :   506640
  contig_bytes:   506556
  sum_frag:   84
  max_frag:   32
  cur_min_alloc   :4
  cur_med_alloc   :  156
  cur_max_alloc   : 1072
  memcg_aware :1

-AFTER-
Percpu Memory Statistics
Allocation Info:

  unit_size   :   655360
  static_size :   608920
  reserved_size   :0
  dyn_size:46440
  atom_size   :65536
  alloc_size  :   655360

Global Stats:

  nr_alloc:97048
  nr_dealloc  :95002
  nr_cur_alloc: 2046
  nr_max_alloc:90054
  nr_chunks   :   48
  nr_max_chunks   :   48
  min_alloc_size  :4
  max_alloc_size  : 1072
  empty_pop_pages :   61

Per Chunk Stats:

Chunk: <- First Chunk
  nr_alloc:  859
  max_alloc_size  : 1072
  empty_pop_pages :0
  first_bit   :16384
  free_bytes  :0
  contig_bytes:0
  sum_frag:0
  max_frag:0
  cur_min_alloc   :4
  cur_med_alloc   :8
  cur_max_alloc   : 1072
  memcg_aware :0

Chunk:
  nr_alloc:  827
  max_alloc_size  : 1072
  empty_pop_pages :8
  first_bit   :  692
  free_bytes  :   

Re: [PATCH v3 0/6] percpu: partial chunk depopulation

2021-04-16 Thread Pratik Sampat

Hello Roman,

I've tried the v3 patch series on a POWER9 and an x86 KVM setup.

My results of the percpu_test are as follows:
Intel KVM 4CPU:4G
Vanilla 5.12-rc6
# ./percpu_test.sh
Percpu: 1952 kB
Percpu:   219648 kB
Percpu:   219648 kB

5.12-rc6 + with patchset applied
# ./percpu_test.sh
Percpu: 2080 kB
Percpu:   219712 kB
Percpu:    72672 kB

I'm able to see improvement comparable to that of what you're see too.

However, on POWERPC I'm unable to reproduce these improvements with the 
patchset in the same configuration

POWER9 KVM 4CPU:4G
Vanilla 5.12-rc6
# ./percpu_test.sh
Percpu: 5888 kB
Percpu:   118272 kB
Percpu:   118272 kB

5.12-rc6 + with patchset applied
# ./percpu_test.sh
Percpu: 6144 kB
Percpu:   119040 kB
Percpu:   119040 kB

I'm wondering if there's any architectural specific code that needs plumbing
here?

I will also look through the code to find the reason why POWER isn't
depopulating pages.

Thank you,
Pratik

On 08/04/21 9:27 am, Roman Gushchin wrote:

In our production experience the percpu memory allocator is sometimes struggling
with returning the memory to the system. A typical example is a creation of
several thousands memory cgroups (each has several chunks of the percpu data
used for vmstats, vmevents, ref counters etc). Deletion and complete releasing
of these cgroups doesn't always lead to a shrinkage of the percpu memory,
so that sometimes there are several GB's of memory wasted.

The underlying problem is the fragmentation: to release an underlying chunk
all percpu allocations should be released first. The percpu allocator tends
to top up chunks to improve the utilization. It means new small-ish allocations
(e.g. percpu ref counters) are placed onto almost filled old-ish chunks,
effectively pinning them in memory.

This patchset solves this problem by implementing a partial depopulation
of percpu chunks: chunks with many empty pages are being asynchronously
depopulated and the pages are returned to the system.

To illustrate the problem the following script can be used:

--
#!/bin/bash

cd /sys/fs/cgroup

mkdir percpu_test
echo "+memory" > percpu_test/cgroup.subtree_control

cat /proc/meminfo | grep Percpu

for i in `seq 1 1000`; do
 mkdir percpu_test/cg_"${i}"
 for j in `seq 1 10`; do
mkdir percpu_test/cg_"${i}"_"${j}"
 done
done

cat /proc/meminfo | grep Percpu

for i in `seq 1 1000`; do
 for j in `seq 1 10`; do
rmdir percpu_test/cg_"${i}"_"${j}"
 done
done

sleep 10

cat /proc/meminfo | grep Percpu

for i in `seq 1 1000`; do
 rmdir percpu_test/cg_"${i}"
done

rmdir percpu_test
--

It creates 11000 memory cgroups and removes every 10 out of 11.
It prints the initial size of the percpu memory, the size after
creating all cgroups and the size after deleting most of them.

Results:
   vanilla:
 ./percpu_test.sh
 Percpu: 7488 kB
 Percpu:   481152 kB
 Percpu:   481152 kB

   with this patchset applied:
 ./percpu_test.sh
 Percpu: 7488 kB
 Percpu:   481408 kB
 Percpu:   135552 kB

So the total size of the percpu memory was reduced by more than 3.5 times.

v3:
   - introduced pcpu_check_chunk_hint()
   - fixed a bug related to the hint check
   - minor cosmetic changes
   - s/pretends/fixes (cc Vlastimil)

v2:
   - depopulated chunks are sidelined
   - depopulation happens in the reverse order
   - depopulate list made per-chunk type
   - better results due to better heuristics

v1:
   - depopulation heuristics changed and optimized
   - chunks are put into a separate list, depopulation scan this list
   - chunk->isolated is introduced, chunk->depopulate is dropped
   - rearranged patches a bit
   - fixed a panic discovered by krobot
   - made pcpu_nr_empty_pop_pages per chunk type
   - minor fixes

rfc:
   https://lwn.net/Articles/850508/


Roman Gushchin (6):
   percpu: fix a comment about the chunks ordering
   percpu: split __pcpu_balance_workfn()
   percpu: make pcpu_nr_empty_pop_pages per chunk type
   percpu: generalize pcpu_balance_populated()
   percpu: factor out pcpu_check_chunk_hint()
   percpu: implement partial chunk depopulation

  mm/percpu-internal.h |   4 +-
  mm/percpu-stats.c|   9 +-
  mm/percpu.c  | 306 +++
  3 files changed, 261 insertions(+), 58 deletions(-)





Re: [RFC v3 0/2] CPU-Idle latency selftest framework

2021-04-09 Thread Pratik Sampat

Hello Doug,

On 09/04/21 10:53 am, Doug Smythies wrote:

Hi Pratik,

I tried V3 on a Intel i5-10600K processor with 6 cores and 12 CPUs.
The core to cpu mappings are:
core 0 has cpus 0 and 6
core 1 has cpus 1 and 7
core 2 has cpus 2 and 8
core 3 has cpus 3 and 9
core 4 has cpus 4 and 10
core 5 has cpus 5 and 11

By default, it will test CPUs 0,2,4,6,10 on cores 0,2,4,0,2,4.
wouldn't it make more sense to test each core once?


Ideally it would be better to run on all the CPUs, however on larger systems
that I'm testing on with hundreds of cores and a high a thread count, the
execution time increases while not particularly bringing any additional
information to the table.

That is why it made sense only run on one of the threads of each core to make
the experiment faster while preserving accuracy.

To handle various thread topologies it maybe worthwhile if we parse
/sys/devices/system/cpu/cpuX/topology/thread_siblings_list for each core and
use this information to run only once per physical core, rather than
assuming the topology.

What are your thoughts on a mechanism like this?


With the source CPU always 0, I think the results from the results
from the destination CPUs 0 and 6, on core 0 bias the results, at
least in the deeper idle states. They don't make much difference in
the shallow states. Myself, I wouldn't include them in the results.


I agree, CPU0->CPU0 same core interaction is causing a bias. I could omit that
observation while computing the average.

In the verbose mode I'll omit all the threads of CPU0 and in the default
(quick) mode just CPU0's latency can be omitted while computing average.

Thank you,
Pratik


Example, where I used the -v option for all CPUs:

--IPI Latency Test---
--Baseline IPI Latency measurement: CPU Busy--
SRC_CPU   DEST_CPU IPI_Latency(ns)
00  101
01  790
02  609
03  595
04  737
05  759
06  780
07  741
08  574
09  681
0   10  527
0   11  552
Baseline Avg IPI latency(ns): 620   suggest 656 here
---Enabling state: 0---
SRC_CPU   DEST_CPU IPI_Latency(ns)
00   76
01  471
02  420
03  462
04  454
05  468
06  453
07  473
08  380
09  483
0   10  492
0   11  454
Expected IPI latency(ns): 0
Observed Avg IPI latency(ns) - State 0: 423 < suggest 456 here
---Enabling state: 1---
SRC_CPU   DEST_CPU IPI_Latency(ns)
00  112
01  866
02  663
03  851
04 1090
05 1314
06 1941
07 1458
08  687
09  802
0   10 1041
0   11 1284
Expected IPI latency(ns): 1000
Observed Avg IPI latency(ns) - State 1: 1009  suggest 1006 here
---Enabling state: 2---
SRC_CPU   DEST_CPU IPI_Latency(ns)
00   75
0116362
0216785
0319650
0417356
0517606
06 2217
0717958
0817332
0916615
0   1017382
0   1117423
Expected IPI latency(ns): 12
Observed Avg IPI latency(ns) - State 2: 14730  suggest 17447 here
---Enabling state: 3---
SRC_CPU   DEST_CPU IPI_Latency(ns)
00  103
0117416
0217961
0316651
0417867
0517726
06 2178
0716620
0820951
0916567
0   1017131
0   1117563
Expected IPI latency(ns): 1034000
Observed Avg IPI latency(ns) - State 3: 14894  suggest 17645 here

Hope this helps.

... Doug




Re: [RFC v2 2/2] selftest/cpuidle: Add support for cpuidle latency measurement

2021-04-04 Thread Pratik Sampat

Hello Doug,

Thanks for your review.

On 02/04/21 4:57 am, Doug Smythies wrote:

Hi Pratik,

On Thu, Apr 1, 2021 at 4:45 AM Pratik Rajesh Sampat
 wrote:
...


To run this test specifically:
$ make -C tools/testing/selftests TARGETS="cpuidle" run_tests

I have not become any smarter than I was with version 1,
and still assumed that the "$" meant regular user.
Please put it as "#" or separate the two steps, compile and run.


Apologies, I missed incorporating the root user discussion we had.
I'll add a sudo past the "$" symbol.


There are a few optinal arguments too that the script can take

optional

Suggest to also specifically mention how to run without re-compile,

# ./cpuidle.sh -v


Sure thing, I'll add a comment specifying this.


Note also that the test still leaves all idle states disabled when done.


Yes, I missed out enabling all the idle states after the tests are done.
I'll spin a new version where I enable idle states at the end of the experiment
so that the system stays coherent.


 [-h ]
 [-i ]
 [-m ]
 [-o ]
 [-v  (run on all cpus)]
Default Output location in: tools/testing/selftest/cpuidle/cpuidle.log

...


+cpu_is_online()
+{
+   cpu=$1
+   if [ ! -f "/sys/devices/system/cpu/cpu$cpu/online" ]; then
+   echo 0

incorrect. should be:


+   echo 1


Right! Thanks for catching this.


... Doug


Thank you,
Pratik



Re: [RFC 2/2] selftest/cpuidle: Add support for cpuidle latency measurement

2021-03-22 Thread Pratik Sampat

Hi Doug,

On 20/03/21 8:34 pm, Doug Smythies wrote:

On Wed, Mar 17, 2021 at 11:44 PM Pratik Sampat  wrote:

Hi Doug,
Thanks for trying these patches out.

On 18/03/21 2:30 am, Doug Smythies wrote:

Hi Pratik,

It just so happens that I have been trying Artem's version this last
week, so I tried yours.

On Mon, Mar 15, 2021 at 4:49 AM Pratik Rajesh Sampat
 wrote:
...

...

Other notes:

No idle state for CPU 0 ever gets disabled.
I assume this is because CPU 0 can never be offline,
so that bit of code (Disable all stop states) doesn't find its state.
By the way, processor = Intel i5-9600K

I had tried these patches on an IBM POWER 9 processor and disabling CPU0's idle
state works there. However, it does make sense for some processors to treat CPU
0 differently.
Maybe I could write in a case if idle state disabling fails for a CPU then we
just skip it?

I didn't try it, I just did a hack so I could continue for this reply.


Sure. In subsequent version I could write something to cleanly handle online
fail checks, maybe even specifically for CPU0.


The system is left with all idle states disabled, well not for CPU 0
as per the above comment. The suggestion is to restore them,
otherwise my processor hogs 42 watts instead of 2.

My results are highly variable per test.

Question: Do you notice high variability with IPI test, Timer test or both?

The IPI test has less variability than the Timer test.


I can think of two reasons for high run to run variance:

1. If you observe variance in timer tests, then I believe there could a
mechanism of "C-state pre-wake" on some Intel machines at play here, which can
pre-wake a CPU from an idle state when timers are armed. I'm not sure if the
Intel platform that you're running on does that or not.

Artem had described this behavior to me a while ago and I think his wult page
describes this behavior in more detail:
https://intel.github.io/wult/#c-state-pre-wake

Yes, I have reviewed all the references.
And yes, I think my processors have the pre-wake stuff.

I do not have the proper hardware to do the Artem pre-wake workaround
method, but might buy it in future.


That explains the variability that we are seeing in the Timer tests on the
Intel processor you've tried on.
Also based on the data pasted below, it means that the IPI tests are more
reliable than Timers.

Maybe it would be better to not run the Timer test on Intel platforms that
support this pre-wakeup feature?
However, I don't know how (or if) Intel exposes this information to the
userspace and if other platforms like AMD also have this feature in some form.

Another way of solving this problem could be to have the timer test as an
optional parameter in the selftest for people to use while also printing a
disclaimer for x86 users about the potential hardware design?

--
Thanks
Pratik


2. I have noticed variability in results when there are kernel book-keeping or
jitter tasks scheduled from time to time on an otherwise idle core.
In the full per-CPU logs at tools/testing/selftests/cpuidle/cpuidle.log can you
spot any obvious outliers per-CPU state?

Yes.
I'll just paste in an example cpuidle.log file having used the -v option
below, along with my hack job diff.

doug@s19:~/temp-k-git/linux/tools/testing/selftests/cpuidle$ cat
cpuidle.log.v3-1
--IPI Latency Test---
--Baseline IPI Latency measurement: CPU Busy--
SRC_CPU   DEST_CPU IPI_Latency(ns)
00  140
01  632
02  675
03  671
04  675
05  767
06  653
07  826
08  819
09  615
0   10  758
0   11  758
Baseline Avg IPI latency(ns): 665
---Enabling state: 0---
SRC_CPU   DEST_CPU IPI_Latency(ns)
00   76
01  484
02  494
03  539
04  498
05  491
06  474
07  434
08  544
09  476
0   10  447
0   11  467
Expected IPI latency(ns): 0
Observed Avg IPI latency(ns) - State 0: 452
---Enabling state: 1---
SRC_CPU   DEST_CPU IPI_Latency(ns)
00   72
01 1081
02  821
03 1486
04 1022
05  960
06 1634
07  933
08 1032
09 1046
0   10 1430
0   11 1338
Expected IPI latency(ns): 1000
Observed Avg IPI latency(ns) - State 1: 1071
---Enabling state: 2---
SRC_CPU   DEST_CPU IPI_Latency(ns)
00  264
0130836
0230562
0330748
0435286
0530978
06 1952
07 

Re: [RFC 2/2] selftest/cpuidle: Add support for cpuidle latency measurement

2021-03-18 Thread Pratik Sampat

Hi Doug,
Thanks for trying these patches out.

On 18/03/21 2:30 am, Doug Smythies wrote:

Hi Pratik,

It just so happens that I have been trying Artem's version this last
week, so I tried yours.

On Mon, Mar 15, 2021 at 4:49 AM Pratik Rajesh Sampat
 wrote:
...

To run this test specifically:
$ make -C tools/testing/selftests TARGETS="cpuidle" run_tests

While I suppose it should have been obvious, I interpreted
the "$" sign to mean I could run as a regular user, which I can not.


Ah yes, this does need root privileges, I should have prefixed the command with
sudo in the instructions for better understanding.


There are a few optinal arguments too that the script can take
 [-h ]
 [-m ]
 [-o ]
 [-v  (run on all cpus)]
Default Output location in: tools/testing/cpuidle/cpuidle.log

Isn't it:

tools/testing/selftests/cpuidle/cpuidle.log


My bad, It was a typing error. I missed the "selftest" directory while
typing it out.


? At least, that is where my file was.

Other notes:

No idle state for CPU 0 ever gets disabled.
I assume this is because CPU 0 can never be offline,
so that bit of code (Disable all stop states) doesn't find its state.
By the way, processor = Intel i5-9600K


I had tried these patches on an IBM POWER 9 processor and disabling CPU0's idle
state works there. However, it does make sense for some processors to treat CPU
0 differently.
Maybe I could write in a case if idle state disabling fails for a CPU then we
just skip it?


The system is left with all idle states disabled, well not for CPU 0
as per the above comment. The suggestion is to restore them,
otherwise my processor hogs 42 watts instead of 2.

My results are highly variable per test.


Question: Do you notice high variability with IPI test, Timer test or both?

I can think of two reasons for high run to run variance:

1. If you observe variance in timer tests, then I believe there could a
mechanism of "C-state pre-wake" on some Intel machines at play here, which can
pre-wake a CPU from an idle state when timers are armed. I'm not sure if the
Intel platform that you're running on does that or not.

Artem had described this behavior to me a while ago and I think his wult page
describes this behavior in more detail:
https://intel.github.io/wult/#c-state-pre-wake

2. I have noticed variability in results when there are kernel book-keeping or
jitter tasks scheduled from time to time on an otherwise idle core.
In the full per-CPU logs at tools/testing/selftests/cpuidle/cpuidle.log can you
spot any obvious outliers per-CPU state?

Also you may want to run the test in verbose mode which runs for all the
threads of each CPU with the following: "sudo ./cpuidle.sh -v". While latency
mostly matters for per-core basis but it still maybe a good idea to see if
that changes anything with the observations.

--
Thanks and regards,
Pratik


My system is very idle:
Example (from turbostat at 6 seconds sample rate):
Busy%   Bzy_MHz IRQ PkgTmp  PkgWatt RAMWatt
0.034600153 28  2.031.89
0.014600103 29  2.031.89
0.054600115 29  2.081.89
0.01460095  28  2.091.89
0.034600114 28  2.111.89
0.014600107 29  2.071.89
0.024600102 29  2.111.89

...

... Doug




Re: [RFC v4 1/1] selftests/cpuidle: Add support for cpuidle latency measurement

2020-09-14 Thread Pratik Sampat




On 03/09/20 8:20 pm, Artem Bityutskiy wrote:

On Thu, 2020-09-03 at 17:30 +0530, Pratik Sampat wrote:

I certainly did not know about that the Intel architecture being aware
of timers and pre-wakes the CPUs which makes the timer experiment
observations void.

Well, things depend on platform, it is really "void", it is just
different and it measures an optimized case. The result may be smaller
observed latency. And things depend on the platform.


Of course, this will be for just software observability and hardware
can be more complex with each architecture behaving differently.


However, we are also collecting a baseline measurement wherein we run
the same test on a 100% busy CPU and the measurement of latency from
that could be considered to the kernel-userspace overhead.
The rest of the measurements would be considered keeping this baseline
in mind.

Yes, this should give the idea of the overhead, but still, at least for
many Intel platforms I would not be comfortable using the resulting
number (measured latency - baseline) for a cpuidle driver, because
there are just too many variables there. I am not sure I could assume
the baseline measured this way is an invariant - it could be noticeably
different depending on whether you use C-states or not.


At least on Intel platforms, this will mean that the IPI method won't
cover deep C-states like, say, PC6, because one CPU is busy. Again, not
saying this is not interesting, just pointing out the limitation.

That's a valid point. We have similar deep idle states in POWER too.
The idea here is that this test should be run on an already idle
system, of course there will be kernel jitters along the way
which can cause little skewness in observations across some CPUs but I
believe the observations overall should be stable.

If baseline and cpuidle latency are numbers of same order of magnitude,
and you are measuring in a controlled lab system, may be yes. But if
baseline is, say, in milliseconds, and you are measuring a 10
microseconds C-state, then probably no.


This makes complete sense. The magnitude of deviations being greater
than the scope of the experiment may not be very useful in quantifying
the latency metric.

One way is to minimize the baseline overhead is to make this a kernel
module https://lkml.org/lkml/2020/7/21/567. However, the overhead is
unavoidable but definetly can be further minimized by using an external
approach suggested by you in your LPC talk


Another solution to this could be using isolcpus, but that just
increases the complexity all the more.
If you have any suggestions of any other way that could guarantee
idleness that would be great.

Well, I did not try to guarantee idleness. I just use timers and
external device (the network card), so no CPUs needs to be busy and the
system can enter deep C-states. Then I just look at median, 99%-th
percentile, etc.

But by all means IPI is also a very interesting experiment. Just covers
a different usage scenario.

When I started experimenting in this area, one of my main early
takeaways was realization that C-state latency really depends on the
event source.


That is an interesting observation, on POWER systems where we don't
have timer related wakeup optimizations, the readings from this test do
signify a difference between latencies of IPI versus the latency
gathered after a timer interrupt.

However, these timer based variations weren't as prominent on my Intel
based ThinkPad t480, therefore in confirmation with your observations.

This discussions does help!
Although this approach may not help quantify latency deviations at a
hardware-accurate level but could still be helpful in quantifying this
metric from a software observability point of view.


Thanks!
Pratik



Re: [RFC v4 1/1] selftests/cpuidle: Add support for cpuidle latency measurement

2020-09-03 Thread Pratik Sampat

Hello Artem,

On 02/09/20 8:55 pm, Artem Bityutskiy wrote:

On Wed, 2020-09-02 at 17:15 +0530, Pratik Rajesh Sampat wrote:

Measure cpuidle latencies on wakeup to determine and compare with the
advertsied wakeup latencies for each idle state.


Thank you for pointing me to your talk. It was very interesting!
I certainly did not know about that the Intel architecture being aware
of timers and pre-wakes the CPUs which makes the timer experiment
observations void.


It looks like the measurements include more than just C-state wake,
they also include the overhead of waking up the proces, context switch,
and potentially any interrupts that happen on that CPU. I am not saying
this is not interesting data, it surely is, but it is going to be
larger than you see in cpuidle latency tables. Potentially
significantly larger.


The measurements will definitely include overhead than just the C-State
wakeup.

However, we are also collecting a baseline measurement wherein we run
the same test on a 100% busy CPU and the measurement of latency from
that could be considered to the kernel-userspace overhead.
The rest of the measurements would be considered keeping this baseline
in mind.


Therefore, I am not sure this program should be advertised as "cpuidle
measurement". It really measures the "IPI latency" in case of the IPI
method.


Now with the new found knowledge of timers in Intel, I understand that
this really only seems to measure IPI latency and not timer latency,
although both the observations shouldn't be too far off anyways.


A baseline measurement for each case of IPI and timers is taken at
100 percent CPU usage to quantify for the kernel-userpsace overhead
during execution.

At least on Intel platforms, this will mean that the IPI method won't
cover deep C-states like, say, PC6, because one CPU is busy. Again, not
saying this is not interesting, just pointing out the limitation.


That's a valid point. We have similar deep idle states in POWER too.
The idea here is that this test should be run on an already idle
system, of course there will be kernel jitters along the way
which can cause little skewness in observations across some CPUs but I
believe the observations overall should be stable.

Another solution to this could be using isolcpus, but that just
increases the complexity all the more.
If you have any suggestions of any other way that could guarantee
idleness that would be great.



I was working on a somewhat similar stuff for x86 platforms, and I am
almost ready to publish that on github. I can notify you when I do so
if you are interested. But here is a small presentation of the approach
that I did on Plumbers last year:

https://youtu.be/Opk92aQyvt0?t=8266

(the link points to the start of my talk)


Sure thing. Do notify me when it comes up.
I would be happy to have a look at it.

--
Thanks!
Pratik



Re: [PATCH] Revert "powerpc/powernv/idle: Replace CPU feature check with PVR check"

2020-08-26 Thread Pratik Sampat




On 26/08/20 2:07 pm, Christophe Leroy wrote:



Le 26/08/2020 à 10:29, Pratik Rajesh Sampat a écrit :

Cpuidle stop state implementation has minor optimizations for P10
where hardware preserves more SPR registers compared to P9.
The current P9 driver works for P10, although does few extra
save-restores. P9 driver can provide the required power management
features like SMT thread folding and core level power savings
on a P10 platform.

Until the P10 stop driver is available, revert the commit which
allows for only P9 systems to utilize cpuidle and blocks all
idle stop states for P10.
Cpu idle states are enabled and tested on the P10 platform
with this fix.

This reverts commit 8747bf36f312356f8a295a0c39ff092d65ce75ae.

Fixes: 8747bf36f312 ("powerpc/powernv/idle: Replace CPU feature check 
with PVR check")

Signed-off-by: Pratik Rajesh Sampat 
---
  @mpe: This revert would resolve a staging issue wherein the P10 stop
  driver is not yet ready while cpuidle stop states need not be blocked
  on 5.9 for Power10 systems which could cause SMT folding related
  performance issues.

  The P10 stop driver is in the works here:
https://lists.ozlabs.org/pipermail/linuxppc-dev/2020-August/216773.html

  arch/powerpc/platforms/powernv/idle.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/powernv/idle.c 
b/arch/powerpc/platforms/powernv/idle.c

index 77513a80cef9..345ab062b21a 100644
--- a/arch/powerpc/platforms/powernv/idle.c
+++ b/arch/powerpc/platforms/powernv/idle.c
@@ -1223,7 +1223,7 @@ static void __init pnv_probe_idle_states(void)
  return;
  }
  -    if (pvr_version_is(PVR_POWER9))
+    if (cpu_has_feature(CPU_FTR_ARCH_300))


Why not something like:

if (pvr_version_is(PVR_POWER9) || pvr_version_is(PVR_POWER10))
    pnv_power9_idle_init(); 


In order to use PVR_POWER10 I would need to define it under
arch/powerpc/include/asm/reg.h, which is not present in 5.9 yet.

However, if it okay with @mpe I could split out Nick's P10 stop driver
(https://lists.ozlabs.org/pipermail/linuxppc-dev/2020-August/216773.html)
into two parts:
1. This could include minimal code to introduce the P10 PVR and the
stop wrappers for it. Although this patch internally still calls into
the P9 path. This should gracefully fix the issue.
2. Then later in this patch we could introduce the p10 callback
methods as they are in Nick's series.

---
Thanks
Pratik




  pnv_power9_idle_init();
    for (i = 0; i < nr_pnv_idle_states; i++)



Christophe




Re: [PATCH v3 1/2] cpuidle: Trace IPI based and timer based wakeup latency from idle states

2020-07-28 Thread Pratik Sampat

Hello Rafael,


On 27/07/20 7:12 pm, Rafael J. Wysocki wrote:

On Tue, Jul 21, 2020 at 2:43 PM Pratik Rajesh Sampat
 wrote:

Fire directed smp_call_function_single IPIs from a specified source
CPU to the specified target CPU to reduce the noise we have to wade
through in the trace log.

And what's the purpose of it?


The idea for this comes from that fact that estimating wake-up
latencies and residencies for stop states is not an easy task.

The purpose is essentially to determine wakeup latencies, that are
caused by either, an IPI or a timer and compare with the advertised
wakeup latencies for each stop state.

This might help in determining the accuracy of our advertised values
and/or if they need any re-calibration.


The module is based on the idea written by Srivatsa Bhat and maintained
by Vaidyanathan Srinivasan internally.

Queue HR timer and measure jitter. Wakeup latency measurement for idle
states using hrtimer.  Echo a value in ns to timer_test_function and
watch trace. A HRtimer will be queued and when it fires the expected
wakeup vs actual wakeup is computes and delay printed in ns.

Implemented as a module which utilizes debugfs so that it can be
integrated with selftests.

To include the module, check option and include as module
kernel hacking -> Cpuidle latency selftests

[srivatsa.b...@linux.vnet.ibm.com: Initial implementation in
  cpidle/sysfs]

[sva...@linux.vnet.ibm.com: wakeup latency measurements using hrtimer
  and fix some of the time calculation]

[e...@linux.vnet.ibm.com: Fix some whitespace and tab errors and
  increase the resolution of IPI wakeup]

Signed-off-by: Pratik Rajesh Sampat 
Reviewed-by: Gautham R. Shenoy 
---
  drivers/cpuidle/Makefile   |   1 +
  drivers/cpuidle/test-cpuidle_latency.c | 150 +
  lib/Kconfig.debug  |  10 ++
  3 files changed, 161 insertions(+)
  create mode 100644 drivers/cpuidle/test-cpuidle_latency.c

diff --git a/drivers/cpuidle/Makefile b/drivers/cpuidle/Makefile
index f07800cbb43f..2ae05968078c 100644
--- a/drivers/cpuidle/Makefile
+++ b/drivers/cpuidle/Makefile
@@ -8,6 +8,7 @@ obj-$(CONFIG_ARCH_NEEDS_CPU_IDLE_COUPLED) += coupled.o
  obj-$(CONFIG_DT_IDLE_STATES) += dt_idle_states.o
  obj-$(CONFIG_ARCH_HAS_CPU_RELAX) += poll_state.o
  obj-$(CONFIG_HALTPOLL_CPUIDLE)   += cpuidle-haltpoll.o
+obj-$(CONFIG_IDLE_LATENCY_SELFTEST)  += test-cpuidle_latency.o

  
##
  # ARM SoC drivers
diff --git a/drivers/cpuidle/test-cpuidle_latency.c 
b/drivers/cpuidle/test-cpuidle_latency.c
new file mode 100644
index ..61574665e972
--- /dev/null
+++ b/drivers/cpuidle/test-cpuidle_latency.c
@@ -0,0 +1,150 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Module-based API test facility for cpuidle latency using IPIs and timers

I'd like to see a more detailed description of what it does and how it
works here.


Right, I'll add that.
Based on comments from Daniel I have also been working on a
user-space only variant of this test as that does seem like
a better way to go.

The only downside is that the latency will be higher, but as we are
taking baseline measurements the diff of that from our observed reading
should still remain the same. Just that the test will take longer to run.
I'm yet to accurately confirm this.

I would appreciate your thoughts on that.


+ */
+
+#include 
+#include 
+#include 
+
+/* IPI based wakeup latencies */
+struct latency {
+   unsigned int src_cpu;
+   unsigned int dest_cpu;
+   ktime_t time_start;
+   ktime_t time_end;
+   u64 latency_ns;
+} ipi_wakeup;
+
+static void measure_latency(void *info)
+{
+   struct latency *v;
+   ktime_t time_diff;
+
+   v = (struct latency *)info;
+   v->time_end = ktime_get();
+   time_diff = ktime_sub(v->time_end, v->time_start);
+   v->latency_ns = ktime_to_ns(time_diff);
+}
+
+void run_smp_call_function_test(unsigned int cpu)
+{
+   ipi_wakeup.src_cpu = smp_processor_id();
+   ipi_wakeup.dest_cpu = cpu;
+   ipi_wakeup.time_start = ktime_get();
+   smp_call_function_single(cpu, measure_latency, _wakeup, 1);
+}
+
+/* Timer based wakeup latencies */
+struct timer_data {
+   unsigned int src_cpu;
+   u64 timeout;
+   ktime_t time_start;
+   ktime_t time_end;
+   struct hrtimer timer;
+   u64 timeout_diff_ns;
+} timer_wakeup;
+
+static enum hrtimer_restart timer_called(struct hrtimer *hrtimer)
+{
+   struct timer_data *w;
+   ktime_t time_diff;
+
+   w = container_of(hrtimer, struct timer_data, timer);
+   w->time_end = ktime_get();
+
+   time_diff = ktime_sub(w->time_end, w->time_start);
+   time_diff = ktime_sub(time_diff, ns_to_ktime(w->timeout));
+   w->timeout_diff_ns = ktime_to_ns(time_diff);
+   return HRTIMER_NORESTART;
+}
+
+static void run_timer_test(unsigned int ns)
+{
+   

Re: [PATCH v2 2/3] powerpc/powernv/idle: save-restore DAWR0,DAWRX0 for P10

2020-07-24 Thread Pratik Sampat




On 24/07/20 6:55 am, Michael Neuling wrote:

On Fri, 2020-07-10 at 10:52 +0530, Pratik Rajesh Sampat wrote:

Additional registers DAWR0, DAWRX0 may be lost on Power 10 for
stop levels < 4.
Therefore save the values of these SPRs before entering a  "stop"
state and restore their values on wakeup.

Signed-off-by: Pratik Rajesh Sampat 
---
  arch/powerpc/platforms/powernv/idle.c | 10 ++
  1 file changed, 10 insertions(+)

diff --git a/arch/powerpc/platforms/powernv/idle.c
b/arch/powerpc/platforms/powernv/idle.c
index 19d94d021357..f2e2a6a4c274 100644
--- a/arch/powerpc/platforms/powernv/idle.c
+++ b/arch/powerpc/platforms/powernv/idle.c
@@ -600,6 +600,8 @@ struct p9_sprs {
u64 iamr;
u64 amor;
u64 uamor;
+   u64 dawr0;
+   u64 dawrx0;
  };
  
  static unsigned long power9_idle_stop(unsigned long psscr, bool mmu_on)

@@ -687,6 +689,10 @@ static unsigned long power9_idle_stop(unsigned long
psscr, bool mmu_on)
sprs.iamr   = mfspr(SPRN_IAMR);
sprs.amor   = mfspr(SPRN_AMOR);
sprs.uamor  = mfspr(SPRN_UAMOR);
+   if (cpu_has_feature(CPU_FTR_ARCH_31)) {


You are actually viewing an old version of the patches
The main point of change were based on comments from Nick Piggin, I
have changed the top level function check from ARCH_300 to a P9 PVR
check instead.

A similar thing needs to be done for P10, however as the P10 PVR isn't
exposed yet, I've shelved this particular patch.

Nick's comment to check based on PVR:https://lkml.org/lkml/2020/7/13/1018
v4 of the series:https://lkml.org/lkml/2020/7/21/784

Thanks for your review,
Pratik


Can you add a comment here saying even though DAWR0 is ARCH_30, it's only
required to be saved on 31. Otherwise this looks pretty odd.


+   sprs.dawr0 = mfspr(SPRN_DAWR0);
+   sprs.dawrx0 = mfspr(SPRN_DAWRX0);
+   }
  
  	srr1 = isa300_idle_stop_mayloss(psscr);		/* go idle */
  
@@ -710,6 +716,10 @@ static unsigned long power9_idle_stop(unsigned long

psscr, bool mmu_on)
mtspr(SPRN_IAMR,sprs.iamr);
mtspr(SPRN_AMOR,sprs.amor);
mtspr(SPRN_UAMOR,   sprs.uamor);
+   if (cpu_has_feature(CPU_FTR_ARCH_31)) {
+   mtspr(SPRN_DAWR0, sprs.dawr0);
+   mtspr(SPRN_DAWRX0, sprs.dawrx0);
+   }
  
  		/*

 * Workaround for POWER9 DD2.0, if we lost resources, the ERAT




Re: [PATCH v3 0/2] Selftest for cpuidle latency measurement

2020-07-22 Thread Pratik Sampat

Hello Daniel,

On 21/07/20 8:27 pm, Daniel Lezcano wrote:

On 21/07/2020 14:42, Pratik Rajesh Sampat wrote:

v2: https://lkml.org/lkml/2020/7/17/369
Changelog v2-->v3
Based on comments from Gautham R. Shenoy adding the following in the
selftest,
1. Grepping modules to determine if already loaded
2. Wrapper to enable/disable states
3. Preventing any operation/test on offlined CPUs
---

The patch series introduces a mechanism to measure wakeup latency for
IPI and timer based interrupts
The motivation behind this series is to find significant deviations
behind advertised latency and resisdency values

Why do you want to measure for the timer and the IPI ? Whatever the
source of the wakeup, the exit latency remains the same, no ?

Is all this kernel-ish code really needed ?

What about using a highres periodic timer and make it expires every eg.
50ms x 2400, so it is 120 secondes and measure the deviation. Repeat the
operation for each idle states.

And in order to make it as much accurate as possible, set the program
affinity on a CPU and isolate this one by preventing other processes to
be scheduled on and migrate the interrupts on the other CPUs.

That will be all userspace code, no?



The kernel module may not needed now that you mention it.
IPI latencies could be measured using pipes and threads using
pthread_attr_setaffinity_np to control the experiment, as you
suggested. This should internally fire a smp_call_function_single.

The original idea was to essentially measure it as closely as possible
in the kernel without involving the kernel-->userspace overhead.
However, the user-space approach may not be too much of a problem as
we are collecting a baseline and the delta of the latency is what we
would be concerned about anyways!

With respect to measuring both timers and IPI latencies: In principle
yes, the exit latency should remain the same but if there is a
deviation in reality we may want to measure it.

I'll implement this experiment in the userspace and get back with the
numbers to confirm.

Thanks for your comments!
Best,
Pratik








Re: [PATCH v2 2/2] selftest/cpuidle: Add support for cpuidle latency measurement

2020-07-21 Thread Pratik Sampat

Hi Gautham, Thanks for the review.


On 20/07/20 11:22 am, Gautham R Shenoy wrote:

Hi Pratik,


On Fri, Jul 17, 2020 at 02:48:01PM +0530, Pratik Rajesh Sampat wrote:

This patch adds support to trace IPI based and timer based wakeup
latency from idle states

Latches onto the test-cpuidle_latency kernel module using the debugfs
interface to send IPIs or schedule a timer based event, which in-turn
populates the debugfs with the latency measurements.

Currently for the IPI and timer tests; first disable all idle states
and then test for latency measurements incrementally enabling each state

Signed-off-by: Pratik Rajesh Sampat 

A few comments below.


---
  tools/testing/selftests/Makefile   |   1 +
  tools/testing/selftests/cpuidle/Makefile   |   6 +
  tools/testing/selftests/cpuidle/cpuidle.sh | 257 +
  tools/testing/selftests/cpuidle/settings   |   1 +
  4 files changed, 265 insertions(+)
  create mode 100644 tools/testing/selftests/cpuidle/Makefile
  create mode 100755 tools/testing/selftests/cpuidle/cpuidle.sh
  create mode 100644 tools/testing/selftests/cpuidle/settings


[..skip..]


+
+ins_mod()
+{
+   if [ ! -f "$MODULE" ]; then
+   printf "$MODULE module does not exist. Exitting\n"

If the module has been compiled into the kernel (due to a
localyesconfig, for instance), then it is unlikely that we will find
it in /lib/modules. Perhaps you want to check if the debugfs
directories created by the module exist, and if so, print a message
saying that the modules is already loaded or some such?


That's a good idea. I can can grep for this module within /proc/modules
and not insert it, if it is already there


+   exit $ksft_skip
+   fi
+   printf "Inserting $MODULE module\n\n"
+   insmod $MODULE
+   if [ $? != 0 ]; then
+   printf "Insmod $MODULE failed\n"
+   exit $ksft_skip
+   fi
+}
+
+compute_average()
+{
+   arr=("$@")
+   sum=0
+   size=${#arr[@]}
+   for i in "${arr[@]}"
+   do
+   sum=$((sum + i))
+   done
+   avg=$((sum/size))

It would be good to assert that "size" isn't 0 here.


Sure


+}
+
+# Disable all stop states
+disable_idle()
+{
+   for ((cpu=0; cpu 
/sys/devices/system/cpu/cpu$cpu/cpuidle/state$state/disable

So, on offlined CPUs, we won't see
/sys/devices/system/cpu/cpu$cpu/cpuidle/state$state directory. You
should probably perform this operation only on online CPUs.


Right. I should make CPU operations only on online CPUs all over the script

[..snip..]

Thanks
Pratik



Re: [PATCH v3 2/3] powerpc/powernv/idle: Rename pnv_first_spr_loss_level variable

2020-07-21 Thread Pratik Sampat




On 20/07/20 5:27 am, Nicholas Piggin wrote:

Excerpts from Pratik Rajesh Sampat's message of July 18, 2020 4:53 am:

Replace the variable name from using "pnv_first_spr_loss_level" to
"pnv_first_fullstate_loss_level".

As pnv_first_spr_loss_level is supposed to be the earliest state that
has OPAL_PM_LOSE_FULL_CONTEXT set, however as shallow states too loose
SPR values, render an incorrect terminology.

It also doesn't lose "full" state at this loss level though. From the
architecture it could be called "hv state loss level", but in POWER10
even that is not strictly true.


Right. Just discovered that deep stop states won't loose full state
P10 onwards.
Would it better if we rename it as "pnv_all_spr_loss_state" instead
so that it stays generic enough while being semantically coherent?

Thanks
Pratik


Signed-off-by: Pratik Rajesh Sampat 
---
  arch/powerpc/platforms/powernv/idle.c | 18 +-
  1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/idle.c 
b/arch/powerpc/platforms/powernv/idle.c
index f62904f70fc6..d439e11af101 100644
--- a/arch/powerpc/platforms/powernv/idle.c
+++ b/arch/powerpc/platforms/powernv/idle.c
@@ -48,7 +48,7 @@ static bool default_stop_found;
   * First stop state levels when SPR and TB loss can occur.
   */
  static u64 pnv_first_tb_loss_level = MAX_STOP_STATE + 1;
-static u64 pnv_first_spr_loss_level = MAX_STOP_STATE + 1;
+static u64 pnv_first_fullstate_loss_level = MAX_STOP_STATE + 1;
  
  /*

   * psscr value and mask of the deepest stop idle state.
@@ -657,7 +657,7 @@ static unsigned long power9_idle_stop(unsigned long psscr, 
bool mmu_on)
  */
mmcr0   = mfspr(SPRN_MMCR0);
}
-   if ((psscr & PSSCR_RL_MASK) >= pnv_first_spr_loss_level) {
+   if ((psscr & PSSCR_RL_MASK) >= pnv_first_fullstate_loss_level) {
sprs.lpcr   = mfspr(SPRN_LPCR);
sprs.hfscr  = mfspr(SPRN_HFSCR);
sprs.fscr   = mfspr(SPRN_FSCR);
@@ -741,7 +741,7 @@ static unsigned long power9_idle_stop(unsigned long psscr, 
bool mmu_on)
 * just always test PSSCR for SPR/TB state loss.
 */
pls = (psscr & PSSCR_PLS) >> PSSCR_PLS_SHIFT;
-   if (likely(pls < pnv_first_spr_loss_level)) {
+   if (likely(pls < pnv_first_fullstate_loss_level)) {
if (sprs_saved)
atomic_stop_thread_idle();
goto out;
@@ -1088,7 +1088,7 @@ static void __init pnv_power9_idle_init(void)
 * the deepest loss-less (OPAL_PM_STOP_INST_FAST) stop state.
 */
pnv_first_tb_loss_level = MAX_STOP_STATE + 1;
-   pnv_first_spr_loss_level = MAX_STOP_STATE + 1;
+   pnv_first_fullstate_loss_level = MAX_STOP_STATE + 1;
for (i = 0; i < nr_pnv_idle_states; i++) {
int err;
struct pnv_idle_states_t *state = _idle_states[i];
@@ -1099,8 +1099,8 @@ static void __init pnv_power9_idle_init(void)
pnv_first_tb_loss_level = psscr_rl;
  
  		if ((state->flags & OPAL_PM_LOSE_FULL_CONTEXT) &&

-(pnv_first_spr_loss_level > psscr_rl))
-   pnv_first_spr_loss_level = psscr_rl;
+(pnv_first_fullstate_loss_level > psscr_rl))
+   pnv_first_fullstate_loss_level = psscr_rl;
  
  		/*

 * The idle code does not deal with TB loss occurring
@@ -,8 +,8 @@ static void __init pnv_power9_idle_init(void)
 * compatibility.
 */
if ((state->flags & OPAL_PM_TIMEBASE_STOP) &&
-(pnv_first_spr_loss_level > psscr_rl))
-   pnv_first_spr_loss_level = psscr_rl;
+(pnv_first_fullstate_loss_level > psscr_rl))
+   pnv_first_fullstate_loss_level = psscr_rl;
  
  		err = validate_psscr_val_mask(>psscr_val,

  >psscr_mask,
@@ -1158,7 +1158,7 @@ static void __init pnv_power9_idle_init(void)
}
  
  	pr_info("cpuidle-powernv: First stop level that may lose SPRs = 0x%llx\n",

-   pnv_first_spr_loss_level);
+   pnv_first_fullstate_loss_level);
  
  	pr_info("cpuidle-powernv: First stop level that may lose timebase = 0x%llx\n",

pnv_first_tb_loss_level);
--
2.25.4






Re: [PATCH v3 1/3] powerpc/powernv/idle: Replace CPU features checks with PVR checks

2020-07-21 Thread Pratik Sampat




On 20/07/20 5:30 am, Nicholas Piggin wrote:

Excerpts from Pratik Rajesh Sampat's message of July 18, 2020 4:53 am:

As the idle framework's architecture is incomplete, hence instead of
checking for just the processor type advertised in the device tree CPU
features; check for the Processor Version Register (PVR) so that finer
granularity can be leveraged while making processor checks.

Signed-off-by: Pratik Rajesh Sampat 
---
  arch/powerpc/platforms/powernv/idle.c | 14 +++---
  1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/idle.c 
b/arch/powerpc/platforms/powernv/idle.c
index 2dd467383a88..f62904f70fc6 100644
--- a/arch/powerpc/platforms/powernv/idle.c
+++ b/arch/powerpc/platforms/powernv/idle.c
@@ -92,7 +92,7 @@ static int pnv_save_sprs_for_deep_states(void)
if (rc != 0)
return rc;
  
-		if (cpu_has_feature(CPU_FTR_ARCH_300)) {

+   if (pvr_version_is(PVR_POWER9)) {
rc = opal_slw_set_reg(pir, P9_STOP_SPR_MSR, msr_val);
if (rc)
return rc;
@@ -116,7 +116,7 @@ static int pnv_save_sprs_for_deep_states(void)
return rc;
  
  			/* Only p8 needs to set extra HID regiters */

-   if (!cpu_has_feature(CPU_FTR_ARCH_300)) {
+   if (!pvr_version_is(PVR_POWER9)) {
  
  rc = opal_slw_set_reg(pir, SPRN_HID1, hid1_val);

if (rc != 0)

What I think you should do is keep using CPU_FTR_ARCH_300 for this stuff
which is written for power9 and we know is running on power9, because
that's a faster test (static branch and does not have to read PVR. And
then...


@@ -1205,7 +1205,7 @@ static void __init pnv_probe_idle_states(void)
return;
}
  
-	if (cpu_has_feature(CPU_FTR_ARCH_300))

+   if (pvr_version_is(PVR_POWER9))
pnv_power9_idle_init();
  
  	for (i = 0; i < nr_pnv_idle_states; i++)

Here is where you would put the version check. Once we have code that
can also handle P10 (either by testing CPU_FTR_ARCH_31, or by adding
an entirely new power10 idle function), then you can add the P10 version
check here.


Sure, it makes sense to make this check on the top level function and
retain CPU_FTR_ARCH_300 lower in the calls for speed.
I'll make that change.

Thanks
Pratik


Thanks,
Nick





Re: [PATCH v2 0/3] Power10 basic energy management

2020-07-13 Thread Pratik Sampat




On 13/07/20 10:20 pm, Nicholas Piggin wrote:

Excerpts from Pratik Sampat's message of July 13, 2020 8:02 pm:

Thank you for your comments,

On 13/07/20 10:53 am, Nicholas Piggin wrote:

Excerpts from Pratik Rajesh Sampat's message of July 10, 2020 3:22 pm:

Changelog v1 --> v2:
1. Save-restore DAWR and DAWRX unconditionally as they are lost in
shallow idle states too
2. Rename pnv_first_spr_loss_level to pnv_first_fullstate_loss_level to
correct naming terminology

Pratik Rajesh Sampat (3):
powerpc/powernv/idle: Exclude mfspr on HID1,4,5 on P9 and above
powerpc/powernv/idle: save-restore DAWR0,DAWRX0 for P10
powerpc/powernv/idle: Rename pnv_first_spr_loss_level variable

   arch/powerpc/platforms/powernv/idle.c | 34 +--
   1 file changed, 22 insertions(+), 12 deletions(-)

These look okay to me, but the CPU_FTR_ARCH_300 test for
pnv_power9_idle_init() is actually wrong, it should be a PVR test
because idle is not completely architected (not even shallow stop
states, unfortunately).

It doesn't look like we support POWER10 idle correctly yet, and on older
kernels it wouldn't work even if we fixed newer, so ideally the PVR
check would be backported as a fix in the front of the series.

Sadly, we have no OPAL idle driver yet. Hopefully we will before the
next processor shows up :P

Thanks,
Nick

So if I understand this correctly, in powernv/idle.c where we check for
CPU_FTR_ARCH_300, we should rather be making a pvr_version_is(PVR_POWER9)
check instead?

Of course, the P10 PVR and its relevant checks will have to be added then too.

Yes I think so, unfortunately.

Thanks,
Nick


Sure, I'll add these checks in.

Thanks,
Pratik



Re: [PATCH v2 0/3] Power10 basic energy management

2020-07-13 Thread Pratik Sampat

Thank you for your comments,

On 13/07/20 10:53 am, Nicholas Piggin wrote:

Excerpts from Pratik Rajesh Sampat's message of July 10, 2020 3:22 pm:

Changelog v1 --> v2:
1. Save-restore DAWR and DAWRX unconditionally as they are lost in
shallow idle states too
2. Rename pnv_first_spr_loss_level to pnv_first_fullstate_loss_level to
correct naming terminology

Pratik Rajesh Sampat (3):
   powerpc/powernv/idle: Exclude mfspr on HID1,4,5 on P9 and above
   powerpc/powernv/idle: save-restore DAWR0,DAWRX0 for P10
   powerpc/powernv/idle: Rename pnv_first_spr_loss_level variable

  arch/powerpc/platforms/powernv/idle.c | 34 +--
  1 file changed, 22 insertions(+), 12 deletions(-)

These look okay to me, but the CPU_FTR_ARCH_300 test for
pnv_power9_idle_init() is actually wrong, it should be a PVR test
because idle is not completely architected (not even shallow stop
states, unfortunately).

It doesn't look like we support POWER10 idle correctly yet, and on older
kernels it wouldn't work even if we fixed newer, so ideally the PVR
check would be backported as a fix in the front of the series.

Sadly, we have no OPAL idle driver yet. Hopefully we will before the
next processor shows up :P

Thanks,
Nick


So if I understand this correctly, in powernv/idle.c where we check for
CPU_FTR_ARCH_300, we should rather be making a pvr_version_is(PVR_POWER9)
check instead?

Of course, the P10 PVR and its relevant checks will have to be added then too.

Thanks
Pratik

 



Re: [PATCH 2/2] powerpc/powernv/idle: save-restore DAWR0,DAWRX0 for P10

2020-07-09 Thread Pratik Sampat




On 09/07/20 2:39 pm, Gautham R Shenoy wrote:

On Fri, Jul 03, 2020 at 06:16:40PM +0530, Pratik Rajesh Sampat wrote:

Additional registers DAWR0, DAWRX0 may be lost on Power 10 for
stop levels < 4.

Adding Ravi Bangoria  to the cc.


Therefore save the values of these SPRs before entering a  "stop"
state and restore their values on wakeup.

Signed-off-by: Pratik Rajesh Sampat 


The saving and restoration looks good to me.

---
  arch/powerpc/platforms/powernv/idle.c | 10 ++
  1 file changed, 10 insertions(+)

diff --git a/arch/powerpc/platforms/powernv/idle.c 
b/arch/powerpc/platforms/powernv/idle.c
index 19d94d021357..471d4a65b1fa 100644
--- a/arch/powerpc/platforms/powernv/idle.c
+++ b/arch/powerpc/platforms/powernv/idle.c
@@ -600,6 +600,8 @@ struct p9_sprs {
u64 iamr;
u64 amor;
u64 uamor;
+   u64 dawr0;
+   u64 dawrx0;
  };

  static unsigned long power9_idle_stop(unsigned long psscr, bool mmu_on)
@@ -677,6 +679,10 @@ static unsigned long power9_idle_stop(unsigned long psscr, 
bool mmu_on)
sprs.tscr   = mfspr(SPRN_TSCR);
if (!firmware_has_feature(FW_FEATURE_ULTRAVISOR))
sprs.ldbar = mfspr(SPRN_LDBAR);
+   if (cpu_has_feature(CPU_FTR_ARCH_31)) {
+   sprs.dawr0 = mfspr(SPRN_DAWR0);
+   sprs.dawrx0 = mfspr(SPRN_DAWRX0);
+   }



But this is within the if condition which says

if ((psscr & PSSCR_RL_MASK) >= pnv_first_spr_loss_level)

This if condition is meant for stop4 and stop5 since these are stop
levels that have OPAL_PM_LOSE_HYP_CONTEXT set.

Since we can lose DAWR*, on states that lose limited hypervisor
context, such as stop0-2, we need to unconditionally save them
like AMR, IAMR etc.


Right, shallow states too loose DAWR/X. Thanks for pointing it out.
I'll fix this and resend.


sprs_saved = true;

@@ -792,6 +798,10 @@ static unsigned long power9_idle_stop(unsigned long psscr, 
bool mmu_on)
mtspr(SPRN_MMCR2,   sprs.mmcr2);
if (!firmware_has_feature(FW_FEATURE_ULTRAVISOR))
mtspr(SPRN_LDBAR, sprs.ldbar);
+   if (cpu_has_feature(CPU_FTR_ARCH_31)) {
+   mtspr(SPRN_DAWR0, sprs.dawr0);
+   mtspr(SPRN_DAWRX0, sprs.dawrx0);
+   }


Likewise, we need to unconditionally restore these SPRs.



mtspr(SPRN_SPRG3,   local_paca->sprg_vdso);

--
2.25.4


Thanks
Pratik


Re: [PATCH 2/2] selftest/cpuidle: Add support for cpuidle latency measurement

2020-07-07 Thread Pratik Sampat

[..snip..]

+
+ins_mod()
+{
+    if [ ! -f "$MODULE" ]; then
+    printf "$MODULE module does not exist. Exitting\n"
+    exit 2


Please use ksft_skip code to indicate the test is being skipped.


Sure thing I'll use ksft_skip exit code instead.

+    fi
+    printf "Inserting $MODULE module\n\n"
+    insmod $MODULE
+    if [ $? != 0 ]; then
+    printf "Insmod $MODULE failed\n"
+    exit 2


This is fine since you expect to be able to load the module.


Thanks for the review.
Pratik

[..snip..]




thanks,
-- Shuah




Re: [RFC 0/1] Alternate history mechanism for the TEO governor

2020-05-21 Thread Pratik Sampat

Hello Doug,

Thanks a lot for running these benchmarks on an Intel box.


On 17/05/20 11:41 pm, Doug Smythies wrote:

On 2020.05.11 Pratik Rajesh Sampat wrote:

First RFC posting:https://lkml.org/lkml/2020/2/22/27

Summary:

On that thread I wrote:

   > I have done a couple of other tests with this patch set,
   > but nothing to report yet, as the differences have been
   > minor so far.

I tried your tests, or as close as I could find, and still
do not notice much difference.


That is quite unfortunate. At least it doesn't seem to regress.

Nevertheless, as Rafael suggested aging is crucial, this patch doesn't age
weights. I do have a version with aging but I had a lot of run to run variance
so I had refrained from posting that.
I'm tweaking around the logic for aging as well as distribution of weights,
hopefully that may help.


For detail, but likely little added value, read on:

Kernel: 5.7-rc4:
"teo": unmodified kernel.
"wtteo": with this patch added.
"menu": the menu idle governor, for comparison.
CPU frequency scaling driver: intel-cpufreq
CPU frequency scaling governor: schedutil
CPU idle driver: intel_idle

...


Benchmarks:
Schbench

Benchmarks scheduler wakeup latencies

1. latency 99th percentile - usec

I found a Phoronix schbench test.
It defaults to 99.9th percentile.

schbench (usec, 99.9th Latency Percentile, less is better)(8 workers)   


threads teo wtteo   menu
2   14197   14194   99.98%  14467   101.90%
4   46733   46490   99.48%  46554   99.62%
6   57306   58291   101.72% 57754   100.78%
8   81408   80768   99.21%  81715   100.38%
16  157286  156570  99.54%  156621  99.58%
32  314573  310784  98.80%  315802  100.39%

Powers and other idle statistics were similar. [1]


2. Power - watts
Machine - IBM Power 9

Latency and Power - Normalized
+-+--+-+---+
| Threads | TEO Baseline | Wt. TEO Latency | Wt. TEO Power |
+-+--+-+---+
| 2   | 100  | 101.3   | 85.29 |
+-+--+-+---+
| 4   | 100  | 105.06  | 113.63|
+-+--+-+---+
| 8   | 100  | 92.32   | 90.36 |
+-+--+-+---+
| 16  | 100  | 99.1| 92.43 |
+-+--+-+---+

Accuracy

Vanilla TEO Governor - Prediction distribution %
+-+--+--+--+---+---+---+-+
| Threads | US 1 | US 2 | US 3 | US 4  | US 5  | US 6  | Correct |
+-+--+--+--+---+---+---+-+
| 2   | 6.12 | 1.08 | 1.76 | 20.41 | 9.2   | 28.74 | 22.51   |
+-+--+--+--+---+---+---+-+
| 4   | 8.54 | 1.56 | 1.25 | 20.24 | 10.75 | 25.17 | 22.67   |
+-+--+--+--+---+---+---+-+
| 8   | 5.88 | 2.67 | 1.09 | 13.72 | 17.08 | 32.04 | 22.95   |
+-+--+--+--+---+---+---+-+
| 16  | 6.29 | 2.43 | 0.86 | 13.21 | 15.33 | 26.52 | 29.34   |
+-+--+--+--+---+---+---+-+
+-+--+--+--+
| Threads | OS 1 | OS 2 | OS 3 |
+-+--+--+--+
| 2   | 1.77 | 1.27 | 7.14 |
+-+--+--+--+
| 4   | 1.8  | 1.31 | 6.71 |
+-+--+--+--+
| 8   | 0.65 | 0.72 | 3.2  |
+-+--+--+--+
| 16  | 0.63 | 1.71 | 3.68 |
+-+--+--+--+

Weighted TEO Governor - Prediction distribution %
+-+--+--+--+---+---+---+-+
| Threads | US 1 | US 2 | US 3 | US 4  | US 5  | US 6  | Correct |
+-+--+--+--+---+---+---+-+
| 2   | 7.26 | 2.07 | 0.02 | 15.85 | 13.29 | 36.26 | 22.13   |
+-+--+--+--+---+---+---+-+
| 4   | 4.33 | 1.45 | 0.15 | 14.17 | 14.68 | 40.36 | 21.01   |
+-+--+--+--+---+---+---+-+
| 8   | 4.73 | 2.46 | 0.12 | 12.48 | 14.68 | 32.38 | 28.9|
+-+--+--+--+---+---+---+-+
| 16  | 7.68 | 1.25 | 0.98 | 12.15 | 11.19 | 24.91 | 35.92   |
+-+--+--+--+---+---+---+-+
+-+--+--+--+
| Threads | OS 1 | OS 2 | OS 3 |
+-+--+--+--+
| 2   | 0.39 | 0.42 | 2.31 |
+-+--+--+--+
| 4   | 0.45 | 0.51 | 2.89 |
+-+--+--+--+
| 8   | 0.53 | 0.66 | 3.06 |
+-+--+--+--+
| 16  | 0.97 | 1.9  | 3.05 |

Re: [RFC 1/1] Weighted approach to gather and use history in TEO governor

2020-05-14 Thread Pratik Sampat




On 13/05/20 8:19 pm, Rafael J. Wysocki wrote:

On Wed, May 13, 2020 at 7:31 AM Pratik Sampat  wrote:

Thanks for your comment.


On 12/05/20 11:07 pm, Peter Zijlstra wrote:

Just a quick note..

On Mon, May 11, 2020 at 07:40:55PM +0530, Pratik Rajesh Sampat wrote:


+/*
+ * Rearrange the weight distribution of the state, increase the weight
+ * by the LEARNING RATE % for the idle state that was supposed to be
+ * chosen and reduce by the same amount for rest of the states
+ *
+ * If the weights are greater than (100 - LEARNING_RATE) % or lesser
+ * than LEARNING_RATE %, do not increase or decrease the confidence
+ * respectively
+ */
+for (i = 0; i < drv->state_count; i++) {
+unsigned int delta;
+
+if (idx == -1)
+break;
+if (i ==  idx) {
+delta = (LEARNING_RATE * cpu_data->state_mat[last_idx][i]) 
/ 100;

100 is a crap number to divide by as a computer. We bio-puddings happend
to have 10 digits, so 100 makes sense to us, but it does not to our
binary friends.



Absolutely! I just wrote the code exactly the way I did the Math on paper,
definitely need to figure out an optimal way of doing things.

There is no particular reason to use percent in computations at all.
You may as well use 1/1024 parts instead (and then use shifts instead
of divisions).


Yes you're right. Looking at it now the whole percent system and divisions
does seem quite unnecessary and we can achieve it rather with bitwise
operations.

Thanks!



Re: [RFC 1/1] Weighted approach to gather and use history in TEO governor

2020-05-12 Thread Pratik Sampat

Thanks for your comment.


On 12/05/20 11:07 pm, Peter Zijlstra wrote:

Just a quick note..

On Mon, May 11, 2020 at 07:40:55PM +0530, Pratik Rajesh Sampat wrote:


+   /*
+* Rearrange the weight distribution of the state, increase the weight
+* by the LEARNING RATE % for the idle state that was supposed to be
+* chosen and reduce by the same amount for rest of the states
+*
+* If the weights are greater than (100 - LEARNING_RATE) % or lesser
+* than LEARNING_RATE %, do not increase or decrease the confidence
+* respectively
+*/
+   for (i = 0; i < drv->state_count; i++) {
+   unsigned int delta;
+
+   if (idx == -1)
+   break;
+   if (i ==  idx) {
+   delta = (LEARNING_RATE * 
cpu_data->state_mat[last_idx][i]) / 100;

100 is a crap number to divide by as a computer. We bio-puddings happend
to have 10 digits, so 100 makes sense to us, but it does not to our
binary friends.



Absolutely! I just wrote the code exactly the way I did the Math on paper,
definitely need to figure out an optimal way of doing things.

~Pratik