subject:"\[PATCH v3 0\/5\] mm\/memcg\: Reduce kmemcache memory accounting overhead"

Re: [PATCH v3 0/5] mm/memcg: Reduce kmemcache memory accounting overhead

2021-04-15 Thread Waiman Long


On 4/15/21 1:10 PM, Matthew Wilcox wrote:

On Tue, Apr 13, 2021 at 09:20:22PM -0400, Waiman Long wrote:

With memory accounting disable, the run time was 2.848s. With memory
accounting enabled, the run times with the application of various
patches in the patchset were:

   Applied patches   Run time   Accounting overhead   Overhead %age
   ---      ---   -
None  10.800s 7.952s  100.0%
 1-2   9.140s 6.292s   79.1%
 1-3   7.641s 4.793s   60.3%
 1-5   6.801s 3.953s   49.7%

I think this is a misleading way to report the overhead.  I would have said:

10.800s 7.952s  279.2%
 9.140s 6.292s  220.9%
 7.641s 4.793s  168.3%
 6.801s 3.953s  138.8%

What I want to emphasize is the reduction in the accounting overhead 
part of execution time. Your percentage used the accounting disable time 
as the denominator. I think both are valid, I will be more clear about 
that in my version of the patch.


Thanks,
Longman

Re: [PATCH v3 0/5] mm/memcg: Reduce kmemcache memory accounting overhead

2021-04-15 Thread Matthew Wilcox

On Tue, Apr 13, 2021 at 09:20:22PM -0400, Waiman Long wrote:
> With memory accounting disable, the run time was 2.848s. With memory
> accounting enabled, the run times with the application of various
> patches in the patchset were:
> 
>   Applied patches   Run time   Accounting overhead   Overhead %age
>   ---      ---   -
>None  10.800s 7.952s  100.0%
> 1-2   9.140s 6.292s   79.1%
> 1-3   7.641s 4.793s   60.3%
> 1-5   6.801s 3.953s   49.7%

I think this is a misleading way to report the overhead.  I would have said:

10.800s 7.952s  279.2%
 9.140s 6.292s  220.9%
 7.641s 4.793s  168.3%
 6.801s 3.953s  138.8%

Re: [PATCH v3 0/5] mm/memcg: Reduce kmemcache memory accounting overhead

2021-04-15 Thread Masayoshi Mizuma

On Thu, Apr 15, 2021 at 09:17:37AM -0400, Waiman Long wrote:
> I was focusing on your kernel module benchmark in testing my patch. I will
> try out your pgbench benchmark to see if there can be other tuning that can
> be done.

Thanks a lot!

> BTW, how many numa nodes does your test machine? I did my testing with a
> 2-socket system. The vmstat caching part may be less effective on systems
> with more numa nodes. I will try to find a larger 4-socket systems for
> testing.

The test machine has one node.

- Masa

Re: [PATCH v3 0/5] mm/memcg: Reduce kmemcache memory accounting overhead

2021-04-15 Thread Waiman Long


On 4/14/21 11:26 PM, Masayoshi Mizuma wrote:


Hi Longman,

Thank you for your patches.
I rerun the benchmark with your patches, it seems that the reduction
is small... The total duration of sendto() and recvfrom() system call
during the benchmark are as follows.

- sendto
   - v5.8 vanilla:  2576.056 msec (100%)
   - v5.12-rc7 vanilla: 2988.911 msec (116%)
   - v5.12-rc7 with your patches (1-5): 2984.307 msec (115%)

- recvfrom
   - v5.8 vanilla:  2113.156 msec (100%)
   - v5.12-rc7 vanilla: 2305.810 msec (109%)
   - v5.12-rc7 with your patches (1-5): 2287.351 msec (108%)

kmem_cache_alloc()/kmem_cache_free() are called around 1,400,000 times during
the benchmark. I ran a loop in a kernel module as following. The duration
is reduced by your patches actually.

   ---
   dummy_cache = KMEM_CACHE(dummy, SLAB_ACCOUNT);
   for (i = 0; i < 140; i++) {
p = kmem_cache_alloc(dummy_cache, GFP_KERNEL);
kmem_cache_free(dummy_cache, p);
   }
   ---

- v5.12-rc7 vanilla: 110 msec (100%)
- v5.12-rc7 with your patches (1-5):  85 msec (77%)

It seems that the reduction is small for the benchmark though...
Anyway, I can see your patches reduce the overhead.
Please feel free to add:

Tested-by: Masayoshi Mizuma 

Thanks!
Masa


Thanks for the testing.

I was focusing on your kernel module benchmark in testing my patch. I 
will try out your pgbench benchmark to see if there can be other tuning 
that can be done.


BTW, how many numa nodes does your test machine? I did my testing with a 
2-socket system. The vmstat caching part may be less effective on 
systems with more numa nodes. I will try to find a larger 4-socket 
systems for testing.


Cheers,
Longman

Re: [PATCH v3 0/5] mm/memcg: Reduce kmemcache memory accounting overhead

2021-04-14 Thread Masayoshi Mizuma

On Tue, Apr 13, 2021 at 09:20:22PM -0400, Waiman Long wrote:
>  v3:
>   - Add missing "inline" qualifier to the alternate mod_obj_stock_state()
> in patch 3.
>   - Remove redundant current_obj_stock() call in patch 5.
> 
>  v2:
>   - Fix bug found by test robot in patch 5.
>   - Update cover letter and commit logs.
> 
> With the recent introduction of the new slab memory controller, we
> eliminate the need for having separate kmemcaches for each memory
> cgroup and reduce overall kernel memory usage. However, we also add
> additional memory accounting overhead to each call of kmem_cache_alloc()
> and kmem_cache_free().
> 
> For workloads that require a lot of kmemcache allocations and
> de-allocations, they may experience performance regression as illustrated
> in [1] and [2].
> 
> A simple kernel module that performs repeated loop of 100,000,000
> kmem_cache_alloc() and kmem_cache_free() of a 64-byte object at module
> init time is used for benchmarking. The test was run on a CascadeLake
> server with turbo-boosting disable to reduce run-to-run variation.
> 
> With memory accounting disable, the run time was 2.848s. With memory
> accounting enabled, the run times with the application of various
> patches in the patchset were:
> 
>   Applied patches   Run time   Accounting overhead   Overhead %age
>   ---      ---   -
>None  10.800s 7.952s  100.0%
> 1-2   9.140s 6.292s   79.1%
> 1-3   7.641s 4.793s   60.3%
> 1-5   6.801s 3.953s   49.7%
> 
> Note that this is the best case scenario where most updates happen only
> to the percpu stocks. Real workloads will likely have a certain amount
> of updates to the memcg charges and vmstats. So the performance benefit
> will be less.
> 
> It was found that a big part of the memory accounting overhead
> was caused by the local_irq_save()/local_irq_restore() sequences in
> updating local stock charge bytes and vmstat array, at least in x86
> systems. There are two such sequences in kmem_cache_alloc() and two
> in kmem_cache_free(). This patchset tries to reduce the use of such
> sequences as much as possible. In fact, it eliminates them in the common
> case. Another part of this patchset to cache the vmstat data update in
> the local stock as well which also helps.
> 
> [1] 
> https://lore.kernel.org/linux-mm/20210408193948.vfktg3azh2wrt56t@gabell/T/#u

Hi Longman,

Thank you for your patches.
I rerun the benchmark with your patches, it seems that the reduction
is small... The total duration of sendto() and recvfrom() system call 
during the benchmark are as follows.

- sendto
  - v5.8 vanilla:  2576.056 msec (100%)
  - v5.12-rc7 vanilla: 2988.911 msec (116%)
  - v5.12-rc7 with your patches (1-5): 2984.307 msec (115%)

- recvfrom
  - v5.8 vanilla:  2113.156 msec (100%)
  - v5.12-rc7 vanilla: 2305.810 msec (109%)
  - v5.12-rc7 with your patches (1-5): 2287.351 msec (108%)

kmem_cache_alloc()/kmem_cache_free() are called around 1,400,000 times during
the benchmark. I ran a loop in a kernel module as following. The duration
is reduced by your patches actually.

  ---
  dummy_cache = KMEM_CACHE(dummy, SLAB_ACCOUNT);
  for (i = 0; i < 140; i++) {
p = kmem_cache_alloc(dummy_cache, GFP_KERNEL);
kmem_cache_free(dummy_cache, p);
  }
  ---

- v5.12-rc7 vanilla: 110 msec (100%)
- v5.12-rc7 with your patches (1-5):  85 msec (77%)

It seems that the reduction is small for the benchmark though...
Anyway, I can see your patches reduce the overhead.
Please feel free to add:

Tested-by: Masayoshi Mizuma 

Thanks!
Masa

> [2] https://lore.kernel.org/lkml/20210114025151.GA22932@xsang-OptiPlex-9020/
> 
> Waiman Long (5):
>   mm/memcg: Pass both memcg and lruvec to mod_memcg_lruvec_state()
>   mm/memcg: Introduce obj_cgroup_uncharge_mod_state()
>   mm/memcg: Cache vmstat data in percpu memcg_stock_pcp
>   mm/memcg: Separate out object stock data into its own struct
>   mm/memcg: Optimize user context object stock access
> 
>  include/linux/memcontrol.h |  14 ++-
>  mm/memcontrol.c| 199 -
>  mm/percpu.c|   9 +-
>  mm/slab.h  |  32 +++---
>  4 files changed, 196 insertions(+), 58 deletions(-)
> 
> -- 
> 2.18.1
>

[PATCH v3 0/5] mm/memcg: Reduce kmemcache memory accounting overhead

2021-04-13 Thread Waiman Long

 v3:
  - Add missing "inline" qualifier to the alternate mod_obj_stock_state()
in patch 3.
  - Remove redundant current_obj_stock() call in patch 5.

 v2:
  - Fix bug found by test robot in patch 5.
  - Update cover letter and commit logs.

With the recent introduction of the new slab memory controller, we
eliminate the need for having separate kmemcaches for each memory
cgroup and reduce overall kernel memory usage. However, we also add
additional memory accounting overhead to each call of kmem_cache_alloc()
and kmem_cache_free().

For workloads that require a lot of kmemcache allocations and
de-allocations, they may experience performance regression as illustrated
in [1] and [2].

A simple kernel module that performs repeated loop of 100,000,000
kmem_cache_alloc() and kmem_cache_free() of a 64-byte object at module
init time is used for benchmarking. The test was run on a CascadeLake
server with turbo-boosting disable to reduce run-to-run variation.

With memory accounting disable, the run time was 2.848s. With memory
accounting enabled, the run times with the application of various
patches in the patchset were:

  Applied patches   Run time   Accounting overhead   Overhead %age
  ---      ---   -
   None  10.800s 7.952s  100.0%
1-2   9.140s 6.292s   79.1%
1-3   7.641s 4.793s   60.3%
1-5   6.801s 3.953s   49.7%

Note that this is the best case scenario where most updates happen only
to the percpu stocks. Real workloads will likely have a certain amount
of updates to the memcg charges and vmstats. So the performance benefit
will be less.

It was found that a big part of the memory accounting overhead
was caused by the local_irq_save()/local_irq_restore() sequences in
updating local stock charge bytes and vmstat array, at least in x86
systems. There are two such sequences in kmem_cache_alloc() and two
in kmem_cache_free(). This patchset tries to reduce the use of such
sequences as much as possible. In fact, it eliminates them in the common
case. Another part of this patchset to cache the vmstat data update in
the local stock as well which also helps.

[1] https://lore.kernel.org/linux-mm/20210408193948.vfktg3azh2wrt56t@gabell/T/#u
[2] https://lore.kernel.org/lkml/20210114025151.GA22932@xsang-OptiPlex-9020/

Waiman Long (5):
  mm/memcg: Pass both memcg and lruvec to mod_memcg_lruvec_state()
  mm/memcg: Introduce obj_cgroup_uncharge_mod_state()
  mm/memcg: Cache vmstat data in percpu memcg_stock_pcp
  mm/memcg: Separate out object stock data into its own struct
  mm/memcg: Optimize user context object stock access

 include/linux/memcontrol.h |  14 ++-
 mm/memcontrol.c| 199 -
 mm/percpu.c|   9 +-
 mm/slab.h  |  32 +++---
 4 files changed, 196 insertions(+), 58 deletions(-)

-- 
2.18.1

Re: [PATCH v3 0/5] mm/memcg: Reduce kmemcache memory accounting overhead

Re: [PATCH v3 0/5] mm/memcg: Reduce kmemcache memory accounting overhead

Re: [PATCH v3 0/5] mm/memcg: Reduce kmemcache memory accounting overhead

Re: [PATCH v3 0/5] mm/memcg: Reduce kmemcache memory accounting overhead

Re: [PATCH v3 0/5] mm/memcg: Reduce kmemcache memory accounting overhead

[PATCH v3 0/5] mm/memcg: Reduce kmemcache memory accounting overhead

6 matches

Site Navigation

Mail list logo

Footer information