Re: [PATCH 1/2] fs: fsnotify: account fsnotify metadata to kmemcg

2018-06-29 Thread Jan Kara
On Thu 28-06-18 12:21:26, Shakeel Butt wrote:
> On Thu, Jun 28, 2018 at 12:03 PM Jan Kara  wrote:
> >
> > On Wed 27-06-18 12:12:49, Shakeel Butt wrote:
> > > A lot of memory can be consumed by the events generated for the huge or
> > > unlimited queues if there is either no or slow listener.  This can cause
> > > system level memory pressure or OOMs.  So, it's better to account the
> > > fsnotify kmem caches to the memcg of the listener.
> > >
> > > However the listener can be in a different memcg than the memcg of the
> > > producer and these allocations happen in the context of the event
> > > producer. This patch introduces remote memcg charging API which the
> > > producer can use to charge the allocations to the memcg of the listener.
> > >
> > > There are seven fsnotify kmem caches and among them allocations from
> > > dnotify_struct_cache, dnotify_mark_cache, fanotify_mark_cache and
> > > inotify_inode_mark_cachep happens in the context of syscall from the
> > > listener.  So, SLAB_ACCOUNT is enough for these caches.
> > >
> > > The objects from fsnotify_mark_connector_cachep are not accounted as they
> > > are small compared to the notification mark or events and it is unclear
> > > whom to account connector to since it is shared by all events attached to
> > > the inode.
> > >
> > > The allocations from the event caches happen in the context of the event
> > > producer.  For such caches we will need to remote charge the allocations
> > > to the listener's memcg.  Thus we save the memcg reference in the
> > > fsnotify_group structure of the listener.
> > >
> > > This patch has also moved the members of fsnotify_group to keep the size
> > > same, at least for 64 bit build, even with additional member by filling
> > > the holes.
> >
> > ...
> >
> > >  static int __init fanotify_user_setup(void)
> > >  {
> > > - fanotify_mark_cache = KMEM_CACHE(fsnotify_mark, SLAB_PANIC);
> > > + fanotify_mark_cache = KMEM_CACHE(fsnotify_mark,
> > > +  SLAB_PANIC|SLAB_ACCOUNT);
> > >   fanotify_event_cachep = KMEM_CACHE(fanotify_event_info, SLAB_PANIC);
> > >   if (IS_ENABLED(CONFIG_FANOTIFY_ACCESS_PERMISSIONS)) {
> > >   fanotify_perm_event_cachep =
> >
> > Why don't you setup also fanotify_event_cachep and
> > fanotify_perm_event_cachep caches with SLAB_ACCOUNT and instead specify
> > __GFP_ACCOUNT manually? Otherwise the patch looks good to me.
> >
> 
> Hi Jan, IMHO having a visible __GFP_ACCOUNT along with
> memalloc_use_memcg() makes the code more explicit and readable that we
> want to targeted/remote memcg charging. However if you think
> otherwise, I will replace __GFP_ACCOUNT with SLAB_ACCOUNT.

OK, fair enough.

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH 1/2] fs: fsnotify: account fsnotify metadata to kmemcg

2018-06-29 Thread Jan Kara
On Thu 28-06-18 12:21:26, Shakeel Butt wrote:
> On Thu, Jun 28, 2018 at 12:03 PM Jan Kara  wrote:
> >
> > On Wed 27-06-18 12:12:49, Shakeel Butt wrote:
> > > A lot of memory can be consumed by the events generated for the huge or
> > > unlimited queues if there is either no or slow listener.  This can cause
> > > system level memory pressure or OOMs.  So, it's better to account the
> > > fsnotify kmem caches to the memcg of the listener.
> > >
> > > However the listener can be in a different memcg than the memcg of the
> > > producer and these allocations happen in the context of the event
> > > producer. This patch introduces remote memcg charging API which the
> > > producer can use to charge the allocations to the memcg of the listener.
> > >
> > > There are seven fsnotify kmem caches and among them allocations from
> > > dnotify_struct_cache, dnotify_mark_cache, fanotify_mark_cache and
> > > inotify_inode_mark_cachep happens in the context of syscall from the
> > > listener.  So, SLAB_ACCOUNT is enough for these caches.
> > >
> > > The objects from fsnotify_mark_connector_cachep are not accounted as they
> > > are small compared to the notification mark or events and it is unclear
> > > whom to account connector to since it is shared by all events attached to
> > > the inode.
> > >
> > > The allocations from the event caches happen in the context of the event
> > > producer.  For such caches we will need to remote charge the allocations
> > > to the listener's memcg.  Thus we save the memcg reference in the
> > > fsnotify_group structure of the listener.
> > >
> > > This patch has also moved the members of fsnotify_group to keep the size
> > > same, at least for 64 bit build, even with additional member by filling
> > > the holes.
> >
> > ...
> >
> > >  static int __init fanotify_user_setup(void)
> > >  {
> > > - fanotify_mark_cache = KMEM_CACHE(fsnotify_mark, SLAB_PANIC);
> > > + fanotify_mark_cache = KMEM_CACHE(fsnotify_mark,
> > > +  SLAB_PANIC|SLAB_ACCOUNT);
> > >   fanotify_event_cachep = KMEM_CACHE(fanotify_event_info, SLAB_PANIC);
> > >   if (IS_ENABLED(CONFIG_FANOTIFY_ACCESS_PERMISSIONS)) {
> > >   fanotify_perm_event_cachep =
> >
> > Why don't you setup also fanotify_event_cachep and
> > fanotify_perm_event_cachep caches with SLAB_ACCOUNT and instead specify
> > __GFP_ACCOUNT manually? Otherwise the patch looks good to me.
> >
> 
> Hi Jan, IMHO having a visible __GFP_ACCOUNT along with
> memalloc_use_memcg() makes the code more explicit and readable that we
> want to targeted/remote memcg charging. However if you think
> otherwise, I will replace __GFP_ACCOUNT with SLAB_ACCOUNT.

OK, fair enough.

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH 1/2] fs: fsnotify: account fsnotify metadata to kmemcg

2018-06-29 Thread Michal Hocko
On Thu 28-06-18 12:21:26, Shakeel Butt wrote:
> On Thu, Jun 28, 2018 at 12:03 PM Jan Kara  wrote:
> >
> > On Wed 27-06-18 12:12:49, Shakeel Butt wrote:
> > > A lot of memory can be consumed by the events generated for the huge or
> > > unlimited queues if there is either no or slow listener.  This can cause
> > > system level memory pressure or OOMs.  So, it's better to account the
> > > fsnotify kmem caches to the memcg of the listener.
> > >
> > > However the listener can be in a different memcg than the memcg of the
> > > producer and these allocations happen in the context of the event
> > > producer. This patch introduces remote memcg charging API which the
> > > producer can use to charge the allocations to the memcg of the listener.
> > >
> > > There are seven fsnotify kmem caches and among them allocations from
> > > dnotify_struct_cache, dnotify_mark_cache, fanotify_mark_cache and
> > > inotify_inode_mark_cachep happens in the context of syscall from the
> > > listener.  So, SLAB_ACCOUNT is enough for these caches.
> > >
> > > The objects from fsnotify_mark_connector_cachep are not accounted as they
> > > are small compared to the notification mark or events and it is unclear
> > > whom to account connector to since it is shared by all events attached to
> > > the inode.
> > >
> > > The allocations from the event caches happen in the context of the event
> > > producer.  For such caches we will need to remote charge the allocations
> > > to the listener's memcg.  Thus we save the memcg reference in the
> > > fsnotify_group structure of the listener.
> > >
> > > This patch has also moved the members of fsnotify_group to keep the size
> > > same, at least for 64 bit build, even with additional member by filling
> > > the holes.
> >
> > ...
> >
> > >  static int __init fanotify_user_setup(void)
> > >  {
> > > - fanotify_mark_cache = KMEM_CACHE(fsnotify_mark, SLAB_PANIC);
> > > + fanotify_mark_cache = KMEM_CACHE(fsnotify_mark,
> > > +  SLAB_PANIC|SLAB_ACCOUNT);
> > >   fanotify_event_cachep = KMEM_CACHE(fanotify_event_info, SLAB_PANIC);
> > >   if (IS_ENABLED(CONFIG_FANOTIFY_ACCESS_PERMISSIONS)) {
> > >   fanotify_perm_event_cachep =
> >
> > Why don't you setup also fanotify_event_cachep and
> > fanotify_perm_event_cachep caches with SLAB_ACCOUNT and instead specify
> > __GFP_ACCOUNT manually? Otherwise the patch looks good to me.
> >
> 
> Hi Jan, IMHO having a visible __GFP_ACCOUNT along with
> memalloc_use_memcg() makes the code more explicit and readable that we
> want to targeted/remote memcg charging.

Agreed. If you had an implicit SLAB_ACCOUNT then you could get
inconsistencies when some allocations would get charged to the current
task while others would not.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH 1/2] fs: fsnotify: account fsnotify metadata to kmemcg

2018-06-29 Thread Michal Hocko
On Thu 28-06-18 12:21:26, Shakeel Butt wrote:
> On Thu, Jun 28, 2018 at 12:03 PM Jan Kara  wrote:
> >
> > On Wed 27-06-18 12:12:49, Shakeel Butt wrote:
> > > A lot of memory can be consumed by the events generated for the huge or
> > > unlimited queues if there is either no or slow listener.  This can cause
> > > system level memory pressure or OOMs.  So, it's better to account the
> > > fsnotify kmem caches to the memcg of the listener.
> > >
> > > However the listener can be in a different memcg than the memcg of the
> > > producer and these allocations happen in the context of the event
> > > producer. This patch introduces remote memcg charging API which the
> > > producer can use to charge the allocations to the memcg of the listener.
> > >
> > > There are seven fsnotify kmem caches and among them allocations from
> > > dnotify_struct_cache, dnotify_mark_cache, fanotify_mark_cache and
> > > inotify_inode_mark_cachep happens in the context of syscall from the
> > > listener.  So, SLAB_ACCOUNT is enough for these caches.
> > >
> > > The objects from fsnotify_mark_connector_cachep are not accounted as they
> > > are small compared to the notification mark or events and it is unclear
> > > whom to account connector to since it is shared by all events attached to
> > > the inode.
> > >
> > > The allocations from the event caches happen in the context of the event
> > > producer.  For such caches we will need to remote charge the allocations
> > > to the listener's memcg.  Thus we save the memcg reference in the
> > > fsnotify_group structure of the listener.
> > >
> > > This patch has also moved the members of fsnotify_group to keep the size
> > > same, at least for 64 bit build, even with additional member by filling
> > > the holes.
> >
> > ...
> >
> > >  static int __init fanotify_user_setup(void)
> > >  {
> > > - fanotify_mark_cache = KMEM_CACHE(fsnotify_mark, SLAB_PANIC);
> > > + fanotify_mark_cache = KMEM_CACHE(fsnotify_mark,
> > > +  SLAB_PANIC|SLAB_ACCOUNT);
> > >   fanotify_event_cachep = KMEM_CACHE(fanotify_event_info, SLAB_PANIC);
> > >   if (IS_ENABLED(CONFIG_FANOTIFY_ACCESS_PERMISSIONS)) {
> > >   fanotify_perm_event_cachep =
> >
> > Why don't you setup also fanotify_event_cachep and
> > fanotify_perm_event_cachep caches with SLAB_ACCOUNT and instead specify
> > __GFP_ACCOUNT manually? Otherwise the patch looks good to me.
> >
> 
> Hi Jan, IMHO having a visible __GFP_ACCOUNT along with
> memalloc_use_memcg() makes the code more explicit and readable that we
> want to targeted/remote memcg charging.

Agreed. If you had an implicit SLAB_ACCOUNT then you could get
inconsistencies when some allocations would get charged to the current
task while others would not.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH 1/2] fs: fsnotify: account fsnotify metadata to kmemcg

2018-06-28 Thread Shakeel Butt
On Thu, Jun 28, 2018 at 12:03 PM Jan Kara  wrote:
>
> On Wed 27-06-18 12:12:49, Shakeel Butt wrote:
> > A lot of memory can be consumed by the events generated for the huge or
> > unlimited queues if there is either no or slow listener.  This can cause
> > system level memory pressure or OOMs.  So, it's better to account the
> > fsnotify kmem caches to the memcg of the listener.
> >
> > However the listener can be in a different memcg than the memcg of the
> > producer and these allocations happen in the context of the event
> > producer. This patch introduces remote memcg charging API which the
> > producer can use to charge the allocations to the memcg of the listener.
> >
> > There are seven fsnotify kmem caches and among them allocations from
> > dnotify_struct_cache, dnotify_mark_cache, fanotify_mark_cache and
> > inotify_inode_mark_cachep happens in the context of syscall from the
> > listener.  So, SLAB_ACCOUNT is enough for these caches.
> >
> > The objects from fsnotify_mark_connector_cachep are not accounted as they
> > are small compared to the notification mark or events and it is unclear
> > whom to account connector to since it is shared by all events attached to
> > the inode.
> >
> > The allocations from the event caches happen in the context of the event
> > producer.  For such caches we will need to remote charge the allocations
> > to the listener's memcg.  Thus we save the memcg reference in the
> > fsnotify_group structure of the listener.
> >
> > This patch has also moved the members of fsnotify_group to keep the size
> > same, at least for 64 bit build, even with additional member by filling
> > the holes.
>
> ...
>
> >  static int __init fanotify_user_setup(void)
> >  {
> > - fanotify_mark_cache = KMEM_CACHE(fsnotify_mark, SLAB_PANIC);
> > + fanotify_mark_cache = KMEM_CACHE(fsnotify_mark,
> > +  SLAB_PANIC|SLAB_ACCOUNT);
> >   fanotify_event_cachep = KMEM_CACHE(fanotify_event_info, SLAB_PANIC);
> >   if (IS_ENABLED(CONFIG_FANOTIFY_ACCESS_PERMISSIONS)) {
> >   fanotify_perm_event_cachep =
>
> Why don't you setup also fanotify_event_cachep and
> fanotify_perm_event_cachep caches with SLAB_ACCOUNT and instead specify
> __GFP_ACCOUNT manually? Otherwise the patch looks good to me.
>

Hi Jan, IMHO having a visible __GFP_ACCOUNT along with
memalloc_use_memcg() makes the code more explicit and readable that we
want to targeted/remote memcg charging. However if you think
otherwise, I will replace __GFP_ACCOUNT with SLAB_ACCOUNT.

thanks,
Shakeel


Re: [PATCH 1/2] fs: fsnotify: account fsnotify metadata to kmemcg

2018-06-28 Thread Shakeel Butt
On Thu, Jun 28, 2018 at 12:03 PM Jan Kara  wrote:
>
> On Wed 27-06-18 12:12:49, Shakeel Butt wrote:
> > A lot of memory can be consumed by the events generated for the huge or
> > unlimited queues if there is either no or slow listener.  This can cause
> > system level memory pressure or OOMs.  So, it's better to account the
> > fsnotify kmem caches to the memcg of the listener.
> >
> > However the listener can be in a different memcg than the memcg of the
> > producer and these allocations happen in the context of the event
> > producer. This patch introduces remote memcg charging API which the
> > producer can use to charge the allocations to the memcg of the listener.
> >
> > There are seven fsnotify kmem caches and among them allocations from
> > dnotify_struct_cache, dnotify_mark_cache, fanotify_mark_cache and
> > inotify_inode_mark_cachep happens in the context of syscall from the
> > listener.  So, SLAB_ACCOUNT is enough for these caches.
> >
> > The objects from fsnotify_mark_connector_cachep are not accounted as they
> > are small compared to the notification mark or events and it is unclear
> > whom to account connector to since it is shared by all events attached to
> > the inode.
> >
> > The allocations from the event caches happen in the context of the event
> > producer.  For such caches we will need to remote charge the allocations
> > to the listener's memcg.  Thus we save the memcg reference in the
> > fsnotify_group structure of the listener.
> >
> > This patch has also moved the members of fsnotify_group to keep the size
> > same, at least for 64 bit build, even with additional member by filling
> > the holes.
>
> ...
>
> >  static int __init fanotify_user_setup(void)
> >  {
> > - fanotify_mark_cache = KMEM_CACHE(fsnotify_mark, SLAB_PANIC);
> > + fanotify_mark_cache = KMEM_CACHE(fsnotify_mark,
> > +  SLAB_PANIC|SLAB_ACCOUNT);
> >   fanotify_event_cachep = KMEM_CACHE(fanotify_event_info, SLAB_PANIC);
> >   if (IS_ENABLED(CONFIG_FANOTIFY_ACCESS_PERMISSIONS)) {
> >   fanotify_perm_event_cachep =
>
> Why don't you setup also fanotify_event_cachep and
> fanotify_perm_event_cachep caches with SLAB_ACCOUNT and instead specify
> __GFP_ACCOUNT manually? Otherwise the patch looks good to me.
>

Hi Jan, IMHO having a visible __GFP_ACCOUNT along with
memalloc_use_memcg() makes the code more explicit and readable that we
want to targeted/remote memcg charging. However if you think
otherwise, I will replace __GFP_ACCOUNT with SLAB_ACCOUNT.

thanks,
Shakeel


Re: [PATCH 1/2] fs: fsnotify: account fsnotify metadata to kmemcg

2018-06-28 Thread Jan Kara
On Wed 27-06-18 12:12:49, Shakeel Butt wrote:
> A lot of memory can be consumed by the events generated for the huge or
> unlimited queues if there is either no or slow listener.  This can cause
> system level memory pressure or OOMs.  So, it's better to account the
> fsnotify kmem caches to the memcg of the listener.
> 
> However the listener can be in a different memcg than the memcg of the
> producer and these allocations happen in the context of the event
> producer. This patch introduces remote memcg charging API which the
> producer can use to charge the allocations to the memcg of the listener.
> 
> There are seven fsnotify kmem caches and among them allocations from
> dnotify_struct_cache, dnotify_mark_cache, fanotify_mark_cache and
> inotify_inode_mark_cachep happens in the context of syscall from the
> listener.  So, SLAB_ACCOUNT is enough for these caches.
> 
> The objects from fsnotify_mark_connector_cachep are not accounted as they
> are small compared to the notification mark or events and it is unclear
> whom to account connector to since it is shared by all events attached to
> the inode.
> 
> The allocations from the event caches happen in the context of the event
> producer.  For such caches we will need to remote charge the allocations
> to the listener's memcg.  Thus we save the memcg reference in the
> fsnotify_group structure of the listener.
> 
> This patch has also moved the members of fsnotify_group to keep the size
> same, at least for 64 bit build, even with additional member by filling
> the holes.

...

>  static int __init fanotify_user_setup(void)
>  {
> - fanotify_mark_cache = KMEM_CACHE(fsnotify_mark, SLAB_PANIC);
> + fanotify_mark_cache = KMEM_CACHE(fsnotify_mark,
> +  SLAB_PANIC|SLAB_ACCOUNT);
>   fanotify_event_cachep = KMEM_CACHE(fanotify_event_info, SLAB_PANIC);
>   if (IS_ENABLED(CONFIG_FANOTIFY_ACCESS_PERMISSIONS)) {
>   fanotify_perm_event_cachep =

Why don't you setup also fanotify_event_cachep and
fanotify_perm_event_cachep caches with SLAB_ACCOUNT and instead specify
__GFP_ACCOUNT manually? Otherwise the patch looks good to me.

Honza

-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH 1/2] fs: fsnotify: account fsnotify metadata to kmemcg

2018-06-28 Thread Jan Kara
On Wed 27-06-18 12:12:49, Shakeel Butt wrote:
> A lot of memory can be consumed by the events generated for the huge or
> unlimited queues if there is either no or slow listener.  This can cause
> system level memory pressure or OOMs.  So, it's better to account the
> fsnotify kmem caches to the memcg of the listener.
> 
> However the listener can be in a different memcg than the memcg of the
> producer and these allocations happen in the context of the event
> producer. This patch introduces remote memcg charging API which the
> producer can use to charge the allocations to the memcg of the listener.
> 
> There are seven fsnotify kmem caches and among them allocations from
> dnotify_struct_cache, dnotify_mark_cache, fanotify_mark_cache and
> inotify_inode_mark_cachep happens in the context of syscall from the
> listener.  So, SLAB_ACCOUNT is enough for these caches.
> 
> The objects from fsnotify_mark_connector_cachep are not accounted as they
> are small compared to the notification mark or events and it is unclear
> whom to account connector to since it is shared by all events attached to
> the inode.
> 
> The allocations from the event caches happen in the context of the event
> producer.  For such caches we will need to remote charge the allocations
> to the listener's memcg.  Thus we save the memcg reference in the
> fsnotify_group structure of the listener.
> 
> This patch has also moved the members of fsnotify_group to keep the size
> same, at least for 64 bit build, even with additional member by filling
> the holes.

...

>  static int __init fanotify_user_setup(void)
>  {
> - fanotify_mark_cache = KMEM_CACHE(fsnotify_mark, SLAB_PANIC);
> + fanotify_mark_cache = KMEM_CACHE(fsnotify_mark,
> +  SLAB_PANIC|SLAB_ACCOUNT);
>   fanotify_event_cachep = KMEM_CACHE(fanotify_event_info, SLAB_PANIC);
>   if (IS_ENABLED(CONFIG_FANOTIFY_ACCESS_PERMISSIONS)) {
>   fanotify_perm_event_cachep =

Why don't you setup also fanotify_event_cachep and
fanotify_perm_event_cachep caches with SLAB_ACCOUNT and instead specify
__GFP_ACCOUNT manually? Otherwise the patch looks good to me.

Honza

-- 
Jan Kara 
SUSE Labs, CR


[PATCH 1/2] fs: fsnotify: account fsnotify metadata to kmemcg

2018-06-27 Thread Shakeel Butt
A lot of memory can be consumed by the events generated for the huge or
unlimited queues if there is either no or slow listener.  This can cause
system level memory pressure or OOMs.  So, it's better to account the
fsnotify kmem caches to the memcg of the listener.

However the listener can be in a different memcg than the memcg of the
producer and these allocations happen in the context of the event
producer. This patch introduces remote memcg charging API which the
producer can use to charge the allocations to the memcg of the listener.

There are seven fsnotify kmem caches and among them allocations from
dnotify_struct_cache, dnotify_mark_cache, fanotify_mark_cache and
inotify_inode_mark_cachep happens in the context of syscall from the
listener.  So, SLAB_ACCOUNT is enough for these caches.

The objects from fsnotify_mark_connector_cachep are not accounted as they
are small compared to the notification mark or events and it is unclear
whom to account connector to since it is shared by all events attached to
the inode.

The allocations from the event caches happen in the context of the event
producer.  For such caches we will need to remote charge the allocations
to the listener's memcg.  Thus we save the memcg reference in the
fsnotify_group structure of the listener.

This patch has also moved the members of fsnotify_group to keep the size
same, at least for 64 bit build, even with additional member by filling
the holes.

Signed-off-by: Shakeel Butt 
Cc: Michal Hocko 
Cc: Jan Kara 
Cc: Amir Goldstein 
Cc: Greg Thelen 
Cc: Johannes Weiner 
Cc: Vladimir Davydov 
Cc: Roman Gushchin 
Cc: Andrew Morton 
Cc: Alexander Viro 
---
Changelog since v7:
- Simplify the remote memcg charging API
- get_mem_cgroup_from_current() returns root_mem_cgroup if
  current->active_memcg is not NULL but css_tryget_online() fails.

Changelog since v6:
- Removed Jan's ACK as the code has changed a lot
- Squashed the separate remote charging API path into this one
- Removed kmalloc* & kmem_cache_alloc* APIs and only kept the scope API
- Changed fsnotify remote charging code to use scope API

Changelog since v5:
- None

Changelog since v4:
- Fixed the build for CONFIG_MEMCG=n

Changelog since v3:
- Rebased over Jan's patches.
- Some cleanup based on Amir's comments.

Changelog since v2:
- None

Changelog since v1:
- no more charging fsnotify_mark_connector objects
- Fixed the build for SLOB

 fs/notify/dnotify/dnotify.c  |  5 ++--
 fs/notify/fanotify/fanotify.c| 14 ---
 fs/notify/fanotify/fanotify_user.c   |  5 +++-
 fs/notify/group.c|  3 +++
 fs/notify/inotify/inotify_fsnotify.c |  7 +-
 fs/notify/inotify/inotify_user.c |  5 +++-
 include/linux/fsnotify_backend.h | 12 ++---
 include/linux/memcontrol.h   | 10 +++-
 include/linux/sched.h|  3 +++
 include/linux/sched/mm.h | 37 
 kernel/fork.c|  3 +++
 mm/memcontrol.c  | 37 +---
 12 files changed, 123 insertions(+), 18 deletions(-)

diff --git a/fs/notify/dnotify/dnotify.c b/fs/notify/dnotify/dnotify.c
index e2bea2ac5dfb..a6365e6bc047 100644
--- a/fs/notify/dnotify/dnotify.c
+++ b/fs/notify/dnotify/dnotify.c
@@ -384,8 +384,9 @@ int fcntl_dirnotify(int fd, struct file *filp, unsigned 
long arg)
 
 static int __init dnotify_init(void)
 {
-   dnotify_struct_cache = KMEM_CACHE(dnotify_struct, SLAB_PANIC);
-   dnotify_mark_cache = KMEM_CACHE(dnotify_mark, SLAB_PANIC);
+   dnotify_struct_cache = KMEM_CACHE(dnotify_struct,
+ SLAB_PANIC|SLAB_ACCOUNT);
+   dnotify_mark_cache = KMEM_CACHE(dnotify_mark, SLAB_PANIC|SLAB_ACCOUNT);
 
dnotify_group = fsnotify_alloc_group(_fsnotify_ops);
if (IS_ERR(dnotify_group))
diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c
index f90842efea13..6ff1f75d156d 100644
--- a/fs/notify/fanotify/fanotify.c
+++ b/fs/notify/fanotify/fanotify.c
@@ -11,6 +11,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "fanotify.h"
 
@@ -140,8 +141,8 @@ struct fanotify_event_info *fanotify_alloc_event(struct 
fsnotify_group *group,
 struct inode *inode, u32 mask,
 const struct path *path)
 {
-   struct fanotify_event_info *event;
-   gfp_t gfp = GFP_KERNEL;
+   struct fanotify_event_info *event = NULL;
+   gfp_t gfp = GFP_KERNEL | __GFP_ACCOUNT;
 
/*
 * For queues with unlimited length lost events are not expected and
@@ -151,19 +152,22 @@ struct fanotify_event_info *fanotify_alloc_event(struct 
fsnotify_group *group,
if (group->max_events == UINT_MAX)
gfp |= __GFP_NOFAIL;
 
+   /* Whoever is interested in the event, pays for the allocation. */
+   memalloc_use_memcg(group->memcg);
+
if (fanotify_is_perm_event(mask)) {

[PATCH 1/2] fs: fsnotify: account fsnotify metadata to kmemcg

2018-06-27 Thread Shakeel Butt
A lot of memory can be consumed by the events generated for the huge or
unlimited queues if there is either no or slow listener.  This can cause
system level memory pressure or OOMs.  So, it's better to account the
fsnotify kmem caches to the memcg of the listener.

However the listener can be in a different memcg than the memcg of the
producer and these allocations happen in the context of the event
producer. This patch introduces remote memcg charging API which the
producer can use to charge the allocations to the memcg of the listener.

There are seven fsnotify kmem caches and among them allocations from
dnotify_struct_cache, dnotify_mark_cache, fanotify_mark_cache and
inotify_inode_mark_cachep happens in the context of syscall from the
listener.  So, SLAB_ACCOUNT is enough for these caches.

The objects from fsnotify_mark_connector_cachep are not accounted as they
are small compared to the notification mark or events and it is unclear
whom to account connector to since it is shared by all events attached to
the inode.

The allocations from the event caches happen in the context of the event
producer.  For such caches we will need to remote charge the allocations
to the listener's memcg.  Thus we save the memcg reference in the
fsnotify_group structure of the listener.

This patch has also moved the members of fsnotify_group to keep the size
same, at least for 64 bit build, even with additional member by filling
the holes.

Signed-off-by: Shakeel Butt 
Cc: Michal Hocko 
Cc: Jan Kara 
Cc: Amir Goldstein 
Cc: Greg Thelen 
Cc: Johannes Weiner 
Cc: Vladimir Davydov 
Cc: Roman Gushchin 
Cc: Andrew Morton 
Cc: Alexander Viro 
---
Changelog since v7:
- Simplify the remote memcg charging API
- get_mem_cgroup_from_current() returns root_mem_cgroup if
  current->active_memcg is not NULL but css_tryget_online() fails.

Changelog since v6:
- Removed Jan's ACK as the code has changed a lot
- Squashed the separate remote charging API path into this one
- Removed kmalloc* & kmem_cache_alloc* APIs and only kept the scope API
- Changed fsnotify remote charging code to use scope API

Changelog since v5:
- None

Changelog since v4:
- Fixed the build for CONFIG_MEMCG=n

Changelog since v3:
- Rebased over Jan's patches.
- Some cleanup based on Amir's comments.

Changelog since v2:
- None

Changelog since v1:
- no more charging fsnotify_mark_connector objects
- Fixed the build for SLOB

 fs/notify/dnotify/dnotify.c  |  5 ++--
 fs/notify/fanotify/fanotify.c| 14 ---
 fs/notify/fanotify/fanotify_user.c   |  5 +++-
 fs/notify/group.c|  3 +++
 fs/notify/inotify/inotify_fsnotify.c |  7 +-
 fs/notify/inotify/inotify_user.c |  5 +++-
 include/linux/fsnotify_backend.h | 12 ++---
 include/linux/memcontrol.h   | 10 +++-
 include/linux/sched.h|  3 +++
 include/linux/sched/mm.h | 37 
 kernel/fork.c|  3 +++
 mm/memcontrol.c  | 37 +---
 12 files changed, 123 insertions(+), 18 deletions(-)

diff --git a/fs/notify/dnotify/dnotify.c b/fs/notify/dnotify/dnotify.c
index e2bea2ac5dfb..a6365e6bc047 100644
--- a/fs/notify/dnotify/dnotify.c
+++ b/fs/notify/dnotify/dnotify.c
@@ -384,8 +384,9 @@ int fcntl_dirnotify(int fd, struct file *filp, unsigned 
long arg)
 
 static int __init dnotify_init(void)
 {
-   dnotify_struct_cache = KMEM_CACHE(dnotify_struct, SLAB_PANIC);
-   dnotify_mark_cache = KMEM_CACHE(dnotify_mark, SLAB_PANIC);
+   dnotify_struct_cache = KMEM_CACHE(dnotify_struct,
+ SLAB_PANIC|SLAB_ACCOUNT);
+   dnotify_mark_cache = KMEM_CACHE(dnotify_mark, SLAB_PANIC|SLAB_ACCOUNT);
 
dnotify_group = fsnotify_alloc_group(_fsnotify_ops);
if (IS_ERR(dnotify_group))
diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c
index f90842efea13..6ff1f75d156d 100644
--- a/fs/notify/fanotify/fanotify.c
+++ b/fs/notify/fanotify/fanotify.c
@@ -11,6 +11,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "fanotify.h"
 
@@ -140,8 +141,8 @@ struct fanotify_event_info *fanotify_alloc_event(struct 
fsnotify_group *group,
 struct inode *inode, u32 mask,
 const struct path *path)
 {
-   struct fanotify_event_info *event;
-   gfp_t gfp = GFP_KERNEL;
+   struct fanotify_event_info *event = NULL;
+   gfp_t gfp = GFP_KERNEL | __GFP_ACCOUNT;
 
/*
 * For queues with unlimited length lost events are not expected and
@@ -151,19 +152,22 @@ struct fanotify_event_info *fanotify_alloc_event(struct 
fsnotify_group *group,
if (group->max_events == UINT_MAX)
gfp |= __GFP_NOFAIL;
 
+   /* Whoever is interested in the event, pays for the allocation. */
+   memalloc_use_memcg(group->memcg);
+
if (fanotify_is_perm_event(mask)) {

Re: [PATCH 1/2] fs: fsnotify: account fsnotify metadata to kmemcg

2018-06-27 Thread Jan Kara
On Tue 26-06-18 12:07:57, Shakeel Butt wrote:
> On Tue, Jun 26, 2018 at 11:55 AM Johannes Weiner  wrote:
> >
> > On Tue, Jun 26, 2018 at 11:00:53AM -0700, Shakeel Butt wrote:
> > > On Mon, Jun 25, 2018 at 10:49 PM Amir Goldstein  
> > > wrote:
> > > >
> > > ...
> > > >
> > > > The verb 'unuse' takes an argument memcg and 'uses' it - too weird.
> > > > You can use 'override'/'revert' verbs like override_creds or just call
> > > > memalloc_use_memcg(old_memcg) since there is no reference taken
> > > > anyway in use_memcg and no reference released in unuse_memcg.
> > > >
> > > > Otherwise looks good to me.
> > > >
> > >
> > > Thanks for your feedback. Just using memalloc_use_memcg(old_memcg) and
> > > ignoring the return seems more simple. I will wait for feedback from
> > > other before changing anything.
> >
> > We're not nesting calls to memalloc_use_memcg(), right? So we don't
> > have to return old_memcg and don't have to pass anything to unuse, it
> > can always set current->active_memcg to NULL.
> 
> For buffer_head, the allocation is done with GFP_NOFS. So, I think
> there is no chance of nesting. The fsnotify uses GFP_KERNEL but based
> on my limited understanding of fsnotify, there should not be any
> nesting i.e. the allocation triggering reclaim which trigger fsnotify
> events. Though I would like Amir or Jan to confirm there is no nesting
> possible.

You are correct. Fsnotify events are generated only as a result of some
syscall, not due to reclaim or stuff like that.

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH 1/2] fs: fsnotify: account fsnotify metadata to kmemcg

2018-06-27 Thread Jan Kara
On Tue 26-06-18 12:07:57, Shakeel Butt wrote:
> On Tue, Jun 26, 2018 at 11:55 AM Johannes Weiner  wrote:
> >
> > On Tue, Jun 26, 2018 at 11:00:53AM -0700, Shakeel Butt wrote:
> > > On Mon, Jun 25, 2018 at 10:49 PM Amir Goldstein  
> > > wrote:
> > > >
> > > ...
> > > >
> > > > The verb 'unuse' takes an argument memcg and 'uses' it - too weird.
> > > > You can use 'override'/'revert' verbs like override_creds or just call
> > > > memalloc_use_memcg(old_memcg) since there is no reference taken
> > > > anyway in use_memcg and no reference released in unuse_memcg.
> > > >
> > > > Otherwise looks good to me.
> > > >
> > >
> > > Thanks for your feedback. Just using memalloc_use_memcg(old_memcg) and
> > > ignoring the return seems more simple. I will wait for feedback from
> > > other before changing anything.
> >
> > We're not nesting calls to memalloc_use_memcg(), right? So we don't
> > have to return old_memcg and don't have to pass anything to unuse, it
> > can always set current->active_memcg to NULL.
> 
> For buffer_head, the allocation is done with GFP_NOFS. So, I think
> there is no chance of nesting. The fsnotify uses GFP_KERNEL but based
> on my limited understanding of fsnotify, there should not be any
> nesting i.e. the allocation triggering reclaim which trigger fsnotify
> events. Though I would like Amir or Jan to confirm there is no nesting
> possible.

You are correct. Fsnotify events are generated only as a result of some
syscall, not due to reclaim or stuff like that.

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH 1/2] fs: fsnotify: account fsnotify metadata to kmemcg

2018-06-26 Thread Amir Goldstein
On Tue, Jun 26, 2018 at 11:05 PM, Shakeel Butt  wrote:
> On Tue, Jun 26, 2018 at 12:03 PM Johannes Weiner  wrote:
>>
>> On Mon, Jun 25, 2018 at 04:06:58PM -0700, Shakeel Butt wrote:
>> > @@ -140,8 +141,9 @@ struct fanotify_event_info 
>> > *fanotify_alloc_event(struct fsnotify_group *group,
>> >struct inode *inode, u32 
>> > mask,
>> >const struct path *path)
>> >  {
>> > - struct fanotify_event_info *event;
>> > + struct fanotify_event_info *event = NULL;
>> >   gfp_t gfp = GFP_KERNEL;
>> > + struct mem_cgroup *old_memcg = NULL;
>> >
>> >   /*
>> >* For queues with unlimited length lost events are not expected and
>> > @@ -151,19 +153,25 @@ struct fanotify_event_info 
>> > *fanotify_alloc_event(struct fsnotify_group *group,
>> >   if (group->max_events == UINT_MAX)
>> >   gfp |= __GFP_NOFAIL;
>> >
>> > + /* Whoever is interested in the event, pays for the allocation. */
>> > + if (group->memcg) {
>> > + gfp |= __GFP_ACCOUNT;
>> > + old_memcg = memalloc_use_memcg(group->memcg);
>> > + }
>>
>> group->memcg is only NULL when memcg is disabled or there is some
>> offlining race. Can you make memalloc_use_memcg(NULL) mean that it
>> should charge root_mem_cgroup instead of current->mm->memcg? That way
>> we can make this site unconditional while retaining the behavior:
>>
>> gfp_t gfp = GFP_KERNEL | __GFP_ACCOUNT;
>>
>> memalloc_use_memcg(group->memcg);
>> kmem_cache_alloc(..., gfp);
>> out:
>> memalloc_unuse_memcg();
>>
>> (dropping old_memcg and the unuse parameter as per the other mail)
>>
>
> group->memcg is only NULL when memcg is disabled (i.e.
> get_mem_cgroup_from_mm() returns root_mem_cgroup for offlined
> mm->memcg). Though group->memcg can point to an offlined memcg.
>
> If I understand you correctly this is what we want:
>
> 1. If group->memcg is NULL then __GFP_ACCOUNT is a noop i.e. memcg is 
> disabled.
> 2. If group->memcg is root_mem_cgroup, then __GFP_ACCOUNT again is a
> kind of noop (charges to root_mem_cgroups are bypassed).
> 3. If group->memcg is offlined memcg, then make __GFP_ACCOUNT noop by
> returning root_mem_cgroup from get_mem_cgroup_from_current().
> 4. Else charge group->memcg.
>
> This seems reasonable. After your Ack and Amir's or Jan's answer to
> the nesting query, I will resend the next version of this patch
> series.
>
> In future if we find any use-cases of memalloc_use_memcg nesting then
> we can make it work for nesting.
>

For the fsnotify use case memalloc_use_memcg() certainly doesn't
need to nest, but I wonder, if that facility becomes popular among different
subsystems, how exactly do you intend to monitor that it doesn't grow
nested use cases? I would suggest that you at least leave a
WARN_ON_ONCE if memalloc_use_memcg() is called and
active_memcg is already set.

Thanks,
Amir.


Re: [PATCH 1/2] fs: fsnotify: account fsnotify metadata to kmemcg

2018-06-26 Thread Amir Goldstein
On Tue, Jun 26, 2018 at 11:05 PM, Shakeel Butt  wrote:
> On Tue, Jun 26, 2018 at 12:03 PM Johannes Weiner  wrote:
>>
>> On Mon, Jun 25, 2018 at 04:06:58PM -0700, Shakeel Butt wrote:
>> > @@ -140,8 +141,9 @@ struct fanotify_event_info 
>> > *fanotify_alloc_event(struct fsnotify_group *group,
>> >struct inode *inode, u32 
>> > mask,
>> >const struct path *path)
>> >  {
>> > - struct fanotify_event_info *event;
>> > + struct fanotify_event_info *event = NULL;
>> >   gfp_t gfp = GFP_KERNEL;
>> > + struct mem_cgroup *old_memcg = NULL;
>> >
>> >   /*
>> >* For queues with unlimited length lost events are not expected and
>> > @@ -151,19 +153,25 @@ struct fanotify_event_info 
>> > *fanotify_alloc_event(struct fsnotify_group *group,
>> >   if (group->max_events == UINT_MAX)
>> >   gfp |= __GFP_NOFAIL;
>> >
>> > + /* Whoever is interested in the event, pays for the allocation. */
>> > + if (group->memcg) {
>> > + gfp |= __GFP_ACCOUNT;
>> > + old_memcg = memalloc_use_memcg(group->memcg);
>> > + }
>>
>> group->memcg is only NULL when memcg is disabled or there is some
>> offlining race. Can you make memalloc_use_memcg(NULL) mean that it
>> should charge root_mem_cgroup instead of current->mm->memcg? That way
>> we can make this site unconditional while retaining the behavior:
>>
>> gfp_t gfp = GFP_KERNEL | __GFP_ACCOUNT;
>>
>> memalloc_use_memcg(group->memcg);
>> kmem_cache_alloc(..., gfp);
>> out:
>> memalloc_unuse_memcg();
>>
>> (dropping old_memcg and the unuse parameter as per the other mail)
>>
>
> group->memcg is only NULL when memcg is disabled (i.e.
> get_mem_cgroup_from_mm() returns root_mem_cgroup for offlined
> mm->memcg). Though group->memcg can point to an offlined memcg.
>
> If I understand you correctly this is what we want:
>
> 1. If group->memcg is NULL then __GFP_ACCOUNT is a noop i.e. memcg is 
> disabled.
> 2. If group->memcg is root_mem_cgroup, then __GFP_ACCOUNT again is a
> kind of noop (charges to root_mem_cgroups are bypassed).
> 3. If group->memcg is offlined memcg, then make __GFP_ACCOUNT noop by
> returning root_mem_cgroup from get_mem_cgroup_from_current().
> 4. Else charge group->memcg.
>
> This seems reasonable. After your Ack and Amir's or Jan's answer to
> the nesting query, I will resend the next version of this patch
> series.
>
> In future if we find any use-cases of memalloc_use_memcg nesting then
> we can make it work for nesting.
>

For the fsnotify use case memalloc_use_memcg() certainly doesn't
need to nest, but I wonder, if that facility becomes popular among different
subsystems, how exactly do you intend to monitor that it doesn't grow
nested use cases? I would suggest that you at least leave a
WARN_ON_ONCE if memalloc_use_memcg() is called and
active_memcg is already set.

Thanks,
Amir.


Re: [PATCH 1/2] fs: fsnotify: account fsnotify metadata to kmemcg

2018-06-26 Thread Shakeel Butt
On Tue, Jun 26, 2018 at 12:03 PM Johannes Weiner  wrote:
>
> On Mon, Jun 25, 2018 at 04:06:58PM -0700, Shakeel Butt wrote:
> > @@ -140,8 +141,9 @@ struct fanotify_event_info *fanotify_alloc_event(struct 
> > fsnotify_group *group,
> >struct inode *inode, u32 
> > mask,
> >const struct path *path)
> >  {
> > - struct fanotify_event_info *event;
> > + struct fanotify_event_info *event = NULL;
> >   gfp_t gfp = GFP_KERNEL;
> > + struct mem_cgroup *old_memcg = NULL;
> >
> >   /*
> >* For queues with unlimited length lost events are not expected and
> > @@ -151,19 +153,25 @@ struct fanotify_event_info 
> > *fanotify_alloc_event(struct fsnotify_group *group,
> >   if (group->max_events == UINT_MAX)
> >   gfp |= __GFP_NOFAIL;
> >
> > + /* Whoever is interested in the event, pays for the allocation. */
> > + if (group->memcg) {
> > + gfp |= __GFP_ACCOUNT;
> > + old_memcg = memalloc_use_memcg(group->memcg);
> > + }
>
> group->memcg is only NULL when memcg is disabled or there is some
> offlining race. Can you make memalloc_use_memcg(NULL) mean that it
> should charge root_mem_cgroup instead of current->mm->memcg? That way
> we can make this site unconditional while retaining the behavior:
>
> gfp_t gfp = GFP_KERNEL | __GFP_ACCOUNT;
>
> memalloc_use_memcg(group->memcg);
> kmem_cache_alloc(..., gfp);
> out:
> memalloc_unuse_memcg();
>
> (dropping old_memcg and the unuse parameter as per the other mail)
>

group->memcg is only NULL when memcg is disabled (i.e.
get_mem_cgroup_from_mm() returns root_mem_cgroup for offlined
mm->memcg). Though group->memcg can point to an offlined memcg.

If I understand you correctly this is what we want:

1. If group->memcg is NULL then __GFP_ACCOUNT is a noop i.e. memcg is disabled.
2. If group->memcg is root_mem_cgroup, then __GFP_ACCOUNT again is a
kind of noop (charges to root_mem_cgroups are bypassed).
3. If group->memcg is offlined memcg, then make __GFP_ACCOUNT noop by
returning root_mem_cgroup from get_mem_cgroup_from_current().
4. Else charge group->memcg.

This seems reasonable. After your Ack and Amir's or Jan's answer to
the nesting query, I will resend the next version of this patch
series.

In future if we find any use-cases of memalloc_use_memcg nesting then
we can make it work for nesting.

thanks,
Shakeel


Re: [PATCH 1/2] fs: fsnotify: account fsnotify metadata to kmemcg

2018-06-26 Thread Shakeel Butt
On Tue, Jun 26, 2018 at 12:03 PM Johannes Weiner  wrote:
>
> On Mon, Jun 25, 2018 at 04:06:58PM -0700, Shakeel Butt wrote:
> > @@ -140,8 +141,9 @@ struct fanotify_event_info *fanotify_alloc_event(struct 
> > fsnotify_group *group,
> >struct inode *inode, u32 
> > mask,
> >const struct path *path)
> >  {
> > - struct fanotify_event_info *event;
> > + struct fanotify_event_info *event = NULL;
> >   gfp_t gfp = GFP_KERNEL;
> > + struct mem_cgroup *old_memcg = NULL;
> >
> >   /*
> >* For queues with unlimited length lost events are not expected and
> > @@ -151,19 +153,25 @@ struct fanotify_event_info 
> > *fanotify_alloc_event(struct fsnotify_group *group,
> >   if (group->max_events == UINT_MAX)
> >   gfp |= __GFP_NOFAIL;
> >
> > + /* Whoever is interested in the event, pays for the allocation. */
> > + if (group->memcg) {
> > + gfp |= __GFP_ACCOUNT;
> > + old_memcg = memalloc_use_memcg(group->memcg);
> > + }
>
> group->memcg is only NULL when memcg is disabled or there is some
> offlining race. Can you make memalloc_use_memcg(NULL) mean that it
> should charge root_mem_cgroup instead of current->mm->memcg? That way
> we can make this site unconditional while retaining the behavior:
>
> gfp_t gfp = GFP_KERNEL | __GFP_ACCOUNT;
>
> memalloc_use_memcg(group->memcg);
> kmem_cache_alloc(..., gfp);
> out:
> memalloc_unuse_memcg();
>
> (dropping old_memcg and the unuse parameter as per the other mail)
>

group->memcg is only NULL when memcg is disabled (i.e.
get_mem_cgroup_from_mm() returns root_mem_cgroup for offlined
mm->memcg). Though group->memcg can point to an offlined memcg.

If I understand you correctly this is what we want:

1. If group->memcg is NULL then __GFP_ACCOUNT is a noop i.e. memcg is disabled.
2. If group->memcg is root_mem_cgroup, then __GFP_ACCOUNT again is a
kind of noop (charges to root_mem_cgroups are bypassed).
3. If group->memcg is offlined memcg, then make __GFP_ACCOUNT noop by
returning root_mem_cgroup from get_mem_cgroup_from_current().
4. Else charge group->memcg.

This seems reasonable. After your Ack and Amir's or Jan's answer to
the nesting query, I will resend the next version of this patch
series.

In future if we find any use-cases of memalloc_use_memcg nesting then
we can make it work for nesting.

thanks,
Shakeel


Re: [PATCH 1/2] fs: fsnotify: account fsnotify metadata to kmemcg

2018-06-26 Thread Shakeel Butt
On Tue, Jun 26, 2018 at 11:55 AM Johannes Weiner  wrote:
>
> On Tue, Jun 26, 2018 at 11:00:53AM -0700, Shakeel Butt wrote:
> > On Mon, Jun 25, 2018 at 10:49 PM Amir Goldstein  wrote:
> > >
> > ...
> > >
> > > The verb 'unuse' takes an argument memcg and 'uses' it - too weird.
> > > You can use 'override'/'revert' verbs like override_creds or just call
> > > memalloc_use_memcg(old_memcg) since there is no reference taken
> > > anyway in use_memcg and no reference released in unuse_memcg.
> > >
> > > Otherwise looks good to me.
> > >
> >
> > Thanks for your feedback. Just using memalloc_use_memcg(old_memcg) and
> > ignoring the return seems more simple. I will wait for feedback from
> > other before changing anything.
>
> We're not nesting calls to memalloc_use_memcg(), right? So we don't
> have to return old_memcg and don't have to pass anything to unuse, it
> can always set current->active_memcg to NULL.

For buffer_head, the allocation is done with GFP_NOFS. So, I think
there is no chance of nesting. The fsnotify uses GFP_KERNEL but based
on my limited understanding of fsnotify, there should not be any
nesting i.e. the allocation triggering reclaim which trigger fsnotify
events. Though I would like Amir or Jan to confirm there is no nesting
possible.

thanks,
Shakeel


Re: [PATCH 1/2] fs: fsnotify: account fsnotify metadata to kmemcg

2018-06-26 Thread Shakeel Butt
On Tue, Jun 26, 2018 at 11:55 AM Johannes Weiner  wrote:
>
> On Tue, Jun 26, 2018 at 11:00:53AM -0700, Shakeel Butt wrote:
> > On Mon, Jun 25, 2018 at 10:49 PM Amir Goldstein  wrote:
> > >
> > ...
> > >
> > > The verb 'unuse' takes an argument memcg and 'uses' it - too weird.
> > > You can use 'override'/'revert' verbs like override_creds or just call
> > > memalloc_use_memcg(old_memcg) since there is no reference taken
> > > anyway in use_memcg and no reference released in unuse_memcg.
> > >
> > > Otherwise looks good to me.
> > >
> >
> > Thanks for your feedback. Just using memalloc_use_memcg(old_memcg) and
> > ignoring the return seems more simple. I will wait for feedback from
> > other before changing anything.
>
> We're not nesting calls to memalloc_use_memcg(), right? So we don't
> have to return old_memcg and don't have to pass anything to unuse, it
> can always set current->active_memcg to NULL.

For buffer_head, the allocation is done with GFP_NOFS. So, I think
there is no chance of nesting. The fsnotify uses GFP_KERNEL but based
on my limited understanding of fsnotify, there should not be any
nesting i.e. the allocation triggering reclaim which trigger fsnotify
events. Though I would like Amir or Jan to confirm there is no nesting
possible.

thanks,
Shakeel


Re: [PATCH 1/2] fs: fsnotify: account fsnotify metadata to kmemcg

2018-06-26 Thread Johannes Weiner
On Mon, Jun 25, 2018 at 04:06:58PM -0700, Shakeel Butt wrote:
> @@ -140,8 +141,9 @@ struct fanotify_event_info *fanotify_alloc_event(struct 
> fsnotify_group *group,
>struct inode *inode, u32 mask,
>const struct path *path)
>  {
> - struct fanotify_event_info *event;
> + struct fanotify_event_info *event = NULL;
>   gfp_t gfp = GFP_KERNEL;
> + struct mem_cgroup *old_memcg = NULL;
>  
>   /*
>* For queues with unlimited length lost events are not expected and
> @@ -151,19 +153,25 @@ struct fanotify_event_info *fanotify_alloc_event(struct 
> fsnotify_group *group,
>   if (group->max_events == UINT_MAX)
>   gfp |= __GFP_NOFAIL;
>  
> + /* Whoever is interested in the event, pays for the allocation. */
> + if (group->memcg) {
> + gfp |= __GFP_ACCOUNT;
> + old_memcg = memalloc_use_memcg(group->memcg);
> + }

group->memcg is only NULL when memcg is disabled or there is some
offlining race. Can you make memalloc_use_memcg(NULL) mean that it
should charge root_mem_cgroup instead of current->mm->memcg? That way
we can make this site unconditional while retaining the behavior:

gfp_t gfp = GFP_KERNEL | __GFP_ACCOUNT;

memalloc_use_memcg(group->memcg);
kmem_cache_alloc(..., gfp);
out:
memalloc_unuse_memcg();

(dropping old_memcg and the unuse parameter as per the other mail)

>   if (fanotify_is_perm_event(mask)) {
>   struct fanotify_perm_event_info *pevent;
>  
>   pevent = kmem_cache_alloc(fanotify_perm_event_cachep, gfp);
>   if (!pevent)
> - return NULL;
> + goto out;
>   event = >fae;
>   pevent->response = 0;
>   goto init;
>   }
>   event = kmem_cache_alloc(fanotify_event_cachep, gfp);
>   if (!event)
> - return NULL;
> + goto out;
>  init: __maybe_unused
>   fsnotify_init_event(>fse, inode, mask);
>   event->tgid = get_pid(task_tgid(current));
> @@ -174,6 +182,9 @@ init: __maybe_unused
>   event->path.mnt = NULL;
>   event->path.dentry = NULL;
>   }
> +out:
> + if (group->memcg)
> + memalloc_unuse_memcg(old_memcg);
>   return event;
>  }

Thanks,
Johannes


Re: [PATCH 1/2] fs: fsnotify: account fsnotify metadata to kmemcg

2018-06-26 Thread Johannes Weiner
On Mon, Jun 25, 2018 at 04:06:58PM -0700, Shakeel Butt wrote:
> @@ -140,8 +141,9 @@ struct fanotify_event_info *fanotify_alloc_event(struct 
> fsnotify_group *group,
>struct inode *inode, u32 mask,
>const struct path *path)
>  {
> - struct fanotify_event_info *event;
> + struct fanotify_event_info *event = NULL;
>   gfp_t gfp = GFP_KERNEL;
> + struct mem_cgroup *old_memcg = NULL;
>  
>   /*
>* For queues with unlimited length lost events are not expected and
> @@ -151,19 +153,25 @@ struct fanotify_event_info *fanotify_alloc_event(struct 
> fsnotify_group *group,
>   if (group->max_events == UINT_MAX)
>   gfp |= __GFP_NOFAIL;
>  
> + /* Whoever is interested in the event, pays for the allocation. */
> + if (group->memcg) {
> + gfp |= __GFP_ACCOUNT;
> + old_memcg = memalloc_use_memcg(group->memcg);
> + }

group->memcg is only NULL when memcg is disabled or there is some
offlining race. Can you make memalloc_use_memcg(NULL) mean that it
should charge root_mem_cgroup instead of current->mm->memcg? That way
we can make this site unconditional while retaining the behavior:

gfp_t gfp = GFP_KERNEL | __GFP_ACCOUNT;

memalloc_use_memcg(group->memcg);
kmem_cache_alloc(..., gfp);
out:
memalloc_unuse_memcg();

(dropping old_memcg and the unuse parameter as per the other mail)

>   if (fanotify_is_perm_event(mask)) {
>   struct fanotify_perm_event_info *pevent;
>  
>   pevent = kmem_cache_alloc(fanotify_perm_event_cachep, gfp);
>   if (!pevent)
> - return NULL;
> + goto out;
>   event = >fae;
>   pevent->response = 0;
>   goto init;
>   }
>   event = kmem_cache_alloc(fanotify_event_cachep, gfp);
>   if (!event)
> - return NULL;
> + goto out;
>  init: __maybe_unused
>   fsnotify_init_event(>fse, inode, mask);
>   event->tgid = get_pid(task_tgid(current));
> @@ -174,6 +182,9 @@ init: __maybe_unused
>   event->path.mnt = NULL;
>   event->path.dentry = NULL;
>   }
> +out:
> + if (group->memcg)
> + memalloc_unuse_memcg(old_memcg);
>   return event;
>  }

Thanks,
Johannes


Re: [PATCH 1/2] fs: fsnotify: account fsnotify metadata to kmemcg

2018-06-26 Thread Johannes Weiner
On Tue, Jun 26, 2018 at 11:00:53AM -0700, Shakeel Butt wrote:
> On Mon, Jun 25, 2018 at 10:49 PM Amir Goldstein  wrote:
> >
> ...
> >
> > The verb 'unuse' takes an argument memcg and 'uses' it - too weird.
> > You can use 'override'/'revert' verbs like override_creds or just call
> > memalloc_use_memcg(old_memcg) since there is no reference taken
> > anyway in use_memcg and no reference released in unuse_memcg.
> >
> > Otherwise looks good to me.
> >
> 
> Thanks for your feedback. Just using memalloc_use_memcg(old_memcg) and
> ignoring the return seems more simple. I will wait for feedback from
> other before changing anything.

We're not nesting calls to memalloc_use_memcg(), right? So we don't
have to return old_memcg and don't have to pass anything to unuse, it
can always set current->active_memcg to NULL.


Re: [PATCH 1/2] fs: fsnotify: account fsnotify metadata to kmemcg

2018-06-26 Thread Johannes Weiner
On Tue, Jun 26, 2018 at 11:00:53AM -0700, Shakeel Butt wrote:
> On Mon, Jun 25, 2018 at 10:49 PM Amir Goldstein  wrote:
> >
> ...
> >
> > The verb 'unuse' takes an argument memcg and 'uses' it - too weird.
> > You can use 'override'/'revert' verbs like override_creds or just call
> > memalloc_use_memcg(old_memcg) since there is no reference taken
> > anyway in use_memcg and no reference released in unuse_memcg.
> >
> > Otherwise looks good to me.
> >
> 
> Thanks for your feedback. Just using memalloc_use_memcg(old_memcg) and
> ignoring the return seems more simple. I will wait for feedback from
> other before changing anything.

We're not nesting calls to memalloc_use_memcg(), right? So we don't
have to return old_memcg and don't have to pass anything to unuse, it
can always set current->active_memcg to NULL.


Re: [PATCH 1/2] fs: fsnotify: account fsnotify metadata to kmemcg

2018-06-26 Thread Shakeel Butt
On Mon, Jun 25, 2018 at 10:49 PM Amir Goldstein  wrote:
>
...
>
> The verb 'unuse' takes an argument memcg and 'uses' it - too weird.
> You can use 'override'/'revert' verbs like override_creds or just call
> memalloc_use_memcg(old_memcg) since there is no reference taken
> anyway in use_memcg and no reference released in unuse_memcg.
>
> Otherwise looks good to me.
>

Thanks for your feedback. Just using memalloc_use_memcg(old_memcg) and
ignoring the return seems more simple. I will wait for feedback from
other before changing anything.

thanks,
Shakeel


Re: [PATCH 1/2] fs: fsnotify: account fsnotify metadata to kmemcg

2018-06-26 Thread Shakeel Butt
On Mon, Jun 25, 2018 at 10:49 PM Amir Goldstein  wrote:
>
...
>
> The verb 'unuse' takes an argument memcg and 'uses' it - too weird.
> You can use 'override'/'revert' verbs like override_creds or just call
> memalloc_use_memcg(old_memcg) since there is no reference taken
> anyway in use_memcg and no reference released in unuse_memcg.
>
> Otherwise looks good to me.
>

Thanks for your feedback. Just using memalloc_use_memcg(old_memcg) and
ignoring the return seems more simple. I will wait for feedback from
other before changing anything.

thanks,
Shakeel


Re: [PATCH 1/2] fs: fsnotify: account fsnotify metadata to kmemcg

2018-06-25 Thread Amir Goldstein
On Tue, Jun 26, 2018 at 2:06 AM, Shakeel Butt  wrote:
> A lot of memory can be consumed by the events generated for the huge or
> unlimited queues if there is either no or slow listener.  This can cause
> system level memory pressure or OOMs.  So, it's better to account the
> fsnotify kmem caches to the memcg of the listener.
>
> However the listener can be in a different memcg than the memcg of the
> producer and these allocations happen in the context of the event
> producer. This patch introduces remote memcg charging scope API which the
> producer can use to charge the allocations to the memcg of the listener.
>
> There are seven fsnotify kmem caches and among them allocations from
> dnotify_struct_cache, dnotify_mark_cache, fanotify_mark_cache and
> inotify_inode_mark_cachep happens in the context of syscall from the
> listener.  So, SLAB_ACCOUNT is enough for these caches.
>
> The objects from fsnotify_mark_connector_cachep are not accounted as they
> are small compared to the notification mark or events and it is unclear
> whom to account connector to since it is shared by all events attached to
> the inode.
>
> The allocations from the event caches happen in the context of the event
> producer.  For such caches we will need to remote charge the allocations
> to the listener's memcg.  Thus we save the memcg reference in the
> fsnotify_group structure of the listener.
>
> This patch has also moved the members of fsnotify_group to keep the size
> same, at least for 64 bit build, even with additional member by filling
> the holes.
>
> Signed-off-by: Shakeel Butt 
> Cc: Michal Hocko 
> Cc: Jan Kara 
> Cc: Amir Goldstein 
> Cc: Greg Thelen 
> Cc: Johannes Weiner 
> Cc: Vladimir Davydov 
> Cc: Roman Gushchin 
> ---
> Changelog since v6:
> - Removed Jan's ACK as the code has changed a lot
> - Squashed the separate remote charging API path into this one
> - Removed kmalloc* & kmem_cache_alloc* APIs and only kept the scope API
> - Changed fsnotify remote charging code to use scope API
>
> Changelog since v5:
> - None
>
> Changelog since v4:
> - Fixed the build for CONFIG_MEMCG=n
>
> Changelog since v3:
> - Rebased over Jan's patches.
> - Some cleanup based on Amir's comments.
>
> Changelog since v2:
> - None
>
> Changelog since v1:
> - no more charging fsnotify_mark_connector objects
> - Fixed the build for SLOB
>
>  fs/notify/dnotify/dnotify.c  |  5 ++--
>  fs/notify/fanotify/fanotify.c| 17 ++--
>  fs/notify/fanotify/fanotify_user.c   |  5 +++-
>  fs/notify/group.c|  4 +++
>  fs/notify/inotify/inotify_fsnotify.c | 15 +-
>  fs/notify/inotify/inotify_user.c |  5 +++-
>  include/linux/fsnotify_backend.h | 12 +---
>  include/linux/memcontrol.h   |  7 +
>  include/linux/sched.h|  3 ++
>  include/linux/sched/mm.h | 41 
>  kernel/fork.c|  3 ++
>  mm/memcontrol.c  | 38 +++---
>  12 files changed, 139 insertions(+), 16 deletions(-)
>
> diff --git a/fs/notify/dnotify/dnotify.c b/fs/notify/dnotify/dnotify.c
> index e2bea2ac5dfb..a6365e6bc047 100644
> --- a/fs/notify/dnotify/dnotify.c
> +++ b/fs/notify/dnotify/dnotify.c
> @@ -384,8 +384,9 @@ int fcntl_dirnotify(int fd, struct file *filp, unsigned 
> long arg)
>
>  static int __init dnotify_init(void)
>  {
> -   dnotify_struct_cache = KMEM_CACHE(dnotify_struct, SLAB_PANIC);
> -   dnotify_mark_cache = KMEM_CACHE(dnotify_mark, SLAB_PANIC);
> +   dnotify_struct_cache = KMEM_CACHE(dnotify_struct,
> + SLAB_PANIC|SLAB_ACCOUNT);
> +   dnotify_mark_cache = KMEM_CACHE(dnotify_mark, 
> SLAB_PANIC|SLAB_ACCOUNT);
>
> dnotify_group = fsnotify_alloc_group(_fsnotify_ops);
> if (IS_ERR(dnotify_group))
> diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c
> index f90842efea13..d6dfcf0ec21f 100644
> --- a/fs/notify/fanotify/fanotify.c
> +++ b/fs/notify/fanotify/fanotify.c
> @@ -11,6 +11,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>
>  #include "fanotify.h"
>
> @@ -140,8 +141,9 @@ struct fanotify_event_info *fanotify_alloc_event(struct 
> fsnotify_group *group,
>  struct inode *inode, u32 
> mask,
>  const struct path *path)
>  {
> -   struct fanotify_event_info *event;
> +   struct fanotify_event_info *event = NULL;
> gfp_t gfp = GFP_KERNEL;
> +   struct mem_cgroup *old_memcg = NULL;
>
> /*
>  * For queues with unlimited length lost events are not expected and
> @@ -151,19 +153,25 @@ struct fanotify_event_info *fanotify_alloc_event(struct 
> fsnotify_group *group,
> if (group->max_events == UINT_MAX)
> gfp |= __GFP_NOFAIL;
>
> +   /* Whoever is interested in the event, pays for the allocation. */
> +   if (group->memcg) {
> +   

Re: [PATCH 1/2] fs: fsnotify: account fsnotify metadata to kmemcg

2018-06-25 Thread Amir Goldstein
On Tue, Jun 26, 2018 at 2:06 AM, Shakeel Butt  wrote:
> A lot of memory can be consumed by the events generated for the huge or
> unlimited queues if there is either no or slow listener.  This can cause
> system level memory pressure or OOMs.  So, it's better to account the
> fsnotify kmem caches to the memcg of the listener.
>
> However the listener can be in a different memcg than the memcg of the
> producer and these allocations happen in the context of the event
> producer. This patch introduces remote memcg charging scope API which the
> producer can use to charge the allocations to the memcg of the listener.
>
> There are seven fsnotify kmem caches and among them allocations from
> dnotify_struct_cache, dnotify_mark_cache, fanotify_mark_cache and
> inotify_inode_mark_cachep happens in the context of syscall from the
> listener.  So, SLAB_ACCOUNT is enough for these caches.
>
> The objects from fsnotify_mark_connector_cachep are not accounted as they
> are small compared to the notification mark or events and it is unclear
> whom to account connector to since it is shared by all events attached to
> the inode.
>
> The allocations from the event caches happen in the context of the event
> producer.  For such caches we will need to remote charge the allocations
> to the listener's memcg.  Thus we save the memcg reference in the
> fsnotify_group structure of the listener.
>
> This patch has also moved the members of fsnotify_group to keep the size
> same, at least for 64 bit build, even with additional member by filling
> the holes.
>
> Signed-off-by: Shakeel Butt 
> Cc: Michal Hocko 
> Cc: Jan Kara 
> Cc: Amir Goldstein 
> Cc: Greg Thelen 
> Cc: Johannes Weiner 
> Cc: Vladimir Davydov 
> Cc: Roman Gushchin 
> ---
> Changelog since v6:
> - Removed Jan's ACK as the code has changed a lot
> - Squashed the separate remote charging API path into this one
> - Removed kmalloc* & kmem_cache_alloc* APIs and only kept the scope API
> - Changed fsnotify remote charging code to use scope API
>
> Changelog since v5:
> - None
>
> Changelog since v4:
> - Fixed the build for CONFIG_MEMCG=n
>
> Changelog since v3:
> - Rebased over Jan's patches.
> - Some cleanup based on Amir's comments.
>
> Changelog since v2:
> - None
>
> Changelog since v1:
> - no more charging fsnotify_mark_connector objects
> - Fixed the build for SLOB
>
>  fs/notify/dnotify/dnotify.c  |  5 ++--
>  fs/notify/fanotify/fanotify.c| 17 ++--
>  fs/notify/fanotify/fanotify_user.c   |  5 +++-
>  fs/notify/group.c|  4 +++
>  fs/notify/inotify/inotify_fsnotify.c | 15 +-
>  fs/notify/inotify/inotify_user.c |  5 +++-
>  include/linux/fsnotify_backend.h | 12 +---
>  include/linux/memcontrol.h   |  7 +
>  include/linux/sched.h|  3 ++
>  include/linux/sched/mm.h | 41 
>  kernel/fork.c|  3 ++
>  mm/memcontrol.c  | 38 +++---
>  12 files changed, 139 insertions(+), 16 deletions(-)
>
> diff --git a/fs/notify/dnotify/dnotify.c b/fs/notify/dnotify/dnotify.c
> index e2bea2ac5dfb..a6365e6bc047 100644
> --- a/fs/notify/dnotify/dnotify.c
> +++ b/fs/notify/dnotify/dnotify.c
> @@ -384,8 +384,9 @@ int fcntl_dirnotify(int fd, struct file *filp, unsigned 
> long arg)
>
>  static int __init dnotify_init(void)
>  {
> -   dnotify_struct_cache = KMEM_CACHE(dnotify_struct, SLAB_PANIC);
> -   dnotify_mark_cache = KMEM_CACHE(dnotify_mark, SLAB_PANIC);
> +   dnotify_struct_cache = KMEM_CACHE(dnotify_struct,
> + SLAB_PANIC|SLAB_ACCOUNT);
> +   dnotify_mark_cache = KMEM_CACHE(dnotify_mark, 
> SLAB_PANIC|SLAB_ACCOUNT);
>
> dnotify_group = fsnotify_alloc_group(_fsnotify_ops);
> if (IS_ERR(dnotify_group))
> diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c
> index f90842efea13..d6dfcf0ec21f 100644
> --- a/fs/notify/fanotify/fanotify.c
> +++ b/fs/notify/fanotify/fanotify.c
> @@ -11,6 +11,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>
>  #include "fanotify.h"
>
> @@ -140,8 +141,9 @@ struct fanotify_event_info *fanotify_alloc_event(struct 
> fsnotify_group *group,
>  struct inode *inode, u32 
> mask,
>  const struct path *path)
>  {
> -   struct fanotify_event_info *event;
> +   struct fanotify_event_info *event = NULL;
> gfp_t gfp = GFP_KERNEL;
> +   struct mem_cgroup *old_memcg = NULL;
>
> /*
>  * For queues with unlimited length lost events are not expected and
> @@ -151,19 +153,25 @@ struct fanotify_event_info *fanotify_alloc_event(struct 
> fsnotify_group *group,
> if (group->max_events == UINT_MAX)
> gfp |= __GFP_NOFAIL;
>
> +   /* Whoever is interested in the event, pays for the allocation. */
> +   if (group->memcg) {
> +   

[PATCH 1/2] fs: fsnotify: account fsnotify metadata to kmemcg

2018-06-25 Thread Shakeel Butt
A lot of memory can be consumed by the events generated for the huge or
unlimited queues if there is either no or slow listener.  This can cause
system level memory pressure or OOMs.  So, it's better to account the
fsnotify kmem caches to the memcg of the listener.

However the listener can be in a different memcg than the memcg of the
producer and these allocations happen in the context of the event
producer. This patch introduces remote memcg charging scope API which the
producer can use to charge the allocations to the memcg of the listener.

There are seven fsnotify kmem caches and among them allocations from
dnotify_struct_cache, dnotify_mark_cache, fanotify_mark_cache and
inotify_inode_mark_cachep happens in the context of syscall from the
listener.  So, SLAB_ACCOUNT is enough for these caches.

The objects from fsnotify_mark_connector_cachep are not accounted as they
are small compared to the notification mark or events and it is unclear
whom to account connector to since it is shared by all events attached to
the inode.

The allocations from the event caches happen in the context of the event
producer.  For such caches we will need to remote charge the allocations
to the listener's memcg.  Thus we save the memcg reference in the
fsnotify_group structure of the listener.

This patch has also moved the members of fsnotify_group to keep the size
same, at least for 64 bit build, even with additional member by filling
the holes.

Signed-off-by: Shakeel Butt 
Cc: Michal Hocko 
Cc: Jan Kara 
Cc: Amir Goldstein 
Cc: Greg Thelen 
Cc: Johannes Weiner 
Cc: Vladimir Davydov 
Cc: Roman Gushchin 
---
Changelog since v6:
- Removed Jan's ACK as the code has changed a lot
- Squashed the separate remote charging API path into this one
- Removed kmalloc* & kmem_cache_alloc* APIs and only kept the scope API
- Changed fsnotify remote charging code to use scope API

Changelog since v5:
- None

Changelog since v4:
- Fixed the build for CONFIG_MEMCG=n

Changelog since v3:
- Rebased over Jan's patches.
- Some cleanup based on Amir's comments.

Changelog since v2:
- None

Changelog since v1:
- no more charging fsnotify_mark_connector objects
- Fixed the build for SLOB

 fs/notify/dnotify/dnotify.c  |  5 ++--
 fs/notify/fanotify/fanotify.c| 17 ++--
 fs/notify/fanotify/fanotify_user.c   |  5 +++-
 fs/notify/group.c|  4 +++
 fs/notify/inotify/inotify_fsnotify.c | 15 +-
 fs/notify/inotify/inotify_user.c |  5 +++-
 include/linux/fsnotify_backend.h | 12 +---
 include/linux/memcontrol.h   |  7 +
 include/linux/sched.h|  3 ++
 include/linux/sched/mm.h | 41 
 kernel/fork.c|  3 ++
 mm/memcontrol.c  | 38 +++---
 12 files changed, 139 insertions(+), 16 deletions(-)

diff --git a/fs/notify/dnotify/dnotify.c b/fs/notify/dnotify/dnotify.c
index e2bea2ac5dfb..a6365e6bc047 100644
--- a/fs/notify/dnotify/dnotify.c
+++ b/fs/notify/dnotify/dnotify.c
@@ -384,8 +384,9 @@ int fcntl_dirnotify(int fd, struct file *filp, unsigned 
long arg)
 
 static int __init dnotify_init(void)
 {
-   dnotify_struct_cache = KMEM_CACHE(dnotify_struct, SLAB_PANIC);
-   dnotify_mark_cache = KMEM_CACHE(dnotify_mark, SLAB_PANIC);
+   dnotify_struct_cache = KMEM_CACHE(dnotify_struct,
+ SLAB_PANIC|SLAB_ACCOUNT);
+   dnotify_mark_cache = KMEM_CACHE(dnotify_mark, SLAB_PANIC|SLAB_ACCOUNT);
 
dnotify_group = fsnotify_alloc_group(_fsnotify_ops);
if (IS_ERR(dnotify_group))
diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c
index f90842efea13..d6dfcf0ec21f 100644
--- a/fs/notify/fanotify/fanotify.c
+++ b/fs/notify/fanotify/fanotify.c
@@ -11,6 +11,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "fanotify.h"
 
@@ -140,8 +141,9 @@ struct fanotify_event_info *fanotify_alloc_event(struct 
fsnotify_group *group,
 struct inode *inode, u32 mask,
 const struct path *path)
 {
-   struct fanotify_event_info *event;
+   struct fanotify_event_info *event = NULL;
gfp_t gfp = GFP_KERNEL;
+   struct mem_cgroup *old_memcg = NULL;
 
/*
 * For queues with unlimited length lost events are not expected and
@@ -151,19 +153,25 @@ struct fanotify_event_info *fanotify_alloc_event(struct 
fsnotify_group *group,
if (group->max_events == UINT_MAX)
gfp |= __GFP_NOFAIL;
 
+   /* Whoever is interested in the event, pays for the allocation. */
+   if (group->memcg) {
+   gfp |= __GFP_ACCOUNT;
+   old_memcg = memalloc_use_memcg(group->memcg);
+   }
+
if (fanotify_is_perm_event(mask)) {
struct fanotify_perm_event_info *pevent;
 
pevent = kmem_cache_alloc(fanotify_perm_event_cachep, 

[PATCH 1/2] fs: fsnotify: account fsnotify metadata to kmemcg

2018-06-25 Thread Shakeel Butt
A lot of memory can be consumed by the events generated for the huge or
unlimited queues if there is either no or slow listener.  This can cause
system level memory pressure or OOMs.  So, it's better to account the
fsnotify kmem caches to the memcg of the listener.

However the listener can be in a different memcg than the memcg of the
producer and these allocations happen in the context of the event
producer. This patch introduces remote memcg charging scope API which the
producer can use to charge the allocations to the memcg of the listener.

There are seven fsnotify kmem caches and among them allocations from
dnotify_struct_cache, dnotify_mark_cache, fanotify_mark_cache and
inotify_inode_mark_cachep happens in the context of syscall from the
listener.  So, SLAB_ACCOUNT is enough for these caches.

The objects from fsnotify_mark_connector_cachep are not accounted as they
are small compared to the notification mark or events and it is unclear
whom to account connector to since it is shared by all events attached to
the inode.

The allocations from the event caches happen in the context of the event
producer.  For such caches we will need to remote charge the allocations
to the listener's memcg.  Thus we save the memcg reference in the
fsnotify_group structure of the listener.

This patch has also moved the members of fsnotify_group to keep the size
same, at least for 64 bit build, even with additional member by filling
the holes.

Signed-off-by: Shakeel Butt 
Cc: Michal Hocko 
Cc: Jan Kara 
Cc: Amir Goldstein 
Cc: Greg Thelen 
Cc: Johannes Weiner 
Cc: Vladimir Davydov 
Cc: Roman Gushchin 
---
Changelog since v6:
- Removed Jan's ACK as the code has changed a lot
- Squashed the separate remote charging API path into this one
- Removed kmalloc* & kmem_cache_alloc* APIs and only kept the scope API
- Changed fsnotify remote charging code to use scope API

Changelog since v5:
- None

Changelog since v4:
- Fixed the build for CONFIG_MEMCG=n

Changelog since v3:
- Rebased over Jan's patches.
- Some cleanup based on Amir's comments.

Changelog since v2:
- None

Changelog since v1:
- no more charging fsnotify_mark_connector objects
- Fixed the build for SLOB

 fs/notify/dnotify/dnotify.c  |  5 ++--
 fs/notify/fanotify/fanotify.c| 17 ++--
 fs/notify/fanotify/fanotify_user.c   |  5 +++-
 fs/notify/group.c|  4 +++
 fs/notify/inotify/inotify_fsnotify.c | 15 +-
 fs/notify/inotify/inotify_user.c |  5 +++-
 include/linux/fsnotify_backend.h | 12 +---
 include/linux/memcontrol.h   |  7 +
 include/linux/sched.h|  3 ++
 include/linux/sched/mm.h | 41 
 kernel/fork.c|  3 ++
 mm/memcontrol.c  | 38 +++---
 12 files changed, 139 insertions(+), 16 deletions(-)

diff --git a/fs/notify/dnotify/dnotify.c b/fs/notify/dnotify/dnotify.c
index e2bea2ac5dfb..a6365e6bc047 100644
--- a/fs/notify/dnotify/dnotify.c
+++ b/fs/notify/dnotify/dnotify.c
@@ -384,8 +384,9 @@ int fcntl_dirnotify(int fd, struct file *filp, unsigned 
long arg)
 
 static int __init dnotify_init(void)
 {
-   dnotify_struct_cache = KMEM_CACHE(dnotify_struct, SLAB_PANIC);
-   dnotify_mark_cache = KMEM_CACHE(dnotify_mark, SLAB_PANIC);
+   dnotify_struct_cache = KMEM_CACHE(dnotify_struct,
+ SLAB_PANIC|SLAB_ACCOUNT);
+   dnotify_mark_cache = KMEM_CACHE(dnotify_mark, SLAB_PANIC|SLAB_ACCOUNT);
 
dnotify_group = fsnotify_alloc_group(_fsnotify_ops);
if (IS_ERR(dnotify_group))
diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c
index f90842efea13..d6dfcf0ec21f 100644
--- a/fs/notify/fanotify/fanotify.c
+++ b/fs/notify/fanotify/fanotify.c
@@ -11,6 +11,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "fanotify.h"
 
@@ -140,8 +141,9 @@ struct fanotify_event_info *fanotify_alloc_event(struct 
fsnotify_group *group,
 struct inode *inode, u32 mask,
 const struct path *path)
 {
-   struct fanotify_event_info *event;
+   struct fanotify_event_info *event = NULL;
gfp_t gfp = GFP_KERNEL;
+   struct mem_cgroup *old_memcg = NULL;
 
/*
 * For queues with unlimited length lost events are not expected and
@@ -151,19 +153,25 @@ struct fanotify_event_info *fanotify_alloc_event(struct 
fsnotify_group *group,
if (group->max_events == UINT_MAX)
gfp |= __GFP_NOFAIL;
 
+   /* Whoever is interested in the event, pays for the allocation. */
+   if (group->memcg) {
+   gfp |= __GFP_ACCOUNT;
+   old_memcg = memalloc_use_memcg(group->memcg);
+   }
+
if (fanotify_is_perm_event(mask)) {
struct fanotify_perm_event_info *pevent;
 
pevent = kmem_cache_alloc(fanotify_perm_event_cachep,