Re: [PATCH 2/3] cgroup: duplicate cgroup reference when cloning sockets

2016-09-19 Thread Vladimir Davydov
On Wed, Sep 14, 2016 at 03:48:45PM -0400, Johannes Weiner wrote:
> From: Johannes Weiner 
> 
> When a socket is cloned, the associated sock_cgroup_data is duplicated
> but not its reference on the cgroup. As a result, the cgroup reference
> count will underflow when both sockets are destroyed later on.
> 
> Fixes: bd1060a1d671 ("sock, cgroup: add sock->sk_cgroup")
> Cc:  # 4.5+
> Signed-off-by: Johannes Weiner 

Reviewed-by: Vladimir Davydov 


Re: [PATCH 1/3] mm: memcontrol: make per-cpu charge cache IRQ-safe for socket accounting

2016-09-19 Thread Vladimir Davydov
On Wed, Sep 14, 2016 at 03:48:44PM -0400, Johannes Weiner wrote:
> From: Johannes Weiner 
> 
> During cgroup2 rollout into production, we started encountering css
> refcount underflows and css access crashes in the memory controller.
> Splitting the heavily shared css reference counter into logical users
> narrowed the imbalance down to the cgroup2 socket memory accounting.
> 
> The problem turns out to be the per-cpu charge cache. Cgroup1 had a
> separate socket counter, but the new cgroup2 socket accounting goes
> through the common charge path that uses a shared per-cpu cache for
> all memory that is being tracked. Those caches are safe against
> scheduling preemption, but not against interrupts - such as the newly
> added packet receive path. When cache draining is interrupted by
> network RX taking pages out of the cache, the resuming drain operation
> will put references of in-use pages, thus causing the imbalance.
> 
> Disable IRQs during all per-cpu charge cache operations.
> 
> Fixes: f7e1cb6ec51b ("mm: memcontrol: account socket memory in unified 
> hierarchy memory controller")
> Cc:  # 4.5+
> Signed-off-by: Johannes Weiner 

Acked-by: Vladimir Davydov 


Re: [PATCH RESEND 8/8] af_unix: charge buffers to kmemcg

2016-08-24 Thread Vladimir Davydov
Hello,

On Tue, Aug 23, 2016 at 02:48:11PM +0100, Sudeep K N wrote:
> On Tue, May 24, 2016 at 5:36 PM, Vladimir Davydov
>  wrote:
> > On Tue, May 24, 2016 at 06:02:06AM -0700, Eric Dumazet wrote:
> >> On Tue, 2016-05-24 at 11:49 +0300, Vladimir Davydov wrote:
> >> > Unix sockets can consume a significant amount of system memory, hence
> >> > they should be accounted to kmemcg.
> >> >
> >> > Since unix socket buffers are always allocated from process context,
> >> > all we need to do to charge them to kmemcg is set __GFP_ACCOUNT in
> >> > sock->sk_allocation mask.
> >>
> >> I have two questions :
> >>
> >> 1) What happens when a buffer, allocated from socket  lands in a
> >> different socket , maybe owned by another user/process.
> >>
> >> Who owns it now, in term of kmemcg accounting ?
> >
> > We never move memcg charges. E.g. if two processes from different
> > cgroups are sharing a memory region, each page will be charged to the
> > process which touched it first. Or if two processes are working with the
> > same directory tree, inodes and dentries will be charged to the first
> > user. The same is fair for unix socket buffers - they will be charged to
> > the sender.
> >
> >>
> >> 2) Has performance impact been evaluated ?
> >
> > I ran netperf STREAM_STREAM with default options in a kmemcg on
> > a 4 core x 2 HT box. The results are below:
> >
> >  # clientsbandwidth (10^6bits/sec)
> > base  patched
> >  1  67643 +-  725  64874 +-  353- 4.0 %
> >  4 193585 +- 2516 186715 +- 1460- 3.5 %
> >  8 194820 +-  377 187443 +- 1229- 3.7 %
> >
> > So the accounting doesn't come for free - it takes ~4% of performance.
> > I believe we could optimize it by using per cpu batching not only on
> > charge, but also on uncharge in memcg core, but that's beyond the scope
> > of this patch set - I'll take a look at this later.
> >
> > Anyway, if performance impact is found to be unacceptable, it is always
> > possible to disable kmem accounting at boot time (cgroup.memory=nokmem)
> > or not use memory cgroups at runtime at all (thanks to jump labels
> > there'll be no overhead even if they are compiled in).
> >
> 
> I started seeing almost 10% degradation in the hackbench score with v4.8-rc1
> Bisecting it resulted in this patch, i.e. Commit 3aa9799e1364 ("af_unix: 
> charge
> buffers to kmemcg") in the mainline.
> 
> As per the commit log, it seems like that's expected but I was not sure about
> the margin. I also see the hackbench score is more inconsistent after this
> patch, but I may be wrong as that's based on limited observation.
> 
> Is this something we can ignore as hackbench is more synthetic compared
> to the gain this patch provides in some real workloads ?

AFAIU hackbench essentially measures the rate of sending data over a
unix socket back and forth between processes running on different cpus,
so it isn't a surprise that the patch resulted in a degradation, as it
makes every skb page allocation/deallocation inc/dec an atomic counter
inside memcg. The more processes/cpus running in the same cgroup are
involved in this test, the more significant the overhead of this atomic
counter is going to be.

The degradation is not unavoidable - it can be fixed by making kmem
charge/uncharge code use per-cpu batches. The infrastructure for this
already exists in memcontrol.c. If it were not for the legacy
mem_cgroup->kmem counter (which is actually useless and will be dropped
in cgroup v2), the issue would be pretty easy to fix. However, this
legacy counter makes a possible implementation quite messy, so I'd like
to postpone it until cgroup v2 has finally settled down.

Regarding your problem. As a workaround you can either start your
workload in the root memory cgroup or disable kmem accounting for memory
cgroups altogether (via cgroup.memory=nokmem boot option). If you find
the issue critical, I don't mind reverting the patch - we can always
re-apply it once per-cpu batches are implemented for kmem charges.

Thanks,
Vladimir


Re: [PATCH RESEND 7/8] pipe: account to kmemcg

2016-05-27 Thread Vladimir Davydov
On Thu, May 26, 2016 at 07:15:49AM -0700, Eric Dumazet wrote:
> On Thu, 2016-05-26 at 16:59 +0300, Vladimir Davydov wrote:
> > On Thu, May 26, 2016 at 04:04:55PM +0900, Minchan Kim wrote:
> > > On Wed, May 25, 2016 at 01:30:11PM +0300, Vladimir Davydov wrote:
> > > > On Tue, May 24, 2016 at 01:04:33PM -0700, Eric Dumazet wrote:
> > > > > On Tue, 2016-05-24 at 19:13 +0300, Vladimir Davydov wrote:
> > > > > > On Tue, May 24, 2016 at 05:59:02AM -0700, Eric Dumazet wrote:
> > > > > > ...
> > > > > > > > +static int anon_pipe_buf_steal(struct pipe_inode_info *pipe,
> > > > > > > > +  struct pipe_buffer *buf)
> > > > > > > > +{
> > > > > > > > +   struct page *page = buf->page;
> > > > > > > > +
> > > > > > > > +   if (page_count(page) == 1) {
> > > > > > > 
> > > > > > > This looks racy : some cpu could have temporarily elevated page 
> > > > > > > count.
> > > > > > 
> > > > > > All pipe operations (pipe_buf_operations->get, ->release, ->steal) 
> > > > > > are
> > > > > > supposed to be called under pipe_lock. So, if we see a 
> > > > > > pipe_buffer->page
> > > > > > with refcount of 1 in ->steal, that means that we are the only its 
> > > > > > user
> > > > > > and it can't be spliced to another pipe.
> > > > > > 
> > > > > > In fact, I just copied the code from generic_pipe_buf_steal, adding
> > > > > > kmemcg related checks along the way, so it should be fine.
> > > > > 
> > > > > So you guarantee that no other cpu might have done
> > > > > get_page_unless_zero() right before this test ?
> > > > 
> > > > Each pipe_buffer holds a reference to its page. If we find page's
> > > > refcount to be 1 here, then it can be referenced only by our
> > > > pipe_buffer. And the refcount cannot be increased by a parallel thread,
> > > > because we hold pipe_lock, which rules out splice, and otherwise it's
> > > > impossible to reach the page as it is not on lru. That said, I think I
> > > > guarantee that this should be safe.
> > > 
> > > I don't know kmemcg internal and pipe stuff so my comment might be
> > > totally crap.
> > > 
> > > No one cannot guarantee any CPU cannot held a reference of a page.
> > > Look at get_page_unless_zero usecases.
> > > 
> > > 1. balloon_page_isolate
> > > 
> > > It can hold a reference in random page and then verify the page
> > > is balloon page. Otherwise, just put.
> > > 
> > > 2. page_idle_get_page
> > > 
> > > It has PageLRU check but it's racy so it can hold a reference
> > > of randome page and then verify within zone->lru_lock. If it's
> > > not LRU page, just put.
> > 
> > Well, I see your concern now - even if a page is not on lru and we
> > locked all structs pointing to it, it can always get accessed by pfn in
> > a completely unrelated thread, like in examples you gave above. That's a
> > fair point.
> > 
> > However, I still think that it's OK in case of pipe buffers. What can
> > happen if somebody takes a transient reference to a pipe buffer page? At
> > worst, we'll see page_count > 1 due to temporary ref and abort stealing,
> > falling back on copying instead. That's OK, because stealing is not
> > guaranteed. Can a function that takes a transient ref to page by pfn
> > mistakenly assume that this is a page it's interested in? I don't think
> > so, because this page has no marks on it except special _mapcount value,
> > which should only be set on kmemcg pages.
> 
> Well, all this information deserve to be in the changelog.
> 
> Maybe in 6 months, this will be incredibly useful for bug hunting.
> 
> pipes can be used to exchange data (or pages) between processes in
> different domains.
> 
> If kmemcg is not precise, this could be used by some attackers to force
> some processes to consume all their budget and eventually not be able to
> allocate new pages.
> 

Here goes the patch with updated change log.
---
From: Vladimir Davydov 
Subject: [PATCH] pipe: account to kmemcg

Pipes can consume a significant amount of system memory, hence they
should be accounted to kmemcg.

This

Re: [PATCH RESEND 7/8] pipe: account to kmemcg

2016-05-26 Thread Vladimir Davydov
On Thu, May 26, 2016 at 04:04:55PM +0900, Minchan Kim wrote:
> On Wed, May 25, 2016 at 01:30:11PM +0300, Vladimir Davydov wrote:
> > On Tue, May 24, 2016 at 01:04:33PM -0700, Eric Dumazet wrote:
> > > On Tue, 2016-05-24 at 19:13 +0300, Vladimir Davydov wrote:
> > > > On Tue, May 24, 2016 at 05:59:02AM -0700, Eric Dumazet wrote:
> > > > ...
> > > > > > +static int anon_pipe_buf_steal(struct pipe_inode_info *pipe,
> > > > > > +  struct pipe_buffer *buf)
> > > > > > +{
> > > > > > +   struct page *page = buf->page;
> > > > > > +
> > > > > > +   if (page_count(page) == 1) {
> > > > > 
> > > > > This looks racy : some cpu could have temporarily elevated page count.
> > > > 
> > > > All pipe operations (pipe_buf_operations->get, ->release, ->steal) are
> > > > supposed to be called under pipe_lock. So, if we see a pipe_buffer->page
> > > > with refcount of 1 in ->steal, that means that we are the only its user
> > > > and it can't be spliced to another pipe.
> > > > 
> > > > In fact, I just copied the code from generic_pipe_buf_steal, adding
> > > > kmemcg related checks along the way, so it should be fine.
> > > 
> > > So you guarantee that no other cpu might have done
> > > get_page_unless_zero() right before this test ?
> > 
> > Each pipe_buffer holds a reference to its page. If we find page's
> > refcount to be 1 here, then it can be referenced only by our
> > pipe_buffer. And the refcount cannot be increased by a parallel thread,
> > because we hold pipe_lock, which rules out splice, and otherwise it's
> > impossible to reach the page as it is not on lru. That said, I think I
> > guarantee that this should be safe.
> 
> I don't know kmemcg internal and pipe stuff so my comment might be
> totally crap.
> 
> No one cannot guarantee any CPU cannot held a reference of a page.
> Look at get_page_unless_zero usecases.
> 
> 1. balloon_page_isolate
> 
> It can hold a reference in random page and then verify the page
> is balloon page. Otherwise, just put.
> 
> 2. page_idle_get_page
> 
> It has PageLRU check but it's racy so it can hold a reference
> of randome page and then verify within zone->lru_lock. If it's
> not LRU page, just put.

Well, I see your concern now - even if a page is not on lru and we
locked all structs pointing to it, it can always get accessed by pfn in
a completely unrelated thread, like in examples you gave above. That's a
fair point.

However, I still think that it's OK in case of pipe buffers. What can
happen if somebody takes a transient reference to a pipe buffer page? At
worst, we'll see page_count > 1 due to temporary ref and abort stealing,
falling back on copying instead. That's OK, because stealing is not
guaranteed. Can a function that takes a transient ref to page by pfn
mistakenly assume that this is a page it's interested in? I don't think
so, because this page has no marks on it except special _mapcount value,
which should only be set on kmemcg pages.

Thanks,
Vladimir


Re: [PATCH RESEND 7/8] pipe: account to kmemcg

2016-05-25 Thread Vladimir Davydov
On Tue, May 24, 2016 at 01:04:33PM -0700, Eric Dumazet wrote:
> On Tue, 2016-05-24 at 19:13 +0300, Vladimir Davydov wrote:
> > On Tue, May 24, 2016 at 05:59:02AM -0700, Eric Dumazet wrote:
> > ...
> > > > +static int anon_pipe_buf_steal(struct pipe_inode_info *pipe,
> > > > +  struct pipe_buffer *buf)
> > > > +{
> > > > +   struct page *page = buf->page;
> > > > +
> > > > +   if (page_count(page) == 1) {
> > > 
> > > This looks racy : some cpu could have temporarily elevated page count.
> > 
> > All pipe operations (pipe_buf_operations->get, ->release, ->steal) are
> > supposed to be called under pipe_lock. So, if we see a pipe_buffer->page
> > with refcount of 1 in ->steal, that means that we are the only its user
> > and it can't be spliced to another pipe.
> > 
> > In fact, I just copied the code from generic_pipe_buf_steal, adding
> > kmemcg related checks along the way, so it should be fine.
> 
> So you guarantee that no other cpu might have done
> get_page_unless_zero() right before this test ?

Each pipe_buffer holds a reference to its page. If we find page's
refcount to be 1 here, then it can be referenced only by our
pipe_buffer. And the refcount cannot be increased by a parallel thread,
because we hold pipe_lock, which rules out splice, and otherwise it's
impossible to reach the page as it is not on lru. That said, I think I
guarantee that this should be safe.

Thanks,
Vladimir


Re: [PATCH RESEND 7/8] pipe: account to kmemcg

2016-05-24 Thread Vladimir Davydov
On Tue, May 24, 2016 at 05:59:02AM -0700, Eric Dumazet wrote:
...
> > +static int anon_pipe_buf_steal(struct pipe_inode_info *pipe,
> > +  struct pipe_buffer *buf)
> > +{
> > +   struct page *page = buf->page;
> > +
> > +   if (page_count(page) == 1) {
> 
> This looks racy : some cpu could have temporarily elevated page count.

All pipe operations (pipe_buf_operations->get, ->release, ->steal) are
supposed to be called under pipe_lock. So, if we see a pipe_buffer->page
with refcount of 1 in ->steal, that means that we are the only its user
and it can't be spliced to another pipe.

In fact, I just copied the code from generic_pipe_buf_steal, adding
kmemcg related checks along the way, so it should be fine.

Thanks,
Vladimir

> 
> > +   if (memcg_kmem_enabled()) {
> > +   memcg_kmem_uncharge(page, 0);
> > +   __ClearPageKmemcg(page);
> > +   }
> > +   __SetPageLocked(page);
> > +   return 0;
> > +   }
> > +   return 1;
> > +}


Re: [PATCH RESEND 8/8] af_unix: charge buffers to kmemcg

2016-05-24 Thread Vladimir Davydov
On Tue, May 24, 2016 at 06:02:06AM -0700, Eric Dumazet wrote:
> On Tue, 2016-05-24 at 11:49 +0300, Vladimir Davydov wrote:
> > Unix sockets can consume a significant amount of system memory, hence
> > they should be accounted to kmemcg.
> > 
> > Since unix socket buffers are always allocated from process context,
> > all we need to do to charge them to kmemcg is set __GFP_ACCOUNT in
> > sock->sk_allocation mask.
> 
> I have two questions : 
> 
> 1) What happens when a buffer, allocated from socket  lands in a
> different socket , maybe owned by another user/process.
> 
> Who owns it now, in term of kmemcg accounting ?

We never move memcg charges. E.g. if two processes from different
cgroups are sharing a memory region, each page will be charged to the
process which touched it first. Or if two processes are working with the
same directory tree, inodes and dentries will be charged to the first
user. The same is fair for unix socket buffers - they will be charged to
the sender.

> 
> 2) Has performance impact been evaluated ?

I ran netperf STREAM_STREAM with default options in a kmemcg on
a 4 core x 2 HT box. The results are below:

 # clientsbandwidth (10^6bits/sec)
base  patched
 1  67643 +-  725  64874 +-  353- 4.0 %
 4 193585 +- 2516 186715 +- 1460- 3.5 %
 8 194820 +-  377 187443 +- 1229- 3.7 %

So the accounting doesn't come for free - it takes ~4% of performance.
I believe we could optimize it by using per cpu batching not only on
charge, but also on uncharge in memcg core, but that's beyond the scope
of this patch set - I'll take a look at this later.

Anyway, if performance impact is found to be unacceptable, it is always
possible to disable kmem accounting at boot time (cgroup.memory=nokmem)
or not use memory cgroups at runtime at all (thanks to jump labels
there'll be no overhead even if they are compiled in).

Thanks,
Vladimir


[PATCH RESEND 8/8] af_unix: charge buffers to kmemcg

2016-05-24 Thread Vladimir Davydov
Unix sockets can consume a significant amount of system memory, hence
they should be accounted to kmemcg.

Since unix socket buffers are always allocated from process context,
all we need to do to charge them to kmemcg is set __GFP_ACCOUNT in
sock->sk_allocation mask.

Signed-off-by: Vladimir Davydov 
Cc: "David S. Miller" 
---
 net/unix/af_unix.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 80aa6a3e6817..022bdd3ab7d9 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -769,6 +769,7 @@ static struct sock *unix_create1(struct net *net, struct 
socket *sock, int kern)
lockdep_set_class(&sk->sk_receive_queue.lock,
&af_unix_sk_receive_queue_lock_key);
 
+   sk->sk_allocation   = GFP_KERNEL_ACCOUNT;
sk->sk_write_space  = unix_write_space;
sk->sk_max_ack_backlog  = net->unx.sysctl_max_dgram_qlen;
sk->sk_destruct = unix_sock_destructor;
-- 
2.1.4



[PATCH RESEND 6/8] arch: x86: charge page tables to kmemcg

2016-05-24 Thread Vladimir Davydov
Page tables can bite a relatively big chunk off system memory and their
allocations are easy to trigger from userspace, so they should be
accounted to kmemcg.

This patch marks page table allocations as __GFP_ACCOUNT for x86. Note
we must not charge allocations of kernel page tables, because they can
be shared among processes from different cgroups so accounting them to a
particular one can pin other cgroups for indefinitely long. So we clear
__GFP_ACCOUNT flag if a page table is allocated for the kernel.

Signed-off-by: Vladimir Davydov 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: "H. Peter Anvin" 
---
 arch/x86/include/asm/pgalloc.h | 12 ++--
 arch/x86/mm/pgtable.c  | 11 ---
 2 files changed, 18 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/pgalloc.h b/arch/x86/include/asm/pgalloc.h
index bf7f8b55b0f9..2f531633cb16 100644
--- a/arch/x86/include/asm/pgalloc.h
+++ b/arch/x86/include/asm/pgalloc.h
@@ -81,7 +81,11 @@ static inline void pmd_populate(struct mm_struct *mm, pmd_t 
*pmd,
 static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long addr)
 {
struct page *page;
-   page = alloc_pages(GFP_KERNEL | __GFP_REPEAT | __GFP_ZERO, 0);
+   gfp_t gfp = GFP_KERNEL_ACCOUNT | __GFP_REPEAT | __GFP_ZERO;
+
+   if (mm == &init_mm)
+   gfp &= ~__GFP_ACCOUNT;
+   page = alloc_pages(gfp, 0);
if (!page)
return NULL;
if (!pgtable_pmd_page_ctor(page)) {
@@ -125,7 +129,11 @@ static inline void pgd_populate(struct mm_struct *mm, 
pgd_t *pgd, pud_t *pud)
 
 static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned long addr)
 {
-   return (pud_t *)get_zeroed_page(GFP_KERNEL|__GFP_REPEAT);
+   gfp_t gfp = GFP_KERNEL_ACCOUNT | __GFP_REPEAT;
+
+   if (mm == &init_mm)
+   gfp &= ~__GFP_ACCOUNT;
+   return (pud_t *)get_zeroed_page(gfp);
 }
 
 static inline void pud_free(struct mm_struct *mm, pud_t *pud)
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index 4eb287e25043..421ac6b74d11 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -6,7 +6,8 @@
 #include 
 #include 
 
-#define PGALLOC_GFP GFP_KERNEL | __GFP_NOTRACK | __GFP_REPEAT | __GFP_ZERO
+#define PGALLOC_GFP (GFP_KERNEL_ACCOUNT | __GFP_NOTRACK | __GFP_REPEAT | \
+__GFP_ZERO)
 
 #ifdef CONFIG_HIGHPTE
 #define PGALLOC_USER_GFP __GFP_HIGHMEM
@@ -18,7 +19,7 @@ gfp_t __userpte_alloc_gfp = PGALLOC_GFP | PGALLOC_USER_GFP;
 
 pte_t *pte_alloc_one_kernel(struct mm_struct *mm, unsigned long address)
 {
-   return (pte_t *)__get_free_page(PGALLOC_GFP);
+   return (pte_t *)__get_free_page(PGALLOC_GFP & ~__GFP_ACCOUNT);
 }
 
 pgtable_t pte_alloc_one(struct mm_struct *mm, unsigned long address)
@@ -207,9 +208,13 @@ static int preallocate_pmds(struct mm_struct *mm, pmd_t 
*pmds[])
 {
int i;
bool failed = false;
+   gfp_t gfp = PGALLOC_GFP;
+
+   if (mm == &init_mm)
+   gfp &= ~__GFP_ACCOUNT;
 
for(i = 0; i < PREALLOCATED_PMDS; i++) {
-   pmd_t *pmd = (pmd_t *)__get_free_page(PGALLOC_GFP);
+   pmd_t *pmd = (pmd_t *)__get_free_page(gfp);
if (!pmd)
failed = true;
if (pmd && !pgtable_pmd_page_ctor(virt_to_page(pmd))) {
-- 
2.1.4



[PATCH RESEND 1/8] mm: remove pointless struct in struct page definition

2016-05-24 Thread Vladimir Davydov
... to reduce indentation level thus leaving more space for comments.

Signed-off-by: Vladimir Davydov 
---
 include/linux/mm_types.h | 68 +++-
 1 file changed, 32 insertions(+), 36 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index d553855503e6..3cc5977a9cab 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -60,51 +60,47 @@ struct page {
};
 
/* Second double word */
-   struct {
-   union {
-   pgoff_t index;  /* Our offset within mapping. */
-   void *freelist; /* sl[aou]b first free object */
-   /* page_deferred_list().prev-- second tail page */
-   };
+   union {
+   pgoff_t index;  /* Our offset within mapping. */
+   void *freelist; /* sl[aou]b first free object */
+   /* page_deferred_list().prev-- second tail page */
+   };
 
-   union {
+   union {
 #if defined(CONFIG_HAVE_CMPXCHG_DOUBLE) && \
defined(CONFIG_HAVE_ALIGNED_STRUCT_PAGE)
-   /* Used for cmpxchg_double in slub */
-   unsigned long counters;
+   /* Used for cmpxchg_double in slub */
+   unsigned long counters;
 #else
-   /*
-* Keep _refcount separate from slub cmpxchg_double
-* data.  As the rest of the double word is protected by
-* slab_lock but _refcount is not.
-*/
-   unsigned counters;
+   /*
+* Keep _refcount separate from slub cmpxchg_double data.
+* As the rest of the double word is protected by slab_lock
+* but _refcount is not.
+*/
+   unsigned counters;
 #endif
+   struct {
 
-   struct {
-
-   union {
-   /*
-* Count of ptes mapped in mms, to show
-* when page is mapped & limit reverse
-* map searches.
-*/
-   atomic_t _mapcount;
-
-   struct { /* SLUB */
-   unsigned inuse:16;
-   unsigned objects:15;
-   unsigned frozen:1;
-   };
-   int units;  /* SLOB */
-   };
+   union {
/*
-* Usage count, *USE WRAPPER FUNCTION*
-* when manual accounting. See page_ref.h
+* Count of ptes mapped in mms, to show when
+* page is mapped & limit reverse map searches.
 */
-   atomic_t _refcount;
+   atomic_t _mapcount;
+
+   unsigned int active;/* SLAB */
+   struct {/* SLUB */
+   unsigned inuse:16;
+   unsigned objects:15;
+   unsigned frozen:1;
+   };
+   int units;  /* SLOB */
};
-   unsigned int active;/* SLAB */
+   /*
+* Usage count, *USE WRAPPER FUNCTION* when manual
+* accounting. See page_ref.h
+*/
+   atomic_t _refcount;
};
};
 
-- 
2.1.4



Re: [PATCH 8/8] af_unix: charge buffers to kmemcg

2016-05-24 Thread Vladimir Davydov
[adding netdev to Cc]

On Mon, May 23, 2016 at 01:20:29PM +0300, Vladimir Davydov wrote:
> Unix sockets can consume a significant amount of system memory, hence
> they should be accounted to kmemcg.
> 
> Since unix socket buffers are always allocated from process context,
> all we need to do to charge them to kmemcg is set __GFP_ACCOUNT in
> sock->sk_allocation mask.
> 
> Signed-off-by: Vladimir Davydov 
> Cc: "David S. Miller" 
> ---
>  net/unix/af_unix.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
> index 80aa6a3e6817..022bdd3ab7d9 100644
> --- a/net/unix/af_unix.c
> +++ b/net/unix/af_unix.c
> @@ -769,6 +769,7 @@ static struct sock *unix_create1(struct net *net, struct 
> socket *sock, int kern)
>   lockdep_set_class(&sk->sk_receive_queue.lock,
>   &af_unix_sk_receive_queue_lock_key);
>  
> + sk->sk_allocation   = GFP_KERNEL_ACCOUNT;
>   sk->sk_write_space  = unix_write_space;
>   sk->sk_max_ack_backlog  = net->unx.sysctl_max_dgram_qlen;
>   sk->sk_destruct = unix_sock_destructor;


[PATCH RESEND 2/8] mm: clean up non-standard page->_mapcount users

2016-05-24 Thread Vladimir Davydov
 - Add a proper comment to page->_mapcount.
 - Introduce a macro for generating helper functions.
 - Place all special page->_mapcount values next to each other so that
   readers can see all possible values and so we don't get duplicates.

Signed-off-by: Vladimir Davydov 
---
 include/linux/mm_types.h   |  5 
 include/linux/page-flags.h | 73 --
 scripts/tags.sh|  3 ++
 3 files changed, 40 insertions(+), 41 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 3cc5977a9cab..16bdef7943e3 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -85,6 +85,11 @@ struct page {
/*
 * Count of ptes mapped in mms, to show when
 * page is mapped & limit reverse map searches.
+*
+* Extra information about page type may be
+* stored here for pages that are never mapped,
+* in which case the value MUST BE <= -2.
+* See page-flags.h for more details.
 */
atomic_t _mapcount;
 
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index e5a32445f930..9940ade6a25e 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -593,54 +593,45 @@ TESTPAGEFLAG_FALSE(DoubleMap)
 #endif
 
 /*
- * PageBuddy() indicate that the page is free and in the buddy system
- * (see mm/page_alloc.c).
- *
- * PAGE_BUDDY_MAPCOUNT_VALUE must be <= -2 but better not too close to
- * -2 so that an underflow of the page_mapcount() won't be mistaken
- * for a genuine PAGE_BUDDY_MAPCOUNT_VALUE. -128 can be created very
- * efficiently by most CPU architectures.
+ * For pages that are never mapped to userspace, page->mapcount may be
+ * used for storing extra information about page type. Any value used
+ * for this purpose must be <= -2, but it's better start not too close
+ * to -2 so that an underflow of the page_mapcount() won't be mistaken
+ * for a special page.
  */
-#define PAGE_BUDDY_MAPCOUNT_VALUE (-128)
-
-static inline int PageBuddy(struct page *page)
-{
-   return atomic_read(&page->_mapcount) == PAGE_BUDDY_MAPCOUNT_VALUE;
+#define PAGE_MAPCOUNT_OPS(uname, lname)
\
+static __always_inline int Page##uname(struct page *page)  \
+{  \
+   return atomic_read(&page->_mapcount) == \
+   PAGE_##lname##_MAPCOUNT_VALUE;  \
+}  \
+static __always_inline void __SetPage##uname(struct page *page)
\
+{  \
+   VM_BUG_ON_PAGE(atomic_read(&page->_mapcount) != -1, page);  \
+   atomic_set(&page->_mapcount, PAGE_##lname##_MAPCOUNT_VALUE);\
+}  \
+static __always_inline void __ClearPage##uname(struct page *page)  \
+{  \
+   VM_BUG_ON_PAGE(!Page##uname(page), page);   \
+   atomic_set(&page->_mapcount, -1);   \
 }
 
-static inline void __SetPageBuddy(struct page *page)
-{
-   VM_BUG_ON_PAGE(atomic_read(&page->_mapcount) != -1, page);
-   atomic_set(&page->_mapcount, PAGE_BUDDY_MAPCOUNT_VALUE);
-}
+/*
+ * PageBuddy() indicate that the page is free and in the buddy system
+ * (see mm/page_alloc.c).
+ */
+#define PAGE_BUDDY_MAPCOUNT_VALUE  (-128)
+PAGE_MAPCOUNT_OPS(Buddy, BUDDY)
 
-static inline void __ClearPageBuddy(struct page *page)
-{
-   VM_BUG_ON_PAGE(!PageBuddy(page), page);
-   atomic_set(&page->_mapcount, -1);
-}
+/*
+ * PageBalloon() is set on pages that are on the balloon page list
+ * (see mm/balloon_compaction.c).
+ */
+#define PAGE_BALLOON_MAPCOUNT_VALUE(-256)
+PAGE_MAPCOUNT_OPS(Balloon, BALLOON)
 
 extern bool is_free_buddy_page(struct page *page);
 
-#define PAGE_BALLOON_MAPCOUNT_VALUE (-256)
-
-static inline int PageBalloon(struct page *page)
-{
-   return atomic_read(&page->_mapcount) == PAGE_BALLOON_MAPCOUNT_VALUE;
-}
-
-static inline void __SetPageBalloon(struct page *page)
-{
-   VM_BUG_ON_PAGE(atomic_read(&page->_mapcount) != -1, page);
-   atomic_set(&page->_mapcount, PAGE_BALLOON_MAPCOUNT_VALUE);
-}
-
-static inline void __ClearPageBalloon(struct page *page)
-{
-   VM_BUG_ON_PAGE(!PageBalloon(page), page);
-   atomic_set(&page->_mapcount, -1);
-}
-
 /*
  * If 

[PATCH RESEND 0/8] More stuff to charge to kmemcg

2016-05-24 Thread Vladimir Davydov
[resending with all relevant lists in Cc]

Hi,

This patch implements per kmemcg accounting of page tables (x86-only),
pipe buffers, and unix socket buffers.

Basically, this is v2 of my earlier attempt [1], addressing comments by
Andrew, namely: lack of comments to non-standard _mapcount usage, extra
overhead even when kmemcg is unused, wrong handling of stolen pipe
buffer pages.

Patches 1-3 are just cleanups that are not supposed to introduce any
functional changes. Patches 4 and 5 move charge/uncharge to generic page
allocator paths for the sake of accounting pipe and unix socket buffers.
Patches 5-7 make x86 page tables, pipe buffers, and unix socket buffers
accountable.

[1] http://lkml.kernel.org/r/%3ccover.1443262808.git.vdavy...@parallels.com%3E

Thanks,

Vladimir Davydov (8):
  mm: remove pointless struct in struct page definition
  mm: clean up non-standard page->_mapcount users
  mm: memcontrol: cleanup kmem charge functions
  mm: charge/uncharge kmemcg from generic page allocator paths
  mm: memcontrol: teach uncharge_list to deal with kmem pages
  arch: x86: charge page tables to kmemcg
  pipe: account to kmemcg
  af_unix: charge buffers to kmemcg

 arch/x86/include/asm/pgalloc.h |  12 -
 arch/x86/mm/pgtable.c  |  11 ++--
 fs/pipe.c  |  32 ---
 include/linux/gfp.h|  10 +---
 include/linux/memcontrol.h | 103 +++-
 include/linux/mm_types.h   |  73 -
 include/linux/page-flags.h |  78 +--
 kernel/fork.c  |   6 +--
 mm/memcontrol.c| 117 -
 mm/page_alloc.c|  63 +-
 mm/slab.h  |  16 --
 mm/slab_common.c   |   2 +-
 mm/slub.c  |   6 +--
 mm/vmalloc.c   |   6 +--
 net/unix/af_unix.c |   1 +
 scripts/tags.sh|   3 ++
 16 files changed, 245 insertions(+), 294 deletions(-)

-- 
2.1.4



[PATCH RESEND 4/8] mm: charge/uncharge kmemcg from generic page allocator paths

2016-05-24 Thread Vladimir Davydov
Currently, to charge a non-slab allocation to kmemcg one has to use
alloc_kmem_pages helper with __GFP_ACCOUNT flag. A page allocated with
this helper should finally be freed using free_kmem_pages, otherwise it
won't be uncharged.

This API suits its current users fine, but it turns out to be impossible
to use along with page reference counting, i.e. when an allocation is
supposed to be freed with put_page, as it is the case with pipe or unix
socket buffers.

To overcome this limitation, this patch moves charging/uncharging to
generic page allocator paths, i.e. to __alloc_pages_nodemask and
free_pages_prepare, and zaps alloc/free_kmem_pages helpers. This way,
one can use any of the available page allocation functions to get the
allocated page charged to kmemcg - it's enough to pass __GFP_ACCOUNT,
just like in case of kmalloc and friends. A charged page will be
automatically uncharged on free.

To make it possible, we need to mark pages charged to kmemcg somehow. To
avoid introducing a new page flag, we make use of page->_mapcount for
marking such pages. Since pages charged to kmemcg are not supposed to be
mapped to userspace, it should work just fine. There are other (ab)users
of page->_mapcount - buddy and balloon pages - but we don't conflict
with them.

In case kmemcg is compiled out or not used at runtime, this patch
introduces no overhead to generic page allocator paths. If kmemcg is
used, it will be plus one gfp flags check on alloc and plus one
page->_mapcount check on free, which shouldn't hurt performance, because
the data accessed are hot.

Signed-off-by: Vladimir Davydov 
---
 include/linux/gfp.h| 10 +--
 include/linux/page-flags.h |  7 +
 kernel/fork.c  |  6 ++---
 mm/page_alloc.c| 66 +-
 mm/slab_common.c   |  2 +-
 mm/slub.c  |  6 ++---
 mm/vmalloc.c   |  6 ++---
 7 files changed, 31 insertions(+), 72 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 570383a41853..c29e9d347bc6 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -78,8 +78,7 @@ struct vm_area_struct;
  * __GFP_THISNODE forces the allocation to be satisified from the requested
  *   node with no fallbacks or placement policy enforcements.
  *
- * __GFP_ACCOUNT causes the allocation to be accounted to kmemcg (only relevant
- *   to kmem allocations).
+ * __GFP_ACCOUNT causes the allocation to be accounted to kmemcg.
  */
 #define __GFP_RECLAIMABLE ((__force gfp_t)___GFP_RECLAIMABLE)
 #define __GFP_WRITE((__force gfp_t)___GFP_WRITE)
@@ -486,10 +485,6 @@ extern struct page *alloc_pages_vma(gfp_t gfp_mask, int 
order,
 #define alloc_page_vma_node(gfp_mask, vma, addr, node) \
alloc_pages_vma(gfp_mask, 0, vma, addr, node, false)
 
-extern struct page *alloc_kmem_pages(gfp_t gfp_mask, unsigned int order);
-extern struct page *alloc_kmem_pages_node(int nid, gfp_t gfp_mask,
- unsigned int order);
-
 extern unsigned long __get_free_pages(gfp_t gfp_mask, unsigned int order);
 extern unsigned long get_zeroed_page(gfp_t gfp_mask);
 
@@ -513,9 +508,6 @@ extern void *__alloc_page_frag(struct page_frag_cache *nc,
   unsigned int fragsz, gfp_t gfp_mask);
 extern void __free_page_frag(void *addr);
 
-extern void __free_kmem_pages(struct page *page, unsigned int order);
-extern void free_kmem_pages(unsigned long addr, unsigned int order);
-
 #define __free_page(page) __free_pages((page), 0)
 #define free_page(addr) free_pages((addr), 0)
 
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 9940ade6a25e..b51e75a47e82 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -630,6 +630,13 @@ PAGE_MAPCOUNT_OPS(Buddy, BUDDY)
 #define PAGE_BALLOON_MAPCOUNT_VALUE(-256)
 PAGE_MAPCOUNT_OPS(Balloon, BALLOON)
 
+/*
+ * If kmemcg is enabled, the buddy allocator will set PageKmemcg() on
+ * pages allocated with __GFP_ACCOUNT. It gets cleared on page free.
+ */
+#define PAGE_KMEMCG_MAPCOUNT_VALUE (-512)
+PAGE_MAPCOUNT_OPS(Kmemcg, KMEMCG)
+
 extern bool is_free_buddy_page(struct page *page);
 
 /*
diff --git a/kernel/fork.c b/kernel/fork.c
index 66cc2e0e137e..3f3c30f80786 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -162,8 +162,8 @@ void __weak arch_release_thread_info(struct thread_info *ti)
 static struct thread_info *alloc_thread_info_node(struct task_struct *tsk,
  int node)
 {
-   struct page *page = alloc_kmem_pages_node(node, THREADINFO_GFP,
- THREAD_SIZE_ORDER);
+   struct page *page = alloc_pages_node(node, THREADINFO_GFP,
+THREAD_SIZE_ORDER);
 
if (page)
memcg_kmem_update_page_stat(page, MEMCG_KERNEL_STACK,
@@ -178,7 +178,7 @@ static inline

[PATCH RESEND 5/8] mm: memcontrol: teach uncharge_list to deal with kmem pages

2016-05-24 Thread Vladimir Davydov
Page table pages are batched-freed in release_pages on most
architectures. If we want to charge them to kmemcg (this is what is done
later in this series), we need to teach mem_cgroup_uncharge_list to
handle kmem pages.

Signed-off-by: Vladimir Davydov 
---
 mm/memcontrol.c | 42 --
 1 file changed, 24 insertions(+), 18 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 482b4a0c97e4..89a421ee4713 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5432,15 +5432,18 @@ void mem_cgroup_cancel_charge(struct page *page, struct 
mem_cgroup *memcg,
 
 static void uncharge_batch(struct mem_cgroup *memcg, unsigned long pgpgout,
   unsigned long nr_anon, unsigned long nr_file,
-  unsigned long nr_huge, struct page *dummy_page)
+  unsigned long nr_huge, unsigned long nr_kmem,
+  struct page *dummy_page)
 {
-   unsigned long nr_pages = nr_anon + nr_file;
+   unsigned long nr_pages = nr_anon + nr_file + nr_kmem;
unsigned long flags;
 
if (!mem_cgroup_is_root(memcg)) {
page_counter_uncharge(&memcg->memory, nr_pages);
if (do_memsw_account())
page_counter_uncharge(&memcg->memsw, nr_pages);
+   if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) && nr_kmem)
+   page_counter_uncharge(&memcg->kmem, nr_kmem);
memcg_oom_recover(memcg);
}
 
@@ -5463,6 +5466,7 @@ static void uncharge_list(struct list_head *page_list)
unsigned long nr_anon = 0;
unsigned long nr_file = 0;
unsigned long nr_huge = 0;
+   unsigned long nr_kmem = 0;
unsigned long pgpgout = 0;
struct list_head *next;
struct page *page;
@@ -5473,8 +5477,6 @@ static void uncharge_list(struct list_head *page_list)
 */
next = page_list->next;
do {
-   unsigned int nr_pages = 1;
-
page = list_entry(next, struct page, lru);
next = page->lru.next;
 
@@ -5493,31 +5495,35 @@ static void uncharge_list(struct list_head *page_list)
if (memcg != page->mem_cgroup) {
if (memcg) {
uncharge_batch(memcg, pgpgout, nr_anon, nr_file,
-  nr_huge, page);
-   pgpgout = nr_anon = nr_file = nr_huge = 0;
+  nr_huge, nr_kmem, page);
+   pgpgout = nr_anon = nr_file =
+   nr_huge = nr_kmem = 0;
}
memcg = page->mem_cgroup;
}
 
-   if (PageTransHuge(page)) {
-   nr_pages <<= compound_order(page);
-   VM_BUG_ON_PAGE(!PageTransHuge(page), page);
-   nr_huge += nr_pages;
-   }
+   if (!PageKmemcg(page)) {
+   unsigned int nr_pages = 1;
 
-   if (PageAnon(page))
-   nr_anon += nr_pages;
-   else
-   nr_file += nr_pages;
+   if (PageTransHuge(page)) {
+   nr_pages <<= compound_order(page);
+   VM_BUG_ON_PAGE(!PageTransHuge(page), page);
+   nr_huge += nr_pages;
+   }
+   if (PageAnon(page))
+   nr_anon += nr_pages;
+   else
+   nr_file += nr_pages;
+   pgpgout++;
+   } else
+   nr_kmem += 1 << compound_order(page);
 
page->mem_cgroup = NULL;
-
-   pgpgout++;
} while (next != page_list);
 
if (memcg)
uncharge_batch(memcg, pgpgout, nr_anon, nr_file,
-  nr_huge, page);
+  nr_huge, nr_kmem, page);
 }
 
 /**
-- 
2.1.4



[PATCH RESEND 3/8] mm: memcontrol: cleanup kmem charge functions

2016-05-24 Thread Vladimir Davydov
 - Handle memcg_kmem_enabled check out to the caller. This reduces the
   number of function definitions making the code easier to follow. At
   the same time it doesn't result in code bloat, because all of these
   functions are used only in one or two places.
 - Move __GFP_ACCOUNT check to the caller as well so that one wouldn't
   have to dive deep into memcg implementation to see which allocations
   are charged and which are not.
 - Refresh comments.

Signed-off-by: Vladimir Davydov 
---
 include/linux/memcontrol.h | 103 +++--
 mm/memcontrol.c|  75 -
 mm/page_alloc.c|   9 ++--
 mm/slab.h  |  16 +--
 4 files changed, 80 insertions(+), 123 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index a805474df4ab..2d03975c7dc0 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -754,6 +754,13 @@ static inline bool mem_cgroup_under_socket_pressure(struct 
mem_cgroup *memcg)
 }
 #endif
 
+struct kmem_cache *memcg_kmem_get_cache(struct kmem_cache *cachep);
+void memcg_kmem_put_cache(struct kmem_cache *cachep);
+int memcg_kmem_charge_memcg(struct page *page, gfp_t gfp, int order,
+   struct mem_cgroup *memcg);
+int memcg_kmem_charge(struct page *page, gfp_t gfp, int order);
+void memcg_kmem_uncharge(struct page *page, int order);
+
 #if defined(CONFIG_MEMCG) && !defined(CONFIG_SLOB)
 extern struct static_key_false memcg_kmem_enabled_key;
 
@@ -775,22 +782,6 @@ static inline bool memcg_kmem_enabled(void)
 }
 
 /*
- * In general, we'll do everything in our power to not incur in any overhead
- * for non-memcg users for the kmem functions. Not even a function call, if we
- * can avoid it.
- *
- * Therefore, we'll inline all those functions so that in the best case, we'll
- * see that kmemcg is off for everybody and proceed quickly.  If it is on,
- * we'll still do most of the flag checking inline. We check a lot of
- * conditions, but because they are pretty simple, they are expected to be
- * fast.
- */
-int __memcg_kmem_charge_memcg(struct page *page, gfp_t gfp, int order,
- struct mem_cgroup *memcg);
-int __memcg_kmem_charge(struct page *page, gfp_t gfp, int order);
-void __memcg_kmem_uncharge(struct page *page, int order);
-
-/*
  * helper for accessing a memcg's index. It will be used as an index in the
  * child cache array in kmem_cache, and also to derive its name. This function
  * will return -1 when this is not a kmem-limited memcg.
@@ -800,67 +791,6 @@ static inline int memcg_cache_id(struct mem_cgroup *memcg)
return memcg ? memcg->kmemcg_id : -1;
 }
 
-struct kmem_cache *__memcg_kmem_get_cache(struct kmem_cache *cachep, gfp_t 
gfp);
-void __memcg_kmem_put_cache(struct kmem_cache *cachep);
-
-static inline bool __memcg_kmem_bypass(void)
-{
-   if (!memcg_kmem_enabled())
-   return true;
-   if (in_interrupt() || (!current->mm) || (current->flags & PF_KTHREAD))
-   return true;
-   return false;
-}
-
-/**
- * memcg_kmem_charge: charge a kmem page
- * @page: page to charge
- * @gfp: reclaim mode
- * @order: allocation order
- *
- * Returns 0 on success, an error code on failure.
- */
-static __always_inline int memcg_kmem_charge(struct page *page,
-gfp_t gfp, int order)
-{
-   if (__memcg_kmem_bypass())
-   return 0;
-   if (!(gfp & __GFP_ACCOUNT))
-   return 0;
-   return __memcg_kmem_charge(page, gfp, order);
-}
-
-/**
- * memcg_kmem_uncharge: uncharge a kmem page
- * @page: page to uncharge
- * @order: allocation order
- */
-static __always_inline void memcg_kmem_uncharge(struct page *page, int order)
-{
-   if (memcg_kmem_enabled())
-   __memcg_kmem_uncharge(page, order);
-}
-
-/**
- * memcg_kmem_get_cache: selects the correct per-memcg cache for allocation
- * @cachep: the original global kmem cache
- *
- * All memory allocated from a per-memcg cache is charged to the owner memcg.
- */
-static __always_inline struct kmem_cache *
-memcg_kmem_get_cache(struct kmem_cache *cachep, gfp_t gfp)
-{
-   if (__memcg_kmem_bypass())
-   return cachep;
-   return __memcg_kmem_get_cache(cachep, gfp);
-}
-
-static __always_inline void memcg_kmem_put_cache(struct kmem_cache *cachep)
-{
-   if (memcg_kmem_enabled())
-   __memcg_kmem_put_cache(cachep);
-}
-
 /**
  * memcg_kmem_update_page_stat - update kmem page state statistics
  * @page: the page
@@ -883,15 +813,6 @@ static inline bool memcg_kmem_enabled(void)
return false;
 }
 
-static inline int memcg_kmem_charge(struct page *page, gfp_t gfp, int order)
-{
-   return 0;
-}
-
-static inline void memcg_kmem_uncharge(struct page *page, int order)
-{
-}
-
 static inline int memcg_cache_id(struct mem_cgro

[PATCH RESEND 7/8] pipe: account to kmemcg

2016-05-24 Thread Vladimir Davydov
Pipes can consume a significant amount of system memory, hence they
should be accounted to kmemcg.

This patch marks pipe_inode_info and anonymous pipe buffer page
allocations as __GFP_ACCOUNT so that they would be charged to kmemcg.
Note, since a pipe buffer page can be "stolen" and get reused for other
purposes, including mapping to userspace, we clear PageKmemcg thus
resetting page->_mapcount and uncharge it in anon_pipe_buf_steal, which
is introduced by this patch.

Signed-off-by: Vladimir Davydov 
Cc: Alexander Viro 
---
 fs/pipe.c | 32 ++--
 1 file changed, 26 insertions(+), 6 deletions(-)

diff --git a/fs/pipe.c b/fs/pipe.c
index 0d3f5165cb0b..4b32928f5426 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -21,6 +21,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -137,6 +138,22 @@ static void anon_pipe_buf_release(struct pipe_inode_info 
*pipe,
put_page(page);
 }
 
+static int anon_pipe_buf_steal(struct pipe_inode_info *pipe,
+  struct pipe_buffer *buf)
+{
+   struct page *page = buf->page;
+
+   if (page_count(page) == 1) {
+   if (memcg_kmem_enabled()) {
+   memcg_kmem_uncharge(page, 0);
+   __ClearPageKmemcg(page);
+   }
+   __SetPageLocked(page);
+   return 0;
+   }
+   return 1;
+}
+
 /**
  * generic_pipe_buf_steal - attempt to take ownership of a &pipe_buffer
  * @pipe:  the pipe that the buffer belongs to
@@ -219,7 +236,7 @@ static const struct pipe_buf_operations anon_pipe_buf_ops = 
{
.can_merge = 1,
.confirm = generic_pipe_buf_confirm,
.release = anon_pipe_buf_release,
-   .steal = generic_pipe_buf_steal,
+   .steal = anon_pipe_buf_steal,
.get = generic_pipe_buf_get,
 };
 
@@ -227,7 +244,7 @@ static const struct pipe_buf_operations packet_pipe_buf_ops 
= {
.can_merge = 0,
.confirm = generic_pipe_buf_confirm,
.release = anon_pipe_buf_release,
-   .steal = generic_pipe_buf_steal,
+   .steal = anon_pipe_buf_steal,
.get = generic_pipe_buf_get,
 };
 
@@ -405,7 +422,7 @@ pipe_write(struct kiocb *iocb, struct iov_iter *from)
int copied;
 
if (!page) {
-   page = alloc_page(GFP_HIGHUSER);
+   page = alloc_page(GFP_HIGHUSER | __GFP_ACCOUNT);
if (unlikely(!page)) {
ret = ret ? : -ENOMEM;
break;
@@ -611,7 +628,7 @@ struct pipe_inode_info *alloc_pipe_info(void)
 {
struct pipe_inode_info *pipe;
 
-   pipe = kzalloc(sizeof(struct pipe_inode_info), GFP_KERNEL);
+   pipe = kzalloc(sizeof(struct pipe_inode_info), GFP_KERNEL_ACCOUNT);
if (pipe) {
unsigned long pipe_bufs = PIPE_DEF_BUFFERS;
struct user_struct *user = get_current_user();
@@ -619,7 +636,9 @@ struct pipe_inode_info *alloc_pipe_info(void)
if (!too_many_pipe_buffers_hard(user)) {
if (too_many_pipe_buffers_soft(user))
pipe_bufs = 1;
-   pipe->bufs = kzalloc(sizeof(struct pipe_buffer) * 
pipe_bufs, GFP_KERNEL);
+   pipe->bufs = kcalloc(pipe_bufs,
+sizeof(struct pipe_buffer),
+GFP_KERNEL_ACCOUNT);
}
 
if (pipe->bufs) {
@@ -1010,7 +1029,8 @@ static long pipe_set_size(struct pipe_inode_info *pipe, 
unsigned long nr_pages)
if (nr_pages < pipe->nrbufs)
return -EBUSY;
 
-   bufs = kcalloc(nr_pages, sizeof(*bufs), GFP_KERNEL | __GFP_NOWARN);
+   bufs = kcalloc(nr_pages, sizeof(*bufs),
+  GFP_KERNEL_ACCOUNT | __GFP_NOWARN);
if (unlikely(!bufs))
return -ENOMEM;
 
-- 
2.1.4



Re: [PATCH] mm: memcontrol: MEMCG no longer works with SLOB

2015-12-10 Thread Vladimir Davydov
On Wed, Dec 09, 2015 at 03:01:07PM -0500, Johannes Weiner wrote:
> On Wed, Dec 09, 2015 at 05:32:39PM +0100, Arnd Bergmann wrote:
> > The change to move the kmem accounting into the normal memcg
> > code means we can no longer use memcg with slob, which lacks
> > the memcg_params member in its struct kmem_cache:
> > 
> > ../mm/slab.h: In function 'is_root_cache':
> > ../mm/slab.h:187:10: error: 'struct kmem_cache' has no member named 
> > 'memcg_params'

Argh, I completely forgot about this SLOB thing :-(

> > 
> > This enforces the new dependency in Kconfig. Alternatively,
> > we could change the slob code to allow using MEMCG.
> 
> I'm curious, was this a random config or do you actually use
> CONFIG_SLOB && CONFIG_MEMCG?
> 
> Excluding CONFIG_MEMCG completely for slob seems harsh, but I would
> prefer not littering the source with
> 
> #if defined(CONFIG_MEMCG) && (defined(CONFIG_SLAB) || defined(CONFIG_SLUB))
> 
> or
> 
> #if defined(CONFIG_MEMCG) && !defined(CONFIG_SLOB)
> 
> for such a special case. The #ifdefs are already out of hand in there.
> 
> Vladimir, what would you think of simply doing this?
> 
> diff --git a/mm/slab.h b/mm/slab.h
> index 5adec08..0b3ec4b 100644
> --- a/mm/slab.h
> +++ b/mm/slab.h
> @@ -25,6 +25,9 @@ struct kmem_cache {
>   int refcount;   /* Use counter */
>   void (*ctor)(void *);   /* Called on object slot creation */
>   struct list_head list;  /* List of all slab caches on the system */
> +#ifdef CONFIG_MEMCG
> + struct memcg_cache_params memcg_params;
> +#endif
>  };
>  
>  #endif /* CONFIG_SLOB */

I don't like it. This would result in allocation of per memcg arrays for
each list_lru/kmem_cache, which would never be used. This looks
extremely ugly. I'd prefer to make CONFIG_MEMCG depend on SL[AU]B, but
I'm afraid such a change will be frowned upon - who knows who uses
MEMCG & SLOB?

I guess SLOB could be made memcg-aware, but I don't think it's worth the
trouble, although I can take a look in this direction - from a quick
glance at SLOB it shouldn't be difficult. If we decide to go this way, I
think we could use this patch as a temporary fix, which would be
reverted eventually.

Otherwise, no matter how tempting the idea to put all memcg stuff under
CONFIG_MEMCG is, I think it won't fly, so for now we should use ifdefs.
To avoid complex checks, we could define a macro in memcontrol.h, say
MEMCG_KMEM_ENABLED, and use it throughout the code. And I think we
should wrap list_lru stuff in it either :-/

Thanks,
Vladimir
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 12/13] mm: memcontrol: account socket memory in unified hierarchy memory controller

2015-11-30 Thread Vladimir Davydov
On Mon, Nov 30, 2015 at 10:26:38AM -0500, Johannes Weiner wrote:
> On Mon, Nov 30, 2015 at 01:54:21PM +0300, Vladimir Davydov wrote:
> > On Tue, Nov 24, 2015 at 04:58:44PM -0500, Johannes Weiner wrote:
> > ...
> > > @@ -5520,15 +5557,30 @@ void sock_release_memcg(struct sock *sk)
> > >   */
> > >  bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int 
> > > nr_pages)
> > >  {
> > > - struct page_counter *counter;
> > > + gfp_t gfp_mask = GFP_KERNEL;
> > >  
> > > - if (page_counter_try_charge(&memcg->tcp_mem.memory_allocated,
> > > - nr_pages, &counter)) {
> > > - memcg->tcp_mem.memory_pressure = 0;
> > > - return true;
> > > +#ifdef CONFIG_MEMCG_KMEM
> > > + if (!cgroup_subsys_on_dfl(memory_cgrp_subsys)) {
> > > + struct page_counter *counter;
> > > +
> > > + if (page_counter_try_charge(&memcg->tcp_mem.memory_allocated,
> > > + nr_pages, &counter)) {
> > > + memcg->tcp_mem.memory_pressure = 0;
> > > + return true;
> > > + }
> > > + page_counter_charge(&memcg->tcp_mem.memory_allocated, nr_pages);
> > > + memcg->tcp_mem.memory_pressure = 1;
> > > + return false;
> > >   }
> > > - page_counter_charge(&memcg->tcp_mem.memory_allocated, nr_pages);
> > > - memcg->tcp_mem.memory_pressure = 1;
> > > +#endif
> > > + /* Don't block in the packet receive path */
> > > + if (in_softirq())
> > > + gfp_mask = GFP_NOWAIT;
> > > +
> > > + if (try_charge(memcg, gfp_mask, nr_pages) == 0)
> > > + return true;
> > > +
> > > + try_charge(memcg, gfp_mask|__GFP_NOFAIL, nr_pages);
> > 
> > We won't trigger high reclaim if we get here, because try_charge does
> > not check high threshold if failing or forcing charge. I think this
> > should be fixed regardless of this patch. The fix is attached below.
> 
> We kind of assume that max is either set above high, or not at
> all. That means when max is hit the high limit has already failed and
> it's of limited use to schedule background reclaim.

Yeah, you're right. No point scheduling the work here - it must be
already running.

> 
> > Also, I don't like calling try_charge twice: the second time will go
> > through all the try_charge steps for nothing. What about checking
> > page_counter value after calling try_charge instead:
> > 
> > try_charge(memcg, gfp_mask|__GFP_NOFAIL, nr_pages);
> > return page_counter_read(&memcg->memory) <= memcg->memory.limit;
> > 
> > or adding an out parameter to try_charge that would inform us if charge
> > was forced?
> 
> That's a complete cold path where we are going to drop the packet in
> all but a few cases. It's not worth the trouble.

Right

> 
> > > @@ -5539,10 +5591,32 @@ bool mem_cgroup_charge_skmem(struct mem_cgroup 
> > > *memcg, unsigned int nr_pages)
> > >   */
> > >  void mem_cgroup_uncharge_skmem(struct mem_cgroup *memcg, unsigned int 
> > > nr_pages)
> > >  {
> > > - page_counter_uncharge(&memcg->tcp_mem.memory_allocated, nr_pages);
> > > +#ifdef CONFIG_MEMCG_KMEM
> > > + if (!cgroup_subsys_on_dfl(memory_cgrp_subsys)) {
> > > + page_counter_uncharge(&memcg->tcp_mem.memory_allocated,
> > > +   nr_pages);
> > > + return;
> > > + }
> > > +#endif
> > > + page_counter_uncharge(&memcg->memory, nr_pages);
> > > + css_put_many(&memcg->css, nr_pages);
> > 
> > cancel_charge(memcg, nr_pages);
> 
> It does the same, but it's a weird name for regular uncharging.

Right

> 
> > From: Vladimir Davydov 
> > Subject: [PATCH] memcg: check high threshold if forcing allocation
> > 
> > try_charge() does not result in checking high threshold if it forces
> > charge. This is incorrect, because we could have failed to reclaim
> > memory due to the current context, so we do need to check high threshold
> > and try to compensate for the excess once we are in the safe context.
> > 
> > Signed-off-by: Vladimir Davydov 
> > 
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 79a29d564bff..e922965b572b 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -2112,13 +2112,14 @@ static int tr

Re: [PATCH 13/13] mm: memcontrol: hook up vmpressure to socket pressure

2015-11-30 Thread Vladimir Davydov
On Mon, Nov 30, 2015 at 10:58:38AM -0500, Johannes Weiner wrote:
> On Mon, Nov 30, 2015 at 02:36:28PM +0300, Vladimir Davydov wrote:
> > Suppose we have the following cgroup configuration.
> > 
> > A __ B
> >   \_ C
> > 
> > A is empty (which is natural for the unified hierarchy AFAIU). B has
> > some workload running in it, and C generates socket pressure. Due to the
> > socket pressure coming from C we start reclaim in A, which results in
> > thrashing of B, but we might not put sockets under pressure in A or C,
> > because vmpressure does not account pages scanned/reclaimed in B when
> > generating a vmpressure event for A or C. This might result in
> > aggressive reclaim and thrashing in B w/o generating a signal for C to
> > stop growing socket buffers.
> > 
> > Do you think such a situation is possible? If so, would it make sense to
> > switch to post-order walk in shrink_zone and pass sub-tree
> > scanned/reclaimed stats to vmpressure for each scanned memcg?
> 
> In that case the LRU pages in C would experience pressure as well,
> which would then reign in the sockets in C. There must be some LRU
> pages in there, otherwise who is creating socket pressure?
> 
> The same applies to shrinkers. All secondary reclaim is driven by LRU
> reclaim results.
> 
> I can see that there is some unfairness in distributing memcg reclaim
> pressure purely based on LRU size, because there are scenarios where
> the auxiliary objects (incl. sockets, but mostly shrinker pools)
> amount to a significant portion of the group's memory footprint. But
> substitute group for NUMA node and we've had this behavior for
> years. I'm not sure it's actually a problem in practice.
> 

Fiar enough. Let's wait until we hit this problem in real world then.

The patch looks good to me.

Reviewed-by: Vladimir Davydov 

Thanks,
Vladimir
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 13/13] mm: memcontrol: hook up vmpressure to socket pressure

2015-11-30 Thread Vladimir Davydov
On Tue, Nov 24, 2015 at 04:59:40PM -0500, Johannes Weiner wrote:
...
> @@ -2396,6 +2396,7 @@ static bool shrink_zone(struct zone *zone, struct 
> scan_control *sc,
>   memcg = mem_cgroup_iter(root, NULL, &reclaim);
>   do {
>   unsigned long lru_pages;
> + unsigned long reclaimed;
>   unsigned long scanned;
>   struct lruvec *lruvec;
>   int swappiness;
> @@ -2408,6 +2409,7 @@ static bool shrink_zone(struct zone *zone, struct 
> scan_control *sc,
>  
>   lruvec = mem_cgroup_zone_lruvec(zone, memcg);
>   swappiness = mem_cgroup_swappiness(memcg);
> + reclaimed = sc->nr_reclaimed;
>   scanned = sc->nr_scanned;
>  
>   shrink_lruvec(lruvec, swappiness, sc, &lru_pages);
> @@ -2418,6 +2420,11 @@ static bool shrink_zone(struct zone *zone, struct 
> scan_control *sc,
>   memcg, sc->nr_scanned - scanned,
>   lru_pages);
>  
> + /* Record the group's reclaim efficiency */
> + vmpressure(sc->gfp_mask, memcg, false,
> +sc->nr_scanned - scanned,
> +sc->nr_reclaimed - reclaimed);
> +

Suppose we have the following cgroup configuration.

A __ B
  \_ C

A is empty (which is natural for the unified hierarchy AFAIU). B has
some workload running in it, and C generates socket pressure. Due to the
socket pressure coming from C we start reclaim in A, which results in
thrashing of B, but we might not put sockets under pressure in A or C,
because vmpressure does not account pages scanned/reclaimed in B when
generating a vmpressure event for A or C. This might result in
aggressive reclaim and thrashing in B w/o generating a signal for C to
stop growing socket buffers.

Do you think such a situation is possible? If so, would it make sense to
switch to post-order walk in shrink_zone and pass sub-tree
scanned/reclaimed stats to vmpressure for each scanned memcg?

Thanks,
Vladimir

>   /*
>* Direct reclaim and kswapd have to scan all memory
>* cgroups to fulfill the overall scan target for the
> @@ -2449,7 +2456,8 @@ static bool shrink_zone(struct zone *zone, struct 
> scan_control *sc,
>   reclaim_state->reclaimed_slab = 0;
>   }
>  
> - vmpressure(sc->gfp_mask, sc->target_mem_cgroup,
> + /* Record the subtree's reclaim efficiency */
> + vmpressure(sc->gfp_mask, sc->target_mem_cgroup, true,
>  sc->nr_scanned - nr_scanned,
>  sc->nr_reclaimed - nr_reclaimed);
>  
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 12/13] mm: memcontrol: account socket memory in unified hierarchy memory controller

2015-11-30 Thread Vladimir Davydov
On Tue, Nov 24, 2015 at 04:58:44PM -0500, Johannes Weiner wrote:
...
> @@ -5520,15 +5557,30 @@ void sock_release_memcg(struct sock *sk)
>   */
>  bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages)
>  {
> - struct page_counter *counter;
> + gfp_t gfp_mask = GFP_KERNEL;
>  
> - if (page_counter_try_charge(&memcg->tcp_mem.memory_allocated,
> - nr_pages, &counter)) {
> - memcg->tcp_mem.memory_pressure = 0;
> - return true;
> +#ifdef CONFIG_MEMCG_KMEM
> + if (!cgroup_subsys_on_dfl(memory_cgrp_subsys)) {
> + struct page_counter *counter;
> +
> + if (page_counter_try_charge(&memcg->tcp_mem.memory_allocated,
> + nr_pages, &counter)) {
> + memcg->tcp_mem.memory_pressure = 0;
> + return true;
> + }
> + page_counter_charge(&memcg->tcp_mem.memory_allocated, nr_pages);
> + memcg->tcp_mem.memory_pressure = 1;
> + return false;
>   }
> - page_counter_charge(&memcg->tcp_mem.memory_allocated, nr_pages);
> - memcg->tcp_mem.memory_pressure = 1;
> +#endif
> + /* Don't block in the packet receive path */
> + if (in_softirq())
> + gfp_mask = GFP_NOWAIT;
> +
> + if (try_charge(memcg, gfp_mask, nr_pages) == 0)
> + return true;
> +
> + try_charge(memcg, gfp_mask|__GFP_NOFAIL, nr_pages);

We won't trigger high reclaim if we get here, because try_charge does
not check high threshold if failing or forcing charge. I think this
should be fixed regardless of this patch. The fix is attached below.

Also, I don't like calling try_charge twice: the second time will go
through all the try_charge steps for nothing. What about checking
page_counter value after calling try_charge instead:

try_charge(memcg, gfp_mask|__GFP_NOFAIL, nr_pages);
return page_counter_read(&memcg->memory) <= memcg->memory.limit;

or adding an out parameter to try_charge that would inform us if charge
was forced?

>   return false;
>  }
>  
> @@ -5539,10 +5591,32 @@ bool mem_cgroup_charge_skmem(struct mem_cgroup 
> *memcg, unsigned int nr_pages)
>   */
>  void mem_cgroup_uncharge_skmem(struct mem_cgroup *memcg, unsigned int 
> nr_pages)
>  {
> - page_counter_uncharge(&memcg->tcp_mem.memory_allocated, nr_pages);
> +#ifdef CONFIG_MEMCG_KMEM
> + if (!cgroup_subsys_on_dfl(memory_cgrp_subsys)) {
> + page_counter_uncharge(&memcg->tcp_mem.memory_allocated,
> +       nr_pages);
> + return;
> + }
> +#endif
> + page_counter_uncharge(&memcg->memory, nr_pages);
> + css_put_many(&memcg->css, nr_pages);

cancel_charge(memcg, nr_pages);

?

---
From: Vladimir Davydov 
Subject: [PATCH] memcg: check high threshold if forcing allocation

try_charge() does not result in checking high threshold if it forces
charge. This is incorrect, because we could have failed to reclaim
memory due to the current context, so we do need to check high threshold
and try to compensate for the excess once we are in the safe context.

Signed-off-by: Vladimir Davydov 

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 79a29d564bff..e922965b572b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2112,13 +2112,14 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t 
gfp_mask,
page_counter_charge(&memcg->memsw, nr_pages);
css_get_many(&memcg->css, nr_pages);
 
-   return 0;
+   goto check_high;
 
 done_restock:
css_get_many(&memcg->css, batch);
if (batch > nr_pages)
refill_stock(memcg, batch - nr_pages);
 
+check_high:
/*
 * If the hierarchy is above the normal consumption range, schedule
 * reclaim on returning to userland.  We can perform reclaim here
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 09/14] net: tcp_memcontrol: simplify linkage between socket and page counter

2015-11-24 Thread Vladimir Davydov
On Mon, Nov 23, 2015 at 01:20:37PM -0500, Johannes Weiner wrote:
> On Mon, Nov 23, 2015 at 12:36:46PM +0300, Vladimir Davydov wrote:
> > On Fri, Nov 20, 2015 at 01:56:48PM -0500, Johannes Weiner wrote:
> > > I actually had all this at first, but then wondered if it makes more
> > > sense to keep the legacy code in isolation. Don't you think it would
> > > be easier to keep track of what's v1 and what's v2 if we keep the
> > > legacy stuff physically separate as much as possible? In particular I
> > > found that 'tcp_mem.' marker really useful while working on the code.
> > > 
> > > In the same vein, tcp_memcontrol.c doesn't really hurt anybody and I'd
> > > expect it to remain mostly unopened and unchanged in the future. But
> > > if we merge it into memcontrol.c, that code will likely be in the way
> > > and we'd have to make it explicit somehow that this is not actually
> > > part of the new memory controller anymore.
> > > 
> > > What do you think?
> > 
> > There isn't much code left in tcp_memcontrol.c, and not all of it is
> > legacy. We still want to call tcp_init_cgroup and tcp_destroy_cgroup
> > from memcontrol.c - in fact, it's the only call site, so I think we'd
> > better keep these functions there. Apart from init/destroy, there is
> > only stuff for handling legacy files, which is relatively small and
> > isolated. We can just put it along with memsw and kmem legacy files in
> > the end of memcontrol.c adding a comment that it's legacy. Personally,
> > I'd find the code easier to follow then, because currently the logic
> > behind the ACTIVE flag as well as memcg->tcp_mem init/use/destroy turns
> > out to be scattered between two files in different subsystems for no
> > apparent reason now, as it does not need tcp_prot any more. Besides,
> > this would allow us to accurately reuse the ACTIVE flag in init/destroy
> > for inc/dec static branch and probably in sock_update_memcg instead of
> > sprinkling cgroup_subsys_on_dfl all over the place, which would make the
> > code a bit cleaner IMO (in fact, that's why I proposed to drop ACTIVATED
> > bit and replace cg_proto->flags with ->active bool).
> 
> As far as I can see, all of tcp_memcontrol.c is legacy, including the
> init and destroy functions. We only call them to set up the legacy
> tcp_mem state and do legacy jump-label maintenance. Delete it all and
> the unified hierarchy controller would still work. So I don't really
> see the benefits of consolidating it, and more risk of convoluting.
> 
> That being said, if you care strongly about it and see opportunities
> to cut down code and make things more readable, please feel free to
> turn the flags -> bool patch into a followup series and I'll be happy
> to review it.

OK, I'll look into that.

Regarding this patch, I don't have any questions left,

Reviewed-by: Vladimir Davydov 

Thanks,
Vladimir
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 13/14] mm: memcontrol: account socket memory in unified hierarchy memory controller

2015-11-23 Thread Vladimir Davydov
On Fri, Nov 20, 2015 at 02:25:06PM -0500, Johannes Weiner wrote:
> On Fri, Nov 20, 2015 at 04:10:33PM +0300, Vladimir Davydov wrote:
> > On Thu, Nov 12, 2015 at 06:41:32PM -0500, Johannes Weiner wrote:
> > ...
> > > @@ -5514,16 +5550,43 @@ void sock_release_memcg(struct sock *sk)
> > >   */
> > >  bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int 
> > > nr_pages)
> > >  {
> > > + unsigned int batch = max(CHARGE_BATCH, nr_pages);
> > >   struct page_counter *counter;
> > > + bool force = false;
> > >  
> > > - if (page_counter_try_charge(&memcg->tcp_mem.memory_allocated,
> > > - nr_pages, &counter)) {
> > > - memcg->tcp_mem.memory_pressure = 0;
> > > +#ifdef CONFIG_MEMCG_KMEM
> > > + if (!cgroup_subsys_on_dfl(memory_cgrp_subsys)) {
> > > + if (page_counter_try_charge(&memcg->tcp_mem.memory_allocated,
> > > + nr_pages, &counter)) {
> > > + memcg->tcp_mem.memory_pressure = 0;
> > > + return true;
> > > + }
> > > + page_counter_charge(&memcg->tcp_mem.memory_allocated, nr_pages);
> > > + memcg->tcp_mem.memory_pressure = 1;
> > > + return false;
> > > + }
> > > +#endif
> > > + if (consume_stock(memcg, nr_pages))
> > >   return true;
> > > +retry:
> > > + if (page_counter_try_charge(&memcg->memory, batch, &counter))
> > > + goto done;
> > > +
> > > + if (batch > nr_pages) {
> > > + batch = nr_pages;
> > > + goto retry;
> > >   }
> > > - page_counter_charge(&memcg->tcp_mem.memory_allocated, nr_pages);
> > > - memcg->tcp_mem.memory_pressure = 1;
> > > - return false;
> > > +
> > > + page_counter_charge(&memcg->memory, batch);
> > > + force = true;
> > > +done:
> > 
> > > + css_get_many(&memcg->css, batch);
> > 
> > Is there any point to get css reference per each charged page? For kmem
> > it is absolutely necessary, because dangling slabs must block
> > destruction of memcg's kmem caches, which are destroyed on css_free. But
> > for sockets there's no such problem: memcg will be destroyed only after
> > all sockets are destroyed and therefore uncharged (since
> > sock_update_memcg pins css).
> 
> I'm afraid we have to when we want to share 'stock' with cache and
> anon pages, which hold individual references. drain_stock() always
> assumes one reference per cached page.

Missed that, you're right.

> 
> > > + if (batch > nr_pages)
> > > + refill_stock(memcg, batch - nr_pages);
> > > +
> > > + schedule_work(&memcg->socket_work);
> > 
> > I think it's suboptimal to schedule the work even if we are below the
> > high threshold.
> 
> Hm, it seemed unnecessary to duplicate the hierarchy check since this
> is in the batch-exhausted slowpath anyway.

Dunno, may be you're right.

I've another question regarding this socket_work: its reclaim target
always equals CHARGE_BATCH. Can't it result in a workload exceeding
memory.high in case there are a lot of allocations coming from different
cpus? In this case the work might not manage to complete before another
allocation happens. May be, we should accumulate the number of pages to
be reclaimed by the work, as we do in try_charge?

> 
> > BTW why do we need this work at all? Why is reclaim_high called from
> > task_work not enough?
> 
> The problem lies in the memcg association: the random task that gets
> interrupted by an arriving packet might not be in the same memcg as
> the one owning receiving socket. And multiple interrupts could happen
> while we're in the kernel already charging pages. We'd basically have
> to maintain a list of memcgs that need to run reclaim_high associated
> with current.
> 

Right, I think this is worth placing in a comment to memcg->socket_work.
I wonder if we could use it *instead* of task_work for handling every
allocation, not only socket-related. Would it make any sense? May be, it
could reduce the latency experienced by tasks in memory cgroups.

Thanks,
Vladimir
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 09/14] net: tcp_memcontrol: simplify linkage between socket and page counter

2015-11-23 Thread Vladimir Davydov
On Fri, Nov 20, 2015 at 01:56:48PM -0500, Johannes Weiner wrote:
> On Fri, Nov 20, 2015 at 03:42:16PM +0300, Vladimir Davydov wrote:
> > On Thu, Nov 12, 2015 at 06:41:28PM -0500, Johannes Weiner wrote:
> > > There won't be any separate counters for socket memory consumed by
> > > protocols other than TCP in the future. Remove the indirection and
> > 
> > I really want to believe you're right. And with vmpressure propagation
> > implemented properly you are likely to be right.
> > 
> > However, we might still want to account other socket protos to
> > memcg->memory in the unified hierarchy, e.g. UDP, or SCTP, or whatever
> > else. Adding new consumers should be trivial, but it will break the
> > legacy usecase, where only TCP sockets are supposed to be accounted.
> > What about adding a check to sock_update_memcg() so that it would enable
> > accounting only for TCP sockets in case legacy hierarchy is used?
> 
> Yup, I was thinking the same thing. But we can cross that bridge when
> we come to it and are actually adding further packet types.

Fair enough.

> 
> > For the same reason, I think we'd better rename memcg->tcp_mem to
> > something like memcg->sk_mem or we can even drop the cg_proto struct
> > altogether embedding its fields directly to mem_cgroup struct.
> > 
> > Also, I don't see any reason to have tcp_memcontrol.c file. It's tiny
> > and with this patch it does not depend on tcp code any more. Let's move
> > it to memcontrol.c?
> 
> I actually had all this at first, but then wondered if it makes more
> sense to keep the legacy code in isolation. Don't you think it would
> be easier to keep track of what's v1 and what's v2 if we keep the
> legacy stuff physically separate as much as possible? In particular I
> found that 'tcp_mem.' marker really useful while working on the code.
> 
> In the same vein, tcp_memcontrol.c doesn't really hurt anybody and I'd
> expect it to remain mostly unopened and unchanged in the future. But
> if we merge it into memcontrol.c, that code will likely be in the way
> and we'd have to make it explicit somehow that this is not actually
> part of the new memory controller anymore.
> 
> What do you think?

There isn't much code left in tcp_memcontrol.c, and not all of it is
legacy. We still want to call tcp_init_cgroup and tcp_destroy_cgroup
from memcontrol.c - in fact, it's the only call site, so I think we'd
better keep these functions there. Apart from init/destroy, there is
only stuff for handling legacy files, which is relatively small and
isolated. We can just put it along with memsw and kmem legacy files in
the end of memcontrol.c adding a comment that it's legacy. Personally,
I'd find the code easier to follow then, because currently the logic
behind the ACTIVE flag as well as memcg->tcp_mem init/use/destroy turns
out to be scattered between two files in different subsystems for no
apparent reason now, as it does not need tcp_prot any more. Besides,
this would allow us to accurately reuse the ACTIVE flag in init/destroy
for inc/dec static branch and probably in sock_update_memcg instead of
sprinkling cgroup_subsys_on_dfl all over the place, which would make the
code a bit cleaner IMO (in fact, that's why I proposed to drop ACTIVATED
bit and replace cg_proto->flags with ->active bool).

Regarding, tcp_mem marker, well, currently it's OK, because we don't
account anything but TCP sockets, but when it changes (and I'm pretty
sure it will), we'll have to rename it anyway. For now, I'm OK with
leaving it as is though.

Thanks,
Vladimir
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 13/14] mm: memcontrol: account socket memory in unified hierarchy memory controller

2015-11-20 Thread Vladimir Davydov
On Thu, Nov 12, 2015 at 06:41:32PM -0500, Johannes Weiner wrote:
...
> @@ -5514,16 +5550,43 @@ void sock_release_memcg(struct sock *sk)
>   */
>  bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages)
>  {
> + unsigned int batch = max(CHARGE_BATCH, nr_pages);
>   struct page_counter *counter;
> + bool force = false;
>  
> - if (page_counter_try_charge(&memcg->tcp_mem.memory_allocated,
> - nr_pages, &counter)) {
> - memcg->tcp_mem.memory_pressure = 0;
> +#ifdef CONFIG_MEMCG_KMEM
> + if (!cgroup_subsys_on_dfl(memory_cgrp_subsys)) {
> + if (page_counter_try_charge(&memcg->tcp_mem.memory_allocated,
> + nr_pages, &counter)) {
> + memcg->tcp_mem.memory_pressure = 0;
> + return true;
> + }
> + page_counter_charge(&memcg->tcp_mem.memory_allocated, nr_pages);
> + memcg->tcp_mem.memory_pressure = 1;
> + return false;
> + }
> +#endif
> + if (consume_stock(memcg, nr_pages))
>   return true;
> +retry:
> + if (page_counter_try_charge(&memcg->memory, batch, &counter))
> + goto done;
> +
> + if (batch > nr_pages) {
> + batch = nr_pages;
> + goto retry;
>   }
> - page_counter_charge(&memcg->tcp_mem.memory_allocated, nr_pages);
> - memcg->tcp_mem.memory_pressure = 1;
> - return false;
> +
> + page_counter_charge(&memcg->memory, batch);
> + force = true;
> +done:

> + css_get_many(&memcg->css, batch);

Is there any point to get css reference per each charged page? For kmem
it is absolutely necessary, because dangling slabs must block
destruction of memcg's kmem caches, which are destroyed on css_free. But
for sockets there's no such problem: memcg will be destroyed only after
all sockets are destroyed and therefore uncharged (since
sock_update_memcg pins css).

> + if (batch > nr_pages)
> + refill_stock(memcg, batch - nr_pages);
> +
> + schedule_work(&memcg->socket_work);

I think it's suboptimal to schedule the work even if we are below the
high threshold.

BTW why do we need this work at all? Why is reclaim_high called from
task_work not enough?

Thanks,
Vladimir

> +
> + return !force;
>  }
>  
>  /**
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 12/14] mm: memcontrol: move socket code for unified hierarchy accounting

2015-11-20 Thread Vladimir Davydov
On Thu, Nov 12, 2015 at 06:41:31PM -0500, Johannes Weiner wrote:
> The unified hierarchy memory controller will account socket
> memory. Move the infrastructure functions accordingly.
> 
> Signed-off-by: Johannes Weiner 
> Acked-by: Michal Hocko 

Reviewed-by: Vladimir Davydov 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 09/14] net: tcp_memcontrol: simplify linkage between socket and page counter

2015-11-20 Thread Vladimir Davydov
On Thu, Nov 12, 2015 at 06:41:28PM -0500, Johannes Weiner wrote:
> There won't be any separate counters for socket memory consumed by
> protocols other than TCP in the future. Remove the indirection and

I really want to believe you're right. And with vmpressure propagation
implemented properly you are likely to be right.

However, we might still want to account other socket protos to
memcg->memory in the unified hierarchy, e.g. UDP, or SCTP, or whatever
else. Adding new consumers should be trivial, but it will break the
legacy usecase, where only TCP sockets are supposed to be accounted.
What about adding a check to sock_update_memcg() so that it would enable
accounting only for TCP sockets in case legacy hierarchy is used?

For the same reason, I think we'd better rename memcg->tcp_mem to
something like memcg->sk_mem or we can even drop the cg_proto struct
altogether embedding its fields directly to mem_cgroup struct.

Also, I don't see any reason to have tcp_memcontrol.c file. It's tiny
and with this patch it does not depend on tcp code any more. Let's move
it to memcontrol.c?

Other than that this patch looks OK to me.

Thanks,
Vladimir

> link sockets directly to their owning memory cgroup.
> 
> Signed-off-by: Johannes Weiner 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 08/14] net: tcp_memcontrol: sanitize tcp memory accounting callbacks

2015-11-20 Thread Vladimir Davydov
On Thu, Nov 12, 2015 at 06:41:27PM -0500, Johannes Weiner wrote:
> There won't be a tcp control soft limit, so integrating the memcg code
> into the global skmem limiting scheme complicates things
> unnecessarily. Replace this with simple and clear charge and uncharge
> calls--hidden behind a jump label--to account skb memory.
> 
> Note that this is not purely aesthetic: as a result of shoehorning the
> per-memcg code into the same memory accounting functions that handle
> the global level, the old code would compare the per-memcg consumption
> against the smaller of the per-memcg limit and the global limit. This
> allowed the total consumption of multiple sockets to exceed the global
> limit, as long as the individual sockets stayed within bounds. After
> this change, the code will always compare the per-memcg consumption to
> the per-memcg limit, and the global consumption to the global limit,
> and thus close this loophole.
> 
> Without a soft limit, the per-memcg memory pressure state in sockets
> is generally questionable. However, we did it until now, so we
> continue to enter it when the hard limit is hit, and packets are
> dropped, to let other sockets in the cgroup know that they shouldn't
> grow their transmit windows, either. However, keep it simple in the
> new callback model and leave memory pressure lazily when the next
> packet is accepted (as opposed to doing it synchroneously when packets
> are processed). When packets are dropped, network performance will
> already be in the toilet, so that should be a reasonable trade-off.
> 
> As described above, consumption is now checked on the per-memcg level
> and the global level separately. Likewise, memory pressure states are
> maintained on both the per-memcg level and the global level, and a
> socket is considered under pressure when either level asserts as much.
> 
> Signed-off-by: Johannes Weiner 

It leaves the legacy functionality intact, while making the code look
much better.

Reviewed-by: Vladimir Davydov 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 07/14] net: tcp_memcontrol: simplify the per-memcg limit access

2015-11-20 Thread Vladimir Davydov
On Thu, Nov 12, 2015 at 06:41:26PM -0500, Johannes Weiner wrote:
> tcp_memcontrol replicates the global sysctl_mem limit array per
> cgroup, but it only ever sets these entries to the value of the
> memory_allocated page_counter limit. Use the latter directly.
> 
> Signed-off-by: Johannes Weiner 

Reviewed-by: Vladimir Davydov 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 06/14] net: tcp_memcontrol: remove dead per-memcg count of allocated sockets

2015-11-20 Thread Vladimir Davydov
On Thu, Nov 12, 2015 at 06:41:25PM -0500, Johannes Weiner wrote:
> The number of allocated sockets is used for calculations in the soft
> limit phase, where packets are accepted but the socket is under memory
> pressure. Since there is no soft limit phase in tcp_memcontrol, and
> memory pressure is only entered when packets are already dropped, this
> is actually dead code. Remove it.

Actually, we can get into the soft limit phase due to the global limit
(tcp_memory_pressure is set), but then using per-memcg sockets_allocated
counter is just wrong.

> 
> As this is the last user of parent_cg_proto(), remove that too.
> 
> Signed-off-by: Johannes Weiner 

Reviewed-by: Vladimir Davydov 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 04/14] net: tcp_memcontrol: remove bogus hierarchy pressure propagation

2015-11-20 Thread Vladimir Davydov
On Thu, Nov 12, 2015 at 06:41:23PM -0500, Johannes Weiner wrote:
> When a cgroup currently breaches its socket memory limit, it enters
> memory pressure mode for itself and its *ancestors*. This throttles
> transmission in unrelated sibling and cousin subtrees that have
> nothing to do with the breached limit.
> 
> On the contrary, breaching a limit should make that group and its
> *children* enter memory pressure mode. But this happens already,
> albeit lazily: if an ancestor limit is breached, siblings will enter
> memory pressure on their own once the next packet arrives for them.

Hmm, we still call sk_prot->enter_memory_pressure, which might hurt a
workload in the root cgroup AFAICS. Strange. You fix it in patch 8
though.

> 
> So no additional hierarchy code is needed. Remove the bogus stuff.
> 
> Signed-off-by: Johannes Weiner 

Reviewed-by: Vladimir Davydov 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 14/14] mm: memcontrol: hook up vmpressure to socket pressure

2015-11-18 Thread Vladimir Davydov
On Tue, Nov 17, 2015 at 05:22:17PM -0500, Johannes Weiner wrote:
> On Tue, Nov 17, 2015 at 11:18:50PM +0300, Vladimir Davydov wrote:
> > AFAIK vmpressure was designed to allow userspace to tune hard limits of
> > cgroups in accordance with their demands, in which case the way how
> > vmpressure notifications work makes sense.
> 
> You can still do that when the reporting happens on the reclaim level,
> it's easy to figure out where the pressure comes from once a group is
> struggling to reclaim its LRU pages.

Right, one only needs to check usage-vs-limit in the cgroup that
received a notification and all its ascendants, but I doubt existing
applications do that.

> 
> Reporting on the pressure level does nothing but destroy valuable
> information that would be useful in scenarios other than tuning a
> hierarchical memory limit.

Agree.

> 
> > > But you guys were wary about the patch that changed it, and this
> > 
> > Changing vmpressure semantics as you proposed in v1 would result in
> > userspace getting notifications even if cgroup does not hit its limit.
> > May be it could be useful to someone (e.g. it could help tuning
> > memory.low), but I am pretty sure this would also result in breakages
> > for others.
> 
> Maybe. I'll look into a two-layer vmpressure recording/reporting model
> that would give us reclaim-level events internally while retaining
> pressure-level events for the existing userspace interface.

It would be great. I think vmpressure, as you propose to implement it,
could be useful even in the unified hierarchy for tuning memory.low and
memory.high.

> 
> > > series has kicked up enough dust already, so I backed it out.
> > > 
> > > But this will still be useful. Yes, it won't help in rebalancing an
> > > regularly working system, which would be cool, but it'll still help
> > > contain a worklad that is growing beyond expectations, which is the
> > > scenario that kickstarted this work.
> > 
> > I haven't looked through all the previous patches in the series, but
> > AFAIU they should do the trick, no? Notifying sockets about vmpressure
> > is rather needed to protect a workload from itself.
> 
> No, the only critical thing is to protect the system from OOM
> conditions caused by what should be containerized processes.
> 
> That's a correctness issue.
> 
> How much we mitigate the consequences inside the container when the
> workload screws up is secondary. But even that is already much better
> in this series compared to memcg v1, while leaving us with all the
> freedom to continue improving this internal mitigation in the future.
> 
> > And with this patch it will work this way, but only if sum limits <
> > total ram, which is rather rare in practice. On tightly packed
> > systems it does nothing.
> 
> That's not true, it's still useful when things go south inside a
> cgroup, even with overcommitted limits. See above.

I meant solely this patch here, not the rest of the patch set. In the
overcommitted case there is no difference if we have the last patch or
not AFAIU.

> 
> We can optimize the continuous global pressure rebalancing later on;
> whether that'll be based on a modified vmpressure implementation, or
> adding reclaim efficiency to the shrinker API or whatever.
> 
> > That said, I don't think we should commit this particular patch. Neither
> > do I think socket accounting should be enabled by default in the unified
> > hierarchy for now, since the implementation is still incomplete. IMHO.
> 
> I don't see a technical basis for either of those suggestions.
> 

IMHO users switching to the unified hierarchy don't expect that
something gets broken in the default setup unless it's a bug. They
expect API changes, new functionality appeared, some features dropped,
but not breakages.

With this patch set, one gets socket accounting enabled by default,
which would be OK if it always worked right, at least in theory. But it
does not if the node is overcommitted - one might get unexpected local
OOMs due to growing socket buffers, which have never been seen in the
legacy hierarchy.

You say that it will help coping with global OOM, which is true, but it
looks like trading an old problem for a new one, which is unaccepted in
this particular case IMHO, because the legacy hierarchy has been used
for years and people are likely to be used to old problems such as lack
of socket buffers accounting - they might already work around this
problem by tuning global tcp limits for instance. After switching to the
unified hierarchy they'll get a new problem in the default setup, which
is no good IMHO.

I'm not against enabling socket

Re: [PATCH 14/14] mm: memcontrol: hook up vmpressure to socket pressure

2015-11-17 Thread Vladimir Davydov
On Mon, Nov 16, 2015 at 01:53:16PM -0500, Johannes Weiner wrote:
> On Sun, Nov 15, 2015 at 04:54:57PM +0300, Vladimir Davydov wrote:
> > On Thu, Nov 12, 2015 at 06:41:33PM -0500, Johannes Weiner wrote:
> > > Let the networking stack know when a memcg is under reclaim pressure
> > > so that it can clamp its transmit windows accordingly.
> > > 
> > > Whenever the reclaim efficiency of a cgroup's LRU lists drops low
> > > enough for a MEDIUM or HIGH vmpressure event to occur, assert a
> > > pressure state in the socket and tcp memory code that tells it to curb
> > > consumption growth from sockets associated with said control group.
> > > 
> > > vmpressure events are naturally edge triggered, so for hysteresis
> > > assert socket pressure for a second to allow for subsequent vmpressure
> > > events to occur before letting the socket code return to normal.
> > 
> > AFAICS, in contrast to v1, now you don't modify vmpressure behavior,
> > which means socket_pressure will only be set when cgroup hits its
> > high/hard limit. On tightly packed system, this is unlikely IMO -
> > cgroups will mostly experience pressure due to memory shortage at the
> > global level and/or their low limit configuration, in which case no
> > vmpressure events will be triggered and therefore tcp window won't be
> > clamped accordingly.
> 
> Yeah, this is an inherent problem in the vmpressure design and it
> makes the feature significantly less useful than it could be IMO.

AFAIK vmpressure was designed to allow userspace to tune hard limits of
cgroups in accordance with their demands, in which case the way how
vmpressure notifications work makes sense.

> 
> But you guys were wary about the patch that changed it, and this

Changing vmpressure semantics as you proposed in v1 would result in
userspace getting notifications even if cgroup does not hit its limit.
May be it could be useful to someone (e.g. it could help tuning
memory.low), but I am pretty sure this would also result in breakages
for others.

> series has kicked up enough dust already, so I backed it out.
> 
> But this will still be useful. Yes, it won't help in rebalancing an
> regularly working system, which would be cool, but it'll still help
> contain a worklad that is growing beyond expectations, which is the
> scenario that kickstarted this work.

I haven't looked through all the previous patches in the series, but
AFAIU they should do the trick, no? Notifying sockets about vmpressure
is rather needed to protect a workload from itself. And with this patch
it will work this way, but only if sum limits < total ram, which is
rather rare in practice. On tightly packed systems it does nothing.

That said, I don't think we should commit this particular patch. Neither
do I think socket accounting should be enabled by default in the unified
hierarchy for now, since the implementation is still incomplete. IMHO.

Thanks,
Vladimir

> 
> > May be, we could use a per memcg slab shrinker to detect memory
> > pressure? This looks like abusing shrinkers API though.
> 
> Actually, I thought about doing this long-term.
> 
> Shrinkers are a nice way to export VM pressure to auxiliary allocators
> and caches. But currently, the only metric we export is LRU scan rate,
> whose application is limited to ageable caches: it doesn't make sense
> to cause auxiliary workingsets to shrink when the VM is merely picking
> up the drop-behind pages of a one-off page cache stream. I think it
> would make sense for shrinkers to include reclaim efficiency so that
> they can be used by caches that don't have 'accessed' bits and object
> rotation, but are able to shrink based on the cost they're imposing.
> 
> But a change like this is beyond the scope of this series, IMO.
> 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 14/14] mm: memcontrol: hook up vmpressure to socket pressure

2015-11-15 Thread Vladimir Davydov
On Thu, Nov 12, 2015 at 06:41:33PM -0500, Johannes Weiner wrote:
> Let the networking stack know when a memcg is under reclaim pressure
> so that it can clamp its transmit windows accordingly.
> 
> Whenever the reclaim efficiency of a cgroup's LRU lists drops low
> enough for a MEDIUM or HIGH vmpressure event to occur, assert a
> pressure state in the socket and tcp memory code that tells it to curb
> consumption growth from sockets associated with said control group.
> 
> vmpressure events are naturally edge triggered, so for hysteresis
> assert socket pressure for a second to allow for subsequent vmpressure
> events to occur before letting the socket code return to normal.

AFAICS, in contrast to v1, now you don't modify vmpressure behavior,
which means socket_pressure will only be set when cgroup hits its
high/hard limit. On tightly packed system, this is unlikely IMO -
cgroups will mostly experience pressure due to memory shortage at the
global level and/or their low limit configuration, in which case no
vmpressure events will be triggered and therefore tcp window won't be
clamped accordingly.

May be, we could use a per memcg slab shrinker to detect memory
pressure? This looks like abusing shrinkers API though.

Thanks,
Vladimir
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 05/14] net: tcp_memcontrol: protect all tcp_memcontrol calls by jump-label

2015-11-14 Thread Vladimir Davydov
On Thu, Nov 12, 2015 at 06:41:24PM -0500, Johannes Weiner wrote:
> Move the jump-label from sock_update_memcg() and sock_release_memcg()
> to the callsite, and so eliminate those function calls when socket
> accounting is not enabled.

I don't believe this patch's necessary, because these functions aren't
hot paths. Neither do I think it makes the code look better. Anyway,
it's rather a matter of personal preference, and the patch looks correct
to me, so

Reviewed-by: Vladimir Davydov 

> 
> This also eliminates the need for dummy functions because the calls
> will be optimized away if the Kconfig options are not enabled.
> 
> Signed-off-by: Johannes Weiner 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 10/14] mm: memcontrol: generalize the socket accounting jump label

2015-11-14 Thread Vladimir Davydov
On Thu, Nov 12, 2015 at 06:41:29PM -0500, Johannes Weiner wrote:
> The unified hierarchy memory controller is going to use this jump
> label as well to control the networking callbacks. Move it to the
> memory controller code and give it a more generic name.
> 
> Signed-off-by: Johannes Weiner 

Reviewed-by: Vladimir Davydov 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 11/14] mm: memcontrol: do not account memory+swap on unified hierarchy

2015-11-14 Thread Vladimir Davydov
On Thu, Nov 12, 2015 at 06:41:30PM -0500, Johannes Weiner wrote:
> The unified hierarchy memory controller doesn't expose the memory+swap
> counter to userspace, but its accounting is hardcoded in all charge
> paths right now, including the per-cpu charge cache ("the stock").
> 
> To avoid adding yet more pointless memory+swap accounting with the
> socket memory support in unified hierarchy, disable the counter
> altogether when in unified hierarchy mode.
> 
> Signed-off-by: Johannes Weiner 

Reviewed-by: Vladimir Davydov 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 03/14] net: tcp_memcontrol: properly detect ancestor socket pressure

2015-11-14 Thread Vladimir Davydov
On Thu, Nov 12, 2015 at 06:41:22PM -0500, Johannes Weiner wrote:
> When charging socket memory, the code currently checks only the local
> page counter for excess to determine whether the memcg is under socket
> pressure. But even if the local counter is fine, one of the ancestors
> could have breached its limit, which should also force this child to
> enter socket pressure. This currently doesn't happen.
> 
> Fix this by using page_counter_try_charge() first. If that fails, it
> means that either the local counter or one of the ancestors are in
> excess of their limit, and the child should enter socket pressure.
> 
> Signed-off-by: Johannes Weiner 

Reviewed-by: Vladimir Davydov 

For the record: it was broken by commit 3e32cb2e0a12 ("mm: memcontrol:
lockless page counters").
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 02/14] mm: vmscan: simplify memcg vs. global shrinker invocation

2015-11-14 Thread Vladimir Davydov
On Thu, Nov 12, 2015 at 06:41:21PM -0500, Johannes Weiner wrote:
...
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index a4507ec..e4f5b3c 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -411,6 +411,10 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>   struct shrinker *shrinker;
>   unsigned long freed = 0;
>  
> + /* Global shrinker mode */
> + if (memcg == root_mem_cgroup)
> + memcg = NULL;
> +
>   if (memcg && !memcg_kmem_is_active(memcg))
>   return 0;
>  
> @@ -2410,11 +2414,22 @@ static bool shrink_zone(struct zone *zone, struct 
> scan_control *sc,
>   shrink_lruvec(lruvec, swappiness, sc, &lru_pages);
>   zone_lru_pages += lru_pages;
>  
> - if (memcg && is_classzone)
> + /*
> +  * Shrink the slab caches in the same proportion that
> +  * the eligible LRU pages were scanned.
> +  */
> + if (is_classzone) {
>   shrink_slab(sc->gfp_mask, zone_to_nid(zone),
>   memcg, sc->nr_scanned - scanned,
>   lru_pages);
>  
> + if (reclaim_state) {
> + sc->nr_reclaimed +=
> + reclaim_state->reclaimed_slab;
> + reclaim_state->reclaimed_slab = 0;
> + }
> + }
> +
>   /*
>* Direct reclaim and kswapd have to scan all memory
>* cgroups to fulfill the overall scan target for the
> @@ -2432,20 +2447,6 @@ static bool shrink_zone(struct zone *zone, struct 
> scan_control *sc,
>   }
>   } while ((memcg = mem_cgroup_iter(root, memcg, &reclaim)));
>  
> - /*
> -  * Shrink the slab caches in the same proportion that
> -  * the eligible LRU pages were scanned.
> -  */
> - if (global_reclaim(sc) && is_classzone)
> - shrink_slab(sc->gfp_mask, zone_to_nid(zone), NULL,
> - sc->nr_scanned - nr_scanned,
> - zone_lru_pages);
> -
> - if (reclaim_state) {
> - sc->nr_reclaimed += reclaim_state->reclaimed_slab;
> - reclaim_state->reclaimed_slab = 0;
> - }
> -

AFAICS this patch deadly breaks memcg-unaware shrinkers vs LRU balance:
currently we scan (*total* LRU scanned / *total* LRU pages) of all such
objects; with this patch we'd use the numbers from the root cgroup
instead. If most processes reside in memory cgroups, the root cgroup
will have only a few LRU pages and hence the pressure exerted upon such
objects will be unfairly severe.

Thanks,
Vladimir

>   vmpressure(sc->gfp_mask, sc->target_mem_cgroup,
>  sc->nr_scanned - nr_scanned,
>  sc->nr_reclaimed - nr_reclaimed);
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 01/14] mm: memcontrol: export root_mem_cgroup

2015-11-14 Thread Vladimir Davydov
On Thu, Nov 12, 2015 at 06:41:20PM -0500, Johannes Weiner wrote:
> A later patch will need this symbol in files other than memcontrol.c,
> so export it now and replace mem_cgroup_root_css at the same time.
> 
> Signed-off-by: Johannes Weiner 
> Acked-by: Michal Hocko 

Reviewed-by: Vladimir Davydov 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy

2015-11-06 Thread Vladimir Davydov
On Thu, Nov 05, 2015 at 03:55:22PM -0500, Johannes Weiner wrote:
> On Thu, Nov 05, 2015 at 03:40:02PM +0100, Michal Hocko wrote:
...
> > 3) keep only some (safe) cache types enabled by default with the current
> >failing semantic and require an explicit enabling for the complete
> >kmem accounting. [di]cache code paths should be quite robust to
> >handle allocation failures.
> 
> Vladimir, what would be your opinion on this?

I'm all for this option. Actually, I've been thinking about this since I
introduced the __GFP_NOACCOUNT flag. Not because of the failing
semantics, since we can always let kmem allocations breach the limit.
This shouldn't be critical, because I don't think it's possible to issue
a series of kmem allocations w/o a single user page allocation, which
would reclaim/kill the excess.

The point is there are allocations that are shared system-wide and
therefore shouldn't go to any memcg. Most obvious examples are: mempool
users and radix_tree/idr preloads. Accounting them to memcg is likely to
result in noticeable memory overhead as memory cgroups are
created/destroyed, because they pin dead memory cgroups with all their
kmem caches, which aren't tiny.

Another funny example is objects destroyed lazily for performance
reasons, e.g. vmap_area. Such objects are usually very small, so
delaying destruction of a bunch of them will normally go unnoticed.
However, if kmemcg is used the effective memory consumption caused by
such objects can be multiplied by many times due to dangling kmem
caches.

We can, of course, mark all such allocations as __GFP_NOACCOUNT, but the
problem is they are tricky to identify, because they are scattered all
over the kernel source tree. E.g. Dave Chinner mentioned that XFS
internals do a lot of allocations that are shared among all XFS
filesystems and therefore should not be accounted (BTW that's why
list_lru's used by XFS are not marked as memcg-aware). There must be
more out there. Besides, kernel developers don't usually even know about
kmemcg (they just write the code for their subsys, so why should they?)
so they won't care thinking about using __GFP_NOACCOUNT, and hence new
falsely-accounted allocations are likely to appear.

That said, by switching from black-list (__GFP_NOACCOUNT) to white-list
(__GFP_ACCOUNT) kmem accounting policy we would make the system more
predictable and robust IMO. OTOH what would we lose? Security? Well,
containers aren't secure IMHO. In fact, I doubt they will ever be (as
secure as VMs). Anyway, if a runaway allocation is reported, it should
be trivial to fix by adding __GFP_ACCOUNT where appropriate.

If there are no objections, I'll prepare a patch switching to the
white-list approach. Let's start from obvious things like fs_struct,
mm_struct, task_struct, signal_struct, dentry, inode, which can be
easily allocated from user space. This should cover 90% of all
allocations that should be accounted AFAICS. The rest will be added
later if necessarily.

Thanks,
Vladimir
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/8] mm: memcontrol: account socket memory in unified hierarchy

2015-11-02 Thread Vladimir Davydov
On Thu, Oct 29, 2015 at 10:52:28AM -0700, Johannes Weiner wrote:
...
> Now, you mentioned that you'd rather see the socket buffers accounted
> at the allocator level, but I looked at the different allocation paths
> and network protocols and I'm not convinced that this makes sense. We
> don't want to be in the hotpath of every single packet when a lot of
> them are small, short-lived management blips that don't involve user
> space to let the kernel dispose of them.
> 
> __sk_mem_schedule() on the other hand is already wired up to exactly
> those consumers we are interested in for memory isolation: those with
> bigger chunks of data attached to them and those that have exploding
> receive queues when userspace fails to read(). UDP and TCP.
> 
> I mean, there is a reason why the global memory limits apply to only
> those types of packets in the first place: everything else is noise.
> 
> I agree that it's appealing to account at the allocator level and set
> page->mem_cgroup etc. but in this case we'd pay extra to capture a lot
> of noise, and I don't want to pay that just for aesthetics. In this
> case it's better to track ownership on the socket level and only count
> packets that can accumulate a significant amount of memory consumed.

Sigh, you seem to be right. Moreover, I can't even think of a neat way
to account skb pages to memcg, because rcv skbs are generated in device
drivers, where we don't know which socket/memcg it will go to. We could
recharge individual pages when skb gets to the network or transport
layer, but it would result in unjustified overhead.

> 
> > > We tried using the per-memcg tcp limits, and that prevents the OOMs
> > > for sure, but it's horrendous for network performance. There is no
> > > "stop growing" phase, it just keeps going full throttle until it hits
> > > the wall hard.
> > > 
> > > Now, we could probably try to replicate the global knobs and add a
> > > per-memcg soft limit. But you know better than anyone else how hard it
> > > is to estimate the overall workingset size of a workload, and the
> > > margins on containerized loads are razor-thin. Performance is much
> > > more sensitive to input errors, and often times parameters must be
> > > adjusted continuously during the runtime of a workload. It'd be
> > > disasterous to rely on yet more static, error-prone user input here.
> > 
> > Yeah, but the dynamic approach proposed in your patch set doesn't
> > guarantee we won't hit OOM in memcg due to overgrown buffers. It just
> > reduces this possibility. Of course, memcg OOM is far not as disastrous
> > as the global one, but still it usually means the workload breakage.
> 
> Right now, the entire machine breaks. Confining it to a faulty memcg,
> as well as reducing the likelihood of that OOM in many cases seems
> like a good move in the right direction, no?

It seems. However, memcg OOM is also bad, we should strive to avoid it
if we can.

> 
> And how likely are memcg OOMs because of this anyway? There is of

Frankly, I've no idea. Your arguments below sound reassuring though.

> course a scenario imaginable where the packets pile up, followed by
> some *other* part of the workload, the one that doesn't read() and
> process packets, trying to expand--which then doesn't work and goes
> OOM. But that seems like a complete corner case. In the vast majority
> of cases, the application will be in full operation and just fail to
> read() fast enough--because the network bandwidth is enormous compared
> to the container's size, or because it shares the CPU with thousands
> of other workloads and there is scheduling latency.
> 
> This would be the perfect point to reign in the transmit window...
> 
> > The static approach is error-prone for sure, but it has existed for
> > years and worked satisfactory AFAIK.
> 
> ...but that point is not a fixed amount of memory consumed. It depends
> on the workload and the random interactions it's having with thousands
> of other containers on that same machine.
> 
> The point of containers is to maximize utilization of your hardware
> and systematically eliminate slack in the system. But it's exactly
> that slack on dedicated bare-metal machines that allowed us to take a
> wild guess at the settings and then tune them based on observing a
> handful of workloads. This approach is not going to work anymore when
> we pack the machine to capacity and still expect every single
> container out of thousands to perform well. We need that automation.

But we do use static approach when setting memory limits, no?
memory.{low,high,max} - they are all static.

I understand it's appealing to have just one knob - memory size - like
in case of virtual machines, but it doesn't seem to work with
containers. You added memory.low and memory.high knobs. VMs don't have
anything like that. How is one supposed to set them? Depends on the
workload, I guess. Also, there is the pids cgroup for limiting the
number of pids that can be used by a cgroup, because p

Re: [PATCH 0/8] mm: memcontrol: account socket memory in unified hierarchy

2015-10-29 Thread Vladimir Davydov
On Wed, Oct 28, 2015 at 11:58:10AM -0700, Johannes Weiner wrote:
> On Wed, Oct 28, 2015 at 11:20:03AM +0300, Vladimir Davydov wrote:
> > Then you'd better not touch existing tcp limits at all, because they
> > just work, and the logic behind them is very close to that of global tcp
> > limits. I don't think one can simplify it somehow.
> 
> Uhm, no, there is a crapload of boilerplate code and complication that
> seems entirely unnecessary. The only thing missing from my patch seems
> to be the part where it enters memory pressure state when the limit is
> hit. I'm adding this for completeness, but I doubt it even matters.
> 
> > Moreover, frankly I still have my reservations about this vmpressure
> > propagation to skb you're proposing. It might work, but I doubt it
> > will allow us to throw away explicit tcp limit, as I explained
> > previously. So, even with your approach I think we can still need
> > per memcg tcp limit *unless* you get rid of global tcp limit
> > somehow.
> 
> Having the hard limit as a failsafe (or a minimum for other consumers)
> is one thing, and certainly something I'm open to for cgroupv2, should
> we have problems with load startup up after a socket memory landgrab.
> 
> That being said, if the VM is struggling to reclaim pages, or is even
> swapping, it makes perfect sense to let the socket memory scheduler
> know it shouldn't continue to increase its footprint until the VM
> recovers. Regardless of any hard limitations/minimum guarantees.
> 
> This is what my patch does and it seems pretty straight-forward to
> me. I don't really understand why this is so controversial.

I'm not arguing that the idea behind this patch set is necessarily bad.
Quite the contrary, it does look interesting to me. I'm just saying that
IMO it can't replace hard/soft limits. It probably could if it was
possible to shrink buffers, but I don't think it's feasible, even
theoretically. That's why I propose not to change the behavior of the
existing per memcg tcp limit at all. And frankly I don't get why you are
so keen on simplifying it. You say it's a "crapload of boilerplate
code". Well, I don't see how it is - it just replicates global knobs and
I don't see how it could be done in a better way. The code is hidden
behind jump labels, so the overhead is zero if it isn't used. If you
really dislike this code, we can isolate it under a separate config
option. But all right, I don't rule out the possibility that the code
could be simplified. If you do that w/o breaking it, that'll be OK to
me, but I don't see why it should be related to this particular patch
set.

> 
> The *next* step would be to figure out whether we can actually
> *reclaim* memory in the network subsystem--shrink windows and steal
> buffers back--and that might even be an avenue to replace tcp window
> limits. But it's not necessary for *this* patch series to be useful.

Again, I don't think we can *reclaim* network memory, but you're right.

> 
> > > So this seemed like a good way to prove a new mechanism before rolling
> > > it out to every single Linux setup, rather than switch everybody over
> > > after the limited scope testing I can do as a developer on my own.
> > > 
> > > Keep in mind that my patches are not committing anything in terms of
> > > interface, so we retain all the freedom to fix and tune the way this
> > > is implemented, including the freedom to re-add tcp window limits in
> > > case the pressure balancing is not a comprehensive solution.
> > 
> > I really dislike this kind of proof. It looks like you're trying to
> > push something you think is right covertly, w/o having a proper
> > discussion with networking people and then say that it just works
> > and hence should be done globally, but what if it won't? Revert it?
> > We already have a lot of dubious stuff in memcg that should be
> > reverted, so let's please try to avoid this kind of mistakes in
> > future. Note, I say "w/o having a proper discussion with networking
> > people", because I don't think they will really care *unless* you
> > change the global logic, simply because most of them aren't very
> > interested in memcg AFAICS.
> 
> Come on, Dave is the first To and netdev is CC'd. They might not care
> about memcg, but "pushing things covertly" is a bit of a stretch.

Sorry if it sounded rude to you. I just look back at my experience
patching slab internals to make kmem accountable, and AFAICS Christoph
didn't really care about *what* I was doing, he only cared about the
global case - if there was

Re: [PATCH 0/8] mm: memcontrol: account socket memory in unified hierarchy

2015-10-28 Thread Vladimir Davydov
On Tue, Oct 27, 2015 at 09:01:08AM -0700, Johannes Weiner wrote:
...
> > > But regardless of tcp window control, we need to account socket memory
> > > in the main memory accounting pool where pressure is shared (to the
> > > best of our abilities) between all accounted memory consumers.
> > > 
> > 
> > No objections to this point. However, I really don't like the idea to
> > charge tcp window size to memory.current instead of charging individual
> > pages consumed by the workload for storing socket buffers, because it is
> > inconsistent with what we have now. Can't we charge individual skb pages
> > as we do in case of other kmem allocations?
> 
> Absolutely, both work for me. I chose that route because it's where
> the networking code already tracks and accounts memory consumed, so it
> seemed like a better site to hook into.
> 
> But I understand your concerns. We want to track this stuff as close
> to the memory allocators as possible.

Exactly.

> 
> > > But also, there are people right now for whom the socket buffers cause
> > > system OOM, but the existing memcg's hard tcp window limitq that
> > > exists absolutely wrecks network performance for them. It's not usable
> > > the way it is. It'd be much better to have the socket buffers exert
> > > pressure on the shared pool, and then propagate the overall pressure
> > > back to individual consumers with reclaim, shrinkers, vmpressure etc.
> > 
> > This might or might not work. I'm not an expert to judge. But if you do
> > this only for memcg leaving the global case as it is, networking people
> > won't budge IMO. So could you please start such a major rework from the
> > global case? Could you please try to deprecate the tcp window limits not
> > only in the legacy memcg hierarchy, but also system-wide in order to
> > attract attention of networking experts?
> 
> I'm definitely interested in addressing this globally as well.
> 
> The idea behind this was to use the memcg part as a testbed. cgroup2
> is going to be new and people are prepared for hiccups when migrating
> their applications to it; and they can roll back to cgroup1 and tcp
> window limits at any time should they run into problems in production.

Then you'd better not touch existing tcp limits at all, because they
just work, and the logic behind them is very close to that of global tcp
limits. I don't think one can simplify it somehow. Moreover, frankly I
still have my reservations about this vmpressure propagation to skb
you're proposing. It might work, but I doubt it will allow us to throw
away explicit tcp limit, as I explained previously. So, even with your
approach I think we can still need per memcg tcp limit *unless* you get
rid of global tcp limit somehow.

> 
> So this seemed like a good way to prove a new mechanism before rolling
> it out to every single Linux setup, rather than switch everybody over
> after the limited scope testing I can do as a developer on my own.
> 
> Keep in mind that my patches are not committing anything in terms of
> interface, so we retain all the freedom to fix and tune the way this
> is implemented, including the freedom to re-add tcp window limits in
> case the pressure balancing is not a comprehensive solution.
> 

I really dislike this kind of proof. It looks like you're trying to push
something you think is right covertly, w/o having a proper discussion
with networking people and then say that it just works and hence should
be done globally, but what if it won't? Revert it? We already have a lot
of dubious stuff in memcg that should be reverted, so let's please try
to avoid this kind of mistakes in future. Note, I say "w/o having a
proper discussion with networking people", because I don't think they
will really care *unless* you change the global logic, simply because
most of them aren't very interested in memcg AFAICS.

That effectively means you loose a chance to listen to networking
experts, who could point you at design flaws and propose an improvement
right away. Let's please not miss such an opportunity. You said that
you'd seen this problem happen w/o cgroups, so you have a use case that
might need fixing at the global level. IMO it shouldn't be difficult to
prepare an RFC patch for the global case first and see what people think
about it.

Thanks,
Vladimir
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/8] mm: memcontrol: account socket memory in unified hierarchy

2015-10-27 Thread Vladimir Davydov
On Mon, Oct 26, 2015 at 01:22:16PM -0400, Johannes Weiner wrote:
> On Thu, Oct 22, 2015 at 09:45:10PM +0300, Vladimir Davydov wrote:
> > Hi Johannes,
> > 
> > On Thu, Oct 22, 2015 at 12:21:28AM -0400, Johannes Weiner wrote:
> > ...
> > > Patch #5 adds accounting and tracking of socket memory to the unified
> > > hierarchy memory controller, as described above. It uses the existing
> > > per-cpu charge caches and triggers high limit reclaim asynchroneously.
> > > 
> > > Patch #8 uses the vmpressure extension to equalize pressure between
> > > the pages tracked natively by the VM and socket buffer pages. As the
> > > pool is shared, it makes sense that while natively tracked pages are
> > > under duress the network transmit windows are also not increased.
> > 
> > First of all, I've no experience in networking, so I'm likely to be
> > mistaken. Nevertheless I beg to disagree that this patch set is a step
> > in the right direction. Here goes why.
> > 
> > I admit that your idea to get rid of explicit tcp window control knobs
> > and size it dynamically basing on memory pressure instead does sound
> > tempting, but I don't think it'd always work. The problem is that in
> > contrast to, say, dcache, we can't shrink tcp buffers AFAIU, we can only
> > stop growing them. Now suppose a system hasn't experienced memory
> > pressure for a while. If we don't have explicit tcp window limit, tcp
> > buffers on such a system might have eaten almost all available memory
> > (because of network load/problems). If a user workload that needs a
> > significant amount of memory is started suddenly then, the network code
> > will receive a notification and surely stop growing buffers, but all
> > those buffers accumulated won't disappear instantly. As a result, the
> > workload might be unable to find enough free memory and have no choice
> > but invoke OOM killer. This looks unexpected from the user POV.
> 
> I'm not getting rid of those knobs, I'm just reusing the old socket
> accounting infrastructure in an attempt to make the memory accounting
> feature useful to more people in cgroups v2 (unified hierarchy).
> 

My understanding is that in the meantime you effectively break the
existing per memcg tcp window control logic.

> We can always come back to think about per-cgroup tcp window limits in
> the unified hierarchy, my patches don't get in the way of this. I'm
> not removing the knobs in cgroups v1 and I'm not preventing them in v2.
> 
> But regardless of tcp window control, we need to account socket memory
> in the main memory accounting pool where pressure is shared (to the
> best of our abilities) between all accounted memory consumers.
> 

No objections to this point. However, I really don't like the idea to
charge tcp window size to memory.current instead of charging individual
pages consumed by the workload for storing socket buffers, because it is
inconsistent with what we have now. Can't we charge individual skb pages
as we do in case of other kmem allocations?

> From an interface standpoint alone, I don't think it's reasonable to
> ask users per default to limit different consumers on a case by case
> basis. I certainly have no problem with finetuning for scenarios you
> describe above, but with memory.current, memory.high, memory.max we
> are providing a generic interface to account and contain memory
> consumption of workloads. This has to include all major memory
> consumers to make semantical sense.

We can propose a reasonable default as we do in the global case.

> 
> But also, there are people right now for whom the socket buffers cause
> system OOM, but the existing memcg's hard tcp window limitq that
> exists absolutely wrecks network performance for them. It's not usable
> the way it is. It'd be much better to have the socket buffers exert
> pressure on the shared pool, and then propagate the overall pressure
> back to individual consumers with reclaim, shrinkers, vmpressure etc.
> 

This might or might not work. I'm not an expert to judge. But if you do
this only for memcg leaving the global case as it is, networking people
won't budge IMO. So could you please start such a major rework from the
global case? Could you please try to deprecate the tcp window limits not
only in the legacy memcg hierarchy, but also system-wide in order to
attract attention of networking experts?

Thanks,
Vladimir
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/8] net: consolidate memcg socket buffer tracking and accounting

2015-10-23 Thread Vladimir Davydov
On Thu, Oct 22, 2015 at 03:09:43PM -0400, Johannes Weiner wrote:
> On Thu, Oct 22, 2015 at 09:46:12PM +0300, Vladimir Davydov wrote:
> > On Thu, Oct 22, 2015 at 12:21:31AM -0400, Johannes Weiner wrote:
> > > The tcp memory controller has extensive provisions for future memory
> > > accounting interfaces that won't materialize after all. Cut the code
> > > base down to what's actually used, now and in the likely future.
> > > 
> > > - There won't be any different protocol counters in the future, so a
> > >   direct sock->sk_memcg linkage is enough. This eliminates a lot of
> > >   callback maze and boilerplate code, and restores most of the socket
> > >   allocation code to pre-tcp_memcontrol state.
> > > 
> > > - There won't be a tcp control soft limit, so integrating the memcg
> > 
> > In fact, the code is ready for the "soft" limit (I mean min, pressure,
> > max tuple), it just lacks a knob.
> 
> Yeah, but that's not going to materialize if the entire interface for
> dedicated tcp throttling is considered obsolete.

May be, it shouldn't be. My current understanding is that per memcg tcp
window control is necessary, because:

 - We need to be able to protect a containerized workload from its
   growing network buffers. Using vmpressure notifications for that does
   not look reassuring to me.

 - We need a way to limit network buffers of a particular container,
   otherwise it can fill the system-wide window throttling other
   containers, which is unfair.

> 
> > > @@ -1136,9 +1090,6 @@ static inline bool sk_under_memory_pressure(const 
> > > struct sock *sk)
> > >   if (!sk->sk_prot->memory_pressure)
> > >   return false;
> > >  
> > > - if (mem_cgroup_sockets_enabled && sk->sk_cgrp)
> > > - return !!sk->sk_cgrp->memory_pressure;
> > > -
> > 
> > AFAIU, now we won't shrink the window on hitting the limit, i.e. this
> > patch subtly changes the behavior of the existing knobs, potentially
> > breaking them.
> 
> Hm, but there is no grace period in which something meaningful could
> happen with the window shrinking, is there? Any buffer allocation is
> still going to fail hard.

AFAIU when we hit the limit, we not only throttle the socket which
allocates, but also try to release space reserved by other sockets.
After your patch we won't. This looks unfair to me.

Thanks,
Vladimir
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 8/8] mm: memcontrol: hook up vmpressure to socket pressure

2015-10-22 Thread Vladimir Davydov
On Thu, Oct 22, 2015 at 12:21:36AM -0400, Johannes Weiner wrote:
...
> @@ -185,8 +183,29 @@ static void vmpressure_work_fn(struct work_struct *work)
>   vmpr->reclaimed = 0;
>   spin_unlock(&vmpr->sr_lock);
>  
> + level = vmpressure_calc_level(scanned, reclaimed);
> +
> + if (level > VMPRESSURE_LOW) {

So we start socket_pressure at MEDIUM. Why not at LOW or CRITICAL?

> + struct mem_cgroup *memcg;
> + /*
> +  * Let the socket buffer allocator know that we are
> +  * having trouble reclaiming LRU pages.
> +  *
> +  * For hysteresis, keep the pressure state asserted
> +  * for a second in which subsequent pressure events
> +  * can occur.
> +  *
> +  * XXX: is vmpressure a global feature or part of
> +  * memcg? There shouldn't be anything memcg-specific
> +  * about exporting reclaim success ratios from the VM.
> +  */
> + memcg = container_of(vmpr, struct mem_cgroup, vmpressure);
> + if (memcg != root_mem_cgroup)
> + memcg->socket_pressure = jiffies + HZ;

Why 1 second?

Thanks,
Vladimir

> + }
> +
>   do {
> - if (vmpressure_event(vmpr, scanned, reclaimed))
> + if (vmpressure_event(vmpr, level))
>   break;
>   /*
>* If not handled, propagate the event upward into the
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy

2015-10-22 Thread Vladimir Davydov
On Thu, Oct 22, 2015 at 12:21:33AM -0400, Johannes Weiner wrote:
...
> @@ -5500,13 +5524,38 @@ void sock_release_memcg(struct sock *sk)
>   */
>  bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages)
>  {
> + unsigned int batch = max(CHARGE_BATCH, nr_pages);
>   struct page_counter *counter;
> + bool force = false;
>  
> - if (page_counter_try_charge(&memcg->skmem, nr_pages, &counter))
> + if (!cgroup_subsys_on_dfl(memory_cgrp_subsys)) {
> + if (page_counter_try_charge(&memcg->skmem, nr_pages, &counter))
> + return true;
> + page_counter_charge(&memcg->skmem, nr_pages);
> + return false;
> + }
> +
> + if (consume_stock(memcg, nr_pages))
>   return true;
> +retry:
> + if (page_counter_try_charge(&memcg->memory, batch, &counter))
> + goto done;

Currently, we use memcg->memory only for charging memory pages. Besides,
every page charged to this counter (including kmem) has ->mem_cgroup
field set appropriately. This looks consistent and nice. As an extra
benefit, we can track all pages charged to a memory cgroup via
/proc/kapgecgroup.

Now, you charge "window size" to it, which AFAIU isn't necessarily equal
to the amount of memory actually consumed by the cgroup for socket
buffers. I think this looks ugly and inconsistent with the existing
behavior. I agree that we need to charge socker buffers to ->memory, but
IMO we should do that per each skb page, using memcg_kmem_charge_kmem
somewhere in alloc_skb_with_frags invoking the reclaimer just as we do
for kmalloc, while tcp window size control should stay aside.

Thanks,
Vladimir
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 7/8] mm: vmscan: report vmpressure at the level of reclaim activity

2015-10-22 Thread Vladimir Davydov
On Thu, Oct 22, 2015 at 12:21:35AM -0400, Johannes Weiner wrote:
...
> @@ -2437,6 +2439,10 @@ static bool shrink_zone(struct zone *zone, struct 
> scan_control *sc,
>   }
>   }
>  
> + vmpressure(sc->gfp_mask, memcg,
> +sc->nr_scanned - scanned,
> +sc->nr_reclaimed - reclaimed);
> +
>   /*
>* Direct reclaim and kswapd have to scan all memory
>* cgroups to fulfill the overall scan target for the
> @@ -2454,10 +2460,6 @@ static bool shrink_zone(struct zone *zone, struct 
> scan_control *sc,
>   }
>   } while ((memcg = mem_cgroup_iter(root, memcg, &reclaim)));
>  
> - vmpressure(sc->gfp_mask, sc->target_mem_cgroup,
> -sc->nr_scanned - nr_scanned,
> -sc->nr_reclaimed - nr_reclaimed);
> -
>   if (sc->nr_reclaimed - nr_reclaimed)
>   reclaimable = true;
>  

I may be mistaken, but AFAIU this patch subtly changes the behavior of
vmpressure visible from the userspace: w/o this patch a userspace
process will only receive a notification for a memory cgroup only if
*this* memory cgroup calls reclaimer; with this patch userspace
notification will be issued even if reclaimer is invoked by any cgroup
up the hierarchy.

Thanks,
Vladimir
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/8] net: consolidate memcg socket buffer tracking and accounting

2015-10-22 Thread Vladimir Davydov
On Thu, Oct 22, 2015 at 12:21:31AM -0400, Johannes Weiner wrote:
> The tcp memory controller has extensive provisions for future memory
> accounting interfaces that won't materialize after all. Cut the code
> base down to what's actually used, now and in the likely future.
> 
> - There won't be any different protocol counters in the future, so a
>   direct sock->sk_memcg linkage is enough. This eliminates a lot of
>   callback maze and boilerplate code, and restores most of the socket
>   allocation code to pre-tcp_memcontrol state.
> 
> - There won't be a tcp control soft limit, so integrating the memcg

In fact, the code is ready for the "soft" limit (I mean min, pressure,
max tuple), it just lacks a knob.

>   code into the global skmem limiting scheme complicates things
>   unnecessarily. Replace all that with simple and clear charge and
>   uncharge calls--hidden behind a jump label--to account skb memory.
> 
> - The previous jump label code was an elaborate state machine that
>   tracked the number of cgroups with an active socket limit in order
>   to enable the skmem tracking and accounting code only when actively
>   necessary. But this is overengineered: it was meant to protect the
>   people who never use this feature in the first place. Simply enable
>   the branches once when the first limit is set until the next reboot.
> 
...
> @@ -1136,9 +1090,6 @@ static inline bool sk_under_memory_pressure(const 
> struct sock *sk)
>   if (!sk->sk_prot->memory_pressure)
>   return false;
>  
> - if (mem_cgroup_sockets_enabled && sk->sk_cgrp)
> - return !!sk->sk_cgrp->memory_pressure;
> -

AFAIU, now we won't shrink the window on hitting the limit, i.e. this
patch subtly changes the behavior of the existing knobs, potentially
breaking them.

Thanks,
Vladimir
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/8] mm: memcontrol: account socket memory in unified hierarchy

2015-10-22 Thread Vladimir Davydov
Hi Johannes,

On Thu, Oct 22, 2015 at 12:21:28AM -0400, Johannes Weiner wrote:
...
> Patch #5 adds accounting and tracking of socket memory to the unified
> hierarchy memory controller, as described above. It uses the existing
> per-cpu charge caches and triggers high limit reclaim asynchroneously.
> 
> Patch #8 uses the vmpressure extension to equalize pressure between
> the pages tracked natively by the VM and socket buffer pages. As the
> pool is shared, it makes sense that while natively tracked pages are
> under duress the network transmit windows are also not increased.

First of all, I've no experience in networking, so I'm likely to be
mistaken. Nevertheless I beg to disagree that this patch set is a step
in the right direction. Here goes why.

I admit that your idea to get rid of explicit tcp window control knobs
and size it dynamically basing on memory pressure instead does sound
tempting, but I don't think it'd always work. The problem is that in
contrast to, say, dcache, we can't shrink tcp buffers AFAIU, we can only
stop growing them. Now suppose a system hasn't experienced memory
pressure for a while. If we don't have explicit tcp window limit, tcp
buffers on such a system might have eaten almost all available memory
(because of network load/problems). If a user workload that needs a
significant amount of memory is started suddenly then, the network code
will receive a notification and surely stop growing buffers, but all
those buffers accumulated won't disappear instantly. As a result, the
workload might be unable to find enough free memory and have no choice
but invoke OOM killer. This looks unexpected from the user POV.

That said, I think we do need per memcg tcp window control similar to
what we have system-wide. In other words, Glauber's work makes sense to
me. You might want to point me at my RFC patch where I proposed to
revert it (https://lkml.org/lkml/2014/9/12/401). Well, I've changed my
mind since then. Now I think I was mistaken, luckily I was stopped.
However, I may be mistaken again :-)

Thanks,
Vladimir
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html