Re: [RFC 0/4] memcg: Low-limit reclaim

2014-02-13 Thread Michal Hocko
On Wed 12-02-14 16:28:36, Roman Gushchin wrote:
> Hi, Michal!
> 
> Sorry for a long reply.
> 
> At Wed, 29 Jan 2014 19:22:59 +0100,
> Michal Hocko wrote:
> > > As you can remember, I've proposed to introduce low limits about a year 
> > > ago.
> > > 
> > > We had a small discussion at that time: 
> > > http://marc.info/?t=13619522664 .
> > 
> > yes I remember that discussion and vaguely remember the proposed
> > approach. I really wanted to prevent from introduction of a new knob but
> > things evolved differently than I planned since then and it turned out
> > that the knew knob is unavoidable. That's why I came with this approach
> > which is quite different from yours AFAIR.
> >  
> > > Since that time we intensively use low limits in our production
> > > (on thousands of machines). So, I'm very interested to merge this
> > > functionality into upstream.
> > 
> > Have you tried to use this implementation? Would this work as well?
> > My very vague recollection of your patch is that it didn't cover both
> > global and target reclaims and it didn't fit into the reclaim very
> > naturally it used its own scaling method. I will have to refresh my
> > memory though.
> 
> IMHO, the main problem of your implementation is the following: 
> the number of reclaimed pages is not limited at all,
> if cgroup is over it's low memory limit. So, a significant number 
> of pages can be reclaimed, even if the memory usage is only a bit 
> (e.g. one page) above the low limit.

Yes but this is the same problem as with the regular reclaim.
We do not have any guarantee that we will reclaim only the required
amount of memory. As the reclaim priority falls down we can overreclaim.
The global reclaim tries to avoid this problem by keeping the priority
as high as possible. And the target reclaim is not a big deal because we
are limiting the number of reclaimed pages to the swap cluster.

I do not see this as a practical problem of the low_limit though,
because it protects those that are below the limit not above. Small
fluctuation around the limit should be tolerable.

> In my case, this problem is solved by scaling the number of scanned pages.
> 
> I think, an ideal solution is to limit the number of reclaimed pages by 
> low limit excess value. This allows to discard my scaling code, but save
> the strict semantics of low limit under memory pressure. The main problem 
> here is how to balance scanning pressure between cgroups and LRUs.
> 
> Maybe, we should calculate the number of pages to scan in a LRU based on
> the low limit excess value instead of number of pages...

I do not like it much and I expect other mm people to feel similar. We
already have scanning scaling based on the priority. Adding a new
variable into the picture will make the whole thing only more
complicated without a very good reason for it.

[...]
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/4] memcg: Low-limit reclaim

2014-02-13 Thread Michal Hocko
On Wed 12-02-14 16:28:36, Roman Gushchin wrote:
 Hi, Michal!
 
 Sorry for a long reply.
 
 At Wed, 29 Jan 2014 19:22:59 +0100,
 Michal Hocko wrote:
   As you can remember, I've proposed to introduce low limits about a year 
   ago.
   
   We had a small discussion at that time: 
   http://marc.info/?t=13619522664 .
  
  yes I remember that discussion and vaguely remember the proposed
  approach. I really wanted to prevent from introduction of a new knob but
  things evolved differently than I planned since then and it turned out
  that the knew knob is unavoidable. That's why I came with this approach
  which is quite different from yours AFAIR.
   
   Since that time we intensively use low limits in our production
   (on thousands of machines). So, I'm very interested to merge this
   functionality into upstream.
  
  Have you tried to use this implementation? Would this work as well?
  My very vague recollection of your patch is that it didn't cover both
  global and target reclaims and it didn't fit into the reclaim very
  naturally it used its own scaling method. I will have to refresh my
  memory though.
 
 IMHO, the main problem of your implementation is the following: 
 the number of reclaimed pages is not limited at all,
 if cgroup is over it's low memory limit. So, a significant number 
 of pages can be reclaimed, even if the memory usage is only a bit 
 (e.g. one page) above the low limit.

Yes but this is the same problem as with the regular reclaim.
We do not have any guarantee that we will reclaim only the required
amount of memory. As the reclaim priority falls down we can overreclaim.
The global reclaim tries to avoid this problem by keeping the priority
as high as possible. And the target reclaim is not a big deal because we
are limiting the number of reclaimed pages to the swap cluster.

I do not see this as a practical problem of the low_limit though,
because it protects those that are below the limit not above. Small
fluctuation around the limit should be tolerable.

 In my case, this problem is solved by scaling the number of scanned pages.
 
 I think, an ideal solution is to limit the number of reclaimed pages by 
 low limit excess value. This allows to discard my scaling code, but save
 the strict semantics of low limit under memory pressure. The main problem 
 here is how to balance scanning pressure between cgroups and LRUs.
 
 Maybe, we should calculate the number of pages to scan in a LRU based on
 the low limit excess value instead of number of pages...

I do not like it much and I expect other mm people to feel similar. We
already have scanning scaling based on the priority. Adding a new
variable into the picture will make the whole thing only more
complicated without a very good reason for it.

[...]
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/4] memcg: Low-limit reclaim

2014-02-12 Thread Roman Gushchin
Hi, Michal!

Sorry for a long reply.

At Wed, 29 Jan 2014 19:22:59 +0100,
Michal Hocko wrote:
> > As you can remember, I've proposed to introduce low limits about a year ago.
> > 
> > We had a small discussion at that time: http://marc.info/?t=13619522664 
> > .
> 
> yes I remember that discussion and vaguely remember the proposed
> approach. I really wanted to prevent from introduction of a new knob but
> things evolved differently than I planned since then and it turned out
> that the knew knob is unavoidable. That's why I came with this approach
> which is quite different from yours AFAIR.
>  
> > Since that time we intensively use low limits in our production
> > (on thousands of machines). So, I'm very interested to merge this
> > functionality into upstream.
> 
> Have you tried to use this implementation? Would this work as well?
> My very vague recollection of your patch is that it didn't cover both
> global and target reclaims and it didn't fit into the reclaim very
> naturally it used its own scaling method. I will have to refresh my
> memory though.

IMHO, the main problem of your implementation is the following: 
the number of reclaimed pages is not limited at all,
if cgroup is over it's low memory limit. So, a significant number 
of pages can be reclaimed, even if the memory usage is only a bit 
(e.g. one page) above the low limit.

In my case, this problem is solved by scaling the number of scanned pages.

I think, an ideal solution is to limit the number of reclaimed pages by 
low limit excess value. This allows to discard my scaling code, but save
the strict semantics of low limit under memory pressure. The main problem 
here is how to balance scanning pressure between cgroups and LRUs.

Maybe, we should calculate the number of pages to scan in a LRU based on
the low limit excess value instead of number of pages...

> > In my experience, low limits also require some changes in memcg page 
> > accounting
> > policy. For instance, an application in protected cgroup should have a 
> > guarantee
> > that it's filecache belongs to it's cgroup and is protected by low limit
> > therefore. If the filecache was created by another application in other 
> > cgroup,
> > it can be not so. I've solved this problem by implementing optional page
> > reaccouting on pagefaults and read/writes.
> 
> Memory sharing is a separate issue and we should discuss that
> separately. 
> 
> > I can prepare my current version of patchset, if someone is interested.
> 
> Sure, having something to compare with is always valuable.


Subject: [PATCH] memcg: low limits for memory cgroups

Low limits for memory cgroup can be used to limit memory pressure on it.
If memory usage of a cgroup is under it's low limit, it will not be
affected by global reclaim. If it reaches it's low limit from above,
the reclaiming speed will be dropped exponentially.

Low limits don't affect soft reclaim.
Also, it's possible that a cgroup with memory usage under low limit
will be reclaimed slowly on very low scanning priorities.
---
 include/linux/memcontrol.h  |  7 ++
 include/linux/res_counter.h | 17 +
 kernel/res_counter.c|  2 ++
 mm/memcontrol.c | 60 +
 mm/vmscan.c |  9 +++
 5 files changed, 95 insertions(+)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index abd0113..3905e95 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -231,6 +231,8 @@ void mem_cgroup_split_huge_fixup(struct page *head);
 bool mem_cgroup_bad_page_check(struct page *page);
 void mem_cgroup_print_bad_page(struct page *page);
 #endif
+
+unsigned int mem_cgroup_low_limit_scale(struct mem_cgroup *memcg);
 #else /* CONFIG_MEMCG */
 struct mem_cgroup;
 
@@ -427,6 +429,11 @@ static inline void mem_cgroup_replace_page_cache(struct 
page *oldpage,
struct page *newpage)
 {
 }
+
+static inline unsigned int mem_cgroup_low_limit_scale(struct mem_cgroup *memcg)
+{
+   return 0;
+}
 #endif /* CONFIG_MEMCG */
 
 #if !defined(CONFIG_MEMCG) || !defined(CONFIG_DEBUG_VM)
diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
index 201a697..7a16c2a 100644
--- a/include/linux/res_counter.h
+++ b/include/linux/res_counter.h
@@ -40,6 +40,10 @@ struct res_counter {
 */
unsigned long long soft_limit;
/*
+* the secured guaranteed minimal limit of resource
+*/
+   unsigned long long low_limit;
+   /*
 * the number of unsuccessful attempts to consume the resource
 */
unsigned long long failcnt;
@@ -88,6 +92,7 @@ enum {
RES_LIMIT,
RES_FAILCNT,
RES_SOFT_LIMIT,
+   RES_LOW_LIMIT,
 };
 
 /*
@@ -224,4 +229,16 @@ res_counter_set_soft_limit(struct res_counter *cnt,
return 0;
 }
 
+static inline int
+res_counter_set_low_limit(struct res_counter *cnt,
+  unsigned long long 

Re: [RFC 0/4] memcg: Low-limit reclaim

2014-02-12 Thread Roman Gushchin
Hi, Michal!

Sorry for a long reply.

At Wed, 29 Jan 2014 19:22:59 +0100,
Michal Hocko wrote:
  As you can remember, I've proposed to introduce low limits about a year ago.
  
  We had a small discussion at that time: http://marc.info/?t=13619522664 
  .
 
 yes I remember that discussion and vaguely remember the proposed
 approach. I really wanted to prevent from introduction of a new knob but
 things evolved differently than I planned since then and it turned out
 that the knew knob is unavoidable. That's why I came with this approach
 which is quite different from yours AFAIR.
  
  Since that time we intensively use low limits in our production
  (on thousands of machines). So, I'm very interested to merge this
  functionality into upstream.
 
 Have you tried to use this implementation? Would this work as well?
 My very vague recollection of your patch is that it didn't cover both
 global and target reclaims and it didn't fit into the reclaim very
 naturally it used its own scaling method. I will have to refresh my
 memory though.

IMHO, the main problem of your implementation is the following: 
the number of reclaimed pages is not limited at all,
if cgroup is over it's low memory limit. So, a significant number 
of pages can be reclaimed, even if the memory usage is only a bit 
(e.g. one page) above the low limit.

In my case, this problem is solved by scaling the number of scanned pages.

I think, an ideal solution is to limit the number of reclaimed pages by 
low limit excess value. This allows to discard my scaling code, but save
the strict semantics of low limit under memory pressure. The main problem 
here is how to balance scanning pressure between cgroups and LRUs.

Maybe, we should calculate the number of pages to scan in a LRU based on
the low limit excess value instead of number of pages...

  In my experience, low limits also require some changes in memcg page 
  accounting
  policy. For instance, an application in protected cgroup should have a 
  guarantee
  that it's filecache belongs to it's cgroup and is protected by low limit
  therefore. If the filecache was created by another application in other 
  cgroup,
  it can be not so. I've solved this problem by implementing optional page
  reaccouting on pagefaults and read/writes.
 
 Memory sharing is a separate issue and we should discuss that
 separately. 
 
  I can prepare my current version of patchset, if someone is interested.
 
 Sure, having something to compare with is always valuable.


Subject: [PATCH] memcg: low limits for memory cgroups

Low limits for memory cgroup can be used to limit memory pressure on it.
If memory usage of a cgroup is under it's low limit, it will not be
affected by global reclaim. If it reaches it's low limit from above,
the reclaiming speed will be dropped exponentially.

Low limits don't affect soft reclaim.
Also, it's possible that a cgroup with memory usage under low limit
will be reclaimed slowly on very low scanning priorities.
---
 include/linux/memcontrol.h  |  7 ++
 include/linux/res_counter.h | 17 +
 kernel/res_counter.c|  2 ++
 mm/memcontrol.c | 60 +
 mm/vmscan.c |  9 +++
 5 files changed, 95 insertions(+)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index abd0113..3905e95 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -231,6 +231,8 @@ void mem_cgroup_split_huge_fixup(struct page *head);
 bool mem_cgroup_bad_page_check(struct page *page);
 void mem_cgroup_print_bad_page(struct page *page);
 #endif
+
+unsigned int mem_cgroup_low_limit_scale(struct mem_cgroup *memcg);
 #else /* CONFIG_MEMCG */
 struct mem_cgroup;
 
@@ -427,6 +429,11 @@ static inline void mem_cgroup_replace_page_cache(struct 
page *oldpage,
struct page *newpage)
 {
 }
+
+static inline unsigned int mem_cgroup_low_limit_scale(struct mem_cgroup *memcg)
+{
+   return 0;
+}
 #endif /* CONFIG_MEMCG */
 
 #if !defined(CONFIG_MEMCG) || !defined(CONFIG_DEBUG_VM)
diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
index 201a697..7a16c2a 100644
--- a/include/linux/res_counter.h
+++ b/include/linux/res_counter.h
@@ -40,6 +40,10 @@ struct res_counter {
 */
unsigned long long soft_limit;
/*
+* the secured guaranteed minimal limit of resource
+*/
+   unsigned long long low_limit;
+   /*
 * the number of unsuccessful attempts to consume the resource
 */
unsigned long long failcnt;
@@ -88,6 +92,7 @@ enum {
RES_LIMIT,
RES_FAILCNT,
RES_SOFT_LIMIT,
+   RES_LOW_LIMIT,
 };
 
 /*
@@ -224,4 +229,16 @@ res_counter_set_soft_limit(struct res_counter *cnt,
return 0;
 }
 
+static inline int
+res_counter_set_low_limit(struct res_counter *cnt,
+  unsigned long long low_limit)
+{
+   unsigned long flags;
+
+   

Re: [RFC 0/4] memcg: Low-limit reclaim

2014-02-03 Thread Greg Thelen
On Mon, Feb 03 2014, Michal Hocko wrote:

> On Thu 30-01-14 16:28:27, Greg Thelen wrote:
>> On Thu, Jan 30 2014, Michal Hocko wrote:
>> 
>> > On Wed 29-01-14 11:08:46, Greg Thelen wrote:
>> > [...]
>> >> The series looks useful.  We (Google) have been using something similar.
>> >> In practice such a low_limit (or memory guarantee), doesn't nest very
>> >> well.
>> >> 
>> >> Example:
>> >>   - parent_memcg: limit 500, low_limit 500, usage 500
>> >> 1 privately charged non-reclaimable page (e.g. mlock, slab)
>> >>   - child_memcg: limit 500, low_limit 500, usage 499
>> >
>> > I am not sure this is a good example. Your setup basically say that no
>> > single page should be reclaimed. I can imagine this might be useful in
>> > some cases and I would like to allow it but it sounds too extreme (e.g.
>> > a load which would start trashing heavily once the reclaim starts and it
>> > makes more sense to start it again rather than crowl - think about some
>> > mathematical simulation which might diverge).
>> 
>> Pages will still be reclaimed the usage_in_bytes is exceeds
>> limit_in_bytes.  I see the low_limit as a way to tell the kernel: don't
>> reclaim my memory due to external pressure, but internal pressure is
>> different.
>
> That sounds strange and very confusing to me. What if the internal
> pressure comes from children memcgs? Lowlimit is intended for protecting
> a group from reclaim and it shouldn't matter whether the reclaim is a
> result of the internal or external pressure.
>
>> >> If a streaming file cache workload (e.g. sha1sum) starts gobbling up
>> >> page cache it will lead to an oom kill instead of reclaiming. 
>> >
>> > Does it make any sense to protect all of such memory although it is
>> > easily reclaimable?
>> 
>> I think protection makes sense in this case.  If I know my workload
>> needs 500 to operate well, then I reserve 500 using low_limit.  My app
>> doesn't want to run with less than its reservation.
>> 
>> >> One could argue that this is working as intended because child_memcg
>> >> was promised 500 but can only get 499.  So child_memcg is oom killed
>> >> rather than being forced to operate below its promised low limit.
>> >> 
>> >> This has led to various internal workarounds like:
>> >> - don't charge any memory to interior tree nodes (e.g. parent_memcg);
>> >>   only charge memory to cgroup leafs.  This gets tricky when dealing
>> >>   with reparented memory inherited to parent from child during cgroup
>> >>   deletion.
>> >
>> > Do those need any protection at all?
>> 
>> Interior tree nodes don't need protection from their children.  But
>> children and interior nodes need protection from siblings and parents.
>
> Why? They contains only reparented pages in the above case. Those would
> be #1 candidate for reclaim in most cases, no?

I think we're on the same page.  My example interior node has reclaimed
pages and is a #1 candidate for reclaim induced from charges against
parent_memcg, but not a candidate for reclaim due to global memory
pressure induced by a sibling of parent_memcg.

>> >> - don't set low_limit on non leafs (e.g. do not set low limit on
>> >>   parent_memcg).  This constrains the cgroup layout a bit.  Some
>> >>   customers want to purchase $MEM and setup their workload with a few
>> >>   child cgroups.  A system daemon hands out $MEM by setting low_limit
>> >>   for top-level containers (e.g. parent_memcg).  Thereafter such
>> >>   customers are able to partition their workload with sub memcg below
>> >>   child_memcg.  Example:
>> >>  parent_memcg
>> >>  \
>> >>   child_memcg
>> >> / \
>> >> server   backup
>> >
>> > I think that the low_limit makes sense where you actually want to
>> > protect something from reclaim. And backup sounds like a bad fit for
>> > that.
>> 
>> The backup job would presumably have a small low_limit, but it may still
>> have a minimum working set required to make useful forward progress.
>> 
>> Example:
>>   parent_memcg
>>   \
>>child_memcg limit 500, low_limit 500, usage 500
>>  / \
>>  |   backup   limit 10, low_limit 10, usage 10
>>  |
>>   server limit 490, low_limit 490, usage 490
>> 
>> One could argue that problems appear when
>> server.low_limit+backup.lower_limit=child_memcg.limit.  So the safer
>> configuration is leave some padding:
>>   server.low_limit + backup.low_limit + padding = child_memcg.limit
>> but this just defers the problem.  As memory is reparented into parent,
>> then padding must grow.
>
> Which all sounds like a drawback of internal vs. external pressure
> semantic which you have mentioned above.

Huh?  I probably confused matters with the internal vs external talk
above.  Forgetting about that, I'm happy with the following
configuration assuming low_limit_fallback (ll_fallback) is eventually
available.

   parent_memcg
   \
child_memcg limit 500, low_limit 500, usage 500, ll_fallback 0
  

Re: [RFC 0/4] memcg: Low-limit reclaim

2014-02-03 Thread Michal Hocko
On Thu 30-01-14 16:28:27, Greg Thelen wrote:
> On Thu, Jan 30 2014, Michal Hocko wrote:
> 
> > On Wed 29-01-14 11:08:46, Greg Thelen wrote:
> > [...]
> >> The series looks useful.  We (Google) have been using something similar.
> >> In practice such a low_limit (or memory guarantee), doesn't nest very
> >> well.
> >> 
> >> Example:
> >>   - parent_memcg: limit 500, low_limit 500, usage 500
> >> 1 privately charged non-reclaimable page (e.g. mlock, slab)
> >>   - child_memcg: limit 500, low_limit 500, usage 499
> >
> > I am not sure this is a good example. Your setup basically say that no
> > single page should be reclaimed. I can imagine this might be useful in
> > some cases and I would like to allow it but it sounds too extreme (e.g.
> > a load which would start trashing heavily once the reclaim starts and it
> > makes more sense to start it again rather than crowl - think about some
> > mathematical simulation which might diverge).
> 
> Pages will still be reclaimed the usage_in_bytes is exceeds
> limit_in_bytes.  I see the low_limit as a way to tell the kernel: don't
> reclaim my memory due to external pressure, but internal pressure is
> different.

That sounds strange and very confusing to me. What if the internal
pressure comes from children memcgs? Lowlimit is intended for protecting
a group from reclaim and it shouldn't matter whether the reclaim is a
result of the internal or external pressure.

> >> If a streaming file cache workload (e.g. sha1sum) starts gobbling up
> >> page cache it will lead to an oom kill instead of reclaiming. 
> >
> > Does it make any sense to protect all of such memory although it is
> > easily reclaimable?
> 
> I think protection makes sense in this case.  If I know my workload
> needs 500 to operate well, then I reserve 500 using low_limit.  My app
> doesn't want to run with less than its reservation.
> 
> >> One could argue that this is working as intended because child_memcg
> >> was promised 500 but can only get 499.  So child_memcg is oom killed
> >> rather than being forced to operate below its promised low limit.
> >> 
> >> This has led to various internal workarounds like:
> >> - don't charge any memory to interior tree nodes (e.g. parent_memcg);
> >>   only charge memory to cgroup leafs.  This gets tricky when dealing
> >>   with reparented memory inherited to parent from child during cgroup
> >>   deletion.
> >
> > Do those need any protection at all?
> 
> Interior tree nodes don't need protection from their children.  But
> children and interior nodes need protection from siblings and parents.

Why? They contains only reparented pages in the above case. Those would
be #1 candidate for reclaim in most cases, no?

> >> - don't set low_limit on non leafs (e.g. do not set low limit on
> >>   parent_memcg).  This constrains the cgroup layout a bit.  Some
> >>   customers want to purchase $MEM and setup their workload with a few
> >>   child cgroups.  A system daemon hands out $MEM by setting low_limit
> >>   for top-level containers (e.g. parent_memcg).  Thereafter such
> >>   customers are able to partition their workload with sub memcg below
> >>   child_memcg.  Example:
> >>  parent_memcg
> >>  \
> >>   child_memcg
> >> / \
> >> server   backup
> >
> > I think that the low_limit makes sense where you actually want to
> > protect something from reclaim. And backup sounds like a bad fit for
> > that.
> 
> The backup job would presumably have a small low_limit, but it may still
> have a minimum working set required to make useful forward progress.
> 
> Example:
>   parent_memcg
>   \
>child_memcg limit 500, low_limit 500, usage 500
>  / \
>  |   backup   limit 10, low_limit 10, usage 10
>  |
>   server limit 490, low_limit 490, usage 490
> 
> One could argue that problems appear when
> server.low_limit+backup.lower_limit=child_memcg.limit.  So the safer
> configuration is leave some padding:
>   server.low_limit + backup.low_limit + padding = child_memcg.limit
> but this just defers the problem.  As memory is reparented into parent,
> then padding must grow.

Which all sounds like a drawback of internal vs. external pressure
semantic which you have mentioned above.

> >>   Thereafter customers often want some weak isolation between server and
> >>   backup.  To avoid undesired oom kills the server/backup isolation is
> >>   provided with a softer memory guarantee (e.g. soft_limit).  The soft
> >>   limit acts like the low_limit until priority becomes desperate.
> >
> > Johannes was already suggesting that the low_limit should allow for a
> > weaker semantic as well. I am not very much inclined to that but I can
> > leave with a knob which would say oom_on_lowlimit (on by default but
> > allowed to be set to 0). We would fallback to the full reclaim if
> > no groups turn out to be reclaimable.
> 
> I like the strong semantic of your low_limit at least at level:1 cgroups
> 

Re: [RFC 0/4] memcg: Low-limit reclaim

2014-02-03 Thread Michal Hocko
On Thu 30-01-14 16:28:27, Greg Thelen wrote:
 On Thu, Jan 30 2014, Michal Hocko wrote:
 
  On Wed 29-01-14 11:08:46, Greg Thelen wrote:
  [...]
  The series looks useful.  We (Google) have been using something similar.
  In practice such a low_limit (or memory guarantee), doesn't nest very
  well.
  
  Example:
- parent_memcg: limit 500, low_limit 500, usage 500
  1 privately charged non-reclaimable page (e.g. mlock, slab)
- child_memcg: limit 500, low_limit 500, usage 499
 
  I am not sure this is a good example. Your setup basically say that no
  single page should be reclaimed. I can imagine this might be useful in
  some cases and I would like to allow it but it sounds too extreme (e.g.
  a load which would start trashing heavily once the reclaim starts and it
  makes more sense to start it again rather than crowl - think about some
  mathematical simulation which might diverge).
 
 Pages will still be reclaimed the usage_in_bytes is exceeds
 limit_in_bytes.  I see the low_limit as a way to tell the kernel: don't
 reclaim my memory due to external pressure, but internal pressure is
 different.

That sounds strange and very confusing to me. What if the internal
pressure comes from children memcgs? Lowlimit is intended for protecting
a group from reclaim and it shouldn't matter whether the reclaim is a
result of the internal or external pressure.

  If a streaming file cache workload (e.g. sha1sum) starts gobbling up
  page cache it will lead to an oom kill instead of reclaiming. 
 
  Does it make any sense to protect all of such memory although it is
  easily reclaimable?
 
 I think protection makes sense in this case.  If I know my workload
 needs 500 to operate well, then I reserve 500 using low_limit.  My app
 doesn't want to run with less than its reservation.
 
  One could argue that this is working as intended because child_memcg
  was promised 500 but can only get 499.  So child_memcg is oom killed
  rather than being forced to operate below its promised low limit.
  
  This has led to various internal workarounds like:
  - don't charge any memory to interior tree nodes (e.g. parent_memcg);
only charge memory to cgroup leafs.  This gets tricky when dealing
with reparented memory inherited to parent from child during cgroup
deletion.
 
  Do those need any protection at all?
 
 Interior tree nodes don't need protection from their children.  But
 children and interior nodes need protection from siblings and parents.

Why? They contains only reparented pages in the above case. Those would
be #1 candidate for reclaim in most cases, no?

  - don't set low_limit on non leafs (e.g. do not set low limit on
parent_memcg).  This constrains the cgroup layout a bit.  Some
customers want to purchase $MEM and setup their workload with a few
child cgroups.  A system daemon hands out $MEM by setting low_limit
for top-level containers (e.g. parent_memcg).  Thereafter such
customers are able to partition their workload with sub memcg below
child_memcg.  Example:
   parent_memcg
   \
child_memcg
  / \
  server   backup
 
  I think that the low_limit makes sense where you actually want to
  protect something from reclaim. And backup sounds like a bad fit for
  that.
 
 The backup job would presumably have a small low_limit, but it may still
 have a minimum working set required to make useful forward progress.
 
 Example:
   parent_memcg
   \
child_memcg limit 500, low_limit 500, usage 500
  / \
  |   backup   limit 10, low_limit 10, usage 10
  |
   server limit 490, low_limit 490, usage 490
 
 One could argue that problems appear when
 server.low_limit+backup.lower_limit=child_memcg.limit.  So the safer
 configuration is leave some padding:
   server.low_limit + backup.low_limit + padding = child_memcg.limit
 but this just defers the problem.  As memory is reparented into parent,
 then padding must grow.

Which all sounds like a drawback of internal vs. external pressure
semantic which you have mentioned above.

Thereafter customers often want some weak isolation between server and
backup.  To avoid undesired oom kills the server/backup isolation is
provided with a softer memory guarantee (e.g. soft_limit).  The soft
limit acts like the low_limit until priority becomes desperate.
 
  Johannes was already suggesting that the low_limit should allow for a
  weaker semantic as well. I am not very much inclined to that but I can
  leave with a knob which would say oom_on_lowlimit (on by default but
  allowed to be set to 0). We would fallback to the full reclaim if
  no groups turn out to be reclaimable.
 
 I like the strong semantic of your low_limit at least at level:1 cgroups
 (direct children of root).  But I have also encountered situations where
 a strict guarantee is too strict and a mere preference is desirable.
 Perhaps the best plan is to continue with the 

Re: [RFC 0/4] memcg: Low-limit reclaim

2014-02-03 Thread Greg Thelen
On Mon, Feb 03 2014, Michal Hocko wrote:

 On Thu 30-01-14 16:28:27, Greg Thelen wrote:
 On Thu, Jan 30 2014, Michal Hocko wrote:
 
  On Wed 29-01-14 11:08:46, Greg Thelen wrote:
  [...]
  The series looks useful.  We (Google) have been using something similar.
  In practice such a low_limit (or memory guarantee), doesn't nest very
  well.
  
  Example:
- parent_memcg: limit 500, low_limit 500, usage 500
  1 privately charged non-reclaimable page (e.g. mlock, slab)
- child_memcg: limit 500, low_limit 500, usage 499
 
  I am not sure this is a good example. Your setup basically say that no
  single page should be reclaimed. I can imagine this might be useful in
  some cases and I would like to allow it but it sounds too extreme (e.g.
  a load which would start trashing heavily once the reclaim starts and it
  makes more sense to start it again rather than crowl - think about some
  mathematical simulation which might diverge).
 
 Pages will still be reclaimed the usage_in_bytes is exceeds
 limit_in_bytes.  I see the low_limit as a way to tell the kernel: don't
 reclaim my memory due to external pressure, but internal pressure is
 different.

 That sounds strange and very confusing to me. What if the internal
 pressure comes from children memcgs? Lowlimit is intended for protecting
 a group from reclaim and it shouldn't matter whether the reclaim is a
 result of the internal or external pressure.

  If a streaming file cache workload (e.g. sha1sum) starts gobbling up
  page cache it will lead to an oom kill instead of reclaiming. 
 
  Does it make any sense to protect all of such memory although it is
  easily reclaimable?
 
 I think protection makes sense in this case.  If I know my workload
 needs 500 to operate well, then I reserve 500 using low_limit.  My app
 doesn't want to run with less than its reservation.
 
  One could argue that this is working as intended because child_memcg
  was promised 500 but can only get 499.  So child_memcg is oom killed
  rather than being forced to operate below its promised low limit.
  
  This has led to various internal workarounds like:
  - don't charge any memory to interior tree nodes (e.g. parent_memcg);
only charge memory to cgroup leafs.  This gets tricky when dealing
with reparented memory inherited to parent from child during cgroup
deletion.
 
  Do those need any protection at all?
 
 Interior tree nodes don't need protection from their children.  But
 children and interior nodes need protection from siblings and parents.

 Why? They contains only reparented pages in the above case. Those would
 be #1 candidate for reclaim in most cases, no?

I think we're on the same page.  My example interior node has reclaimed
pages and is a #1 candidate for reclaim induced from charges against
parent_memcg, but not a candidate for reclaim due to global memory
pressure induced by a sibling of parent_memcg.

  - don't set low_limit on non leafs (e.g. do not set low limit on
parent_memcg).  This constrains the cgroup layout a bit.  Some
customers want to purchase $MEM and setup their workload with a few
child cgroups.  A system daemon hands out $MEM by setting low_limit
for top-level containers (e.g. parent_memcg).  Thereafter such
customers are able to partition their workload with sub memcg below
child_memcg.  Example:
   parent_memcg
   \
child_memcg
  / \
  server   backup
 
  I think that the low_limit makes sense where you actually want to
  protect something from reclaim. And backup sounds like a bad fit for
  that.
 
 The backup job would presumably have a small low_limit, but it may still
 have a minimum working set required to make useful forward progress.
 
 Example:
   parent_memcg
   \
child_memcg limit 500, low_limit 500, usage 500
  / \
  |   backup   limit 10, low_limit 10, usage 10
  |
   server limit 490, low_limit 490, usage 490
 
 One could argue that problems appear when
 server.low_limit+backup.lower_limit=child_memcg.limit.  So the safer
 configuration is leave some padding:
   server.low_limit + backup.low_limit + padding = child_memcg.limit
 but this just defers the problem.  As memory is reparented into parent,
 then padding must grow.

 Which all sounds like a drawback of internal vs. external pressure
 semantic which you have mentioned above.

Huh?  I probably confused matters with the internal vs external talk
above.  Forgetting about that, I'm happy with the following
configuration assuming low_limit_fallback (ll_fallback) is eventually
available.

   parent_memcg
   \
child_memcg limit 500, low_limit 500, usage 500, ll_fallback 0
  / \
  |   backup   limit 10, low_limit 10, usage 10, ll_fallback 1
  |
   server limit 490, low_limit 490, usage 490, ll_fallback 1

Thereafter customers often want some weak isolation between server and
backup.  To avoid 

Re: [RFC 0/4] memcg: Low-limit reclaim

2014-01-30 Thread Greg Thelen
On Thu, Jan 30 2014, Michal Hocko wrote:

> On Wed 29-01-14 11:08:46, Greg Thelen wrote:
> [...]
>> The series looks useful.  We (Google) have been using something similar.
>> In practice such a low_limit (or memory guarantee), doesn't nest very
>> well.
>> 
>> Example:
>>   - parent_memcg: limit 500, low_limit 500, usage 500
>> 1 privately charged non-reclaimable page (e.g. mlock, slab)
>>   - child_memcg: limit 500, low_limit 500, usage 499
>
> I am not sure this is a good example. Your setup basically say that no
> single page should be reclaimed. I can imagine this might be useful in
> some cases and I would like to allow it but it sounds too extreme (e.g.
> a load which would start trashing heavily once the reclaim starts and it
> makes more sense to start it again rather than crowl - think about some
> mathematical simulation which might diverge).

Pages will still be reclaimed the usage_in_bytes is exceeds
limit_in_bytes.  I see the low_limit as a way to tell the kernel: don't
reclaim my memory due to external pressure, but internal pressure is
different.

>> If a streaming file cache workload (e.g. sha1sum) starts gobbling up
>> page cache it will lead to an oom kill instead of reclaiming. 
>
> Does it make any sense to protect all of such memory although it is
> easily reclaimable?

I think protection makes sense in this case.  If I know my workload
needs 500 to operate well, then I reserve 500 using low_limit.  My app
doesn't want to run with less than its reservation.

>> One could argue that this is working as intended because child_memcg
>> was promised 500 but can only get 499.  So child_memcg is oom killed
>> rather than being forced to operate below its promised low limit.
>> 
>> This has led to various internal workarounds like:
>> - don't charge any memory to interior tree nodes (e.g. parent_memcg);
>>   only charge memory to cgroup leafs.  This gets tricky when dealing
>>   with reparented memory inherited to parent from child during cgroup
>>   deletion.
>
> Do those need any protection at all?

Interior tree nodes don't need protection from their children.  But
children and interior nodes need protection from siblings and parents.

>> - don't set low_limit on non leafs (e.g. do not set low limit on
>>   parent_memcg).  This constrains the cgroup layout a bit.  Some
>>   customers want to purchase $MEM and setup their workload with a few
>>   child cgroups.  A system daemon hands out $MEM by setting low_limit
>>   for top-level containers (e.g. parent_memcg).  Thereafter such
>>   customers are able to partition their workload with sub memcg below
>>   child_memcg.  Example:
>>  parent_memcg
>>  \
>>   child_memcg
>> / \
>> server   backup
>
> I think that the low_limit makes sense where you actually want to
> protect something from reclaim. And backup sounds like a bad fit for
> that.

The backup job would presumably have a small low_limit, but it may still
have a minimum working set required to make useful forward progress.

Example:
  parent_memcg
  \
   child_memcg limit 500, low_limit 500, usage 500
 / \
 |   backup   limit 10, low_limit 10, usage 10
 |
  server limit 490, low_limit 490, usage 490

One could argue that problems appear when
server.low_limit+backup.lower_limit=child_memcg.limit.  So the safer
configuration is leave some padding:
  server.low_limit + backup.low_limit + padding = child_memcg.limit
but this just defers the problem.  As memory is reparented into parent,
then padding must grow.

>>   Thereafter customers often want some weak isolation between server and
>>   backup.  To avoid undesired oom kills the server/backup isolation is
>>   provided with a softer memory guarantee (e.g. soft_limit).  The soft
>>   limit acts like the low_limit until priority becomes desperate.
>
> Johannes was already suggesting that the low_limit should allow for a
> weaker semantic as well. I am not very much inclined to that but I can
> leave with a knob which would say oom_on_lowlimit (on by default but
> allowed to be set to 0). We would fallback to the full reclaim if
> no groups turn out to be reclaimable.

I like the strong semantic of your low_limit at least at level:1 cgroups
(direct children of root).  But I have also encountered situations where
a strict guarantee is too strict and a mere preference is desirable.
Perhaps the best plan is to continue with the proposed strict low_limit
and eventually provide an additional mechanism which provides weaker
guarantees (e.g. soft_limit or something else if soft_limit cannot be
altered).  These two would offer good support for a variety of use
cases.

I thinking of something like:

bool mem_cgroup_reclaim_eligible(struct mem_cgroup *memcg,
struct mem_cgroup *root,
int priority)
{
do {
if (memcg == root)
break;
if 

Re: [RFC 0/4] memcg: Low-limit reclaim

2014-01-30 Thread Michal Hocko
On Wed 29-01-14 11:08:46, Greg Thelen wrote:
[...]
> The series looks useful.  We (Google) have been using something similar.
> In practice such a low_limit (or memory guarantee), doesn't nest very
> well.
> 
> Example:
>   - parent_memcg: limit 500, low_limit 500, usage 500
> 1 privately charged non-reclaimable page (e.g. mlock, slab)
>   - child_memcg: limit 500, low_limit 500, usage 499

I am not sure this is a good example. Your setup basically say that no
single page should be reclaimed. I can imagine this might be useful in
some cases and I would like to allow it but it sounds too extreme (e.g.
a load which would start trashing heavily once the reclaim starts and it
makes more sense to start it again rather than crowl - think about some
mathematical simulation which might diverge).
 
> If a streaming file cache workload (e.g. sha1sum) starts gobbling up
> page cache it will lead to an oom kill instead of reclaiming. 

Does it make any sense to protect all of such memory although it is
easily reclaimable?

> One could
> argue that this is working as intended because child_memcg was promised
> 500 but can only get 499.  So child_memcg is oom killed rather than
> being forced to operate below its promised low limit.
> 
> This has led to various internal workarounds like:
> - don't charge any memory to interior tree nodes (e.g. parent_memcg);
>   only charge memory to cgroup leafs.  This gets tricky when dealing
>   with reparented memory inherited to parent from child during cgroup
>   deletion.

Do those need any protection at all?

> - don't set low_limit on non leafs (e.g. do not set low limit on
>   parent_memcg).  This constrains the cgroup layout a bit.  Some
>   customers want to purchase $MEM and setup their workload with a few
>   child cgroups.  A system daemon hands out $MEM by setting low_limit
>   for top-level containers (e.g. parent_memcg).  Thereafter such
>   customers are able to partition their workload with sub memcg below
>   child_memcg.  Example:
>  parent_memcg
>  \
>   child_memcg
> / \
> server   backup

I think that the low_limit makes sense where you actually want to
protect something from reclaim. And backup sounds like a bad fit for
that.

>   Thereafter customers often want some weak isolation between server and
>   backup.  To avoid undesired oom kills the server/backup isolation is
>   provided with a softer memory guarantee (e.g. soft_limit).  The soft
>   limit acts like the low_limit until priority becomes desperate.

Johannes was already suggesting that the low_limit should allow for a
weaker semantic as well. I am not very much inclined to that but I can
leave with a knob which would say oom_on_lowlimit (on by default but
allowed to be set to 0). We would fallback to the full reclaim if
no groups turn out to be reclaimable.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/4] memcg: Low-limit reclaim

2014-01-30 Thread Michal Hocko
On Wed 29-01-14 11:08:46, Greg Thelen wrote:
[...]
 The series looks useful.  We (Google) have been using something similar.
 In practice such a low_limit (or memory guarantee), doesn't nest very
 well.
 
 Example:
   - parent_memcg: limit 500, low_limit 500, usage 500
 1 privately charged non-reclaimable page (e.g. mlock, slab)
   - child_memcg: limit 500, low_limit 500, usage 499

I am not sure this is a good example. Your setup basically say that no
single page should be reclaimed. I can imagine this might be useful in
some cases and I would like to allow it but it sounds too extreme (e.g.
a load which would start trashing heavily once the reclaim starts and it
makes more sense to start it again rather than crowl - think about some
mathematical simulation which might diverge).
 
 If a streaming file cache workload (e.g. sha1sum) starts gobbling up
 page cache it will lead to an oom kill instead of reclaiming. 

Does it make any sense to protect all of such memory although it is
easily reclaimable?

 One could
 argue that this is working as intended because child_memcg was promised
 500 but can only get 499.  So child_memcg is oom killed rather than
 being forced to operate below its promised low limit.
 
 This has led to various internal workarounds like:
 - don't charge any memory to interior tree nodes (e.g. parent_memcg);
   only charge memory to cgroup leafs.  This gets tricky when dealing
   with reparented memory inherited to parent from child during cgroup
   deletion.

Do those need any protection at all?

 - don't set low_limit on non leafs (e.g. do not set low limit on
   parent_memcg).  This constrains the cgroup layout a bit.  Some
   customers want to purchase $MEM and setup their workload with a few
   child cgroups.  A system daemon hands out $MEM by setting low_limit
   for top-level containers (e.g. parent_memcg).  Thereafter such
   customers are able to partition their workload with sub memcg below
   child_memcg.  Example:
  parent_memcg
  \
   child_memcg
 / \
 server   backup

I think that the low_limit makes sense where you actually want to
protect something from reclaim. And backup sounds like a bad fit for
that.

   Thereafter customers often want some weak isolation between server and
   backup.  To avoid undesired oom kills the server/backup isolation is
   provided with a softer memory guarantee (e.g. soft_limit).  The soft
   limit acts like the low_limit until priority becomes desperate.

Johannes was already suggesting that the low_limit should allow for a
weaker semantic as well. I am not very much inclined to that but I can
leave with a knob which would say oom_on_lowlimit (on by default but
allowed to be set to 0). We would fallback to the full reclaim if
no groups turn out to be reclaimable.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/4] memcg: Low-limit reclaim

2014-01-30 Thread Greg Thelen
On Thu, Jan 30 2014, Michal Hocko wrote:

 On Wed 29-01-14 11:08:46, Greg Thelen wrote:
 [...]
 The series looks useful.  We (Google) have been using something similar.
 In practice such a low_limit (or memory guarantee), doesn't nest very
 well.
 
 Example:
   - parent_memcg: limit 500, low_limit 500, usage 500
 1 privately charged non-reclaimable page (e.g. mlock, slab)
   - child_memcg: limit 500, low_limit 500, usage 499

 I am not sure this is a good example. Your setup basically say that no
 single page should be reclaimed. I can imagine this might be useful in
 some cases and I would like to allow it but it sounds too extreme (e.g.
 a load which would start trashing heavily once the reclaim starts and it
 makes more sense to start it again rather than crowl - think about some
 mathematical simulation which might diverge).

Pages will still be reclaimed the usage_in_bytes is exceeds
limit_in_bytes.  I see the low_limit as a way to tell the kernel: don't
reclaim my memory due to external pressure, but internal pressure is
different.

 If a streaming file cache workload (e.g. sha1sum) starts gobbling up
 page cache it will lead to an oom kill instead of reclaiming. 

 Does it make any sense to protect all of such memory although it is
 easily reclaimable?

I think protection makes sense in this case.  If I know my workload
needs 500 to operate well, then I reserve 500 using low_limit.  My app
doesn't want to run with less than its reservation.

 One could argue that this is working as intended because child_memcg
 was promised 500 but can only get 499.  So child_memcg is oom killed
 rather than being forced to operate below its promised low limit.
 
 This has led to various internal workarounds like:
 - don't charge any memory to interior tree nodes (e.g. parent_memcg);
   only charge memory to cgroup leafs.  This gets tricky when dealing
   with reparented memory inherited to parent from child during cgroup
   deletion.

 Do those need any protection at all?

Interior tree nodes don't need protection from their children.  But
children and interior nodes need protection from siblings and parents.

 - don't set low_limit on non leafs (e.g. do not set low limit on
   parent_memcg).  This constrains the cgroup layout a bit.  Some
   customers want to purchase $MEM and setup their workload with a few
   child cgroups.  A system daemon hands out $MEM by setting low_limit
   for top-level containers (e.g. parent_memcg).  Thereafter such
   customers are able to partition their workload with sub memcg below
   child_memcg.  Example:
  parent_memcg
  \
   child_memcg
 / \
 server   backup

 I think that the low_limit makes sense where you actually want to
 protect something from reclaim. And backup sounds like a bad fit for
 that.

The backup job would presumably have a small low_limit, but it may still
have a minimum working set required to make useful forward progress.

Example:
  parent_memcg
  \
   child_memcg limit 500, low_limit 500, usage 500
 / \
 |   backup   limit 10, low_limit 10, usage 10
 |
  server limit 490, low_limit 490, usage 490

One could argue that problems appear when
server.low_limit+backup.lower_limit=child_memcg.limit.  So the safer
configuration is leave some padding:
  server.low_limit + backup.low_limit + padding = child_memcg.limit
but this just defers the problem.  As memory is reparented into parent,
then padding must grow.

   Thereafter customers often want some weak isolation between server and
   backup.  To avoid undesired oom kills the server/backup isolation is
   provided with a softer memory guarantee (e.g. soft_limit).  The soft
   limit acts like the low_limit until priority becomes desperate.

 Johannes was already suggesting that the low_limit should allow for a
 weaker semantic as well. I am not very much inclined to that but I can
 leave with a knob which would say oom_on_lowlimit (on by default but
 allowed to be set to 0). We would fallback to the full reclaim if
 no groups turn out to be reclaimable.

I like the strong semantic of your low_limit at least at level:1 cgroups
(direct children of root).  But I have also encountered situations where
a strict guarantee is too strict and a mere preference is desirable.
Perhaps the best plan is to continue with the proposed strict low_limit
and eventually provide an additional mechanism which provides weaker
guarantees (e.g. soft_limit or something else if soft_limit cannot be
altered).  These two would offer good support for a variety of use
cases.

I thinking of something like:

bool mem_cgroup_reclaim_eligible(struct mem_cgroup *memcg,
struct mem_cgroup *root,
int priority)
{
do {
if (memcg == root)
break;
if (!res_counter_low_limit_excess(memcg-res))
return false;
if ((priority = DEF_PRIORITY - 

Re: [RFC 0/4] memcg: Low-limit reclaim

2014-01-29 Thread Greg Thelen
On Wed, Dec 11 2013, Michal Hocko wrote:

> Hi,
> previous discussions have shown that soft limits cannot be reformed
> (http://lwn.net/Articles/555249/). This series introduces an alternative
> approach to protecting memory allocated to processes executing within
> a memory cgroup controller. It is based on a new tunable that was
> discussed with Johannes and Tejun held during the last kernel summit.
>
> This patchset introduces such low limit that is functionally similar to a
> minimum guarantee. Memcgs which are under their lowlimit are not considered
> eligible for the reclaim (both global and hardlimit). The default value of
> the limit is 0 so all groups are eligible by default and an interested
> party has to explicitly set the limit.
>
> The primary use case is to protect an amount of memory allocated to a
> workload without it being reclaimed by an unrelated activity. In some
> cases this requirement can be fulfilled by mlock but it is not suitable
> for many loads and generally requires application awareness. Such
> application awareness can be complex. It effectively forbids the
> use of memory overcommit as the application must explicitly manage
> memory residency.
> With low limits, such workloads can be placed in a memcg with a low
> limit that protects the estimated working set.
>
> Another use case might be unreclaimable groups. Some loads might be so
> sensitive to reclaim that it is better to kill and start it again (or
> since checkpoint) rather than trash. This would be trivial with low
> limit set to unlimited and the OOM killer will handle the situation as
> required (e.g. kill and restart).
>
> The hierarchical behavior of the lowlimit is described in the first
> patch. It is followed by a direct reclaim fix which is necessary to
> handle situation when a no group is eligible because all groups are
> below low limit. This is not a big deal for hardlimit reclaim because
> we simply retry the reclaim few times and then trigger memcg OOM killer
> path. It would blow up in the global case when we would loop without
> doing any progress or trigger OOM killer. I would consider configuration
> leading to this state invalid but we should handle that gracefully.
>
> The third patch finally allows setting the lowlimit.
>
> The last patch tries expedites OOM if it is clear that no group is
> eligible for reclaim. It basically breaks out of loops in the direct
> reclaim and lets kswapd sleep because it wouldn't do any progress anyway.
>
> Thoughts?
>
> Short log says:
> Michal Hocko (4):
>   memcg, mm: introduce lowlimit reclaim
>   mm, memcg: allow OOM if no memcg is eligible during direct reclaim
>   memcg: Allow setting low_limit
>   mm, memcg: expedite OOM if no memcg is reclaimable
>
> And a diffstat
>  include/linux/memcontrol.h  | 14 +++
>  include/linux/res_counter.h | 40 ++
>  kernel/res_counter.c|  2 ++
>  mm/memcontrol.c | 60 
> -
>  mm/vmscan.c | 59 +---
>  5 files changed, 170 insertions(+), 5 deletions(-)

The series looks useful.  We (Google) have been using something similar.
In practice such a low_limit (or memory guarantee), doesn't nest very
well.

Example:
  - parent_memcg: limit 500, low_limit 500, usage 500
1 privately charged non-reclaimable page (e.g. mlock, slab)
  - child_memcg: limit 500, low_limit 500, usage 499

If a streaming file cache workload (e.g. sha1sum) starts gobbling up
page cache it will lead to an oom kill instead of reclaiming.  One could
argue that this is working as intended because child_memcg was promised
500 but can only get 499.  So child_memcg is oom killed rather than
being forced to operate below its promised low limit.

This has led to various internal workarounds like:
- don't charge any memory to interior tree nodes (e.g. parent_memcg);
  only charge memory to cgroup leafs.  This gets tricky when dealing
  with reparented memory inherited to parent from child during cgroup
  deletion.
- don't set low_limit on non leafs (e.g. do not set low limit on
  parent_memcg).  This constrains the cgroup layout a bit.  Some
  customers want to purchase $MEM and setup their workload with a few
  child cgroups.  A system daemon hands out $MEM by setting low_limit
  for top-level containers (e.g. parent_memcg).  Thereafter such
  customers are able to partition their workload with sub memcg below
  child_memcg.  Example:
 parent_memcg
 \
  child_memcg
/ \
server   backup
  Thereafter customers often want some weak isolation between server and
  backup.  To avoid undesired oom kills the server/backup isolation is
  provided with a softer memory guarantee (e.g. soft_limit).  The soft
  limit acts like the low_limit until priority becomes desperate.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a 

Re: [RFC 0/4] memcg: Low-limit reclaim

2014-01-29 Thread Michal Hocko
On Fri 24-01-14 15:07:02, Roman Gushchin wrote:
> Hi, Michal!

Hi,

> As you can remember, I've proposed to introduce low limits about a year ago.
> 
> We had a small discussion at that time: http://marc.info/?t=13619522664 .

yes I remember that discussion and vaguely remember the proposed
approach. I really wanted to prevent from introduction of a new knob but
things evolved differently than I planned since then and it turned out
that the knew knob is unavoidable. That's why I came with this approach
which is quite different from yours AFAIR.
 
> Since that time we intensively use low limits in our production
> (on thousands of machines). So, I'm very interested to merge this
> functionality into upstream.

Have you tried to use this implementation? Would this work as well?
My very vague recollection of your patch is that it didn't cover both
global and target reclaims and it didn't fit into the reclaim very
naturally it used its own scaling method. I will have to refresh my
memory though.

> In my experience, low limits also require some changes in memcg page 
> accounting
> policy. For instance, an application in protected cgroup should have a 
> guarantee
> that it's filecache belongs to it's cgroup and is protected by low limit
> therefore. If the filecache was created by another application in other 
> cgroup,
> it can be not so. I've solved this problem by implementing optional page
> reaccouting on pagefaults and read/writes.

Memory sharing is a separate issue and we should discuss that
separately. 

> I can prepare my current version of patchset, if someone is interested.

Sure, having something to compare with is always valuable.

> Regards,
> Roman
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/4] memcg: Low-limit reclaim

2014-01-29 Thread Michal Hocko
On Fri 24-01-14 15:07:02, Roman Gushchin wrote:
 Hi, Michal!

Hi,

 As you can remember, I've proposed to introduce low limits about a year ago.
 
 We had a small discussion at that time: http://marc.info/?t=13619522664 .

yes I remember that discussion and vaguely remember the proposed
approach. I really wanted to prevent from introduction of a new knob but
things evolved differently than I planned since then and it turned out
that the knew knob is unavoidable. That's why I came with this approach
which is quite different from yours AFAIR.
 
 Since that time we intensively use low limits in our production
 (on thousands of machines). So, I'm very interested to merge this
 functionality into upstream.

Have you tried to use this implementation? Would this work as well?
My very vague recollection of your patch is that it didn't cover both
global and target reclaims and it didn't fit into the reclaim very
naturally it used its own scaling method. I will have to refresh my
memory though.

 In my experience, low limits also require some changes in memcg page 
 accounting
 policy. For instance, an application in protected cgroup should have a 
 guarantee
 that it's filecache belongs to it's cgroup and is protected by low limit
 therefore. If the filecache was created by another application in other 
 cgroup,
 it can be not so. I've solved this problem by implementing optional page
 reaccouting on pagefaults and read/writes.

Memory sharing is a separate issue and we should discuss that
separately. 

 I can prepare my current version of patchset, if someone is interested.

Sure, having something to compare with is always valuable.

 Regards,
 Roman
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/4] memcg: Low-limit reclaim

2014-01-29 Thread Greg Thelen
On Wed, Dec 11 2013, Michal Hocko wrote:

 Hi,
 previous discussions have shown that soft limits cannot be reformed
 (http://lwn.net/Articles/555249/). This series introduces an alternative
 approach to protecting memory allocated to processes executing within
 a memory cgroup controller. It is based on a new tunable that was
 discussed with Johannes and Tejun held during the last kernel summit.

 This patchset introduces such low limit that is functionally similar to a
 minimum guarantee. Memcgs which are under their lowlimit are not considered
 eligible for the reclaim (both global and hardlimit). The default value of
 the limit is 0 so all groups are eligible by default and an interested
 party has to explicitly set the limit.

 The primary use case is to protect an amount of memory allocated to a
 workload without it being reclaimed by an unrelated activity. In some
 cases this requirement can be fulfilled by mlock but it is not suitable
 for many loads and generally requires application awareness. Such
 application awareness can be complex. It effectively forbids the
 use of memory overcommit as the application must explicitly manage
 memory residency.
 With low limits, such workloads can be placed in a memcg with a low
 limit that protects the estimated working set.

 Another use case might be unreclaimable groups. Some loads might be so
 sensitive to reclaim that it is better to kill and start it again (or
 since checkpoint) rather than trash. This would be trivial with low
 limit set to unlimited and the OOM killer will handle the situation as
 required (e.g. kill and restart).

 The hierarchical behavior of the lowlimit is described in the first
 patch. It is followed by a direct reclaim fix which is necessary to
 handle situation when a no group is eligible because all groups are
 below low limit. This is not a big deal for hardlimit reclaim because
 we simply retry the reclaim few times and then trigger memcg OOM killer
 path. It would blow up in the global case when we would loop without
 doing any progress or trigger OOM killer. I would consider configuration
 leading to this state invalid but we should handle that gracefully.

 The third patch finally allows setting the lowlimit.

 The last patch tries expedites OOM if it is clear that no group is
 eligible for reclaim. It basically breaks out of loops in the direct
 reclaim and lets kswapd sleep because it wouldn't do any progress anyway.

 Thoughts?

 Short log says:
 Michal Hocko (4):
   memcg, mm: introduce lowlimit reclaim
   mm, memcg: allow OOM if no memcg is eligible during direct reclaim
   memcg: Allow setting low_limit
   mm, memcg: expedite OOM if no memcg is reclaimable

 And a diffstat
  include/linux/memcontrol.h  | 14 +++
  include/linux/res_counter.h | 40 ++
  kernel/res_counter.c|  2 ++
  mm/memcontrol.c | 60 
 -
  mm/vmscan.c | 59 +---
  5 files changed, 170 insertions(+), 5 deletions(-)

The series looks useful.  We (Google) have been using something similar.
In practice such a low_limit (or memory guarantee), doesn't nest very
well.

Example:
  - parent_memcg: limit 500, low_limit 500, usage 500
1 privately charged non-reclaimable page (e.g. mlock, slab)
  - child_memcg: limit 500, low_limit 500, usage 499

If a streaming file cache workload (e.g. sha1sum) starts gobbling up
page cache it will lead to an oom kill instead of reclaiming.  One could
argue that this is working as intended because child_memcg was promised
500 but can only get 499.  So child_memcg is oom killed rather than
being forced to operate below its promised low limit.

This has led to various internal workarounds like:
- don't charge any memory to interior tree nodes (e.g. parent_memcg);
  only charge memory to cgroup leafs.  This gets tricky when dealing
  with reparented memory inherited to parent from child during cgroup
  deletion.
- don't set low_limit on non leafs (e.g. do not set low limit on
  parent_memcg).  This constrains the cgroup layout a bit.  Some
  customers want to purchase $MEM and setup their workload with a few
  child cgroups.  A system daemon hands out $MEM by setting low_limit
  for top-level containers (e.g. parent_memcg).  Thereafter such
  customers are able to partition their workload with sub memcg below
  child_memcg.  Example:
 parent_memcg
 \
  child_memcg
/ \
server   backup
  Thereafter customers often want some weak isolation between server and
  backup.  To avoid undesired oom kills the server/backup isolation is
  provided with a softer memory guarantee (e.g. soft_limit).  The soft
  limit acts like the low_limit until priority becomes desperate.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  

Re: [RFC 0/4] memcg: Low-limit reclaim

2014-01-24 Thread Roman Gushchin

Hi, Michal!

As you can remember, I've proposed to introduce low limits about a year ago.

We had a small discussion at that time: http://marc.info/?t=13619522664 .

Since that time we intensively use low limits in our production
(on thousands of machines). So, I'm very interested to merge this
functionality into upstream.

In my experience, low limits also require some changes in memcg page accounting
policy. For instance, an application in protected cgroup should have a guarantee
that it's filecache belongs to it's cgroup and is protected by low limit
therefore. If the filecache was created by another application in other cgroup,
it can be not so. I've solved this problem by implementing optional page
reaccouting on pagefaults and read/writes.

I can prepare my current version of patchset, if someone is interested.

Regards,
Roman

On 11.12.2013 18:15, Michal Hocko wrote:

Hi,
previous discussions have shown that soft limits cannot be reformed
(http://lwn.net/Articles/555249/). This series introduces an alternative
approach to protecting memory allocated to processes executing within
a memory cgroup controller. It is based on a new tunable that was
discussed with Johannes and Tejun held during the last kernel summit.

This patchset introduces such low limit that is functionally similar to a
minimum guarantee. Memcgs which are under their lowlimit are not considered
eligible for the reclaim (both global and hardlimit). The default value of
the limit is 0 so all groups are eligible by default and an interested
party has to explicitly set the limit.

The primary use case is to protect an amount of memory allocated to a
workload without it being reclaimed by an unrelated activity. In some
cases this requirement can be fulfilled by mlock but it is not suitable
for many loads and generally requires application awareness. Such
application awareness can be complex. It effectively forbids the
use of memory overcommit as the application must explicitly manage
memory residency.
With low limits, such workloads can be placed in a memcg with a low
limit that protects the estimated working set.

Another use case might be unreclaimable groups. Some loads might be so
sensitive to reclaim that it is better to kill and start it again (or
since checkpoint) rather than trash. This would be trivial with low
limit set to unlimited and the OOM killer will handle the situation as
required (e.g. kill and restart).

The hierarchical behavior of the lowlimit is described in the first
patch. It is followed by a direct reclaim fix which is necessary to
handle situation when a no group is eligible because all groups are
below low limit. This is not a big deal for hardlimit reclaim because
we simply retry the reclaim few times and then trigger memcg OOM killer
path. It would blow up in the global case when we would loop without
doing any progress or trigger OOM killer. I would consider configuration
leading to this state invalid but we should handle that gracefully.

The third patch finally allows setting the lowlimit.

The last patch tries expedites OOM if it is clear that no group is
eligible for reclaim. It basically breaks out of loops in the direct
reclaim and lets kswapd sleep because it wouldn't do any progress anyway.

Thoughts?

Short log says:
Michal Hocko (4):
   memcg, mm: introduce lowlimit reclaim
   mm, memcg: allow OOM if no memcg is eligible during direct reclaim
   memcg: Allow setting low_limit
   mm, memcg: expedite OOM if no memcg is reclaimable

And a diffstat
  include/linux/memcontrol.h  | 14 +++
  include/linux/res_counter.h | 40 ++
  kernel/res_counter.c|  2 ++
  mm/memcontrol.c | 60 -
  mm/vmscan.c | 59 +---
  5 files changed, 170 insertions(+), 5 deletions(-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majord...@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: mailto:"d...@kvack.org;> em...@kvack.org 



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/4] memcg: Low-limit reclaim

2014-01-24 Thread Roman Gushchin

Hi, Michal!

As you can remember, I've proposed to introduce low limits about a year ago.

We had a small discussion at that time: http://marc.info/?t=13619522664 .

Since that time we intensively use low limits in our production
(on thousands of machines). So, I'm very interested to merge this
functionality into upstream.

In my experience, low limits also require some changes in memcg page accounting
policy. For instance, an application in protected cgroup should have a guarantee
that it's filecache belongs to it's cgroup and is protected by low limit
therefore. If the filecache was created by another application in other cgroup,
it can be not so. I've solved this problem by implementing optional page
reaccouting on pagefaults and read/writes.

I can prepare my current version of patchset, if someone is interested.

Regards,
Roman

On 11.12.2013 18:15, Michal Hocko wrote:

Hi,
previous discussions have shown that soft limits cannot be reformed
(http://lwn.net/Articles/555249/). This series introduces an alternative
approach to protecting memory allocated to processes executing within
a memory cgroup controller. It is based on a new tunable that was
discussed with Johannes and Tejun held during the last kernel summit.

This patchset introduces such low limit that is functionally similar to a
minimum guarantee. Memcgs which are under their lowlimit are not considered
eligible for the reclaim (both global and hardlimit). The default value of
the limit is 0 so all groups are eligible by default and an interested
party has to explicitly set the limit.

The primary use case is to protect an amount of memory allocated to a
workload without it being reclaimed by an unrelated activity. In some
cases this requirement can be fulfilled by mlock but it is not suitable
for many loads and generally requires application awareness. Such
application awareness can be complex. It effectively forbids the
use of memory overcommit as the application must explicitly manage
memory residency.
With low limits, such workloads can be placed in a memcg with a low
limit that protects the estimated working set.

Another use case might be unreclaimable groups. Some loads might be so
sensitive to reclaim that it is better to kill and start it again (or
since checkpoint) rather than trash. This would be trivial with low
limit set to unlimited and the OOM killer will handle the situation as
required (e.g. kill and restart).

The hierarchical behavior of the lowlimit is described in the first
patch. It is followed by a direct reclaim fix which is necessary to
handle situation when a no group is eligible because all groups are
below low limit. This is not a big deal for hardlimit reclaim because
we simply retry the reclaim few times and then trigger memcg OOM killer
path. It would blow up in the global case when we would loop without
doing any progress or trigger OOM killer. I would consider configuration
leading to this state invalid but we should handle that gracefully.

The third patch finally allows setting the lowlimit.

The last patch tries expedites OOM if it is clear that no group is
eligible for reclaim. It basically breaks out of loops in the direct
reclaim and lets kswapd sleep because it wouldn't do any progress anyway.

Thoughts?

Short log says:
Michal Hocko (4):
   memcg, mm: introduce lowlimit reclaim
   mm, memcg: allow OOM if no memcg is eligible during direct reclaim
   memcg: Allow setting low_limit
   mm, memcg: expedite OOM if no memcg is reclaimable

And a diffstat
  include/linux/memcontrol.h  | 14 +++
  include/linux/res_counter.h | 40 ++
  kernel/res_counter.c|  2 ++
  mm/memcontrol.c | 60 -
  mm/vmscan.c | 59 +---
  5 files changed, 170 insertions(+), 5 deletions(-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majord...@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC 0/4] memcg: Low-limit reclaim

2013-12-11 Thread Michal Hocko
Hi,
previous discussions have shown that soft limits cannot be reformed
(http://lwn.net/Articles/555249/). This series introduces an alternative
approach to protecting memory allocated to processes executing within
a memory cgroup controller. It is based on a new tunable that was
discussed with Johannes and Tejun held during the last kernel summit.

This patchset introduces such low limit that is functionally similar to a
minimum guarantee. Memcgs which are under their lowlimit are not considered
eligible for the reclaim (both global and hardlimit). The default value of
the limit is 0 so all groups are eligible by default and an interested
party has to explicitly set the limit.

The primary use case is to protect an amount of memory allocated to a
workload without it being reclaimed by an unrelated activity. In some
cases this requirement can be fulfilled by mlock but it is not suitable
for many loads and generally requires application awareness. Such
application awareness can be complex. It effectively forbids the
use of memory overcommit as the application must explicitly manage
memory residency.
With low limits, such workloads can be placed in a memcg with a low
limit that protects the estimated working set.

Another use case might be unreclaimable groups. Some loads might be so
sensitive to reclaim that it is better to kill and start it again (or
since checkpoint) rather than trash. This would be trivial with low
limit set to unlimited and the OOM killer will handle the situation as
required (e.g. kill and restart).

The hierarchical behavior of the lowlimit is described in the first
patch. It is followed by a direct reclaim fix which is necessary to
handle situation when a no group is eligible because all groups are
below low limit. This is not a big deal for hardlimit reclaim because
we simply retry the reclaim few times and then trigger memcg OOM killer
path. It would blow up in the global case when we would loop without
doing any progress or trigger OOM killer. I would consider configuration
leading to this state invalid but we should handle that gracefully.

The third patch finally allows setting the lowlimit.

The last patch tries expedites OOM if it is clear that no group is
eligible for reclaim. It basically breaks out of loops in the direct
reclaim and lets kswapd sleep because it wouldn't do any progress anyway.

Thoughts?

Short log says:
Michal Hocko (4):
  memcg, mm: introduce lowlimit reclaim
  mm, memcg: allow OOM if no memcg is eligible during direct reclaim
  memcg: Allow setting low_limit
  mm, memcg: expedite OOM if no memcg is reclaimable

And a diffstat
 include/linux/memcontrol.h  | 14 +++
 include/linux/res_counter.h | 40 ++
 kernel/res_counter.c|  2 ++
 mm/memcontrol.c | 60 -
 mm/vmscan.c | 59 +---
 5 files changed, 170 insertions(+), 5 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC 0/4] memcg: Low-limit reclaim

2013-12-11 Thread Michal Hocko
Hi,
previous discussions have shown that soft limits cannot be reformed
(http://lwn.net/Articles/555249/). This series introduces an alternative
approach to protecting memory allocated to processes executing within
a memory cgroup controller. It is based on a new tunable that was
discussed with Johannes and Tejun held during the last kernel summit.

This patchset introduces such low limit that is functionally similar to a
minimum guarantee. Memcgs which are under their lowlimit are not considered
eligible for the reclaim (both global and hardlimit). The default value of
the limit is 0 so all groups are eligible by default and an interested
party has to explicitly set the limit.

The primary use case is to protect an amount of memory allocated to a
workload without it being reclaimed by an unrelated activity. In some
cases this requirement can be fulfilled by mlock but it is not suitable
for many loads and generally requires application awareness. Such
application awareness can be complex. It effectively forbids the
use of memory overcommit as the application must explicitly manage
memory residency.
With low limits, such workloads can be placed in a memcg with a low
limit that protects the estimated working set.

Another use case might be unreclaimable groups. Some loads might be so
sensitive to reclaim that it is better to kill and start it again (or
since checkpoint) rather than trash. This would be trivial with low
limit set to unlimited and the OOM killer will handle the situation as
required (e.g. kill and restart).

The hierarchical behavior of the lowlimit is described in the first
patch. It is followed by a direct reclaim fix which is necessary to
handle situation when a no group is eligible because all groups are
below low limit. This is not a big deal for hardlimit reclaim because
we simply retry the reclaim few times and then trigger memcg OOM killer
path. It would blow up in the global case when we would loop without
doing any progress or trigger OOM killer. I would consider configuration
leading to this state invalid but we should handle that gracefully.

The third patch finally allows setting the lowlimit.

The last patch tries expedites OOM if it is clear that no group is
eligible for reclaim. It basically breaks out of loops in the direct
reclaim and lets kswapd sleep because it wouldn't do any progress anyway.

Thoughts?

Short log says:
Michal Hocko (4):
  memcg, mm: introduce lowlimit reclaim
  mm, memcg: allow OOM if no memcg is eligible during direct reclaim
  memcg: Allow setting low_limit
  mm, memcg: expedite OOM if no memcg is reclaimable

And a diffstat
 include/linux/memcontrol.h  | 14 +++
 include/linux/res_counter.h | 40 ++
 kernel/res_counter.c|  2 ++
 mm/memcontrol.c | 60 -
 mm/vmscan.c | 59 +---
 5 files changed, 170 insertions(+), 5 deletions(-)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/