On Tue 20-01-15 10:31:55, Johannes Weiner wrote:
> Introduce the basic control files to account, partition, and limit
> memory using cgroups in default hierarchy mode.
> 
> This interface versioning allows us to address fundamental design
> issues in the existing memory cgroup interface, further explained
> below.  The old interface will be maintained indefinitely, but a
> clearer model and improved workload performance should encourage
> existing users to switch over to the new one eventually.
> 
> The control files are thus:
> 
>   - memory.current shows the current consumption of the cgroup and its
>     descendants, in bytes.
> 
>   - memory.low configures the lower end of the cgroup's expected
>     memory consumption range.  The kernel considers memory below that
>     boundary to be a reserve - the minimum that the workload needs in
>     order to make forward progress - and generally avoids reclaiming
>     it, unless there is an imminent risk of entering an OOM situation.
> 
>   - memory.high configures the upper end of the cgroup's expected
>     memory consumption range.  A cgroup whose consumption grows beyond
>     this threshold is forced into direct reclaim, to work off the
>     excess and to throttle new allocations heavily, but is generally
>     allowed to continue and the OOM killer is not invoked.
> 
>   - memory.max configures the hard maximum amount of memory that the
>     cgroup is allowed to consume before the OOM killer is invoked.
> 
>   - memory.events shows event counters that indicate how often the
>     cgroup was reclaimed while below memory.low, how often it was
>     forced to reclaim excess beyond memory.high, how often it hit
>     memory.max, and how often it entered OOM due to memory.max.  This
>     allows users to identify configuration problems when observing a
>     degradation in workload performance.  An overcommitted system will
>     have an increased rate of low boundary breaches, whereas increased
>     rates of high limit breaches, maximum hits, or even OOM situations
>     will indicate internally overcommitted cgroups.
> 
> For existing users of memory cgroups, the following deviations from
> the current interface are worth pointing out and explaining:
> 
>   - The original lower boundary, the soft limit, is defined as a limit
>     that is per default unset.  As a result, the set of cgroups that
>     global reclaim prefers is opt-in, rather than opt-out.  The costs
>     for optimizing these mostly negative lookups are so high that the
>     implementation, despite its enormous size, does not even provide
>     the basic desirable behavior.  First off, the soft limit has no
>     hierarchical meaning.  All configured groups are organized in a
>     global rbtree and treated like equal peers, regardless where they
>     are located in the hierarchy.  This makes subtree delegation
>     impossible.  Second, the soft limit reclaim pass is so aggressive
>     that it not just introduces high allocation latencies into the
>     system, but also impacts system performance due to overreclaim, to
>     the point where the feature becomes self-defeating.
> 
>     The memory.low boundary on the other hand is a top-down allocated
>     reserve.  A cgroup enjoys reclaim protection when it and all its
>     ancestors are below their low boundaries, which makes delegation
>     of subtrees possible.  Secondly, new cgroups have no reserve per
>     default and in the common case most cgroups are eligible for the
>     preferred reclaim pass.  This allows the new low boundary to be
>     efficiently implemented with just a minor addition to the generic
>     reclaim code, without the need for out-of-band data structures and
>     reclaim passes.  Because the generic reclaim code considers all
>     cgroups except for the ones running low in the preferred first
>     reclaim pass, overreclaim of individual groups is eliminated as
>     well, resulting in much better overall workload performance.
> 
>   - The original high boundary, the hard limit, is defined as a strict
>     limit that can not budge, even if the OOM killer has to be called.
>     But this generally goes against the goal of making the most out of
>     the available memory.  The memory consumption of workloads varies
>     during runtime, and that requires users to overcommit.  But doing
>     that with a strict upper limit requires either a fairly accurate
>     prediction of the working set size or adding slack to the limit.
>     Since working set size estimation is hard and error prone, and
>     getting it wrong results in OOM kills, most users tend to err on
>     the side of a looser limit and end up wasting precious resources.
> 
>     The memory.high boundary on the other hand can be set much more
>     conservatively.  When hit, it throttles allocations by forcing
>     them into direct reclaim to work off the excess, but it never
>     invokes the OOM killer.  As a result, a high boundary that is
>     chosen too aggressively will not terminate the processes, but
>     instead it will lead to gradual performance degradation.  The user
>     can monitor this and make corrections until the minimal memory
>     footprint that still gives acceptable performance is found.
> 
>     In extreme cases, with many concurrent allocations and a complete
>     breakdown of reclaim progress within the group, the high boundary
>     can be exceeded.  But even then it's mostly better to satisfy the
>     allocation from the slack available in other groups or the rest of
>     the system than killing the group.  Otherwise, memory.max is there
>     to limit this type of spillover and ultimately contain buggy or
>     even malicious applications.
> 
>   - The original control file names are unwieldy and inconsistent in
>     many different ways.  For example, the upper boundary hit count is
>     exported in the memory.failcnt file, but an OOM event count has to
>     be manually counted by listening to memory.oom_control events, and
>     lower boundary / soft limit events have to be counted by first
>     setting a threshold for that value and then counting those events.
>     Also, usage and limit files encode their units in the filename.
>     That makes the filenames very long, even though this is not
>     information that a user needs to be reminded of every time they
>     type out those names.
> 
>     To address these naming issues, as well as to signal clearly that
>     the new interface carries a new configuration model, the naming
>     conventions in it necessarily differ from the old interface.
> 
>   - The original limit files indicate the state of an unset limit with
>     a very high number, and a configured limit can be unset by echoing
>     -1 into those files.  But that very high number is implementation
>     and architecture dependent and not very descriptive.  And while -1
>     can be understood as an underflow into the highest possible value,
>     -2 or -10M etc. do not work, so it's not inconsistent.
> 
>     memory.low, memory.high, and memory.max will use the string
>     "infinity" to indicate and set the highest possible value.
> 
> [[email protected]: use seq_puts() for basic strings]
> Signed-off-by: Johannes Weiner <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Cc: Vladimir Davydov <[email protected]>
> Cc: Greg Thelen <[email protected]>
> Signed-off-by: Andrew Morton <[email protected]>

Acked-by: Michal Hocko <[email protected]>

> ---
>  Documentation/cgroups/unified-hierarchy.txt |  79 ++++++++++
>  include/linux/memcontrol.h                  |  32 ++++
>  mm/memcontrol.c                             | 229 
> ++++++++++++++++++++++++++--
>  mm/vmscan.c                                 |  22 ++-
>  4 files changed, 348 insertions(+), 14 deletions(-)
> 
> diff --git a/Documentation/cgroups/unified-hierarchy.txt 
> b/Documentation/cgroups/unified-hierarchy.txt
> index 4f4563277864..71daa35ec2d9 100644
> --- a/Documentation/cgroups/unified-hierarchy.txt
> +++ b/Documentation/cgroups/unified-hierarchy.txt
> @@ -327,6 +327,85 @@ supported and the interface files "release_agent" and
>  - use_hierarchy is on by default and the cgroup file for the flag is
>    not created.
>  
> +- The original lower boundary, the soft limit, is defined as a limit
> +  that is per default unset.  As a result, the set of cgroups that
> +  global reclaim prefers is opt-in, rather than opt-out.  The costs
> +  for optimizing these mostly negative lookups are so high that the
> +  implementation, despite its enormous size, does not even provide the
> +  basic desirable behavior.  First off, the soft limit has no
> +  hierarchical meaning.  All configured groups are organized in a
> +  global rbtree and treated like equal peers, regardless where they
> +  are located in the hierarchy.  This makes subtree delegation
> +  impossible.  Second, the soft limit reclaim pass is so aggressive
> +  that it not just introduces high allocation latencies into the
> +  system, but also impacts system performance due to overreclaim, to
> +  the point where the feature becomes self-defeating.
> +
> +  The memory.low boundary on the other hand is a top-down allocated
> +  reserve.  A cgroup enjoys reclaim protection when it and all its
> +  ancestors are below their low boundaries, which makes delegation of
> +  subtrees possible.  Secondly, new cgroups have no reserve per
> +  default and in the common case most cgroups are eligible for the
> +  preferred reclaim pass.  This allows the new low boundary to be
> +  efficiently implemented with just a minor addition to the generic
> +  reclaim code, without the need for out-of-band data structures and
> +  reclaim passes.  Because the generic reclaim code considers all
> +  cgroups except for the ones running low in the preferred first
> +  reclaim pass, overreclaim of individual groups is eliminated as
> +  well, resulting in much better overall workload performance.
> +
> +- The original high boundary, the hard limit, is defined as a strict
> +  limit that can not budge, even if the OOM killer has to be called.
> +  But this generally goes against the goal of making the most out of
> +  the available memory.  The memory consumption of workloads varies
> +  during runtime, and that requires users to overcommit.  But doing
> +  that with a strict upper limit requires either a fairly accurate
> +  prediction of the working set size or adding slack to the limit.
> +  Since working set size estimation is hard and error prone, and
> +  getting it wrong results in OOM kills, most users tend to err on the
> +  side of a looser limit and end up wasting precious resources.
> +
> +  The memory.high boundary on the other hand can be set much more
> +  conservatively.  When hit, it throttles allocations by forcing them
> +  into direct reclaim to work off the excess, but it never invokes the
> +  OOM killer.  As a result, a high boundary that is chosen too
> +  aggressively will not terminate the processes, but instead it will
> +  lead to gradual performance degradation.  The user can monitor this
> +  and make corrections until the minimal memory footprint that still
> +  gives acceptable performance is found.
> +
> +  In extreme cases, with many concurrent allocations and a complete
> +  breakdown of reclaim progress within the group, the high boundary
> +  can be exceeded.  But even then it's mostly better to satisfy the
> +  allocation from the slack available in other groups or the rest of
> +  the system than killing the group.  Otherwise, memory.max is there
> +  to limit this type of spillover and ultimately contain buggy or even
> +  malicious applications.
> +
> +- The original control file names are unwieldy and inconsistent in
> +  many different ways.  For example, the upper boundary hit count is
> +  exported in the memory.failcnt file, but an OOM event count has to
> +  be manually counted by listening to memory.oom_control events, and
> +  lower boundary / soft limit events have to be counted by first
> +  setting a threshold for that value and then counting those events.
> +  Also, usage and limit files encode their units in the filename.
> +  That makes the filenames very long, even though this is not
> +  information that a user needs to be reminded of every time they type
> +  out those names.
> +
> +  To address these naming issues, as well as to signal clearly that
> +  the new interface carries a new configuration model, the naming
> +  conventions in it necessarily differ from the old interface.
> +
> +- The original limit files indicate the state of an unset limit with a
> +  Very High Number, and a configured limit can be unset by echoing -1
> +  into those files.  But that very high number is implementation and
> +  architecture dependent and not very descriptive.  And while -1 can
> +  be understood as an underflow into the highest possible value, -2 or
> +  -10M etc. do not work, so it's not consistent.
> +
> +  memory.low, memory.high, and memory.max will use the string
> +  "infinity" to indicate and set the highest possible value.
>  
>  5. Planned Changes
>  
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 76f489fad640..72dff5fb0d0c 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -52,7 +52,27 @@ struct mem_cgroup_reclaim_cookie {
>       unsigned int generation;
>  };
>  
> +enum mem_cgroup_events_index {
> +     MEM_CGROUP_EVENTS_PGPGIN,       /* # of pages paged in */
> +     MEM_CGROUP_EVENTS_PGPGOUT,      /* # of pages paged out */
> +     MEM_CGROUP_EVENTS_PGFAULT,      /* # of page-faults */
> +     MEM_CGROUP_EVENTS_PGMAJFAULT,   /* # of major page-faults */
> +     MEM_CGROUP_EVENTS_NSTATS,
> +     /* default hierarchy events */
> +     MEMCG_LOW = MEM_CGROUP_EVENTS_NSTATS,
> +     MEMCG_HIGH,
> +     MEMCG_MAX,
> +     MEMCG_OOM,
> +     MEMCG_NR_EVENTS,
> +};
> +
>  #ifdef CONFIG_MEMCG
> +void mem_cgroup_events(struct mem_cgroup *memcg,
> +                    enum mem_cgroup_events_index idx,
> +                    unsigned int nr);
> +
> +bool mem_cgroup_low(struct mem_cgroup *root, struct mem_cgroup *memcg);
> +
>  int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
>                         gfp_t gfp_mask, struct mem_cgroup **memcgp);
>  void mem_cgroup_commit_charge(struct page *page, struct mem_cgroup *memcg,
> @@ -175,6 +195,18 @@ void mem_cgroup_split_huge_fixup(struct page *head);
>  #else /* CONFIG_MEMCG */
>  struct mem_cgroup;
>  
> +static inline void mem_cgroup_events(struct mem_cgroup *memcg,
> +                                  enum mem_cgroup_events_index idx,
> +                                  unsigned int nr)
> +{
> +}
> +
> +static inline bool mem_cgroup_low(struct mem_cgroup *root,
> +                               struct mem_cgroup *memcg)
> +{
> +     return false;
> +}
> +
>  static inline int mem_cgroup_try_charge(struct page *page, struct mm_struct 
> *mm,
>                                       gfp_t gfp_mask,
>                                       struct mem_cgroup **memcgp)
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index a3592a756ad9..5730886e3b0e 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -97,14 +97,6 @@ static const char * const mem_cgroup_stat_names[] = {
>       "swap",
>  };
>  
> -enum mem_cgroup_events_index {
> -     MEM_CGROUP_EVENTS_PGPGIN,       /* # of pages paged in */
> -     MEM_CGROUP_EVENTS_PGPGOUT,      /* # of pages paged out */
> -     MEM_CGROUP_EVENTS_PGFAULT,      /* # of page-faults */
> -     MEM_CGROUP_EVENTS_PGMAJFAULT,   /* # of major page-faults */
> -     MEM_CGROUP_EVENTS_NSTATS,
> -};
> -
>  static const char * const mem_cgroup_events_names[] = {
>       "pgpgin",
>       "pgpgout",
> @@ -138,7 +130,7 @@ enum mem_cgroup_events_target {
>  
>  struct mem_cgroup_stat_cpu {
>       long count[MEM_CGROUP_STAT_NSTATS];
> -     unsigned long events[MEM_CGROUP_EVENTS_NSTATS];
> +     unsigned long events[MEMCG_NR_EVENTS];
>       unsigned long nr_page_events;
>       unsigned long targets[MEM_CGROUP_NTARGETS];
>  };
> @@ -284,6 +276,10 @@ struct mem_cgroup {
>       struct page_counter memsw;
>       struct page_counter kmem;
>  
> +     /* Normal memory consumption range */
> +     unsigned long low;
> +     unsigned long high;
> +
>       unsigned long soft_limit;
>  
>       /* vmpressure notifications */
> @@ -2327,6 +2323,8 @@ retry:
>       if (!(gfp_mask & __GFP_WAIT))
>               goto nomem;
>  
> +     mem_cgroup_events(mem_over_limit, MEMCG_MAX, 1);
> +
>       nr_reclaimed = try_to_free_mem_cgroup_pages(mem_over_limit, nr_pages,
>                                                   gfp_mask, may_swap);
>  
> @@ -2368,6 +2366,8 @@ retry:
>       if (fatal_signal_pending(current))
>               goto bypass;
>  
> +     mem_cgroup_events(mem_over_limit, MEMCG_OOM, 1);
> +
>       mem_cgroup_oom(mem_over_limit, gfp_mask, get_order(nr_pages));
>  nomem:
>       if (!(gfp_mask & __GFP_NOFAIL))
> @@ -2379,6 +2379,16 @@ done_restock:
>       css_get_many(&memcg->css, batch);
>       if (batch > nr_pages)
>               refill_stock(memcg, batch - nr_pages);
> +     /*
> +      * If the hierarchy is above the normal consumption range,
> +      * make the charging task trim their excess contribution.
> +      */
> +     do {
> +             if (page_counter_read(&memcg->memory) <= memcg->high)
> +                     continue;
> +             mem_cgroup_events(memcg, MEMCG_HIGH, 1);
> +             try_to_free_mem_cgroup_pages(memcg, nr_pages, gfp_mask, true);
> +     } while ((memcg = parent_mem_cgroup(memcg)));
>  done:
>       return ret;
>  }
> @@ -4304,7 +4314,7 @@ out_kfree:
>       return ret;
>  }
>  
> -static struct cftype mem_cgroup_files[] = {
> +static struct cftype mem_cgroup_legacy_files[] = {
>       {
>               .name = "usage_in_bytes",
>               .private = MEMFILE_PRIVATE(_MEM, RES_USAGE),
> @@ -4580,6 +4590,7 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state 
> *parent_css)
>       if (parent_css == NULL) {
>               root_mem_cgroup = memcg;
>               page_counter_init(&memcg->memory, NULL);
> +             memcg->high = PAGE_COUNTER_MAX;
>               memcg->soft_limit = PAGE_COUNTER_MAX;
>               page_counter_init(&memcg->memsw, NULL);
>               page_counter_init(&memcg->kmem, NULL);
> @@ -4625,6 +4636,7 @@ mem_cgroup_css_online(struct cgroup_subsys_state *css)
>  
>       if (parent->use_hierarchy) {
>               page_counter_init(&memcg->memory, &parent->memory);
> +             memcg->high = PAGE_COUNTER_MAX;
>               memcg->soft_limit = PAGE_COUNTER_MAX;
>               page_counter_init(&memcg->memsw, &parent->memsw);
>               page_counter_init(&memcg->kmem, &parent->kmem);
> @@ -4635,6 +4647,7 @@ mem_cgroup_css_online(struct cgroup_subsys_state *css)
>                */
>       } else {
>               page_counter_init(&memcg->memory, NULL);
> +             memcg->high = PAGE_COUNTER_MAX;
>               memcg->soft_limit = PAGE_COUNTER_MAX;
>               page_counter_init(&memcg->memsw, NULL);
>               page_counter_init(&memcg->kmem, NULL);
> @@ -4710,6 +4723,8 @@ static void mem_cgroup_css_reset(struct 
> cgroup_subsys_state *css)
>       mem_cgroup_resize_limit(memcg, PAGE_COUNTER_MAX);
>       mem_cgroup_resize_memsw_limit(memcg, PAGE_COUNTER_MAX);
>       memcg_update_kmem_limit(memcg, PAGE_COUNTER_MAX);
> +     memcg->low = 0;
> +     memcg->high = PAGE_COUNTER_MAX;
>       memcg->soft_limit = PAGE_COUNTER_MAX;
>  }
>  
> @@ -5296,6 +5311,147 @@ static void mem_cgroup_bind(struct 
> cgroup_subsys_state *root_css)
>               mem_cgroup_from_css(root_css)->use_hierarchy = true;
>  }
>  
> +static u64 memory_current_read(struct cgroup_subsys_state *css,
> +                            struct cftype *cft)
> +{
> +     return mem_cgroup_usage(mem_cgroup_from_css(css), false);
> +}
> +
> +static int memory_low_show(struct seq_file *m, void *v)
> +{
> +     struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
> +     unsigned long low = ACCESS_ONCE(memcg->low);
> +
> +     if (low == PAGE_COUNTER_MAX)
> +             seq_puts(m, "infinity\n");
> +     else
> +             seq_printf(m, "%llu\n", (u64)low * PAGE_SIZE);
> +
> +     return 0;
> +}
> +
> +static ssize_t memory_low_write(struct kernfs_open_file *of,
> +                             char *buf, size_t nbytes, loff_t off)
> +{
> +     struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
> +     unsigned long low;
> +     int err;
> +
> +     buf = strstrip(buf);
> +     err = page_counter_memparse(buf, "infinity", &low);
> +     if (err)
> +             return err;
> +
> +     memcg->low = low;
> +
> +     return nbytes;
> +}
> +
> +static int memory_high_show(struct seq_file *m, void *v)
> +{
> +     struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
> +     unsigned long high = ACCESS_ONCE(memcg->high);
> +
> +     if (high == PAGE_COUNTER_MAX)
> +             seq_puts(m, "infinity\n");
> +     else
> +             seq_printf(m, "%llu\n", (u64)high * PAGE_SIZE);
> +
> +     return 0;
> +}
> +
> +static ssize_t memory_high_write(struct kernfs_open_file *of,
> +                              char *buf, size_t nbytes, loff_t off)
> +{
> +     struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
> +     unsigned long high;
> +     int err;
> +
> +     buf = strstrip(buf);
> +     err = page_counter_memparse(buf, "infinity", &high);
> +     if (err)
> +             return err;
> +
> +     memcg->high = high;
> +
> +     return nbytes;
> +}
> +
> +static int memory_max_show(struct seq_file *m, void *v)
> +{
> +     struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
> +     unsigned long max = ACCESS_ONCE(memcg->memory.limit);
> +
> +     if (max == PAGE_COUNTER_MAX)
> +             seq_puts(m, "infinity\n");
> +     else
> +             seq_printf(m, "%llu\n", (u64)max * PAGE_SIZE);
> +
> +     return 0;
> +}
> +
> +static ssize_t memory_max_write(struct kernfs_open_file *of,
> +                             char *buf, size_t nbytes, loff_t off)
> +{
> +     struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
> +     unsigned long max;
> +     int err;
> +
> +     buf = strstrip(buf);
> +     err = page_counter_memparse(buf, "infinity", &max);
> +     if (err)
> +             return err;
> +
> +     err = mem_cgroup_resize_limit(memcg, max);
> +     if (err)
> +             return err;
> +
> +     return nbytes;
> +}
> +
> +static int memory_events_show(struct seq_file *m, void *v)
> +{
> +     struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
> +
> +     seq_printf(m, "low %lu\n", mem_cgroup_read_events(memcg, MEMCG_LOW));
> +     seq_printf(m, "high %lu\n", mem_cgroup_read_events(memcg, MEMCG_HIGH));
> +     seq_printf(m, "max %lu\n", mem_cgroup_read_events(memcg, MEMCG_MAX));
> +     seq_printf(m, "oom %lu\n", mem_cgroup_read_events(memcg, MEMCG_OOM));
> +
> +     return 0;
> +}
> +
> +static struct cftype memory_files[] = {
> +     {
> +             .name = "current",
> +             .read_u64 = memory_current_read,
> +     },
> +     {
> +             .name = "low",
> +             .flags = CFTYPE_NOT_ON_ROOT,
> +             .seq_show = memory_low_show,
> +             .write = memory_low_write,
> +     },
> +     {
> +             .name = "high",
> +             .flags = CFTYPE_NOT_ON_ROOT,
> +             .seq_show = memory_high_show,
> +             .write = memory_high_write,
> +     },
> +     {
> +             .name = "max",
> +             .flags = CFTYPE_NOT_ON_ROOT,
> +             .seq_show = memory_max_show,
> +             .write = memory_max_write,
> +     },
> +     {
> +             .name = "events",
> +             .flags = CFTYPE_NOT_ON_ROOT,
> +             .seq_show = memory_events_show,
> +     },
> +     { }     /* terminate */
> +};
> +
>  struct cgroup_subsys memory_cgrp_subsys = {
>       .css_alloc = mem_cgroup_css_alloc,
>       .css_online = mem_cgroup_css_online,
> @@ -5306,7 +5462,8 @@ struct cgroup_subsys memory_cgrp_subsys = {
>       .cancel_attach = mem_cgroup_cancel_attach,
>       .attach = mem_cgroup_move_task,
>       .bind = mem_cgroup_bind,
> -     .legacy_cftypes = mem_cgroup_files,
> +     .dfl_cftypes = memory_files,
> +     .legacy_cftypes = mem_cgroup_legacy_files,
>       .early_init = 0,
>  };
>  
> @@ -5341,6 +5498,56 @@ static void __init enable_swap_cgroup(void)
>  }
>  #endif
>  
> +/**
> + * mem_cgroup_events - count memory events against a cgroup
> + * @memcg: the memory cgroup
> + * @idx: the event index
> + * @nr: the number of events to account for
> + */
> +void mem_cgroup_events(struct mem_cgroup *memcg,
> +                    enum mem_cgroup_events_index idx,
> +                    unsigned int nr)
> +{
> +     this_cpu_add(memcg->stat->events[idx], nr);
> +}
> +
> +/**
> + * mem_cgroup_low - check if memory consumption is below the normal range
> + * @root: the highest ancestor to consider
> + * @memcg: the memory cgroup to check
> + *
> + * Returns %true if memory consumption of @memcg, and that of all
> + * configurable ancestors up to @root, is below the normal range.
> + */
> +bool mem_cgroup_low(struct mem_cgroup *root, struct mem_cgroup *memcg)
> +{
> +     if (mem_cgroup_disabled())
> +             return false;
> +
> +     /*
> +      * The toplevel group doesn't have a configurable range, so
> +      * it's never low when looked at directly, and it is not
> +      * considered an ancestor when assessing the hierarchy.
> +      */
> +
> +     if (memcg == root_mem_cgroup)
> +             return false;
> +
> +     if (page_counter_read(&memcg->memory) > memcg->low)
> +             return false;
> +
> +     while (memcg != root) {
> +             memcg = parent_mem_cgroup(memcg);
> +
> +             if (memcg == root_mem_cgroup)
> +                     break;
> +
> +             if (page_counter_read(&memcg->memory) > memcg->low)
> +                     return false;
> +     }
> +     return true;
> +}
> +
>  #ifdef CONFIG_MEMCG_SWAP
>  /**
>   * mem_cgroup_swapout - transfer a memsw charge to swap
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index b89097185f46..f62ec654d4c5 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -91,6 +91,9 @@ struct scan_control {
>       /* Can pages be swapped as part of reclaim? */
>       unsigned int may_swap:1;
>  
> +     /* Can cgroups be reclaimed below their normal consumption range? */
> +     unsigned int may_thrash:1;
> +
>       unsigned int hibernation_mode:1;
>  
>       /* One of the zones is ready for compaction */
> @@ -2333,6 +2336,12 @@ static bool shrink_zone(struct zone *zone, struct 
> scan_control *sc,
>                       struct lruvec *lruvec;
>                       int swappiness;
>  
> +                     if (mem_cgroup_low(root, memcg)) {
> +                             if (!sc->may_thrash)
> +                                     continue;
> +                             mem_cgroup_events(memcg, MEMCG_LOW, 1);
> +                     }
> +
>                       lruvec = mem_cgroup_zone_lruvec(zone, memcg);
>                       swappiness = mem_cgroup_swappiness(memcg);
>                       scanned = sc->nr_scanned;
> @@ -2360,8 +2369,7 @@ static bool shrink_zone(struct zone *zone, struct 
> scan_control *sc,
>                               mem_cgroup_iter_break(root, memcg);
>                               break;
>                       }
> -                     memcg = mem_cgroup_iter(root, memcg, &reclaim);
> -             } while (memcg);
> +             } while ((memcg = mem_cgroup_iter(root, memcg, &reclaim)));
>  
>               /*
>                * Shrink the slab caches in the same proportion that
> @@ -2559,10 +2567,11 @@ static bool shrink_zones(struct zonelist *zonelist, 
> struct scan_control *sc)
>  static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
>                                         struct scan_control *sc)
>  {
> +     int initial_priority = sc->priority;
>       unsigned long total_scanned = 0;
>       unsigned long writeback_threshold;
>       bool zones_reclaimable;
> -
> +retry:
>       delayacct_freepages_start();
>  
>       if (global_reclaim(sc))
> @@ -2612,6 +2621,13 @@ static unsigned long do_try_to_free_pages(struct 
> zonelist *zonelist,
>       if (sc->compaction_ready)
>               return 1;
>  
> +     /* Untapped cgroup reserves?  Don't OOM, retry. */
> +     if (!sc->may_thrash) {
> +             sc->priority = initial_priority;
> +             sc->may_thrash = 1;
> +             goto retry;
> +     }
> +
>       /* Any of the zones still reclaimable?  Don't OOM. */
>       if (zones_reclaimable)
>               return 1;
> -- 
> 2.2.0
> 

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Reply via email to