Re: [PATCH] printk: Add loglevel for "do not print to consoles".

2020-05-13 Thread Michal Hocko
On Wed 13-05-20 12:04:13, Petr Mladek wrote:
> What is so special about  OOM dump task so that it would deserve such
> complications?

Nothing really. Except for the potential amount of the output. But as
you've said there are two ways around that. Disable this output if you
do not need it or make it a lower loglevel. A completely different
method of tagging messages just to distinguish different backends of the
printk ring buffers sounds like a bad idea to me because it a) adds a
policy to the kernel and b) it makes it incredibly hard to judge when to
use such a feature. I simply cannot tell whether somebody considers
dump_tasks an important information to be printed on consoles.

If there is any need to control which messages should be routed to which
backend then the proper solution would be to filter messages per log
level per backend. But I have no idea how feasible this is for the
existing infrastructure - or maybe it already exists...
-- 
Michal Hocko
SUSE Labs


Re: [PATCH] mm, swap: Use prandom_u32_max()

2020-05-12 Thread Michal Hocko
On Tue 12-05-20 15:14:46, Huang, Ying wrote:
> Michal Hocko  writes:
> 
> > On Tue 12-05-20 14:41:46, Huang Ying wrote:
> >> To improve the code readability and get random number with higher
> >> quality.
> >
> > I understand the readability argument but why should prandom_u32_max
> > (which I was not aware of) provide a higher quality randomness?
> 
> I am not expert on random number generator.  I have heard about that the
> randomness of the low order bits of some random number generator isn't
> good enough.  Anyway, by using the common implementation, the real
> random number generator expert can fix the possible issue once for all
> users.

Please drop the quality argument if you cannot really justify it. This
will likely just confuse future readers the same way it confused me
here. Because prandom_u32_max uses the same source of randomness the
only difference is the way how modulo vs. u64 overflow arithmetic is
used for distributing values. I am not aware the later would be
a way to achieve a higher quality randomness. If the interval
distribution is better with the later then it would be great to have it
documented.

> >> Signed-off-by: "Huang, Ying" 
> >> Cc: Michal Hocko 
> >> Cc: Minchan Kim 
> >> Cc: Tim Chen 
> >> Cc: Hugh Dickins 
> >
> > To the change itself
> > Acked-by: Michal Hocko 
> 
> Thanks!
> 
> Best Regards,
> Huang, Ying
> 
> >> ---
> >>  mm/swapfile.c | 2 +-
> >>  1 file changed, 1 insertion(+), 1 deletion(-)
> >> 
> >> diff --git a/mm/swapfile.c b/mm/swapfile.c
> >> index a0a123e59ce6..2ec8b21201d6 100644
> >> --- a/mm/swapfile.c
> >> +++ b/mm/swapfile.c
> >> @@ -3220,7 +3220,7 @@ SYSCALL_DEFINE2(swapon, const char __user *, 
> >> specialfile, int, swap_flags)
> >> * select a random position to start with to help wear leveling
> >> * SSD
> >> */
> >> -  p->cluster_next = 1 + (prandom_u32() % p->highest_bit);
> >> +  p->cluster_next = 1 + prandom_u32_max(p->highest_bit);
> >>nr_cluster = DIV_ROUND_UP(maxpages, SWAPFILE_CLUSTER);
> >>  
> >>cluster_info = kvcalloc(nr_cluster, sizeof(*cluster_info),
> >> -- 
> >> 2.26.2

-- 
Michal Hocko
SUSE Labs


Re: [PATCH] mm, swap: Use prandom_u32_max()

2020-05-11 Thread Michal Hocko
On Tue 12-05-20 14:41:46, Huang Ying wrote:
> To improve the code readability and get random number with higher
> quality.

I understand the readability argument but why should prandom_u32_max
(which I was not aware of) provide a higher quality randomness?

> Signed-off-by: "Huang, Ying" 
> Cc: Michal Hocko 
> Cc: Minchan Kim 
> Cc: Tim Chen 
> Cc: Hugh Dickins 

To the change itself
Acked-by: Michal Hocko 

> ---
>  mm/swapfile.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index a0a123e59ce6..2ec8b21201d6 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -3220,7 +3220,7 @@ SYSCALL_DEFINE2(swapon, const char __user *, 
> specialfile, int, swap_flags)
>* select a random position to start with to help wear leveling
>* SSD
>*/
> - p->cluster_next = 1 + (prandom_u32() % p->highest_bit);
> + p->cluster_next = 1 + prandom_u32_max(p->highest_bit);
>   nr_cluster = DIV_ROUND_UP(maxpages, SWAPFILE_CLUSTER);
>  
>       cluster_info = kvcalloc(nr_cluster, sizeof(*cluster_info),
> -- 
> 2.26.2

-- 
Michal Hocko
SUSE Labs


Re: [PATCH] doc: cgroup: update note about conditions when oom killer is invoked

2020-05-11 Thread Michal Hocko
On Mon 11-05-20 12:34:00, Konstantin Khlebnikov wrote:
> 
> 
> On 11/05/2020 11.39, Michal Hocko wrote:
> > On Fri 08-05-20 17:16:29, Konstantin Khlebnikov wrote:
> > > Starting from v4.19 commit 29ef680ae7c2 ("memcg, oom: move out_of_memory
> > > back to the charge path") cgroup oom killer is no longer invoked only from
> > > page faults. Now it implements the same semantics as global OOM killer:
> > > allocation context invokes OOM killer and keeps retrying until success.
> > > 
> > > Signed-off-by: Konstantin Khlebnikov 
> > 
> > Acked-by: Michal Hocko 
> > 
> > > ---
> > >   Documentation/admin-guide/cgroup-v2.rst |   17 -
> > >   1 file changed, 8 insertions(+), 9 deletions(-)
> > > 
> > > diff --git a/Documentation/admin-guide/cgroup-v2.rst 
> > > b/Documentation/admin-guide/cgroup-v2.rst
> > > index bcc80269bb6a..1bb9a8f6ebe1 100644
> > > --- a/Documentation/admin-guide/cgroup-v2.rst
> > > +++ b/Documentation/admin-guide/cgroup-v2.rst
> > > @@ -1172,6 +1172,13 @@ PAGE_SIZE multiple when read back.
> > >   Under certain circumstances, the usage may go over the limit
> > >   temporarily.
> > > + In default configuration regular 0-order allocation always
> > > + succeed unless OOM killer choose current task as a victim.
> > > +
> > > + Some kinds of allocations don't invoke the OOM killer.
> > > + Caller could retry them differently, return into userspace
> > > + as -ENOMEM or silently ignore in cases like disk readahead.
> > 
> > I would probably add -EFAULT but the less error codes we document the
> > better.
> 
> Yeah, EFAULT was a most obscure result of memory shortage.
> Fortunately with new behaviour this shouldn't happens a lot.

Yes, it shouldn't really happen very often. gup was the most prominent
example but this one should be taken care of by triggering the OOM
killer. But I wouldn't bet my hat there are no potential cases anymore.

> Actually where it is still possible? THP always fallback to 0-order.
> I mean EFAULT could appear inside kernel only if task is killed so
> nobody would see it.

Yes fatal_signal_pending paths are ok. And no I do not have any specific
examples. But as you've said EFAULT was a real surprise so I thought it
would be nice to still keep a reference for it around. Even when it is
unlikely.

-- 
Michal Hocko
SUSE Labs


Re: [PATCH] memcg: effective memory.high reclaim for remote charging

2020-05-11 Thread Michal Hocko
On Thu 07-05-20 09:33:01, Shakeel Butt wrote:
> Currently the reclaim of excessive usage over memory.high is scheduled
> to run on returning to the userland. The main reason behind this
> approach was simplicity i.e. always reclaim with GFP_KERNEL context.
> However the underlying assumptions behind this approach are: the current
> task shares the memcg hierarchy with the given memcg and the memcg of
> the current task most probably will not change on return to userland.
> 
> With the remote charging, the first assumption breaks and it allows the
> usage to grow way beyond the memory.high as the reclaim and the
> throttling becomes ineffective.
> 
> This patch forces the synchronous reclaim and potentially throttling for
> the callers with context that allows blocking. For unblockable callers
> or whose synch high reclaim is still not successful, a high reclaim is
> scheduled either to return-to-userland if current task shares the
> hierarchy with the given memcg or to system work queue.
> 
> Signed-off-by: Shakeel Butt 

Acked-by: Michal Hocko 

I would just make the early break a bit more clear.

[...]
> @@ -2600,8 +2596,23 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t 
> gfp_mask,
>   schedule_work(&memcg->high_work);
>   break;
>   }
> - current->memcg_nr_pages_over_high += batch;
> - set_notify_resume(current);
> +
> + if (gfpflags_allow_blocking(gfp_mask))
> + reclaim_over_high(memcg, gfp_mask, batch);
> +

/*
 * reclaim_over_high reclaims parents up the
 * hierarchy so we can break out early here.
 */
> + if (page_counter_read(&memcg->memory) <=
> + READ_ONCE(memcg->high))
> + break;
> + /*
> +  * The above reclaim might not be able to do much. Punt
> +  * the high reclaim to return to userland if the current
> +  * task shares the hierarchy.
> +  */
> + if (current->mm && mm_match_cgroup(current->mm, memcg)) 
> {
> + current->memcg_nr_pages_over_high += batch;
> + set_notify_resume(current);
> + } else
> + schedule_work(&memcg->high_work);
>   break;
>   }
>   } while ((memcg = parent_mem_cgroup(memcg)));
> -- 
> 2.26.2.526.g744177e7f7-goog
> 

-- 
Michal Hocko
SUSE Labs


Re: [PATCH] doc: cgroup: update note about conditions when oom killer is invoked

2020-05-11 Thread Michal Hocko
On Fri 08-05-20 17:16:29, Konstantin Khlebnikov wrote:
> Starting from v4.19 commit 29ef680ae7c2 ("memcg, oom: move out_of_memory
> back to the charge path") cgroup oom killer is no longer invoked only from
> page faults. Now it implements the same semantics as global OOM killer:
> allocation context invokes OOM killer and keeps retrying until success.
> 
> Signed-off-by: Konstantin Khlebnikov 

Acked-by: Michal Hocko 

> ---
>  Documentation/admin-guide/cgroup-v2.rst |   17 -
>  1 file changed, 8 insertions(+), 9 deletions(-)
> 
> diff --git a/Documentation/admin-guide/cgroup-v2.rst 
> b/Documentation/admin-guide/cgroup-v2.rst
> index bcc80269bb6a..1bb9a8f6ebe1 100644
> --- a/Documentation/admin-guide/cgroup-v2.rst
> +++ b/Documentation/admin-guide/cgroup-v2.rst
> @@ -1172,6 +1172,13 @@ PAGE_SIZE multiple when read back.
>   Under certain circumstances, the usage may go over the limit
>   temporarily.
>  
> + In default configuration regular 0-order allocation always
> + succeed unless OOM killer choose current task as a victim.
> +
> + Some kinds of allocations don't invoke the OOM killer.
> + Caller could retry them differently, return into userspace
> + as -ENOMEM or silently ignore in cases like disk readahead.

I would probably add -EFAULT but the less error codes we document the
better.

> +
>   This is the ultimate protection mechanism.  As long as the
>   high limit is used and monitored properly, this limit's
>   utility is limited to providing the final safety net.
> @@ -1228,17 +1235,9 @@ PAGE_SIZE multiple when read back.
>   The number of time the cgroup's memory usage was
>   reached the limit and allocation was about to fail.
>  
> - Depending on context result could be invocation of OOM
> - killer and retrying allocation or failing allocation.
> -
> - Failed allocation in its turn could be returned into
> - userspace as -ENOMEM or silently ignored in cases like
> - disk readahead.  For now OOM in memory cgroup kills
> - tasks iff shortage has happened inside page fault.
> -
>   This event is not raised if the OOM killer is not
>   considered as an option, e.g. for failed high-order
> - allocations.
> +     allocations or if caller asked to not retry attempts.
>  
> oom_kill
>   The number of processes belonging to this cgroup

-- 
Michal Hocko
SUSE Labs


Re: [PATCH] memcg: effective memory.high reclaim for remote charging

2020-05-07 Thread Michal Hocko
On Thu 07-05-20 10:00:07, Shakeel Butt wrote:
> On Thu, May 7, 2020 at 9:47 AM Michal Hocko  wrote:
> >
> > On Thu 07-05-20 09:33:01, Shakeel Butt wrote:
> > [...]
> > > @@ -2600,8 +2596,23 @@ static int try_charge(struct mem_cgroup *memcg, 
> > > gfp_t gfp_mask,
> > >   schedule_work(&memcg->high_work);
> > >   break;
> > >   }
> > > - current->memcg_nr_pages_over_high += batch;
> > > - set_notify_resume(current);
> > > +
> > > + if (gfpflags_allow_blocking(gfp_mask))
> > > + reclaim_over_high(memcg, gfp_mask, batch);
> > > +
> > > + if (page_counter_read(&memcg->memory) <=
> > > + READ_ONCE(memcg->high))
> > > + break;
> >
> > I am half way to a long weekend so bear with me. Shouldn't this be 
> > continue? The
> > parent memcg might be still in excess even the child got reclaimed,
> > right?
> >
> 
> The reclaim_high() actually already does this walk up to the root and
> reclaim from ones who are still over their high limit. Though having
> 'continue' here is correct too.

Ohh, right. As I've said weekend brain. I will have a proper look next
week. This just hit my eyes.

-- 
Michal Hocko
SUSE Labs


Re: [PATCH] memcg: effective memory.high reclaim for remote charging

2020-05-07 Thread Michal Hocko
On Thu 07-05-20 09:33:01, Shakeel Butt wrote:
[...]
> @@ -2600,8 +2596,23 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t 
> gfp_mask,
>   schedule_work(&memcg->high_work);
>   break;
>   }
> - current->memcg_nr_pages_over_high += batch;
> - set_notify_resume(current);
> +
> + if (gfpflags_allow_blocking(gfp_mask))
> + reclaim_over_high(memcg, gfp_mask, batch);
> +
> + if (page_counter_read(&memcg->memory) <=
> + READ_ONCE(memcg->high))
> + break;

I am half way to a long weekend so bear with me. Shouldn't this be continue? The
parent memcg might be still in excess even the child got reclaimed,
right?

> + /*
> +  * The above reclaim might not be able to do much. Punt
> +  * the high reclaim to return to userland if the current
> +  * task shares the hierarchy.
> +  */
> + if (current->mm && mm_match_cgroup(current->mm, memcg)) 
> {
> + current->memcg_nr_pages_over_high += batch;
> + set_notify_resume(current);
> + } else
> + schedule_work(&memcg->high_work);
>   break;
>   }
>   } while ((memcg = parent_mem_cgroup(memcg)));
> -- 
> 2.26.2.526.g744177e7f7-goog
> 

-- 
Michal Hocko
SUSE Labs


Re: [PATCH] Documentation: update numastat explanation

2020-05-07 Thread Michal Hocko
On Thu 07-05-20 14:02:17, Vlastimil Babka wrote:
> During recent patch discussion [1] it became apparent that the "other_node"
> definition in the numastat documentation has always been different from actual
> implementation. It was also noted that the stats can be innacurate on systems
> with memoryless nodes.
> 
> This patch corrects the other_node definition (with minor tweaks to two more
> definitions), adds a note about memoryless nodes and also two introductory
> paragraphs to the numastat documentation.
> 
> [1] 
> https://lore.kernel.org/linux-mm/20200504070304.127361-1-sandi...@linux.ibm.com/T/#u
> 
> Signed-off-by: Vlastimil Babka 

Acked-by: Michal Hocko 

Thanks!

> ---
>  Documentation/admin-guide/numastat.rst | 31 +++---
>  1 file changed, 28 insertions(+), 3 deletions(-)
> 
> diff --git a/Documentation/admin-guide/numastat.rst 
> b/Documentation/admin-guide/numastat.rst
> index aaf1667489f8..08ec2c2bdce3 100644
> --- a/Documentation/admin-guide/numastat.rst
> +++ b/Documentation/admin-guide/numastat.rst
> @@ -6,6 +6,21 @@ Numa policy hit/miss statistics
>  
>  All units are pages. Hugepages have separate counters.
>  
> +The numa_hit, numa_miss and numa_foreign counters reflect how well processes
> +are able to allocate memory from nodes they prefer. If they succeed, numa_hit
> +is incremented on the preferred node, otherwise numa_foreign is incremented 
> on
> +the preferred node and numa_miss on the node where allocation succeeded.
> +
> +Usually preferred node is the one local to the CPU where the process 
> executes,
> +but restrictions such as mempolicies can change that, so there are also two
> +counters based on CPU local node. local_node is similar to numa_hit and is
> +incremented on allocation from a node by CPU on the same node. other_node is
> +similar to numa_miss and is incremented on the node where allocation succeeds
> +from a CPU from a different node. Note there is no counter analogical to
> +numa_foreign.
> +
> +In more detail:
> +
>  === 
>  numa_hit A process wanted to allocate memory from this node,
>   and succeeded.
> @@ -14,11 +29,13 @@ numa_miss A process wanted to allocate memory from 
> another node,
>   but ended up with memory from this node.
>  
>  numa_foreign A process wanted to allocate on this node,
> - but ended up with memory from another one.
> + but ended up with memory from another node.
>  
> -local_node   A process ran on this node and got memory from it.
> +local_node   A process ran on this node's CPU,
> + and got memory from this node.
>  
> -other_node   A process ran on this node and got memory from another node.
> +other_node   A process ran on a different node's CPU
> + and got memory from this node.
>  
>  interleave_hit   Interleaving wanted to allocate from this node
>   and succeeded.
> @@ -28,3 +45,11 @@ For easier reading you can use the numastat utility from 
> the numactl package
>  (http://oss.sgi.com/projects/libnuma/). Note that it only works
>  well right now on machines with a small number of CPUs.
>  
> +Note that on systems with memoryless nodes (where a node has CPUs but no
> +memory) the numa_hit, numa_miss and numa_foreign statistics can be skewed
> +heavily. In the current kernel implementation, if a process prefers a
> +memoryless node (i.e.  because it is running on one of its local CPU), the
> +implementation actually treats one of the nearest nodes with memory as the
> +preferred node. As a result, such allocation will not increase the 
> numa_foreign
> +counter on the memoryless node, and will skew the numa_hit, numa_miss and
> +numa_foreign statistics of the nearest node.
> -- 
> 2.26.2

-- 
Michal Hocko
SUSE Labs


Re: [PATCH] memcg: oom: ignore oom warnings from memory.max

2020-05-05 Thread Michal Hocko
On Tue 05-05-20 08:35:45, Shakeel Butt wrote:
> On Tue, May 5, 2020 at 8:27 AM Johannes Weiner  wrote:
> >
> > On Mon, May 04, 2020 at 12:23:51PM -0700, Shakeel Butt wrote:
> > > On Mon, May 4, 2020 at 9:06 AM Michal Hocko  wrote:
> > > > I really hate to repeat myself but this is no different from a regular
> > > > oom situation.
> > >
> > > Conceptually yes there is no difference but there is no *divine
> > > restriction* to not make a difference if there is a real world
> > > use-case which would benefit from it.
> >
> > I would wholeheartedly agree with this in general.
> >
> > However, we're talking about the very semantics that set memory.max
> > apart from memory.high: triggering OOM kills to enforce the limit.
> >
> > > > when the kernel cannot act and mentions that along with the
> > > > oom report so that whoever consumes that information can debug or act on
> > > > that fact.
> > > >
> > > > Silencing the oom report is simply removing a potentially useful
> > > > aid to debug further a potential problem.
> > >
> > > *Potentially* useful for debugging versus actually beneficial for
> > > "sweep before tear down" use-case. Also I am not saying to make "no
> > > dumps for memory.max when no eligible tasks" a set in stone rule. We
> > > can always reevaluate when such information will actually be useful.
> > >
> > > Johannes/Andrew, what's your opinion?
> >
> > I still think that if you want to sweep without triggering OOMs,
> > memory.high has the matching semantics.
> >
> > As you pointed out, it doesn't work well for foreign charges, but that
> > is more of a limitation in the implementation than in the semantics:
> >
> > /*
> >  * If the hierarchy is above the normal consumption range, schedule
> >  * reclaim on returning to userland.  We can perform reclaim here
> >  * if __GFP_RECLAIM but let's always punt for simplicity and so that
> >  * GFP_KERNEL can consistently be used during reclaim.  @memcg is
> >  * not recorded as it most likely matches current's and won't
> >  * change in the meantime.  As high limit is checked again before
> >  * reclaim, the cost of mismatch is negligible.
> >  */
> >
> > Wouldn't it be more useful to fix that instead? It shouldn't be much
> > of a code change to do sync reclaim in try_charge().
> 
> Sync reclaim would really simplify the remote charging case. Though
> should sync reclaim only be done for remote charging or for all?

The simplest way around that would be to always do sync reclaim for
unpopulated memcgs because those do not have any user context to reclaim
from and for restricted allocation contexts like atomic allocations
maybe also for GFP_NO{IO,FS}.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH] memcg: oom: ignore oom warnings from memory.max

2020-05-05 Thread Michal Hocko
On Mon 04-05-20 12:23:51, Shakeel Butt wrote:
[...]
> *Potentially* useful for debugging versus actually beneficial for
> "sweep before tear down" use-case.

I definitely do not want to prevent you from achieving what you
want/need. Let's get back to your argument on why you cannot use
memory.high for this purpose and what is the actual difference from
memory.max on the "sweep before removal". You've said

: Yes that would work but remote charging concerns me. Remote charging
: can still happen after the memcg is offlined and at the moment, high
: reclaim does not work for remote memcg and the usage can go till max
: or global pressure. This is most probably a misconfiguration and we
: might not receive the warnings in the log ever. Setting memory.max to
: 0 will definitely give such warnings.

So essentially the only reason you are not using memory.high which would
effectively achieve the same level of reclaim for your usecase is that
potential future remote charges could get unnoticed. I have proposed to
warn when charging to an offline memcg because that looks like a sign of
bug to me. Having the hard limit clamped to 0 (or some other small
value) would complain loud by the oom report and no eligible tasks
message but it will unlikely help to stop such a usage because, well,
there is nothing reclaimable and we force the charge in that case. So
you are effectively in the memory.high like situation.

So instead of potentially removing a useful information can we focus on
the remote charging side of the problem and deal with it in a sensible
way? That would make memory.high usable for your usecase and I still
believe that this is what you should be using in the first place.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH] memcg: oom: ignore oom warnings from memory.max

2020-05-04 Thread Michal Hocko
On Mon 04-05-20 08:35:57, Shakeel Butt wrote:
> On Mon, May 4, 2020 at 8:00 AM Michal Hocko  wrote:
> >
> > On Mon 04-05-20 07:53:01, Shakeel Butt wrote:
[...]
> > > I am trying to see if "no eligible task" is really an issue and should
> > > be warned for the "other use cases". The only real use-case I can
> > > think of are resource managers adjusting the limit dynamically. I
> > > don't see "no eligible task" a concerning reason for such use-case.
> >
> > It is very much a concerning reason to notify about like any other OOM
> > situation due to hard limit breach. In this case it is worse in some
> > sense because the limit cannot be trimmed down because there is no
> > directly reclaimable memory at all. Such an oom situation is
> > effectivelly conserved.
> > --
> 
> Let me make a more precise statement and tell me if you agree. The "no
> eligible task" is concerning for the charging path but not for the
> writer of memory.max. The writer can read the usage and
> cgroup.[procs|events] to figure out the situation if needed.

I really hate to repeat myself but this is no different from a regular
oom situation. Admin sets the hard limit and the kernel tries to act
upon that.

You cannot make any assumption about what admin wanted or didn't want
to see. We simply trigger the oom killer on memory.max and this is a
documented behavior. No eligible task or no task at all is a simply a
corner case when the kernel cannot act and mentions that along with the
oom report so that whoever consumes that information can debug or act on
that fact.

Silencing the oom report is simply removing a potentially useful
aid to debug further a potential problem. But let me repeat this is not
reallly any different from a regular oom situation when the oom killer
is able to act.

-- 
Michal Hocko
SUSE Labs


Re: [PATCH] memcg: oom: ignore oom warnings from memory.max

2020-05-04 Thread Michal Hocko
On Mon 04-05-20 07:53:01, Shakeel Butt wrote:
> On Mon, May 4, 2020 at 7:11 AM Michal Hocko  wrote:
> >
> > On Mon 04-05-20 06:54:40, Shakeel Butt wrote:
> > > On Sun, May 3, 2020 at 11:56 PM Michal Hocko  wrote:
> > > >
> > > > On Thu 30-04-20 11:27:12, Shakeel Butt wrote:
> > > > > Lowering memory.max can trigger an oom-kill if the reclaim does not
> > > > > succeed. However if oom-killer does not find a process for killing, it
> > > > > dumps a lot of warnings.
> > > >
> > > > It shouldn't dump much more than the regular OOM report AFAICS. Sure
> > > > there is "Out of memory and no killable processes..." message printed as
> > > > well but is that a real problem?
> > > >
> > > > > Deleting a memcg does not reclaim memory from it and the memory can
> > > > > linger till there is a memory pressure. One normal way to proactively
> > > > > reclaim such memory is to set memory.max to 0 just before deleting the
> > > > > memcg. However if some of the memcg's memory is pinned by others, this
> > > > > operation can trigger an oom-kill without any process and thus can 
> > > > > log a
> > > > > lot un-needed warnings. So, ignore all such warnings from memory.max.
> > > >
> > > > OK, I can see why you might want to use memory.max for that purpose but
> > > > I do not really understand why the oom report is a problem here.
> > >
> > > It may not be a problem for an individual or small scale deployment
> > > but when "sweep before tear down" is the part of the workflow for
> > > thousands of machines cycling through hundreds of thousands of cgroups
> > > then we can potentially flood the logs with not useful dumps and may
> > > hide (or overflow) any useful information in the logs.
> >
> > If you are doing this in a large scale and the oom report is really a
> > problem then you shouldn't be resetting hard limit to 0 in the first
> > place.
> >
> 
> I think I have pretty clearly described why we want to reset the hard
> limit to 0, so, unless there is an alternative I don't see why we
> should not be doing this.

I am not saying you shouldn't be doing that. I am just saying that if
you do then you have to live with oom reports.

> > > > memory.max can trigger the oom kill and user should be expecting the oom
> > > > report under that condition. Why is "no eligible task" so special? Is it
> > > > because you know that there won't be any tasks for your particular case?
> > > > What about other use cases where memory.max is not used as a "sweep
> > > > before tear down"?
> > >
> > > What other such use-cases would be? The only use-case I can envision
> > > of adjusting limits dynamically of a live cgroup are resource
> > > managers. However for cgroup v2, memory.high is the recommended way to
> > > limit the usage, so, why would resource managers be changing
> > > memory.max instead of memory.high? I am not sure. What do you think?
> >
> > There are different reasons to use the hard limit. Mostly to contain
> > potential runaways. While high limit might be a sufficient measure to
> > achieve that as well the hard limit is the last resort. And it clearly
> > has the oom killer semantic so I am not really sure why you are
> > comparing the two.
> >
> 
> I am trying to see if "no eligible task" is really an issue and should
> be warned for the "other use cases". The only real use-case I can
> think of are resource managers adjusting the limit dynamically. I
> don't see "no eligible task" a concerning reason for such use-case.

It is very much a concerning reason to notify about like any other OOM
situation due to hard limit breach. In this case it is worse in some
sense because the limit cannot be trimmed down because there is no
directly reclaimable memory at all. Such an oom situation is
effectivelly conserved.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH] memcg: oom: ignore oom warnings from memory.max

2020-05-04 Thread Michal Hocko
On Mon 04-05-20 06:54:40, Shakeel Butt wrote:
> On Sun, May 3, 2020 at 11:56 PM Michal Hocko  wrote:
> >
> > On Thu 30-04-20 11:27:12, Shakeel Butt wrote:
> > > Lowering memory.max can trigger an oom-kill if the reclaim does not
> > > succeed. However if oom-killer does not find a process for killing, it
> > > dumps a lot of warnings.
> >
> > It shouldn't dump much more than the regular OOM report AFAICS. Sure
> > there is "Out of memory and no killable processes..." message printed as
> > well but is that a real problem?
> >
> > > Deleting a memcg does not reclaim memory from it and the memory can
> > > linger till there is a memory pressure. One normal way to proactively
> > > reclaim such memory is to set memory.max to 0 just before deleting the
> > > memcg. However if some of the memcg's memory is pinned by others, this
> > > operation can trigger an oom-kill without any process and thus can log a
> > > lot un-needed warnings. So, ignore all such warnings from memory.max.
> >
> > OK, I can see why you might want to use memory.max for that purpose but
> > I do not really understand why the oom report is a problem here.
> 
> It may not be a problem for an individual or small scale deployment
> but when "sweep before tear down" is the part of the workflow for
> thousands of machines cycling through hundreds of thousands of cgroups
> then we can potentially flood the logs with not useful dumps and may
> hide (or overflow) any useful information in the logs.

If you are doing this in a large scale and the oom report is really a
problem then you shouldn't be resetting hard limit to 0 in the first
place.

> > memory.max can trigger the oom kill and user should be expecting the oom
> > report under that condition. Why is "no eligible task" so special? Is it
> > because you know that there won't be any tasks for your particular case?
> > What about other use cases where memory.max is not used as a "sweep
> > before tear down"?
> 
> What other such use-cases would be? The only use-case I can envision
> of adjusting limits dynamically of a live cgroup are resource
> managers. However for cgroup v2, memory.high is the recommended way to
> limit the usage, so, why would resource managers be changing
> memory.max instead of memory.high? I am not sure. What do you think?

There are different reasons to use the hard limit. Mostly to contain
potential runaways. While high limit might be a sufficient measure to
achieve that as well the hard limit is the last resort. And it clearly
has the oom killer semantic so I am not really sure why you are
comparing the two.

> FB is moving away from limits setting, so, not sure if they have
> thought of these cases.
> 
> BTW for such use-cases, shouldn't we be taking the memcg's oom_lock?

This is a good question. I would have to go and double check the code
but I suspect that this is an omission.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v2 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline

2020-05-04 Thread Michal Hocko
On Thu 30-04-20 12:48:20, Srikar Dronamraju wrote:
> * Michal Hocko  [2020-04-29 14:22:11]:
> 
> > On Wed 29-04-20 07:11:45, Srikar Dronamraju wrote:
> > > > > 
> > > > > By marking, N_ONLINE as NODE_MASK_NONE, lets stop assuming that Node 
> > > > > 0 is
> > > > > always online.
> > > > > 
> > > > > ...
> > > > >
> > > > > --- a/mm/page_alloc.c
> > > > > +++ b/mm/page_alloc.c
> > > > > @@ -116,8 +116,10 @@ EXPORT_SYMBOL(latent_entropy);
> > > > >   */
> > > > >  nodemask_t node_states[NR_NODE_STATES] __read_mostly = {
> > > > >   [N_POSSIBLE] = NODE_MASK_ALL,
> > > > > +#ifdef CONFIG_NUMA
> > > > > + [N_ONLINE] = NODE_MASK_NONE,
> > > > > +#else
> > > > >   [N_ONLINE] = { { [0] = 1UL } },
> > > > > -#ifndef CONFIG_NUMA
> > > > >   [N_NORMAL_MEMORY] = { { [0] = 1UL } },
> > > > >  #ifdef CONFIG_HIGHMEM
> > > > >   [N_HIGH_MEMORY] = { { [0] = 1UL } },
> > > > 
> > > > So on all other NUMA machines, when does node 0 get marked online?
> > > > 
> > > > This change means that for some time during boot, such machines will
> > > > now be running with node 0 marked as offline.  What are the
> > > > implications of this?  Will something break?
> > > 
> > > Till the nodes are detected, marking Node 0 as online tends to be 
> > > redundant.
> > > Because the system doesn't know if its a NUMA or a non-NUMA system.
> > > Once we detect the nodes, we online them immediately. Hence I don't see 
> > > any
> > > side-effects or negative implications of this change.
> > > 
> > > However if I am missing anything, please do let me know.
> > > 
> > > >From my part, I have tested this on
> > > 1. Non-NUMA Single node but CPUs and memory coming from zero node.
> > > 2. Non-NUMA Single node but CPUs and memory coming from non-zero node.
> > > 3. NUMA Multi node but with CPUs and memory from node 0.
> > > 4. NUMA Multi node but with no CPUs and memory from node 0.
> > 
> > Have you tested on something else than ppc? Each arch does the NUMA
> > setup separately and this is a big mess. E.g. x86 marks even memory less
> > nodes (see init_memory_less_node) as online.
> > 
> 
> while I have predominantly tested on ppc, I did test on X86 with CONFIG_NUMA
> enabled/disabled on both single node and multi node machines.
> However, I dont have a cpuless/memoryless x86 system.

This should be able to emulate inside kvm, I believe.

> > Honestly I have hard time to evaluate the effect of this patch. It makes
> > some sense to assume all nodes offline before they get online but this
> > is a land mine territory.
> > 
> > I am also not sure what kind of problem this is going to address. You
> > have mentioned numa balancing without many details.
> 
> 1. On a machine with just one node with node number not being 0,
> the current setup will end up showing 2 online nodes. And when there are
> more than one online nodes, numa_balancing gets enabled.
> 
> Without patch
> $ grep numa /proc/vmstat
> numa_hit 95179
> numa_miss 0
> numa_foreign 0
> numa_interleave 3764
> numa_local 95179
> numa_other 0
> numa_pte_updates 1206973 <--
> numa_huge_pte_updates 4654 <--
> numa_hint_faults 19560 <--
> numa_hint_faults_local 19560 <--
> numa_pages_migrated 0
> 
> 
> With patch
> $ grep numa /proc/vmstat 
> numa_hit 322338756
> numa_miss 0
> numa_foreign 0
> numa_interleave 3790
> numa_local 322338756
> numa_other 0
> numa_pte_updates 0     <--
> numa_huge_pte_updates 0 <--
> numa_hint_faults 0 <--
> numa_hint_faults_local 0 <--
> numa_pages_migrated 0
> 
> So we have a redundant page hinting numa faults which we can avoid.

interesting. Does this lead to any observable differences? Btw. it would
be really great to describe how the online state influences the numa
balancing.

> 2. Few people have complained about existence of this dummy node when
> parsing lscpu and numactl o/p. They somehow start to think that the tools
> are reporting incorrectly or the kernel is not able to recognize resources
> connected to the node.

Please be more specific.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH] memcg: oom: ignore oom warnings from memory.max

2020-05-04 Thread Michal Hocko
On Mon 04-05-20 15:40:18, Yafang Shao wrote:
> On Mon, May 4, 2020 at 3:35 PM Michal Hocko  wrote:
> >
> > On Mon 04-05-20 15:26:52, Yafang Shao wrote:
[...]
> > > As explianed above, no eligible task is different from no task.
> > > If there are some candidates but no one is eligible, the system will 
> > > panic.
> > > While if there's no task, it is definitely no OOM, because that's an
> > > improssible thing for the system.
> >
> > This is very much possible situation when all eligible tasks have been
> > already killed but they didn't really help to resolve the oom situation
> > - e.g. in kernel memory leak or unbounded shmem consumption etc...
> >
> 
> That's still an impossible thing, because many tasks are invisible to
> the oom killer.
> See oom_unkillable_task().

I do not follow, really. oom_unkillable_task only says that global init
cannot be killed and that it doesn't make any sense to kill kernel
threads as they do not own any mm normally.

-- 
Michal Hocko
SUSE Labs


Re: [PATCH] memcg: oom: ignore oom warnings from memory.max

2020-05-04 Thread Michal Hocko
On Mon 04-05-20 15:26:52, Yafang Shao wrote:
> On Mon, May 4, 2020 at 3:03 PM Michal Hocko  wrote:
> >
> > On Fri 01-05-20 09:39:24, Yafang Shao wrote:
> > > On Fri, May 1, 2020 at 2:27 AM Shakeel Butt  wrote:
> > > >
> > > > Lowering memory.max can trigger an oom-kill if the reclaim does not
> > > > succeed. However if oom-killer does not find a process for killing, it
> > > > dumps a lot of warnings.
> > > >
> > >
> > > I have been confused by this behavior for several months and I think
> > > it will confuse more memcg users.
> >
> > Could you be more specific what has caused the confusion?
> >
> 
> No task is different from no eligible task.
> No eligible task means there are some candidates but no one is eligible.
> Whille no task means there is no candidate.

I really fail to see a difference. It is clear the one is subset of the
other but in practical life tasks might come and go at any time and if
you try to reduce the hard limit and there are no tasks that could be
reclaimed then I believe we should complain whether it is only oom
disabled tasks or no tasks at all. It is certainly unexpected situation
in some cases because there are resources which are bound to the memcg
without any task we can act on.

> > > We should keep the memcg oom behavior consistent with system oom - no
> > > oom kill if no process.
> >
> > This is not the global mmemcg behavior. We do complain loud on no
> > eligible tasks and actually panic the system. Memcg cannot simply
> > do the same by default for obvious reasons.
> >
> 
> As explianed above, no eligible task is different from no task.
> If there are some candidates but no one is eligible, the system will panic.
> While if there's no task, it is definitely no OOM, because that's an
> improssible thing for the system.

This is very much possible situation when all eligible tasks have been
already killed but they didn't really help to resolve the oom situation
- e.g. in kernel memory leak or unbounded shmem consumption etc...

-- 
Michal Hocko
SUSE Labs


Re: [PATCH 1/2] mm, memcg: Avoid stale protection values when cgroup is above protection

2020-05-04 Thread Michal Hocko
On Fri 01-05-20 07:59:57, Yafang Shao wrote:
> On Thu, Apr 30, 2020 at 10:57 PM Michal Hocko  wrote:
> >
> > On Wed 29-04-20 12:56:27, Johannes Weiner wrote:
> > [...]
> > > I think to address this, we need a more comprehensive solution and
> > > introduce some form of serialization. I'm not sure yet how that would
> > > look like yet.
> >
> > Yeah, that is what I've tried to express earlier and that is why I would
> > rather go with an uglier workaround for now and think about a more
> > robust effective values calculation on top.
> >
> 
> Agreed.
> If there's a more robust effective values calculation on top, then we
> don't need to hack it here and there.
> 
> > > I'm still not sure it's worth having a somewhat ugly workaround in
> > > mem_cgroup_protection() to protect against half of the bug. If you
> > > think so, the full problem should at least be documented and marked
> > > XXX or something.
> >
> > Yes, this makes sense to me. What about the following?
> 
> Many thanks for the explaination on this workaround.
> With this explanation, I think the others will have a clear idea why
> we must add this ugly workaround here.

OK, this would be the patch with the full changelog. If both Chris and
Johannes are ok with this I would suggest replacing the one Andrew took
already


>From dfcdbfd336d2d23195ec9d90e6e58898f49f8998 Mon Sep 17 00:00:00 2001
From: Yafang Shao 
Date: Mon, 4 May 2020 09:10:03 +0200
Subject: [PATCH] mm, memcg: Avoid stale protection values when cgroup is above
 protection

A cgroup can have both memory protection and a memory limit to isolate
it from its siblings in both directions - for example, to prevent it
from being shrunk below 2G under high pressure from outside, but also
from growing beyond 4G under low pressure.

Commit 9783aa9917f8 ("mm, memcg: proportional memory.{low,min} reclaim")
implemented proportional scan pressure so that multiple siblings in
excess of their protection settings don't get reclaimed equally but
instead in accordance to their unprotected portion.

During limit reclaim, this proportionality shouldn't apply of course:
there is no competition, all pressure is from within the cgroup and
should be applied as such. Reclaim should operate at full efficiency.

However, mem_cgroup_protected() never expected anybody to look at the
effective protection values when it indicated that the cgroup is above
its protection. As a result, a query during limit reclaim may return
stale protection values that were calculated by a previous reclaim cycle
in which the cgroup did have siblings.

When this happens, reclaim is unnecessarily hesitant and potentially
slow to meet the desired limit. In theory this could lead to premature
OOM kills, although it's not obvious this has occurred in practice.

Workaround the problem by special casing reclaim roots in
mem_cgroup_protection. These memcgs are never participating in the
reclaim protection because the reclaim is internal.

We have to ignore effective protection values for reclaim roots because
mem_cgroup_protected might be called from racing reclaim contexts with
different roots. Calculation is relying on root -> leaf tree traversal
therefore top-down reclaim protection invariants should hold. The only
exception is the reclaim root which should have effective protection set
to 0 but that would be problematic for the following setup:
 Let's have global and A's reclaim in parallel:
  |
  A (low=2G, usage = 3G, max = 3G, children_low_usage = 1.5G)
  |\
  | C (low = 1G, usage = 2.5G)
  B (low = 1G, usage = 0.5G)

 for A reclaim we have
 B.elow = B.low
 C.elow = C.low

 For the global reclaim
 A.elow = A.low
 B.elow = min(B.usage, B.low) because children_low_usage <= A.elow
 C.elow = min(C.usage, C.low)

 With the effective values resetting we have A reclaim
 A.elow = 0
 B.elow = B.low
 C.elow = C.low

 and global reclaim could see the above and then
 B.elow = C.elow = 0 because children_low_usage > A.elow

Which means that protected memcgs would get reclaimed.

In future we would like to make mem_cgroup_protected more robust against
racing reclaim contexts but that is likely more complex solution that
this simple workaround.

[han...@cmpxchg.org - large part of the changelog]
[mho...@suse.com - workaround explanation]
Fixes: 9783aa9917f8 ("mm, memcg: proportional memory.{low,min} reclaim")
Signed-off-by: Yafang Shao 
Signed-off-by: Michal Hocko 
---
 include/linux/memcontrol.h | 36 
 mm/memcontrol.c|  8 
 2 files changed, 44 insertions(+)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 1b4150ff64be..50ffbc17cdd8 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -350,6 +350,42 @@ s

Re: [PATCH] memcg: oom: ignore oom warnings from memory.max

2020-05-04 Thread Michal Hocko
On Fri 01-05-20 09:39:24, Yafang Shao wrote:
> On Fri, May 1, 2020 at 2:27 AM Shakeel Butt  wrote:
> >
> > Lowering memory.max can trigger an oom-kill if the reclaim does not
> > succeed. However if oom-killer does not find a process for killing, it
> > dumps a lot of warnings.
> >
> 
> I have been confused by this behavior for several months and I think
> it will confuse more memcg users.

Could you be more specific what has caused the confusion?

> We should keep the memcg oom behavior consistent with system oom - no
> oom kill if no process.

This is not the global mmemcg behavior. We do complain loud on no
eligible tasks and actually panic the system. Memcg cannot simply
do the same by default for obvious reasons.

> What about bellow change ?
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index e28098e13f1c..25fbc37a747f 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -6086,6 +6086,9 @@ static ssize_t memory_max_write(struct
> kernfs_open_file *of,
> continue;
> }
> 
> +   if (!cgroup_is_populated(memcg->css.cgroup))
> +   break;
> +
> memcg_memory_event(memcg, MEMCG_OOM);
> if (!mem_cgroup_out_of_memory(memcg, GFP_KERNEL, 0))
> break;

I am not a great fan to be honest. The warning might be useful for other
usecases when it is not clear that the memcg is empty.

-- 
Michal Hocko
SUSE Labs


Re: [PATCH] memcg: oom: ignore oom warnings from memory.max

2020-05-03 Thread Michal Hocko
On Thu 30-04-20 13:20:10, Shakeel Butt wrote:
> On Thu, Apr 30, 2020 at 12:29 PM Johannes Weiner  wrote:
> >
> > On Thu, Apr 30, 2020 at 11:27:12AM -0700, Shakeel Butt wrote:
> > > Lowering memory.max can trigger an oom-kill if the reclaim does not
> > > succeed. However if oom-killer does not find a process for killing, it
> > > dumps a lot of warnings.
> > >
> > > Deleting a memcg does not reclaim memory from it and the memory can
> > > linger till there is a memory pressure. One normal way to proactively
> > > reclaim such memory is to set memory.max to 0 just before deleting the
> > > memcg. However if some of the memcg's memory is pinned by others, this
> > > operation can trigger an oom-kill without any process and thus can log a
> > > lot un-needed warnings. So, ignore all such warnings from memory.max.
> >
> > Can't you set memory.high=0 instead? It does the reclaim portion of
> > memory.max, without the actual OOM killing that causes you problems.
> 
> Yes that would work but remote charging concerns me. Remote charging
> can still happen after the memcg is offlined and at the moment, high
> reclaim does not work for remote memcg and the usage can go till max
> or global pressure. This is most probably a misconfiguration and we
> might not receive the warnings in the log ever. Setting memory.max to
> 0 will definitely give such warnings.

Can we add a warning for the remote charging on dead memcgs?

-- 
Michal Hocko
SUSE Labs


Re: [PATCH] memcg: oom: ignore oom warnings from memory.max

2020-05-03 Thread Michal Hocko
On Thu 30-04-20 11:27:12, Shakeel Butt wrote:
> Lowering memory.max can trigger an oom-kill if the reclaim does not
> succeed. However if oom-killer does not find a process for killing, it
> dumps a lot of warnings.

It shouldn't dump much more than the regular OOM report AFAICS. Sure
there is "Out of memory and no killable processes..." message printed as
well but is that a real problem?

> Deleting a memcg does not reclaim memory from it and the memory can
> linger till there is a memory pressure. One normal way to proactively
> reclaim such memory is to set memory.max to 0 just before deleting the
> memcg. However if some of the memcg's memory is pinned by others, this
> operation can trigger an oom-kill without any process and thus can log a
> lot un-needed warnings. So, ignore all such warnings from memory.max.

OK, I can see why you might want to use memory.max for that purpose but
I do not really understand why the oom report is a problem here.
memory.max can trigger the oom kill and user should be expecting the oom
report under that condition. Why is "no eligible task" so special? Is it
because you know that there won't be any tasks for your particular case?
What about other use cases where memory.max is not used as a "sweep
before tear down"?

> Signed-off-by: Shakeel Butt 
> ---
>  include/linux/oom.h | 3 +++
>  mm/memcontrol.c | 9 +
>  mm/oom_kill.c   | 2 +-
>  3 files changed, 9 insertions(+), 5 deletions(-)
> 
> diff --git a/include/linux/oom.h b/include/linux/oom.h
> index c696c265f019..6345dc55df64 100644
> --- a/include/linux/oom.h
> +++ b/include/linux/oom.h
> @@ -52,6 +52,9 @@ struct oom_control {
>  
>   /* Used to print the constraint info. */
>   enum oom_constraint constraint;
> +
> + /* Do not warn even if there is no process to be killed. */
> + bool no_warn;
>  };
>  
>  extern struct mutex oom_lock;
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 317dbbaac603..a1f00d9b9bb0 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1571,7 +1571,7 @@ unsigned long mem_cgroup_size(struct mem_cgroup *memcg)
>  }
>  
>  static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t 
> gfp_mask,
> -  int order)
> +  int order, bool no_warn)
>  {
>   struct oom_control oc = {
>   .zonelist = NULL,
> @@ -1579,6 +1579,7 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup 
> *memcg, gfp_t gfp_mask,
>   .memcg = memcg,
>   .gfp_mask = gfp_mask,
>   .order = order,
> + .no_warn = no_warn,
>   };
>   bool ret;
>  
> @@ -1821,7 +1822,7 @@ static enum oom_status mem_cgroup_oom(struct mem_cgroup 
> *memcg, gfp_t mask, int
>   mem_cgroup_oom_notify(memcg);
>  
>   mem_cgroup_unmark_under_oom(memcg);
> - if (mem_cgroup_out_of_memory(memcg, mask, order))
> + if (mem_cgroup_out_of_memory(memcg, mask, order, false))
>   ret = OOM_SUCCESS;
>   else
>   ret = OOM_FAILED;
> @@ -1880,7 +1881,7 @@ bool mem_cgroup_oom_synchronize(bool handle)
>   mem_cgroup_unmark_under_oom(memcg);
>   finish_wait(&memcg_oom_waitq, &owait.wait);
>   mem_cgroup_out_of_memory(memcg, current->memcg_oom_gfp_mask,
> -  current->memcg_oom_order);
> +  current->memcg_oom_order, false);
>   } else {
>   schedule();
>   mem_cgroup_unmark_under_oom(memcg);
> @@ -6106,7 +6107,7 @@ static ssize_t memory_max_write(struct kernfs_open_file 
> *of,
>   }
>  
>   memcg_memory_event(memcg, MEMCG_OOM);
> - if (!mem_cgroup_out_of_memory(memcg, GFP_KERNEL, 0))
> + if (!mem_cgroup_out_of_memory(memcg, GFP_KERNEL, 0, true))
>   break;
>   }
>  
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 463b3d74a64a..5ace39f6fe1e 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -1098,7 +1098,7 @@ bool out_of_memory(struct oom_control *oc)
>  
>   select_bad_process(oc);
>   /* Found nothing?!?! */
> - if (!oc->chosen) {
> + if (!oc->chosen && !oc->no_warn) {
>   dump_header(oc, NULL);
>   pr_warn("Out of memory and no killable processes...\n");
>   /*
> -- 
> 2.26.2.526.g744177e7f7-goog

-- 
Michal Hocko
SUSE Labs


Re: [PATCH 1/2] mm, memcg: Avoid stale protection values when cgroup is above protection

2020-04-30 Thread Michal Hocko
On Wed 29-04-20 12:56:27, Johannes Weiner wrote:
[...]
> I think to address this, we need a more comprehensive solution and
> introduce some form of serialization. I'm not sure yet how that would
> look like yet.

Yeah, that is what I've tried to express earlier and that is why I would
rather go with an uglier workaround for now and think about a more
robust effective values calculation on top.
 
> I'm still not sure it's worth having a somewhat ugly workaround in
> mem_cgroup_protection() to protect against half of the bug. If you
> think so, the full problem should at least be documented and marked
> XXX or something.

Yes, this makes sense to me. What about the following?
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 1b4150ff64be..50ffbc17cdd8 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -350,6 +350,42 @@ static inline unsigned long mem_cgroup_protection(struct 
mem_cgroup *memcg,
if (mem_cgroup_disabled())
return 0;
 
+   /*
+* There is no reclaim protection applied to a targeted reclaim.
+* We are special casing this specific case here because
+* mem_cgroup_protected calculation is not robust enough to keep
+* the protection invariant for calculated effective values for
+* parallel reclaimers with different reclaim target. This is
+* especially a problem for tail memcgs (as they have pages on LRU)
+* which would want to have effective values 0 for targeted reclaim
+* but a different value for external reclaim.
+*
+* Example
+* Let's have global and A's reclaim in parallel:
+*  |
+*  A (low=2G, usage = 3G, max = 3G, children_low_usage = 1.5G)
+*  |\
+*  | C (low = 1G, usage = 2.5G)
+*  B (low = 1G, usage = 0.5G)
+*
+* For the global reclaim
+* A.elow = A.low
+* B.elow = min(B.usage, B.low) because children_low_usage <= A.elow
+* C.elow = min(C.usage, C.low)
+*
+* With the effective values resetting we have A reclaim
+* A.elow = 0
+* B.elow = B.low
+* C.elow = C.low
+*
+* If the global reclaim races with A's reclaim then
+* B.elow = C.elow = 0 because children_low_usage > A.elow)
+* is possible and reclaiming B would be violating the protection.
+*
+*/
+   if (memcg == root)
+   return 0;
+
if (in_low_reclaim)
return READ_ONCE(memcg->memory.emin);
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 05b4ec2c6499..df88a22f09bc 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6385,6 +6385,14 @@ enum mem_cgroup_protection mem_cgroup_protected(struct 
mem_cgroup *root,
 
if (!root)
root = root_mem_cgroup;
+
+   /*
+* Effective values of the reclaim targets are ignored so they
+* can be stale. Have a look at mem_cgroup_protection for more
+* details.
+* TODO: calculation should be more robust so that we do not need
+* that special casing.
+*/
if (memcg == root)
return MEMCG_PROT_NONE;
 

> In practice, I doubt this matters all that much because limit reclaim
> and global reclaim tend to occur in complementary
> containerization/isolation strategies, not heavily simultaneously.

I would expect that as well but this is always hard to tell.

-- 
Michal Hocko
SUSE Labs


Re: [PATCH 1/2] mm, memcg: Avoid stale protection values when cgroup is above protection

2020-04-29 Thread Michal Hocko
On Wed 29-04-20 10:03:30, Johannes Weiner wrote:
> On Wed, Apr 29, 2020 at 12:15:10PM +0200, Michal Hocko wrote:
> > On Tue 28-04-20 19:26:47, Chris Down wrote:
> > > From: Yafang Shao 
> > > 
> > > A cgroup can have both memory protection and a memory limit to isolate
> > > it from its siblings in both directions - for example, to prevent it
> > > from being shrunk below 2G under high pressure from outside, but also
> > > from growing beyond 4G under low pressure.
> > > 
> > > Commit 9783aa9917f8 ("mm, memcg: proportional memory.{low,min} reclaim")
> > > implemented proportional scan pressure so that multiple siblings in
> > > excess of their protection settings don't get reclaimed equally but
> > > instead in accordance to their unprotected portion.
> > > 
> > > During limit reclaim, this proportionality shouldn't apply of course:
> > > there is no competition, all pressure is from within the cgroup and
> > > should be applied as such. Reclaim should operate at full efficiency.
> > > 
> > > However, mem_cgroup_protected() never expected anybody to look at the
> > > effective protection values when it indicated that the cgroup is above
> > > its protection. As a result, a query during limit reclaim may return
> > > stale protection values that were calculated by a previous reclaim cycle
> > > in which the cgroup did have siblings.
> > > 
> > > When this happens, reclaim is unnecessarily hesitant and potentially
> > > slow to meet the desired limit. In theory this could lead to premature
> > > OOM kills, although it's not obvious this has occurred in practice.
> > 
> > Thanks this describes the underlying problem. I would be also explicit
> > that the issue should be visible only on tail memcgs which have both
> > max/high and protection configured and the effect depends on the
> > difference between the two (the smaller it is the largrger the effect).
> > 
> > There is no mention about the fix. The patch resets effective values for
> > the reclaim root and I've had some concerns about that
> > http://lkml.kernel.org/r/20200424162103.gk11...@dhcp22.suse.cz.
> > Johannes has argued that other races are possible and I didn't get to
> > think about it thoroughly. But this patch is introducing a new
> > possibility of breaking protection. If we want to have a quick and
> > simple fix that would be easier to backport to older kernels then I
> > would feel much better if we simply workedaround the problem as
> > suggested earlier 
> > http://lkml.kernel.org/r/20200423061629.24185-1-laoar.s...@gmail.com
> > We can rework the effective values calculation to be more robust against
> > races on top of that because this is likely a more tricky thing to do.
> 
> Well, can you please *do* think more thoroughly about what I wrote,
> instead of pushing for an alternative patch on gut feeling alone?
> 
> Especially when you imply that this should be a stable patch.

The patch has a Fixes tag and so it is not unrealistic to assume that it
will hit older trees. I wasn't really implying stable tree backport and
I do not think this is a stable material.

All I was arguing here is that a fix/workaround which doesn't add new
side effects is a safer option.

> Not only does your alternative patch not protect against the race you
> are worried about, the race itself doesn't matter. Racing reclaimers
> will write their competing views of the world into the shared state on
> all other levels anyway.
> 
> And that's okay. If the configuration and memory usage is such that
> there is at least one reclaimer that scans without any protection
> (like a limit reclaimer), it's not a problem when a second reclaimer
> that meant to do protected global reclaim will also do one iteration
> without protection. It's no different than if a second thread had
> entered limit reclaim through another internal allocation.

Yes I do agree here.

> There is no semantical violation with the race in your patch or the
> race in this patch. Any effective protection that becomes visible is
> 1) permitted by the configuration, but 2) also triggered *right now*
> by an acute need to reclaim memory with these parameters.
> 
> The *right now* part is important. That's what's broken before either
> patch, and that's what we're fixing: to see really, really *old* stale
> that might not be representative of the config semantics anymore.

No disagreement here either. But please remember that the example I've
given is a clear violation of the protection. Let me p

Re: [PATCH v3] mm/vmscan.c: change prototype for shrink_page_list

2020-04-29 Thread Michal Hocko
On Wed 29-04-20 19:20:59, Maninder Singh wrote:
> 'commit 3c710c1ad11b ("mm, vmscan:
> extract shrink_page_list reclaim counters into a struct")'
> 
> changed data type for the function,
> so changing return type for funciton and its caller.
> 
> Signed-off-by: Vaneet Narang 
> Signed-off-by: Maninder Singh 

You could have kept my ack from v1
Acked-by: Michal Hocko 

Thanks!

> ---
> v1 -> v2: position of variable changed mistakenly, thus reverted.
> v2 -> v3: Don't change position of any variable, thus reverted.
> if required then need to send by separate patch.
> 
>  mm/internal.h   |  2 +-
>  mm/page_alloc.c |  2 +-
>  mm/vmscan.c | 24 
>  3 files changed, 14 insertions(+), 14 deletions(-)
> 
> diff --git a/mm/internal.h b/mm/internal.h
> index b5634e7..c3eeec8 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -527,7 +527,7 @@ extern unsigned long  __must_check vm_mmap_pgoff(struct 
> file *, unsigned long,
>  unsigned long, unsigned long);
>  
>  extern void set_pageblock_order(void);
> -unsigned long reclaim_clean_pages_from_list(struct zone *zone,
> +unsigned int reclaim_clean_pages_from_list(struct zone *zone,
>   struct list_head *page_list);
>  /* The ALLOC_WMARK bits are used as an index to zone->watermark */
>  #define ALLOC_WMARK_MIN  WMARK_MIN
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 1385d78..f17d88c6 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -8416,7 +8416,7 @@ static int __alloc_contig_migrate_range(struct 
> compact_control *cc,
>   unsigned long start, unsigned long end)
>  {
>   /* This function is based on compact_zone() from compaction.c. */
> - unsigned long nr_reclaimed;
> + unsigned int nr_reclaimed;
>   unsigned long pfn = start;
>   unsigned int tries = 0;
>   int ret = 0;
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index b06868f..7631725 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1066,17 +1066,17 @@ static void page_check_dirty_writeback(struct page 
> *page,
>  /*
>   * shrink_page_list() returns the number of reclaimed pages
>   */
> -static unsigned long shrink_page_list(struct list_head *page_list,
> -   struct pglist_data *pgdat,
> -   struct scan_control *sc,
> -   enum ttu_flags ttu_flags,
> -   struct reclaim_stat *stat,
> -   bool ignore_references)
> +static unsigned int shrink_page_list(struct list_head *page_list,
> +  struct pglist_data *pgdat,
> +  struct scan_control *sc,
> +  enum ttu_flags ttu_flags,
> +  struct reclaim_stat *stat,
> +  bool ignore_references)
>  {
>   LIST_HEAD(ret_pages);
>   LIST_HEAD(free_pages);
> - unsigned nr_reclaimed = 0;
> - unsigned pgactivate = 0;
> + unsigned int nr_reclaimed = 0;
> + unsigned int pgactivate = 0;
>  
>   memset(stat, 0, sizeof(*stat));
>   cond_resched();
> @@ -1483,7 +1483,7 @@ static unsigned long shrink_page_list(struct list_head 
> *page_list,
>   return nr_reclaimed;
>  }
>  
> -unsigned long reclaim_clean_pages_from_list(struct zone *zone,
> +unsigned int reclaim_clean_pages_from_list(struct zone *zone,
>   struct list_head *page_list)
>  {
>   struct scan_control sc = {
> @@ -1492,7 +1492,7 @@ unsigned long reclaim_clean_pages_from_list(struct zone 
> *zone,
>   .may_unmap = 1,
>   };
>   struct reclaim_stat dummy_stat;
> - unsigned long ret;
> + unsigned int ret;
>   struct page *page, *next;
>   LIST_HEAD(clean_pages);
>  
> @@ -1900,7 +1900,7 @@ static int current_may_throttle(void)
>  {
>   LIST_HEAD(page_list);
>   unsigned long nr_scanned;
> - unsigned long nr_reclaimed = 0;
> + unsigned int nr_reclaimed = 0;
>   unsigned long nr_taken;
>   struct reclaim_stat stat;
>   int file = is_file_lru(lru);
> @@ -2096,7 +2096,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
>  unsigned long reclaim_pages(struct list_head *page_list)
>  {
>   int nid = NUMA_NO_NODE;
> - unsigned long nr_reclaimed = 0;
> + unsigned int nr_reclaimed = 0;
>   LIST_HEAD(node_page_list);
>   struct reclaim_stat dummy_stat;
>   struct page *page;
> -- 
> 1.9.1
> 

-- 
Michal Hocko
SUSE Labs


Re: [PATCH] printk: Add loglevel for "do not print to consoles".

2020-04-29 Thread Michal Hocko
On Wed 29-04-20 01:23:15, Tetsuo Handa wrote:
> On 2020/04/29 0:45, Michal Hocko wrote:
> > On Tue 28-04-20 22:11:19, Tetsuo Handa wrote:
> >> Existing KERN_$LEVEL allows a user to determine whether he/she wants that 
> >> message
> >> to be printed on consoles (even if it spams his/her operation doing on 
> >> consoles), and
> >> at the same time constrains that user whether that message is saved to log 
> >> files.
> >> KERN_NO_CONSOLES allows a user to control whether he/she wants that 
> >> message to be
> >> saved to log files (without spamming his/her operation doing on consoles).
> > 
> > I understand that. But how do I know whether the user considers the
> > particular information important enough to be dumped on the console.
> > This sounds like a policy in the kernel to me.
> 
> I'm still unable to understand your question.

I am trying to say that KERN_NO_CONSOLES resembles more a policy than a
priority. Because I as a developer have no idea whether the message is
good enough for console or not.

> >I simply cannot forsee
> > any console configuration to tell whether my information is going to
> > swamp the console to no use or not.
> 
> Neither can I.
> 
> > Compare that to KERN_$LEVEL instead.
> > I know that an information is of low/high importance. It is the user
> > policy to decide and base some filtering on top of that priority.
> 
> Whether to use KERN_NO_CONSOLES is not per-importance basis but per-content 
> basis.
> 
> Since both pr_info("[%7d] %5d %5d %8lu %8lu %8ld %8lu %5hd %s\n", 
> ...) from dump_tasks() and
> pr_info("oom-kill:constraint=%s,nodemask=%*pbl", ...) from dump_oom_summary() 
> use KERN_INFO importance,
> existing KERN_$LEVEL-based approach cannot handle these messages differently. 
> Since changing the former to
> e.g. KERN_DEBUG will cause userspace to discard the messages, we effectively 
> can't change KERN_$LEVEL.

I believe we are free to change kernel log levels as we find a fit. I
was not aware that KERN_DEBUG messages are automatically filtered out.
Even if this is the case then this doesn't really disallow admins to
allow KERN_DEBUG into log files. Dump of the oom eligible tasks is
arguably a debugging output anyway. So I disagree with your statement.

> If the kernel allows the former to use KERN_NO_CONSOLES in addition to 
> KERN_INFO, the administrator can
> select from two choices: printing "both the former and the latter" or "only 
> the latter" to consoles.

I am not really familiar with all the possibilities admins have when
setting filtering for different consoles but KERN_NO_CONSOLES sounds
rather alien to the existing priority based approach. You can fine tune
priorities and that is all right because they should be reflecting
importance. But global no-consoles doesn't really fit in here because
each console might require a different policy but the marking is
unconditional and largely unaware of existing consoles.
-- 
Michal Hocko
SUSE Labs


Re: (2) [PATCH 1/1] mm/vmscan.c: change prototype for shrink_page_list

2020-04-29 Thread Michal Hocko
On Wed 29-04-20 18:59:40, Vaneet Narang wrote:
> Hi Michal, 
> 
> >> >
> >> >Acked-by: Michal Hocko 
> >> >
> >> >Is there any reason to move declarations here?
> >> >
> >> 
> >> "unsigned int ret" was changed mistakenely, sending V2.
> >> and "unsigned int nr_reclaimed" is changed to remove hole.
>  
> >Could you be more specific? Have you seen a better stack allocation
> >footprint?
> 
> We didn't check stack allocation footprint, we did changes just by looking 
> into the code.
> we thought changing reclaimed type from long to int on 64 bit platform will 
> add 
> hole of 4 bytes between two stack variables nr_reclaimed & nr_taken.
> 
> So we tried to remove that hole by packing it with bool.
> 
>   unsigned long nr_scanned; --> Size and alignment 8 byte 
> for long 
> - unsigned long nr_reclaimed = 0;   --> Changing to int will make 
> its size as 4  
>   unsigned long nr_taken;   --> nr_taken needs alignment 
> of 8 so will add hole.
>   struct reclaim_stat stat;
>   int file = is_file_lru(lru);
>   enum vm_event_item item;
>   struct pglist_data *pgdat = lruvec_pgdat(lruvec);
>   struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
> + unsigned int nr_reclaimed = 0;--> So moving to this place 
> to pack it along with bool 
>   bool stalled = false;
> 
> 
> Overall stack footprint might not change as compiler makes function stack 
> pointer as 16 byte aligned but we did this 
> as we normally follow this coding convention when defining structures or 
> stack variables. 

My understanding is that gcc can and does tricks when allocating space
for local variables. It can use registers and there is no dictated
structure on the placing of variable or ordering (unlike for
structures).

Anyway, I would prefer if the patch was doing one thing at the time.
If you can see some (have a look ./scripts/bloat-o-meter) improvements
from reordering, make it a separate patch with some numbers attached.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH 1/1] mm/vmscan.c: change prototype for shrink_page_list

2020-04-29 Thread Michal Hocko
On Wed 29-04-20 18:23:23, Maninder Singh wrote:
> 
> Hi,
> 
> >
> >Acked-by: Michal Hocko 
> >
> >Is there any reason to move declarations here?
> >
> 
> "unsigned int ret" was changed mistakenely, sending V2.
> and "unsigned int nr_reclaimed" is changed to remove hole.

Could you be more specific? Have you seen a better stack allocation
footprint?

-- 
Michal Hocko
SUSE Labs


Re: [PATCH v2 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline

2020-04-29 Thread Michal Hocko
On Wed 29-04-20 07:11:45, Srikar Dronamraju wrote:
> > > 
> > > By marking, N_ONLINE as NODE_MASK_NONE, lets stop assuming that Node 0 is
> > > always online.
> > > 
> > > ...
> > >
> > > --- a/mm/page_alloc.c
> > > +++ b/mm/page_alloc.c
> > > @@ -116,8 +116,10 @@ EXPORT_SYMBOL(latent_entropy);
> > >   */
> > >  nodemask_t node_states[NR_NODE_STATES] __read_mostly = {
> > >   [N_POSSIBLE] = NODE_MASK_ALL,
> > > +#ifdef CONFIG_NUMA
> > > + [N_ONLINE] = NODE_MASK_NONE,
> > > +#else
> > >   [N_ONLINE] = { { [0] = 1UL } },
> > > -#ifndef CONFIG_NUMA
> > >   [N_NORMAL_MEMORY] = { { [0] = 1UL } },
> > >  #ifdef CONFIG_HIGHMEM
> > >   [N_HIGH_MEMORY] = { { [0] = 1UL } },
> > 
> > So on all other NUMA machines, when does node 0 get marked online?
> > 
> > This change means that for some time during boot, such machines will
> > now be running with node 0 marked as offline.  What are the
> > implications of this?  Will something break?
> 
> Till the nodes are detected, marking Node 0 as online tends to be redundant.
> Because the system doesn't know if its a NUMA or a non-NUMA system.
> Once we detect the nodes, we online them immediately. Hence I don't see any
> side-effects or negative implications of this change.
> 
> However if I am missing anything, please do let me know.
> 
> >From my part, I have tested this on
> 1. Non-NUMA Single node but CPUs and memory coming from zero node.
> 2. Non-NUMA Single node but CPUs and memory coming from non-zero node.
> 3. NUMA Multi node but with CPUs and memory from node 0.
> 4. NUMA Multi node but with no CPUs and memory from node 0.

Have you tested on something else than ppc? Each arch does the NUMA
setup separately and this is a big mess. E.g. x86 marks even memory less
nodes (see init_memory_less_node) as online.

Honestly I have hard time to evaluate the effect of this patch. It makes
some sense to assume all nodes offline before they get online but this
is a land mine territory.

I am also not sure what kind of problem this is going to address. You
have mentioned numa balancing without many details.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH 1/1] mm/vmscan.c: change prototype for shrink_page_list

2020-04-29 Thread Michal Hocko
On Wed 29-04-20 11:29:27, Maninder Singh wrote:
> 'commit 3c710c1ad11b ("mm, vmscan:
> extract shrink_page_list reclaim counters into a struct")'
> 
> changed data type for the function,
> so changing return type for funciton and its caller.
> 
> Signed-off-by: Vaneet Narang 
> Signed-off-by: Maninder Singh 

Acked-by: Michal Hocko 

Is there any reason to move declarations here?

> -unsigned long reclaim_clean_pages_from_list(struct zone *zone,
> +unsigned int reclaim_clean_pages_from_list(struct zone *zone,
>   struct list_head *page_list)
>  {
> + unsigned int ret;
>   struct scan_control sc = {
>   .gfp_mask = GFP_KERNEL,
>   .priority = DEF_PRIORITY,
>   .may_unmap = 1,
>   };
>   struct reclaim_stat dummy_stat;
> - unsigned long ret;
>   struct page *page, *next;
>   LIST_HEAD(clean_pages);
>  
> @@ -1900,13 +1900,13 @@ static int current_may_throttle(void)
>  {
>   LIST_HEAD(page_list);
>   unsigned long nr_scanned;
> - unsigned long nr_reclaimed = 0;
>   unsigned long nr_taken;
>   struct reclaim_stat stat;
>   int file = is_file_lru(lru);
>   enum vm_event_item item;
>   struct pglist_data *pgdat = lruvec_pgdat(lruvec);
>   struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
> + unsigned int nr_reclaimed = 0;
>   bool stalled = false;
>  
>   while (unlikely(too_many_isolated(pgdat, file, sc))) {
> @@ -2096,7 +2096,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
>  unsigned long reclaim_pages(struct list_head *page_list)
>  {
>   int nid = NUMA_NO_NODE;
> - unsigned long nr_reclaimed = 0;
> + unsigned int nr_reclaimed = 0;
>   LIST_HEAD(node_page_list);
>   struct reclaim_stat dummy_stat;
>   struct page *page;
> -- 
> 1.9.1

-- 
Michal Hocko
SUSE Labs


Re: [patch] mm, oom: stop reclaiming if GFP_ATOMIC will start failing soon

2020-04-29 Thread Michal Hocko
On Wed 29-04-20 19:45:07, Tetsuo Handa wrote:
> On 2020/04/29 18:04, Michal Hocko wrote:
> > Completely agreed! The in kernel OOM killer is to deal with situations
> > when memory is desperately depleted without any sign of a forward
> > progress. If there is a reclaimable memory then we are not there yet.
> > If a workload can benefit from early oom killing based on response time
> > then we have facilities to achieve that (e.g. PSI).
> 
> Can PSI work even if userspace process cannot avoid reclaimable memory
> allocations (e.g. page fault, file read) is already stalling?

The userspace itself would have to be careful and use mlock of course.
But collecting the psi information itself should be pretty independent
on memory allocations as monitoring the system memory state is one of
the main usecases.

> I'm not sure
> whether PSI allows responding quickly enough to "keep reclaimable memory
> allocations not to reclaim" despite there is still reclaimable memory...

PSI is supposed to monitor time spent in the memory allocator (among
other things) and report the tendency. This should be a sufficient
metric to judge that a large part of the userspace is not making forward
progress.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH 1/2] mm, memcg: Avoid stale protection values when cgroup is above protection

2020-04-29 Thread Michal Hocko
On Tue 28-04-20 19:26:47, Chris Down wrote:
> From: Yafang Shao 
> 
> A cgroup can have both memory protection and a memory limit to isolate
> it from its siblings in both directions - for example, to prevent it
> from being shrunk below 2G under high pressure from outside, but also
> from growing beyond 4G under low pressure.
> 
> Commit 9783aa9917f8 ("mm, memcg: proportional memory.{low,min} reclaim")
> implemented proportional scan pressure so that multiple siblings in
> excess of their protection settings don't get reclaimed equally but
> instead in accordance to their unprotected portion.
> 
> During limit reclaim, this proportionality shouldn't apply of course:
> there is no competition, all pressure is from within the cgroup and
> should be applied as such. Reclaim should operate at full efficiency.
> 
> However, mem_cgroup_protected() never expected anybody to look at the
> effective protection values when it indicated that the cgroup is above
> its protection. As a result, a query during limit reclaim may return
> stale protection values that were calculated by a previous reclaim cycle
> in which the cgroup did have siblings.
> 
> When this happens, reclaim is unnecessarily hesitant and potentially
> slow to meet the desired limit. In theory this could lead to premature
> OOM kills, although it's not obvious this has occurred in practice.

Thanks this describes the underlying problem. I would be also explicit
that the issue should be visible only on tail memcgs which have both
max/high and protection configured and the effect depends on the
difference between the two (the smaller it is the largrger the effect).

There is no mention about the fix. The patch resets effective values for
the reclaim root and I've had some concerns about that
http://lkml.kernel.org/r/20200424162103.gk11...@dhcp22.suse.cz.
Johannes has argued that other races are possible and I didn't get to
think about it thoroughly. But this patch is introducing a new
possibility of breaking protection. If we want to have a quick and
simple fix that would be easier to backport to older kernels then I
would feel much better if we simply workedaround the problem as
suggested earlier 
http://lkml.kernel.org/r/20200423061629.24185-1-laoar.s...@gmail.com
We can rework the effective values calculation to be more robust against
races on top of that because this is likely a more tricky thing to do.

> Fixes: 9783aa9917f8 ("mm, memcg: proportional memory.{low,min} reclaim")
> Signed-off-by: Yafang Shao 
> Signed-off-by: Chris Down 
> Cc: Johannes Weiner 
> Cc: Michal Hocko 
> Cc: Roman Gushchin 
> 
> [han...@cmpxchg.org: rework code comment]
> [han...@cmpxchg.org: changelog]
> [ch...@chrisdown.name: fix store tear]
> [ch...@chrisdown.name: retitle]
> ---
>  mm/memcontrol.c | 13 -
>  1 file changed, 12 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 0be00826b832..b0374be44e9e 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -6392,8 +6392,19 @@ enum mem_cgroup_protection mem_cgroup_protected(struct 
> mem_cgroup *root,
>  
>   if (!root)
>   root = root_mem_cgroup;
> - if (memcg == root)
> + if (memcg == root) {
> + /*
> +  * The cgroup is the reclaim root in this reclaim
> +  * cycle, and therefore not protected. But it may have
> +  * stale effective protection values from previous
> +  * cycles in which it was not the reclaim root - for
> +  * example, global reclaim followed by limit reclaim.
> +  * Reset these values for mem_cgroup_protection().
> +  */
> + WRITE_ONCE(memcg->memory.emin, 0);
> + WRITE_ONCE(memcg->memory.elow, 0);
>   return MEMCG_PROT_NONE;
> + }
>  
>   usage = page_counter_read(&memcg->memory);
>   if (!usage)
> -- 
> 2.26.2

-- 
Michal Hocko
SUSE Labs


Re: [PATCH 2/2] mm, memcg: Decouple e{low,min} state mutations from protection checks

2020-04-29 Thread Michal Hocko
On Tue 28-04-20 19:27:00, Chris Down wrote:
> mem_cgroup_protected currently is both used to set effective low and min
> and return a mem_cgroup_protection based on the result. As a user, this
> can be a little unexpected: it appears to be a simple predicate
> function, if not for the big warning in the comment above about the
> order in which it must be executed.
> 
> This change makes it so that we separate the state mutations from the
> actual protection checks, which makes it more obvious where we need to
> be careful mutating internal state, and where we are simply checking and
> don't need to worry about that.
> 
> Signed-off-by: Chris Down 
> Suggested-by: Johannes Weiner 
> Cc: Michal Hocko 
> Cc: Roman Gushchin 
> Cc: Yafang Shao 

Acked-by: Michal Hocko 

> ---
>  include/linux/memcontrol.h | 48 +-
>  mm/memcontrol.c| 30 +++-
>  mm/vmscan.c| 17 --
>  3 files changed, 49 insertions(+), 46 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index d630af1a4e17..88576b1235b0 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -50,12 +50,6 @@ enum memcg_memory_event {
>   MEMCG_NR_MEMORY_EVENTS,
>  };
>  
> -enum mem_cgroup_protection {
> - MEMCG_PROT_NONE,
> - MEMCG_PROT_LOW,
> - MEMCG_PROT_MIN,
> -};
> -
>  struct mem_cgroup_reclaim_cookie {
>   pg_data_t *pgdat;
>   unsigned int generation;
> @@ -357,8 +351,26 @@ static inline unsigned long mem_cgroup_protection(struct 
> mem_cgroup *memcg,
>  READ_ONCE(memcg->memory.elow));
>  }
>  
> -enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root,
> - struct mem_cgroup *memcg);
> +void mem_cgroup_calculate_protection(struct mem_cgroup *root,
> +  struct mem_cgroup *memcg);
> +
> +static inline bool mem_cgroup_below_low(struct mem_cgroup *memcg)
> +{
> + if (mem_cgroup_disabled())
> + return false;
> +
> + return READ_ONCE(memcg->memory.elow) >=
> + page_counter_read(&memcg->memory);
> +}
> +
> +static inline bool mem_cgroup_below_min(struct mem_cgroup *memcg)
> +{
> + if (mem_cgroup_disabled())
> + return false;
> +
> + return READ_ONCE(memcg->memory.emin) >=
> + page_counter_read(&memcg->memory);
> +}
>  
>  int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
> gfp_t gfp_mask, struct mem_cgroup **memcgp,
> @@ -838,13 +850,27 @@ static inline void memcg_memory_event_mm(struct 
> mm_struct *mm,
>  static inline unsigned long mem_cgroup_protection(struct mem_cgroup *memcg,
> bool in_low_reclaim)
>  {
> +
> +
> +static inline void mem_cgroup_calculate_protection(struct mem_cgroup *root,
> +struct mem_cgroup *memcg);
> +{
> +}
> +
> +static inline void mem_cgroup_protection(struct mem_cgroup *memcg,
> +  bool in_low_reclaim)
> +{
>   return 0;
>  }
>  
> -static inline enum mem_cgroup_protection mem_cgroup_protected(
> - struct mem_cgroup *root, struct mem_cgroup *memcg)
> +static inline bool mem_cgroup_below_low(struct mem_cgroup *memcg)
> +{
> + return false;
> +}
> +
> +static inline bool mem_cgroup_below_min(struct mem_cgroup *memcg)
>  {
> - return MEMCG_PROT_NONE;
> + return false;
>  }
>  
>  static inline int mem_cgroup_try_charge(struct page *page, struct mm_struct 
> *mm,
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index b0374be44e9e..317dbbaac603 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -6368,27 +6368,21 @@ static unsigned long effective_protection(unsigned 
> long usage,
>  }
>  
>  /**
> - * mem_cgroup_protected - check if memory consumption is in the normal range
> + * mem_cgroup_calculate_protection - calculate and cache effective low and 
> min
>   * @root: the top ancestor of the sub-tree being checked
>   * @memcg: the memory cgroup to check
>   *
>   * WARNING: This function is not stateless! It can only be used as part
>   *  of a top-down tree iteration, not for isolated queries.
> - *
> - * Returns one of the following:
> - *   MEMCG_PROT_NONE: cgroup memory is not protected
> - *   MEMCG_PROT_LOW: cgroup memory is protected as long there is
> - * an unprotected supply of reclaimable memory from other cgroups.
> - *   MEMCG_PROT_MIN: cgroup memory is prot

Re: [patch] mm, oom: stop reclaiming if GFP_ATOMIC will start failing soon

2020-04-29 Thread Michal Hocko
On Wed 29-04-20 09:51:39, Vlastimil Babka wrote:
> On 4/28/20 11:48 PM, David Rientjes wrote:
> > On Tue, 28 Apr 2020, Vlastimil Babka wrote:
> > 
> > Yes, order-0 reclaim capture is interesting since the issue being reported 
> > here is userspace going out to lunch because it loops for an unbounded 
> > amount of time trying to get above a watermark where it's allowed to 
> > allocate and other consumers are depleting that resource.
> > 
> > We actually prefer to oom kill earlier rather than being put in a 
> > perpetual state of aggressive reclaim that affects all allocators and the 
> > unbounded nature of those allocations leads to very poor results for 
> > everybody.
> 
> Sure. My vague impression is that your (and similar cloud companies) kind of
> workloads are designed to maximize machine utilization, and overshooting and
> killing something as a result is no big deal. Then you perhaps have more
> probability of hitting this state, and on the other hand, even an occasional
> premature oom kill is not a big deal?
> 
> My concers are workloads not designed in such a way, where premature oom kill
> due to temporary higher reclaim activity together with burst of incoming 
> network
> packets will result in e.g. killing an important database. There, the tradeoff
> looks different.

Completely agreed! The in kernel OOM killer is to deal with situations
when memory is desperately depleted without any sign of a forward
progress. If there is a reclaimable memory then we are not there yet.
If a workload can benefit from early oom killing based on response time
then we have facilities to achieve that (e.g. PSI).
-- 
Michal Hocko
SUSE Labs


Re: [patch] mm, oom: stop reclaiming if GFP_ATOMIC will start failing soon

2020-04-29 Thread Michal Hocko
On Wed 29-04-20 10:31:41, peter enderborg wrote:
> On 4/28/20 9:43 AM, Michal Hocko wrote:
> > On Mon 27-04-20 16:35:58, Andrew Morton wrote:
> > [...]
> >> No consumer of GFP_ATOMIC memory should consume an unbounded amount of
> >> it.
> >> Subsystems such as networking will consume a certain amount and
> >> will then start recycling it.  The total amount in-flight will vary
> >> over the longer term as workloads change.  A dynamically tuning
> >> threshold system will need to adapt rapidly enough to sudden load
> >> shifts, which might require unreasonable amounts of headroom.
> > I do agree. __GFP_HIGH/__GFP_ATOMIC are bound by the size of the
> > reserves under memory pressure. Then allocatios start failing very
> > quickly and users have to cope with that, usually by deferring to a
> > sleepable context. Tuning reserves dynamically for heavy reserves
> > consumers would be possible but I am worried that this is far from
> > trivial.
> >
> > We definitely need to understand what is going on here.  Why doesn't
> > kswapd + N*direct reclaimers do not provide enough memory to satisfy
> > both N threads + reserves consumers? How many times those direct
> > reclaimers have to retry?
> 
> Was this not supposed to be avoided with PSI, user-space should
> a fair change to take actions before it goes bad in user-space?

Yes, PSI is certainly a tool to help userspace make actions on heavy
reclaim. And I agree that if there is a desire to trigger the oom killer
early as David states elsewhere in the thread then this approach should
be considered.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH] printk: Add loglevel for "do not print to consoles".

2020-04-28 Thread Michal Hocko
On Tue 28-04-20 22:11:19, Tetsuo Handa wrote:
> On 2020/04/28 21:18, Michal Hocko wrote:
> > On Tue 28-04-20 20:33:21, Tetsuo Handa wrote:
> >> On 2020/04/27 15:21, Sergey Senozhatsky wrote:
> >>>> KERN_NO_CONSOLES is for type of messages where "saved for later 
> >>>> analysis" is
> >>>> important but "printed for immediate notification" is not important.
> >>>> In other words, KERN_NO_CONSOLES is NOT for dying messages where 
> >>>> "printed for
> >>>> immediate notification" is important.
> >>>
> >>> per-console loglevel is a user configurable parameter.
> >>> KERN_NO_CONSOLES is a hard-coded policy.
> >>
> >> But given that whether to use KERN_NO_CONSOLES is configurable via e.g. 
> >> sysctl,
> >> KERN_NO_CONSOLES will become a user configurable parameter. What's still 
> >> wrong?
> > 
> > How do I as a kernel developer know that KERN_NO_CONSOLES should be
> > used? In other words, how can I assume what a user will consider
> > important on the console?
> > 
> 
> Existing KERN_$LEVEL allows a user to determine whether he/she wants that 
> message
> to be printed on consoles (even if it spams his/her operation doing on 
> consoles), and
> at the same time constrains that user whether that message is saved to log 
> files.
> KERN_NO_CONSOLES allows a user to control whether he/she wants that message 
> to be
> saved to log files (without spamming his/her operation doing on consoles).

I understand that. But how do I know whether the user considers the
particular information important enough to be dumped on the console.
This sounds like a policy in the kernel to me. I simply cannot forsee
any console configuration to tell whether my information is going to
swamp the console to no use or not. Compare that to KERN_$LEVEL instead.
I know that an information is of low/high importance. It is the user
policy to decide and base some filtering on top of that priority.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH] printk: Add loglevel for "do not print to consoles".

2020-04-28 Thread Michal Hocko
On Tue 28-04-20 20:33:21, Tetsuo Handa wrote:
> On 2020/04/27 15:21, Sergey Senozhatsky wrote:
> >> KERN_NO_CONSOLES is for type of messages where "saved for later analysis" 
> >> is
> >> important but "printed for immediate notification" is not important.
> >> In other words, KERN_NO_CONSOLES is NOT for dying messages where "printed 
> >> for
> >> immediate notification" is important.
> > 
> > per-console loglevel is a user configurable parameter.
> > KERN_NO_CONSOLES is a hard-coded policy.
> 
> But given that whether to use KERN_NO_CONSOLES is configurable via e.g. 
> sysctl,
> KERN_NO_CONSOLES will become a user configurable parameter. What's still 
> wrong?

How do I as a kernel developer know that KERN_NO_CONSOLES should be
used? In other words, how can I assume what a user will consider
important on the console?

-- 
Michal Hocko
SUSE Labs


Re: [PATCH v3 1/5] kernel/sysctl: support setting sysctl parameters from kernel command line

2020-04-28 Thread Michal Hocko
On Tue 28-04-20 10:09:37, Vlastimil Babka wrote:
> On 4/27/20 8:33 PM, Andrew Morton wrote:
> > On Mon, 27 Apr 2020 20:04:29 +0200 Vlastimil Babka  wrote:
> > 
> > > ...
> > > + sysctl.*=   [KNL]
> > > + Set a sysctl parameter, right before loading the init
> > > + process, as if the value was written to the respective
> > > + /proc/sys/... file. Both '.' and '/' are recognized as
> > > + separators. Unrecognized parameters and invalid values
> > > + are reported in the kernel log. Sysctls registered
> > > + later by a loaded module cannot be set this way.
> > > + Example: sysctl.vm.swappiness=40
> > 
> > Why support "."?  I think only supporting "/" is perfectly adequate and
> > simplifies documentation.  It aligns the command-line syntax with the
> > rest of the sysctl documentation.  I'm not seeing the need to provide
> > two ways of doing the same thing?
> 
> AFAIK the "." is traditional, and "/" is a newer artefact of moving from the
> binary syscall form to procfs based form. So by "command-line syntax" you
> mean echo and cat, not sysctl tool? Because "man sysctl" says:
> 
> variable
>   The name of a key to read from.  An example is kernel.ostype.  The '/'
> separator is also accepted in place of a '.'.
> 
> So I'm not strongly against supporting only / but I expect most people are
> used to the . and it will take them two attempts to pass the sysctl boot
> parameter correctly if they don't use it regularly - first trying . form,
> wonder why it doesn't work, then read the doc and realize it's not
> supported?

Yes, I do agree. I have only recently learned that sysctl supports / as
well. Most people are simply used to . notation. The copy of the arch
and . -> / substitution is a trivial operation and I do not think it is
a real reason to introduce unnecessarily harder to use interface.
-- 
Michal Hocko
SUSE Labs


Re: [patch] mm, oom: stop reclaiming if GFP_ATOMIC will start failing soon

2020-04-28 Thread Michal Hocko
On Mon 27-04-20 16:35:58, Andrew Morton wrote:
[...]
> No consumer of GFP_ATOMIC memory should consume an unbounded amount of
> it.
> Subsystems such as networking will consume a certain amount and
> will then start recycling it.  The total amount in-flight will vary
> over the longer term as workloads change.  A dynamically tuning
> threshold system will need to adapt rapidly enough to sudden load
> shifts, which might require unreasonable amounts of headroom.

I do agree. __GFP_HIGH/__GFP_ATOMIC are bound by the size of the
reserves under memory pressure. Then allocatios start failing very
quickly and users have to cope with that, usually by deferring to a
sleepable context. Tuning reserves dynamically for heavy reserves
consumers would be possible but I am worried that this is far from
trivial.

We definitely need to understand what is going on here.  Why doesn't
kswapd + N*direct reclaimers do not provide enough memory to satisfy
both N threads + reserves consumers? How many times those direct
reclaimers have to retry?

We used to have the allocation stall warning as David mentioned in the
patch description and I have seen it triggering without heavy reserves
consumers (aka reported free pages corresponded to the min watermark).
The underlying problem was usually kswapd being stuck on some FS locks,
direct reclaimers stuck in shrinkers or way too overloaded system with
dozens if not hundreds of processes stuck in the page allocator each
racing with the reclaim and betting on luck. The last problem was the
most annoying because it is really hard to tune for.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH 2/2] mm, vmstat: List total free blocks for each order in /proc/pagetypeinfo

2019-10-23 Thread Michal Hocko
On Wed 23-10-19 13:34:23, Waiman Long wrote:
[...]
> @@ -1419,6 +1419,17 @@ static void pagetypeinfo_showfree_print(struct 
> seq_file *m,
>   }
>   seq_putc(m, '\n');
>   }
> +
> + /*
> +  * List total free blocks per order
> +  */
> + seq_printf(m, "Node %4d, zone %8s, total ",
> +pgdat->node_id, zone->name);
> + for (order = 0; order < MAX_ORDER; ++order) {
> + area = &(zone->free_area[order]);
> + seq_printf(m, "%6lu ", area->nr_free);
> + }
> + seq_putc(m, '\n');

This is essentially duplicating /proc/buddyinfo. Do we really need that?
-- 
Michal Hocko
SUSE Labs


Re: [PATCH 1/2] mm, vmstat: Release zone lock more frequently when reading /proc/pagetypeinfo

2019-10-23 Thread Michal Hocko
On Wed 23-10-19 13:34:22, Waiman Long wrote:
> With a threshold of 10, it is still possible that the zone lock
> will be held for a very long time in the worst case scenario where all
> the counts are just below the threshold. With up to 6 migration types
> and 11 orders, it means up to 6.6 millions.
> 
> Track the total number of list iterations done since the acquisition
> of the zone lock and release it whenever 10 iterations or more have
> been completed. This will cap the lock hold time to no more than 200,000
> list iterations.
> 
> Signed-off-by: Waiman Long 
> ---
>  mm/vmstat.c | 18 ++
>  1 file changed, 14 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 57ba091e5460..c5b82fdf54af 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -1373,6 +1373,7 @@ static void pagetypeinfo_showfree_print(struct seq_file 
> *m,
>   pg_data_t *pgdat, struct zone *zone)
>  {
>   int order, mtype;
> + unsigned long iteration_count = 0;
>  
>   for (mtype = 0; mtype < MIGRATE_TYPES; mtype++) {
>   seq_printf(m, "Node %4d, zone %8s, type %12s ",
> @@ -1397,15 +1398,24 @@ static void pagetypeinfo_showfree_print(struct 
> seq_file *m,
>* of pages in this order should be more than
>* sufficient
>*/
> - if (++freecount >= 10) {
> + if (++freecount > 10) {
>   overflow = true;
> - spin_unlock_irq(&zone->lock);
> - cond_resched();
> - spin_lock_irq(&zone->lock);
> + freecount--;
>   break;
>   }
>   }
>   seq_printf(m, "%s%6lu ", overflow ? ">" : "", 
> freecount);
> + /*
> +  * Take a break and release the zone lock when
> +  * 10 or more entries have been iterated.
> +  */
> + iteration_count += freecount;
> + if (iteration_count >= 10) {
> + iteration_count = 0;
> + spin_unlock_irq(&zone->lock);
> + cond_resched();
> + spin_lock_irq(&zone->lock);
> + }

Aren't you overengineering this a bit? If you are still worried then we
can simply cond_resched for each order
diff --git a/mm/vmstat.c b/mm/vmstat.c
index c156ce24a322..ddb89f4e0486 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1399,13 +1399,13 @@ static void pagetypeinfo_showfree_print(struct seq_file 
*m,
 */
if (++freecount >= 10) {
overflow = true;
-   spin_unlock_irq(&zone->lock);
-   cond_resched();
-   spin_lock_irq(&zone->lock);
break;
}
}
seq_printf(m, "%s%6lu ", overflow ? ">" : "", 
freecount);
+   spin_unlock_irq(&zone->lock);
+   cond_resched();
+   spin_lock_irq(&zone->lock);
}
seq_putc(m, '\n');
}

I do not have a strong opinion here but I can fold this into my patch 2.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH] Add prctl support for controlling PF_MEMALLOC V2

2019-10-23 Thread Michal Hocko
On Wed 23-10-19 12:27:29, Mike Christie wrote:
> On 10/23/2019 02:11 AM, Michal Hocko wrote:
> > On Wed 23-10-19 07:43:44, Dave Chinner wrote:
> >> On Tue, Oct 22, 2019 at 06:33:10PM +0200, Michal Hocko wrote:
> > 
> > Thanks for more clarifiation regarding PF_LESS_THROTTLE.
> > 
> > [...]
> >>> PF_IO_FLUSHER would mean that the user
> >>> context is a part of the IO path and therefore there are certain reclaim
> >>> recursion restrictions.
> >>
> >> If PF_IO_FLUSHER just maps to PF_LESS_THROTTLE|PF_MEMALLOC_NOIO,
> >> then I'm not sure we need a new definition. Maybe that's the ptrace
> >> flag name, but in the kernel we don't need a PF_IO_FLUSHER process
> >> flag...
> > 
> > Yes, the internal implementation would do something like that. I was
> > more interested in the user space visible API at this stage. Something
> > generic enough because exporting MEMALLOC flags is just a bad idea IMHO
> > (especially PF_MEMALLOC).
> 
> Do you mean we would do something like:
> 
> prctl()
> 
> case PF_SET_IO_FLUSHER:
> current->flags |= PF_MEMALLOC_NOIO;
> 

yes, along with PF_LESS_THROTTLE.

-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH 2/2] mm, vmstat: reduce zone->lock holding time by /proc/pagetypeinfo

2019-10-23 Thread Michal Hocko
With a brown paper bag bug fixed. I have also added a note about low
number of pages being more important as per Vlastimil's feedback

>From 0282f604144a5c06fdf3cf0bb2df532411e7f8c9 Mon Sep 17 00:00:00 2001
From: Michal Hocko 
Date: Wed, 23 Oct 2019 12:13:02 +0200
Subject: [PATCH] mm, vmstat: reduce zone->lock holding time by
 /proc/pagetypeinfo

pagetypeinfo_showfree_print is called by zone->lock held in irq mode.
This is not really nice because it blocks both any interrupts on that
cpu and the page allocator. On large machines this might even trigger
the hard lockup detector.

Considering the pagetypeinfo is a debugging tool we do not really need
exact numbers here. The primary reason to look at the outuput is to see
how pageblocks are spread among different migratetypes and low number of
pages is much more interesting therefore putting a bound on the number
of pages on the free_list sounds like a reasonable tradeoff.

The new output will simply tell
[...]
Node6, zone   Normal, type  Movable >10 >10 >10 >10  
41019  31560  23996  10054   3229983648

instead of
Node6, zone   Normal, type  Movable 399568 294127 221558 102119  41019  
31560  23996  10054   3229983648

The limit has been chosen arbitrary and it is a subject of a future
change should there be a need for that.

Suggested-by: Andrew Morton 
Acked-by: Mel Gorman 
Signed-off-by: Michal Hocko 
---
 mm/vmstat.c | 23 ---
 1 file changed, 20 insertions(+), 3 deletions(-)

diff --git a/mm/vmstat.c b/mm/vmstat.c
index 4e885ecd44d1..c156ce24a322 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1383,12 +1383,29 @@ static void pagetypeinfo_showfree_print(struct seq_file 
*m,
unsigned long freecount = 0;
struct free_area *area;
struct list_head *curr;
+   bool overflow = false;
 
area = &(zone->free_area[order]);
 
-   list_for_each(curr, &area->free_list[mtype])
-   freecount++;
-   seq_printf(m, "%6lu ", freecount);
+   list_for_each(curr, &area->free_list[mtype]) {
+   /*
+* Cap the free_list iteration because it might
+* be really large and we are under a spinlock
+* so a long time spent here could trigger a
+* hard lockup detector. Anyway this is a
+* debugging tool so knowing there is a handful
+* of pages in this order should be more than
+* sufficient
+*/
+   if (++freecount >= 10) {
+   overflow = true;
+   spin_unlock_irq(&zone->lock);
+   cond_resched();
+   spin_lock_irq(&zone->lock);
+   break;
+   }
+   }
+   seq_printf(m, "%s%6lu ", overflow ? ">" : "", 
freecount);
}
seq_putc(m, '\n');
}
-- 
2.20.1

-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH 2/2] mm, vmstat: reduce zone->lock holding time by /proc/pagetypeinfo

2019-10-23 Thread Michal Hocko
On Wed 23-10-19 10:56:30, Waiman Long wrote:
> On 10/23/19 6:27 AM, Michal Hocko wrote:
> > From: Michal Hocko 
> >
> > pagetypeinfo_showfree_print is called by zone->lock held in irq mode.
> > This is not really nice because it blocks both any interrupts on that
> > cpu and the page allocator. On large machines this might even trigger
> > the hard lockup detector.
> >
> > Considering the pagetypeinfo is a debugging tool we do not really need
> > exact numbers here. The primary reason to look at the outuput is to see
> > how pageblocks are spread among different migratetypes therefore putting
> > a bound on the number of pages on the free_list sounds like a reasonable
> > tradeoff.
> >
> > The new output will simply tell
> > [...]
> > Node6, zone   Normal, type  Movable >10 >10 >10 >10 
> >  41019  31560  23996  10054   3229983648
> >
> > instead of
> > Node6, zone   Normal, type  Movable 399568 294127 221558 102119  
> > 41019  31560  23996  10054   3229983648
> >
> > The limit has been chosen arbitrary and it is a subject of a future
> > change should there be a need for that.
> >
> > Suggested-by: Andrew Morton 
> > Signed-off-by: Michal Hocko 
> > ---
> >  mm/vmstat.c | 19 ++-
> >  1 file changed, 18 insertions(+), 1 deletion(-)
> >
> > diff --git a/mm/vmstat.c b/mm/vmstat.c
> > index 4e885ecd44d1..762034fc3b83 100644
> > --- a/mm/vmstat.c
> > +++ b/mm/vmstat.c
> > @@ -1386,8 +1386,25 @@ static void pagetypeinfo_showfree_print(struct 
> > seq_file *m,
> >  
> > area = &(zone->free_area[order]);
> >  
> > -   list_for_each(curr, &area->free_list[mtype])
> > +   list_for_each(curr, &area->free_list[mtype]) {
> > freecount++;
> > +   /*
> > +* Cap the free_list iteration because it might
> > +* be really large and we are under a spinlock
> > +* so a long time spent here could trigger a
> > +* hard lockup detector. Anyway this is a
> > +* debugging tool so knowing there is a handful
> > +* of pages in this order should be more than
> > +* sufficient
> > +*/
> > +   if (freecount > 10) {
> > +   seq_printf(m, ">%6lu ", freecount);
> > +   spin_unlock_irq(&zone->lock);
> > +   cond_resched();
> > +   spin_lock_irq(&zone->lock);
> > +   continue;
> list_for_each() is a for loop. The continue statement will just iterate
> the rests with the possibility that curr will be stale. Should we use
> goto to jump after the seq_print() below?

You are right. Kinda brown paper back material. Sorry about that. What
about this on top?
--- 
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 762034fc3b83..c156ce24a322 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1383,11 +1383,11 @@ static void pagetypeinfo_showfree_print(struct seq_file 
*m,
unsigned long freecount = 0;
struct free_area *area;
struct list_head *curr;
+   bool overflow = false;
 
area = &(zone->free_area[order]);
 
list_for_each(curr, &area->free_list[mtype]) {
-   freecount++;
/*
 * Cap the free_list iteration because it might
 * be really large and we are under a spinlock
@@ -1397,15 +1397,15 @@ static void pagetypeinfo_showfree_print(struct seq_file 
*m,
 * of pages in this order should be more than
 * sufficient
 */
-   if (freecount > 10) {
-   seq_printf(m, ">%6lu ", freecount);
+   if (++freecount >= 10) {
+   overflow = true;
spin_unlock_irq(&zone->lock);
        cond_resched();
spin_lock_irq(&zone->lock);
-   continue;
+   break;
}
}
-   seq_printf(m, "%6lu ", freecount);
+   seq_printf(m, "%s%6lu ", overflow ? ">" : "", 
freecount);
}
seq_putc(m, '\n');
}

-- 
Michal Hocko
SUSE Labs


Re: [PATCH] mm/page_alloc: fix gcc compile warning

2019-10-23 Thread Michal Hocko
On Wed 23-10-19 21:48:28, Chen Wandun wrote:
> From: Chenwandun 
> 
> mm/page_alloc.o: In function `page_alloc_init_late':
> mm/page_alloc.c:1956: undefined reference to `zone_pcp_update'
> mm/page_alloc.o:(.debug_addr+0x8350): undefined reference to `zone_pcp_update'
> make: *** [vmlinux] Error 1
> 
> zone_pcp_update is defined in CONFIG_MEMORY_HOTPLUG,
> so add ifdef when calling zone_pcp_update.

David has already pointed out that this has been fixed already but let
me note one more thing. This patch is wrong at two levels. The first one
that it simply does a wrong thing. The comment above the code explains
why zone_pcp_update is called and ifdef makes it depend on
MEMORY_HOTPLUG support which has nothing to do with that code.
The second problem and I find it as a reoccuring problem (and that's why
I am writing this reply) that the changelog explains what but it doesn't
say why. It would certainly be unsatisfactory to say "because compiler
complains". So there would have to be a clarification that
zone_pcp_update is only available for CONFIG_MEMORY_HOTPLUG and why the
code only applies to those configuration. That would be much harder to
justify because there is no real connection.

That being said, a proper changelog with a good explanation of what and
why can help to discover a wrong patch early.

> Signed-off-by: Chenwandun 
> ---
>  mm/page_alloc.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index f9488ef..8513150 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1952,8 +1952,10 @@ void __init page_alloc_init_late(void)
>* so the pcpu batch and high limits needs to be updated or the limits
>* will be artificially small.
>*/
> +#ifdef CONFIG_MEMORY_HOTPLUG
>   for_each_populated_zone(zone)
>   zone_pcp_update(zone);
> +#endif
>  
>   /*
>* We initialized the rest of the deferred pages.  Permanently disable
> -- 
> 2.7.4

-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH 2/2] mm, vmstat: reduce zone->lock holding time by /proc/pagetypeinfo

2019-10-23 Thread Michal Hocko
On Wed 23-10-19 15:48:36, Vlastimil Babka wrote:
> On 10/23/19 3:37 PM, Michal Hocko wrote:
> > On Wed 23-10-19 15:32:05, Vlastimil Babka wrote:
> >> On 10/23/19 12:27 PM, Michal Hocko wrote:
> >>> From: Michal Hocko 
> >>>
> >>> pagetypeinfo_showfree_print is called by zone->lock held in irq mode.
> >>> This is not really nice because it blocks both any interrupts on that
> >>> cpu and the page allocator. On large machines this might even trigger
> >>> the hard lockup detector.
> >>>
> >>> Considering the pagetypeinfo is a debugging tool we do not really need
> >>> exact numbers here. The primary reason to look at the outuput is to see
> >>> how pageblocks are spread among different migratetypes therefore putting
> >>> a bound on the number of pages on the free_list sounds like a reasonable
> >>> tradeoff.
> >>>
> >>> The new output will simply tell
> >>> [...]
> >>> Node6, zone   Normal, type  Movable >10 >10 >10 
> >>> >10  41019  31560  23996  10054   3229983648
> >>>
> >>> instead of
> >>> Node6, zone   Normal, type  Movable 399568 294127 221558 102119  
> >>> 41019  31560  23996  10054   3229983648
> >>>
> >>> The limit has been chosen arbitrary and it is a subject of a future
> >>> change should there be a need for that.
> >>>
> >>> Suggested-by: Andrew Morton 
> >>> Signed-off-by: Michal Hocko 
> >>
> >> Hmm dunno, I would rather e.g. hide the file behind some config or boot
> >> option than do this. Or move it to /sys/kernel/debug ?
> > 
> > But those wouldn't really help to prevent from the lockup, right?
> 
> No, but it would perhaps help ensure that only people who know what they
> are doing (or been told so by a developer e.g. on linux-mm) will try to
> collect the data, and not some automatic monitoring tools taking
> periodic snapshots of stuff in /proc that looks interesting.

Well, we do trust root doesn't do harm, right?

> > Besides that who would enable that config and how much of a difference
> > would root only vs. debugfs make?
> 
> I would hope those tools don't scrap debugfs as much as /proc, but I
> might be wrong of course :)
> 
> > Is the incomplete value a real problem?
> 
> Hmm perhaps not. If the overflow happens only for one migratetype, one
> can use also /proc/buddyinfo to get to the exact count, as was proposed
> in this thread for Movable migratetype.

Let's say this won't be the case. What is the worst case that the
imprecision would cause? In other words. Does it really matter whether
we have 100k pages on the free list of the specific migrate type for
order or say 200k?

-- 
Michal Hocko
SUSE Labs


Re: [PATCH 7/8] mm: vmscan: split shrink_node() into node part and memcgs part

2019-10-23 Thread Michal Hocko
On Tue 22-10-19 10:48:02, Johannes Weiner wrote:
> This function is getting long and unwieldy, split out the memcg bits.
> 
> The updated shrink_node() handles the generic (node) reclaim aspects:
>   - global vmpressure notifications
>   - writeback and congestion throttling
>   - reclaim/compaction management
>   - kswapd giving up on unreclaimable nodes
> 
> It then calls a new shrink_node_memcgs() which handles cgroup specifics:
>   - the cgroup tree traversal
>   - memory.low considerations
>   - per-cgroup slab shrinking callbacks
>   - per-cgroup vmpressure notifications
> 
> Signed-off-by: Johannes Weiner 

Acked-by: Michal Hocko 

> ---
>  mm/vmscan.c | 28 ++--
>  1 file changed, 18 insertions(+), 10 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index db073b40c432..65baa89740dd 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2722,18 +2722,10 @@ static bool pgdat_memcg_congested(pg_data_t *pgdat, 
> struct mem_cgroup *memcg)
>   (memcg && memcg_congested(pgdat, memcg));
>  }
>  
> -static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
> +static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
>  {
> - struct reclaim_state *reclaim_state = current->reclaim_state;
>   struct mem_cgroup *root = sc->target_mem_cgroup;
> - unsigned long nr_reclaimed, nr_scanned;
> - bool reclaimable = false;
>   struct mem_cgroup *memcg;
> -again:
> - memset(&sc->nr, 0, sizeof(sc->nr));
> -
> - nr_reclaimed = sc->nr_reclaimed;
> - nr_scanned = sc->nr_scanned;
>  
>   memcg = mem_cgroup_iter(root, NULL, NULL);
>   do {
> @@ -2786,6 +2778,22 @@ static bool shrink_node(pg_data_t *pgdat, struct 
> scan_control *sc)
>  sc->nr_reclaimed - reclaimed);
>  
>   } while ((memcg = mem_cgroup_iter(root, memcg, NULL)));
> +}
> +
> +static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
> +{
> + struct reclaim_state *reclaim_state = current->reclaim_state;
> + struct mem_cgroup *root = sc->target_mem_cgroup;
> + unsigned long nr_reclaimed, nr_scanned;
> + bool reclaimable = false;
> +
> +again:
> + memset(&sc->nr, 0, sizeof(sc->nr));
> +
> + nr_reclaimed = sc->nr_reclaimed;
> + nr_scanned = sc->nr_scanned;
> +
> + shrink_node_memcgs(pgdat, sc);
>  
>   if (reclaim_state) {
>   sc->nr_reclaimed += reclaim_state->reclaimed_slab;
> @@ -2793,7 +2801,7 @@ static bool shrink_node(pg_data_t *pgdat, struct 
> scan_control *sc)
>   }
>  
>   /* Record the subtree's reclaim efficiency */
> - vmpressure(sc->gfp_mask, sc->target_mem_cgroup, true,
> + vmpressure(sc->gfp_mask, root, true,
>  sc->nr_scanned - nr_scanned,
>  sc->nr_reclaimed - nr_reclaimed);
>  
> -- 
> 2.23.0
> 

-- 
Michal Hocko
SUSE Labs


Re: [PATCH 6/8] mm: vmscan: turn shrink_node_memcg() into shrink_lruvec()

2019-10-23 Thread Michal Hocko
On Tue 22-10-19 10:48:01, Johannes Weiner wrote:
> A lruvec holds LRU pages owned by a certain NUMA node and cgroup.
> Instead of awkwardly passing around a combination of a pgdat and a
> memcg pointer, pass down the lruvec as soon as we can look it up.
> 
> Nested callers that need to access node or cgroup properties can look
> them them up if necessary, but there are only a few cases.
> 
> Signed-off-by: Johannes Weiner 

Acked-by: Michal Hocko 

> ---
>  mm/vmscan.c | 21 ++---
>  1 file changed, 10 insertions(+), 11 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 235d1fc72311..db073b40c432 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2280,9 +2280,10 @@ enum scan_balance {
>   * nr[0] = anon inactive pages to scan; nr[1] = anon active pages to scan
>   * nr[2] = file inactive pages to scan; nr[3] = file active pages to scan
>   */
> -static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
> -struct scan_control *sc, unsigned long *nr)
> +static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
> +unsigned long *nr)
>  {
> + struct mem_cgroup *memcg = lruvec_memcg(lruvec);
>   int swappiness = mem_cgroup_swappiness(memcg);
>   struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
>   u64 fraction[2];
> @@ -2530,13 +2531,8 @@ static void get_scan_count(struct lruvec *lruvec, 
> struct mem_cgroup *memcg,
>   }
>  }
>  
> -/*
> - * This is a basic per-node page freer.  Used by both kswapd and direct 
> reclaim.
> - */
> -static void shrink_node_memcg(struct pglist_data *pgdat, struct mem_cgroup 
> *memcg,
> -   struct scan_control *sc)
> +static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
>  {
> - struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
>   unsigned long nr[NR_LRU_LISTS];
>   unsigned long targets[NR_LRU_LISTS];
>   unsigned long nr_to_scan;
> @@ -2546,7 +2542,7 @@ static void shrink_node_memcg(struct pglist_data 
> *pgdat, struct mem_cgroup *memc
>   struct blk_plug plug;
>   bool scan_adjusted;
>  
> - get_scan_count(lruvec, memcg, sc, nr);
> + get_scan_count(lruvec, sc, nr);
>  
>   /* Record the original scan target for proportional adjustments later */
>   memcpy(targets, nr, sizeof(nr));
> @@ -2741,6 +2737,7 @@ static bool shrink_node(pg_data_t *pgdat, struct 
> scan_control *sc)
>  
>   memcg = mem_cgroup_iter(root, NULL, NULL);
>   do {
> + struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
>   unsigned long reclaimed;
>   unsigned long scanned;
>  
> @@ -2777,7 +2774,8 @@ static bool shrink_node(pg_data_t *pgdat, struct 
> scan_control *sc)
>  
>   reclaimed = sc->nr_reclaimed;
>   scanned = sc->nr_scanned;
> - shrink_node_memcg(pgdat, memcg, sc);
> +
> + shrink_lruvec(lruvec, sc);
>  
>   shrink_slab(sc->gfp_mask, pgdat->node_id, memcg,
>   sc->priority);
> @@ -3281,6 +3279,7 @@ unsigned long mem_cgroup_shrink_node(struct mem_cgroup 
> *memcg,
>   pg_data_t *pgdat,
>   unsigned long *nr_scanned)
>  {
> + struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
>   struct scan_control sc = {
>   .nr_to_reclaim = SWAP_CLUSTER_MAX,
>   .target_mem_cgroup = memcg,
> @@ -3307,7 +3306,7 @@ unsigned long mem_cgroup_shrink_node(struct mem_cgroup 
> *memcg,
>* will pick up pages from other mem cgroup's as well. We hack
>* the priority and make it zero.
>*/
> - shrink_node_memcg(pgdat, memcg, &sc);
> + shrink_lruvec(lruvec, &sc);
>  
>   trace_mm_vmscan_memcg_softlimit_reclaim_end(
>   cgroup_ino(memcg->css.cgroup),
> -- 
> 2.23.0

-- 
Michal Hocko
SUSE Labs


Re: [PATCH 5/8] mm: vmscan: replace shrink_node() loop with a retry jump

2019-10-23 Thread Michal Hocko
aimed - nr_reclaimed)
> + reclaimable = true;
>  
> - /*
> -  * If kswapd scans pages marked marked for immediate
> -  * reclaim and under writeback (nr_immediate), it
> -  * implies that pages are cycling through the LRU
> -  * faster than they are written so also forcibly stall.
> -  */
> - if (sc->nr.immediate)
> - congestion_wait(BLK_RW_ASYNC, HZ/10);
> - }
> + if (current_is_kswapd()) {
> + /*
> +  * If reclaim is isolating dirty pages under writeback,
> +  * it implies that the long-lived page allocation rate
> +  * is exceeding the page laundering rate. Either the
> +  * global limits are not being effective at throttling
> +  * processes due to the page distribution throughout
> +  * zones or there is heavy usage of a slow backing
> +  * device. The only option is to throttle from reclaim
> +  * context which is not ideal as there is no guarantee
> +  * the dirtying process is throttled in the same way
> +  * balance_dirty_pages() manages.
> +  *
> +  * Once a node is flagged PGDAT_WRITEBACK, kswapd will
> +  * count the number of pages under pages flagged for
> +  * immediate reclaim and stall if any are encountered
> +  * in the nr_immediate check below.
> +  */
> + if (sc->nr.writeback && sc->nr.writeback == sc->nr.taken)
> + set_bit(PGDAT_WRITEBACK, &pgdat->flags);
>  
>   /*
> -  * Legacy memcg will stall in page writeback so avoid forcibly
> -  * stalling in wait_iff_congested().
> +  * Tag a node as congested if all the dirty pages
> +  * scanned were backed by a congested BDI and
> +  * wait_iff_congested will stall.
>*/
> - if (cgroup_reclaim(sc) && writeback_throttling_sane(sc) &&
> - sc->nr.dirty && sc->nr.dirty == sc->nr.congested)
> - set_memcg_congestion(pgdat, root, true);
> + if (sc->nr.dirty && sc->nr.dirty == sc->nr.congested)
> + set_bit(PGDAT_CONGESTED, &pgdat->flags);
> +
> + /* Allow kswapd to start writing pages during reclaim.*/
> + if (sc->nr.unqueued_dirty == sc->nr.file_taken)
> + set_bit(PGDAT_DIRTY, &pgdat->flags);
>  
>   /*
> -  * Stall direct reclaim for IO completions if underlying BDIs
> -  * and node is congested. Allow kswapd to continue until it
> -  * starts encountering unqueued dirty pages or cycling through
> -  * the LRU too quickly.
> +  * If kswapd scans pages marked marked for immediate
> +  * reclaim and under writeback (nr_immediate), it
> +  * implies that pages are cycling through the LRU
> +  * faster than they are written so also forcibly stall.
>*/
> - if (!sc->hibernation_mode && !current_is_kswapd() &&
> -current_may_throttle() && pgdat_memcg_congested(pgdat, root))
> - wait_iff_congested(BLK_RW_ASYNC, HZ/10);
> + if (sc->nr.immediate)
> + congestion_wait(BLK_RW_ASYNC, HZ/10);
> + }
> +
> + /*
> +  * Legacy memcg will stall in page writeback so avoid forcibly
> +  * stalling in wait_iff_congested().
> +  */
> + if (cgroup_reclaim(sc) && writeback_throttling_sane(sc) &&
> + sc->nr.dirty && sc->nr.dirty == sc->nr.congested)
> + set_memcg_congestion(pgdat, root, true);
> +
> + /*
> +  * Stall direct reclaim for IO completions if underlying BDIs
> +  * and node is congested. Allow kswapd to continue until it
> +  * starts encountering unqueued dirty pages or cycling through
> +  * the LRU too quickly.
> +  */
> + if (!sc->hibernation_mode && !current_is_kswapd() &&
> + current_may_throttle() && pgdat_memcg_congested(pgdat, root))
> + wait_iff_congested(BLK_RW_ASYNC, HZ/10);
>  
> - } while (should_continue_reclaim(pgdat, sc->nr_reclaimed - nr_reclaimed,
> -  sc));
> + if (should_continue_reclaim(pgdat, sc->nr_reclaimed - nr_reclaimed,
> + sc))
> + goto again;
>  
>   /*
>* Kswapd gives up on balancing particular nodes after too
> -- 
> 2.23.0

-- 
Michal Hocko
SUSE Labs


Re: [PATCH 4/8] mm: vmscan: naming fixes: global_reclaim() and sane_reclaim()

2019-10-23 Thread Michal Hocko
On Tue 22-10-19 10:47:59, Johannes Weiner wrote:
> Seven years after introducing the global_reclaim() function, I still
> have to double take when reading a callsite. I don't know how others
> do it, this is a terrible name.

I somehow never had problem with that but ...
> 
> Invert the meaning and rename it to cgroup_reclaim().
> 
> [ After all, "global reclaim" is just regular reclaim invoked from the
>   page allocator. It's reclaim on behalf of a cgroup limit that is a
>   special case of reclaim, and should be explicit - not the reverse. ]

... this is a valid point.

> sane_reclaim() isn't very descriptive either: it tests whether we can
> use the regular writeback throttling - available during regular page
> reclaim or cgroup2 limit reclaim - or need to use the broken
> wait_on_page_writeback() method. Use "writeback_throttling_sane()".

I do have a stronger opinion on this one. sane_reclaim is really a
terrible name. As you say the only thing this should really tell is
whether writeback throttling is available so I would rather go with
has_writeback_throttling() or writeba_throttling_{eabled,available}
If you insist on having sane in the name then I won't object but it just
raises a question whether we have some levels of throttling with a
different level of sanity.

> Signed-off-by: Johannes Weiner 

Acked-by: Michal Hocko 

> ---
>  mm/vmscan.c | 38 ++
>  1 file changed, 18 insertions(+), 20 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 622b77488144..302dad112f75 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -239,13 +239,13 @@ static void unregister_memcg_shrinker(struct shrinker 
> *shrinker)
>   up_write(&shrinker_rwsem);
>  }
>  
> -static bool global_reclaim(struct scan_control *sc)
> +static bool cgroup_reclaim(struct scan_control *sc)
>  {
> - return !sc->target_mem_cgroup;
> + return sc->target_mem_cgroup;
>  }
>  
>  /**
> - * sane_reclaim - is the usual dirty throttling mechanism operational?
> + * writeback_throttling_sane - is the usual dirty throttling mechanism 
> available?
>   * @sc: scan_control in question
>   *
>   * The normal page dirty throttling mechanism in balance_dirty_pages() is
> @@ -257,11 +257,9 @@ static bool global_reclaim(struct scan_control *sc)
>   * This function tests whether the vmscan currently in progress can assume
>   * that the normal dirty throttling mechanism is operational.
>   */
> -static bool sane_reclaim(struct scan_control *sc)
> +static bool writeback_throttling_sane(struct scan_control *sc)
>  {
> - struct mem_cgroup *memcg = sc->target_mem_cgroup;
> -
> - if (!memcg)
> + if (!cgroup_reclaim(sc))
>   return true;
>  #ifdef CONFIG_CGROUP_WRITEBACK
>   if (cgroup_subsys_on_dfl(memory_cgrp_subsys))
> @@ -302,12 +300,12 @@ static void unregister_memcg_shrinker(struct shrinker 
> *shrinker)
>  {
>  }
>  
> -static bool global_reclaim(struct scan_control *sc)
> +static bool cgroup_reclaim(struct scan_control *sc)
>  {
> - return true;
> + return false;
>  }
>  
> -static bool sane_reclaim(struct scan_control *sc)
> +static bool writeback_throttling_sane(struct scan_control *sc)
>  {
>   return true;
>  }
> @@ -1227,7 +1225,7 @@ static unsigned long shrink_page_list(struct list_head 
> *page_list,
>   goto activate_locked;
>  
>   /* Case 2 above */
> - } else if (sane_reclaim(sc) ||
> + } else if (writeback_throttling_sane(sc) ||
>   !PageReclaim(page) || !may_enter_fs) {
>   /*
>* This is slightly racy - end_page_writeback()
> @@ -1821,7 +1819,7 @@ static int too_many_isolated(struct pglist_data *pgdat, 
> int file,
>   if (current_is_kswapd())
>   return 0;
>  
> - if (!sane_reclaim(sc))
> + if (!writeback_throttling_sane(sc))
>   return 0;
>  
>   if (file) {
> @@ -1971,7 +1969,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct 
> lruvec *lruvec,
>   reclaim_stat->recent_scanned[file] += nr_taken;
>  
>   item = current_is_kswapd() ? PGSCAN_KSWAPD : PGSCAN_DIRECT;
> - if (global_reclaim(sc))
> + if (!cgroup_reclaim(sc))
>   __count_vm_events(item, nr_scanned);
>   __count_memcg_events(lruvec_memcg(lruvec), item, nr_scanned);
>   spin_unlock_irq(&pgdat->lru_lock);
> @@ -1985,7 +1983,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct 
> lruvec *lruvec,
>   spin_lock_irq(&pgdat->lru_lock);
>  
&g

Re: [PATCH 3/8] mm: vmscan: move inactive_list_is_low() swap check to the caller

2019-10-23 Thread Michal Hocko
On Tue 22-10-19 10:47:58, Johannes Weiner wrote:
> inactive_list_is_low() should be about one thing: checking the ratio
> between inactive and active list. Kitchensink checks like the one for
> swap space makes the function hard to use and modify its
> callsites. Luckly, most callers already have an understanding of the
> swap situation, so it's easy to clean up.
> 
> get_scan_count() has its own, memcg-aware swap check, and doesn't even
> get to the inactive_list_is_low() check on the anon list when there is
> no swap space available.
> 
> shrink_list() is called on the results of get_scan_count(), so that
> check is redundant too.
> 
> age_active_anon() has its own totalswap_pages check right before it
> checks the list proportions.
> 
> The shrink_node_memcg() site is the only one that doesn't do its own
> swap check. Add it there.
> 
> Then delete the swap check from inactive_list_is_low().
> 
> Signed-off-by: Johannes Weiner 

OK, makes sense to me.
Acked-by: Michal Hocko 

> ---
>  mm/vmscan.c | 9 +
>  1 file changed, 1 insertion(+), 8 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index be3c22c274c1..622b77488144 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2226,13 +2226,6 @@ static bool inactive_list_is_low(struct lruvec 
> *lruvec, bool file,
>   unsigned long refaults;
>   unsigned long gb;
>  
> - /*
> -  * If we don't have swap space, anonymous page deactivation
> -  * is pointless.
> -  */
> - if (!file && !total_swap_pages)
> - return false;
> -
>   inactive = lruvec_lru_size(lruvec, inactive_lru, sc->reclaim_idx);
>   active = lruvec_lru_size(lruvec, active_lru, sc->reclaim_idx);
>  
> @@ -2653,7 +2646,7 @@ static void shrink_node_memcg(struct pglist_data 
> *pgdat, struct mem_cgroup *memc
>* Even if we did not try to evict anon pages at all, we want to
>* rebalance the anon lru active/inactive ratio.
>*/
> - if (inactive_list_is_low(lruvec, false, sc, true))
> + if (total_swap_pages && inactive_list_is_low(lruvec, false, sc, true))
>   shrink_active_list(SWAP_CLUSTER_MAX, lruvec,
>  sc, LRU_ACTIVE_ANON);
>  }
> -- 
> 2.23.0

-- 
Michal Hocko
SUSE Labs


Re: [PATCH 2/8] mm: clean up and clarify lruvec lookup procedure

2019-10-23 Thread Michal Hocko
On Tue 22-10-19 10:47:57, Johannes Weiner wrote:
> There is a per-memcg lruvec and a NUMA node lruvec. Which one is being
> used is somewhat confusing right now, and it's easy to make mistakes -
> especially when it comes to global reclaim.
> 
> How it works: when memory cgroups are enabled, we always use the
> root_mem_cgroup's per-node lruvecs. When memory cgroups are not
> compiled in or disabled at runtime, we use pgdat->lruvec.
> 
> Document that in a comment.
> 
> Due to the way the reclaim code is generalized, all lookups use the
> mem_cgroup_lruvec() helper function, and nobody should have to find
> the right lruvec manually right now. But to avoid future mistakes,
> rename the pgdat->lruvec member to pgdat->__lruvec and delete the
> convenience wrapper that suggests it's a commonly accessed member.
> 
> While in this area, swap the mem_cgroup_lruvec() argument order. The
> name suggests a memcg operation, yet it takes a pgdat first and a
> memcg second. I have to double take every time I call this. Fix that.
> 
> Signed-off-by: Johannes Weiner 

I do agree that node_lruvec() adds confusion and it is better to get rid
of it.

Acked-by: Michal Hocko 

> ---
>  include/linux/memcontrol.h | 26 +-
>  include/linux/mmzone.h | 15 ---
>  mm/memcontrol.c| 12 ++--
>  mm/page_alloc.c|  2 +-
>  mm/slab.h  |  4 ++--
>  mm/vmscan.c|  6 +++---
>  mm/workingset.c|  8 
>  7 files changed, 37 insertions(+), 36 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 2b34925fc19d..498cea07cbb1 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -393,22 +393,22 @@ mem_cgroup_nodeinfo(struct mem_cgroup *memcg, int nid)
>  }
>  
>  /**
> - * mem_cgroup_lruvec - get the lru list vector for a node or a memcg zone
> - * @node: node of the wanted lruvec
> + * mem_cgroup_lruvec - get the lru list vector for a memcg & node
>   * @memcg: memcg of the wanted lruvec
> + * @node: node of the wanted lruvec
>   *
> - * Returns the lru list vector holding pages for a given @node or a given
> - * @memcg and @zone. This can be the node lruvec, if the memory controller
> - * is disabled.
> + * Returns the lru list vector holding pages for a given @memcg &
> + * @node combination. This can be the node lruvec, if the memory
> + * controller is disabled.
>   */
> -static inline struct lruvec *mem_cgroup_lruvec(struct pglist_data *pgdat,
> - struct mem_cgroup *memcg)
> +static inline struct lruvec *mem_cgroup_lruvec(struct mem_cgroup *memcg,
> +struct pglist_data *pgdat)
>  {
>   struct mem_cgroup_per_node *mz;
>   struct lruvec *lruvec;
>  
>   if (mem_cgroup_disabled()) {
> - lruvec = node_lruvec(pgdat);
> + lruvec = &pgdat->__lruvec;
>   goto out;
>   }
>  
> @@ -727,7 +727,7 @@ static inline void __mod_lruvec_page_state(struct page 
> *page,
>   return;
>   }
>  
> - lruvec = mem_cgroup_lruvec(pgdat, page->mem_cgroup);
> + lruvec = mem_cgroup_lruvec(page->mem_cgroup, pgdat);
>   __mod_lruvec_state(lruvec, idx, val);
>  }
>  
> @@ -898,16 +898,16 @@ static inline void mem_cgroup_migrate(struct page *old, 
> struct page *new)
>  {
>  }
>  
> -static inline struct lruvec *mem_cgroup_lruvec(struct pglist_data *pgdat,
> - struct mem_cgroup *memcg)
> +static inline struct lruvec *mem_cgroup_lruvec(struct mem_cgroup *memcg,
> +struct pglist_data *pgdat)
>  {
> - return node_lruvec(pgdat);
> + return &pgdat->__lruvec;
>  }
>  
>  static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page,
>   struct pglist_data *pgdat)
>  {
> - return &pgdat->lruvec;
> + return &pgdat->__lruvec;
>  }
>  
>  static inline bool mm_match_cgroup(struct mm_struct *mm,
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index d4ca03b93373..449a44171026 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -777,7 +777,13 @@ typedef struct pglist_data {
>  #endif
>  
>   /* Fields commonly accessed by the page reclaim scanner */
> - struct lruvec   lruvec;
> +
> + /*
> +  * NOTE: THIS IS UNUSED IF MEMCG IS ENABLED.
> +  *
> +  * Use mem_cgroup_lruvec() to look up lruvecs.
> +  */
> + struct lruvec  

Re: [PATCH 1/8] mm: vmscan: simplify lruvec_lru_size()

2019-10-23 Thread Michal Hocko
On Tue 22-10-19 10:47:56, Johannes Weiner wrote:
> This function currently takes the node or lruvec size and subtracts
> the zones that are excluded by the classzone index of the
> allocation. It uses four different types of counters to do this.
> 
> Just add up the eligible zones.

The original intention was to optimize this for GFP_KERNEL like
allocations by reducing the number of zones to reduce. But considering
this is not called from hot paths I do agree that a simpler code is more
preferable.
 
> Signed-off-by: Johannes Weiner 

Acked-by: Michal Hocko 

> ---
>  mm/vmscan.c | 21 +
>  1 file changed, 5 insertions(+), 16 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 1154b3a2b637..57f533b808f2 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -351,32 +351,21 @@ unsigned long zone_reclaimable_pages(struct zone *zone)
>   */
>  unsigned long lruvec_lru_size(struct lruvec *lruvec, enum lru_list lru, int 
> zone_idx)
>  {
> - unsigned long lru_size = 0;
> + unsigned long size = 0;
>   int zid;
>  
> - if (!mem_cgroup_disabled()) {
> - for (zid = 0; zid < MAX_NR_ZONES; zid++)
> - lru_size += mem_cgroup_get_zone_lru_size(lruvec, lru, 
> zid);
> - } else
> - lru_size = node_page_state(lruvec_pgdat(lruvec), NR_LRU_BASE + 
> lru);
> -
> - for (zid = zone_idx + 1; zid < MAX_NR_ZONES; zid++) {
> + for (zid = 0; zid <= zone_idx; zid++) {
>   struct zone *zone = &lruvec_pgdat(lruvec)->node_zones[zid];
> - unsigned long size;
>  
>   if (!managed_zone(zone))
>   continue;
>  
>   if (!mem_cgroup_disabled())
> - size = mem_cgroup_get_zone_lru_size(lruvec, lru, zid);
> + size += mem_cgroup_get_zone_lru_size(lruvec, lru, zid);
>   else
> - size = 
> zone_page_state(&lruvec_pgdat(lruvec)->node_zones[zid],
> -NR_ZONE_LRU_BASE + lru);
> - lru_size -= min(size, lru_size);
> +     size += zone_page_state(zone, NR_ZONE_LRU_BASE + lru);
>   }
> -
> - return lru_size;
> -
> + return size;
>  }
>  
>  /*
> -- 
> 2.23.0

-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH 2/2] mm, vmstat: reduce zone->lock holding time by /proc/pagetypeinfo

2019-10-23 Thread Michal Hocko
On Wed 23-10-19 15:32:05, Vlastimil Babka wrote:
> On 10/23/19 12:27 PM, Michal Hocko wrote:
> > From: Michal Hocko 
> > 
> > pagetypeinfo_showfree_print is called by zone->lock held in irq mode.
> > This is not really nice because it blocks both any interrupts on that
> > cpu and the page allocator. On large machines this might even trigger
> > the hard lockup detector.
> > 
> > Considering the pagetypeinfo is a debugging tool we do not really need
> > exact numbers here. The primary reason to look at the outuput is to see
> > how pageblocks are spread among different migratetypes therefore putting
> > a bound on the number of pages on the free_list sounds like a reasonable
> > tradeoff.
> > 
> > The new output will simply tell
> > [...]
> > Node6, zone   Normal, type  Movable >10 >10 >10 >10 
> >  41019  31560  23996  10054   3229983648
> > 
> > instead of
> > Node6, zone   Normal, type  Movable 399568 294127 221558 102119  
> > 41019  31560  23996  10054   3229983648
> > 
> > The limit has been chosen arbitrary and it is a subject of a future
> > change should there be a need for that.
> > 
> > Suggested-by: Andrew Morton 
> > Signed-off-by: Michal Hocko 
> 
> Hmm dunno, I would rather e.g. hide the file behind some config or boot
> option than do this. Or move it to /sys/kernel/debug ?

But those wouldn't really help to prevent from the lockup, right?
Besides that who would enable that config and how much of a difference
would root only vs. debugfs make?

Is the incomplete value a real problem?

> > ---
> >  mm/vmstat.c | 19 ++-
> >  1 file changed, 18 insertions(+), 1 deletion(-)
> > 
> > diff --git a/mm/vmstat.c b/mm/vmstat.c
> > index 4e885ecd44d1..762034fc3b83 100644
> > --- a/mm/vmstat.c
> > +++ b/mm/vmstat.c
> > @@ -1386,8 +1386,25 @@ static void pagetypeinfo_showfree_print(struct 
> > seq_file *m,
> >  
> > area = &(zone->free_area[order]);
> >  
> > -   list_for_each(curr, &area->free_list[mtype])
> > +   list_for_each(curr, &area->free_list[mtype]) {
> > freecount++;
> > +   /*
> > +* Cap the free_list iteration because it might
> > +* be really large and we are under a spinlock
> > +* so a long time spent here could trigger a
> > +* hard lockup detector. Anyway this is a
> > +* debugging tool so knowing there is a handful
> > +* of pages in this order should be more than
> > +* sufficient
> > +*/
> > +   if (freecount > 10) {
> > +   seq_printf(m, ">%6lu ", freecount);
> > +       spin_unlock_irq(&zone->lock);
> > +   cond_resched();
> > +   spin_lock_irq(&zone->lock);
> > +   continue;
> > +   }
> > +   }
> > seq_printf(m, "%6lu ", freecount);
> > }
> > seq_putc(m, '\n');
> > 

-- 
Michal Hocko
SUSE Labs


Re: [RFC v1] mm: add page preemption

2019-10-23 Thread Michal Hocko
On Wed 23-10-19 19:53:50, Hillf Danton wrote:
> 
> On Wed, 23 Oct 2019 10:17:29 +0200 Michal Hocko wrote:
[...]
> > This doesn't really answer my question.
> > Why cannot you use memcgs as they are now.
> 
> No prio provided.
> 
> > Why exactly do you need a fixed priority?
> 
> Prio comparison in global reclaim is what was added. Because every task has
> prio makes that comparison possible.

That still doesn't answer the question because it doesn't explain why is
the priority really necessary. I am sorry but I have more important
things to deal with than asking the same question again and again.
-- 
Michal Hocko
SUSE Labs


[RFC PATCH 0/2] mm/vmstat: Reduce zone lock hold time when reading /proc/pagetypeinfo

2019-10-23 Thread Michal Hocko
On Wed 23-10-19 10:56:08, Mel Gorman wrote:
> On Wed, Oct 23, 2019 at 11:04:22AM +0200, Michal Hocko wrote:
> > So can we go with this to address the security aspect of this and have
> > something trivial to backport.
> > 
> 
> Yes.

Ok, patch 1 in reply to this email.

> > > > > There is a free_area structure associated with each page order. There
> > > > > is also a nr_free count within the free_area for all the different
> > > > > migration types combined. Tracking the number of free list entries
> > > > > for each migration type will probably add some overhead to the fast
> > > > > paths like moving pages from one migration type to another which may
> > > > > not be desirable.
> > > > 
> > > > Have you tried to measure that overhead?
> > > >  
> > > 
> > > I would prefer this option not be taken. It would increase the cost of
> > > watermark calculations which is a relatively fast path.
> > 
> > Is the change for the wmark check going to require more than
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index c0b2e0306720..5d95313ba4a5 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -3448,9 +3448,6 @@ bool __zone_watermark_ok(struct zone *z, unsigned int 
> > order, unsigned long mark,
> > struct free_area *area = &z->free_area[o];
> > int mt;
> >  
> > -   if (!area->nr_free)
> > -   continue;
> > -
> > for (mt = 0; mt < MIGRATE_PCPTYPES; mt++) {
> > if (!free_area_empty(area, mt))
> > return true;
> > 
> > Is this really going to be visible in practice? Sure we are going to do
> > more checks but most orders tend to have at least some memory in a
> > reasonably balanced system and we can hardly expect an optimal
> > allocation path on those that are not.
> >  
> 
> You also have to iterate over them all later in the same function.  The the
> free counts are per migrate type then they would have to be iterated over
> every time.
> 
> Similarly, there would be multiple places where all the counters would
> have to be iterated -- find_suitable_fallback, show_free_areas,
> fast_isolate_freepages, fill_contig_page_info, zone_init_free_lists etc.
> 
> It'd be a small cost but given that it's aimed at fixing a problem with
> reading pagetypeinfo, is it really worth it? I don't think so.

Fair enough.

[...]
> > As pointed out in other email. The problem with this patch is that it
> > hasn't really removed the iteration over the whole free_list which is
> > the primary problem. So I think that we should either consider this a
> > non-issue and make it "admin knows this is potentially expensive" or do
> > something like Andrew was suggesting if we do not want to change the
> > nr_free accounting.
> > 
> 
> Again, the cost is when reading a proc file. From what Andrew said,
> the lock is necessary to safely walk the list but if anything. I would
> be ok with limiting the length of the walk but honestly, I would also
> be ok with simply deleting the proc file. The utility for debugging a
> problem with it is limited now (it was more important when fragmentation
> avoidance was first introduced) and there is little an admin can do with
> the information. I can't remember the last time I asked for the contents
> of the file when trying to debug a problem. There is a possibility that
> someone will complain but I'm not aware of any utility that reads the
> information and does something useful with it. In the unlikely event
> something breaks, the file can be re-added with a limited walk.

I went with a bound to the pages iteratred over in the free_list. See
patch 2.

-- 
Michal Hocko
SUSE Labs



[RFC PATCH 2/2] mm, vmstat: reduce zone->lock holding time by /proc/pagetypeinfo

2019-10-23 Thread Michal Hocko
From: Michal Hocko 

pagetypeinfo_showfree_print is called by zone->lock held in irq mode.
This is not really nice because it blocks both any interrupts on that
cpu and the page allocator. On large machines this might even trigger
the hard lockup detector.

Considering the pagetypeinfo is a debugging tool we do not really need
exact numbers here. The primary reason to look at the outuput is to see
how pageblocks are spread among different migratetypes therefore putting
a bound on the number of pages on the free_list sounds like a reasonable
tradeoff.

The new output will simply tell
[...]
Node6, zone   Normal, type  Movable >10 >10 >10 >10  
41019  31560  23996  10054   3229983648

instead of
Node6, zone   Normal, type  Movable 399568 294127 221558 102119  41019  
31560  23996  10054   3229983648

The limit has been chosen arbitrary and it is a subject of a future
change should there be a need for that.

Suggested-by: Andrew Morton 
Signed-off-by: Michal Hocko 
---
 mm/vmstat.c | 19 ++-
 1 file changed, 18 insertions(+), 1 deletion(-)

diff --git a/mm/vmstat.c b/mm/vmstat.c
index 4e885ecd44d1..762034fc3b83 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1386,8 +1386,25 @@ static void pagetypeinfo_showfree_print(struct seq_file 
*m,
 
area = &(zone->free_area[order]);
 
-   list_for_each(curr, &area->free_list[mtype])
+   list_for_each(curr, &area->free_list[mtype]) {
freecount++;
+   /*
+* Cap the free_list iteration because it might
+* be really large and we are under a spinlock
+* so a long time spent here could trigger a
+* hard lockup detector. Anyway this is a
+* debugging tool so knowing there is a handful
+* of pages in this order should be more than
+* sufficient
+*/
+   if (freecount > 10) {
+   seq_printf(m, ">%6lu ", freecount);
+   spin_unlock_irq(&zone->lock);
+   cond_resched();
+   spin_lock_irq(&zone->lock);
+   continue;
+   }
+   }
seq_printf(m, "%6lu ", freecount);
}
seq_putc(m, '\n');
-- 
2.20.1



[RFC PATCH 1/2] mm, vmstat: hide /proc/pagetypeinfo from normal users

2019-10-23 Thread Michal Hocko
From: Michal Hocko 

/proc/pagetypeinfo is a debugging tool to examine internal page
allocator state wrt to fragmentation. It is not very useful for
any other use so normal users really do not need to read this file.

Waiman Long has noticed that reading this file can have negative side
effects because zone->lock is necessary for gathering data and that
a) interferes with the page allocator and its users and b) can lead to
hard lockups on large machines which have very long free_list.

Reduce both issues by simply not exporting the file to regular users.

Reported-by: Waiman Long 
Cc: stable
Signed-off-by: Michal Hocko 
---
 mm/vmstat.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/vmstat.c b/mm/vmstat.c
index 6afc892a148a..4e885ecd44d1 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1972,7 +1972,7 @@ void __init init_mm_internals(void)
 #endif
 #ifdef CONFIG_PROC_FS
proc_create_seq("buddyinfo", 0444, NULL, &fragmentation_op);
-   proc_create_seq("pagetypeinfo", 0444, NULL, &pagetypeinfo_op);
+   proc_create_seq("pagetypeinfo", 0400, NULL, &pagetypeinfo_op);
proc_create_seq("vmstat", 0444, NULL, &vmstat_op);
proc_create_seq("zoneinfo", 0444, NULL, &zoneinfo_op);
 #endif
-- 
2.20.1



Re: [PATCH RFC v3 6/9] mm: Allow to offline PageOffline() pages with a reference count of 0

2019-10-23 Thread Michal Hocko
On Tue 22-10-19 16:02:09, David Hildenbrand wrote:
[...]
> >>> MEM_CANCEL_OFFLINE could gain the reference back to balance the
> >>> MEM_GOING_OFFLINE step.
> >>
> >> The pages are already unisolated and could be used by the buddy. But again,
> >> I think you have an idea that tries to avoid putting pages to the buddy.
> > 
> > Yeah, set_page_count(page, 0) if you do not want to release that page
> > from the notifier context to reflect that the page is ok to be offlined
> > with the rest.
> >   
> 
> I neither see how you deal with __test_page_isolated_in_pageblock() nor with
> __offline_isolated_pages(). Sorry, but what I read is incomplete and you
> probably have a full proposal in your head. Please read below how I think
> you want to solve it.

Yeah, sorry that I am throwing incomplete ideas at you. I am just trying
to really nail down how to deal with reference counting here because it
is an important aspect.
 
> >>> explicit control via the reference count which is the standard way to
> >>> control the struct page life cycle.
> >>>
> >>> Anyway hooking into __put_page (which tends to be a hot path with
> >>> something that is barely used on most systems) doesn't sound nice to me.
> >>> This is the whole point which made me think about the whole reference
> >>> count approach in the first place.
> >>
> >> Again, the race I think that is possible
> >>
> >> somebody: get_page_unless_zero(page)
> >> virtio_mem: page_ref_dec(pfn_to_page(pfn)
> >> somebody: put_page() -> straight to the buddy
> > 
> > Who is that somebody? I thought that it is only the owner/driver to have
> > a control over the page. Also the above is not possible as long as the
> > owner/driver keeps a reference to the PageOffline page throughout the
> > time it is marked that way.
> >   
> 
> I was reading
> 
> include/linux/mm_types.h:
> 
> "If you want to use the refcount field, it must be used in such a way
>  that other CPUs temporarily incrementing and then decrementing the
>  refcount does not cause problems"
> 
> And that made me think "anybody can go ahead and try get_page_unless_zero()".
> 
> If I am missing something here and this can indeed not happen (e.g.,
> because PageOffline() pages are never mapped to user space), then I'll
> happily remove this code.

The point is that if the owner of the page is holding the only reference
to the page then it is clear that nothing like that's happened.
 
[...]

> Let's recap what I suggest:
> 
> "PageOffline() pages that have a reference count of 0 will be treated
>  like free pages when offlining pages, allowing the containing memory
>  block to get offlined. In case a driver wants to revive such a page, it
>  has to synchronize against memory onlining/offlining (e.g., using memory
>  notifiers) while incrementing the reference count. Also, a driver that
>  relies in this feature is aware that re-onlining the memory will require
>  to re-set the pages PageOffline() - e.g., via the online_page_callback_t."

OK

[...]
> d) __put_page() is modified to not return pages to the buddy in any
>case as a safety net. We might be able to get rid of that.

I do not like exactly this part
 
> What I think you suggest:
> 
> a) has_unmovable_pages() skips over all PageOffline() pages.
>This results in a lot of false negatives when trying to offline. Might be 
> ok.
> 
> b) The driver decrements the reference count of the PageOffline pages
>in MEM_GOING_OFFLINE.

Well, driver should make the page unreferenced or fail. What is done
really depends on the specific driver

> c) The driver increments the reference count of the PageOffline pages
>in MEM_CANCEL_OFFLINE. One issue might be that the pages are no longer
>isolated once we get that call. Might be ok.

Only previous PageBuddy pages are returned to the allocator IIRC. Mostly
because of MovablePage()

> d) How to make __test_page_isolated_in_pageblock() succeed?
>Like I propose in this patch (PageOffline() + refcount == 0)?

Yep

> e) How to make __offline_isolated_pages() succeed?
>Like I propose in this patch (PageOffline() + refcount == 0)?

Simply skip over PageOffline pages. Reference count should never be != 0
at this stage.
 
> In summary, is what you suggest simply delaying setting the reference count 
> to 0
> in MEM_GOING_OFFLINE instead of right away when the driver unpluggs the pages?

Yes

> What's the big benefit you see and I fail to see?

Aparat from no hooks into __put_page it is also an explicit control over
the page via reference counting. Do you see any downsides?
-- 
Michal Hocko
SUSE Labs


Re: [PATCH] mm/vmstat: Reduce zone lock hold time when reading /proc/pagetypeinfo

2019-10-23 Thread Michal Hocko
On Wed 23-10-19 09:31:43, Mel Gorman wrote:
> On Tue, Oct 22, 2019 at 06:57:45PM +0200, Michal Hocko wrote:
> > [Cc Mel]
> > 
> > On Tue 22-10-19 12:21:56, Waiman Long wrote:
> > > The pagetypeinfo_showfree_print() function prints out the number of
> > > free blocks for each of the page orders and migrate types. The current
> > > code just iterates the each of the free lists to get counts.  There are
> > > bug reports about hard lockup panics when reading the /proc/pagetyeinfo
> > > file just because it look too long to iterate all the free lists within
> > > a zone while holing the zone lock with irq disabled.
> > > 
> > > Given the fact that /proc/pagetypeinfo is readable by all, the possiblity
> > > of crashing a system by the simple act of reading /proc/pagetypeinfo
> > > by any user is a security problem that needs to be addressed.
> > 
> > Should we make the file 0400? It is a useful thing when debugging but
> > not something regular users would really need for life.
> > 
> 
> I think this would be useful in general. The information is not that
> useful outside of debugging. Even then it's only useful when trying to
> get a handle on why a path like compaction is taking too long.

So can we go with this to address the security aspect of this and have
something trivial to backport.

> > > There is a free_area structure associated with each page order. There
> > > is also a nr_free count within the free_area for all the different
> > > migration types combined. Tracking the number of free list entries
> > > for each migration type will probably add some overhead to the fast
> > > paths like moving pages from one migration type to another which may
> > > not be desirable.
> > 
> > Have you tried to measure that overhead?
> >  
> 
> I would prefer this option not be taken. It would increase the cost of
> watermark calculations which is a relatively fast path.

Is the change for the wmark check going to require more than
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c0b2e0306720..5d95313ba4a5 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3448,9 +3448,6 @@ bool __zone_watermark_ok(struct zone *z, unsigned int 
order, unsigned long mark,
struct free_area *area = &z->free_area[o];
int mt;
 
-   if (!area->nr_free)
-   continue;
-
for (mt = 0; mt < MIGRATE_PCPTYPES; mt++) {
if (!free_area_empty(area, mt))
return true;

Is this really going to be visible in practice? Sure we are going to do
more checks but most orders tend to have at least some memory in a
reasonably balanced system and we can hardly expect an optimal
allocation path on those that are not.
 
> > > we can actually skip iterating the list of one of the migration types
> > > and used nr_free to compute the missing count. Since MIGRATE_MOVABLE
> > > is usually the largest one on large memory systems, this is the one
> > > to be skipped. Since the printing order is migration-type => order, we
> > > will have to store the counts in an internal 2D array before printing
> > > them out.
> > > 
> > > Even by skipping the MIGRATE_MOVABLE pages, we may still be holding the
> > > zone lock for too long blocking out other zone lock waiters from being
> > > run. This can be problematic for systems with large amount of memory.
> > > So a check is added to temporarily release the lock and reschedule if
> > > more than 64k of list entries have been iterated for each order. With
> > > a MAX_ORDER of 11, the worst case will be iterating about 700k of list
> > > entries before releasing the lock.
> > 
> > But you are still iterating through the whole free_list at once so if it
> > gets really large then this is still possible. I think it would be
> > preferable to use per migratetype nr_free if it doesn't cause any
> > regressions.
> > 
> 
> I think it will. The patch as it is contains the overhead within the
> reader of the pagetypeinfo proc file which is a non-critical path. The
> page allocator paths on the other hand is very important.

As pointed out in other email. The problem with this patch is that it
hasn't really removed the iteration over the whole free_list which is
the primary problem. So I think that we should either consider this a
non-issue and make it "admin knows this is potentially expensive" or do
something like Andrew was suggesting if we do not want to change the
nr_free accounting.

diff --git a/mm/vmstat.c b/mm/vmstat.c
index 6afc892a148a..83

Re: [RFC v1] mm: add page preemption

2019-10-23 Thread Michal Hocko
On Tue 22-10-19 22:28:02, Hillf Danton wrote:
> 
> On Tue, 22 Oct 2019 14:42:41 +0200 Michal Hocko wrote:
> > 
> > On Tue 22-10-19 20:14:39, Hillf Danton wrote:
> > > 
> > > On Mon, 21 Oct 2019 14:27:28 +0200 Michal Hocko wrote:
> > [...]
> > > > Why do we care and which workloads would benefit and how much.
> > > 
> > > Page preemption, disabled by default, should be turned on by those
> > > who wish that the performance of their workloads can survive memory
> > > pressure to certain extent.
> > 
> > I am sorry but this doesn't say anything to me. How come not all
> > workloads would fit that description?
> 
> That means pp plays a role when kswapd becomes active, and it may
> prevent too much jitters in active lru pages.

This is still too vague to be useful in any way.

> > > The number of pp users is supposed near the people who change the
> > > nice value of their apps either to -1 or higher at least once a week,
> > > less than vi users among UK's undergraduates.
> > > 
> > > > And last but not least why the existing infrastructure doesn't help
> > > > (e.g. if you have clearly defined workloads with different
> > > > memory consumption requirements then why don't you use memory cgroups to
> > > > reflect the priority).
> > > 
> > > Good question:)
> > > 
> > > Though pp is implemented by preventing any task from reclaiming as many
> > > pages as possible from other tasks that are higher on priority, it is
> > > trying to introduce prio into page reclaiming, to add a feature.
> > > 
> > > Page and memcg are different objects after all; pp is being added at
> > > the page granularity. It should be an option available in environments
> > > without memcg enabled.
> > 
> > So do you actually want to establish LRUs per priority?
> 
> No, no change other than the prio for every lru page was added. LRU per prio
> is too much to implement.

Well, considering that per page priority is a no go as already pointed
out by Willy then you do not have other choice right?

> > Why using memcgs is not an option?
> 
> I have plan to add prio in memcg. As you see, I sent a rfc before v0 with
> nice added in memcg, and realised a couple days ago that its dependence on
> soft limit reclaim is not acceptable.
> 
> But we can't do that without determining how to define memcg's prio.
> What is in mind now is the highest (or lowest) prio of tasks in a memcg
> with a knob offered to userspace.
> 
> If you like, I want to have a talk about it sometime later.

This doesn't really answer my question. Why cannot you use memcgs as
they are now. Why exactly do you need a fixed priority?

> > This is the main facility to partition reclaimable
> > memory in the first place. You should really focus on explaining on why
> > a much more fine grained control is needed much more thoroughly.
> > 
> > > What is way different from the protections offered by memory cgroup
> > > is that pages protected by memcg:min/low can't be reclaimed regardless
> > > of memory pressure. Such guarantee is not available under pp as it only
> > > suggests an extra factor to consider on deactivating lru pages.
> > 
> > Well, low limit can be breached if there is no eliglible memcg to be
> > reclaimed. That means that you can shape some sort of priority by
> > setting the low limit already.
> > 
> > [...]
> > 
> > > What was added on the reclaimer side is
> > > 
> > > 1, kswapd sets pgdat->kswapd_prio, the switch between page reclaimer
> > >and allocator in terms of prio, to the lowest value before taking
> > >a nap.
> > > 
> > > 2, any allocator is able to wake up the reclaimer because of the
> > >lowest prio, and it starts reclaiming pages using the waker's prio.
> > > 
> > > 3, allocator comes while kswapd is active, its prio is checked and
> > >no-op if kswapd is higher on prio; otherwise switch is updated
> > >with the higher prio.
> > > 
> > > 4, every time kswapd raises sc.priority that starts with DEF_PRIORITY,
> > >it is checked if there is pending update of switch; and kswapd's
> > >prio steps up if there is a pending one, thus its prio never steps
> > >down. Nor prio inversion. 
> > > 
> > > 5, goto 1 when kswapd finishes its work.
> > 
> > What about the direct reclaim?
> 
> Their prio will not change before reclaiming finishes, so leave it be.

This doesn't answer my question.

> > What if pages of a lower priority are
> > hard to reclaim? Do you want a process of a higher priority stall more
> > just because it has to wait for those lower priority pages?
> 
> The problems above are not introduced by pp, let Mr. Kswapd take care of
> them.

No, this is not an answer.

-- 
Michal Hocko
SUSE Labs


Re: [RFC v1] memcg: add memcg lru for page reclaiming

2019-10-23 Thread Michal Hocko
On Wed 23-10-19 12:44:48, Hillf Danton wrote:
> 
> On Tue, 22 Oct 2019 15:58:32 +0200 Michal Hocko wrote:
> > 
> > On Tue 22-10-19 21:30:50, Hillf Danton wrote:
[...]
> > > in this RFC after ripping pages off
> > > the first victim, the work finishes with the first ancestor of the victim
> > > added to lru.
> > > 
> > > Recaliming is defered until kswapd becomes active.
> > 
> > This is a wrong assumption because high limit might be configured way
> > before kswapd is woken up.
> 
> This change was introduced because high limit breach looks not like a
> serious problem in the absence of memory pressure. Lets do the hard work,
> reclaiming one memcg a time up through the hierarchy, when kswapd becomes
> active. It also explains the BH introduced.

But this goes against the main motivation for the high limit - to
throttle. It is not all that important that there is not global memory
pressure. The preventive high limit reclaim is there to make sure that
the specific memcg is kept in a reasonable containment.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH 2/2] mm: memcontrol: try harder to set a new memory.high

2019-10-23 Thread Michal Hocko
On Tue 22-10-19 16:15:18, Johannes Weiner wrote:
> Setting a memory.high limit below the usage makes almost no effort to
> shrink the cgroup to the new target size.
> 
> While memory.high is a "soft" limit that isn't supposed to cause OOM
> situations, we should still try harder to meet a user request through
> persistent reclaim.
> 
> For example, after setting a 10M memory.high on an 800M cgroup full of
> file cache, the usage shrinks to about 350M:
> 
> + cat /cgroup/workingset/memory.current
> 841568256
> + echo 10M
> + cat /cgroup/workingset/memory.current
> 355729408
> 
> This isn't exactly what the user would expect to happen. Setting the
> value a few more times eventually whittles the usage down to what we
> are asking for:
> 
> + echo 10M
> + cat /cgroup/workingset/memory.current
> 104181760
> + echo 10M
> + cat /cgroup/workingset/memory.current
> 31801344
> + echo 10M
> + cat /cgroup/workingset/memory.current
> 10440704
> 
> To improve this, add reclaim retry loops to the memory.high write()
> callback, similar to what we do for memory.max, to make a reasonable
> effort that the usage meets the requested size after the call returns.

That suggests that the reclaim couldn't meet the given reclaim target
but later attempts just made it through. Is this due to amount of dirty
pages or what prevented the reclaim to do its job?

While I am not against the reclaim retry loop I would like to understand
the underlying issue. Because if this is really about dirty memory then
we should probably be more pro-active in flushing it. Otherwise the
retry might not be of any help.

> Afterwards, a single write() to memory.high is enough in all but
> extreme cases:
> 
> + cat /cgroup/workingset/memory.current
> 841609216
> + echo 10M
> + cat /cgroup/workingset/memory.current
> 10182656
> 
> Signed-off-by: Johannes Weiner 
> ---
>  mm/memcontrol.c | 30 --
>  1 file changed, 24 insertions(+), 6 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index ff90d4e7df37..8090b4c99ac7 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -6074,7 +6074,8 @@ static ssize_t memory_high_write(struct 
> kernfs_open_file *of,
>char *buf, size_t nbytes, loff_t off)
>  {
>   struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
> - unsigned long nr_pages;
> + unsigned int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
> + bool drained = false;
>   unsigned long high;
>   int err;
>  
> @@ -6085,12 +6086,29 @@ static ssize_t memory_high_write(struct 
> kernfs_open_file *of,
>  
>   memcg->high = high;
>  
> - nr_pages = page_counter_read(&memcg->memory);
> - if (nr_pages > high)
> - try_to_free_mem_cgroup_pages(memcg, nr_pages - high,
> -  GFP_KERNEL, true);
> + for (;;) {
> + unsigned long nr_pages = page_counter_read(&memcg->memory);
> + unsigned long reclaimed;
> +
> + if (nr_pages <= high)
> + break;
> +
> + if (signal_pending(current))
> + break;
> +
> + if (!drained) {
> + drain_all_stock(memcg);
> + drained = true;
> + continue;
> + }
> +
> + reclaimed = try_to_free_mem_cgroup_pages(memcg, nr_pages - high,
> +  GFP_KERNEL, true);
> +
> + if (!reclaimed && !nr_retries--)
> + break;
> + }
>  
> - memcg_wb_domain_size_changed(memcg);
>   return nbytes;
>  }
>  
> -- 
> 2.23.0

-- 
Michal Hocko
SUSE Labs


Re: [PATCH 1/2] mm: memcontrol: remove dead code from memory_max_write()

2019-10-22 Thread Michal Hocko
On Tue 22-10-19 16:15:17, Johannes Weiner wrote:
> When the reclaim loop in memory_max_write() is ^C'd or similar, we set
> err to -EINTR. But we don't return err. Once the limit is set, we
> always return success (nbytes). Delete the dead code.
> 
> Signed-off-by: Johannes Weiner 

Acked-by: Michal Hocko 

> ---
>  mm/memcontrol.c | 4 +---
>  1 file changed, 1 insertion(+), 3 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 055975b0b3a3..ff90d4e7df37 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -6122,10 +6122,8 @@ static ssize_t memory_max_write(struct 
> kernfs_open_file *of,
>   if (nr_pages <= max)
>   break;
>  
> - if (signal_pending(current)) {
> - err = -EINTR;
> + if (signal_pending(current))
>   break;
> - }
>  
>   if (!drained) {
>   drain_all_stock(memcg);
> -- 
> 2.23.0
> 

-- 
Michal Hocko
SUSE Labs


Re: [PATCH] mm: memcontrol: fix network errors from failing __GFP_ATOMIC charges

2019-10-22 Thread Michal Hocko
On Tue 22-10-19 19:37:08, Johannes Weiner wrote:
> While upgrading from 4.16 to 5.2, we noticed these allocation errors
> in the log of the new kernel:
> 
> [ 8642.253395] SLUB: Unable to allocate memory on node -1, 
> gfp=0xa20(GFP_ATOMIC)
> [ 8642.269170]   cache: tw_sock_TCPv6(960:helper-logs), object size: 232, 
> buffer size: 240, default order: 1, min order: 0
> [ 8642.293009]   node 0: slabs: 5, objs: 170, free: 0
> 
> slab_out_of_memory+1
> ___slab_alloc+969
> __slab_alloc+14
> kmem_cache_alloc+346
> inet_twsk_alloc+60
> tcp_time_wait+46
> tcp_fin+206
> tcp_data_queue+2034
> tcp_rcv_state_process+784
> tcp_v6_do_rcv+405
> __release_sock+118
> tcp_close+385
> inet_release+46
> __sock_release+55
> sock_close+17
> __fput+170
> task_work_run+127
> exit_to_usermode_loop+191
> do_syscall_64+212
> entry_SYSCALL_64_after_hwframe+68
> 
> accompanied by an increase in machines going completely radio silent
> under memory pressure.

This is really worrying because that suggests that something depends on
GFP_ATOMIC allocation which is fragile and broken. 
 
> One thing that changed since 4.16 is e699e2c6a654 ("net, mm: account
> sock objects to kmemcg"), which made these slab caches subject to
> cgroup memory accounting and control.
> 
> The problem with that is that cgroups, unlike the page allocator, do
> not maintain dedicated atomic reserves. As a cgroup's usage hovers at
> its limit, atomic allocations - such as done during network rx - can
> fail consistently for extended periods of time. The kernel is not able
> to operate under these conditions.
> 
> We don't want to revert the culprit patch, because it indeed tracks a
> potentially substantial amount of memory used by a cgroup.
> 
> We also don't want to implement dedicated atomic reserves for cgroups.
> There is no point in keeping a fixed margin of unused bytes in the
> cgroup's memory budget to accomodate a consumer that is impossible to
> predict - we'd be wasting memory and get into configuration headaches,
> not unlike what we have going with min_free_kbytes. We do this for
> physical mem because we have to, but cgroups are an accounting game.
> 
> Instead, account these privileged allocations to the cgroup, but let
> them bypass the configured limit if they have to. This way, we get the
> benefits of accounting the consumed memory and have it exert pressure
> on the rest of the cgroup, but like with the page allocator, we shift
> the burden of reclaimining on behalf of atomic allocations onto the
> regular allocations that can block.

On the other hand this would allow to break the isolation by an
unpredictable amount. Should we put a simple cap on how much we can go
over the limit. If the memcg limit reclaim is not able to keep up with
those overflows then even __GFP_ATOMIC allocations have to fail. What do
you think?

> Cc: sta...@kernel.org # 4.18+
> Fixes: e699e2c6a654 ("net, mm: account sock objects to kmemcg")
> Signed-off-by: Johannes Weiner 
> ---
>  mm/memcontrol.c | 9 +
>  1 file changed, 9 insertions(+)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 8090b4c99ac7..c7e3e758c165 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2528,6 +2528,15 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t 
> gfp_mask,
>   goto retry;
>   }
>  
> + /*
> +  * Memcg doesn't have a dedicated reserve for atomic
> +  * allocations. But like the global atomic pool, we need to
> +  * put the burden of reclaim on regular allocation requests
> +  * and let these go through as privileged allocations.
> +  */
> + if (gfp_mask & __GFP_ATOMIC)
> + goto force;
> +
>   /*
>* Unlike in global OOM situations, memcg is not in a physical
>* memory shortage.  Allow dying and OOM-killed tasks to
> -- 
> 2.23.0
> 

-- 
Michal Hocko
SUSE Labs


Re: [PATCH] mm/vmstat: Reduce zone lock hold time when reading /proc/pagetypeinfo

2019-10-22 Thread Michal Hocko
On Tue 22-10-19 14:59:02, Andrew Morton wrote:
> On Tue, 22 Oct 2019 12:21:56 -0400 Waiman Long  wrote:
[...]
> > -   for (mtype = 0; mtype < MIGRATE_TYPES; mtype++) {
> > -   seq_printf(m, "Node %4d, zone %8s, type %12s ",
> > -   pgdat->node_id,
> > -   zone->name,
> > -   migratetype_names[mtype]);
> > -   for (order = 0; order < MAX_ORDER; ++order) {
> > +   lockdep_assert_held(&zone->lock);
> > +   lockdep_assert_irqs_disabled();
> > +
> > +   /*
> > +* MIGRATE_MOVABLE is usually the largest one in large memory
> > +* systems. We skip iterating that list. Instead, we compute it by
> > +* subtracting the total of the rests from free_area->nr_free.
> > +*/
> > +   for (order = 0; order < MAX_ORDER; ++order) {
> > +   unsigned long nr_total = 0;
> > +   struct free_area *area = &(zone->free_area[order]);
> > +
> > +   for (mtype = 0; mtype < MIGRATE_TYPES; mtype++) {
> > unsigned long freecount = 0;
> > -   struct free_area *area;
> > struct list_head *curr;
> >  
> > -   area = &(zone->free_area[order]);
> > -
> > +   if (mtype == MIGRATE_MOVABLE)
> > +   continue;
> > list_for_each(curr, &area->free_list[mtype])
> > freecount++;
> > -   seq_printf(m, "%6lu ", freecount);
> > +   nfree[order][mtype] = freecount;
> > +   nr_total += freecount;
> > }
> > +   nfree[order][MIGRATE_MOVABLE] = area->nr_free - nr_total;
> > +
> > +   /*
> > +* If we have already iterated more than 64k of list
> > +* entries, we might have hold the zone lock for too long.
> > +* Temporarily release the lock and reschedule before
> > +* continuing so that other lock waiters have a chance
> > +* to run.
> > +*/
> > +   if (nr_total > (1 << 16)) {
> > +   spin_unlock_irq(&zone->lock);
> > +   cond_resched();
> > +   spin_lock_irq(&zone->lock);
> > +   }
> > +   }
> > +
> > +   for (mtype = 0; mtype < MIGRATE_TYPES; mtype++) {
> > +   seq_printf(m, "Node %4d, zone %8s, type %12s ",
> > +   pgdat->node_id,
> > +   zone->name,
> > +   migratetype_names[mtype]);
> > +   for (order = 0; order < MAX_ORDER; ++order)
> > +   seq_printf(m, "%6lu ", nfree[order][mtype]);
> > seq_putc(m, '\n');
> 
> This is not exactly a thing of beauty :( Presumably there might still
> be situations where the irq-off times remain excessive.

Yes. It is the list_for_each over the free_list that needs the lock and
that is the actual problem here. This can be really large with a _lot_
of memory. And this is why I objected to the patch. Because it doesn't
really address this problem. I would like to hear from Mel and Vlastimil
how would they feel about making free_list fully migrate type aware
(including nr_free).

> Why are we actually holding zone->lock so much?  Can we get away with
> holding it across the list_for_each() loop and nothing else?  If so,
> this still isn't a bulletproof fix.  Maybe just terminate the list
> walk if freecount reaches 1024.  Would anyone really care?
> 
> Sigh.  I wonder if anyone really uses this thing for anything
> important.  Can we just remove it all?

Vlastimil would know much better but I have seen this being used for
fragmentation related debugging. That should imply that 0400 should be
sufficient and a quick and easily backportable fix for the most pressing
immediate problem.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH] mm: fix comments based on per-node memcg

2019-10-22 Thread Michal Hocko
On Tue 22-10-19 15:06:18, Hao Lee wrote:
> These comments should be updated as memcg limit enforcement has been moved
> from zones to nodes.
> 
> Signed-off-by: Hao Lee 

Acked-by: Michal Hocko 

> ---
>  include/linux/memcontrol.h | 5 ++---
>  1 file changed, 2 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index ae703ea3ef48..12c29f74c02a 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -112,7 +112,7 @@ struct memcg_shrinker_map {
>  };
>  
>  /*
> - * per-zone information in memory controller.
> + * per-node information in memory controller.
>   */
>  struct mem_cgroup_per_node {
>   struct lruvec   lruvec;
> @@ -399,8 +399,7 @@ mem_cgroup_nodeinfo(struct mem_cgroup *memcg, int nid)
>   * @memcg: memcg of the wanted lruvec
>   *
>   * Returns the lru list vector holding pages for a given @node or a given
> - * @memcg and @zone. This can be the node lruvec, if the memory controller
> - * is disabled.
> + * @memcg. This can be the node lruvec, if the memory controller is disabled.
>   */
>  static inline struct lruvec *mem_cgroup_lruvec(struct pglist_data *pgdat,
>   struct mem_cgroup *memcg)
> -- 
> 2.14.5

-- 
Michal Hocko
SUSE Labs


Re: [PATCH] mm/vmstat: Reduce zone lock hold time when reading /proc/pagetypeinfo

2019-10-22 Thread Michal Hocko
he lock and reschedule before
> +  * continuing so that other lock waiters have a chance
> +  * to run.
> +  */
> + if (nr_total > (1 << 16)) {
> + spin_unlock_irq(&zone->lock);
> + cond_resched();
> + spin_lock_irq(&zone->lock);
> + }
> + }
> +
> + for (mtype = 0; mtype < MIGRATE_TYPES; mtype++) {
> + seq_printf(m, "Node %4d, zone %8s, type %12s ",
> + pgdat->node_id,
> + zone->name,
> + migratetype_names[mtype]);
> + for (order = 0; order < MAX_ORDER; ++order)
> + seq_printf(m, "%6lu ", nfree[order][mtype]);
>   seq_putc(m, '\n');
>   }
>  }
> -- 
> 2.18.1

-- 
Michal Hocko
SUSE Labs


Re: [RFC v1] memcg: add memcg lru for page reclaiming

2019-10-22 Thread Michal Hocko
On Tue 22-10-19 21:30:50, Hillf Danton wrote:
> 
> On Mon, 21 Oct 2019 14:14:53 +0200 Michal Hocko wrote:
> > 
> > On Mon 21-10-19 19:56:54, Hillf Danton wrote:
> > > 
> > > Currently soft limit reclaim is frozen, see
> > > Documentation/admin-guide/cgroup-v2.rst for reasons.
> > > 
> > > Copying the page lru idea, memcg lru is added for selecting victim
> > > memcg to reclaim pages from under memory pressure. It now works in
> > > parallel to slr not only because the latter needs some time to reap
> > > but the coexistence facilitates it a lot to add the lru in a straight
> > > forward manner.
> > 
> > This doesn't explain what is the problem/feature you would like to
> > fix/achieve. It also doesn't explain the overall design. 
> 
> 1, memcg lru makes page reclaiming hierarchy aware

Is that a problem statement or a design goal?

> While doing the high work, memcgs are currently reclaimed one after
> another up through the hierarchy;

Which is the design because it is the the memcg where the high limit got
hit. The hierarchical behavior ensures that the subtree of that memcg is
reclaimed and we try to spread the reclaim fairly over the tree.

> in this RFC after ripping pages off
> the first victim, the work finishes with the first ancestor of the victim
> added to lru.
> 
> Recaliming is defered until kswapd becomes active.

This is a wrong assumption because high limit might be configured way
before kswapd is woken up.

> 2, memcg lru tries much to avoid overreclaim

Again, is this a problem statement or a design goal?
 
> Only one memcg is picked off lru in FIFO mode under memory pressure,
> and MEMCG_CHARGE_BATCH pages are reclaimed one memcg at a time.

And why is this preferred over SWAP_CLUSTER_MAX and whole subtree
reclaim that we do currently? 

Please do not set another version until it is actually clear what you
want to achieve and why.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH 0/4] [RFC] Migrate Pages in lieu of discard

2019-10-22 Thread Michal Hocko
On Fri 18-10-19 07:54:20, Dave Hansen wrote:
> On 10/18/19 12:44 AM, Michal Hocko wrote:
> > How does this compare to
> > http://lkml.kernel.org/r/1560468577-101178-1-git-send-email-yang@linux.alibaba.com
> 
> It's a _bit_ more tied to persistent memory and it appears a bit more
> tied to two tiers rather something arbitrarily deep.  They're pretty
> similar conceptually although there are quite a few differences.
> 
> For instance, what I posted has a static mapping for the migration path.
>  If node A is in reclaim, we always try to allocate pages on node B.
> There are no restrictions on what those nodes can be.  In Yang Shi's
> apporach, there's a dynamic search for a target migration node on each
> migration that follows the normal alloc fallback path.  This ends up
> making migration nodes special.

As we have discussed at LSFMM this year and there seemed to be a goog
consensus on that, the resulting implementation should be as pmem
neutral as possible. After all node migration mode sounds like a
reasonable feature even without pmem. So I would be more inclined to the
normal alloc fallback path rather than a very specific and static
migration fallback path. If that turns out impractical then sure let's
come up with something more specific but I think there is quite a long
route there because we do not really have much of an experience with
this so far.

> There are also some different choices that are pretty arbitrary.  For
> instance, when you allocation a migration target page, should you cause
> memory pressure on the target?

Those are details to really sort out and they require some
experimentation to.

> To be honest, though, I don't see anything fatally flawed with it.  It's
> probably a useful exercise to factor out the common bits from the two
> sets and see what we can agree on being absolutely necessary.

Makes sense. What would that be? Is there a real consensus on having the
new node_reclaim mode to be the configuration mechanism? Do we want to
support generic NUMA without any PMEM in place as well for starter?

Thanks!
-- 
Michal Hocko
SUSE Labs


Re: [PATCH 00/16] The new slab memory controller

2019-10-22 Thread Michal Hocko
On Thu 17-10-19 17:28:04, Roman Gushchin wrote:
> This patchset provides a new implementation of the slab memory controller,
> which aims to reach a much better slab utilization by sharing slab pages
> between multiple memory cgroups. Below is the short description of the new
> design (more details in commit messages).
> 
> Accounting is performed per-object instead of per-page. Slab-related
> vmstat counters are converted to bytes. Charging is performed on page-basis,
> with rounding up and remembering leftovers.
> 
> Memcg ownership data is stored in a per-slab-page vector: for each slab page
> a vector of corresponding size is allocated. To keep slab memory reparenting
> working, instead of saving a pointer to the memory cgroup directly an
> intermediate object is used. It's simply a pointer to a memcg (which can be
> easily changed to the parent) with a built-in reference counter. This scheme
> allows to reparent all allocated objects without walking them over and 
> changing
> memcg pointer to the parent.
> 
> Instead of creating an individual set of kmem_caches for each memory cgroup,
> two global sets are used: the root set for non-accounted and root-cgroup
> allocations and the second set for all other allocations. This allows to
> simplify the lifetime management of individual kmem_caches: they are destroyed
> with root counterparts. It allows to remove a good amount of code and make
> things generally simpler.

What is the performance impact? Also what is the effect on the memory
reclaim side and the isolation. I would expect that mixing objects from
different cgroups would have a negative/unpredictable impact on the
memcg slab shrinking.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH 00/16] The new slab memory controller

2019-10-22 Thread Michal Hocko
On Tue 22-10-19 15:22:06, Michal Hocko wrote:
> On Thu 17-10-19 17:28:04, Roman Gushchin wrote:
> [...]
> > Using a drgn* script I've got an estimation of slab utilization on
> > a number of machines running different production workloads. In most
> > cases it was between 45% and 65%, and the best number I've seen was
> > around 85%. Turning kmem accounting off brings it to high 90s. Also
> > it brings back 30-50% of slab memory. It means that the real price
> > of the existing slab memory controller is way bigger than a pointer
> > per page.
> 
> How much of the memory are we talking about here?

Just to be more specific. Your cover mentions several hundreds of MBs
but there is no scale to the overal charged memory. How much of that is
the actual kmem accounted memory.

> Also is there any pattern for specific caches that tend to utilize
> much worse than others?

-- 
Michal Hocko
SUSE Labs


Re: [PATCH 00/16] The new slab memory controller

2019-10-22 Thread Michal Hocko
On Thu 17-10-19 17:28:04, Roman Gushchin wrote:
[...]
> Using a drgn* script I've got an estimation of slab utilization on
> a number of machines running different production workloads. In most
> cases it was between 45% and 65%, and the best number I've seen was
> around 85%. Turning kmem accounting off brings it to high 90s. Also
> it brings back 30-50% of slab memory. It means that the real price
> of the existing slab memory controller is way bigger than a pointer
> per page.

How much of the memory are we talking about here? Also is there any
pattern for specific caches that tend to utilize much worse than others?
-- 
Michal Hocko
SUSE Labs


Re: [RFC v1] mm: add page preemption

2019-10-22 Thread Michal Hocko
On Tue 22-10-19 20:14:39, Hillf Danton wrote:
> 
> On Mon, 21 Oct 2019 14:27:28 +0200 Michal Hocko wrote:
[...]
> > Why do we care and which workloads would benefit and how much.
> 
> Page preemption, disabled by default, should be turned on by those
> who wish that the performance of their workloads can survive memory
> pressure to certain extent.

I am sorry but this doesn't say anything to me. How come not all
workloads would fit that description?
 
> The number of pp users is supposed near the people who change the
> nice value of their apps either to -1 or higher at least once a week,
> less than vi users among UK's undergraduates.
> 
> > And last but not least why the existing infrastructure doesn't help
> > (e.g. if you have clearly defined workloads with different
> > memory consumption requirements then why don't you use memory cgroups to
> > reflect the priority).
> 
> Good question:)
> 
> Though pp is implemented by preventing any task from reclaiming as many
> pages as possible from other tasks that are higher on priority, it is
> trying to introduce prio into page reclaiming, to add a feature.
> 
> Page and memcg are different objects after all; pp is being added at
> the page granularity. It should be an option available in environments
> without memcg enabled.

So do you actually want to establish LRUs per priority? Why using memcgs
is not an option? This is the main facility to partition reclaimable
memory in the first place. You should really focus on explaining on why
a much more fine grained control is needed much more thoroughly.

> What is way different from the protections offered by memory cgroup
> is that pages protected by memcg:min/low can't be reclaimed regardless
> of memory pressure. Such guarantee is not available under pp as it only
> suggests an extra factor to consider on deactivating lru pages.

Well, low limit can be breached if there is no eliglible memcg to be
reclaimed. That means that you can shape some sort of priority by
setting the low limit already.

[...]

> What was added on the reclaimer side is
> 
> 1, kswapd sets pgdat->kswapd_prio, the switch between page reclaimer
>and allocator in terms of prio, to the lowest value before taking
>a nap.
> 
> 2, any allocator is able to wake up the reclaimer because of the
>lowest prio, and it starts reclaiming pages using the waker's prio.
> 
> 3, allocator comes while kswapd is active, its prio is checked and
>no-op if kswapd is higher on prio; otherwise switch is updated
>with the higher prio.
> 
> 4, every time kswapd raises sc.priority that starts with DEF_PRIORITY,
>it is checked if there is pending update of switch; and kswapd's
>prio steps up if there is a pending one, thus its prio never steps
>down. Nor prio inversion. 
> 
> 5, goto 1 when kswapd finishes its work.

What about the direct reclaim? What if pages of a lower priority are
hard to reclaim? Do you want a process of a higher priority stall more
just because it has to wait for those lower priority pages?
-- 
Michal Hocko
SUSE Labs


Re: [PATCH RFC v3 6/9] mm: Allow to offline PageOffline() pages with a reference count of 0

2019-10-22 Thread Michal Hocko
On Fri 18-10-19 14:35:06, David Hildenbrand wrote:
> On 18.10.19 13:20, Michal Hocko wrote:
> > On Fri 18-10-19 10:50:24, David Hildenbrand wrote:
> > > On 18.10.19 10:15, Michal Hocko wrote:
[...]
> > > > for that - MEM_GOING_OFFLINE notification. This sounds like a good place
> > > > for the driver to decide whether it is safe to let the page go or not.
> > > 
> > > As I explained, this is too late and fragile. I post again what I posted
> > > before with some further explanations
> > > 
> > > __offline_pages() works like this:
> > > 
> > > 1) start_isolate_page_range()
> > > -> offline pages with a reference count of one will be detected as
> > > unmovable -> offlining aborted. (see below on the memory isolation 
> > > notifier)
> > 
> > I am assuming that has_unmovable_pages would skip over those pages. Your
> > patch already does that, no?
> 
> Yes, this works IFF the reference count is 0 (IOW, this patch). Not with a
> reference count of 1 (unless the pages are movable, like with balloon
> compaction).

I am pretty sure that has_unmovable_pages can special these pages
regardless of the reference count for the memory hotplug. We already do
that for HWPoison pages.
 
> Please note that we have other users that use PG_offline + refcount >= 1
> (HyperV balloon, XEN). We should not affect these users (IOW,
> has_unmovable_pages() has to stop right there if we see one of these pages).

OK, this is exactly what I was worried about. I can see why you might
want to go an easier way and rule those users out but wouldn't be it
actually more reasonable to explicitly request PageOffline users to
implement MEM_GOING_OFFLINE and prepare their offlined pages for the
offlining operation or fail right there if that is not possible.
If you fail right there during the isolation phase then there is no way
to allow the offlining to proceed from that context.
 
> > > 2) memory_notify(MEM_GOING_OFFLINE, &arg);
> > > -> Here, we could release all pages to the buddy, clearing PG_offline
> > > -> PF_offline must not be cleared so dumping tools will not touch
> > > these pages. There is a time where pages are !PageBuddy() and
> > > !PageOffline().
> > 
> > Well, this is fully under control of the driver, no? Reference count
> > shouldn't play any role here AFAIU.
> 
> Yes, this is more a PG_offline issue. The reference count is an issue of
> reaching this call :) If we want to go via the buddy:
> 
> 1. Clear PG_offline
> 2. Free page (gets set PG_buddy)
> 
> Between 1 and 2, a dumping tool could not exclude these pages and therefore
> try to read from these pages. That is an issue IFF we want to return the
> pages back to the buddy instead of doing what I propose here.

If the driver is going to free page to the allocator then it would have
to claim the page back and so it is usable again. If it cannot free it
then it would simply set the reference count to 0. It can even keep the
PG_offline if necessary although I have to admit I am not really sure it
is necessary.

> > > 3) scan_movable_pages() ...
> 
> Please note that when we don't put the pages back to the buddy and don't
> implement something like I have in this patch, we'll loop/fail here.
> Especially if we have pages with PG_offline + refcount >= 1 .

You should have your reference count 0 at this stage as it is after
MEM_GOING_OFFLINE.
 
> > MEM_CANCEL_OFFLINE could gain the reference back to balance the
> > MEM_GOING_OFFLINE step.
> 
> The pages are already unisolated and could be used by the buddy. But again,
> I think you have an idea that tries to avoid putting pages to the buddy.

Yeah, set_page_count(page, 0) if you do not want to release that page
from the notifier context to reflect that the page is ok to be offlined
with the rest.
 
[...]

> > explicit control via the reference count which is the standard way to
> > control the struct page life cycle.
> > 
> > Anyway hooking into __put_page (which tends to be a hot path with
> > something that is barely used on most systems) doesn't sound nice to me.
> > This is the whole point which made me think about the whole reference
> > count approach in the first place.
> 
> Again, the race I think that is possible
> 
> somebody: get_page_unless_zero(page)
> virtio_mem: page_ref_dec(pfn_to_page(pfn)
> somebody: put_page() -> straight to the buddy

Who is that somebody? I thought that it is only the owner/driver to have
a control over the page. Also the above is not possible as long as the
owner/driver keeps a reference to the PageOffline page throughout the
time it is marked that way.
 
[...]

Re: [RFC PATCH v2 10/16] mm,hwpoison: Rework soft offline for free pages

2019-10-22 Thread Michal Hocko
On Tue 22-10-19 11:58:52, Oscar Salvador wrote:
> On Tue, Oct 22, 2019 at 11:22:56AM +0200, Michal Hocko wrote:
> > Hmm, that might be a misunderstanding on my end. I thought that it is
> > the MCE handler to say whether the failure is recoverable or not. If yes
> > then we can touch the content of the memory (that would imply the
> > migration). Other than that both paths should be essentially the same,
> > no? Well unrecoverable case would be essentially force migration failure
> > path.
> > 
> > MADV_HWPOISON is explicitly documented to test MCE handling IIUC:
> > : This feature is intended for testing of memory error-handling
> > : code; it is available only if the kernel was configured with
> > : CONFIG_MEMORY_FAILURE.
> > 
> > There is no explicit note about the type of the error that is injected
> > but I think it is reasonably safe to assume this is a recoverable one.
> 
> MADV_HWPOISON stands for hard-offline.
> MADV_SOFT_OFFLINE stands for soft-offline.
> 
> MADV_SOFT_OFFLINE (since Linux 2.6.33)
>   Soft offline the pages in the range specified by addr and
>   length.  The memory of each page in the specified range is
>   preserved (i.e., when next accessed, the same content will be
>   visible, but in a new physical page frame), and the original
>   page is offlined (i.e., no longer used, and taken out of
>   normal memory management).  The effect of the
>   MADV_SOFT_OFFLINE operation is invisible to (i.e., does not
>   change the semantics of) the calling process.
> 
>   This feature is intended for testing of memory error-handling
>   code; it is available only if the kernel was configured with
>   CONFIG_MEMORY_FAILURE.

I have missed that one somehow. Thanks for pointing out.

[...]

> AFAICS, for hard-offline case, a recovered event would be if:
> 
> - the page to shut down is already free
> - the page was unmapped
> 
> In some cases we need to kill the process if it holds dirty pages.

Yes, I would expect that the page table would be poisoned and the
process receive a SIGBUS when accessing that memory.

> But we never migrate contents in hard-offline path.
> I guess it is because we cannot really trust the contents anymore.

Yes, that makes a perfect sense. What I am saying that the migration
(aka trying to recover) is the main and only difference. The soft
offline should poison page tables when not able to migrate as well
IIUC.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v2 0/2] mm: Memory offlining + page isolation cleanups

2019-10-22 Thread Michal Hocko
On Tue 22-10-19 11:17:24, David Hildenbrand wrote:
> On 22.10.19 11:14, Michal Hocko wrote:
> > On Tue 22-10-19 10:32:11, David Hildenbrand wrote:
> > [...]
> > > E.g., arch/x86/kvm/mmu.c:kvm_is_mmio_pfn()
> > 
> > Thanks for these references. I am not really familiar with kvm so I
> > cannot really comment on the specific code but I am wondering why
> > it simply doesn't check for ZONE_DEVICE explicitly? Also we do care
> > about holes in RAM (from the early boot), those should be reserved
> > already AFAIR. So we are left with hotplugged memory with holes and
> > I am not really sure we should bother with this until there is a clear
> > usecase in sight.
> 
> Well, checking for ZONE_DEVICE is only possible if you have an initialized
> memmap. And that is not guaranteed when you start mapping random stuff into
> your guest via /dev/mem.

Yes, I can understand that part but checking PageReserved on an
uninitialized memmap is pointless as well. So if you can test for it you
can very well test for ZONE_DEVICE as well. PageReserved -> ZONE_DEVICE
is a terrible assumption.

> I am reworking these patches right now and audit the whole kernel for
> PageReserved() checks that might affect ZONE_DEVICE. I'll send the
> collection of patches as RFC.

Thanks a lot!
-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH v2 10/16] mm,hwpoison: Rework soft offline for free pages

2019-10-22 Thread Michal Hocko
On Tue 22-10-19 10:35:17, Oscar Salvador wrote:
> On Tue, Oct 22, 2019 at 10:26:11AM +0200, Michal Hocko wrote:
> > On Tue 22-10-19 09:46:20, Oscar Salvador wrote:
> > [...]
> > > So, opposite to hard-offline, in soft-offline we do not fiddle with pages
> > > unless we are sure the page is not reachable anymore by any means.
> > 
> > I have to say I do not follow. Is there any _real_ reason for
> > soft-offline to behave differenttly from MCE (hard-offline)?
> 
> Yes.
> Do not take it as 100% true as I read that in some code/Documentation
> a while ago.
> 
> But I think that it boils down to:
> 
> soft-offline: "We have seen some erros in the underlying page, but
>it is still usable, so we have a chance to keep the
>the contents (via migration)"
> hard-offline: "The underlying page is dead, we cannot trust it, so
>we shut it down, killing whoever is holding it
>along the way".

Hmm, that might be a misunderstanding on my end. I thought that it is
the MCE handler to say whether the failure is recoverable or not. If yes
then we can touch the content of the memory (that would imply the
migration). Other than that both paths should be essentially the same,
no? Well unrecoverable case would be essentially force migration failure
path.

MADV_HWPOISON is explicitly documented to test MCE handling IIUC:
: This feature is intended for testing of memory error-handling
: code; it is available only if the kernel was configured with
: CONFIG_MEMORY_FAILURE.

There is no explicit note about the type of the error that is injected
but I think it is reasonably safe to assume this is a recoverable one.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v2 0/2] mm: Memory offlining + page isolation cleanups

2019-10-22 Thread Michal Hocko
On Tue 22-10-19 10:32:11, David Hildenbrand wrote:
[...]
> E.g., arch/x86/kvm/mmu.c:kvm_is_mmio_pfn()

Thanks for these references. I am not really familiar with kvm so I
cannot really comment on the specific code but I am wondering why
it simply doesn't check for ZONE_DEVICE explicitly? Also we do care
about holes in RAM (from the early boot), those should be reserved
already AFAIR. So we are left with hotplugged memory with holes and
I am not really sure we should bother with this until there is a clear
usecase in sight.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v1 1/2] mm/page_alloc.c: Don't set pages PageReserved() when offlining

2019-10-22 Thread Michal Hocko
On Tue 22-10-19 10:23:37, David Hildenbrand wrote:
> On 22.10.19 10:20, Michal Hocko wrote:
> > On Mon 21-10-19 17:54:35, David Hildenbrand wrote:
> > > On 21.10.19 17:47, Michal Hocko wrote:
> > > > On Mon 21-10-19 17:39:36, David Hildenbrand wrote:
> > > > > On 21.10.19 16:43, Michal Hocko wrote:
> > > > [...]
> > > > > > We still set PageReserved before onlining pages and that one should 
> > > > > > be
> > > > > > good to go as well (memmap_init_zone).
> > > > > > Thanks!
> > > > > 
> > > > > memmap_init_zone() is called when onlining memory. There, set all 
> > > > > pages to
> > > > > reserved right now (on context == MEMMAP_HOTPLUG). We clear 
> > > > > PG_reserved when
> > > > > onlining a page to the buddy (e.g., generic_online_page). If we would 
> > > > > online
> > > > > a memory block with holes, we would want to keep all such pages
> > > > > (!pfn_valid()) set to reserved. Also, there might be other side 
> > > > > effects.
> > > > 
> > > > Isn't it sufficient to have those pages in a poisoned state? They are
> > > > not onlined so their state is basically undefined anyway. I do not see
> > > > how PageReserved makes this any better.
> > > 
> > > It is what people have been using for a long time. Memory hole ->
> > > PG_reserved. The memmap is valid, but people want to tell "this here is
> > > crap, don't look at it".
> > 
> > The page is poisoned, right? If yes then setting the reserved bit
> > doesn't make any sense.
> 
> No it's not poisoned AFAIK. It should be initialized

Dohh, it seems I still keep confusing myself. You are right the page is
initialized at this stage. A potential hole in RAM or ZONE_DEVICE memory
will just not hit the page allocator. Sorry about the noise.

> - and I remember that PG_reserved on memory holes is relevant to
> detect MMIO pages. (e.g., looking at KVM code ...)

I can see kvm_is_reserved_pfn() which checks both pfn_valid and
PageReserved. How does this help to detect memory holes though?
Any driver might be setting the page reserved.
 
> > > > Also is the hole inside a hotplugable memory something we really have to
> > > > care about. Has anybody actually seen a platform to require that?
> > > 
> > > That's what I was asking. I can see "support" for this was added basically
> > > right from the beginning. I'd say we rip that out and cleanup/simplify. I 
> > > am
> > > not aware of a platform that requires this. Especially, memory holes on
> > > DIMMs (detected during boot) seem like an unlikely thing.
> > 
> > The thing is that the hotplug development shows ad-hoc decisions
> > throughout the code. It is even worse that it is hard to guess whether
> > some hludges are a result of a careful design or ad-hoc trial and
> > failure approach on setups that never were production. Building on top
> > of that be preserving hacks is not going to improve the situation. So I
> > am perfectly fine to focus on making the most straightforward setups
> > work reliably. Even when there is a risk of breaking some odd setups. We
> > can fix them up later but we would have at least a specific example and
> > document it.
> > 
> 
> Alright, I'll prepare a simple patch that rejects offlining memory with

Is offlining an interesting path? I would expect onlining to be much
more interesting one.

> memory holes. We can apply that and see if anybody screams out loud. If not,
> we can clean up that crap.
-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH v2 11/16] mm,hwpoison: Rework soft offline for in-use pages

2019-10-22 Thread Michal Hocko
On Tue 22-10-19 09:56:27, Oscar Salvador wrote:
> On Mon, Oct 21, 2019 at 04:06:19PM +0200, Michal Hocko wrote:
> > On Mon 21-10-19 15:48:48, Oscar Salvador wrote:
> > > We can only perform actions on LRU/Movable pages or hugetlb pages.
> > 
> > What would prevent other pages mapped via page tables to be handled as
> > well?
> 
> What kind of pages?

Any pages mapped to the userspace. E.g. driver memory which is not on
LRU.

> I mean, I guess it could be done, it was just not implemented, and I
> did not want to add more "features" as my main goal was to re-work
> the interface to be more deterministic.

Fair enough. One step at the time sounds definitely good
 
> > > 1) we would need to hook in enqueue_hugetlb_page so the page is not 
> > > enqueued
> > >into hugetlb freelists
> > > 2) when trying to free a hugetlb page, we would need to do as we do for 
> > > gigantic
> > >pages now, and that is breaking down the pages into order-0 pages and 
> > > release
> > >them to the buddy (so the check in free_papges_prepare would skip the
> > >hwpoison page).
> > >Trying to handle a higher-order hwpoison page in free_pages_prepare is
> > >a bit complicated.
> > 
> > I am not sure I see the problem. If you dissolve the hugetlb page then
> > there is no hugetlb page anymore and so you make it a regular high-order
> > page.
> 
> Yes, but the problem comes when trying to work with a hwpoison high-order page
> in free_pages_prepare, it gets more complicated, and when I weigthed
> code vs benefits, I was not really sure to go down that road.
> 
> If we get a hwpoison high-order page in free_pages_prepare, we need to
> break it down to smaller pages, so we can skip the "bad" to not be sent
> into buddy allocator.

But we have destructors for compound pages. Can we do the heavy lifting
there?

> > If the page is free then it shouldn't pin the memcg or any other state.
> 
> Well, it is not really free, as it is not usable, is it?

Sorry I meant to say the page is free from the memcg POV - aka no task
from the memcg is holding a reference to it. The page is not usable for
anybody, that is true but no particular memcg should pay a price for
that. This would mean that random memcgs would end up pinned for ever
without a good reason.
-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH v2 10/16] mm,hwpoison: Rework soft offline for free pages

2019-10-22 Thread Michal Hocko
On Tue 22-10-19 09:46:20, Oscar Salvador wrote:
[...]
> So, opposite to hard-offline, in soft-offline we do not fiddle with pages
> unless we are sure the page is not reachable anymore by any means.

I have to say I do not follow. Is there any _real_ reason for
soft-offline to behave differenttly from MCE (hard-offline)?
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v2 0/2] mm: Memory offlining + page isolation cleanups

2019-10-22 Thread Michal Hocko
On Tue 22-10-19 10:15:07, David Hildenbrand wrote:
> On 22.10.19 10:08, Michal Hocko wrote:
> > On Tue 22-10-19 08:52:28, David Hildenbrand wrote:
> > > On 21.10.19 19:23, David Hildenbrand wrote:
> > > > Two cleanups that popped up while working on (and discussing) 
> > > > virtio-mem:
> > > >https://lkml.org/lkml/2019/9/19/463
> > > > 
> > > > Tested with DIMMs on x86.
> > > > 
> > > > As discussed with michal in v1, I'll soon look into removing the use
> > > > of PG_reserved during memory onlining completely - most probably
> > > > disallowing to offline memory blocks with holes, cleaning up the
> > > > onlining+offlining code.
> > > 
> > > BTW, I remember that ZONE_DEVICE pages are still required to be set
> > > PG_reserved. That has to be sorted out first.
> > 
> > Do they?
> 
> Yes, especially KVM code :/

Details please?
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v1 1/2] mm/page_alloc.c: Don't set pages PageReserved() when offlining

2019-10-22 Thread Michal Hocko
On Mon 21-10-19 17:54:35, David Hildenbrand wrote:
> On 21.10.19 17:47, Michal Hocko wrote:
> > On Mon 21-10-19 17:39:36, David Hildenbrand wrote:
> > > On 21.10.19 16:43, Michal Hocko wrote:
> > [...]
> > > > We still set PageReserved before onlining pages and that one should be
> > > > good to go as well (memmap_init_zone).
> > > > Thanks!
> > > 
> > > memmap_init_zone() is called when onlining memory. There, set all pages to
> > > reserved right now (on context == MEMMAP_HOTPLUG). We clear PG_reserved 
> > > when
> > > onlining a page to the buddy (e.g., generic_online_page). If we would 
> > > online
> > > a memory block with holes, we would want to keep all such pages
> > > (!pfn_valid()) set to reserved. Also, there might be other side effects.
> > 
> > Isn't it sufficient to have those pages in a poisoned state? They are
> > not onlined so their state is basically undefined anyway. I do not see
> > how PageReserved makes this any better.
> 
> It is what people have been using for a long time. Memory hole ->
> PG_reserved. The memmap is valid, but people want to tell "this here is
> crap, don't look at it".

The page is poisoned, right? If yes then setting the reserved bit
doesn't make any sense.

> > Also is the hole inside a hotplugable memory something we really have to
> > care about. Has anybody actually seen a platform to require that?
> 
> That's what I was asking. I can see "support" for this was added basically
> right from the beginning. I'd say we rip that out and cleanup/simplify. I am
> not aware of a platform that requires this. Especially, memory holes on
> DIMMs (detected during boot) seem like an unlikely thing.

The thing is that the hotplug development shows ad-hoc decisions
throughout the code. It is even worse that it is hard to guess whether
some hludges are a result of a careful design or ad-hoc trial and
failure approach on setups that never were production. Building on top
of that be preserving hacks is not going to improve the situation. So I
am perfectly fine to focus on making the most straightforward setups
work reliably. Even when there is a risk of breaking some odd setups. We
can fix them up later but we would have at least a specific example and
document it.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v2 0/2] mm: Memory offlining + page isolation cleanups

2019-10-22 Thread Michal Hocko
On Tue 22-10-19 08:52:28, David Hildenbrand wrote:
> On 21.10.19 19:23, David Hildenbrand wrote:
> > Two cleanups that popped up while working on (and discussing) virtio-mem:
> >   https://lkml.org/lkml/2019/9/19/463
> > 
> > Tested with DIMMs on x86.
> > 
> > As discussed with michal in v1, I'll soon look into removing the use
> > of PG_reserved during memory onlining completely - most probably
> > disallowing to offline memory blocks with holes, cleaning up the
> > onlining+offlining code.
> 
> BTW, I remember that ZONE_DEVICE pages are still required to be set
> PG_reserved. That has to be sorted out first.

Do they?

> I remember that somebody was
> working on it a while ago but didn't hear about that again. Will look into
> that as well - should be as easy as adding a zone check (if there isn't a
> pfn_to_online_page() check already). But of course, there might be special
> cases 

I remember Alexander didn't want to change the PageReserved handling
because he was worried about unforeseeable side effects. I have a vague
recollection he (or maybe Dan) has promissed some follow up clean ups
which didn't seem to materialize.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v1 1/2] mm/page_alloc.c: Don't set pages PageReserved() when offlining

2019-10-21 Thread Michal Hocko
On Mon 21-10-19 17:39:36, David Hildenbrand wrote:
> On 21.10.19 16:43, Michal Hocko wrote:
[...]
> > We still set PageReserved before onlining pages and that one should be
> > good to go as well (memmap_init_zone).
> > Thanks!
> 
> memmap_init_zone() is called when onlining memory. There, set all pages to
> reserved right now (on context == MEMMAP_HOTPLUG). We clear PG_reserved when
> onlining a page to the buddy (e.g., generic_online_page). If we would online
> a memory block with holes, we would want to keep all such pages
> (!pfn_valid()) set to reserved. Also, there might be other side effects.

Isn't it sufficient to have those pages in a poisoned state? They are
not onlined so their state is basically undefined anyway. I do not see
how PageReserved makes this any better.

Also is the hole inside a hotplugable memory something we really have to
care about. Has anybody actually seen a platform to require that?
-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH v2 10/16] mm,hwpoison: Rework soft offline for free pages

2019-10-21 Thread Michal Hocko
On Mon 21-10-19 14:58:49, Oscar Salvador wrote:
> On Fri, Oct 18, 2019 at 02:06:15PM +0200, Michal Hocko wrote:
> > On Thu 17-10-19 16:21:17, Oscar Salvador wrote:
> > [...]
> > > +bool take_page_off_buddy(struct page *page)
> > > + {
> > > + struct zone *zone = page_zone(page);
> > > + unsigned long pfn = page_to_pfn(page);
> > > + unsigned long flags;
> > > + unsigned int order;
> > > + bool ret = false;
> > > +
> > > + spin_lock_irqsave(&zone->lock, flags);
> > 
> > What prevents the page to be allocated in the meantime? Also what about
> > free pages on the pcp lists? Also the page could be gone by the time you
> > have reached here.
> 
> Nothing prevents the page to be allocated in the meantime.
> We would just bail out and return -EBUSY to userspace.
> Since we do not do __anything__ to the page until we are sure we took it off,
> and it is completely isolated from the memory, there is no danger.

Wouldn't it be better to simply check the PageBuddy state after the lock
has been taken?

> Since soft-offline is kinda "best effort" mode, it is something like:
> "Sorry, could not poison the page, try again".

Well, I would disagree here. While madvise is indeed a best effort
operation please keep in mind that the sole purpose of this interface is
to allow real MCE behavior. And that operation should better try
_really_ hard to make sure we try to recover as gracefully as possible.

> Now, thinking about this a bit more, I guess we could be more clever here
> and call the routine that handles in-use pages if we see that the page
> was allocated by the time we reach take_page_off_buddy.
> 
> About pcp pages, you are right.
> I thought that we were already handling that case, and we do, but looking 
> closer the
> call to shake_page() (that among other things spills pcppages into buddy)
> is performed at a later stage.
> I think we need to adjust __get_any_page to recognize pcp pages as well.

Yeah, pcp pages are PITA. We cannot really recognize them now. Dropping
all pcp pages is certainly a way to go but we need to mark the page
before that happens.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v1 2/2] mm/page_isolation.c: Convert SKIP_HWPOISON to MEMORY_OFFLINE

2019-10-21 Thread Michal Hocko
On Mon 21-10-19 16:19:26, David Hildenbrand wrote:
> We have two types of users of page isolation:
> 1. Memory offlining: Offline memory so it can be unplugged. Memory won't
>be touched.
> 2. Memory allocation: Allocate memory (e.g., alloc_contig_range()) to
> become the owner of the memory and make use of it.
> 
> For example, in case we want to offline memory, we can ignore (skip over)
> PageHWPoison() pages, as the memory won't get used. We can allow to
> offline memory. In contrast, we don't want to allow to allocate such
> memory.
> 
> Let's generalize the approach so we can special case other types of
> pages we want to skip over in case we offline memory. While at it, also
> pass the same flags to test_pages_isolated().
> 
> Cc: Michal Hocko 
> Cc: Oscar Salvador 
> Cc: Andrew Morton 
> Cc: Anshuman Khandual 
> Cc: David Hildenbrand 
> Cc: Pingfan Liu 
> Cc: Qian Cai 
> Cc: Pavel Tatashin 
> Cc: Dan Williams 
> Cc: Vlastimil Babka 
> Cc: Mel Gorman 
> Cc: Mike Rapoport 
> Cc: Alexander Duyck 
> Suggested-by: Michal Hocko 
> Signed-off-by: David Hildenbrand 

Yes, a highlevel flag makes more sense than requesting specific types of
pages to skip over.

Acked-by: Michal Hocko 

Please make the code easier to follow ...
> ---
>  include/linux/page-isolation.h |  4 ++--
>  mm/memory_hotplug.c|  8 +---
>  mm/page_alloc.c|  4 ++--
>  mm/page_isolation.c| 12 ++--
>  4 files changed, 15 insertions(+), 13 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index bf6b21f02154..b44712c7fdd7 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -8270,7 +8270,7 @@ bool has_unmovable_pages(struct zone *zone, struct page 
> *page, int count,
>* The HWPoisoned page may be not in buddy system, and
>* page_count() is not 0.
>*/
> - if ((flags & SKIP_HWPOISON) && PageHWPoison(page))
> + if (flags & MEMORY_OFFLINE && PageHWPoison(page))
>   continue;
>  
>   if (__PageMovable(page))
[...]
> @@ -257,7 +258,7 @@ void undo_isolate_page_range(unsigned long start_pfn, 
> unsigned long end_pfn,
>   */
>  static unsigned long
>  __test_page_isolated_in_pageblock(unsigned long pfn, unsigned long end_pfn,
> -   bool skip_hwpoisoned_pages)
> +   int flags)
>  {
>   struct page *page;
>  
> @@ -274,7 +275,7 @@ __test_page_isolated_in_pageblock(unsigned long pfn, 
> unsigned long end_pfn,
>* simple way to verify that as VM_BUG_ON(), though.
>*/
>   pfn += 1 << page_order(page);
> - else if (skip_hwpoisoned_pages && PageHWPoison(page))
> + else if (flags & MEMORY_OFFLINE && PageHWPoison(page))
>   /* A HWPoisoned page cannot be also PageBuddy */
>   pfn++;
>   else

.. and use parentheses for the flag check.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v1 1/2] mm/page_alloc.c: Don't set pages PageReserved() when offlining

2019-10-21 Thread Michal Hocko
On Mon 21-10-19 16:19:25, David Hildenbrand wrote:
> We call __offline_isolated_pages() from __offline_pages() after all
> pages were isolated and are either free (PageBuddy()) or PageHWPoison.
> Nothing can stop us from offlining memory at this point.
> 
> In __offline_isolated_pages() we first set all affected memory sections
> offline (offline_mem_sections(pfn, end_pfn)), to mark the memmap as
> invalid (pfn_to_online_page() will no longer succeed), and then walk over
> all pages to pull the free pages from the free lists (to the isolated
> free lists, to be precise).
> 
> Note that re-onlining a memory block will result in the whole memmap
> getting reinitialized, overwriting any old state. We already poision the
> memmap when offlining is complete to find any access to
> stale/uninitialized memmaps.
> 
> So, setting the pages PageReserved() is not helpful. The memap is marked
> offline and all pageblocks are isolated. As soon as offline, the memmap
> is stale either way.
> 
> This looks like a leftover from ancient times where we initialized the
> memmap when adding memory and not when onlining it (the pages were set
> PageReserved so re-onling would work as expected).
> 
> Cc: Andrew Morton 
> Cc: Michal Hocko 
> Cc: Vlastimil Babka 
> Cc: Oscar Salvador 
> Cc: Mel Gorman 
> Cc: Mike Rapoport 
> Cc: Dan Williams 
> Cc: Wei Yang 
> Cc: Alexander Duyck 
> Cc: Anshuman Khandual 
> Cc: Pavel Tatashin 
> Signed-off-by: David Hildenbrand 

Acked-by: Michal Hocko 

We still set PageReserved before onlining pages and that one should be
good to go as well (memmap_init_zone).
Thanks!

There is a comment above offline_isolated_pages_cb that should be
removed as well.

> ---
>  mm/page_alloc.c | 5 +
>  1 file changed, 1 insertion(+), 4 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index ed8884dc0c47..bf6b21f02154 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -8667,7 +8667,7 @@ __offline_isolated_pages(unsigned long start_pfn, 
> unsigned long end_pfn)
>  {
>   struct page *page;
>   struct zone *zone;
> - unsigned int order, i;
> + unsigned int order;
>   unsigned long pfn;
>   unsigned long flags;
>   unsigned long offlined_pages = 0;
> @@ -8695,7 +8695,6 @@ __offline_isolated_pages(unsigned long start_pfn, 
> unsigned long end_pfn)
>*/
>   if (unlikely(!PageBuddy(page) && PageHWPoison(page))) {
>   pfn++;
> - SetPageReserved(page);
>   offlined_pages++;
>   continue;
>   }
> @@ -8709,8 +8708,6 @@ __offline_isolated_pages(unsigned long start_pfn, 
> unsigned long end_pfn)
>   pfn, 1 << order, end_pfn);
>  #endif
>   del_page_from_free_area(page, &zone->free_area[order]);
> -     for (i = 0; i < (1 << order); i++)
> - SetPageReserved((page+i));
>   pfn += (1 << order);
>   }
>   spin_unlock_irqrestore(&zone->lock, flags);
> -- 
> 2.21.0
> 

-- 
Michal Hocko
SUSE Labs


Re: [PATCH 1/3] mm, meminit: Recalculate pcpu batch and high limits after init completes

2019-10-21 Thread Michal Hocko
On Mon 21-10-19 10:01:24, Qian Cai wrote:
> 
> 
> > On Oct 21, 2019, at 5:48 AM, Mel Gorman  wrote:
i[...]
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index c0b2e0306720..f972076d0f6b 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -1947,6 +1947,14 @@ void __init page_alloc_init_late(void)
> > /* Block until all are initialised */
> > wait_for_completion(&pgdat_init_all_done_comp);
> > 
> > +   /*
> > +* The number of managed pages has changed due to the initialisation
> > +* so the pcpu batch and high limits needs to be updated or the limits
> > +* will be artificially small.
> > +*/
> > +   for_each_populated_zone(zone)
> > +   zone_pcp_update(zone);
> > +
> > /*
> >  * We initialized the rest of the deferred pages.  Permanently disable
> >  * on-demand struct page initialization.
> > -- 
> > 2.16.4
> > 
> > 
> 
> Warnings from linux-next,
> 
> [   14.265911][  T659] BUG: sleeping function called from invalid context at 
> kernel/locking/mutex.c:935
> [   14.265992][  T659] in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 
> 659, name: pgdatinit8
> [   14.266044][  T659] 1 lock held by pgdatinit8/659:
> [   14.266075][  T659]  #0: c000201ffca87b40 
> (&(&pgdat->node_size_lock)->rlock){}, at: deferred_init_memmap+0xc4/0x26c

This is really surprising to say the least. I do not see any spinlock
held here. Besides that we do sleep in wait_for_completion already.
Is it possible that the patch has been misplaced? zone_pcp_update is
called from page_alloc_init_late which is a different context than
deferred_init_memmap which runs in a separate kthread.

> [   14.266160][  T659] irq event stamp: 26
> [   14.266194][  T659] hardirqs last  enabled at (25): [] 
> _raw_spin_unlock_irq+0x44/0x80
> [   14.266246][  T659] hardirqs last disabled at (26): [] 
> _raw_spin_lock_irqsave+0x3c/0xa0
> [   14.266299][  T659] softirqs last  enabled at (0): [] 
> copy_process+0x720/0x19b0
> [   14.266339][  T659] softirqs last disabled at (0): [<>] 0x0
> [   14.266400][  T659] CPU: 64 PID: 659 Comm: pgdatinit8 Not tainted 
> 5.4.0-rc4-next-20191021 #1
> [   14.266462][  T659] Call Trace:
> [   14.266494][  T659] [c0003d8efae0] [c0921cf4] 
> dump_stack+0xe8/0x164 (unreliable)
> [   14.266538][  T659] [c0003d8efb30] [c0157c54] 
> ___might_sleep+0x334/0x370
> [   14.266577][  T659] [c0003d8efbb0] [c094a784] 
> __mutex_lock+0x84/0xb20
> [   14.266627][  T659] [c0003d8efcc0] [c0954038] 
> zone_pcp_update+0x34/0x64
> [   14.266677][  T659] [c0003d8efcf0] [c0b9e6bc] 
> deferred_init_memmap+0x1b8/0x26c
> [   14.266740][  T659] [c0003d8efdb0] [c0149528] 
> kthread+0x1a8/0x1b0
> [   14.266792][  T659] [c0003d8efe20] [c000b748] 
> ret_from_kernel_thread+0x5c/0x74
> [   14.268288][  T659] node 8 initialised, 1879186 pages in 12200ms
> [   14.268527][  T659] pgdatinit8 (659) used greatest stack depth: 27984 
> bytes left
> [   15.589983][  T658] BUG: sleeping function called from invalid context at 
> kernel/locking/mutex.c:935
> [   15.590041][  T658] in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 
> 658, name: pgdatinit0
> [   15.590078][  T658] 1 lock held by pgdatinit0/658:
> [   15.590108][  T658]  #0: c01fff5c7b40 
> (&(&pgdat->node_size_lock)->rlock){}, at: deferred_init_memmap+0xc4/0x26c
> [   15.590192][  T658] irq event stamp: 18
> [   15.590224][  T658] hardirqs last  enabled at (17): [] 
> _raw_spin_unlock_irqrestore+0x94/0xd0
> [   15.590283][  T658] hardirqs last disabled at (18): [] 
> _raw_spin_lock_irqsave+0x3c/0xa0
> [   15.590332][  T658] softirqs last  enabled at (0): [] 
> copy_process+0x720/0x19b0
> [   15.590379][  T658] softirqs last disabled at (0): [<>] 0x0
> [   15.590414][  T658] CPU: 8 PID: 658 Comm: pgdatinit0 Tainted: GW   
>   5.4.0-rc4-next-20191021 #1
> [   15.590460][  T658] Call Trace:
> [   15.590491][  T658] [c0003d8cfae0] [c0921cf4] 
> dump_stack+0xe8/0x164 (unreliable)
> [   15.590541][  T658] [c0003d8cfb30] [c0157c54] 
> ___might_sleep+0x334/0x370
> [   15.590588][  T658] [c0003d8cfbb0] [c094a784] 
> __mutex_lock+0x84/0xb20
> [   15.590643][  T658] [c0003d8cfcc0] [c0954038] 
> zone_pcp_update+0x34/0x64
> [   15.590689][  T658] [c0003d8cfcf0] [c0b9e6bc] 
> deferred_init_memmap+0x1b8/0x26c
> [   15.590739][  T658] [c0003d8cfdb0] [c0149528] 
> kthread+0x1a8/0x1b0
> [   15.590790][  T658] [c0003d8cfe20] [c000b748] 
> ret_from_kernel_thread+0x5c/0x74

-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH v2 11/16] mm,hwpoison: Rework soft offline for in-use pages

2019-10-21 Thread Michal Hocko
On Mon 21-10-19 15:48:48, Oscar Salvador wrote:
> On Fri, Oct 18, 2019 at 02:39:01PM +0200, Michal Hocko wrote:
> > 
> > I am sorry but I got lost in the above description and I cannot really
> > make much sense from the code either. Let me try to outline the way how
> > I think about this.
> > 
> > Say we have a pfn to hwpoison. We have effectivelly three possibilities
> > - page is poisoned already - done nothing to do
> > - page is managed by the buddy allocator - excavate from there
> > - page is in use
> > 
> > The last category is the most interesting one. There are essentially
> > three classes of pages
> > - freeable
> > - migrateable
> > - others
> > 
> > We cannot do really much about the last one, right? Do we mark them
> > HWPoison anyway?
> 
> We can only perform actions on LRU/Movable pages or hugetlb pages.

What would prevent other pages mapped via page tables to be handled as
well?

> So unless the page does not fall into those areas, we do not do anything
> with them.
> 
> > Freeable should be simply marked HWPoison and freed.
> > For all those migrateable, we simply do migrate and mark HWPoison.
> > Now the main question is how to handle HWPoison page when it is freed
> > - aka last reference is dropped. The main question is whether the last
> > reference is ever dropped. If yes then the free_pages_prepare should
> > never release it to the allocator (some compound destructors would have
> > to special case as well, e.g. hugetlb would have to hand over to the
> > allocator rather than a pool). If not then the page would be lingering
> > potentially with some state bound to it (e.g. memcg charge).  So I
> > suspect you want the former.
> 
> For non-hugetlb pages, we do not call put_page in the migration path,
> but we do it in page_handle_poison, after the page has been flagged as
> hwpoison.
> Then the check in free_papes_prepare will see that the page is hwpoison
> and will bail out, so the page is not released into the allocator/pcp lists.
> 
> Hugetlb pages follow a different methodology.
> They are dissolved, and then we split the higher-order page and take the
> page off the buddy.
> The problem is that while it is easy to hold a non-hugetlb page,
> doing the same for hugetlb pages is not that easy:
> 
> 1) we would need to hook in enqueue_hugetlb_page so the page is not enqueued
>into hugetlb freelists
> 2) when trying to free a hugetlb page, we would need to do as we do for 
> gigantic
>pages now, and that is breaking down the pages into order-0 pages and 
> release
>them to the buddy (so the check in free_papges_prepare would skip the
>hwpoison page).
>Trying to handle a higher-order hwpoison page in free_pages_prepare is
>a bit complicated.

I am not sure I see the problem. If you dissolve the hugetlb page then
there is no hugetlb page anymore and so you make it a regular high-order
page.

> There is one thing I was unsure though.
> Bailing out at the beginning of free_pages_prepare if the page is hwpoison
> means that the calls to
> 
> - __memcg_kmem_uncharge
> - page_cpupid_reset_last
> - reset_page_owner
> - ...
> 
> will not be performed.
> I thought this is right because the page is not really "free", it is just 
> unusable,
> so.. it should be still charged to the memcg?

If the page is free then it shouldn't pin the memcg or any other state.
-- 
Michal Hocko
SUSE Labs


Re: [RFC v1] mm: add page preemption

2019-10-21 Thread Michal Hocko
On Sun 20-10-19 21:43:04, Hillf Danton wrote:
> 
> Unlike cpu preemption, page preemption would have been a two-edge
> option for quite a while. It is added by preventing tasks from
> reclaiming as many lru pages as possible from other tasks of
> higher priorities.

This really begs for more explanation. I would start with what page
preemption actually is. Why do we care and which workloads would benefit
and how much. And last but not least why the existing infrastructure
doesn't help (e.g. if you have clearly defined workloads with different
memory consumption requirements then why don't you use memory cgroups to
reflect the priority).
 
> Who need pp?
> Users who want to manage/control jitters in lru pages under memory
> pressure. Way in parallel to scheduling with task's memory footprint
> taken into account, pp makes task prio a part of page reclaiming.

How do you assign priority to generally shared pages?

> [Off topic: prio can also be defined and added in memory controller
> and then plays a role in memcg reclaiming, for example check prio
> when selecting victim memcg.]
> 
> First on the page side, page->prio that is used to mirror the prio
> of page owner tasks is added, and a couple of helpers for setting,
> copying and comparing page->prio to help to add pages to lru.
> 
> Then on the reclaimer side, pgdat->kswapd_prio is added to mirror
> the prio of tasks that wake up the kthread, and it is updated
> every time kswapd raises its reclaiming priority.

This sounds like a very bad design to me. You essentially hand over
to a completely detached context while you want to handle priority
inversion problems (or at least that is what I think you want).

> Finally priorities on both sides are compared when deactivating lru
> pages, and skip page if it is higher on prio.
> 
> V1 is based on next-20191018.
> 
> Changes since v0
> - s/page->nice/page->prio/
> - drop the role of kswapd's reclaiming prioirty in prio comparison
> - add pgdat->kswapd_prio
> 
> Cc: Michal Hocko 
> Cc: Johannes Weiner 
> Cc: Shakeel Butt 
> Cc: Minchan Kim 
> Cc: Mel Gorman 
> Cc: Vladimir Davydov 
> Cc: Jan Kara 
> Signed-off-by: Hillf Danton 
> ---
> 
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -281,6 +281,15 @@ config VIRT_TO_BUS
> should probably not select this.
>  
>  
> +config PAGE_PREEMPTION
> + bool
> + help
> +   This makes a task unable to reclaim as many lru pages as
> +   possible from other tasks of higher priorities.
> +
> +   Say N if unsure.
> +
> +
>  config MMU_NOTIFIER
>   bool
>   select SRCU
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -14,6 +14,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include 
>  
> @@ -197,6 +198,10 @@ struct page {
>   /* Usage count. *DO NOT USE DIRECTLY*. See page_ref.h */
>   atomic_t _refcount;
>  
> +#ifdef CONFIG_PAGE_PREEMPTION
> + int prio; /* mirror page owner task->prio */
> +#endif
> +
>  #ifdef CONFIG_MEMCG
>   struct mem_cgroup *mem_cgroup;
>  #endif
> @@ -232,6 +237,54 @@ struct page {
>  #define page_private(page)   ((page)->private)
>  #define set_page_private(page, v)((page)->private = (v))
>  
> +#ifdef CONFIG_PAGE_PREEMPTION
> +static inline bool page_prio_valid(struct page *p)
> +{
> + return p->prio > MAX_PRIO;
> +}
> +
> +static inline void set_page_prio(struct page *p, int task_prio)
> +{
> + /* store page prio low enough to help khugepaged add lru pages */
> + if (!page_prio_valid(p))
> + p->prio = task_prio + MAX_PRIO + 1;
> +}
> +
> +static inline void copy_page_prio(struct page *to, struct page *from)
> +{
> + to->prio = from->prio;
> +}
> +
> +static inline int page_prio(struct page *p)
> +{
> + return p->prio - MAX_PRIO - 1;
> +}
> +
> +static inline bool page_prio_higher(struct page *p, int prio)
> +{
> + return page_prio(p) < prio;
> +}
> +#else
> +static inline bool page_prio_valid(struct page *p)
> +{
> + return true;
> +}
> +static inline void set_page_prio(struct page *p, int task_prio)
> +{
> +}
> +static inline void copy_page_prio(struct page *to, struct page *from)
> +{
> +}
> +static inline int page_prio(struct page *p)
> +{
> + return MAX_PRIO + 1;
> +}
> +static inline bool page_prio_higher(struct page *p, int prio)
> +{
> + return false;
> +}
> +#endif
> +
>  struct page_frag_cache {
>   void * va;
>  #if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -402,6 +402,7 @@ st

Re: [RFC PATCH v2 02/16] mm,madvise: call soft_offline_page() without MF_COUNT_INCREASED

2019-10-21 Thread Michal Hocko
On Mon 21-10-19 07:02:55, Naoya Horiguchi wrote:
> On Fri, Oct 18, 2019 at 01:52:27PM +0200, Michal Hocko wrote:
> > On Thu 17-10-19 16:21:09, Oscar Salvador wrote:
> > > From: Naoya Horiguchi 
> > > 
> > > The call to get_user_pages_fast is only to get the pointer to a struct
> > > page of a given address, pinning it is memory-poisoning handler's job,
> > > so drop the refcount grabbed by get_user_pages_fast
> > > 
> > > Signed-off-by: Naoya Horiguchi 
> > > Signed-off-by: Oscar Salvador 
> > > ---
> > >  mm/madvise.c | 24 
> > >  1 file changed, 12 insertions(+), 12 deletions(-)
> > > 
> > > diff --git a/mm/madvise.c b/mm/madvise.c
> > > index 2be9f3fdb05e..89ed9a22ff4f 100644
> > > --- a/mm/madvise.c
> > > +++ b/mm/madvise.c
> > > @@ -878,16 +878,24 @@ static int madvise_inject_error(int behavior,
> > >*/
> > >   order = compound_order(compound_head(page));
> > >  
> > > - if (PageHWPoison(page)) {
> > > - put_page(page);
> > > + /*
> > > +  * The get_user_pages_fast() is just to get the pfn of the
> > > +  * given address, and the refcount has nothing to do with
> > > +  * what we try to test, so it should be released immediately.
> > > +  * This is racy but it's intended because the real hardware
> > > +  * errors could happen at any moment and memory error handlers
> > > +  * must properly handle the race.
> > > +  */
> > > + put_page(page);
> > > +
> > > + if (PageHWPoison(page))
> > >   continue;
> > > - }
> > >  
> > >   if (behavior == MADV_SOFT_OFFLINE) {
> > >   pr_info("Soft offlining pfn %#lx at process virtual 
> > > address %#lx\n",
> > >   pfn, start);
> > >  
> > > - ret = soft_offline_page(page, MF_COUNT_INCREASED);
> > > +     ret = soft_offline_page(page, 0);
> > 
> > What does prevent this struct page to go away completely?
> 
> Nothing does it.  Memory error handler tries to pin by itself and
> then determines what state the page is in now.

OK, but the page is not pinned by this context so it can go away at any
time, right? Or do I miss your point? Who would be the Error handler
context in this case?
-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH v2 01/16] mm,hwpoison: cleanup unused PageHuge() check

2019-10-21 Thread Michal Hocko
On Mon 21-10-19 07:00:46, Naoya Horiguchi wrote:
> On Fri, Oct 18, 2019 at 01:48:32PM +0200, Michal Hocko wrote:
> > On Thu 17-10-19 16:21:08, Oscar Salvador wrote:
> > > From: Naoya Horiguchi 
> > > 
> > > Drop the PageHuge check since memory_failure forks into 
> > > memory_failure_hugetlb()
> > > for hugetlb pages.
> > > 
> > > Signed-off-by: Oscar Salvador 
> > > Signed-off-by: Naoya Horiguchi 
> > 
> > s-o-b chain is reversed.
> > 
> > The code is a bit confusing. Doesn't this check aim for THP?
> 
> No, PageHuge() is false for thp, so this if branch is just dead code.
> 
> > AFAICS
> > PageTransHuge(hpage) will split the THP or fail so PageTransHuge
> > shouldn't be possible anymore, right?
> 
> Yes, that's right.
> 
> > But why does hwpoison_user_mappings
> > still work with hpage then?
> 
> hwpoison_user_mappings() is called both from memory_failure() and
> from memory_failure_hugetlb(), so it need handle both cases.

Thanks for the clarification. Maybe the changelog could be more explicit
because the code is quite confusing.
-- 
Michal Hocko
SUSE Labs


Re: [RFC v1] memcg: add memcg lru for page reclaiming

2019-10-21 Thread Michal Hocko
On Mon 21-10-19 19:56:54, Hillf Danton wrote:
> 
> Currently soft limit reclaim is frozen, see
> Documentation/admin-guide/cgroup-v2.rst for reasons.
> 
> Copying the page lru idea, memcg lru is added for selecting victim
> memcg to reclaim pages from under memory pressure. It now works in
> parallel to slr not only because the latter needs some time to reap
> but the coexistence facilitates it a lot to add the lru in a straight
> forward manner.

This doesn't explain what is the problem/feature you would like to
fix/achieve. It also doesn't explain the overall design. 

> A lru list paired with a spin lock is added, thanks to the current
> memcg high_work that provides other things it needs, and a couple of
> helpers to add memcg to and pick victim from lru.
> 
> V1 is based on 5.4-rc3.
> 
> Changes since v0
> - add MEMCG_LRU in init/Kconfig
> - drop changes in mm/vmscan.c
> - make memcg lru work in parallel to slr
> 
> Cc: Chris Down 
> Cc: Tejun Heo 
> Cc: Roman Gushchin 
> Cc: Michal Hocko 
> Cc: Johannes Weiner 
> Cc: Shakeel Butt 
> Cc: Matthew Wilcox 
> Cc: Minchan Kim 
> Cc: Mel Gorman 
> Signed-off-by: Hillf Danton 
> ---
> 
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -843,6 +843,14 @@ config MEMCG
>   help
> Provides control over the memory footprint of tasks in a cgroup.
>  
> +config MEMCG_LRU
> + bool
> + depends on MEMCG
> + help
> +   Select victim memcg on lru for page reclaiming.
> +
> +   Say N if unsure.
> +
>  config MEMCG_SWAP
>   bool "Swap controller"
>   depends on MEMCG && SWAP
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -223,6 +223,10 @@ struct mem_cgroup {
>   /* Upper bound of normal memory consumption range */
>   unsigned long high;
>  
> +#ifdef CONFIG_MEMCG_LRU
> + struct list_head lru_node;
> +#endif
> +
>   /* Range enforcement for interrupt charges */
>   struct work_struct high_work;
>  
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2338,14 +2338,54 @@ static int memcg_hotplug_cpu_dead(unsign
>   return 0;
>  }
>  
> +#ifdef CONFIG_MEMCG_LRU
> +static DEFINE_SPINLOCK(memcg_lru_lock);
> +static LIST_HEAD(memcg_lru); /* a copy of page lru */
> +
> +static void memcg_add_lru(struct mem_cgroup *memcg)
> +{
> + spin_lock_irq(&memcg_lru_lock);
> + if (list_empty(&memcg->lru_node))
> + list_add_tail(&memcg->lru_node, &memcg_lru);
> + spin_unlock_irq(&memcg_lru_lock);
> +}
> +
> +static struct mem_cgroup *memcg_pick_lru(void)
> +{
> + struct mem_cgroup *memcg, *next;
> +
> + spin_lock_irq(&memcg_lru_lock);
> +
> + list_for_each_entry_safe(memcg, next, &memcg_lru, lru_node) {
> + list_del_init(&memcg->lru_node);
> +
> + if (page_counter_read(&memcg->memory) > memcg->high) {
> + spin_unlock_irq(&memcg_lru_lock);
> + return memcg;
> + }
> + }
> + spin_unlock_irq(&memcg_lru_lock);
> +
> + return NULL;
> +}
> +#endif
> +
>  static void reclaim_high(struct mem_cgroup *memcg,
>unsigned int nr_pages,
>gfp_t gfp_mask)
>  {
> +#ifdef CONFIG_MEMCG_LRU
> + struct mem_cgroup *start = memcg;
> +#endif
>   do {
>   if (page_counter_read(&memcg->memory) <= memcg->high)
>   continue;
>   memcg_memory_event(memcg, MEMCG_HIGH);
> + if (IS_ENABLED(CONFIG_MEMCG_LRU))
> + if (start != memcg) {
> + memcg_add_lru(memcg);
> + return;
> + }
>   try_to_free_mem_cgroup_pages(memcg, nr_pages, gfp_mask, true);
>   } while ((memcg = parent_mem_cgroup(memcg)));
>  }
> @@ -3158,6 +3198,13 @@ unsigned long mem_cgroup_soft_limit_recl
>   unsigned long excess;
>   unsigned long nr_scanned;
>  
> + if (IS_ENABLED(CONFIG_MEMCG_LRU)) {
> + struct mem_cgroup *memcg = memcg_pick_lru();
> + if (memcg)
> + schedule_work(&memcg->high_work);
> +     return 0;
> + }
> +
>   if (order > 0)
>   return 0;
>  
> @@ -5068,6 +5115,8 @@ static struct mem_cgroup *mem_cgroup_all
>   if (memcg_wb_domain_init(memcg, GFP_KERNEL))
>   goto fail;
>  
> + if (IS_ENABLED(CONFIG_MEMCG_LRU))
> + INIT_LIST_HEAD(&memcg->lru_node);
>   INIT_WORK(&memcg->high_work, high_work_func);
>   memcg->last_scanned_node = MAX_NUMNODES;
>   INIT_LIST_HEAD(&memcg->oom_notify);
> --
> 

-- 
Michal Hocko
SUSE Labs


Re: [PATCH 1/3] mm, meminit: Recalculate pcpu batch and high limits after init completes

2019-10-21 Thread Michal Hocko
On Mon 21-10-19 10:48:06, Mel Gorman wrote:
> Deferred memory initialisation updates zone->managed_pages during
> the initialisation phase but before that finishes, the per-cpu page
> allocator (pcpu) calculates the number of pages allocated/freed in
> batches as well as the maximum number of pages allowed on a per-cpu list.
> As zone->managed_pages is not up to date yet, the pcpu initialisation
> calculates inappropriately low batch and high values.
> 
> This increases zone lock contention quite severely in some cases with the
> degree of severity depending on how many CPUs share a local zone and the
> size of the zone. A private report indicated that kernel build times were
> excessive with extremely high system CPU usage. A perf profile indicated
> that a large chunk of time was lost on zone->lock contention.
> 
> This patch recalculates the pcpu batch and high values after deferred
> initialisation completes for every populated zone in the system. It
> was tested on a 2-socket AMD EPYC 2 machine using a kernel compilation
> workload -- allmodconfig and all available CPUs.
> 
> mmtests configuration: config-workload-kernbench-max
> Configuration was modified to build on a fresh XFS partition.
> 
> kernbench
> 5.4.0-rc3  5.4.0-rc3
>   vanilla   resetpcpu-v2
> Amean user-25613249.50 (   0.00%)16401.31 * -23.79%*
> Amean syst-25614760.30 (   0.00%) 4448.39 *  69.86%*
> Amean elsp-256  162.42 (   0.00%)  119.13 *  26.65%*
> Stddevuser-256   42.97 (   0.00%)   19.15 (  55.43%)
> Stddevsyst-256  336.87 (   0.00%)6.71 (  98.01%)
> Stddevelsp-2562.46 (   0.00%)0.39 (  84.03%)
> 
>5.4.0-rc35.4.0-rc3
>  vanilla resetpcpu-v2
> Duration User   39766.24 49221.79
> Duration System 44298.10 13361.67
> Duration Elapsed  519.11   388.87
> 
> The patch reduces system CPU usage by 69.86% and total build time by
> 26.65%. The variance of system CPU usage is also much reduced.
> 
> Before, this was the breakdown of batch and high values over all zones was.
> 
> 256   batch: 1
> 256   batch: 63
> 512   batch: 7
> 256   high:  0
> 256   high:  378
> 512   high:  42
> 
> 512 pcpu pagesets had a batch limit of 7 and a high limit of 42. After the 
> patch
> 
> 256   batch: 1
> 768       batch: 63
> 256   high:  0
> 768   high:  378
> 
> Cc: sta...@vger.kernel.org # v4.1+
> Signed-off-by: Mel Gorman 

Acked-by: Michal Hocko 

> ---
>  mm/page_alloc.c | 8 
>  1 file changed, 8 insertions(+)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index c0b2e0306720..f972076d0f6b 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1947,6 +1947,14 @@ void __init page_alloc_init_late(void)
>   /* Block until all are initialised */
>   wait_for_completion(&pgdat_init_all_done_comp);
>  
> + /*
> +  * The number of managed pages has changed due to the initialisation
> +  * so the pcpu batch and high limits needs to be updated or the limits
> +  * will be artificially small.
> +  */
> + for_each_populated_zone(zone)
> + zone_pcp_update(zone);
> +
>   /*
>* We initialized the rest of the deferred pages.  Permanently disable
>* on-demand struct page initialization.
> -- 
> 2.16.4

-- 
Michal Hocko
SUSE Labs


Re: [patch 07/26] mm/memunmap: don't access uninitialized memmap in memunmap_pages()

2019-10-21 Thread Michal Hocko
On Mon 21-10-19 10:28:16, David Hildenbrand wrote:
> On 21.10.19 10:26, Michal Hocko wrote:
> > Has this been properly reviewed? I do not see any Acks nor Reviewed-bys.
> > 
> 
> As I modified this patch while carrying it along, it at least has my
> implicit Ack/RB.

OK, thanks!
-- 
Michal Hocko
SUSE Labs


Re: [patch 06/26] mm/memory_hotplug: don't access uninitialized memmaps in shrink_pgdat_span()

2019-10-21 Thread Michal Hocko
Has this been reviewed properly? I do not see any Acks nor Reviewed-bys.
Did Aneesh gave it some testing?

On Fri 18-10-19 20:19:33, Andrew Morton wrote:
> From: David Hildenbrand 
> Subject: mm/memory_hotplug: don't access uninitialized memmaps in 
> shrink_pgdat_span()
> 
> We might use the nid of memmaps that were never initialized.  For example,
> if the memmap was poisoned, we will crash the kernel in pfn_to_nid() right
> now.  Let's use the calculated boundaries of the separate zones instead. 
> This now also avoids having to iterate over a whole bunch of subsections
> again, after shrinking one zone.
> 
> Before commit d0dc12e86b31 ("mm/memory_hotplug: optimize memory hotplug"),
> the memmap was initialized to 0 and the node was set to the right value. 
> After that commit, the node might be garbage.
> 
> We'll have to fix shrink_zone_span() next.
> 
> Link: http://lkml.kernel.org/r/20191006085646.5768-4-da...@redhat.com
> Fixes: f1dd2cd13c4b ("mm, memory_hotplug: do not associate hotadded memory to 
> zones until online")[d0dc12e86b319]
> Signed-off-by: David Hildenbrand 
> Reported-by: Aneesh Kumar K.V 
> Cc: Oscar Salvador 
> Cc: David Hildenbrand 
> Cc: Michal Hocko 
> Cc: Pavel Tatashin 
> Cc: Dan Williams 
> Cc: Wei Yang 
> Cc: Alexander Duyck 
> Cc: Alexander Potapenko 
> Cc: Andy Lutomirski 
> Cc: Anshuman Khandual 
> Cc: Benjamin Herrenschmidt 
> Cc: Borislav Petkov 
> Cc: Catalin Marinas 
> Cc: Christian Borntraeger 
> Cc: Christophe Leroy 
> Cc: Damian Tometzki 
> Cc: Dave Hansen 
> Cc: Fenghua Yu 
> Cc: Gerald Schaefer 
> Cc: Greg Kroah-Hartman 
> Cc: Halil Pasic 
> Cc: Heiko Carstens 
> Cc: "H. Peter Anvin" 
> Cc: Ingo Molnar 
> Cc: Ira Weiny 
> Cc: Jason Gunthorpe 
> Cc: Jun Yao 
> Cc: Logan Gunthorpe 
> Cc: Mark Rutland 
> Cc: Masahiro Yamada 
> Cc: "Matthew Wilcox (Oracle)" 
> Cc: Mel Gorman 
> Cc: Michael Ellerman 
> Cc: Mike Rapoport 
> Cc: Pankaj Gupta 
> Cc: Paul Mackerras 
> Cc: Pavel Tatashin 
> Cc: Peter Zijlstra 
> Cc: Qian Cai 
> Cc: Rich Felker 
> Cc: Robin Murphy 
> Cc: Steve Capper 
> Cc: Thomas Gleixner 
> Cc: Tom Lendacky 
> Cc: Tony Luck 
> Cc: Vasily Gorbik 
> Cc: Vlastimil Babka 
> Cc: Wei Yang 
> Cc: Will Deacon 
> Cc: Yoshinori Sato 
> Cc: Yu Zhao 
> Cc:   [4.13+]
> Signed-off-by: Andrew Morton 
> ---
> 
>  mm/memory_hotplug.c |   74 +-
>  1 file changed, 16 insertions(+), 58 deletions(-)
> 
> --- 
> a/mm/memory_hotplug.c~mm-memory_hotplug-dont-access-uninitialized-memmaps-in-shrink_pgdat_span
> +++ a/mm/memory_hotplug.c
> @@ -436,67 +436,25 @@ static void shrink_zone_span(struct zone
>   zone_span_writeunlock(zone);
>  }
>  
> -static void shrink_pgdat_span(struct pglist_data *pgdat,
> -   unsigned long start_pfn, unsigned long end_pfn)
> +static void update_pgdat_span(struct pglist_data *pgdat)
>  {
> - unsigned long pgdat_start_pfn = pgdat->node_start_pfn;
> - unsigned long p = pgdat_end_pfn(pgdat); /* pgdat_end_pfn namespace 
> clash */
> - unsigned long pgdat_end_pfn = p;
> - unsigned long pfn;
> - int nid = pgdat->node_id;
> -
> - if (pgdat_start_pfn == start_pfn) {
> - /*
> -  * If the section is smallest section in the pgdat, it need
> -  * shrink pgdat->node_start_pfn and pgdat->node_spanned_pages.
> -  * In this case, we find second smallest valid mem_section
> -  * for shrinking zone.
> -  */
> - pfn = find_smallest_section_pfn(nid, NULL, end_pfn,
> - pgdat_end_pfn);
> - if (pfn) {
> - pgdat->node_start_pfn = pfn;
> - pgdat->node_spanned_pages = pgdat_end_pfn - pfn;
> - }
> - } else if (pgdat_end_pfn == end_pfn) {
> - /*
> -  * If the section is biggest section in the pgdat, it need
> -  * shrink pgdat->node_spanned_pages.
> -  * In this case, we find second biggest valid mem_section for
> -  * shrinking zone.
> -  */
> - pfn = find_biggest_section_pfn(nid, NULL, pgdat_start_pfn,
> -start_pfn);
> - if (pfn)
> - pgdat->node_spanned_pages = pfn - pgdat_start_pfn + 1;
> - }
> -
> - /*
> -  * If the section is not biggest or smallest mem_section in the pgdat,
> -  * it only creates a hole in the pgdat. So in this case, we need not
>

Re: [patch 07/26] mm/memunmap: don't access uninitialized memmap in memunmap_pages()

2019-10-21 Thread Michal Hocko
Has this been properly reviewed? I do not see any Acks nor Reviewed-bys.

On Fri 18-10-19 20:19:39, Andrew Morton wrote:
> From: "Aneesh Kumar K.V" 
> Subject: mm/memunmap: don't access uninitialized memmap in memunmap_pages()
> 
> Patch series "mm/memory_hotplug: Shrink zones before removing memory", v6.
> 
> This series fixes the access of uninitialized memmaps when shrinking
> zones/nodes and when removing memory.  Also, it contains all fixes for
> crashes that can be triggered when removing certain namespace using
> memunmap_pages() - ZONE_DEVICE, reported by Aneesh.
> 
> We stop trying to shrink ZONE_DEVICE, as it's buggy, fixing it would be
> more involved (we don't have SECTION_IS_ONLINE as an indicator), and
> shrinking is only of limited use (set_zone_contiguous() cannot detect the
> ZONE_DEVICE as contiguous).
> 
> We continue shrinking !ZONE_DEVICE zones, however, I reduced the amount of
> code to a minimum.  Shrinking is especially necessary to keep
> zone->contiguous set where possible, especially, on memory unplug of DIMMs
> at zone boundaries.
> 
> --
> 
> Zones are now properly shrunk when offlining memory blocks or when
> onlining failed.  This allows to properly shrink zones on memory unplug
> even if the separate memory blocks of a DIMM were onlined to different
> zones or re-onlined to a different zone after offlining.
> 
> Example:
> 
> :/# cat /proc/zoneinfo
> Node 1, zone  Movable
> spanned  0
> present  0
> managed  0
> :/# echo "online_movable" > /sys/devices/system/memory/memory41/state
> :/# echo "online_movable" > /sys/devices/system/memory/memory43/state
> :/# cat /proc/zoneinfo
> Node 1, zone  Movable
> spanned  98304
> present  65536
> managed  65536
> :/# echo 0 > /sys/devices/system/memory/memory43/online
> :/# cat /proc/zoneinfo
> Node 1, zone  Movable
> spanned  32768
> present  32768
> managed  32768
> :/# echo 0 > /sys/devices/system/memory/memory41/online
> :/# cat /proc/zoneinfo
> Node 1, zone  Movable
> spanned  0
> present  0
> managed  0
> 
> 
> This patch (of 10):
> 
> With an altmap, the memmap falling into the reserved altmap space are not
> initialized and, therefore, contain a garbage NID and a garbage zone. 
> Make sure to read the NID/zone from a memmap that was initialized.
> 
> This fixes a kernel crash that is observed when destroying a namespace:
> 
> [   81.356173] kernel BUG at include/linux/mm.h:1107!
> cpu 0x1: Vector: 700 (Program Check) at [c00274087890]
> pc: c04b9728: memunmap_pages+0x238/0x340
> lr: c04b9724: memunmap_pages+0x234/0x340
> ...
> pid   = 3669, comm = ndctl
> kernel BUG at include/linux/mm.h:1107!
> [c00274087ba0] c09e3500 devm_action_release+0x30/0x50
> [c00274087bc0] c09e4758 release_nodes+0x268/0x2d0
> [c00274087c30] c09dd144 device_release_driver_internal+0x174/0x240
> [c00274087c70] c09d9dfc unbind_store+0x13c/0x190
> [c00274087cb0] c09d8a24 drv_attr_store+0x44/0x60
> [c00274087cd0] c05a7470 sysfs_kf_write+0x70/0xa0
> [c00274087d10] c05a5cac kernfs_fop_write+0x1ac/0x290
> [c00274087d60] c04be45c __vfs_write+0x3c/0x70
> [c00274087d80] c04c26e4 vfs_write+0xe4/0x200
> [c00274087dd0] c04c2a6c ksys_write+0x7c/0x140
> [c00274087e20] c000bbd0 system_call+0x5c/0x68
> 
> The "page_zone(pfn_to_page(pfn)" was introduced by 69324b8f4833 ("mm,
> devm_memremap_pages: add MEMORY_DEVICE_PRIVATE support"), however, I
> think we will never have driver reserved memory with
> MEMORY_DEVICE_PRIVATE (no altmap AFAIKS).
> 
> [da...@redhat.com: minimze code changes, rephrase description]
> Link: http://lkml.kernel.org/r/20191006085646.5768-2-da...@redhat.com
> Fixes: 2c2a5af6fed2 ("mm, memory_hotplug: add nid parameter to 
> arch_remove_memory")
> Signed-off-by: Aneesh Kumar K.V 
> Signed-off-by: David Hildenbrand 
> Cc: Dan Williams 
> Cc: Jason Gunthorpe 
> Cc: Logan Gunthorpe 
> Cc: Ira Weiny 
> Cc: Damian Tometzki 
> Cc: Alexander Duyck 
> Cc: Alexander Potapenko 
> Cc: Andy Lutomirski 
> Cc: Anshuman Khandual 
> Cc: Benjamin Herrenschmidt 
> Cc: Borislav Petkov 
> Cc: Catalin Marinas 
> Cc: Christian Borntraeger 
> Cc: Christophe Leroy 
> Cc: Dave Hansen 
> Cc: Fenghua Yu 
> Cc: Gerald Schaefer 
> Cc: Greg Kroah-Hartman 
> Cc: Halil Pasic 
> Cc: Heiko Carstens 
> Cc: "H. Peter 

Re: [patch for-5.3 0/4] revert immediate fallback to remote hugepages

2019-10-18 Thread Michal Hocko
It's been some time since I've posted these results. The hugetlb issue
got resolved but I would still like to hear back about these findings
because they suggest that the current bail out strategy doesn't seem
to produce very good results. Essentially it doesn't really help THP
locality (on moderately filled up nodes) and it introduces a strong
dependency on kswapd which is not a source of the high order pages.
Also the overral THP success rate is decreased on a pretty standard "RAM
is used for page cache" workload.

That makes me think that the only possible workload that might really
benefit from this heuristic is a THP demanding one on a heavily
fragmented node with a lot of free memory while other nodes are not
fragmented and have quite a lot of free memory. If that is the case, is
this something to optimize for?

I am keeping all the results for the reference in a condensed form

On Tue 01-10-19 10:37:43, Michal Hocko wrote:
> I have split out my kvm machine into two nodes to get at least some
> idea how these patches behave
> $ numactl -H
> available: 2 nodes (0-1)
> node 0 cpus: 0 2
> node 0 size: 475 MB
> node 0 free: 432 MB
> node 1 cpus: 1 3
> node 1 size: 503 MB
> node 1 free: 458 MB
> 
> First run with 5.3 and without THP
> $ echo never > /sys/kernel/mm/transparent_hugepage/enabled
> root@test1:~# sh thp_test.sh 
> 7f4bdefec000 prefer:1 anon=102400 dirty=102400 active=86115 N0=41963 N1=60437 
> kernelpagesize_kB=4
> 7fd0f248b000 prefer:1 anon=102400 dirty=102400 active=86909 N0=40079 N1=62321 
> kernelpagesize_kB=4
> 7f2a69fc3000 prefer:1 anon=102400 dirty=102400 active=85244 N0=44455 N1=57945 
> kernelpagesize_kB=4
> 
> So we get around 56-60% pages to the preferred node
> 
> Now let's enable THPs
> AnonHugePages:407552 kB
> 7f05c6dee000 prefer:1 anon=102400 dirty=102400 active=52718 N0=50688 N1=51712 
> kernelpagesize_kB=4
> Few more runs
> AnonHugePages:407552 kB
> 7effca1b9000 prefer:1 anon=102400 dirty=102400 active=65977 N0=53760 N1=48640 
> kernelpagesize_kB=4
> AnonHugePages:407552 kB
> 7f474bfc4000 prefer:1 anon=102400 dirty=102400 active=52676 N0=8704 N1=93696 
> kernelpagesize_kB=4
> 
> The utilization is again almost 100% and the preferred node usage
> varied a lot between 47-91%.
> 
> Now with 5.3 + all 4 patches this time:
> AnonHugePages:401408 kB
> 7f8114ab4000 prefer:1 anon=102400 dirty=102400 active=51892 N0=3072 N1=99328 
> kernelpagesize_kB=4
> AnonHugePages:376832 kB
> 7f37a1404000 prefer:1 anon=102400 dirty=102400 active=55204 N0=23153 N1=79247 
> kernelpagesize_kB=4
> AnonHugePages:372736 kB
> 7f4abe4af000 prefer:1 anon=102400 dirty=102400 active=52399 N0=23646 N1=78754 
> kernelpagesize_kB=4
> 
> The THP utilization varies again and the locality is higher in average
> 76+%.  Which is even higher than the base page case. I was really
> wondering what is the reason for that So I've added a simple debugging

> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 8caab1f81a52..565f667f6dcb 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -2140,9 +2140,11 @@ alloc_pages_vma(gfp_t gfp, int order, struct 
> vm_area_struct *vma,
>* to prefer hugepage backing, retry allowing remote
>* memory as well.
>*/
> - if (!page && (gfp & __GFP_DIRECT_RECLAIM))
> + if (!page && (gfp & __GFP_DIRECT_RECLAIM)) {
>   page = __alloc_pages_node(hpage_node,
>   gfp | __GFP_NORETRY, order);
> + printk("XXX: %s\n", !page ? "fail" : hpage_node 
> == page_to_nid(page)?"match":"fallback");
> + }
>  
>   goto out;
>   }
> 
> Cases like
> AnonHugePages:407552 kB
> 7f3ab2ebf000 prefer:1 anon=102400 dirty=102400 active=77503 N0=43520 N1=58880 
> kernelpagesize_kB=4
>  85 XXX: fallback
> a very successful ones on the other hand
> AnonHugePages:280576 kB
> 7feffd9c2000 prefer:1 anon=102400 dirty=102400 active=52563 N0=3131 N1=99269 
> kernelpagesize_kB=4
>  62 XXX: fail
>   3 XXX: fallback
> 
> Note that 62 failing THPs is 31744 pages which also explains much higher
> locality I suspect (we simply allocate small pages from the preferred
> node).
> 
> This is just a very trivial test and I still have to grasp the outcome.
> My current feeling is that the whole thing is very timing specific and
> the higher local utilization depends on somebody else doing a reclaim
> on our behalf with patches applied.
> 
> 5.3 w

Re: [PATCH 3/3] mm, pcpu: Make zone pcp updates and reset internal to the mm

2019-10-18 Thread Michal Hocko
On Fri 18-10-19 11:56:06, Mel Gorman wrote:
> Memory hotplug needs to be able to reset and reinit the pcpu allocator
> batch and high limits but this action is internal to the VM. Move
> the declaration to internal.h
> 
> Signed-off-by: Mel Gorman 

Acked-by: Michal Hocko 

> ---
>  include/linux/mm.h | 3 ---
>  mm/internal.h  | 3 +++
>  2 files changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index cc292273e6ba..22d6104f2341 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2219,9 +2219,6 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, 
> const char *fmt, ...);
>  
>  extern void setup_per_cpu_pageset(void);
>  
> -extern void zone_pcp_update(struct zone *zone);
> -extern void zone_pcp_reset(struct zone *zone);
> -
>  /* page_alloc.c */
>  extern int min_free_kbytes;
>  extern int watermark_boost_factor;
> diff --git a/mm/internal.h b/mm/internal.h
> index 0d5f720c75ab..0a3d41c7b3c5 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -165,6 +165,9 @@ extern void post_alloc_hook(struct page *page, unsigned 
> int order,
>   gfp_t gfp_flags);
>  extern int user_min_free_kbytes;
>  
> +extern void zone_pcp_update(struct zone *zone);
> +extern void zone_pcp_reset(struct zone *zone);
> +
>  #if defined CONFIG_COMPACTION || defined CONFIG_CMA
>  
>  /*
> -- 
> 2.16.4
> 

-- 
Michal Hocko
SUSE Labs


<    9   10   11   12   13   14   15   16   17   18   >