Re: [PATCH RFCv2 0/4] mm/memory_hotplug: Introduce memory block types

2018-12-20 Thread Michal Hocko
On Thu 20-12-18 13:58:16, David Hildenbrand wrote:
> On 30.11.18 18:59, David Hildenbrand wrote:
> > This is the second approach, introducing more meaningful memory block
> > types and not changing online behavior in the kernel. It is based on
> > latest linux-next.
> > 
> > As we found out during dicussion, user space should always handle onlining
> > of memory, in any case. However in order to make smart decisions in user
> > space about if and how to online memory, we have to export more information
> > about memory blocks. This way, we can formulate rules in user space.
> > 
> > One such information is the type of memory block we are talking about.
> > This helps to answer some questions like:
> > - Does this memory block belong to a DIMM?
> > - Can this DIMM theoretically ever be unplugged again?
> > - Was this memory added by a balloon driver that will rely on balloon
> >   inflation to remove chunks of that memory again? Which zone is advised?
> > - Is this special standby memory on s390x that is usually not automatically
> >   onlined?
> > 
> > And in short it helps to answer to some extend (excluding zone imbalances)
> > - Should I online this memory block?
> > - To which zone should I online this memory block?
> > ... of course special use cases will result in different anwers. But that's
> > why user space has control of onlining memory.
> > 
> > More details can be found in Patch 1 and Patch 3.
> > Tested on x86 with hotplugged DIMMs. Cross-compiled for PPC and s390x.
> > 
> > 
> > Example:
> > $ udevadm info -q all -a /sys/devices/system/memory/memory0
> > KERNEL=="memory0"
> > SUBSYSTEM=="memory"
> > DRIVER==""
> > ATTR{online}=="1"
> > ATTR{phys_device}=="0"
> > ATTR{phys_index}==""
> > ATTR{removable}=="0"
> > ATTR{state}=="online"
> > ATTR{type}=="boot"
> > ATTR{valid_zones}=="none"
> > $ udevadm info -q all -a /sys/devices/system/memory/memory90
> > KERNEL=="memory90"
> > SUBSYSTEM=="memory"
> > DRIVER==""
> > ATTR{online}=="1"
> > ATTR{phys_device}=="0"
> > ATTR{phys_index}=="005a"
> > ATTR{removable}=="1"
> > ATTR{state}=="online"
> > ATTR{type}=="dimm"
> > ATTR{valid_zones}=="Normal"
> > 
> > 
> > RFC -> RFCv2:
> > - Now also taking care of PPC (somehow missed it :/ )
> > - Split the series up to some degree (some ideas on how to split up patch 3
> >   would be very welcome)
> > - Introduce more memory block types. Turns out abstracting too much was
> >   rather confusing and not helpful. Properly document them.
> > 
> > Notes:
> > - I wanted to convert the enum of types into a named enum but this
> >   provoked all kinds of different errors. For now, I am doing it just like
> >   the other types (e.g. online_type) we are using in that context.
> > - The "removable" property should never have been named like that. It
> >   should have been "offlinable". Can we still rename that? E.g. boot memory
> >   is sometimes marked as removable ...
> > 
> 
> 
> Any feedback regarding the suggested block types would be very much
> appreciated!

I still do not like this much to be honest. I just didn't get to think
through this properly. My fear is that this is conflating an actual API
with the current implementation and as such will cause problems in
future. But I haven't really looked into your patches closely so I might
be wrong. Anyway I won't be able to look into it by the end of year.
-- 
Michal Hocko
SUSE Labs
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android

2019-03-11 Thread Michal Hocko
he calculated kill timeout jiffies for frequent reuse */
> + kill_expires = msecs_to_jiffies(CONFIG_ANDROID_SIMPLE_LMK_KILL_TIMEOUT);
> + atomic_set(&simple_lmk_state, READY);
> + return 0;
> +}
> +
> +static const struct kernel_param_ops simple_lmk_init_ops = {
> + .set = simple_lmk_init_set
> +};
> +
> +/* Needed to prevent Android from thinking there's no LMK and thus rebooting 
> */
> +#undef MODULE_PARAM_PREFIX
> +#define MODULE_PARAM_PREFIX "lowmemorykiller."
> +module_param_cb(minfree, &simple_lmk_init_ops, NULL, 0200);
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 1549584a1..d290f9ece 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1199,6 +1199,9 @@ struct task_struct {
>   unsigned long   lowest_stack;
>   unsigned long   prev_lowest_stack;
>  #endif
> +#ifdef CONFIG_ANDROID_SIMPLE_LMK
> + bool slmk_sigkill_sent;
> +#endif
>  
>   /*
>* New fields for task_struct should be added above here, so that
> diff --git a/include/linux/simple_lmk.h b/include/linux/simple_lmk.h
> new file mode 100644
> index 0..64c26368a
> --- /dev/null
> +++ b/include/linux/simple_lmk.h
> @@ -0,0 +1,11 @@
> +/* SPDX-License-Identifier: GPL-2.0
> + *
> + * Copyright (C) 2019 Sultan Alsawaf .
> + */
> +#ifndef _SIMPLE_LMK_H_
> +#define _SIMPLE_LMK_H_
> +
> +struct page *simple_lmk_oom_alloc(unsigned int order, int migratetype);
> +bool simple_lmk_page_in(struct page *page, unsigned int order, int 
> migratetype);
> +
> +#endif /* _SIMPLE_LMK_H_ */
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 9dcd18aa2..162c45392 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1881,6 +1881,9 @@ static __latent_entropy struct task_struct 
> *copy_process(
>   p->sequential_io= 0;
>   p->sequential_io_avg= 0;
>  #endif
> +#ifdef CONFIG_ANDROID_SIMPLE_LMK
> + p->slmk_sigkill_sent = false;
> +#endif
>  
>   /* Perform scheduler related setup. Assign this task to a CPU. */
>   retval = sched_fork(clone_flags, p);
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 3eb01dedf..fd0d697c6 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -67,6 +67,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include 
>  #include 
> @@ -967,6 +968,11 @@ static inline void __free_one_page(struct page *page,
>   }
>   }
>  
> +#ifdef CONFIG_ANDROID_SIMPLE_LMK
> + if (simple_lmk_page_in(page, order, migratetype))
> + return;
> +#endif
> +
>   list_add(&page->lru, &zone->free_area[order].free_list[migratetype]);
>  out:
>   zone->free_area[order].nr_free++;
> @@ -4427,6 +4433,13 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int 
> order,
>   if (costly_order && !(gfp_mask & __GFP_RETRY_MAYFAIL))
>   goto nopage;
>  
> +#ifdef CONFIG_ANDROID_SIMPLE_LMK
> + page = simple_lmk_oom_alloc(order, ac->migratetype);
> + if (page)
> + prep_new_page(page, order, gfp_mask, alloc_flags);
> + goto got_pg;
> +#endif
> +
>   if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags,
>did_some_progress > 0, &no_progress_loops))
>   goto retry;
> -- 
> 2.21.0

-- 
Michal Hocko
SUSE Labs
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android

2019-03-12 Thread Michal Hocko
On Mon 11-03-19 15:15:35, Suren Baghdasaryan wrote:
> On Mon, Mar 11, 2019 at 1:46 PM Sultan Alsawaf  wrote:
> >
> > On Mon, Mar 11, 2019 at 01:10:36PM -0700, Suren Baghdasaryan wrote:
> > > The idea seems interesting although I need to think about this a bit
> > > more. Killing processes based on failed page allocation might backfire
> > > during transient spikes in memory usage.
> >
> > This issue could be alleviated if tasks could be killed and have their pages
> > reaped faster. Currently, Linux takes a _very_ long time to free a task's 
> > memory
> > after an initial privileged SIGKILL is sent to a task, even with the task's
> > priority being set to the highest possible (so unwanted scheduler preemption
> > starving dying tasks of CPU time is not the issue at play here). I've
> > frequently measured the difference in time between when a SIGKILL is sent 
> > for a
> > task and when free_task() is called for that task to be hundreds of
> > milliseconds, which is incredibly long. AFAIK, this is a problem that LMKD
> > suffers from as well, and perhaps any OOM killer implementation in Linux, 
> > since
> > you cannot evaluate effect you've had on memory pressure by killing a 
> > process
> > for at least several tens of milliseconds.
> 
> Yeah, killing speed is a well-known problem which we are considering
> in LMKD. For example the recent LMKD change to assign process being
> killed to a cpuset cgroup containing big cores cuts the kill time
> considerably. This is not ideal and we are thinking about better ways
> to expedite the cleanup process.

If you design is relies on the speed of killing then it is fundamentally
flawed AFAICT. You cannot assume anything about how quickly a task dies.
It might be blocked in an uninterruptible sleep or performin an
operation which takes some time. Sure, oom_reaper might help here but
still.

The only way to control the OOM behavior pro-actively is to throttle
allocation speed. We have memcg high limit for that purpose. Along with
PSI, I can imagine a reasonably working user space early oom
notifications and reasonable acting upon that.
-- 
Michal Hocko
SUSE Labs
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android

2019-03-12 Thread Michal Hocko
On Tue 12-03-19 08:25:41, Matthew Wilcox wrote:
> On Tue, Mar 12, 2019 at 09:05:32AM +0100, Michal Hocko wrote:
> > On Mon 11-03-19 15:15:35, Suren Baghdasaryan wrote:
> > > Yeah, killing speed is a well-known problem which we are considering
> > > in LMKD. For example the recent LMKD change to assign process being
> > > killed to a cpuset cgroup containing big cores cuts the kill time
> > > considerably. This is not ideal and we are thinking about better ways
> > > to expedite the cleanup process.
> > 
> > If you design is relies on the speed of killing then it is fundamentally
> > flawed AFAICT. You cannot assume anything about how quickly a task dies.
> > It might be blocked in an uninterruptible sleep or performin an
> > operation which takes some time. Sure, oom_reaper might help here but
> > still.
> 
> Many UNINTERRUPTIBLE sleeps can be converted to KILLABLE sleeps.  It just
> needs someone to do the work.

They can and should as much as possible. No question about that. But not
all of them can and that is why nobody should be relying on that. That
is the whole point of having the oom_reaper and async oom victim tear
down.

-- 
Michal Hocko
SUSE Labs
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android

2019-03-12 Thread Michal Hocko
On Tue 12-03-19 16:33:15, Michal Hocko wrote:
> On Tue 12-03-19 08:25:41, Matthew Wilcox wrote:
> > On Tue, Mar 12, 2019 at 09:05:32AM +0100, Michal Hocko wrote:
> > > On Mon 11-03-19 15:15:35, Suren Baghdasaryan wrote:
> > > > Yeah, killing speed is a well-known problem which we are considering
> > > > in LMKD. For example the recent LMKD change to assign process being
> > > > killed to a cpuset cgroup containing big cores cuts the kill time
> > > > considerably. This is not ideal and we are thinking about better ways
> > > > to expedite the cleanup process.
> > > 
> > > If you design is relies on the speed of killing then it is fundamentally
> > > flawed AFAICT. You cannot assume anything about how quickly a task dies.
> > > It might be blocked in an uninterruptible sleep or performin an
> > > operation which takes some time. Sure, oom_reaper might help here but
> > > still.
> > 
> > Many UNINTERRUPTIBLE sleeps can be converted to KILLABLE sleeps.  It just
> > needs someone to do the work.
> 
> They can and should as much as possible. No question about that. But not
> all of them can and that is why nobody should be relying on that. That
> is the whole point of having the oom_reaper and async oom victim tear
> down.

Let me clarify a bit. LMK obviously doesn't need any guarantee like the
core oom killer because it is more of a pro-active measure than the last
resort. I merely wanted to say that relying on a design which assumes
anything about time victim needs to exit is flawed and it will fail
under different workloads. On the other hand this might work good enough
on very specific workloads to be usable. I am not questioning that. The
point is that this is not generic enough to be accepted to the upstream
kernel.
-- 
Michal Hocko
SUSE Labs
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android

2019-03-12 Thread Michal Hocko
On Tue 12-03-19 09:37:41, Sultan Alsawaf wrote:
> On Tue, Mar 12, 2019 at 09:05:32AM +0100, Michal Hocko wrote:
> > The only way to control the OOM behavior pro-actively is to throttle
> > allocation speed. We have memcg high limit for that purpose. Along with
> > PSI, I can imagine a reasonably working user space early oom
> > notifications and reasonable acting upon that.
> 
> The issue with pro-active memory management that prompted me to create this 
> was
> poor memory utilization. All of the alternative means of reclaiming pages in 
> the
> page allocator's slow path turn out to be very useful for maximizing memory
> utilization, which is something that we would have to forgo by relying on a
> purely pro-active solution. I have not had a chance to look at PSI yet, but
> unless a PSI-enabled solution allows allocations to reach the same point as 
> when
> the OOM killer is invoked (which is contradictory to what it sets out to do),
> then it cannot take advantage of all of the alternative memory-reclaim means
> employed in the slowpath, and will result in killing a process before it is
> _really_ necessary.

If you really want to reach the real OOM situation then you can very
well rely on the in-kernel OOM killer. The only reason you want a
customized oom killer is the tasks clasification. And that is a
different story. User space hints on the victim selection has been a
topic for quite while. It never get to any conclusion as interested
parties have always lost an interest because it got hairy quickly.

> > If you design is relies on the speed of killing then it is fundamentally
> > flawed AFAICT. You cannot assume anything about how quickly a task dies.
> > It might be blocked in an uninterruptible sleep or performin an
> > operation which takes some time. Sure, oom_reaper might help here but
> > still.
> 
> In theory we could instantly zap any process that is not trapped in the kernel
> at the time that the OOM killer is invoked without any consequences though, 
> no?

No, this is not so simple. Have a look at the oom_reaper and hops it has
to go through.
-- 
Michal Hocko
SUSE Labs
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android

2019-03-12 Thread Michal Hocko
On Tue 12-03-19 09:37:41, Sultan Alsawaf wrote:
> I have not had a chance to look at PSI yet, but
> unless a PSI-enabled solution allows allocations to reach the same point as 
> when
> the OOM killer is invoked (which is contradictory to what it sets out to do),
> then it cannot take advantage of all of the alternative memory-reclaim means
> employed in the slowpath, and will result in killing a process before it is
> _really_ necessary.

One more note. The above is true, but you can also hit one of the
thrashing reclaim behaviors and reclaim last few pages again and again
with the whole system really sluggish. That is what PSI is trying to
help with.
-- 
Michal Hocko
SUSE Labs
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH v2] memory_hotplug: Free pages as higher order

2018-09-25 Thread Michal Hocko
On Tue 25-09-18 11:59:09, Vlastimil Babka wrote:
[...]
> This seems like almost complete copy of __free_pages_boot_core(), could
> you do some code reuse instead? I think Michal Hocko also suggested that.

Yes, please try to reuse as much code as possible

-- 
Michal Hocko
SUSE Labs
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types

2018-10-01 Thread Michal Hocko
On Fri 28-09-18 17:03:57, David Hildenbrand wrote:
[...]

I haven't read the patch itself but I just wanted to note one thing
about this part

> For paravirtualized devices it is relevant that memory is onlined as
> quickly as possible after adding - and that it is added to the NORMAL
> zone. Otherwise, it could happen that too much memory in a row is added
> (but not onlined), resulting in out-of-memory conditions due to the
> additional memory for "struct pages" and friends. MOVABLE zone as well
> as delays might be very problematic and lead to crashes (e.g. zone
> imbalance).

I have proposed (but haven't finished this due to other stuff) a
solution for this. Newly added memory can host memmaps itself and then
you do not have the problem in the first place. For vmemmap it would
have an advantage that you do not really have to beg for 2MB pages to
back the whole section but you would get it for free because the initial
part of the section is by definition properly aligned and unused.

I yet have to think about the whole proposal but I am missing the most
important part. _Who_ is going to use the new exported information and
for what purpose. You said that distributions have hard time to
distinguish different types of onlinining policies but isn't this
something that is inherently usecase specific?
-- 
Michal Hocko
SUSE Labs
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types

2018-10-02 Thread Michal Hocko
On Mon 01-10-18 11:34:25, David Hildenbrand wrote:
> On 01/10/2018 10:40, Michal Hocko wrote:
> > On Fri 28-09-18 17:03:57, David Hildenbrand wrote:
> > [...]
> > 
> > I haven't read the patch itself but I just wanted to note one thing
> > about this part
> > 
> >> For paravirtualized devices it is relevant that memory is onlined as
> >> quickly as possible after adding - and that it is added to the NORMAL
> >> zone. Otherwise, it could happen that too much memory in a row is added
> >> (but not onlined), resulting in out-of-memory conditions due to the
> >> additional memory for "struct pages" and friends. MOVABLE zone as well
> >> as delays might be very problematic and lead to crashes (e.g. zone
> >> imbalance).
> > 
> > I have proposed (but haven't finished this due to other stuff) a
> > solution for this. Newly added memory can host memmaps itself and then
> > you do not have the problem in the first place. For vmemmap it would
> > have an advantage that you do not really have to beg for 2MB pages to
> > back the whole section but you would get it for free because the initial
> > part of the section is by definition properly aligned and unused.
> 
> So the plan is to "host metadata for new memory on the memory itself".
> Just want to note that this is basically impossible for s390x with the
> current mechanisms. (added memory is dead, until onlining notifies the
> hypervisor and memory is allocated). It will also be problematic for
> paravirtualized memory devices (e.g. XEN's "not backed by the
> hypervisor" hacks).

OK, I understand that not all usecases can use self memmap hosting
others do not have much choice left though. You have to allocate from
somewhere. Well and alternative would be to have no memmap until
onlining but I am not sure how much work that would be.

> This would only be possible for memory DIMMs, memory that is completely
> accessible as far as I can see. Or at least, some specified "first part"
> is accessible.
> 
> Other problems are other metadata like extended struct pages and friends.

I wouldn't really worry about extended struct pages. Those should be
used for debugging purposes mostly. Ot at least that was the case last
time I've checked.

> (I really like the idea of adding memory without allocating memory in
> the hypervisor in the first place, please keep me tuned).
> 
> And please note: This solves some problematic part ("adding too much
> memory to the movable zone or not onlining it"), but not the issue of
> zone imbalance in the first place. And not one issue I try to tackle
> here: don't add paravirtualized memory to the movable zone.

Zone imbalance is an inherent problem of the highmem zone. It is
essentially the highmem zone we all loved so much back in 32b days.
Yes the movable zone doesn't have any addressing limitations so it is a
bit more relaxed but considering the hotplug scenarios I have seen so
far people just want to have full NUMA nodes movable to allow replacing
DIMMs. And then we are back to square one and the zone imbalance issue.
You have those regardless where memmaps are allocated from.

> > I yet have to think about the whole proposal but I am missing the most
> > important part. _Who_ is going to use the new exported information and
> > for what purpose. You said that distributions have hard time to
> > distinguish different types of onlinining policies but isn't this
> > something that is inherently usecase specific?
> > 
> 
> Let's think about a distribution. We have a clash of use cases here
> (just what you describe). What I propose solves one part of it ("handle
> what you know how to handle right in the kernel").
> 
> 1. Users of DIMMs usually expect that they can be unplugged again. That
> is why you want to control how to online memory in user space (== add it
> to the movable zone).

Which is only true if you really want to hotremove them. I am not going
to tell how much I believe in this usecase but movable policy is not
generally applicable here.

> 2. Users of standby memory (s390) expect that memory will never be
> onlined automatically. It will be onlined manually.

yeah

> 3. Users of paravirtualized devices (esp. Hyper-V) don't care about
> memory unplug in the sense of MOVABLE at all. They (or Hyper-V!) will
> add a whole bunch of memory and expect that everything works fine. So
> that memory is onlined immediately and that memory is added to the
> NORMAL zone. Users never want the MOVABLE zone.

Then the immediate question would be why to use memory hotplug for that
at all? Why don't you simply start with a huge pre-allocated physical

Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types

2018-10-03 Thread Michal Hocko
On Wed 03-10-18 15:38:04, Vitaly Kuznetsov wrote:
> David Hildenbrand  writes:
> 
> > On 02/10/2018 15:47, Michal Hocko wrote:
> ...
> >> 
> >> Why do you need a generic hotplug rule in the first place? Why don't you
> >> simply provide different set of rules for different usecases? Let users
> >> decide which usecase they prefer rather than try to be clever which
> >> almost always hits weird corner cases.
> >> 
> >
> > Memory hotplug has to work as reliable as we can out of the box. Letting
> > the user make simple decisions like "oh, I am on hyper-V, I want to
> > online memory to the normal zone" does not feel right. But yes, we
> > should definitely allow to make modifications.
> 
> Last time I was thinking about the imperfectness of the auto-online
> solution we have and any other solution we're able to suggest an idea
> came to my mind - what if we add an eBPF attach point to the
> auto-onlining mechanism effecively offloading decision-making to
> userspace. We'll of couse need to provide all required data (e.g. how
> memory blocks are aligned with physical DIMMs as it makes no sense to
> online part of DIMM as normal and the rest as movable as it's going to
> be impossible to unplug such DIMM anyways).

And how does that differ from the notification mechanism we have? Just
by not relying on the process scheduling? If yes then this revolves
around the implementation detail that you care about time-to-hot-add
vs. time-to-online. And that is a solveable problem - just allocate
memmaps from the hot-added memory.

As David said some of the memory cannot be onlined without further steps
(e.g. when it is standby as David called it) and then I fail to see how
eBPF help in any way.
-- 
Michal Hocko
SUSE Labs
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types

2018-10-03 Thread Michal Hocko
On Tue 02-10-18 17:25:19, David Hildenbrand wrote:
> On 02/10/2018 15:47, Michal Hocko wrote:
[...]
> > Zone imbalance is an inherent problem of the highmem zone. It is
> > essentially the highmem zone we all loved so much back in 32b days.
> > Yes the movable zone doesn't have any addressing limitations so it is a
> > bit more relaxed but considering the hotplug scenarios I have seen so
> > far people just want to have full NUMA nodes movable to allow replacing
> > DIMMs. And then we are back to square one and the zone imbalance issue.
> > You have those regardless where memmaps are allocated from.
> 
> Unfortunately yes. And things get more complicated as you are adding a
> whole DIMMs and get notifications in the granularity of memory blocks.
> Usually you are not interested in onlining any memory block of that DIMM
> as MOVABLE as soon as you would have to online one memory block of that
> DIMM as NORMAL - because that can already block the whole DIMM.

For the purpose of the hotremove, yes. But as Dave has noted people are
(ab)using zone movable for other purposes - e.g. large pages.
 
[...]
> > Then the immediate question would be why to use memory hotplug for that
> > at all? Why don't you simply start with a huge pre-allocated physical
> > address space and balloon memory in an out per demand. Why do you want
> > to inject new memory during the runtime?
> 
> Let's assume you have a guest with 20GB size and eventually want to
> allow to grow it to 4TB. You would have to allocate metadata for 4TB
> right from the beginning. That's definitely now what we want. That is
> why memory hotplug is used by e.g. XEN or Hyper-V. With Hyper-V, the
> hypervisor even tells you at which places additional memory has been
> made available.

Then you have to live with the fact that your hot added memory will be
self hosted and find a way for ballooning to work with that. The price
would be that some part of the memory is not really balloonable in the
end.

> >> 1. is a reason why distributions usually don't configure
> >> "MEMORY_HOTPLUG_DEFAULT_ONLINE", because you really want the option for
> >> MOVABLE zone. That however implies, that e.g. for x86, you have to
> >> handle all new memory in user space, especially also HyperV memory.
> >> There, you then have to check for things like "isHyperV()" to decide
> >> "oh, yes, this should definitely not go to the MOVABLE zone".
> > 
> > Why do you need a generic hotplug rule in the first place? Why don't you
> > simply provide different set of rules for different usecases? Let users
> > decide which usecase they prefer rather than try to be clever which
> > almost always hits weird corner cases.
> > 
> 
> Memory hotplug has to work as reliable as we can out of the box. Letting
> the user make simple decisions like "oh, I am on hyper-V, I want to
> online memory to the normal zone" does not feel right.

Users usually know what is their usecase and then it is just a matter of
plumbing (e.g. distribution can provide proper tools to deploy those
usecases) to chose the right and for user obscure way to make it work.

> But yes, we
> should definitely allow to make modifications. So some sane default rule
> + possible modification is usually a good idea.
> 
> I think Dave has a point with using MOVABLE for huge page use cases. And
> there might be other corner cases as you correctly state.
> 
> I wonder if this patch itself minus modifying online/offline might make
> sense. We can then implement simple rules in user space
> 
> if (normal) {
>   /* customers expect hotplugged DIMMs to be unpluggable */
>   online_movable();
> } else if (paravirt) {
>   /* paravirt memory should as default always go to the NORMAL */
>   online();
> } else {
>   /* standby memory will never get onlined automatically */
> }
> 
> Compared to having to guess what is to be done (isKVM(), isHyperV,
> isS390 ...) and failing once this is no longer unique (e.g. virtio-mem
> and ACPI support for x86 KVM).

I am worried that exporing a type will just push us even further to the
corner. The current design is really simple and 2 stage and that is good
because it allows for very different usecases. The more specific the API
be the more likely we are going to hit "I haven't even dreamed somebody
would be using hotplug for this thing". And I would bet this will happen
sooner or later.

Just look at how the whole auto onlining screwed the API to workaround
an implementation detail. It has created a one purpose behavior that
doesn't suite many usecases. Yet we have to live with that because
somebody really relies on it. Let's not repeat same errors.
-- 
Michal Hocko
SUSE Labs
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types

2018-10-03 Thread Michal Hocko
On Wed 03-10-18 15:52:24, Vitaly Kuznetsov wrote:
[...]
> > As David said some of the memory cannot be onlined without further steps
> > (e.g. when it is standby as David called it) and then I fail to see how
> > eBPF help in any way.
> 
> and also, we can fight till the end of days here trying to come up with
> an onlining solution which would work for everyone and eBPF would move
> this decision to distro level.

The point is that there is _no_ general onlining solution. This is
basically policy which belongs to the userspace.
-- 
Michal Hocko
SUSE Labs
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types

2018-10-03 Thread Michal Hocko
On Wed 03-10-18 19:14:05, David Hildenbrand wrote:
> On 03/10/2018 16:34, Vitaly Kuznetsov wrote:
> > Dave Hansen  writes:
> > 
> >> On 10/03/2018 06:52 AM, Vitaly Kuznetsov wrote:
> >>> It is more than just memmaps (e.g. forking udev process doing memory
> >>> onlining also needs memory) but yes, the main idea is to make the
> >>> onlining synchronous with hotplug.
> >>
> >> That's a good theoretical concern.
> >>
> >> But, is it a problem we need to solve in practice?
> > 
> > Yes, unfortunately. It was previously discovered that when we try to
> > hotplug tons of memory to a low memory system (this is a common scenario
> > with VMs) we end up with OOM because for all new memory blocks we need
> > to allocate page tables, struct pages, ... and we need memory to do
> > that. The userspace program doing memory onlining also needs memory to
> > run and in case it prefers to fork to handle hundreds of notfifications
> > ... well, it may get OOMkilled before it manages to online anything.
> > 
> > Allocating all kernel objects from the newly hotplugged blocks would
> > definitely help to manage the situation but as I said this won't solve
> > the 'forking udev' problem completely (it will likely remain in
> > 'extreme' cases only. We can probably work around it by onlining with a
> > dedicated process which doesn't do memory allocation).
> > 
> 
> I guess the problem is even worse. We always have two phases
> 
> 1. add memory - requires memory allocation
> 2. online memory - might require memory allocations e.g. for slab/slub
> 
> So if we just added memory but don't have sufficient memory to start a
> user space process to trigger onlining, then we most likely also don't
> have sufficient memory to online the memory right away (in some scenarios).
> 
> We would have to allocate all new memory for 1 and 2 from the memory to
> be onlined. I guess the latter part is less trivial.
> 
> So while onlining the memory from the kernel might make things a little
> more robust, we would still have the chance for OOM / onlining failing.

Yes, _theoretically_. Is this a practical problem for reasonable
configurations though? I mean, this will never be perfect and we simply
cannot support all possible configurations. We should focus on
reasonable subset of them. From my practical experience the vast
majority of memory is consumed by memmaps (roughly 1.5%). That is not a
lot but I agree that allocating that from the zone normal and off node
is not great. Especially the second part which is noticeable for whole
node hotplug.

I have a feeling that arguing about fork not able to proceed or OOMing
for the memory hotplug is a bit of a stretch and a sign a of
misconfiguration.
-- 
Michal Hocko
SUSE Labs
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types

2018-10-03 Thread Michal Hocko
On Wed 03-10-18 19:00:29, David Hildenbrand wrote:
[...]
> Let me rephrase: You state that user space has to make the decision and
> that user should be able to set/reconfigure rules. That is perfectly fine.
> 
> But then we should give user space access to sufficient information to
> make a decision. This might be the type of memory as we learned (what
> some part of this patch proposes), but maybe later more, e.g. to which
> physical device memory belongs (e.g. to hotplug it all movable or all
> normal) ...

I am pretty sure that user knows he/she wants to use ballooning in
HyperV or Xen, or that the memory hotplug should be used as a "RAS"
feature to allow add and remove DIMMs for reliability. Why shouldn't we
have a package to deploy an appropriate set of udev rules for each of
those usecases? I am pretty sure you need some other plumbing to enable
them anyway (e.g. RAS would require to have movable_node kernel
parameters, ballooning a kernel module etc.).

Really, one udev script to rule them all will simply never work.
-- 
Michal Hocko
SUSE Labs
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH v4] memory_hotplug: Free pages as higher order

2018-10-04 Thread Michal Hocko
On Wed 03-10-18 19:09:39, Arun KS wrote:
[...]
> +static int online_pages_blocks(unsigned long start, unsigned long nr_pages)
> +{
> + unsigned long end = start + nr_pages;
> + int order, ret, onlined_pages = 0;
> +
> + while (start < end) {
> + order = min(MAX_ORDER - 1UL, __ffs(start));
> +
> + while (start + (1UL << order) > end)
> + order--;

this really made me scratch my head. Wouldn't it be much simpler to do
the following?
order = min(MAX_ORDER - 1, get_order(end - start))?

> +
> + ret = (*online_page_callback)(pfn_to_page(start), order);
> + if (!ret)
> + onlined_pages += (1UL << order);
> + else if (ret > 0)
> + onlined_pages += ret;
> +
> + start += (1UL << order);
> + }
> + return onlined_pages;
>  }
[...]
> -static void __init __free_pages_boot_core(struct page *page, unsigned int 
> order)
> +void __free_pages_core(struct page *page, unsigned int order)
>  {
>   unsigned int nr_pages = 1 << order;
>   struct page *p = page;
>   unsigned int loop;
>  
> - prefetchw(p);
> - for (loop = 0; loop < (nr_pages - 1); loop++, p++) {
> - prefetchw(p + 1);
> + for (loop = 0; loop < nr_pages; loop++, p++) {
>   __ClearPageReserved(p);
>   set_page_count(p, 0);
>   }
> - __ClearPageReserved(p);
> - set_page_count(p, 0);
>  
>   page_zone(page)->managed_pages += nr_pages;
>   set_page_refcounted(page);

I think this is wort a separate patch as it is unrelated to the patch.

-- 
Michal Hocko
SUSE Labs
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH v5 1/2] memory_hotplug: Free pages as higher order

2018-10-09 Thread Michal Hocko
On Fri 05-10-18 13:40:05, Arun KS wrote:
> When free pages are done with higher order, time spend on
> coalescing pages by buddy allocator can be reduced. With
> section size of 256MB, hot add latency of a single section
> shows improvement from 50-60 ms to less than 1 ms, hence
> improving the hot add latency by 60%. Modify external
> providers of online callback to align with the change.

Acked-by: Michal Hocko 

Thanks for your patience with all the resubmission.
-- 
Michal Hocko
SUSE Labs
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH v5 2/2] mm/page_alloc: remove software prefetching in __free_pages_core

2018-10-09 Thread Michal Hocko
On Fri 05-10-18 13:40:06, Arun KS wrote:
> They not only increase the code footprint, they actually make things
> slower rather than faster. Remove them as contemporary hardware doesn't
> need any hint.

I agree with the change but it is much better to add some numbers
whenever arguing about performance impact.

> 
> Suggested-by: Dan Williams 
> Signed-off-by: Arun KS 
> ---
>  mm/page_alloc.c | 6 +-
>  1 file changed, 1 insertion(+), 5 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 7ab5274..90db431 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1258,14 +1258,10 @@ void __free_pages_core(struct page *page, unsigned 
> int order)
>   struct page *p = page;
>   unsigned int loop;
>  
> - prefetchw(p);
> - for (loop = 0; loop < (nr_pages - 1); loop++, p++) {
> - prefetchw(p + 1);
> + for (loop = 0; loop < nr_pages ; loop++, p++) {
>   __ClearPageReserved(p);
>   set_page_count(p, 0);
>   }
> - __ClearPageReserved(p);
> - set_page_count(p, 0);
>  
>   page_zone(page)->managed_pages += nr_pages;
>   set_page_refcounted(page);
> -- 
> 1.9.1
> 

-- 
Michal Hocko
SUSE Labs
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH v5 1/2] memory_hotplug: Free pages as higher order

2018-10-09 Thread Michal Hocko
On Tue 09-10-18 15:24:13, Arun KS wrote:
> On 2018-10-09 14:59, Michal Hocko wrote:
> > On Fri 05-10-18 13:40:05, Arun KS wrote:
> > > When free pages are done with higher order, time spend on
> > > coalescing pages by buddy allocator can be reduced. With
> > > section size of 256MB, hot add latency of a single section
> > > shows improvement from 50-60 ms to less than 1 ms, hence
> > > improving the hot add latency by 60%. Modify external
> > > providers of online callback to align with the change.
> > 
> > Acked-by: Michal Hocko 
> > 
> > Thanks for your patience with all the resubmission.
> 
> Hello Michal,
> 
> I got the below email few days back. Do I still need to resubmit the patch?
> or is it already on the way to merge?

It is sitting in the queue for now. Andrew collects acks and
reviewed-bys and add them to the patch.
-- 
Michal Hocko
SUSE Labs
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH v5 1/2] memory_hotplug: Free pages as higher order

2018-10-10 Thread Michal Hocko
On Wed 10-10-18 22:26:41, Arun KS wrote:
> On 2018-10-10 21:00, Vlastimil Babka wrote:
> > On 10/5/18 10:10 AM, Arun KS wrote:
> > > When free pages are done with higher order, time spend on
> > > coalescing pages by buddy allocator can be reduced. With
> > > section size of 256MB, hot add latency of a single section
> > > shows improvement from 50-60 ms to less than 1 ms, hence
> > > improving the hot add latency by 60%. Modify external
> > > providers of online callback to align with the change.
> > > 
> > > Signed-off-by: Arun KS 
> > 
> > [...]
> > 
> > > @@ -655,26 +655,44 @@ void __online_page_free(struct page *page)
> > >  }
> > >  EXPORT_SYMBOL_GPL(__online_page_free);
> > > 
> > > -static void generic_online_page(struct page *page)
> > > +static int generic_online_page(struct page *page, unsigned int order)
> > >  {
> > > - __online_page_set_limits(page);
> > 
> > This is now not called anymore, although the xen/hv variants still do
> > it. The function seems empty these days, maybe remove it as a followup
> > cleanup?
> > 
> > > - __online_page_increment_counters(page);
> > > - __online_page_free(page);
> > > + __free_pages_core(page, order);
> > > + totalram_pages += (1UL << order);
> > > +#ifdef CONFIG_HIGHMEM
> > > + if (PageHighMem(page))
> > > + totalhigh_pages += (1UL << order);
> > > +#endif
> > 
> > __online_page_increment_counters() would have used
> > adjust_managed_page_count() which would do the changes under
> > managed_page_count_lock. Are we safe without the lock? If yes, there
> > should perhaps be a comment explaining why.
> 
> Looks unsafe without managed_page_count_lock.

Why does it matter actually? We cannot online/offline memory in
parallel. This is not the case for the boot where we initialize memory
in parallel on multiple nodes. So this seems to be safe currently unless
I am missing something. A comment explaining that would be helpful
though.
-- 
Michal Hocko
SUSE Labs
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH v5 1/2] memory_hotplug: Free pages as higher order

2018-10-11 Thread Michal Hocko
On Thu 11-10-18 07:59:32, Arun KS wrote:
> On 2018-10-10 23:03, Michal Hocko wrote:
> > On Wed 10-10-18 22:26:41, Arun KS wrote:
> > > On 2018-10-10 21:00, Vlastimil Babka wrote:
> > > > On 10/5/18 10:10 AM, Arun KS wrote:
> > > > > When free pages are done with higher order, time spend on
> > > > > coalescing pages by buddy allocator can be reduced. With
> > > > > section size of 256MB, hot add latency of a single section
> > > > > shows improvement from 50-60 ms to less than 1 ms, hence
> > > > > improving the hot add latency by 60%. Modify external
> > > > > providers of online callback to align with the change.
> > > > >
> > > > > Signed-off-by: Arun KS 
> > > >
> > > > [...]
> > > >
> > > > > @@ -655,26 +655,44 @@ void __online_page_free(struct page *page)
> > > > >  }
> > > > >  EXPORT_SYMBOL_GPL(__online_page_free);
> > > > >
> > > > > -static void generic_online_page(struct page *page)
> > > > > +static int generic_online_page(struct page *page, unsigned int order)
> > > > >  {
> > > > > - __online_page_set_limits(page);
> > > >
> > > > This is now not called anymore, although the xen/hv variants still do
> > > > it. The function seems empty these days, maybe remove it as a followup
> > > > cleanup?
> > > >
> > > > > - __online_page_increment_counters(page);
> > > > > - __online_page_free(page);
> > > > > + __free_pages_core(page, order);
> > > > > + totalram_pages += (1UL << order);
> > > > > +#ifdef CONFIG_HIGHMEM
> > > > > + if (PageHighMem(page))
> > > > > + totalhigh_pages += (1UL << order);
> > > > > +#endif
> > > >
> > > > __online_page_increment_counters() would have used
> > > > adjust_managed_page_count() which would do the changes under
> > > > managed_page_count_lock. Are we safe without the lock? If yes, there
> > > > should perhaps be a comment explaining why.
> > > 
> > > Looks unsafe without managed_page_count_lock.
> > 
> > Why does it matter actually? We cannot online/offline memory in
> > parallel. This is not the case for the boot where we initialize memory
> > in parallel on multiple nodes. So this seems to be safe currently unless
> > I am missing something. A comment explaining that would be helpful
> > though.
> 
> Other main callers of adjust_manage_page_count(),
> 
> static inline void free_reserved_page(struct page *page)
> {
> __free_reserved_page(page);
> adjust_managed_page_count(page, 1);
> }
> 
> static inline void mark_page_reserved(struct page *page)
> {
> SetPageReserved(page);
> adjust_managed_page_count(page, -1);
> }
> 
> Won't they race with memory hotplug?
> 
> Few more,
> ./drivers/xen/balloon.c:519:adjust_managed_page_count(page, -1);
> ./drivers/virtio/virtio_balloon.c:175:  adjust_managed_page_count(page, -1);
> ./drivers/virtio/virtio_balloon.c:196:  adjust_managed_page_count(page, 1);
> ./mm/hugetlb.c:2158:adjust_managed_page_count(page, 1 <<
> h->order);

They can, and I have missed those.

-- 
Michal Hocko
SUSE Labs
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH v5 1/2] memory_hotplug: Free pages as higher order

2018-10-11 Thread Michal Hocko
On Thu 11-10-18 10:07:02, Vlastimil Babka wrote:
> On 10/10/18 6:56 PM, Arun KS wrote:
> > On 2018-10-10 21:00, Vlastimil Babka wrote:
> >> On 10/5/18 10:10 AM, Arun KS wrote:
> >>> When free pages are done with higher order, time spend on
> >>> coalescing pages by buddy allocator can be reduced. With
> >>> section size of 256MB, hot add latency of a single section
> >>> shows improvement from 50-60 ms to less than 1 ms, hence
> >>> improving the hot add latency by 60%. Modify external
> >>> providers of online callback to align with the change.
> >>>
> >>> Signed-off-by: Arun KS 
> >>
> >> [...]
> >>
> >>> @@ -655,26 +655,44 @@ void __online_page_free(struct page *page)
> >>>  }
> >>>  EXPORT_SYMBOL_GPL(__online_page_free);
> >>>
> >>> -static void generic_online_page(struct page *page)
> >>> +static int generic_online_page(struct page *page, unsigned int order)
> >>>  {
> >>> - __online_page_set_limits(page);
> >>
> >> This is now not called anymore, although the xen/hv variants still do
> >> it. The function seems empty these days, maybe remove it as a followup
> >> cleanup?
> >>
> >>> - __online_page_increment_counters(page);
> >>> - __online_page_free(page);
> >>> + __free_pages_core(page, order);
> >>> + totalram_pages += (1UL << order);
> >>> +#ifdef CONFIG_HIGHMEM
> >>> + if (PageHighMem(page))
> >>> + totalhigh_pages += (1UL << order);
> >>> +#endif
> >>
> >> __online_page_increment_counters() would have used
> >> adjust_managed_page_count() which would do the changes under
> >> managed_page_count_lock. Are we safe without the lock? If yes, there
> >> should perhaps be a comment explaining why.
> > 
> > Looks unsafe without managed_page_count_lock. I think better have a 
> > similar implementation of free_boot_core() in memory_hotplug.c like we 
> > had in version 1 of patch. And use adjust_managed_page_count() instead 
> > of page_zone(page)->managed_pages += nr_pages;
> > 
> > https://lore.kernel.org/patchwork/patch/989445/
> 
> Looks like deferred_free_range() has the same problem calling
> __free_pages_core() to adjust zone->managed_pages.

deferred initialization has one thread per node AFAIR so we cannot race
on managed_pages updates. Well, unless some of the mentioned can run
that early which I dunno.

> __free_pages_bootmem() is OK because at that point the system is still
> single-threaded?
> Could be solved by moving that out of __free_pages_core().
> 
> But do we care about readers potentially seeing a store tear? If yes
> then maybe these counters should be converted to atomics...

I wanted to suggest that already but I have no idea whether the lock
instructions would cause more overhead.
-- 
Michal Hocko
SUSE Labs
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH v5 1/2] memory_hotplug: Free pages as higher order

2018-10-19 Thread Michal Hocko
On Thu 18-10-18 19:18:25, Andrew Morton wrote:
[...]
> So this patch needs more work, yes?

Yes, I've talked to Arun (he is offline until next week) offlist and he
will play with this some more.

-- 
Michal Hocko
SUSE Labs
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH RFC 3/6] kexec: export PG_offline to VMCOREINFO

2018-11-15 Thread Michal Hocko
On Thu 15-11-18 12:52:13, Borislav Petkov wrote:
> On Thu, Nov 15, 2018 at 12:20:40PM +0100, David Hildenbrand wrote:
> > Sorry to say, but that is the current practice without which
> > makedumpfile would not be able to work at all. (exclude user pages,
> > exclude page cache, exclude buddy pages). Let's not reinvent the wheel
> > here. This is how dumping works forever.
> 
> Sorry, but "we've always done this in the past" doesn't make it better.
> 
> > I don't see how there should be "set of pages which do not have
> > PG_offline".
> 
> It doesn't have to be a set of pages. Think a (mmconfig perhaps) region
> which the kdump kernel should completely skip because poking in it in
> the kdump kernel, causes all kinds of havoc like machine checks. etc.
> We've had and still have one issue like that.
> 
> But let me clarify my note: I don't want to be discussing with you the
> design of makedumpfile and how it should or should not work - that ship
> has already sailed. Apparently there are valid reasons to do it this
> way.
> 
> I was *simply* stating that it feels wrong to export mm flags like that.
> 
> But as I said already, that is mm guys' call and looking at how we're
> already exporting a bunch of stuff in the vmcoreinfo - including other
> mm flags - I guess one more flag doesn't matter anymore.

I am not familiar with kexec to judge this particular patch but we
cannot simply define any range for these pages (same as for hwpoison
ones) because they can be almost anywhere in the available memory range.
Then there can be countless of them. There is no other way to rule them
out but to check the page state.

I am not really sure what is the concern here exactly. Kdump is so
closly tight to the specific kernel version that the api exported
specifically for its purpose cannot be seriously considered an ABI.
Kdump has to adopt all the time.
-- 
Michal Hocko
SUSE Labs
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH RFC 2/6] mm: convert PG_balloon to PG_offline

2018-11-15 Thread Michal Hocko
[Cc Konstantin - the patch is 
http://lkml.kernel.org/r/20181114211704.6381-3-da...@redhat.com]

On Thu 15-11-18 10:21:13, David Hildenbrand wrote:
> On 15.11.18 03:07, Mike Rapoport wrote:
> > On Wed, Nov 14, 2018 at 11:49:15PM +0100, David Hildenbrand wrote:
> >> On 14.11.18 23:23, Matthew Wilcox wrote:
> >>> On Wed, Nov 14, 2018 at 10:17:00PM +0100, David Hildenbrand wrote:
> >>>> Rename PG_balloon to PG_offline. This is an indicator that the page is
> >>>> logically offline, the content stale and that it should not be touched
> >>>> (e.g. a hypervisor would have to allocate backing storage in order for 
> >>>> the
> >>>> guest to dump an unused page).  We can then e.g. exclude such pages from
> >>>> dumps.
> >>>>
> >>>> In following patches, we will make use of this bit also in other balloon
> >>>> drivers.  While at it, document PGTABLE.
> >>>
> >>> Thank you for documenting PGTABLE.  I didn't realise I also had this
> >>> document to update when I added PGTABLE.
> >>
> >> Thank you for looking into this :)
> >>
> >>>
> >>>> +++ b/Documentation/admin-guide/mm/pagemap.rst
> >>>> @@ -78,6 +78,8 @@ number of times a page is mapped.
> >>>>  23. BALLOON
> >>>>  24. ZERO_PAGE
> >>>>  25. IDLE
> >>>> +26. PGTABLE
> >>>> +27. OFFLINE
> >>>
> >>> So the offline *user* bit is new ... even though the *kernel* bit
> >>> just renames the balloon bit.  I'm not sure how I feel about this.
> >>> I'm going to think about it some more.  Could you share your decision
> >>> process with us?
> >>
> >> BALLOON was/is documented as
> >>
> >> "23 - BALLOON
> >> balloon compaction page
> >> "
> >>
> >> and only includes all virtio-ballon pages after the non-lru migration
> >> feature has been implemented for ballooned pages. Since then, this flag
> >> does basically no longer stands for what it actually was supposed to do.
> > 
> > Perhaps I missing something, but how the user should interpret "23" when he
> > reads /proc/kpageflags?
> 
> Looking at the history in more detail:
> 
> commit 09316c09dde33aae14f34489d9e3d243ec0d5938
> Author: Konstantin Khlebnikov 
> Date:   Thu Oct 9 15:29:32 2014 -0700
> 
> mm/balloon_compaction: add vmstat counters and kpageflags bit
> 
> Always mark pages with PageBalloon even if balloon compaction is
> disabled
> and expose this mark in /proc/kpageflags as KPF_BALLOON.
> 
> 
> So KPF_BALLOON was exposed when virtio-balloon pages were always marked
> with PG_balloon. So the documentation is actually wrong ("balloon page"
> vs. "balloon compaction page").
> 
> I have no idea who actually used that information. I suspect this was
> just some debugging aid.
> 
> > 
> >> To not break uapi I decided to not rename it but instead to add a new flag.
> > 
> > I've got a feeling that uapi was anyway changed for the BALLON flag
> > meaning.
> 
> Yes. If we *replace* KPF_BALLOON by KPF_OFFLINE
> 
> a) Some applications might no longer compile (I guess that's ok)
> b) Some old applications will treat KPF_OFFLINE like KPF_BALLOON (which
> should at least for virtio-balloon usage until now be fine - it is just
> more generic)

I do not think any compilation could break. If the semantic of the flag
is preserved then everything should work as expected.
-- 
Michal Hocko
SUSE Labs
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH RFC 6/6] PM / Hibernate: exclude all PageOffline() pages

2018-11-15 Thread Michal Hocko
On Wed 14-11-18 22:17:04, David Hildenbrand wrote:
[...]
> diff --git a/kernel/power/snapshot.c b/kernel/power/snapshot.c
> index b0308a2c6000..01db1d13481a 100644
> --- a/kernel/power/snapshot.c
> +++ b/kernel/power/snapshot.c
> @@ -1222,7 +1222,7 @@ static struct page *saveable_highmem_page(struct zone 
> *zone, unsigned long pfn)
>   BUG_ON(!PageHighMem(page));
>  
>   if (swsusp_page_is_forbidden(page) ||  swsusp_page_is_free(page) ||
> - PageReserved(page))
> + PageReserved(page) || PageOffline(page))
>   return NULL;
>  
>   if (page_is_guard(page))
> @@ -1286,6 +1286,9 @@ static struct page *saveable_page(struct zone *zone, 
> unsigned long pfn)
>   if (swsusp_page_is_forbidden(page) || swsusp_page_is_free(page))
>   return NULL;
>  
> + if (PageOffline(page))
> + return NULL;
> +
>   if (PageReserved(page)
>   && (!kernel_page_present(page) || pfn_is_nosave(pfn)))
>   return NULL;

Btw. now that you are touching this file could you also make
s@pfn_to_page@pfn_to_online_page@ please? We really do not want to touch
offline pfn ranges in general. A separate patch for that of course.

Thanks!
-- 
Michal Hocko
SUSE Labs
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH v1 7/8] PM / Hibernate: use pfn_to_online_page()

2018-11-19 Thread Michal Hocko
On Mon 19-11-18 11:16:15, David Hildenbrand wrote:
> Let's use pfn_to_online_page() instead of pfn_to_page() when checking
> for saveable pages to not save/restore offline memory sections.
> 
> Cc: "Rafael J. Wysocki" 
> Cc: Pavel Machek 
> Cc: Len Brown 
> Cc: Andrew Morton 
> Cc: Matthew Wilcox 
> Cc: Michal Hocko 
> Cc: "Michael S. Tsirkin" 
> Suggested-by: Michal Hocko 
> Signed-off-by: David Hildenbrand 

I have only a very vague understanding of this specific code but I do
not really see any real reason for checking offlined ranges. Also
offline pfn ranges might have uninitialized struct pages so making
any decisions of the struct page is basically undefined behavior.

Acked-by: Michal Hocko 

> ---
>  kernel/power/snapshot.c | 8 
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/kernel/power/snapshot.c b/kernel/power/snapshot.c
> index 640b2034edd6..87e6dd57819f 100644
> --- a/kernel/power/snapshot.c
> +++ b/kernel/power/snapshot.c
> @@ -1215,8 +1215,8 @@ static struct page *saveable_highmem_page(struct zone 
> *zone, unsigned long pfn)
>   if (!pfn_valid(pfn))
>   return NULL;
>  
> - page = pfn_to_page(pfn);
> - if (page_zone(page) != zone)
> + page = pfn_to_online_page(pfn);
> + if (!page || page_zone(page) != zone)
>   return NULL;
>  
>   BUG_ON(!PageHighMem(page));
> @@ -1277,8 +1277,8 @@ static struct page *saveable_page(struct zone *zone, 
> unsigned long pfn)
>   if (!pfn_valid(pfn))
>   return NULL;
>  
> - page = pfn_to_page(pfn);
> - if (page_zone(page) != zone)
> + page = pfn_to_online_page(pfn);
> + if (!page || page_zone(page) != zone)
>   return NULL;
>  
>   BUG_ON(PageHighMem(page));
> -- 
> 2.17.2

-- 
Michal Hocko
SUSE Labs
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH 2/2] Drivers: hv: balloon: Fix the deadlock issue in the memory hot-add code

2014-12-08 Thread Michal Hocko
On Fri 05-12-14 16:41:38, K. Y. Srinivasan wrote:
> Andy Whitcroft  initially saw this deadlock. We
> have seen this as well. Here is the original description of the
> problem (and a potential solution) from Andy:
> 
> https://lkml.org/lkml/2014/3/14/451
> 
> Here is an excerpt from that mail:
> 
> "We are seeing machines lockup with what appears to be an ABBA
> deadlock in the memory hotplug system.  These are from the 3.13.6 based 
> Ubuntu kernels.
> The hv_balloon driver is adding memory using add_memory() which takes
> the hotplug lock

Do you mean mem_hotplug_begin?

> and then emits a udev event, and then attempts to
> lock the sysfs device.  In response to the udev event udev opens the
> sysfs device and locks it, then attempts to grab the hotplug lock to online 
> the memory.

Cannot we simply teach online_pages to fail with EBUSY when the memory
hotplug is on the way. We shouldn't try to online something that is not
initialized yet, no? The memory hotplug log is global so we can get
false positives but that should be easier to deal with than exporting
lock_device_hotplug and adding yet another lock dependency.

> This seems to be inverted nesting in the two cases, leading to the hangs 
> below:
> 
> [  240.608612] INFO: task kworker/0:2:861 blocked for more than 120 seconds.
> [  240.608705] INFO: task systemd-udevd:1906 blocked for more than 120 
> seconds.
> 
> I note that the device hotplug locking allows complete retries (via
> ERESTARTSYS) and if we could detect this at the online stage it could
> be used to get us out.

I am not sure I understand this but it suggests EBUSY above?

> But before I go down this road I wanted to
> make sure I am reading this right.  Or indeed if the hv_balloon driver
> is just doing this wrong."
> 
> This patch is based on the suggestion from
> Yasuaki Ishimatsu 

This changelog doesn't explain us much. And boy this whole thing is so
convoluted. E.g. I have hard time to see why ACPI hotplug is working
correctly. My trail got lost at acpi_memory_device_add level which is
a callback while acpi_device_hotplug is holding lock_device_hotplug but
then again the rest is hidden by callbacks. I cannot seem to find any
documentation which would explain all the locking here.

So why other callers of add_memory don't need the same treatment and if
they do then why don't we use the lock at add_memory level?
 
> Signed-off-by: K. Y. Srinivasan 

Nak to this without proper explanation and I really think that it should
be the onlining code which should deal with the parallel add_memory and
back off until the full initialization is done.

> ---
>  drivers/hv/hv_balloon.c |4 
>  1 files changed, 4 insertions(+), 0 deletions(-)
> 
> diff --git a/drivers/hv/hv_balloon.c b/drivers/hv/hv_balloon.c
> index afdb0d5..f525a62 100644
> --- a/drivers/hv/hv_balloon.c
> +++ b/drivers/hv/hv_balloon.c
> @@ -22,6 +22,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -649,8 +650,11 @@ static void hv_mem_hot_add(unsigned long start, unsigned 
> long size,
>  
>   release_region_mutex(false);
>   nid = memory_add_physaddr_to_nid(PFN_PHYS(start_pfn));
> +
> + lock_device_hotplug();
>   ret = add_memory(nid, PFN_PHYS((start_pfn)),
>   (HA_CHUNK << PAGE_SHIFT));
> + unlock_device_hotplug();
>  
>   if (ret) {
>   pr_info("hot_add memory failed error is %d\n", ret);
> -- 
> 1.7.4.1
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majord...@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: mailto:"d...@kvack.org";> em...@kvack.org 

-- 
Michal Hocko
SUSE Labs
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH 2/2] Drivers: hv: balloon: Fix the deadlock issue in the memory hot-add code

2014-12-09 Thread Michal Hocko
On Tue 09-12-14 10:23:51, Yasuaki Ishimatsu wrote:
> (2014/12/09 0:04), Michal Hocko wrote:
> >On Fri 05-12-14 16:41:38, K. Y. Srinivasan wrote:
> >>Andy Whitcroft  initially saw this deadlock. We
> >>have seen this as well. Here is the original description of the
> >>problem (and a potential solution) from Andy:
> >>
> >>https://lkml.org/lkml/2014/3/14/451
> >>
> >>Here is an excerpt from that mail:
> >>
> >>"We are seeing machines lockup with what appears to be an ABBA
> >>deadlock in the memory hotplug system.  These are from the 3.13.6 based 
> >>Ubuntu kernels.
> >>The hv_balloon driver is adding memory using add_memory() which takes
> >>the hotplug lock
> >
> >Do you mean mem_hotplug_begin?
> >
> 
> >>and then emits a udev event, and then attempts to
> >>lock the sysfs device.  In response to the udev event udev opens the
> >>sysfs device and locks it, then attempts to grab the hotplug lock to online 
> >>the memory.
> >
> >Cannot we simply teach online_pages to fail with EBUSY when the memory
> >hotplug is on the way.  We shouldn't try to online something that is not
> >initialized yet, no?
> 
> Yes. Memory online shouldn't try before initializing it. Then memory online
> should wait for initializing it, not easily fails. Generally, kernel sends
> memory ONLINE event to userland by kobject_uevent() during initializing memory
> and udev makes memory online after catching the event. Onlining memory by
> udev almost run during initializing memory.

I guess this is because the event is sent after a mem section is
initialized while the overal hotplug operation is still not completed.

> So if memory online easily fails, onlining memory by udev almost
> fails.

Doesn't udev retry the operation if it gets EBUSY or EAGAIN?

> >The memory hotplug log is global so we can get
> >false positives but that should be easier to deal with than exporting
> >lock_device_hotplug and adding yet another lock dependency.
> >
> >>This seems to be inverted nesting in the two cases, leading to the hangs 
> >>below:
> >>
> >>[  240.608612] INFO: task kworker/0:2:861 blocked for more than 120 seconds.
> >>[  240.608705] INFO: task systemd-udevd:1906 blocked for more than 120 
> >>seconds.
> >>
> >>I note that the device hotplug locking allows complete retries (via
> >>ERESTARTSYS) and if we could detect this at the online stage it could
> >>be used to get us out.
> >
> >I am not sure I understand this but it suggests EBUSY above?
> >
> >>But before I go down this road I wanted to
> >>make sure I am reading this right.  Or indeed if the hv_balloon driver
> >>is just doing this wrong."
> >>
> >>This patch is based on the suggestion from
> >>Yasuaki Ishimatsu 
> >
> >This changelog doesn't explain us much. And boy this whole thing is so
> >convoluted. E.g. I have hard time to see why ACPI hotplug is working
> >correctly. My trail got lost at acpi_memory_device_add level which is
> >a callback while acpi_device_hotplug is holding lock_device_hotplug but
> >then again the rest is hidden by callbacks.
> 
> Commit 0f1cfe9d0d06 (mm/hotplug: remove stop_machine() from 
> try_offline_node()) said:
> 
>   ---
> lock_device_hotplug() serializes hotplug & online/offline operations.  The
> lock is held in common sysfs online/offline interfaces and ACPI hotplug
> code paths.
> 
> And here are the code paths:
> 
> - CPU & Mem online/offline via sysfs online
> store_online()->lock_device_hotplug()
> 
> - Mem online via sysfs state:
> store_mem_state()->lock_device_hotplug()
> 
> - ACPI CPU & Mem hot-add:
> acpi_scan_bus_device_check()->lock_device_hotplug()
> 
> - ACPI CPU & Mem hot-delete:
> acpi_scan_hot_remove()->lock_device_hotplug()
>   ---
> 
> CPU & Memory online/offline/hotplug are serialized by lock_device_hotplug().

OK, this patch aimed at the complete nodes hotplug. I am not familiar
with the code enough to tell whether this is really needed but it sounds
like an overkill when we are interested only in the memory hotplug. Why
would we need stop_machine or anything for memory that is guaranteed to
be not used at the time of both online and offline.
And again, why cannot we simply make the onlining fail or try_lock and
retry internally if the event consumer cannot cope with errors?
And even if that is not possible then do not export lock_device_hotplug
but export a memory hotplug functions which use it properly so that
other consumers (xen ballon seem to rely on add_memory as well) do not
need the same change as well.

> >I cannot seem to find any documentation which would explain all the
> >locking here.
> 
> Yes. I need the documentation.

Yes please! This code is incredibly hard to follow and deduce all the
hidden requirements and dependencies is even harder.
-- 
Michal Hocko
SUSE Labs
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH 2/2] Drivers: hv: balloon: Fix the deadlock issue in the memory hot-add code

2014-12-09 Thread Michal Hocko
On Tue 09-12-14 19:25:50, Yasuaki Ishimatsu wrote:
> (2014/12/09 18:08), Michal Hocko wrote:
[...]
> >Doesn't udev retry the operation if it gets EBUSY or EAGAIN?
> 
> It depend on implementation of udev.rules. So we can retry online/offline
> operation in udev.rules.
[...]

# Memory hotadd request
SUBSYSTEM=="memory", ACTION=="add", 
DEVPATH=="/devices/system/memory/memory*[0-9]", TEST=="/sys$devpath/state", 
RUN+="/bin/sh -c 'echo online > /sys$devpath/state'"

OK so this is not prepared for a temporary failures and retries.

> >And again, why cannot we simply make the onlining fail or try_lock and
> >retry internally if the event consumer cannot cope with errors?
> 
> Did you mean the following Srinivasan's first patch looks good to you?
>   https://lkml.org/lkml/2014/12/2/662

Heh, I was just about to post this. Because I haven't noticed the
previous patch yet. Yeah, Something like that. Except that I would
expect EAGAIN or EBUSY rather than ERESTARTSYS which should never leak
into userspace. And that would happen here AFAICS because signal_pending
will not be true usually.

So there are two options. Either make the udev rule more robust and
retry within RUN section or do the retry withing online_pages (try_lock
and go into interruptible sleep which gets signaled by finished
add_memory()). The later option is safer wrt. the userspace because the
operation wouldn't fail unexpectedly.
Another option would be generating the sysfs file after all the internal
initialization is done and call it outside of the memory hotplug lock.

-- 
Michal Hocko
SUSE Labs
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH 2/2] Drivers: hv: balloon: Fix the deadlock issue in the memory hot-add code

2014-12-11 Thread Michal Hocko
On Thu 11-12-14 00:21:09, KY Srinivasan wrote:
> 
> 
> > -Original Message-
> > From: Michal Hocko [mailto:mho...@suse.cz]
> > Sent: Tuesday, December 9, 2014 2:56 AM
> > To: Yasuaki Ishimatsu
> > Cc: KY Srinivasan; gre...@linuxfoundation.org; linux-
> > ker...@vger.kernel.org; de...@linuxdriverproject.org; o...@aepfle.de;
> > a...@canonical.com; linux...@kvack.org
> > Subject: Re: [PATCH 2/2] Drivers: hv: balloon: Fix the deadlock issue in the
> > memory hot-add code
> > 
> > On Tue 09-12-14 19:25:50, Yasuaki Ishimatsu wrote:
> > > (2014/12/09 18:08), Michal Hocko wrote:
> > [...]
> > > >Doesn't udev retry the operation if it gets EBUSY or EAGAIN?
> > >
> > > It depend on implementation of udev.rules. So we can retry
> > > online/offline operation in udev.rules.
> > [...]
> > 
> > # Memory hotadd request
> > SUBSYSTEM=="memory", ACTION=="add",
> > DEVPATH=="/devices/system/memory/memory*[0-9]",
> > TEST=="/sys$devpath/state", RUN+="/bin/sh -c 'echo online >
> > /sys$devpath/state'"
> > 
> > OK so this is not prepared for a temporary failures and retries.
> > 
> > > >And again, why cannot we simply make the onlining fail or try_lock
> > > >and retry internally if the event consumer cannot cope with errors?
> > >
> > > Did you mean the following Srinivasan's first patch looks good to you?
> > >   https://lkml.org/lkml/2014/12/2/662
> > 
> > Heh, I was just about to post this. Because I haven't noticed the previous
> > patch yet. Yeah, Something like that. Except that I would expect EAGAIN or
> > EBUSY rather than ERESTARTSYS which should never leak into userspace. And
> > that would happen here AFAICS because signal_pending will not be true
> > usually.
> Michal,
> 
> I agree that the fix to this problem must be outside the clients
> of add_memory() and that is the reason I had sent that patch:
> https://lkml.org/lkml/2014/12/2/662. Let me know if you want me to
> resend this patch with the correct return value.

Please think about the other suggested options as well.
 
> Regards,
> 
> K. Y
> > 
> > So there are two options. Either make the udev rule more robust and retry
> > within RUN section or do the retry withing online_pages (try_lock and go 
> > into
> > interruptible sleep which gets signaled by finished add_memory()). The later
> > option is safer wrt. the userspace because the operation wouldn't fail
> > unexpectedly.
> > Another option would be generating the sysfs file after all the internal
> > initialization is done and call it outside of the memory hotplug lock.
> > 
> > --
> > Michal Hocko
> > SUSE Labs

-- 
Michal Hocko
SUSE Labs
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH] lowmemorykiller: Avoid excessive/redundant calling of LMK

2015-01-15 Thread Michal Hocko
On Mon 12-01-15 21:49:14, Chintan Pandya wrote:
> The global shrinker will invoke lowmem_shrink in a loop.
> The loop will be run (total_scan_pages/batch_size) times.
> The default batch_size will be 128 which will make
> shrinker invoking 100s of times. LMK does meaningful
> work only during first 2-3 times and then rest of the
> invocations are just CPU cycle waste. Fix that by returning
> to the shrinker with SHRINK_STOP when LMK doesn't find any
> more work to do. The deciding factor here is, no process
> found in the selected LMK bucket or memory conditions are
> sane.

lowmemory killer is broken by design and this one of the examples which
shows why. It simply doesn't fit into shrinkers concept.

The count_object callback simply lies and tells the core that all
the reclaimable LRU pages are scanable and gives it this as a number
which the core uses for total_scan. scan_objects callback then happily
ignore nr_to_reclaim and does its one time job where it iterates over
_all_ tasks and picks up the victim and returns its rss as a return
value. This is just a subset of LRU pages of course so it continues
looping until total_scan goes down to 0 finally.

If this really has to be a shrinker then, shouldn't it evaluate the OOM
situation in the count callback and return non zero only if OOM and then
the scan callback would kill and return nr_to_reclaim.

Or even better wouldn't it be much better to use vmpressure to wake
up a kernel module which would simply check the situation and kill
something?

Please do not put only cosmetic changes on top of broken concept and try
to think about a proper solution that is what staging is for AFAIU.

The code is in this state for quite some time and I would really hate if
it got merged just because it is in staging for too long and it is used
out there.

> Signed-off-by: Chintan Pandya 
> ---
>  drivers/staging/android/lowmemorykiller.c | 5 -
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/staging/android/lowmemorykiller.c 
> b/drivers/staging/android/lowmemorykiller.c
> index b545d3d..5bf483f 100644
> --- a/drivers/staging/android/lowmemorykiller.c
> +++ b/drivers/staging/android/lowmemorykiller.c
> @@ -110,7 +110,7 @@ static unsigned long lowmem_scan(struct shrinker *s, 
> struct shrink_control *sc)
>   if (min_score_adj == OOM_SCORE_ADJ_MAX + 1) {
>   lowmem_print(5, "lowmem_scan %lu, %x, return 0\n",
>sc->nr_to_scan, sc->gfp_mask);
> - return 0;
> + return SHRINK_STOP;
>   }
>  
>   selected_oom_score_adj = min_score_adj;
> @@ -163,6 +163,9 @@ static unsigned long lowmem_scan(struct shrinker *s, 
> struct shrink_control *sc)
>   set_tsk_thread_flag(selected, TIF_MEMDIE);
>   send_sig(SIGKILL, selected, 0);
>   rem += selected_tasksize;
> + } else {
> + rcu_read_unlock();
> + return SHRINK_STOP;
>   }
>  
>   lowmem_print(4, "lowmem_scan %lu, %x, return %lu\n",
> -- 
> Chintan Pandya
> 
> QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a
> member of the Code Aurora Forum, hosted by The Linux Foundation
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majord...@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: mailto:"d...@kvack.org";> em...@kvack.org 

-- 
Michal Hocko
SUSE Labs
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH 00/34] put_user_pages(): miscellaneous call sites

2019-08-02 Thread Michal Hocko
On Thu 01-08-19 19:19:31, john.hubb...@gmail.com wrote:
[...]
> 2) Convert all of the call sites for get_user_pages*(), to
> invoke put_user_page*(), instead of put_page(). This involves dozens of
> call sites, and will take some time.

How do we make sure this is the case and it will remain the case in the
future? There must be some automagic to enforce/check that. It is simply
not manageable to do it every now and then because then 3) will simply
be never safe.

Have you considered coccinele or some other scripted way to do the
transition? I have no idea how to deal with future changes that would
break the balance though.
-- 
Michal Hocko
SUSE Labs
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH 00/34] put_user_pages(): miscellaneous call sites

2019-08-07 Thread Michal Hocko
On Wed 07-08-19 10:37:26, Jan Kara wrote:
> On Fri 02-08-19 12:14:09, John Hubbard wrote:
> > On 8/2/19 7:52 AM, Jan Kara wrote:
> > > On Fri 02-08-19 07:24:43, Matthew Wilcox wrote:
> > > > On Fri, Aug 02, 2019 at 02:41:46PM +0200, Jan Kara wrote:
> > > > > On Fri 02-08-19 11:12:44, Michal Hocko wrote:
> > > > > > On Thu 01-08-19 19:19:31, john.hubb...@gmail.com wrote:
> > > > > > [...]
> > > > > > > 2) Convert all of the call sites for get_user_pages*(), to
> > > > > > > invoke put_user_page*(), instead of put_page(). This involves 
> > > > > > > dozens of
> > > > > > > call sites, and will take some time.
> > > > > > 
> > > > > > How do we make sure this is the case and it will remain the case in 
> > > > > > the
> > > > > > future? There must be some automagic to enforce/check that. It is 
> > > > > > simply
> > > > > > not manageable to do it every now and then because then 3) will 
> > > > > > simply
> > > > > > be never safe.
> > > > > > 
> > > > > > Have you considered coccinele or some other scripted way to do the
> > > > > > transition? I have no idea how to deal with future changes that 
> > > > > > would
> > > > > > break the balance though.
> > 
> > Hi Michal,
> > 
> > Yes, I've thought about it, and coccinelle falls a bit short (it's not smart
> > enough to know which put_page()'s to convert). However, there is a debug
> > option planned: a yet-to-be-posted commit [1] uses struct page extensions
> > (obviously protected by CONFIG_DEBUG_GET_USER_PAGES_REFERENCES) to add
> > a redundant counter. That allows:
> > 
> > void __put_page(struct page *page)
> > {
> > ...
> > /* Someone called put_page() instead of put_user_page() */
> > WARN_ON_ONCE(atomic_read(&page_ext->pin_count) > 0);
> > 
> > > > > 
> > > > > Yeah, that's why I've been suggesting at LSF/MM that we may need to 
> > > > > create
> > > > > a gup wrapper - say vaddr_pin_pages() - and track which sites dropping
> > > > > references got converted by using this wrapper instead of gup. The
> > > > > counterpart would then be more logically named as unpin_page() or 
> > > > > whatever
> > > > > instead of put_user_page().  Sure this is not completely foolproof 
> > > > > (you can
> > > > > create new callsite using vaddr_pin_pages() and then just drop refs 
> > > > > using
> > > > > put_page()) but I suppose it would be a high enough barrier for missed
> > > > > conversions... Thoughts?
> > 
> > The debug option above is still a bit simplistic in its implementation
> > (and maybe not taking full advantage of the data it has), but I think
> > it's preferable, because it monitors the "core" and WARNs.
> > 
> > Instead of the wrapper, I'm thinking: documentation and the passage of
> > time, plus the debug option (perhaps enhanced--probably once I post it
> > someone will notice opportunities), yes?
> 
> So I think your debug option and my suggested renaming serve a bit
> different purposes (and thus both make sense). If you do the renaming, you
> can just grep to see unconverted sites. Also when someone merges new GUP
> user (unaware of the new rules) while you switch GUP to use pins instead of
> ordinary references, you'll get compilation error in case of renaming
> instead of hard to debug refcount leak without the renaming. And such
> conflict is almost bound to happen given the size of GUP patch set... Also
> the renaming serves against the "coding inertia" - i.e., GUP is around for
> ages so people just use it without checking any documentation or comments.
> After switching how GUP works, what used to be correct isn't anymore so
> renaming the function serves as a warning that something has really
> changed.

Fully agreed!

> Your refcount debug patches are good to catch bugs in the conversions done
> but that requires you to be able to excercise the code path in the first
> place which may require particular HW or so, and you also have to enable
> the debug option which means you already aim at verifying the GUP
> references are treated properly.
> 
>   Honza
> 
> -- 
> Jan Kara 
> SUSE Labs, CR

-- 
Michal Hocko
SUSE Labs
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH] staging:wlan-ng: use memdup_user instead of kmalloc/copy_from_user

2021-02-15 Thread Michal Hocko
On Sat 13-02-21 15:05:28, Ivan Safonov wrote:
> memdup_user() is shorter and safer equivalent
> of kmalloc/copy_from_user pair.
> 
> Signed-off-by: Ivan Safonov 
> ---
>  drivers/staging/wlan-ng/p80211netdev.c | 28 --
>  1 file changed, 13 insertions(+), 15 deletions(-)
> 
> diff --git a/drivers/staging/wlan-ng/p80211netdev.c 
> b/drivers/staging/wlan-ng/p80211netdev.c
> index a15abb2c8f54..6f9666dc0277 100644
> --- a/drivers/staging/wlan-ng/p80211netdev.c
> +++ b/drivers/staging/wlan-ng/p80211netdev.c
> @@ -569,24 +569,22 @@ static int p80211knetdev_do_ioctl(struct net_device 
> *dev,
>   goto bail;
>   }
>  
> - /* Allocate a buf of size req->len */
> - msgbuf = kmalloc(req->len, GFP_KERNEL);
> - if (msgbuf) {
> - if (copy_from_user(msgbuf, (void __user *)req->data, req->len))
> - result = -EFAULT;
> - else
> - result = p80211req_dorequest(wlandev, msgbuf);
> + msgbuf = memdup_user(req->data, req->len);

Move to memdup_user is definitely a right step. What is the range of
req->len though? If this can be larger than PAGE_SIZE then vmemdup_user
would be a better alternative.

-- 
Michal Hocko
SUSE Labs
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android

2019-05-07 Thread Michal Hocko
On Mon 06-05-19 19:16:22, Sultan Alsawaf wrote:
> This is a complete low memory killer solution for Android that is small
> and simple. Processes are killed according to the priorities that
> Android gives them, so that the least important processes are always
> killed first. Processes are killed until memory deficits are satisfied,
> as observed from kswapd struggling to free up pages. Simple LMK stops
> killing processes when kswapd finally goes back to sleep.
> 
> The only tunables are the desired amount of memory to be freed per
> reclaim event and desired frequency of reclaim events. Simple LMK tries
> to free at least the desired amount of memory per reclaim and waits
> until all of its victims' memory is freed before proceeding to kill more
> processes.

Why do we need something like that in the kernel? I really do not like
an idea of having two OOM killer implementations in the kernel. As
already pointed out newer kernels can do PSI and older kernels can live
with an out of tree code to achieve what they need. I do not see why we
really need this code in the upstream kernel.
-- 
Michal Hocko
SUSE Labs
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH 1/2] mm, treewide: Rename kzfree() to kfree_sensitive()

2020-04-14 Thread Michal Hocko
On Mon 13-04-20 17:15:49, Waiman Long wrote:
> As said by Linus:
> 
>   A symmetric naming is only helpful if it implies symmetries in use.
>   Otherwise it's actively misleading.
> 
>   In "kzalloc()", the z is meaningful and an important part of what the
>   caller wants.
> 
>   In "kzfree()", the z is actively detrimental, because maybe in the
>   future we really _might_ want to use that "memfill(0xdeadbeef)" or
>   something. The "zero" part of the interface isn't even _relevant_.
> 
> The main reason that kzfree() exists is to clear sensitive information
> that should not be leaked to other future users of the same memory
> objects.
> 
> Rename kzfree() to kfree_sensitive() to follow the example of the
> recently added kvfree_sensitive() and make the intention of the API
> more explicit. In addition, memzero_explicit() is used to clear the
> memory to make sure that it won't get optimized away by the compiler.
> 
> The renaming is done by using the command sequence:
> 
>   git grep -w --name-only kzfree |\
>   xargs sed -i 's/\bkzfree\b/kfree_sensitive/'
> 
> followed by some editing of the kfree_sensitive() kerneldoc and the
> use of memzero_explicit() instead of memset().
> 
> Suggested-by: Joe Perches 
> Signed-off-by: Waiman Long 

Makes sense. I haven't checked all the conversions and will rely on the
script doing the right thing. The core MM part is correct.

Acked-by: Michal Hocko 
-- 
Michal Hocko
SUSE Labs
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


[PATCH] staging, android: remove lowmemory killer from the tree

2017-02-22 Thread Michal Hocko
From: Michal Hocko 

Lowmemory killer is sitting in the staging tree since 2008 without any
serious interest for fixing issues brought up by the MM folks. The main
objection is that the implementation is basically broken by design:
- it hooks into slab shrinker API which is not suitable for this
  purpose. lowmem_count implementation just shows this nicely.
  There is no scaling based on the memory pressure and no
  feedback to the generic shrinker infrastructure.
  Moreover lowmem_scan is called way too often for the heavy
  work it performs.
- it is not reclaim context aware - no NUMA and/or memcg
  awareness.

As the code stands right now it just adds a maintenance overhead when
core MM changes have to update lowmemorykiller.c as well. It also seems
that the alternative LMK implementation will be solely in the userspace
so this code has no perspective it seems. The staging tree is supposed
to be for a code which needs to be put in shape before it can be merged
which is not the case here obviously.

Signed-off-by: Michal Hocko 
---
 drivers/staging/android/Kconfig   |  10 --
 drivers/staging/android/Makefile  |   1 -
 drivers/staging/android/lowmemorykiller.c | 212 --
 include/linux/sched.h |   4 -
 4 files changed, 227 deletions(-)
 delete mode 100644 drivers/staging/android/lowmemorykiller.c

diff --git a/drivers/staging/android/Kconfig b/drivers/staging/android/Kconfig
index 6c00d6f765c6..71a50b99caff 100644
--- a/drivers/staging/android/Kconfig
+++ b/drivers/staging/android/Kconfig
@@ -14,16 +14,6 @@ config ASHMEM
  It is, in theory, a good memory allocator for low-memory devices,
  because it can discard shared memory units when under memory pressure.
 
-config ANDROID_LOW_MEMORY_KILLER
-   bool "Android Low Memory Killer"
-   ---help---
- Registers processes to be killed when low memory conditions, this is 
useful
- as there is no particular swap space on android.
-
- The registered process will kill according to the priorities in 
android init
- scripts (/init.rc), and it defines priority values with minimum free 
memory size
- for each priority.
-
 source "drivers/staging/android/ion/Kconfig"
 
 endif # if ANDROID
diff --git a/drivers/staging/android/Makefile b/drivers/staging/android/Makefile
index 7ed1be798909..7cf1564a49a5 100644
--- a/drivers/staging/android/Makefile
+++ b/drivers/staging/android/Makefile
@@ -3,4 +3,3 @@ ccflags-y += -I$(src)   # needed for trace 
events
 obj-y  += ion/
 
 obj-$(CONFIG_ASHMEM)   += ashmem.o
-obj-$(CONFIG_ANDROID_LOW_MEMORY_KILLER)+= lowmemorykiller.o
diff --git a/drivers/staging/android/lowmemorykiller.c 
b/drivers/staging/android/lowmemorykiller.c
deleted file mode 100644
index ec3b66561412..
--- a/drivers/staging/android/lowmemorykiller.c
+++ /dev/null
@@ -1,212 +0,0 @@
-/* drivers/misc/lowmemorykiller.c
- *
- * The lowmemorykiller driver lets user-space specify a set of memory 
thresholds
- * where processes with a range of oom_score_adj values will get killed. 
Specify
- * the minimum oom_score_adj values in
- * /sys/module/lowmemorykiller/parameters/adj and the number of free pages in
- * /sys/module/lowmemorykiller/parameters/minfree. Both files take a comma
- * separated list of numbers in ascending order.
- *
- * For example, write "0,8" to /sys/module/lowmemorykiller/parameters/adj and
- * "1024,4096" to /sys/module/lowmemorykiller/parameters/minfree to kill
- * processes with a oom_score_adj value of 8 or higher when the free memory
- * drops below 4096 pages and kill processes with a oom_score_adj value of 0 or
- * higher when the free memory drops below 1024 pages.
- *
- * The driver considers memory used for caches to be free, but if a large
- * percentage of the cached memory is locked this can be very inaccurate
- * and processes may not get killed until the normal oom killer is triggered.
- *
- * Copyright (C) 2007-2008 Google, Inc.
- *
- * This software is licensed under the terms of the GNU General Public
- * License version 2, as published by the Free Software Foundation, and
- * may be copied, distributed, and modified under those terms.
- *
- * This program is distributed in the hope that it will be useful,
- * but WITHOUT ANY WARRANTY; without even the implied warranty of
- * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
- * GNU General Public License for more details.
- *
- */
-
-#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
-
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-
-static u32 lowmem_debug_level = 1;
-static short lowmem_adj[6] = {
-   0,
-   1,
-   6,
-   12,
-};
-
-static int lowmem_adj_size = 4;
-static int lo

Re: [PATCH] staging, android: remove lowmemory killer from the tree

2017-02-24 Thread Michal Hocko
On Thu 23-02-17 21:36:00, Martijn Coenen wrote:
> On Thu, Feb 23, 2017 at 9:24 PM, John Stultz  wrote:
[...]
> > This is reportedly because while the mempressure notifiers provide a
> > the signal to userspace, the work the deamon then has to do to look up
> > per process memory usage, in order to figure out who is best to kill
> > at that point was too costly and resulted in poor device performance.
> 
> In particular, mempressure requires memory cgroups to function, and we
> saw performance regressions due to the accounting done in mem cgroups.
> At the time we didn't have enough time left to solve this before the
> release, and we reverted back to kernel lmkd.

I would be more than interested to hear details. We used to have some
visible charge path performance footprint but this should be gone now.

[...]
> > It would be great however to get a discussion going here on what the
> > ulmkd needs from the kernel in order to efficiently determine who best
> > to kill, and how we might best implement that.
> 
> The two main issues I think we need to address are:
> 1) Getting the right granularity of events from the kernel; I once
> tried to submit a patch upstream to address this:
> https://lkml.org/lkml/2016/2/24/582

Not only that, the implementation of tht vmpressure needs some serious
rethinking as well. The current one can hit critical events
unexpectedly. The calculation also doesn't consider slab reclaim
sensibly.

> 2) Find out where exactly the memory cgroup overhead is coming from,
> and how to reduce it or work around it to acceptable levels for
> Android. This was also on 3.10, and maybe this has long been fixed or
> improved in more recent kernel versions.

3e32cb2e0a12 ("mm: memcontrol: lockless page counters") has improved
situation a lot as all the charging is lockless since then (3.19).
-- 
Michal Hocko
SUSE Labs
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH] staging, android: remove lowmemory killer from the tree

2017-02-24 Thread Michal Hocko
On Thu 23-02-17 12:24:57, John Stultz wrote:
> On Wed, Feb 22, 2017 at 4:01 AM, Michal Hocko  wrote:
> > From: Michal Hocko 
> >
> > Lowmemory killer is sitting in the staging tree since 2008 without any
> > serious interest for fixing issues brought up by the MM folks. The main
> > objection is that the implementation is basically broken by design:
> > - it hooks into slab shrinker API which is not suitable for this
> >   purpose. lowmem_count implementation just shows this nicely.
> >   There is no scaling based on the memory pressure and no
> >   feedback to the generic shrinker infrastructure.
> >   Moreover lowmem_scan is called way too often for the heavy
> >   work it performs.
> > - it is not reclaim context aware - no NUMA and/or memcg
> >   awareness.
> >
> > As the code stands right now it just adds a maintenance overhead when
> > core MM changes have to update lowmemorykiller.c as well. It also seems
> > that the alternative LMK implementation will be solely in the userspace
> > so this code has no perspective it seems. The staging tree is supposed
> > to be for a code which needs to be put in shape before it can be merged
> > which is not the case here obviously.
> 
> So, just for context, Android does have a userland LMK daemon (using
> the mempressure notifiers) as you mentioned, but unfortunately I'm
> unaware of any devices that ship with that implementation.
> 
> This is reportedly because while the mempressure notifiers provide a
> the signal to userspace, the work the deamon then has to do to look up
> per process memory usage, in order to figure out who is best to kill
> at that point was too costly and resulted in poor device performance.

What was the expensive part?

> So for shipping Android devices, the LMK is still needed. However, its
> not critical for basic android development, as the system will
> function without it. Additionally I believe most vendors heavily
> customize the LMK in their vendor tree, so the value of having it in
> staging might be relatively low.

This is even a stronger reason to drop it from the tree. We do not want
to maintain the code which is not used in fact.
 
> It would be great however to get a discussion going here on what the
> ulmkd needs from the kernel in order to efficiently determine who best
> to kill, and how we might best implement that.

I would really like to see this happen and, to be honest, it should have
happened quite some time ago (around the time when the lmk was merged to
the staging tree).
-- 
Michal Hocko
SUSE Labs
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH] staging, android: remove lowmemory killer from the tree

2017-02-24 Thread Michal Hocko
On Fri 24-02-17 13:19:46, peter enderborg wrote:
> On 02/23/2017 09:36 PM, Martijn Coenen wrote:
> > On Thu, Feb 23, 2017 at 9:24 PM, John Stultz  wrote:
> >> So, just for context, Android does have a userland LMK daemon (using
> >> the mempressure notifiers) as you mentioned, but unfortunately I'm
> >> unaware of any devices that ship with that implementation.
> > I've previously worked on enabling userspace lmkd for a previous
> > release, but ran into some issues there (see below).
> >
> >> This is reportedly because while the mempressure notifiers provide a
> >> the signal to userspace, the work the deamon then has to do to look up
> >> per process memory usage, in order to figure out who is best to kill
> >> at that point was too costly and resulted in poor device performance.
> > In particular, mempressure requires memory cgroups to function, and we
> > saw performance regressions due to the accounting done in mem cgroups.
> > At the time we didn't have enough time left to solve this before the
> > release, and we reverted back to kernel lmkd.
> >
> >> So for shipping Android devices, the LMK is still needed. However, its
> >> not critical for basic android development, as the system will
> >> function without it.
> > It will function, but it most likely will perform horribly (as the
> > page cache will be trashed to such a level that the system will be
> > unusable).
> >
> >> Additionally I believe most vendors heavily
> >> customize the LMK in their vendor tree, so the value of having it in
> >> staging might be relatively low.
> >>
> >> It would be great however to get a discussion going here on what the
> >> ulmkd needs from the kernel in order to efficiently determine who best
> >> to kill, and how we might best implement that.
> > The two main issues I think we need to address are:
> > 1) Getting the right granularity of events from the kernel; I once
> > tried to submit a patch upstream to address this:
> > https://lkml.org/lkml/2016/2/24/582
> > 2) Find out where exactly the memory cgroup overhead is coming from,
> > and how to reduce it or work around it to acceptable levels for
> > Android. This was also on 3.10, and maybe this has long been fixed or
> > improved in more recent kernel versions.
> >
> > I don't have cycles to work on this now, but I'm happy to talk to
> > whoever picks this up on the Android side.
> I sent some patches that is different approach. It still uses shrinkers
> but it has a kernel part that do the kill part better than the old one
> but it does it the android way. The future for this is get it triggered
> with other path's than slab shrinker. But we will not continue unless
> we get google-android to be part of it. Hocko objected heavy on
> the patches but seems not to see that we need something to
> do the job before we can disconnect from shrinker.

Yeah, I strongly believe that the chosen approach is completely wrong.
Both in abusing the shrinker interface and abusing oom_score_adj as the
only criterion for the oom victim selection.
-- 
Michal Hocko
SUSE Labs
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH] staging, android: remove lowmemory killer from the tree

2017-02-24 Thread Michal Hocko
On Fri 24-02-17 14:16:34, peter enderborg wrote:
> On 02/24/2017 01:28 PM, Michal Hocko wrote:
[...]
> > Yeah, I strongly believe that the chosen approach is completely wrong.
> > Both in abusing the shrinker interface and abusing oom_score_adj as the
> > only criterion for the oom victim selection.
> 
> No one is arguing that shrinker is not problematic. And would be great
> if it is removed from lmk.  The oom_score_adj is the way user-space
> tells the kernel what the user-space has as prio. And android is using
> that very much. It's a core part.

Is there any documentation which describes how this is done?

> I have never seen it be used on
> other linux system so what is the intended usage of oom_score_adj? Is
> this really abusing?

oom_score_adj is used to _adjust_ the calculated oom score. It is not a
criterion on its own, well, except for the extreme sides of the range
which are defined to enforce resp. disallow selecting the task. The
global oom killer calculates the oom score as a function of the memory
consumption. Your patch simply ignores the memory consumption (and uses
pids to sort tasks with the same oom score which is just mind boggling)
and that is what I call the abuse. The oom score calculation might
change in future, of course, but all consumers of the oom_score_adj
really have to agree on the base which is adjusted by this tunable
otherwise you can see a lot of unexpected behavior.

I would even argue that nobody outside of mm/oom_kill.c should really
have any business with this tunable.  You can of course tweak the value
from the userspace and help to chose a better oom victim this way but
that is it.

Anyway, I guess we are getting quite off-topic here.

-- 
Michal Hocko
SUSE Labs
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH] staging, android: remove lowmemory killer from the tree

2017-02-24 Thread Michal Hocko
On Fri 24-02-17 15:42:49, peter enderborg wrote:
> On 02/24/2017 03:11 PM, Michal Hocko wrote:
> > On Fri 24-02-17 14:16:34, peter enderborg wrote:
> >> On 02/24/2017 01:28 PM, Michal Hocko wrote:
> > [...]
> >>> Yeah, I strongly believe that the chosen approach is completely wrong.
> >>> Both in abusing the shrinker interface and abusing oom_score_adj as the
> >>> only criterion for the oom victim selection.
> >> No one is arguing that shrinker is not problematic. And would be great
> >> if it is removed from lmk.  The oom_score_adj is the way user-space
> >> tells the kernel what the user-space has as prio. And android is using
> >> that very much. It's a core part.
> > Is there any documentation which describes how this is done?
> >
> >> I have never seen it be used on
> >> other linux system so what is the intended usage of oom_score_adj? Is
> >> this really abusing?
> > oom_score_adj is used to _adjust_ the calculated oom score. It is not a
> > criterion on its own, well, except for the extreme sides of the range
> > which are defined to enforce resp. disallow selecting the task. The
> > global oom killer calculates the oom score as a function of the memory
> > consumption. Your patch simply ignores the memory consumption (and uses
> > pids to sort tasks with the same oom score which is just mind boggling)
>
> How much it uses is of very little importance for android.

But it is relevant for the global oom killer which is the main consumer of
the oom_score_adj.

> The score
> used are only for apps and their services. System related are not
> touched by android lmk. The pid is only to have a unique key to be
> able to have it fast within a rbtree.  One idea was to use task_pid to
> get a strict age of process to get a round robin but since it does not
> matter i skipped that idea since it does not matter.

Pid will not tell you anything about the age. Pids do wrap around.

> > and that is what I call the abuse. The oom score calculation might
> > change in future, of course, but all consumers of the oom_score_adj
> > really have to agree on the base which is adjusted by this tunable
> > otherwise you can see a lot of unexpected behavior.
>
> Then can we just define a range that is strictly for user-space?

This is already well defined. The whole range OOM_SCORE_ADJ_{MIN,MAX}
is usable.

> > I would even argue that nobody outside of mm/oom_kill.c should really
> > have any business with this tunable.  You can of course tweak the value
> > from the userspace and help to chose a better oom victim this way but
> > that is it.
>
> Why only help? If userspace can give an exact order to kernel that
> must be a good thing; other wise kernel have to guess and when
> can that be better? 

Because userspace doesn't know who is the best victim in 99% cases.
Android might be different, although, I am a bit skeptical - especially
after hearing quite some complains about random application being
killed... If you do believe that you know better then, by all means,
implement your custom user space LMK and chose the oom victim on a
different basis but try to understand that the global OOM killer is the
last resort measure to make the system usable again. There is a good
reason why the kernel uses the current badness calculation. The previous
implementation which considered the process age ad other things was just
too random to have a understandable behavior.

In any case playing nasty games with the oom killer tunables might and
will lead, well, to unexpected behavior.
-- 
Michal Hocko
SUSE Labs
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH] staging, android: remove lowmemory killer from the tree

2017-02-24 Thread Michal Hocko
On Fri 24-02-17 16:40:13, peter enderborg wrote:
> On 02/24/2017 04:03 PM, Michal Hocko wrote:
> > On Fri 24-02-17 15:42:49, peter enderborg wrote:
> >> On 02/24/2017 03:11 PM, Michal Hocko wrote:
> >>> On Fri 24-02-17 14:16:34, peter enderborg wrote:
> >>>> On 02/24/2017 01:28 PM, Michal Hocko wrote:
> >>> [...]
> >>>>> Yeah, I strongly believe that the chosen approach is completely wrong.
> >>>>> Both in abusing the shrinker interface and abusing oom_score_adj as the
> >>>>> only criterion for the oom victim selection.
> >>>> No one is arguing that shrinker is not problematic. And would be great
> >>>> if it is removed from lmk.  The oom_score_adj is the way user-space
> >>>> tells the kernel what the user-space has as prio. And android is using
> >>>> that very much. It's a core part.
> >>> Is there any documentation which describes how this is done?
> >>>
> >>>> I have never seen it be used on
> >>>> other linux system so what is the intended usage of oom_score_adj? Is
> >>>> this really abusing?
> >>> oom_score_adj is used to _adjust_ the calculated oom score. It is not a
> >>> criterion on its own, well, except for the extreme sides of the range
> >>> which are defined to enforce resp. disallow selecting the task. The
> >>> global oom killer calculates the oom score as a function of the memory
> >>> consumption. Your patch simply ignores the memory consumption (and uses
> >>> pids to sort tasks with the same oom score which is just mind boggling)
> >> How much it uses is of very little importance for android.
> > But it is relevant for the global oom killer which is the main consumer of
> > the oom_score_adj.
> >
> >> The score
> >> used are only for apps and their services. System related are not
> >> touched by android lmk. The pid is only to have a unique key to be
> >> able to have it fast within a rbtree.  One idea was to use task_pid to
> >> get a strict age of process to get a round robin but since it does not
> >> matter i skipped that idea since it does not matter.
> > Pid will not tell you anything about the age. Pids do wrap around.
> >
> >>> and that is what I call the abuse. The oom score calculation might
> >>> change in future, of course, but all consumers of the oom_score_adj
> >>> really have to agree on the base which is adjusted by this tunable
> >>> otherwise you can see a lot of unexpected behavior.
> >> Then can we just define a range that is strictly for user-space?
> > This is already well defined. The whole range OOM_SCORE_ADJ_{MIN,MAX}
> > is usable.
>
> So we use them in userspace and kernel space but where is the abuse then?

I believe I have already answered that.

> >>> I would even argue that nobody outside of mm/oom_kill.c should really
> >>> have any business with this tunable.  You can of course tweak the value
> >>> from the userspace and help to chose a better oom victim this way but
> >>> that is it.
> >> Why only help? If userspace can give an exact order to kernel that
> >> must be a good thing; other wise kernel have to guess and when
> >> can that be better? 
> > Because userspace doesn't know who is the best victim in 99% cases.
>
> If user-space does not tell kernel what to it have to guess, android
> user-space does, and maybe other should too.

I believe they would do if this was so simple. This is not as easy to
answer as you might think. If you ask, everybody will consider their
task important enough to be killed.

[...]

> > In any case playing nasty games with the oom killer tunables might and
> > will lead, well, to unexpected behavior.
> 
> I don't follow. If we only use values  OOM_SCORE_ADJ_{MIN,MAX} can
> we then be "safe"?

I didn't say that. I was just trying to say that you cannot give a
single knob two different semantics depending on who is using them. That
simply won't work. So if you believe that your LMK knows better then try
to use something else for your heuristics.

Anyway, this is still offtopic I believe.
-- 
Michal Hocko
SUSE Labs
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [RFC PATCH 00/12] Ion cleanup in preparation for moving out of staging

2017-03-03 Thread Michal Hocko
On Thu 02-03-17 13:44:32, Laura Abbott wrote:
> Hi,
> 
> There's been some recent discussions[1] about Ion-like frameworks. There's
> apparently interest in just keeping Ion since it works reasonablly well.
> This series does what should be the final clean ups for it to possibly be
> moved out of staging.
> 
> This includes the following:
> - Some general clean up and removal of features that never got a lot of use
>   as far as I can tell.
> - Fixing up the caching. This is the series I proposed back in December[2]
>   but never heard any feedback on. It will certainly break existing
>   applications that rely on the implicit caching. I'd rather make an effort
>   to move to a model that isn't going directly against the establishement
>   though.
> - Fixing up the platform support. The devicetree approach was never well
>   recieved by DT maintainers. The proposal here is to think of Ion less as
>   specifying requirements and more of a framework for exposing memory to
>   userspace.
> - CMA allocations now happen without the need of a dummy device structure.
>   This fixes a bunch of the reasons why I attempted to add devicetree
>   support before.
> 
> I've had problems getting feedback in the past so if I don't hear any major
> objections I'm going to send out with the RFC dropped to be picked up.
> The only reason there isn't a patch to come out of staging is to discuss any
> other changes to the ABI people might want. Once this comes out of staging,
> I really don't want to mess with the ABI.

Could you recapitulate concerns preventing the code being merged
normally rather than through the staging tree and how they were
addressed?
-- 
Michal Hocko
SUSE Labs
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [RFC PATCH 00/12] Ion cleanup in preparation for moving out of staging

2017-03-05 Thread Michal Hocko
On Fri 03-03-17 09:37:55, Laura Abbott wrote:
> On 03/03/2017 05:29 AM, Michal Hocko wrote:
> > On Thu 02-03-17 13:44:32, Laura Abbott wrote:
> >> Hi,
> >>
> >> There's been some recent discussions[1] about Ion-like frameworks. There's
> >> apparently interest in just keeping Ion since it works reasonablly well.
> >> This series does what should be the final clean ups for it to possibly be
> >> moved out of staging.
> >>
> >> This includes the following:
> >> - Some general clean up and removal of features that never got a lot of use
> >>   as far as I can tell.
> >> - Fixing up the caching. This is the series I proposed back in December[2]
> >>   but never heard any feedback on. It will certainly break existing
> >>   applications that rely on the implicit caching. I'd rather make an effort
> >>   to move to a model that isn't going directly against the establishement
> >>   though.
> >> - Fixing up the platform support. The devicetree approach was never well
> >>   recieved by DT maintainers. The proposal here is to think of Ion less as
> >>   specifying requirements and more of a framework for exposing memory to
> >>   userspace.
> >> - CMA allocations now happen without the need of a dummy device structure.
> >>   This fixes a bunch of the reasons why I attempted to add devicetree
> >>   support before.
> >>
> >> I've had problems getting feedback in the past so if I don't hear any major
> >> objections I'm going to send out with the RFC dropped to be picked up.
> >> The only reason there isn't a patch to come out of staging is to discuss 
> >> any
> >> other changes to the ABI people might want. Once this comes out of staging,
> >> I really don't want to mess with the ABI.
> > 
> > Could you recapitulate concerns preventing the code being merged
> > normally rather than through the staging tree and how they were
> > addressed?
> > 
> 
> Sorry, I'm really not understanding your question here, can you
> clarify?

There must have been a reason why this code ended up in the staging
tree, right? So my question is what those reasons were and how they were
handled in order to move the code from the staging subtree.
-- 
Michal Hocko
SUSE Labs
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [RFC PATCH 00/12] Ion cleanup in preparation for moving out of staging

2017-03-06 Thread Michal Hocko
On Mon 06-03-17 11:40:41, Daniel Vetter wrote:
> On Mon, Mar 06, 2017 at 08:42:59AM +0100, Michal Hocko wrote:
> > On Fri 03-03-17 09:37:55, Laura Abbott wrote:
> > > On 03/03/2017 05:29 AM, Michal Hocko wrote:
> > > > On Thu 02-03-17 13:44:32, Laura Abbott wrote:
> > > >> Hi,
> > > >>
> > > >> There's been some recent discussions[1] about Ion-like frameworks. 
> > > >> There's
> > > >> apparently interest in just keeping Ion since it works reasonablly 
> > > >> well.
> > > >> This series does what should be the final clean ups for it to possibly 
> > > >> be
> > > >> moved out of staging.
> > > >>
> > > >> This includes the following:
> > > >> - Some general clean up and removal of features that never got a lot 
> > > >> of use
> > > >>   as far as I can tell.
> > > >> - Fixing up the caching. This is the series I proposed back in 
> > > >> December[2]
> > > >>   but never heard any feedback on. It will certainly break existing
> > > >>   applications that rely on the implicit caching. I'd rather make an 
> > > >> effort
> > > >>   to move to a model that isn't going directly against the 
> > > >> establishement
> > > >>   though.
> > > >> - Fixing up the platform support. The devicetree approach was never 
> > > >> well
> > > >>   recieved by DT maintainers. The proposal here is to think of Ion 
> > > >> less as
> > > >>   specifying requirements and more of a framework for exposing memory 
> > > >> to
> > > >>   userspace.
> > > >> - CMA allocations now happen without the need of a dummy device 
> > > >> structure.
> > > >>   This fixes a bunch of the reasons why I attempted to add devicetree
> > > >>   support before.
> > > >>
> > > >> I've had problems getting feedback in the past so if I don't hear any 
> > > >> major
> > > >> objections I'm going to send out with the RFC dropped to be picked up.
> > > >> The only reason there isn't a patch to come out of staging is to 
> > > >> discuss any
> > > >> other changes to the ABI people might want. Once this comes out of 
> > > >> staging,
> > > >> I really don't want to mess with the ABI.
> > > > 
> > > > Could you recapitulate concerns preventing the code being merged
> > > > normally rather than through the staging tree and how they were
> > > > addressed?
> > > > 
> > > 
> > > Sorry, I'm really not understanding your question here, can you
> > > clarify?
> > 
> > There must have been a reason why this code ended up in the staging
> > tree, right? So my question is what those reasons were and how they were
> > handled in order to move the code from the staging subtree.
> 
> No one gave a thing about android in upstream, so Greg KH just dumped it
> all into staging/android/. We've discussed ION a bunch of times, recorded
> anything we'd like to fix in staging/android/TODO, and Laura's patch
> series here addresses a big chunk of that.

Thanks for the TODO reference. I was looking exactly at something like
that in drivers/staging/android/ion/. To bad I didn't look one directory
up.

Thanks for the clarification!

-- 
Michal Hocko
SUSE Labs
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH] staging, android: remove lowmemory killer from the tree

2017-03-09 Thread Michal Hocko
Greg, do you see any obstacle to have this merged. The discussion so far
shown that a) vendors are not using the code as is b) there seems to be
an agreement that something else than we have in the kernel is really
needed.

On Wed 22-02-17 13:01:21, Michal Hocko wrote:
> From: Michal Hocko 
> 
> Lowmemory killer is sitting in the staging tree since 2008 without any
> serious interest for fixing issues brought up by the MM folks. The main
> objection is that the implementation is basically broken by design:
>   - it hooks into slab shrinker API which is not suitable for this
> purpose. lowmem_count implementation just shows this nicely.
> There is no scaling based on the memory pressure and no
> feedback to the generic shrinker infrastructure.
> Moreover lowmem_scan is called way too often for the heavy
> work it performs.
>   - it is not reclaim context aware - no NUMA and/or memcg
> awareness.
> 
> As the code stands right now it just adds a maintenance overhead when
> core MM changes have to update lowmemorykiller.c as well. It also seems
> that the alternative LMK implementation will be solely in the userspace
> so this code has no perspective it seems. The staging tree is supposed
> to be for a code which needs to be put in shape before it can be merged
> which is not the case here obviously.
> 
> Signed-off-by: Michal Hocko 
> ---
>  drivers/staging/android/Kconfig   |  10 --
>  drivers/staging/android/Makefile  |   1 -
>  drivers/staging/android/lowmemorykiller.c | 212 
> --
>  include/linux/sched.h |   4 -
>  4 files changed, 227 deletions(-)
>  delete mode 100644 drivers/staging/android/lowmemorykiller.c
> 
> diff --git a/drivers/staging/android/Kconfig b/drivers/staging/android/Kconfig
> index 6c00d6f765c6..71a50b99caff 100644
> --- a/drivers/staging/android/Kconfig
> +++ b/drivers/staging/android/Kconfig
> @@ -14,16 +14,6 @@ config ASHMEM
> It is, in theory, a good memory allocator for low-memory devices,
> because it can discard shared memory units when under memory pressure.
>  
> -config ANDROID_LOW_MEMORY_KILLER
> - bool "Android Low Memory Killer"
> - ---help---
> -   Registers processes to be killed when low memory conditions, this is 
> useful
> -   as there is no particular swap space on android.
> -
> -   The registered process will kill according to the priorities in 
> android init
> -   scripts (/init.rc), and it defines priority values with minimum free 
> memory size
> -   for each priority.
> -
>  source "drivers/staging/android/ion/Kconfig"
>  
>  endif # if ANDROID
> diff --git a/drivers/staging/android/Makefile 
> b/drivers/staging/android/Makefile
> index 7ed1be798909..7cf1564a49a5 100644
> --- a/drivers/staging/android/Makefile
> +++ b/drivers/staging/android/Makefile
> @@ -3,4 +3,3 @@ ccflags-y += -I$(src) # needed for trace 
> events
>  obj-y+= ion/
>  
>  obj-$(CONFIG_ASHMEM) += ashmem.o
> -obj-$(CONFIG_ANDROID_LOW_MEMORY_KILLER)  += lowmemorykiller.o
> diff --git a/drivers/staging/android/lowmemorykiller.c 
> b/drivers/staging/android/lowmemorykiller.c
> deleted file mode 100644
> index ec3b66561412..
> --- a/drivers/staging/android/lowmemorykiller.c
> +++ /dev/null
> @@ -1,212 +0,0 @@
> -/* drivers/misc/lowmemorykiller.c
> - *
> - * The lowmemorykiller driver lets user-space specify a set of memory 
> thresholds
> - * where processes with a range of oom_score_adj values will get killed. 
> Specify
> - * the minimum oom_score_adj values in
> - * /sys/module/lowmemorykiller/parameters/adj and the number of free pages in
> - * /sys/module/lowmemorykiller/parameters/minfree. Both files take a comma
> - * separated list of numbers in ascending order.
> - *
> - * For example, write "0,8" to /sys/module/lowmemorykiller/parameters/adj and
> - * "1024,4096" to /sys/module/lowmemorykiller/parameters/minfree to kill
> - * processes with a oom_score_adj value of 8 or higher when the free memory
> - * drops below 4096 pages and kill processes with a oom_score_adj value of 0 
> or
> - * higher when the free memory drops below 1024 pages.
> - *
> - * The driver considers memory used for caches to be free, but if a large
> - * percentage of the cached memory is locked this can be very inaccurate
> - * and processes may not get killed until the normal oom killer is triggered.
> - *
> - * Copyright (C) 2007-2008 Google, Inc.
> - *
> - * This software is licensed under the terms of the GNU General Public
> - * 

Re: [PATCH] staging, android: remove lowmemory killer from the tree

2017-03-09 Thread Michal Hocko
On Thu 09-03-17 10:30:28, Greg KH wrote:
> On Thu, Mar 09, 2017 at 10:15:13AM +0100, Michal Hocko wrote:
> > Greg, do you see any obstacle to have this merged. The discussion so far
> > shown that a) vendors are not using the code as is b) there seems to be
> > an agreement that something else than we have in the kernel is really
> > needed.
> 
> Well, some vendors are using the code as-is, just not Sony...
> 
> I think the ideas that Tim wrote about is the best way forward for this.
> I'd prefer to leave the code in the kernel until that solution is
> integrated, as dropping support entirely isn't very nice.
> 
> But, given that almost no Android system is running mainline at the
> moment, I will queue this patch up for 4.12-rc1, which will give the
> Google people a bit more of an incentive to get their solution
> implemented and working and merged :)
> 
> Sound reasonable?

sounds good to me.
-- 
Michal Hocko
SUSE Labs
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [RFC PATCH] binder: Don't require the binder lock when killed in binder_thread_read()

2017-04-04 Thread Michal Hocko
On Mon 03-04-17 11:09:32, Doug Anderson wrote:
[...]
> Maybe +Michal Hocko would have some opinions of which OOM Reaper
> patches would be good for picking into linux stable?

I am lacking context here but please note that the OOM reaper patches
alone are not sufficient to make the OOM handling lockup free. There
were quite some changes in the core OOM killer handling to make this
possible. This has been work throughout the last year and it will be
really non-trivial to backport to stable trees.

So the primary question is what are you trying to achieve?
-- 
Michal Hocko
SUSE Labs
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [RFC PATCH] binder: Don't require the binder lock when killed in binder_thread_read()

2017-04-04 Thread Michal Hocko
On Tue 04-04-17 08:10:03, Doug Anderson wrote:
> Hi,
> 
> On Tue, Apr 4, 2017 at 12:40 AM, Michal Hocko  wrote:
[...]
> > So the primary question is what are you trying to achieve?
> 
> Ideally it would be nice to bring back patches to make the OOM
> handling lockup free.  However, presuming that's not feasible then at
> least it would be nice if we could get some minimal set of patches
> that:
> 
> - Isn't too scary to backport.
> - Handles the low hanging fruit.
> - Is fairly self contained.

oom_reaper as introduced by aac4536355496 and 4 follow up patches should
be a good yet not too scary to start.

> 1. It would be nice if other users of linuxstable could benefit.

well most of the oom fixes are rather subtle and it took quite some time
to do the properly so I am not really sure we want to take risk. I would
rather handle those issues and backport specific fixes for the issue.
 
> 2. I know the above patches are not as ideal as the work that has
> happened upstream, so of course I'd prefer to get the upstream
> solution.
> 
> 3. I always appreciate being closer to the upstream solution which
> means we get more people looking at the code and more people testing
> the code.

I can hardly help you more than offer to use the upstream code. I fully
realize that this is hard and I am facing similar issues with enterprise
kernel @Suse but so far I have seen OOMs behaving mostly OK except for
extreme cases which usually do not happen enough to take the risk and
backport non-trivial changes to the stable trees.
-- 
Michal Hocko
SUSE Labs
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH 2/3 staging-next] oom: Add notification for oom_score_adj

2017-02-09 Thread Michal Hocko
On Thu 09-02-17 14:21:49, peter enderborg wrote:
> This adds subscribtion for changes in oom_score_adj, this
> value is important to android systems.

Why? Who is user of this API?
-- 
Michal Hocko
SUSE Labs
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH 1/3 staging-next] android: Collect statistics from lowmemorykiller

2017-02-09 Thread Michal Hocko
On Thu 09-02-17 14:21:45, peter enderborg wrote:
> This collects stats for shrinker calls and how much
> waste work we do within the lowmemorykiller.

This doesn't explain why do we need this information and who is going to
use it. Not to mention it exports it in /proc which is considered a
stable user API. This is a no-go, especially for something that is still
lingering in the staging tree without any actuall effort to make it
fully supported MM feature. I am actually strongly inclined to simply
drop lmk from the tree completely.

> Signed-off-by: Peter Enderborg 

Nacked-by: Michal Hocko 

> ---
>  drivers/staging/android/Kconfig | 11 
>  drivers/staging/android/Makefile|  1 +
>  drivers/staging/android/lowmemorykiller.c   |  9 ++-
>  drivers/staging/android/lowmemorykiller_stats.c | 85 
> +
>  drivers/staging/android/lowmemorykiller_stats.h | 29 +
>  5 files changed, 134 insertions(+), 1 deletion(-)
>  create mode 100644 drivers/staging/android/lowmemorykiller_stats.c
>  create mode 100644 drivers/staging/android/lowmemorykiller_stats.h
> 
> diff --git a/drivers/staging/android/Kconfig b/drivers/staging/android/Kconfig
> index 6c00d6f..96e86c7 100644
> --- a/drivers/staging/android/Kconfig
> +++ b/drivers/staging/android/Kconfig
> @@ -24,6 +24,17 @@ config ANDROID_LOW_MEMORY_KILLER
>scripts (/init.rc), and it defines priority values with minimum free 
> memory size
>for each priority.
> 
> +config ANDROID_LOW_MEMORY_KILLER_STATS
> +bool "Android Low Memory Killer: collect statistics"
> +depends on ANDROID_LOW_MEMORY_KILLER
> +default n
> +help
> +  Create a file in /proc/lmkstats that includes
> +  collected statistics about kills, scans and counts
> +  and  interaction with the shrinker. Its content
> +  will be different depeding on lmk implementation used.
> +
> +
>  source "drivers/staging/android/ion/Kconfig"
> 
>  endif # if ANDROID
> diff --git a/drivers/staging/android/Makefile 
> b/drivers/staging/android/Makefile
> index 7ed1be7..d710eb2 100644
> --- a/drivers/staging/android/Makefile
> +++ b/drivers/staging/android/Makefile
> @@ -4,3 +4,4 @@ obj-y+= ion/
> 
>  obj-$(CONFIG_ASHMEM)+= ashmem.o
>  obj-$(CONFIG_ANDROID_LOW_MEMORY_KILLER)+= lowmemorykiller.o
> +obj-$(CONFIG_ANDROID_LOW_MEMORY_KILLER_STATS)+= lowmemorykiller_stats.o
> diff --git a/drivers/staging/android/lowmemorykiller.c 
> b/drivers/staging/android/lowmemorykiller.c
> index ec3b665..15c1b38 100644
> --- a/drivers/staging/android/lowmemorykiller.c
> +++ b/drivers/staging/android/lowmemorykiller.c
> @@ -42,6 +42,7 @@
>  #include 
>  #include 
>  #include 
> +#include "lowmemorykiller_stats.h"
> 
>  static u32 lowmem_debug_level = 1;
>  static short lowmem_adj[6] = {
> @@ -72,6 +73,7 @@ static unsigned long lowmem_deathpending_timeout;
>  static unsigned long lowmem_count(struct shrinker *s,
>struct shrink_control *sc)
>  {
> +lmk_inc_stats(LMK_COUNT);
>  return global_node_page_state(NR_ACTIVE_ANON) +
>  global_node_page_state(NR_ACTIVE_FILE) +
>  global_node_page_state(NR_INACTIVE_ANON) +
> @@ -95,6 +97,7 @@ static unsigned long lowmem_scan(struct shrinker *s, struct 
> shrink_control *sc)
>  global_node_page_state(NR_SHMEM) -
>  total_swapcache_pages();
> 
> +lmk_inc_stats(LMK_SCAN);
>  if (lowmem_adj_size < array_size)
>  array_size = lowmem_adj_size;
>  if (lowmem_minfree_size < array_size)
> @@ -134,6 +137,7 @@ static unsigned long lowmem_scan(struct shrinker *s, 
> struct shrink_control *sc)
>  if (task_lmk_waiting(p) &&
>  time_before_eq(jiffies, lowmem_deathpending_timeout)) {
>  task_unlock(p);
> +lmk_inc_stats(LMK_TIMEOUT);
>  rcu_read_unlock();
>  return 0;
>  }
> @@ -179,7 +183,9 @@ static unsigned long lowmem_scan(struct shrinker *s, 
> struct shrink_control *sc)
>   other_free * (long)(PAGE_SIZE / 1024));
>  lowmem_deathpending_timeout = jiffies + HZ;
>  rem += selected_tasksize;
> -}
> +lmk_inc_stats(LMK_KILL);
> +} else
> +lmk_inc_stats(LMK_WASTE);
> 
>  lowmem_print(4, "lowmem_scan %lu, %x, return %lu\n",
>   sc->nr_to_scan, sc->gfp_mask, rem);
> @@ -196,6 +202,7 @@ static struct shrinker lowmem_shrinker = {
>  static int __init lowmem_init(void)
>  {
>  register_shrinker(&lowmem_shrinker);
> +init_procfs_lmk();
>  return 0;
> 

Re: [PATCH 3/3 staging-next] mm: Remove RCU and tasklocks from lmk

2017-02-09 Thread Michal Hocko
On Thu 09-02-17 14:21:52, peter enderborg wrote:
> Fundamental changes:
> 1 Does NOT take any RCU lock in shrinker functions.
> 2 It returns same result for scan and counts, so  we dont need to do
>   shinker will know when it is pointless to call scan.
> 3 It does not lock any other process than the one that is
>   going to be killed.
> 
> Background.
> The low memory killer scans for process that can be killed to free
> memory. This can be cpu consuming when there is a high demand for
> memory. This can be seen by analysing the kswapd0 task work.
> The stats function added in earler patch adds a counter for waste work.
> 
> How it works.
> This patch create a structure within the lowmemory killer that caches
> the user spaces processes that it might kill. It is done with a
> sorted rbtree so we can very easy find the candidate to be killed,
> and knows its properies as memory usage and sorted by oom_score_adj
> to look up the task with highest oom_score_adj. To be able to achive
> this it uses oom_score_notify events.
> 
> This patch also as a other effect, we are now free to do other
> lowmemorykiller configurations.  Without the patch there is a need
> for a tradeoff between freed memory and task and rcu locks. This
> is no longer a concern for tuning lmk. This patch is not intended
> to do any calculation changes other than we do use the cache for
> calculate the count values and that makes kswapd0 to shrink other
> areas.

I have to admit I really do not understand big part of the above
paragraph as well as how this all is supposed to work. A quick glance
over the implementation. __lmk_task_insert seems to be only called from
the oom_score notifier context. If nobody updates the value then no task
will get into the tree. Or am I missing something really obvious here?
Moreover oom scores tend to be mostly same for tasks. That means that
your sorted tree will become sorted by pids in most cases. I do not see
any sorting based on the rss nor any updates that would reflect updates
of rss. How can this possibly work?
-- 
Michal Hocko
SUSE Labs
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH 1/3 staging-next] android: Collect statistics from lowmemorykiller

2017-02-09 Thread Michal Hocko
On Thu 09-02-17 21:07:37, Greg KH wrote:
> On Thu, Feb 09, 2017 at 08:26:41PM +0100, Michal Hocko wrote:
> > On Thu 09-02-17 14:21:45, peter enderborg wrote:
> > > This collects stats for shrinker calls and how much
> > > waste work we do within the lowmemorykiller.
> > 
> > This doesn't explain why do we need this information and who is going to
> > use it. Not to mention it exports it in /proc which is considered a
> > stable user API. This is a no-go, especially for something that is still
> > lingering in the staging tree without any actuall effort to make it
> > fully supported MM feature. I am actually strongly inclined to simply
> > drop lmk from the tree completely.
> 
> I thought that someone was working to get the "native" mm features to
> work properly with the lmk "feature"  Do you recall if that work got
> rejected, or just never happened?

Never happened AFAIR. There were some attempts to tune the current
behavior which has been rejected for one reason or another but I am not
really aware of anybody working on moving the code from staging area.

I already have this in the to-send queue, just didn't get to post it yet
because I planned to polish the reasoning some more.
---
>From 9f871b54a387e0a7cdfaf0fa256d1440093e427c Mon Sep 17 00:00:00 2001
From: Michal Hocko 
Date: Wed, 1 Feb 2017 10:37:30 +0100
Subject: [PATCH] staging, android: remove lowmemory killer from the tree

Lowmemory killer is sitting in the staging tree since 2008 without any
serious interest for fixing issues brought up by the MM folks. The main
objection is that the implementation is basically broken by design:
- it hooks into slab shrinker API which is not suitable for this
  purpose. lowmem_count implementation just shows this nicely.
  There is no scaling based on the memory pressure and no
  feedback to the generic shrinker infrastructure.
- it is not reclaim context aware - no NUMA and/or memcg
  awareness.

As the code stands right now it just adds a maintenance overhead when
core MM changes have to update lowmemorykiller.c as well.

Signed-off-by: Michal Hocko 
---
 drivers/staging/android/Kconfig   |  10 --
 drivers/staging/android/Makefile  |   1 -
 drivers/staging/android/lowmemorykiller.c | 212 --
 include/linux/sched.h |   4 -
 4 files changed, 227 deletions(-)
 delete mode 100644 drivers/staging/android/lowmemorykiller.c

diff --git a/drivers/staging/android/Kconfig b/drivers/staging/android/Kconfig
index 6c00d6f765c6..71a50b99caff 100644
--- a/drivers/staging/android/Kconfig
+++ b/drivers/staging/android/Kconfig
@@ -14,16 +14,6 @@ config ASHMEM
  It is, in theory, a good memory allocator for low-memory devices,
  because it can discard shared memory units when under memory pressure.
 
-config ANDROID_LOW_MEMORY_KILLER
-   bool "Android Low Memory Killer"
-   ---help---
- Registers processes to be killed when low memory conditions, this is 
useful
- as there is no particular swap space on android.
-
- The registered process will kill according to the priorities in 
android init
- scripts (/init.rc), and it defines priority values with minimum free 
memory size
- for each priority.
-
 source "drivers/staging/android/ion/Kconfig"
 
 endif # if ANDROID
diff --git a/drivers/staging/android/Makefile b/drivers/staging/android/Makefile
index 7ed1be798909..7cf1564a49a5 100644
--- a/drivers/staging/android/Makefile
+++ b/drivers/staging/android/Makefile
@@ -3,4 +3,3 @@ ccflags-y += -I$(src)   # needed for trace 
events
 obj-y  += ion/
 
 obj-$(CONFIG_ASHMEM)   += ashmem.o
-obj-$(CONFIG_ANDROID_LOW_MEMORY_KILLER)+= lowmemorykiller.o
diff --git a/drivers/staging/android/lowmemorykiller.c 
b/drivers/staging/android/lowmemorykiller.c
deleted file mode 100644
index ec3b66561412..
--- a/drivers/staging/android/lowmemorykiller.c
+++ /dev/null
@@ -1,212 +0,0 @@
-/* drivers/misc/lowmemorykiller.c
- *
- * The lowmemorykiller driver lets user-space specify a set of memory 
thresholds
- * where processes with a range of oom_score_adj values will get killed. 
Specify
- * the minimum oom_score_adj values in
- * /sys/module/lowmemorykiller/parameters/adj and the number of free pages in
- * /sys/module/lowmemorykiller/parameters/minfree. Both files take a comma
- * separated list of numbers in ascending order.
- *
- * For example, write "0,8" to /sys/module/lowmemorykiller/parameters/adj and
- * "1024,4096" to /sys/module/lowmemorykiller/parameters/minfree to kill
- * processes with a oom_score_adj value of 8 or higher when the free memory
- * drops below 4096 pages and kill processes with a oom_score_adj value of 0 or
- * highe

Re: [PATCH 3/3 staging-next] mm: Remove RCU and tasklocks from lmk

2017-02-09 Thread Michal Hocko
On Fri 10-02-17 08:39:11, peter enderborg wrote:
> On 02/09/2017 09:05 PM, Michal Hocko wrote:
> > On Thu 09-02-17 14:21:52, peter enderborg wrote:
> >> Fundamental changes:
> >> 1 Does NOT take any RCU lock in shrinker functions.
> >> 2 It returns same result for scan and counts, so  we dont need to do
> >>   shinker will know when it is pointless to call scan.
> >> 3 It does not lock any other process than the one that is
> >>   going to be killed.
> >>
> >> Background.
> >> The low memory killer scans for process that can be killed to free
> >> memory. This can be cpu consuming when there is a high demand for
> >> memory. This can be seen by analysing the kswapd0 task work.
> >> The stats function added in earler patch adds a counter for waste work.
> >>
> >> How it works.
> >> This patch create a structure within the lowmemory killer that caches
> >> the user spaces processes that it might kill. It is done with a
> >> sorted rbtree so we can very easy find the candidate to be killed,
> >> and knows its properies as memory usage and sorted by oom_score_adj
> >> to look up the task with highest oom_score_adj. To be able to achive
> >> this it uses oom_score_notify events.
> >>
> >> This patch also as a other effect, we are now free to do other
> >> lowmemorykiller configurations.  Without the patch there is a need
> >> for a tradeoff between freed memory and task and rcu locks. This
> >> is no longer a concern for tuning lmk. This patch is not intended
> >> to do any calculation changes other than we do use the cache for
> >> calculate the count values and that makes kswapd0 to shrink other
> >> areas.
> > I have to admit I really do not understand big part of the above
> > paragraph as well as how this all is supposed to work. A quick glance
> > over the implementation. __lmk_task_insert seems to be only called from
> > the oom_score notifier context. If nobody updates the value then no task
> > will get into the tree. Or am I missing something really obvious here?
> > Moreover oom scores tend to be mostly same for tasks. That means that
> > your sorted tree will become sorted by pids in most cases. I do not see
> > any sorting based on the rss nor any updates that would reflect updates
> > of rss. How can this possibly work?
> 
> The task tree nodes are created,updated or removed from the notifier when
> there is a relevant oom_score_adj change. If no one create a task that
> is in the range for the lowmemorykiller the tree will be empty. This is
> an android feature so the score will be updated very often. It is
> part of activity manager to prioritise tasks.  Why should we do sort of
> rss?
 
Because the current lmk selects the tasks based on rss. And the patch
doesn't explain why this is no longer suitable and a different metric
shoult be used. If you also consider that the scale of oom_score_adj is
quite small, conllisions when you simply sort based on pids which is
more than questionable. I really fail to see how this can work
reasonably and why the change of the lmk semantic is even acceptable.

-- 
Michal Hocko
SUSE Labs
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH 1/3 staging-next] android: Collect statistics from lowmemorykiller

2017-02-10 Thread Michal Hocko
On Fri 10-02-17 08:51:49, Greg KH wrote:
> On Fri, Feb 10, 2017 at 08:21:32AM +0100, peter enderborg wrote:
[...]
> > Until then we have to polish this version as good as we can. It is
> > essential for android as it is now.
> 
> But if no one is willing to do the work to fix the reported issues, why
> should it remain?  Can you do the work here?  You're already working on
> fixing some of the issues in a differnt way, why not do the "real work"
> here instead for everyone to benifit from?

Well, to be honest, I do not think that the current code is easily
fixable. The approach was wrong from the day 1. Abusing slab shrinkers
is just a bad place to stick this logic. This all belongs to the
userspace. For that we need a proper mm pressure notification which is
supposed to be vmpressure but that one also doesn't seem to work all
that great. So rather than trying to fix unfixable I would stronly
suggest focusing on making vmpressure work reliably.
-- 
Michal Hocko
SUSE Labs
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH 1/3 staging-next] android: Collect statistics from lowmemorykiller

2017-02-10 Thread Michal Hocko
On Fri 10-02-17 10:05:34, peter enderborg wrote:
> On 02/10/2017 08:59 AM, Michal Hocko wrote:
[...]
> > The approach was wrong from the day 1. Abusing slab shrinkers
> > is just a bad place to stick this logic. This all belongs to the
> > userspace.
>
> But now it is there and we have to stick with it.

It is also adding maintenance cost. Just have a look at the git log and
check how many patches were just a result of the core changes which
needed a sync.

I seriously doubt that any of the android devices can run natively on
the Vanilla kernel so insisting on keeping this code in staging doesn't
give much sense to me.
-- 
Michal Hocko
SUSE Labs
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH 0/3 staging-next] android: Lowmemmorykiller task tree

2017-02-10 Thread Michal Hocko
[I have only now see this cover - it answers some of the questions I've
 had to specific patches. It would be really great if you could use git
 send-email to post patch series - it just does the right thing(tm)]

On Thu 09-02-17 14:21:40, peter enderborg wrote:
> Lowmemorykiller efficiency problem and a solution.
> 
> Lowmemorykiller in android has a severe efficiency problem. The basic
> problem is that the registered shrinker gets called very often without
>  anything actually happening.

Which is an inherent problem because lkml doesn't belong to shrinkers
infrastructure.
-- 
Michal Hocko
SUSE Labs
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH 0/3 staging-next] android: Lowmemmorykiller task tree

2017-02-20 Thread Michal Hocko
On Mon 13-02-17 16:42:42, peter enderborg wrote:
> On 02/10/2017 11:27 AM, Michal Hocko wrote:
> > [I have only now see this cover - it answers some of the questions I've
> >  had to specific patches. It would be really great if you could use git
> >  send-email to post patch series - it just does the right thing(tm)]
> >
> > On Thu 09-02-17 14:21:40, peter enderborg wrote:
> >> Lowmemorykiller efficiency problem and a solution.
> >>
> >> Lowmemorykiller in android has a severe efficiency problem. The basic
> >> problem is that the registered shrinker gets called very often without
> >>  anything actually happening.
> > Which is an inherent problem because lkml doesn't belong to shrinkers
> > infrastructure.
> 
> Not really what this patch address.  I see it as a problem with shrinker
> that there no slow-path-free (scan/count) where it should belong.
> This patch address a specific problem where lot of cpu are wasted
> in low memory conditions.

Let me repeat. The specific problem you are trying to solve is
_inherent_ to how the lmk is designed. Full stop.

-- 
Michal Hocko
SUSE Labs
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH v7 4/4] kexec_file: Load kernel at top of system RAM if required

2018-07-23 Thread Michal Hocko
On Thu 19-07-18 23:17:53, Baoquan He wrote:
> Kexec has been a formal feature in our distro, and customers owning
> those kind of very large machine can make use of this feature to speed
> up the reboot process. On uefi machine, the kexec_file loading will
> search place to put kernel under 4G from top to down. As we know, the
> 1st 4G space is DMA32 ZONE, dma, pci mmcfg, bios etc all try to consume
> it. It may have possibility to not be able to find a usable space for
> kernel/initrd. From the top down of the whole memory space, we don't
> have this worry. 

I do not have the full context here but let me note that you should be
careful when doing top-down reservation because you can easily get into
hotplugable memory and break the hotremove usecase. We even warn when
this is done. See memblock_find_in_range_node
-- 
Michal Hocko
SUSE Labs
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH v7 4/4] kexec_file: Load kernel at top of system RAM if required

2018-07-26 Thread Michal Hocko
On Wed 25-07-18 14:48:13, Baoquan He wrote:
> On 07/23/18 at 04:34pm, Michal Hocko wrote:
> > On Thu 19-07-18 23:17:53, Baoquan He wrote:
> > > Kexec has been a formal feature in our distro, and customers owning
> > > those kind of very large machine can make use of this feature to speed
> > > up the reboot process. On uefi machine, the kexec_file loading will
> > > search place to put kernel under 4G from top to down. As we know, the
> > > 1st 4G space is DMA32 ZONE, dma, pci mmcfg, bios etc all try to consume
> > > it. It may have possibility to not be able to find a usable space for
> > > kernel/initrd. From the top down of the whole memory space, we don't
> > > have this worry. 
> > 
> > I do not have the full context here but let me note that you should be
> > careful when doing top-down reservation because you can easily get into
> > hotplugable memory and break the hotremove usecase. We even warn when
> > this is done. See memblock_find_in_range_node
> 
> Kexec read kernel/initrd file into buffer, just search usable positions
> for them to do the later copying. You can see below struct kexec_segment, 
> for the old kexec_load, kernel/initrd are read into user space buffer,
> the @buf stores the user space buffer address, @mem stores the position
> where kernel/initrd will be put. In kernel, it calls
> kimage_load_normal_segment() to copy user space buffer to intermediate
> pages which are allocated with flag GFP_KERNEL. These intermediate pages
> are recorded as entries, later when user execute "kexec -e" to trigger
> kexec jumping, it will do the final copying from the intermediate pages
> to the real destination pages which @mem pointed. Because we can't touch
> the existed data in 1st kernel when do kexec kernel loading. With my
> understanding, GFP_KERNEL will make those intermediate pages be
> allocated inside immovable area, it won't impact hotplugging. But the
> @mem we searched in the whole system RAM might be lost along with
> hotplug. Hence we need do kexec kernel again when hotplug event is
> detected.

I am not sure I am following. If @mem is placed at movable node then the
memory hotremove simply won't work, because we are seeing reserved pages
and do not know what to do about them. They are not migrateable.
Allocating intermediate pages from other nodes doesn't really help.

The memblock code warns exactly for that reason.
-- 
Michal Hocko
SUSE Labs
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH v7 4/4] kexec_file: Load kernel at top of system RAM if required

2018-07-26 Thread Michal Hocko
On Thu 26-07-18 21:09:04, Baoquan He wrote:
> On 07/26/18 at 02:59pm, Michal Hocko wrote:
> > On Wed 25-07-18 14:48:13, Baoquan He wrote:
> > > On 07/23/18 at 04:34pm, Michal Hocko wrote:
> > > > On Thu 19-07-18 23:17:53, Baoquan He wrote:
> > > > > Kexec has been a formal feature in our distro, and customers owning
> > > > > those kind of very large machine can make use of this feature to speed
> > > > > up the reboot process. On uefi machine, the kexec_file loading will
> > > > > search place to put kernel under 4G from top to down. As we know, the
> > > > > 1st 4G space is DMA32 ZONE, dma, pci mmcfg, bios etc all try to 
> > > > > consume
> > > > > it. It may have possibility to not be able to find a usable space for
> > > > > kernel/initrd. From the top down of the whole memory space, we don't
> > > > > have this worry. 
> > > > 
> > > > I do not have the full context here but let me note that you should be
> > > > careful when doing top-down reservation because you can easily get into
> > > > hotplugable memory and break the hotremove usecase. We even warn when
> > > > this is done. See memblock_find_in_range_node
> > > 
> > > Kexec read kernel/initrd file into buffer, just search usable positions
> > > for them to do the later copying. You can see below struct kexec_segment, 
> > > for the old kexec_load, kernel/initrd are read into user space buffer,
> > > the @buf stores the user space buffer address, @mem stores the position
> > > where kernel/initrd will be put. In kernel, it calls
> > > kimage_load_normal_segment() to copy user space buffer to intermediate
> > > pages which are allocated with flag GFP_KERNEL. These intermediate pages
> > > are recorded as entries, later when user execute "kexec -e" to trigger
> > > kexec jumping, it will do the final copying from the intermediate pages
> > > to the real destination pages which @mem pointed. Because we can't touch
> > > the existed data in 1st kernel when do kexec kernel loading. With my
> > > understanding, GFP_KERNEL will make those intermediate pages be
> > > allocated inside immovable area, it won't impact hotplugging. But the
> > > @mem we searched in the whole system RAM might be lost along with
> > > hotplug. Hence we need do kexec kernel again when hotplug event is
> > > detected.
> > 
> > I am not sure I am following. If @mem is placed at movable node then the
> > memory hotremove simply won't work, because we are seeing reserved pages
> > and do not know what to do about them. They are not migrateable.
> > Allocating intermediate pages from other nodes doesn't really help.
> 
> OK, I forgot the 2nd kernel which kexec jump into. It won't impact hotremove
> in 1st kernel, it does impact the kernel which kexec jump into if kernel
> is at top of system RAM and the top RAM is in movable node.

It will affect the 1st kernel (which does the memblock allocation
top-down) as well. For reasons mentioned above.
-- 
Michal Hocko
SUSE Labs
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH v7 4/4] kexec_file: Load kernel at top of system RAM if required

2018-07-26 Thread Michal Hocko
On Thu 26-07-18 15:12:42, Michal Hocko wrote:
> On Thu 26-07-18 21:09:04, Baoquan He wrote:
> > On 07/26/18 at 02:59pm, Michal Hocko wrote:
> > > On Wed 25-07-18 14:48:13, Baoquan He wrote:
> > > > On 07/23/18 at 04:34pm, Michal Hocko wrote:
> > > > > On Thu 19-07-18 23:17:53, Baoquan He wrote:
> > > > > > Kexec has been a formal feature in our distro, and customers owning
> > > > > > those kind of very large machine can make use of this feature to 
> > > > > > speed
> > > > > > up the reboot process. On uefi machine, the kexec_file loading will
> > > > > > search place to put kernel under 4G from top to down. As we know, 
> > > > > > the
> > > > > > 1st 4G space is DMA32 ZONE, dma, pci mmcfg, bios etc all try to 
> > > > > > consume
> > > > > > it. It may have possibility to not be able to find a usable space 
> > > > > > for
> > > > > > kernel/initrd. From the top down of the whole memory space, we don't
> > > > > > have this worry. 
> > > > > 
> > > > > I do not have the full context here but let me note that you should be
> > > > > careful when doing top-down reservation because you can easily get 
> > > > > into
> > > > > hotplugable memory and break the hotremove usecase. We even warn when
> > > > > this is done. See memblock_find_in_range_node
> > > > 
> > > > Kexec read kernel/initrd file into buffer, just search usable positions
> > > > for them to do the later copying. You can see below struct 
> > > > kexec_segment, 
> > > > for the old kexec_load, kernel/initrd are read into user space buffer,
> > > > the @buf stores the user space buffer address, @mem stores the position
> > > > where kernel/initrd will be put. In kernel, it calls
> > > > kimage_load_normal_segment() to copy user space buffer to intermediate
> > > > pages which are allocated with flag GFP_KERNEL. These intermediate pages
> > > > are recorded as entries, later when user execute "kexec -e" to trigger
> > > > kexec jumping, it will do the final copying from the intermediate pages
> > > > to the real destination pages which @mem pointed. Because we can't touch
> > > > the existed data in 1st kernel when do kexec kernel loading. With my
> > > > understanding, GFP_KERNEL will make those intermediate pages be
> > > > allocated inside immovable area, it won't impact hotplugging. But the
> > > > @mem we searched in the whole system RAM might be lost along with
> > > > hotplug. Hence we need do kexec kernel again when hotplug event is
> > > > detected.
> > > 
> > > I am not sure I am following. If @mem is placed at movable node then the
> > > memory hotremove simply won't work, because we are seeing reserved pages
> > > and do not know what to do about them. They are not migrateable.
> > > Allocating intermediate pages from other nodes doesn't really help.
> > 
> > OK, I forgot the 2nd kernel which kexec jump into. It won't impact hotremove
> > in 1st kernel, it does impact the kernel which kexec jump into if kernel
> > is at top of system RAM and the top RAM is in movable node.
> 
> It will affect the 1st kernel (which does the memblock allocation
> top-down) as well. For reasons mentioned above.

And btw. in the ideal world, we would restrict the memblock allocation
top-down from the non-movable nodes. But I do not think we have that
information ready at the time when the reservation is done.
-- 
Michal Hocko
SUSE Labs
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH v7 4/4] kexec_file: Load kernel at top of system RAM if required

2018-07-26 Thread Michal Hocko
On Thu 26-07-18 21:37:05, Baoquan He wrote:
> On 07/26/18 at 03:14pm, Michal Hocko wrote:
> > On Thu 26-07-18 15:12:42, Michal Hocko wrote:
> > > On Thu 26-07-18 21:09:04, Baoquan He wrote:
> > > > On 07/26/18 at 02:59pm, Michal Hocko wrote:
> > > > > On Wed 25-07-18 14:48:13, Baoquan He wrote:
> > > > > > On 07/23/18 at 04:34pm, Michal Hocko wrote:
> > > > > > > On Thu 19-07-18 23:17:53, Baoquan He wrote:
> > > > > > > > Kexec has been a formal feature in our distro, and customers 
> > > > > > > > owning
> > > > > > > > those kind of very large machine can make use of this feature 
> > > > > > > > to speed
> > > > > > > > up the reboot process. On uefi machine, the kexec_file loading 
> > > > > > > > will
> > > > > > > > search place to put kernel under 4G from top to down. As we 
> > > > > > > > know, the
> > > > > > > > 1st 4G space is DMA32 ZONE, dma, pci mmcfg, bios etc all try to 
> > > > > > > > consume
> > > > > > > > it. It may have possibility to not be able to find a usable 
> > > > > > > > space for
> > > > > > > > kernel/initrd. From the top down of the whole memory space, we 
> > > > > > > > don't
> > > > > > > > have this worry. 
> > > > > > > 
> > > > > > > I do not have the full context here but let me note that you 
> > > > > > > should be
> > > > > > > careful when doing top-down reservation because you can easily 
> > > > > > > get into
> > > > > > > hotplugable memory and break the hotremove usecase. We even warn 
> > > > > > > when
> > > > > > > this is done. See memblock_find_in_range_node
> > > > > > 
> > > > > > Kexec read kernel/initrd file into buffer, just search usable 
> > > > > > positions
> > > > > > for them to do the later copying. You can see below struct 
> > > > > > kexec_segment, 
> > > > > > for the old kexec_load, kernel/initrd are read into user space 
> > > > > > buffer,
> > > > > > the @buf stores the user space buffer address, @mem stores the 
> > > > > > position
> > > > > > where kernel/initrd will be put. In kernel, it calls
> > > > > > kimage_load_normal_segment() to copy user space buffer to 
> > > > > > intermediate
> > > > > > pages which are allocated with flag GFP_KERNEL. These intermediate 
> > > > > > pages
> > > > > > are recorded as entries, later when user execute "kexec -e" to 
> > > > > > trigger
> > > > > > kexec jumping, it will do the final copying from the intermediate 
> > > > > > pages
> > > > > > to the real destination pages which @mem pointed. Because we can't 
> > > > > > touch
> > > > > > the existed data in 1st kernel when do kexec kernel loading. With my
> > > > > > understanding, GFP_KERNEL will make those intermediate pages be
> > > > > > allocated inside immovable area, it won't impact hotplugging. But 
> > > > > > the
> > > > > > @mem we searched in the whole system RAM might be lost along with
> > > > > > hotplug. Hence we need do kexec kernel again when hotplug event is
> > > > > > detected.
> > > > > 
> > > > > I am not sure I am following. If @mem is placed at movable node then 
> > > > > the
> > > > > memory hotremove simply won't work, because we are seeing reserved 
> > > > > pages
> > > > > and do not know what to do about them. They are not migrateable.
> > > > > Allocating intermediate pages from other nodes doesn't really help.
> > > > 
> > > > OK, I forgot the 2nd kernel which kexec jump into. It won't impact 
> > > > hotremove
> > > > in 1st kernel, it does impact the kernel which kexec jump into if kernel
> > > > is at top of system RAM and the top RAM is in movable node.
> > > 
> > > It will affect the 1st kernel (which does the memblock allocation
> > > top-down) as well. For reasons mentioned above.
> > 
> > And btw. in the ideal world, we would restrict the memblock allocation
> > top-down from the non-movable nodes. But I do not think we have that
> > information ready at the time when the reservation is done.
> 
> Oh, you could mix kexec loading up with kdump kernel loading. For kdump
> kernel, we need reserve memory region during bootup with memblock
> allocator. For kexec loading, we just operate after system up, and do
> not need to reserve any memmory region. About memory used to load them,
> it's quite different way.

I didn't know about that. I thought both use the same underlying
reservation mechanism. My bad and sorry for the noise.
-- 
Michal Hocko
SUSE Labs
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH] android,lowmemorykiller: Don't abuse TIF_MEMDIE.

2016-03-08 Thread Michal Hocko
On Tue 08-03-16 20:01:32, Tetsuo Handa wrote:
> Currently, lowmemorykiller (LMK) is using TIF_MEMDIE for two purposes.
> One is to remember processes killed by LMK, and the other is to
> accelerate termination of processes killed by LMK.
> 
> But since LMK is invoked as a memory shrinker function, there still
> should be some memory available. It is very likely that memory
> allocations by processes killed by LMK will succeed without using
> ALLOC_NO_WATERMARKS via TIF_MEMDIE. Even if their allocations cannot
> escape from memory allocation loop unless they use ALLOC_NO_WATERMARKS,
> lowmem_deathpending_timeout can guarantee forward progress by choosing
> next victim process.
> 
> On the other hand, mark_oom_victim() assumes that it must be called with
> oom_lock held and it must not be called after oom_killer_disable() was
> called. But LMK is calling it without holding oom_lock and checking
> oom_killer_disabled. It is possible that LMK calls mark_oom_victim()
> due to allocation requests by kernel threads after current thread
> returned from oom_killer_disabled(). This will break synchronization
> for PM/suspend.
> 
> This patch introduces per a task_struct flag for remembering processes
> killed by LMK, and replaces TIF_MEMDIE with that flag. By applying this
> patch, assumption by mark_oom_victim() becomes true.

Thanks for looking into this. A separate flag sounds like a better way
to go (assuming that the flags are not scarce which doesn't seem to be
the case here).
 
The LMK cannot kill the frozen tasks now but this shouldn't be a big deal
because this is not strictly necessary for the system to move on. We are
not OOM.

> Signed-off-by: Tetsuo Handa 
> Cc: Michal Hocko 
> Cc: Greg Kroah-Hartman 
> Cc: Arve Hjonnevag 
> Cc: Riley Andrews 

Acked-by: Michal Hocko 

> ---
>  drivers/staging/android/lowmemorykiller.c | 9 ++---
>  include/linux/sched.h | 4 
>  2 files changed, 6 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/staging/android/lowmemorykiller.c 
> b/drivers/staging/android/lowmemorykiller.c
> index 4b8a56c..a1dd798 100644
> --- a/drivers/staging/android/lowmemorykiller.c
> +++ b/drivers/staging/android/lowmemorykiller.c
> @@ -129,7 +129,7 @@ static unsigned long lowmem_scan(struct shrinker *s, 
> struct shrink_control *sc)
>   if (!p)
>   continue;
>  
> - if (test_tsk_thread_flag(p, TIF_MEMDIE) &&
> + if (task_lmk_waiting(p) && p->mm &&
>   time_before_eq(jiffies, lowmem_deathpending_timeout)) {
>   task_unlock(p);
>   rcu_read_unlock();
> @@ -160,13 +160,8 @@ static unsigned long lowmem_scan(struct shrinker *s, 
> struct shrink_control *sc)
>   if (selected) {
>   task_lock(selected);
>   send_sig(SIGKILL, selected, 0);
> - /*
> -  * FIXME: lowmemorykiller shouldn't abuse global OOM killer
> -  * infrastructure. There is no real reason why the selected
> -  * task should have access to the memory reserves.
> -  */
>   if (selected->mm)
> - mark_oom_victim(selected);
> + task_set_lmk_waiting(selected);
>   task_unlock(selected);
>   lowmem_print(1, "Killing '%s' (%d), adj %hd,\n"
>"   to free %ldkB on behalf of '%s' (%d) 
> because\n"
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 0b44fbc..de9ced9 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -2187,6 +2187,7 @@ static inline void memalloc_noio_restore(unsigned int 
> flags)
>  #define PFA_NO_NEW_PRIVS 0   /* May not gain new privileges. */
>  #define PFA_SPREAD_PAGE  1  /* Spread page cache over cpuset */
>  #define PFA_SPREAD_SLAB  2  /* Spread some slab caches over cpuset */
> +#define PFA_LMK_WAITING  3  /* Lowmemorykiller is waiting */
>  
>  
>  #define TASK_PFA_TEST(name, func)        \
> @@ -2210,6 +2211,9 @@ TASK_PFA_TEST(SPREAD_SLAB, spread_slab)
>  TASK_PFA_SET(SPREAD_SLAB, spread_slab)
>  TASK_PFA_CLEAR(SPREAD_SLAB, spread_slab)
>  
> +TASK_PFA_TEST(LMK_WAITING, lmk_waiting)
> +TASK_PFA_SET(LMK_WAITING, lmk_waiting)
> +
>  /*
>   * task->jobctl flags
>   */
> -- 
> 1.8.3.1

-- 
Michal Hocko
SUSE Labs
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH RESEND 0/3] memory_hotplug: hyperv: fix deadlock between memory adding and onlining

2015-03-06 Thread Michal Hocko
 then tries to do mem_hotplug_begin()
> while add_memory() is between mem_hotplug_begin() and mem_hotplug_done() and 
> it
> tries grabbing device lock.
> 
> To my understanding ACPI memory hotplug doesn't have the same issue as
> device_hotplug_lock is being grabbed when the ACPI device is added.
> 
> Solve the issue by grabbing device_hotplug_lock before doing add_memory(). If
> we do that, lock_device_hotplug_sysfs() will cause syscall retry which will
> eventually succeed. To support the change we need to export 
> lock_device_hotplug/
> unlock_device_hotplug. This approach can be completely wrong though.
> 
> Vitaly Kuznetsov (3):
>   driver core: export lock_device_hotplug/unlock_device_hotplug
>   memory_hotplug: add note about holding device_hotplug_lock and
> add_memory()
>   Drivers: hv: balloon: fix deadlock between memory adding and onlining
> 
>  drivers/base/core.c |  2 ++
>  drivers/hv/hv_balloon.c | 10 ++
>  mm/memory_hotplug.c |  6 +-
>  3 files changed, 17 insertions(+), 1 deletion(-)
> 
> -- 
> 1.9.3
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majord...@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ 
> Don't email: mailto:"d...@kvack.org";> em...@kvack.org 

-- 
Michal Hocko
SUSE Labs
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH RESEND 0/3] memory_hotplug: hyperv: fix deadlock between memory adding and onlining

2015-03-12 Thread Michal Hocko
On Mon 09-03-15 09:40:43, Vitaly Kuznetsov wrote:
> Michal Hocko  writes:
> 
> > [Sorry for the late response]
> >
> > This is basically the same code posted by KY Srinivasan posted late last
> > year (http://marc.info/?l=linux-mm&m=141782228129426&w=2). I had
> > objections to the implementation
> > http://marc.info/?l=linux-mm&m=141805109216700&w=2
> 
> Np, David's alternative fix is already in -mm:
> 
> https://lkml.org/lkml/2015/2/12/655

Thanks for the pointer. I have missed this one. This is definitely a
better approach than cluttering around exporting device lock.
-- 
Michal Hocko
SUSE Labs
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH 2/2] android, lmk: Reverse the order of setting TIF_MEMDIE and sending SIGKILL.

2015-08-27 Thread Michal Hocko
On Thu 27-08-15 00:34:47, Tetsuo Handa wrote:
> Reposting updated version as it turned out that we can call do_send_sig_info()
> with task_lock held. ;-) ( http://marc.info/?l=linux-mm&m=144059948628905&w=2 
> )
> 
> Tetsuo Handa wrote:
> > Should selected_tasksize be added to rem even when TIF_MEMDIE was not set?
> Commit e1099a69a624 "android, lmk: avoid setting TIF_MEMDIE if process
> has already exited" changed not to add selected_tasksize to rem. But I
> noticed that rem is initialized to 0 and there is only one addition
> (i.e. rem += selected_tasksize means rem = selected_tasksize) since
> commit 7dc19d5affd7 "drivers: convert shrinkers to new count/scan API".
> I don't know what values we should return, but this patch restores
> pre commit e1099a69a624 because omitting a call to mark_oom_victim()
> due to race will not prevent from reclaiming memory because we already
> sent SIGKILL.
> 
> >From b7075abd3a1e903e88f1755c68adc017d2125b0d Mon Sep 17 00:00:00 2001
> From: Tetsuo Handa 
> Date: Thu, 27 Aug 2015 00:13:57 +0900
> Subject: [PATCH 2/2] android, lmk: Send SIGKILL before setting TIF_MEMDIE.
> 
> It was observed that setting TIF_MEMDIE before sending SIGKILL at
> oom_kill_process() allows memory reserves to be depleted by allocations
> which are not needed for terminating the OOM victim.
> 
> This patch reverts commit 6bc2b856bb7c ("staging: android: lowmemorykiller:
> set TIF_MEMDIE before send kill sig"), for oom_kill_process() was updated
> to send SIGKILL before setting TIF_MEMDIE.
> 
> Signed-off-by: Tetsuo Handa 

Acked-by: Michal Hocko 

> ---
>  drivers/staging/android/lowmemorykiller.c | 17 ++---
>  1 file changed, 6 insertions(+), 11 deletions(-)
> 
> diff --git a/drivers/staging/android/lowmemorykiller.c 
> b/drivers/staging/android/lowmemorykiller.c
> index d5d25e4..af604cf 100644
> --- a/drivers/staging/android/lowmemorykiller.c
> +++ b/drivers/staging/android/lowmemorykiller.c
> @@ -156,26 +156,21 @@ next:
>   }
>   if (selected) {
>   task_lock(selected);
> - if (!selected->mm) {
> - /* Already exited, cannot do mark_tsk_oom_victim() */
> - task_unlock(selected);
> - goto out;
> - }
> + lowmem_print(1, "send sigkill to %d (%s), adj %hd, size %d\n",
> +  selected->pid, selected->comm,
> +  selected_oom_score_adj, selected_tasksize);
> + send_sig(SIGKILL, selected, 0);
>   /*
>* FIXME: lowmemorykiller shouldn't abuse global OOM killer
>* infrastructure. There is no real reason why the selected
>* task should have access to the memory reserves.
>*/
> - mark_oom_victim(selected);
> - lowmem_print(1, "send sigkill to %d (%s), adj %hd, size %d\n",
> -  selected->pid, selected->comm,
> -  selected_oom_score_adj, selected_tasksize);
> + if (selected->mm)
> + mark_oom_victim(selected);
>   task_unlock(selected);
>   lowmem_deathpending_timeout = jiffies + HZ;
> - send_sig(SIGKILL, selected, 0);
>   rem += selected_tasksize;
>   }
> -out:
>   lowmem_print(4, "lowmem_scan %lu, %x, return %lu\n",
>sc->nr_to_scan, sc->gfp_mask, rem);
>   rcu_read_unlock();
> -- 
> 1.8.3.1

-- 
Michal Hocko
SUSE Labs
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH 2/2] android, lmk: Reverse the order of setting TIF_MEMDIE and sending SIGKILL.

2015-09-04 Thread Michal Hocko
On Wed 02-09-15 18:06:20, Greg KH wrote:
[...]
> And if we aren't taking patch 1/2, I guess this one isn't needed either?

Unlike the patch1 which was pretty much cosmetic this fixes a real
issue.
-- 
Michal Hocko
SUSE Labs
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH 1/1] Drivers: base: memory: Export symbols for onlining memory blocks

2013-07-22 Thread Michal Hocko
On Fri 19-07-13 12:23:05, K. Y. Srinivasan wrote:
> The current machinery for hot-adding memory requires having udev
> rules to bring the memory segments online. Export the necessary functionality
> to to bring the memory segment online without involving user space code. 

Why? Who is going to use it and for what purpose?
If you need to do it from the kernel cannot you use usermod helper
thread?

Besides that this is far from being complete. memory_block_change_state
seems to depend on device_hotplug_lock and find_memory_block is
currently called with mem_sysfs_mutex held. None of them is exported
AFAICS.

> Signed-off-by: K. Y. Srinivasan 
> ---
>  drivers/base/memory.c  |5 -
>  include/linux/memory.h |4 
>  2 files changed, 8 insertions(+), 1 deletions(-)
> 
> diff --git a/drivers/base/memory.c b/drivers/base/memory.c
> index 2b7813e..a8204ac 100644
> --- a/drivers/base/memory.c
> +++ b/drivers/base/memory.c
> @@ -328,7 +328,7 @@ static int __memory_block_change_state_uevent(struct 
> memory_block *mem,
>   return ret;
>  }
>  
> -static int memory_block_change_state(struct memory_block *mem,
> +int memory_block_change_state(struct memory_block *mem,
>   unsigned long to_state, unsigned long from_state_req,
>   int online_type)
>  {
> @@ -341,6 +341,8 @@ static int memory_block_change_state(struct memory_block 
> *mem,
>  
>   return ret;
>  }
> +EXPORT_SYMBOL(memory_block_change_state);
> +
>  static ssize_t
>  store_mem_state(struct device *dev,
>   struct device_attribute *attr, const char *buf, size_t count)
> @@ -540,6 +542,7 @@ struct memory_block *find_memory_block(struct mem_section 
> *section)
>  {
>   return find_memory_block_hinted(section, NULL);
>  }
> +EXPORT_SYMBOL(find_memory_block);
>  
>  static struct attribute *memory_memblk_attrs[] = {
>   &dev_attr_phys_index.attr,
> diff --git a/include/linux/memory.h b/include/linux/memory.h
> index 85c31a8..8e3ede5 100644
> --- a/include/linux/memory.h
> +++ b/include/linux/memory.h
> @@ -115,6 +115,10 @@ extern void unregister_memory_notifier(struct 
> notifier_block *nb);
>  extern int register_memory_isolate_notifier(struct notifier_block *nb);
>  extern void unregister_memory_isolate_notifier(struct notifier_block *nb);
>  extern int register_new_memory(int, struct mem_section *);
> +extern int memory_block_change_state(struct memory_block *mem,
> + unsigned long to_state, unsigned long from_state_req,
> + int online_type);
> +
>  #ifdef CONFIG_MEMORY_HOTREMOVE
>  extern int unregister_memory_section(struct mem_section *);
>  #endif
> -- 
> 1.7.4.1
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majord...@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: mailto:"d...@kvack.org";> em...@kvack.org 

-- 
Michal Hocko
SUSE Labs
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH 1/1] Drivers: base: memory: Export symbols for onlining memory blocks

2013-07-23 Thread Michal Hocko
On Tue 23-07-13 14:52:36, KY Srinivasan wrote:
> 
> 
> > -Original Message-
> > From: Michal Hocko [mailto:mho...@suse.cz]
> > Sent: Monday, July 22, 2013 8:37 AM
> > To: KY Srinivasan
> > Cc: gre...@linuxfoundation.org; linux-ker...@vger.kernel.org;
> > de...@linuxdriverproject.org; o...@aepfle.de; a...@canonical.com;
> > a...@firstfloor.org; a...@linux-foundation.org; linux...@kvack.org;
> > kamezawa.hiroy...@gmail.com; han...@cmpxchg.org; ying...@google.com;
> > jasow...@redhat.com; k...@vrfy.org
> > Subject: Re: [PATCH 1/1] Drivers: base: memory: Export symbols for onlining
> > memory blocks
> > 
> > On Fri 19-07-13 12:23:05, K. Y. Srinivasan wrote:
> > > The current machinery for hot-adding memory requires having udev
> > > rules to bring the memory segments online. Export the necessary 
> > > functionality
> > > to to bring the memory segment online without involving user space code.
> > 
> > Why? Who is going to use it and for what purpose?
> > If you need to do it from the kernel cannot you use usermod helper
> > thread?
> > 
> > Besides that this is far from being complete. memory_block_change_state
> > seems to depend on device_hotplug_lock and find_memory_block is
> > currently called with mem_sysfs_mutex held. None of them is exported
> > AFAICS.
> 
> You are right; not all of the required symbols are exported (yet). Let
> me answer your other questions first:
>
> The Hyper-V balloon driver can use this functionality. I have
> prototyped the in-kernel "onlining" of hot added memory without
> requiring any help from user level code that performs significantly
> better than having user level code involved in the hot add process.

What does significantly better mean here?

> With this change, I am able to successfully hot-add and online the
> hot-added memory even under extreme memory pressure which is what you
> would want given that we are hot-adding memory to alleviate memory
> pressure. The current scheme of involving user level code to close
> this loop obviously does not perform well under high memory pressure.

Hmm, this is really unexpected. Why the high memory pressure matters
here? Userspace only need to access sysfs file and echo a simple string
into a file. The reset is same regardless you do it from the userspace.

> I can, if you prefer export all of the necessary functionality in one
> patch.

If this turns out really a valid use case then I would prefer exporting
a high level function which would hide all the locking and direct
manipulation with mem blocks.

-- 
Michal Hocko
SUSE Labs
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH] zcache: fix "zcache=" kernel parameter

2013-07-24 Thread Michal Hocko
I have posted a similar fix quite some time ago and I guess Greg should
already have it.

Greg?

On Fri 19-07-13 16:46:41, Piotr Sarna wrote:
> Commit 835f2f5 ("staging: zcache: enable zcache to be built/loaded as
> a module") introduced an incorrect handling of "zcache=" parameter.
> 
> Inside zcache_comp_init() function, zcache_comp_name variable is
> checked for being empty. If not empty, the above variable is tested
> for being compatible with Crypto API. Unfortunately, after that
> function ends unconditionally (by the "goto out" directive) and returns:
> - non-zero value if verification succeeded, wrongly indicating an error
> - zero value if verification failed, falsely informing that function
>   zcache_comp_init() ended properly.
> 
> A solution to this problem is as following:
> 1. Move the "goto out" directive inside the "if (!ret)" statement
> 2. In case that crypto_has_comp() returned 0, change the value of ret
>to non-zero before "goto out" to indicate an error.
> 
> Signed-off-by: Piotr Sarna 
> Acked-by: Bartlomiej Zolnierkiewicz 
> Signed-off-by: Kyungmin Park 
> ---
>  drivers/staging/zcache/zcache-main.c |6 --
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/staging/zcache/zcache-main.c 
> b/drivers/staging/zcache/zcache-main.c
> index dcceed2..81972fa 100644
> --- a/drivers/staging/zcache/zcache-main.c
> +++ b/drivers/staging/zcache/zcache-main.c
> @@ -1811,10 +1811,12 @@ static int zcache_comp_init(void)
>  #else
>   if (*zcache_comp_name != '\0') {
>   ret = crypto_has_comp(zcache_comp_name, 0, 0);
> - if (!ret)
> + if (!ret) {
>   pr_info("zcache: %s not supported\n",
>   zcache_comp_name);
> - goto out;
> + ret = 1;
> + goto out;
> + }
>   }
>   if (!ret)
>   strcpy(zcache_comp_name, "lzo");
> -- 
> 1.7.9.5
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

-- 
Michal Hocko
SUSE Labs
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH] zcache: fix "zcache=" kernel parameter

2013-07-24 Thread Michal Hocko
On Wed 24-07-13 12:02:35, Michal Hocko wrote:
> I have posted a similar fix quite some time ago and I guess Greg should
> already have it.

For reference https://lkml.org/lkml/2013/6/26/410
 
> Greg?
> 
> On Fri 19-07-13 16:46:41, Piotr Sarna wrote:
> > Commit 835f2f5 ("staging: zcache: enable zcache to be built/loaded as
> > a module") introduced an incorrect handling of "zcache=" parameter.
> > 
> > Inside zcache_comp_init() function, zcache_comp_name variable is
> > checked for being empty. If not empty, the above variable is tested
> > for being compatible with Crypto API. Unfortunately, after that
> > function ends unconditionally (by the "goto out" directive) and returns:
> > - non-zero value if verification succeeded, wrongly indicating an error
> > - zero value if verification failed, falsely informing that function
> >   zcache_comp_init() ended properly.
> > 
> > A solution to this problem is as following:
> > 1. Move the "goto out" directive inside the "if (!ret)" statement
> > 2. In case that crypto_has_comp() returned 0, change the value of ret
> >to non-zero before "goto out" to indicate an error.
> > 
> > Signed-off-by: Piotr Sarna 
> > Acked-by: Bartlomiej Zolnierkiewicz 
> > Signed-off-by: Kyungmin Park 
> > ---
> >  drivers/staging/zcache/zcache-main.c |6 --
> >  1 file changed, 4 insertions(+), 2 deletions(-)
> > 
> > diff --git a/drivers/staging/zcache/zcache-main.c 
> > b/drivers/staging/zcache/zcache-main.c
> > index dcceed2..81972fa 100644
> > --- a/drivers/staging/zcache/zcache-main.c
> > +++ b/drivers/staging/zcache/zcache-main.c
> > @@ -1811,10 +1811,12 @@ static int zcache_comp_init(void)
> >  #else
> > if (*zcache_comp_name != '\0') {
> > ret = crypto_has_comp(zcache_comp_name, 0, 0);
> > -   if (!ret)
> > +   if (!ret) {
> > pr_info("zcache: %s not supported\n",
> > zcache_comp_name);
> > -   goto out;
> > +   ret = 1;
> > +   goto out;
> > +   }
> > }
> > if (!ret)
> >     strcpy(zcache_comp_name, "lzo");
> > -- 
> > 1.7.9.5
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to majord...@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at  http://www.tux.org/lkml/
> 
> -- 
> Michal Hocko
> SUSE Labs
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

-- 
Michal Hocko
SUSE Labs
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: Re: [PATCH] zcache: fix "zcache=" kernel parameter

2013-07-24 Thread Michal Hocko
On Wed 24-07-13 12:32:32, Bartlomiej Zolnierkiewicz wrote:
> 
> Hi,
> 
> On Wednesday, July 24, 2013 12:04:41 PM Michal Hocko wrote:
> > On Wed 24-07-13 12:02:35, Michal Hocko wrote:
> > > I have posted a similar fix quite some time ago and I guess Greg should
> > > already have it.
> > 
> > For reference https://lkml.org/lkml/2013/6/26/410
> 
> There was a reply from Greg:
> 
>   https://lkml.org/lkml/2013/6/26/410

I guess you meant https://lkml.org/lkml/2013/6/26/569

> and it seems that Greg's request has not been fullfilled.

Bob Liu has resent the patch (I cannot find the email in the archive).

> Anyway Piotr's patch seems to be more complete:
> - it also fixes case in which invalid "zcache=" option is given (returns
>   an error value 1 while with your patch the code still erronously retuns 0)
> - it prints information about compressor type being used when "zcache="
>   option is used (your patch skips it due to addition of "goto out_alloc")

I do not care which one will make it I just pointed out that there is
another patch dealing with the same and it is a question how far that
one went.
-- 
Michal Hocko
SUSE Labs
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH 1/1] Drivers: base: memory: Export symbols for onlining memory blocks

2013-07-25 Thread Michal Hocko
On Wed 24-07-13 14:02:32, Dave Hansen wrote:
> On 07/24/2013 12:45 PM, KY Srinivasan wrote:
> > All I am saying is that I see two classes of failures: (a) Our
> > inability to allocate memory to manage the memory that is being hot added
> > and (b) Our inability to bring the hot added memory online within a 
> > reasonable
> > amount of time. I am not sure the cause for (b) and I was just speculating 
> > that
> > this could be memory related. What is interesting is that I have seen 
> > failure related
> > to our inability to online the memory after having succeeded in hot adding 
> > the
> > memory.
> 
> I think we should hold off on this patch and other like it until we've
> been sufficiently able to explain how (b) happens.

Agreed.

-- 
Michal Hocko
SUSE Labs
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH staging] zcache: fix "zcache=" kernel parameter

2013-07-25 Thread Michal Hocko
On Wed 24-07-13 19:04:14, Bartlomiej Zolnierkiewicz wrote:
> From: Piotr Sarna 
> Subject: [PATCH] zcache: fix "zcache=" kernel parameter
> 
> Commit 835f2f5 ("staging: zcache: enable zcache to be built/loaded as
> a module") introduced an incorrect handling of "zcache=" parameter.
> 
> Inside zcache_comp_init() function, zcache_comp_name variable is
> checked for being empty. If not empty, the above variable is tested
> for being compatible with Crypto API. Unfortunately, after that
> function ends unconditionally (by the "goto out" directive) and returns:
> - non-zero value if verification succeeded, wrongly indicating an error
> - zero value if verification failed, falsely informing that function
>   zcache_comp_init() ended properly.
> 
> A solution to this problem is as following:
> 1. Move the "goto out" directive inside the "if (!ret)" statement
> 2. In case that crypto_has_comp() returned 0, change the value of ret
>to non-zero before "goto out" to indicate an error.
> 
> This patch replaces an earlier one from Michal Hocko (based on report
> from Cristian Rodríguez):
> 
>   http://permalink.gmane.org/gmane.linux.kernel.mm/102484
> 
> It also addressed the same issue but didn't fix the zcache_comp_init()
> for case when the compressor data passed to "zcache=" option was invalid
> or unsupported.
> 
> Signed-off-by: Piotr Sarna 
> [bzolnier: updated patch description]
> Acked-by: Bartlomiej Zolnierkiewicz 
> Signed-off-by: Kyungmin Park 
> Acked-by: Konrad Rzeszutek Wilk 
> Cc: Cristian Rodríguez 
> Cc: Michal Hocko 
> Cc: Bob Liu 

Acked-by: Michal Hocko 

Thanks!

> ---
>  drivers/staging/zcache/zcache-main.c |6 --
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/staging/zcache/zcache-main.c 
> b/drivers/staging/zcache/zcache-main.c
> index dcceed2..81972fa 100644
> --- a/drivers/staging/zcache/zcache-main.c
> +++ b/drivers/staging/zcache/zcache-main.c
> @@ -1811,10 +1811,12 @@ static int zcache_comp_init(void)
>  #else
>   if (*zcache_comp_name != '\0') {
>   ret = crypto_has_comp(zcache_comp_name, 0, 0);
> - if (!ret)
> + if (!ret) {
>   pr_info("zcache: %s not supported\n",
>       zcache_comp_name);
> - goto out;
> + ret = 1;
> + goto out;
> + }
>   }
>   if (!ret)
>   strcpy(zcache_comp_name, "lzo");
> -- 
> 1.7.9.5
> 
> 
> 

-- 
Michal Hocko
SUSE Labs
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH] mm: add MM_SWAPENTS and page table when calculate tasksize in lowmem_scan()

2016-02-17 Thread Michal Hocko
On Tue 16-02-16 16:35:39, David Rientjes wrote:
> On Tue, 16 Feb 2016, Greg Kroah-Hartman wrote:
> 
> > On Tue, Feb 16, 2016 at 05:37:05PM +0800, Xishi Qiu wrote:
> > > Currently tasksize in lowmem_scan() only calculate rss, and not include 
> > > swap.
> > > But usually smart phones enable zram, so swap space actually use ram.
> > 
> > Yes, but does that matter for this type of calculation?  I need an ack
> > from the android team before I could ever take such a core change to
> > this code...
> > 
> 
> The calculation proposed in this patch is the same as the generic oom 
> killer, it's an estimate of the amount of memory that will be freed if it 
> is killed and can exit.  This is better than simply get_mm_rss().
> 
> However, I think we seriously need to re-consider the implementation of 
> the lowmem killer entirely.  It currently abuses the use of TIF_MEMDIE, 
> which should ideally only be set for one thread on the system since it 
> allows unbounded access to global memory reserves.
> 
> It also abuses the user-visible /proc/self/oom_score_adj tunable: this 
> tunable is used by the generic oom killer to bias or discount a proportion 
> of memory from a process's usage.  This is the only supported semantic of 
> the tunable.  The lowmem killer uses it as a strict prioritization, so any 
> process with oom_score_adj higher than another process is preferred for 
> kill, REGARDLESS of memory usage.  This leads to priority inversion, the 
> user is unable to always define the same process to be killed by the 
> generic oom killer and the lowmem killer.  This is what happens when a 
> tunable with a very clear and defined purpose is used for other reasons.
> 
> I'd seriously consider not accepting any additional hacks on top of this 
> code until the implementation is rewritten.

Fully agreed!

-- 
Michal Hocko
SUSE Labs
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH v4 1/3] mm/slab: Use memzero_explicit() in kzfree()

2020-06-15 Thread Michal Hocko
On Mon 15-06-20 21:57:16, Waiman Long wrote:
> The kzfree() function is normally used to clear some sensitive
> information, like encryption keys, in the buffer before freeing it back
> to the pool. Memset() is currently used for the buffer clearing. However,
> it is entirely possible that the compiler may choose to optimize away the
> memory clearing especially if LTO is being used. To make sure that this
> optimization will not happen, memzero_explicit(), which is introduced
> in v3.18, is now used in kzfree() to do the clearing.
> 
> Fixes: 3ef0e5ba4673 ("slab: introduce kzfree()")
> Cc: sta...@vger.kernel.org
> Signed-off-by: Waiman Long 

Acked-by: Michal Hocko 

Although I am not really sure this is a stable material. Is there any
known instance where the memset was optimized out from kzfree?

> ---
>  mm/slab_common.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/slab_common.c b/mm/slab_common.c
> index 9e72ba224175..37d48a56431d 100644
> --- a/mm/slab_common.c
> +++ b/mm/slab_common.c
> @@ -1726,7 +1726,7 @@ void kzfree(const void *p)
>   if (unlikely(ZERO_OR_NULL_PTR(mem)))
>   return;
>   ks = ksize(mem);
> - memset(mem, 0, ks);
> + memzero_explicit(mem, ks);
>   kfree(mem);
>  }
>  EXPORT_SYMBOL(kzfree);
> -- 
> 2.18.1
> 

-- 
Michal Hocko
SUSE Labs
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH v4 0/3] mm, treewide: Rename kzfree() to kfree_sensitive()

2020-06-17 Thread Michal Hocko
On Tue 16-06-20 17:37:11, Matthew Wilcox wrote:
> On Wed, Jun 17, 2020 at 01:01:30AM +0200, David Sterba wrote:
> > On Tue, Jun 16, 2020 at 11:53:50AM -0700, Joe Perches wrote:
> > > On Mon, 2020-06-15 at 21:57 -0400, Waiman Long wrote:
> > > >  v4:
> > > >   - Break out the memzero_explicit() change as suggested by Dan 
> > > > Carpenter
> > > > so that it can be backported to stable.
> > > >   - Drop the "crypto: Remove unnecessary memzero_explicit()" patch for
> > > > now as there can be a bit more discussion on what is best. It will 
> > > > be
> > > > introduced as a separate patch later on after this one is merged.
> > > 
> > > To this larger audience and last week without reply:
> > > https://lore.kernel.org/lkml/573b3fbd5927c643920e1364230c296b23e7584d.ca...@perches.com/
> > > 
> > > Are there _any_ fastpath uses of kfree or vfree?
> > 
> > I'd consider kfree performance critical for cases where it is called
> > under locks. If possible the kfree is moved outside of the critical
> > section, but we have rbtrees or lists that get deleted under locks and
> > restructuring the code to do eg. splice and free it outside of the lock
> > is not always possible.
> 
> Not just performance critical, but correctness critical.  Since kvfree()
> may allocate from the vmalloc allocator, I really think that kvfree()
> should assert that it's !in_atomic().  Otherwise we can get into trouble
> if we end up calling vfree() and have to take the mutex.

FWIW __vfree already checks for atomic context and put the work into a
deferred context. So this should be safe. It should be used as a last
resort, though.

-- 
Michal Hocko
SUSE Labs
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH v4 0/3] mm, treewide: Rename kzfree() to kfree_sensitive()

2020-06-17 Thread Michal Hocko
On Wed 17-06-20 04:08:20, Matthew Wilcox wrote:
> On Wed, Jun 17, 2020 at 09:12:12AM +0200, Michal Hocko wrote:
> > On Tue 16-06-20 17:37:11, Matthew Wilcox wrote:
> > > Not just performance critical, but correctness critical.  Since kvfree()
> > > may allocate from the vmalloc allocator, I really think that kvfree()
> > > should assert that it's !in_atomic().  Otherwise we can get into trouble
> > > if we end up calling vfree() and have to take the mutex.
> > 
> > FWIW __vfree already checks for atomic context and put the work into a
> > deferred context. So this should be safe. It should be used as a last
> > resort, though.
> 
> Actually, it only checks for in_interrupt().

You are right. I have misremembered. You have made me look (thanks) ...

> If you call vfree() under
> a spinlock, you're in trouble.  in_atomic() only knows if we hold a
> spinlock for CONFIG_PREEMPT, so it's not safe to check for in_atomic()
> in __vfree().  So we need the warning in order that preempt people can
> tell those without that there is a bug here.

... Unless I am missing something in_interrupt depends on preempt_count() as
well so neither of the two is reliable without PREEMPT_COUNT configured.

-- 
Michal Hocko
SUSE Labs
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH v4 0/3] mm, treewide: Rename kzfree() to kfree_sensitive()

2020-06-17 Thread Michal Hocko
On Wed 17-06-20 05:23:21, Matthew Wilcox wrote:
> On Wed, Jun 17, 2020 at 01:31:57PM +0200, Michal Hocko wrote:
> > On Wed 17-06-20 04:08:20, Matthew Wilcox wrote:
> > > If you call vfree() under
> > > a spinlock, you're in trouble.  in_atomic() only knows if we hold a
> > > spinlock for CONFIG_PREEMPT, so it's not safe to check for in_atomic()
> > > in __vfree().  So we need the warning in order that preempt people can
> > > tell those without that there is a bug here.
> > 
> > ... Unless I am missing something in_interrupt depends on preempt_count() as
> > well so neither of the two is reliable without PREEMPT_COUNT configured.
> 
> preempt_count() always tracks whether we're in interrupt context,
> regardless of CONFIG_PREEMPT.  The difference is that CONFIG_PREEMPT
> will track spinlock acquisitions as well.

Right you are! Thanks for the clarification. I find the situation
around preempt_count quite confusing TBH. Looking at existing users
of in_atomic() (e.g. a random one zd_usb_iowrite16v_async which check
in_atomic and then does GFP_KERNEL allocation which would be obviously
broken on !PREEMPT if the function can be called from an atomic
context), I am wondering whether it would make sense to track atomic
context also for !PREEMPT. This check is just terribly error prone.

-- 
Michal Hocko
SUSE Labs
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: possible deadlock in shmem_fallocate (4)

2020-07-14 Thread Michal Hocko
On Tue 14-07-20 13:32:05, Hillf Danton wrote:
> 
> On Mon, 13 Jul 2020 20:41:11 -0700 Eric Biggers wrote:
> > On Tue, Jul 14, 2020 at 11:32:52AM +0800, Hillf Danton wrote:
> > > 
> > > Add FALLOC_FL_NOBLOCK and on the shmem side try to lock inode upon the
> > > new flag. And the overall upside is to keep the current gfp either in
> > > the khugepaged context or not.
> > > 
> > > --- a/include/uapi/linux/falloc.h
> > > +++ b/include/uapi/linux/falloc.h
> > > @@ -77,4 +77,6 @@
> > >   */
> > >  #define FALLOC_FL_UNSHARE_RANGE  0x40
> > >  
> > > +#define FALLOC_FL_NOBLOCK0x80
> > > +
> > 
> > You can't add a new UAPI flag to fix a kernel-internal problem like this.
> 
> Sounds fair, see below.
> 
> What the report indicates is a missing PF_MEMALLOC_NOFS and it's
> checked on the ashmem side and added as an exception before going
> to filesystem. On shmem side, no more than a best effort is paid
> on the inteded exception.
> 
> --- a/drivers/staging/android/ashmem.c
> +++ b/drivers/staging/android/ashmem.c
> @@ -437,6 +437,7 @@ static unsigned long
>  ashmem_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
>  {
>   unsigned long freed = 0;
> + bool nofs;
>  
>   /* We might recurse into filesystem code, so bail out if necessary */
>   if (!(sc->gfp_mask & __GFP_FS))
> @@ -445,6 +446,11 @@ ashmem_shrink_scan(struct shrinker *shri
>   if (!mutex_trylock(&ashmem_mutex))
>   return -1;
>  
> + /* enter filesystem with caution: nonblock on locking */
> + nofs = current->flags & PF_MEMALLOC_NOFS;
> + if (!nofs)
> + current->flags |= PF_MEMALLOC_NOFS;
> +
>   while (!list_empty(&ashmem_lru_list)) {
>   struct ashmem_range *range =
>   list_first_entry(&ashmem_lru_list, typeof(*range), lru);

I do not think this is an appropriate fix. First of all is this a real
deadlock or a lockdep false positive? Is it possible that ashmem just
needs to properly annotate its shmem inodes? Or is it possible that
the internal backing shmem file is visible to the userspace so the write
path would be possible?

If this a real problem then the proper fix would be to set internal
shmem mapping's gfp_mask to drop __GFP_FS.
-- 
Michal Hocko
SUSE Labs
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: possible deadlock in shmem_fallocate (4)

2020-07-14 Thread Michal Hocko
On Tue 14-07-20 22:08:59, Hillf Danton wrote:
> 
> On Tue, 14 Jul 2020 10:26:29 +0200 Michal Hocko wrote:
> > On Tue 14-07-20 13:32:05, Hillf Danton wrote:
> > > 
> > > On Mon, 13 Jul 2020 20:41:11 -0700 Eric Biggers wrote:
> > > > On Tue, Jul 14, 2020 at 11:32:52AM +0800, Hillf Danton wrote:
> > > > > 
> > > > > Add FALLOC_FL_NOBLOCK and on the shmem side try to lock inode upon the
> > > > > new flag. And the overall upside is to keep the current gfp either in
> > > > > the khugepaged context or not.
> > > > > 
> > > > > --- a/include/uapi/linux/falloc.h
> > > > > +++ b/include/uapi/linux/falloc.h
> > > > > @@ -77,4 +77,6 @@
> > > > >   */
> > > > >  #define FALLOC_FL_UNSHARE_RANGE  0x40
> > > > >  
> > > > > +#define FALLOC_FL_NOBLOCK0x80
> > > > > +
> > > > 
> > > > You can't add a new UAPI flag to fix a kernel-internal problem like 
> > > > this.
> > > 
> > > Sounds fair, see below.
> > > 
> > > What the report indicates is a missing PF_MEMALLOC_NOFS and it's
> > > checked on the ashmem side and added as an exception before going
> > > to filesystem. On shmem side, no more than a best effort is paid
> > > on the inteded exception.
> > > 
> > > --- a/drivers/staging/android/ashmem.c
> > > +++ b/drivers/staging/android/ashmem.c
> > > @@ -437,6 +437,7 @@ static unsigned long
> > >  ashmem_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
> > >  {
> > >   unsigned long freed = 0;
> > > + bool nofs;
> > >  
> > >   /* We might recurse into filesystem code, so bail out if necessary */
> > >   if (!(sc->gfp_mask & __GFP_FS))
> > > @@ -445,6 +446,11 @@ ashmem_shrink_scan(struct shrinker *shri
> > >   if (!mutex_trylock(&ashmem_mutex))
> > >   return -1;
> > >  
> > > + /* enter filesystem with caution: nonblock on locking */
> > > + nofs = current->flags & PF_MEMALLOC_NOFS;
> > > + if (!nofs)
> > > + current->flags |= PF_MEMALLOC_NOFS;
> > > +
> > >   while (!list_empty(&ashmem_lru_list)) {
> > >   struct ashmem_range *range =
> > >   list_first_entry(&ashmem_lru_list, typeof(*range), lru);
> > 
> > I do not think this is an appropriate fix. First of all is this a real
> > deadlock or a lockdep false positive? Is it possible that ashmem just
> 
> The warning matters and we can do something to quiesce it.

The underlying issue should be fixed rather than _something_ done to
silence it.
 
> > needs to properly annotate its shmem inodes? Or is it possible that
> > the internal backing shmem file is visible to the userspace so the write
> > path would be possible?
> > 
> > If this a real problem then the proper fix would be to set internal
> > shmem mapping's gfp_mask to drop __GFP_FS.
> 
> Thanks for the tip, see below.
> 
> Can you expand a bit on how it helps direct reclaimers like khugepaged
> in the syzbot report wrt deadlock?

I do not understand your question.

> TBH I have difficult time following
> up after staring at the chart below for quite a while.

Yes, lockdep reports are quite hard to follow and they tend to confuse
one hell out of me. But this one says that there is a reclaim dependency
between the shmem inode lock and the reclaim context.

> Possible unsafe locking scenario:
> 
>CPU0CPU1
>
>   lock(fs_reclaim);
>lock(&sb->s_type->i_mutex_key#15);
>lock(fs_reclaim);
> 
>   lock(&sb->s_type->i_mutex_key#15);

Please refrain from proposing fixes until the actual problem is
understood. I suspect that this might be just false positive because the
lockdep cannot tell the backing shmem which is internal to ashmem(?)
with any general shmem. But somebody really familiar with ashmem code
should have a look I believe.

-- 
Michal Hocko
SUSE Labs
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: possible deadlock in shmem_fallocate (4)

2020-07-14 Thread Michal Hocko
On Tue 14-07-20 10:32:20, Suren Baghdasaryan wrote:
[...]
> I'm not sure how we can annotate the fact that the inode_lock in
> generic_file_write_iter and in shmem_fallocate always operate on
> different inodes. Ideas?

I believe that the standard way is to use lockdep_set_class on your
backing inode. But asking lockdep experts would be better than relying
on my vague recollection

-- 
Michal Hocko
SUSE Labs
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH] binder: Don't use mmput() from shrinker function.

2020-07-16 Thread Michal Hocko
On Thu 16-07-20 08:36:52, Tetsuo Handa wrote:
> syzbot is reporting that mmput() from shrinker function has a risk of
> deadlock [1]. Don't start synchronous teardown of mm when called from
> shrinker function.

Please add the actual lock dependency to the changelog.

Anyway is this deadlock real? Mayve I have missed some details but the
call graph points to these two paths.
uprobe_mmap do_shrink_slab  
  uprobes_mmap_hash #lock
  install_breakpoint  binder_shrink_scan
set_swbpbinder_alloc_free_page
  uprobe_write_opcode __mmput
update_ref_ctr  uprobe_clear_state
  mutex_lock(&delayed_uprobe_lock)
mutex_lock(&delayed_uprobe_lock);
allocation -> reclaim

But in order for this to happen the shrinker would have to do the last
put on the mm. But mm cannot go away from under uprobe_mmap so those two
paths cannot race with each other.

Unless I am missing something this is a false positive. I do not mind
using mmput_async from the shrinker as a workaround but the changelog
should be explicit about the fact.

> [1] 
> https://syzkaller.appspot.com/bug?id=bc9e7303f537c41b2b0cc2dfcea3fc42964c2d45
> 
> Reported-by: syzbot 
> Reported-by: syzbot 
> Signed-off-by: Tetsuo Handa 
> ---
>  drivers/android/binder_alloc.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/android/binder_alloc.c b/drivers/android/binder_alloc.c
> index 42c672f1584e..cbe6aa77d50d 100644
> --- a/drivers/android/binder_alloc.c
> +++ b/drivers/android/binder_alloc.c
> @@ -947,7 +947,7 @@ enum lru_status binder_alloc_free_page(struct list_head 
> *item,
>   trace_binder_unmap_user_end(alloc, index);
>   }
>   mmap_read_unlock(mm);
> - mmput(mm);
> + mmput_async(mm);
>  
>   trace_binder_unmap_kernel_start(alloc, index);
>  
> -- 
> 2.18.4
> 

-- 
Michal Hocko
SUSE Labs
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH] binder: Don't use mmput() from shrinker function.

2020-07-16 Thread Michal Hocko
On Thu 16-07-20 22:41:14, Tetsuo Handa wrote:
> On 2020/07/16 17:35, Michal Hocko wrote:
[...]
> > But in order for this to happen the shrinker would have to do the last
> > put on the mm. But mm cannot go away from under uprobe_mmap so those two
> > paths cannot race with each other.
> 
> and mm1 != mm2 is possible, isn't it?

OK, I have missed that information. You are right. Can you make this
into the changelog please?
-- 
Michal Hocko
SUSE Labs
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [PATCH v2] binder: Don't use mmput() from shrinker function.

2020-07-16 Thread Michal Hocko
On Fri 17-07-20 00:12:15, Tetsuo Handa wrote:
> syzbot is reporting that mmput() from shrinker function has a risk of
> deadlock [1], for delayed_uprobe_add() from update_ref_ctr() calls
> kzalloc(GFP_KERNEL) with delayed_uprobe_lock held, and
> uprobe_clear_state() from __mmput() also holds delayed_uprobe_lock.
> 
> Commit a1b2289cef92ef0e ("android: binder: drop lru lock in isolate
> callback") replaced mmput() with mmput_async() in order to avoid sleeping
> with spinlock held. But this patch replaces mmput() with mmput_async() in
> order not to start __mmput() from shrinker context.
> 
> [1] 
> https://syzkaller.appspot.com/bug?id=bc9e7303f537c41b2b0cc2dfcea3fc42964c2d45
> 
> Reported-by: syzbot 
> Reported-by: syzbot 
> Signed-off-by: Tetsuo Handa 

Reviewed-by: Michal Hocko 

Thanks!

> ---
>  drivers/android/binder_alloc.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/android/binder_alloc.c b/drivers/android/binder_alloc.c
> index 42c672f1584e..cbe6aa77d50d 100644
> --- a/drivers/android/binder_alloc.c
> +++ b/drivers/android/binder_alloc.c
> @@ -947,7 +947,7 @@ enum lru_status binder_alloc_free_page(struct list_head 
> *item,
>   trace_binder_unmap_user_end(alloc, index);
>   }
>   mmap_read_unlock(mm);
> - mmput(mm);
> +     mmput_async(mm);
>  
>   trace_binder_unmap_kernel_start(alloc, index);
>  
> -- 
> 2.18.4

-- 
Michal Hocko
SUSE Labs
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel