Re: [PATCH RFC v2 0/4] mm: Introduce MAP_BELOW_HINT

2024-09-03 Thread Michal Hocko
On Thu 29-08-24 10:33:22, Charlie Jenkins wrote:
> On Thu, Aug 29, 2024 at 10:30:56AM +0200, Michal Hocko wrote:
> > On Thu 29-08-24 00:15:57, Charlie Jenkins wrote:
> > > Some applications rely on placing data in free bits addresses allocated
> > > by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the
> > > address returned by mmap to be less than the 48-bit address space,
> > > unless the hint address uses more than 47 bits (the 48th bit is reserved
> > > for the kernel address space).
> > > 
> > > The riscv architecture needs a way to similarly restrict the virtual
> > > address space. On the riscv port of OpenJDK an error is thrown if
> > > attempted to run on the 57-bit address space, called sv57 [1].  golang
> > > has a comment that sv57 support is not complete, but there are some
> > > workarounds to get it to mostly work [2].
> > > 
> > > These applications work on x86 because x86 does an implicit 47-bit
> > > restriction of mmap() address that contain a hint address that is less
> > > than 48 bits.
> > > 
> > > Instead of implicitly restricting the address space on riscv (or any
> > > current/future architecture), a flag would allow users to opt-in to this
> > > behavior rather than opt-out as is done on other architectures. This is
> > > desirable because it is a small class of applications that do pointer
> > > masking.
> > 
> > IIRC this has been discussed at length when 5-level page tables support
> > has been proposed for x86. Sorry I do not have a link handy but lore
> > should help you. Linus was not really convinced and in the end vetoed it
> > and prefer that those few applications that benefit from greater address
> > space would do that explicitly than other way around.
> 
> I believe I found the conversation you were referring to. Ingo Molnar
> recommended a flag similar to what I have proposed [1]. Catalin
> recommended to make 52-bit opt-in on arm64 [2]. Dave Hansen brought up
> MPX [3].
> 
> However these conversations are tangential to what I am proposing. arm64
> and x86 decided to have the default address space be 48 bits. However
> this was done on a per-architecture basis with no way for applications
> to have guarantees between architectures. Even this behavior to restrict
> to 48 bits does not even appear in the man pages, so would require
> reading the kernel source code to understand that this feature is
> available. Then to opt-in to larger address spaces, applications have to
> know to provide a hint address that is greater than 47 bits, mmap() will
> then return an address that contains up to 56 bits on x86 and 52 bits on
> arm64. This difference of 4 bits causes inconsistency and is part of the
> problem I am trying to solve with this flag.

Yes, I guess I do understand where you are heading. Our existing model
assumes that anybody requiring more address space know what they are
doing and deal with the reality. This is the way Linus has pushed this
and I am not really convinced it is the right way TBH. On the other hand
it is true that this allows a safe(r) transition to larger address
spaces.

> I am not proposing to change x86 and arm64 away from using their opt-out
> feature, I am instead proposing a standard ABI for applications that
> need some guarantees of the bits used in pointers.

Right, but this is not really different from earlier attempts to achieve
this IIRC. Extentind mmap for that purpose seems quite tricky as already
pointed out in other sub-threads. Quite honestly I am not really sure
what is the right and backwards compatible way. I just wanted to make
you aware this has been discussed at lenght in the past.
-- 
Michal Hocko
SUSE Labs



Re: [PATCH RFC v2 0/4] mm: Introduce MAP_BELOW_HINT

2024-08-29 Thread Michal Hocko
On Thu 29-08-24 00:15:57, Charlie Jenkins wrote:
> Some applications rely on placing data in free bits addresses allocated
> by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the
> address returned by mmap to be less than the 48-bit address space,
> unless the hint address uses more than 47 bits (the 48th bit is reserved
> for the kernel address space).
> 
> The riscv architecture needs a way to similarly restrict the virtual
> address space. On the riscv port of OpenJDK an error is thrown if
> attempted to run on the 57-bit address space, called sv57 [1].  golang
> has a comment that sv57 support is not complete, but there are some
> workarounds to get it to mostly work [2].
> 
> These applications work on x86 because x86 does an implicit 47-bit
> restriction of mmap() address that contain a hint address that is less
> than 48 bits.
> 
> Instead of implicitly restricting the address space on riscv (or any
> current/future architecture), a flag would allow users to opt-in to this
> behavior rather than opt-out as is done on other architectures. This is
> desirable because it is a small class of applications that do pointer
> masking.

IIRC this has been discussed at length when 5-level page tables support
has been proposed for x86. Sorry I do not have a link handy but lore
should help you. Linus was not really convinced and in the end vetoed it
and prefer that those few applications that benefit from greater address
space would do that explicitly than other way around.

-- 
Michal Hocko
SUSE Labs



Re: [PATCH v1 0/9] mm/memory: optimize unmap/zap with PTE-mapped THP

2024-01-31 Thread Michal Hocko
On Wed 31-01-24 11:16:01, David Hildenbrand wrote:
[...]
> This 1 pages limit was introduced in 53a59fc67f97 ("mm: limit mmu_gather
> batching to fix soft lockups on !CONFIG_PREEMPT") where we wanted to handle
> soft-lockups.

AFAIR at the time of this patch this was mostly just to put some cap on
the number of batches to collect and free at once. If there is a lot of
free memory and a large process exiting this could grow really high. Now
that those pages^Wfolios can represent larger memory chunks it could
mean more physical memory being freed but from which might make the
operation take longer but still far from soft lockup triggering.

Now latency might suck on !PREEMPT kernels with too many pages to
free in a single batch but I guess this is somehow expected for this
preemption model. The soft lockup has to be avoided because this can
panic the machine in some configurations.

-- 
Michal Hocko
SUSE Labs


Re: [PATCH v1 0/9] mm/memory: optimize unmap/zap with PTE-mapped THP

2024-01-31 Thread Michal Hocko
On Wed 31-01-24 10:26:13, Ryan Roberts wrote:
> IIRC there is an option to zero memory when it is freed back to the buddy? So
> that could be a place where time is proportional to size rather than
> proportional to folio count? But I think that option is intended for debug 
> only?
> So perhaps not a problem in practice?

init_on_free is considered a security/hardening feature more than a
debugging one. It will surely add an overhead and I guess this is
something people who use it know about. The batch size limit is a latency
reduction feature for !PREEMPT kernels but by no means it should be
considered low latency guarantee feature. A lot of has changed since
the limit was introduced and the current latency numbers will surely be
different than back then. As long as soft lockups do not trigger again
this should be acceptable IMHO.

-- 
Michal Hocko
SUSE Labs


Re: [PATCH v7 7/7] mm/memory_hotplug: Enable runtime update of memmap_on_memory parameter

2023-08-07 Thread Michal Hocko
On Sat 05-08-23 19:54:23, Aneesh Kumar K V wrote:
[...]
> Do you see a need for firmware-managed memory to be hotplugged in with
> different memory block sizes?

In short. Yes. Slightly longer, a fixed size memory block semantic is
just standing in the way and I would even argue it is actively harmful.
Just have a look at ridicously small memory blocks on ppc. I do
understand that it makes some sense to be aligned to the memory model
(so sparsmem section aligned). In an ideal world, memory hotplug v2
interface (if we ever go that path) should be physical memory range based.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v3 1/2] nmi_backtrace: Allow excluding an arbitrary CPU

2023-08-07 Thread Michal Hocko
On Fri 04-08-23 09:06:07, Doug Anderson wrote:
> Hi,
> 
> On Fri, Aug 4, 2023 at 8:02 AM Michal Hocko  wrote:
> >
> > > > It would have been slightly safer to modify 
> > > > arch_trigger_cpumask_backtrace
> > > > by switching arguments so that some leftovers are captured easier.
> > >
> > > I'm not sure I understand. Oh, you're saying make the prototype of
> > > arch_trigger_cpumask_backtrace() incompatible so that if someone is
> > > directly calling it then it'll be a compile-time error?
> >
> > exactly. bool to int promotion would be too easy to miss while the
> > pointer to int would complain loudly.
> >
> > > I guess the
> > > hope is that nobody is calling that directly and they're calling
> > > through the trigger_...() functions.
> >
> > Hope is one thing, being preventive another.
> >
> > > For now I'm going to leave this alone.
> >
> > If you are going to send another version then please consider this. Not
> > a hard requirement but better.
> 
> If I do send another version, do you have any suggestions for how to
> change this to make it incompatible?

I would swap parameters as this seems simplest.

> I guess swapping the order of the
> parameters would be best? I considered doing that for v4 but I felt
> like long term the current order of the parameters was better.

Yes the current ordering is better but having it other way around is not
really horrendous either.

> I also
> considered a rename, but that different problems. ;-) If I rename both
> the #define and the function then if someone has an out-of-tree patch
> adding arch_trigger_cpumask_backtrace() for another architecture, like
> say arm64, then there would be no compile-time failure indicating that
> the out-of-tree patch needs updating. I could rename the functions but
> _not_ the #define, I guess?

I think that swapping would be simplest as the type mismatch should
catch also pending out-of-tree potential implementations.

-- 
Michal Hocko
SUSE Labs


Re: [PATCH v3 1/2] nmi_backtrace: Allow excluding an arbitrary CPU

2023-08-04 Thread Michal Hocko
On Fri 04-08-23 06:56:51, Doug Anderson wrote:
> Hi,
> 
> On Fri, Aug 4, 2023 at 12:50 AM Michal Hocko  wrote:
> >
> > On Thu 03-08-23 16:07:57, Douglas Anderson wrote:
> > > The APIs that allow backtracing across CPUs have always had a way to
> > > exclude the current CPU. This convenience means callers didn't need to
> > > find a place to allocate a CPU mask just to handle the common case.
> > >
> > > Let's extend the API to take a CPU ID to exclude instead of just a
> > > boolean. This isn't any more complex for the API to handle and allows
> > > the hardlockup detector to exclude a different CPU (the one it already
> > > did a trace for) without needing to find space for a CPU mask.
> > >
> > > Arguably, this new API also encourages safer behavior. Specifically if
> > > the caller wants to avoid tracing the current CPU (maybe because they
> > > already traced the current CPU) this makes it more obvious to the
> > > caller that they need to make sure that the current CPU ID can't
> > > change.
> >
> > Yes, this looks like the best way forward.
> >
> > It would have been slightly safer to modify arch_trigger_cpumask_backtrace
> > by switching arguments so that some leftovers are captured easier.
> 
> I'm not sure I understand. Oh, you're saying make the prototype of
> arch_trigger_cpumask_backtrace() incompatible so that if someone is
> directly calling it then it'll be a compile-time error? 

exactly. bool to int promotion would be too easy to miss while the
pointer to int would complain loudly.

> I guess the
> hope is that nobody is calling that directly and they're calling
> through the trigger_...() functions.

Hope is one thing, being preventive another.

> For now I'm going to leave this alone.

If you are going to send another version then please consider this. Not
a hard requirement but better.
 

-- 
Michal Hocko
SUSE Labs


Re: [PATCH v3 1/2] nmi_backtrace: Allow excluding an arbitrary CPU

2023-08-04 Thread Michal Hocko
On Thu 03-08-23 16:07:57, Douglas Anderson wrote:
> The APIs that allow backtracing across CPUs have always had a way to
> exclude the current CPU. This convenience means callers didn't need to
> find a place to allocate a CPU mask just to handle the common case.
> 
> Let's extend the API to take a CPU ID to exclude instead of just a
> boolean. This isn't any more complex for the API to handle and allows
> the hardlockup detector to exclude a different CPU (the one it already
> did a trace for) without needing to find space for a CPU mask.
> 
> Arguably, this new API also encourages safer behavior. Specifically if
> the caller wants to avoid tracing the current CPU (maybe because they
> already traced the current CPU) this makes it more obvious to the
> caller that they need to make sure that the current CPU ID can't
> change.

Yes, this looks like the best way forward.

It would have been slightly safer to modify arch_trigger_cpumask_backtrace
by switching arguments so that some leftovers are captured easier.

You also have this leftover
diff --git a/include/linux/nmi.h b/include/linux/nmi.h
index 00982b133dc1..9f1743ee2b28 100644
--- a/include/linux/nmi.h
+++ b/include/linux/nmi.h
@@ -190,10 +190,6 @@ static inline bool trigger_all_cpu_backtrace(void)
 {
return false;
 }
-static inline bool trigger_allbutself_cpu_backtrace(void)
-{
-   return false;
-}
 static inline bool trigger_cpumask_backtrace(struct cpumask *mask)
 {
return false;

> Signed-off-by: Douglas Anderson 

Anyway
Acked-by: Michal Hocko 

> ---
> 
> Changes in v3:
> - ("nmi_backtrace: Allow excluding an arbitrary CPU") new for v3.
> 
>  arch/arm/include/asm/irq.h   |  2 +-
>  arch/arm/kernel/smp.c|  4 ++--
>  arch/loongarch/include/asm/irq.h |  2 +-
>  arch/loongarch/kernel/process.c  |  4 ++--
>  arch/mips/include/asm/irq.h  |  2 +-
>  arch/mips/kernel/process.c   |  4 ++--
>  arch/powerpc/include/asm/irq.h   |  2 +-
>  arch/powerpc/kernel/stacktrace.c |  4 ++--
>  arch/powerpc/kernel/watchdog.c   |  4 ++--
>  arch/sparc/include/asm/irq_64.h  |  2 +-
>  arch/sparc/kernel/process_64.c   |  6 +++---
>  arch/x86/include/asm/irq.h   |  2 +-
>  arch/x86/kernel/apic/hw_nmi.c|  4 ++--
>  include/linux/nmi.h  | 12 ++--
>  kernel/watchdog.c|  2 +-
>  lib/nmi_backtrace.c  |  6 +++---
>  16 files changed, 31 insertions(+), 31 deletions(-)
> 
> diff --git a/arch/arm/include/asm/irq.h b/arch/arm/include/asm/irq.h
> index 18605f1b3580..26c1d2ced4ce 100644
> --- a/arch/arm/include/asm/irq.h
> +++ b/arch/arm/include/asm/irq.h
> @@ -32,7 +32,7 @@ void handle_IRQ(unsigned int, struct pt_regs *);
>  #include 
>  
>  extern void arch_trigger_cpumask_backtrace(const cpumask_t *mask,
> -bool exclude_self);
> +int exclude_cpu);
>  #define arch_trigger_cpumask_backtrace arch_trigger_cpumask_backtrace
>  #endif
>  
> diff --git a/arch/arm/kernel/smp.c b/arch/arm/kernel/smp.c
> index 6756203e45f3..3431c0553f45 100644
> --- a/arch/arm/kernel/smp.c
> +++ b/arch/arm/kernel/smp.c
> @@ -846,7 +846,7 @@ static void raise_nmi(cpumask_t *mask)
>   __ipi_send_mask(ipi_desc[IPI_CPU_BACKTRACE], mask);
>  }
>  
> -void arch_trigger_cpumask_backtrace(const cpumask_t *mask, bool exclude_self)
> +void arch_trigger_cpumask_backtrace(const cpumask_t *mask, int exclude_cpu)
>  {
> - nmi_trigger_cpumask_backtrace(mask, exclude_self, raise_nmi);
> + nmi_trigger_cpumask_backtrace(mask, exclude_cpu, raise_nmi);
>  }
> diff --git a/arch/loongarch/include/asm/irq.h 
> b/arch/loongarch/include/asm/irq.h
> index a115e8999c69..218b4da0ea90 100644
> --- a/arch/loongarch/include/asm/irq.h
> +++ b/arch/loongarch/include/asm/irq.h
> @@ -40,7 +40,7 @@ void spurious_interrupt(void);
>  #define NR_IRQS_LEGACY 16
>  
>  #define arch_trigger_cpumask_backtrace arch_trigger_cpumask_backtrace
> -void arch_trigger_cpumask_backtrace(const struct cpumask *mask, bool 
> exclude_self);
> +void arch_trigger_cpumask_backtrace(const struct cpumask *mask, int 
> exclude_cpu);
>  
>  #define MAX_IO_PICS 2
>  #define NR_IRQS  (64 + (256 * MAX_IO_PICS))
> diff --git a/arch/loongarch/kernel/process.c b/arch/loongarch/kernel/process.c
> index 2e04eb07abb6..778e8d09953e 100644
> --- a/arch/loongarch/kernel/process.c
> +++ b/arch/loongarch/kernel/process.c
> @@ -345,9 +345,9 @@ static void raise_backtrace(cpumask_t *mask)
>   }
>  }
>  
> -void arch_trigger_cpumask_backtrace(const cpumask_t *mask, bool exclude_self)
> +void arch_trigger_cpumask_backtrace(const cpumask_t *mask, int exclude_cpu)
>  {
> - nmi_trigge

Re: [PATCH v7 7/7] mm/memory_hotplug: Enable runtime update of memmap_on_memory parameter

2023-08-03 Thread Michal Hocko
On Wed 02-08-23 18:59:04, Michal Hocko wrote:
> On Wed 02-08-23 17:54:04, David Hildenbrand wrote:
> > On 02.08.23 17:50, Michal Hocko wrote:
> > > On Wed 02-08-23 10:15:04, Aneesh Kumar K V wrote:
> > > > On 8/1/23 4:20 PM, Michal Hocko wrote:
> > > > > On Tue 01-08-23 14:58:29, Aneesh Kumar K V wrote:
> > > > > > On 8/1/23 2:28 PM, Michal Hocko wrote:
> > > > > > > On Tue 01-08-23 10:11:16, Aneesh Kumar K.V wrote:
> > > > > > > > Allow updating memmap_on_memory mode after the kernel boot. 
> > > > > > > > Memory
> > > > > > > > hotplug done after the mode update will use the new 
> > > > > > > > mmemap_on_memory
> > > > > > > > value.
> > > > > > > 
> > > > > > > Well, this is a user space kABI extension and as such you should 
> > > > > > > spend
> > > > > > > more words about the usecase. Why we could live with this static 
> > > > > > > and now
> > > > > > > need dynamic?
> > > > > > > 
> > > > > > 
> > > > > > This enables easy testing of memmap_on_memory feature without a 
> > > > > > kernel reboot.
> > > > > 
> > > > > Testing alone is rather weak argument to be honest.
> > > > > 
> > > > > > I also expect people wanting to use that when they find dax kmem 
> > > > > > memory online
> > > > > > failing because of struct page allocation failures[1]. User could 
> > > > > > reboot back with
> > > > > > memmap_on_memory=y kernel parameter. But being able to enable it 
> > > > > > via sysfs makes
> > > > > > the feature much more useful.
> > > > > 
> > > > > Sure it can be useful but that holds for any feature, right. The main
> > > > > question is whether this is worth maintaing. The current 
> > > > > implementation
> > > > > seems rather trivial which is an argument to have it but are there any
> > > > > risks long term? Have you evaluated a potential long term maintenance
> > > > > cost? There is no easy way to go back and disable it later on without
> > > > > breaking some userspace.
> > > > > 
> > > > > All that should be in the changelog!
> > > > 
> > > > I updated it as below.
> > > > 
> > > > mm/memory_hotplug: Enable runtime update of memmap_on_memory parameter
> > > > 
> > > > Allow updating memmap_on_memory mode after the kernel boot. Memory
> > > > hotplug done after the mode update will use the new mmemap_on_memory
> > > > value.
> > > > 
> > > > It is now possible to test the memmap_on_memory feature easily without
> > > > the need for a kernel reboot. Additionally, for those encountering
> > > > struct page allocation failures while using dax kmem memory online, this
> > > > feature may prove useful. Instead of rebooting with the
> > > > memmap_on_memory=y kernel parameter, users can now enable it via sysfs,
> > > > which greatly enhances its usefulness.
> > > 
> > > 
> > > I do not really see a solid argument why rebooting is really a problem
> > > TBH. Also is the global policy knob really the right fit for existing
> > > hotplug usecases? In other words, if we really want to make
> > > memmap_on_memory more flexible would it make more sense to have it per
> > > memory block property instead (the global knob being the default if
> > > admin doesn't specify it differently).
> > 
> > Per memory block isn't possible, due to the creation order.
> 
> I am not sure I follow. Could you elaborate more?

Must have been a tired brain. Now I see what you mean of course. vmemmap
is allocated at the time the range is registered and therefore memory
block sysfs created. So you are right that it is too late in some sense
but I still believe that this would be doable. The memmap_on_memory file
would be readable only when the block is offline and it would reallocate
vmemmap on the change. Makes sense? Are there any risks? Maybe pfn
walkers?

-- 
Michal Hocko
SUSE Labs


Re: [PATCH v7 7/7] mm/memory_hotplug: Enable runtime update of memmap_on_memory parameter

2023-08-02 Thread Michal Hocko
On Wed 02-08-23 17:54:04, David Hildenbrand wrote:
> On 02.08.23 17:50, Michal Hocko wrote:
> > On Wed 02-08-23 10:15:04, Aneesh Kumar K V wrote:
> > > On 8/1/23 4:20 PM, Michal Hocko wrote:
> > > > On Tue 01-08-23 14:58:29, Aneesh Kumar K V wrote:
> > > > > On 8/1/23 2:28 PM, Michal Hocko wrote:
> > > > > > On Tue 01-08-23 10:11:16, Aneesh Kumar K.V wrote:
> > > > > > > Allow updating memmap_on_memory mode after the kernel boot. Memory
> > > > > > > hotplug done after the mode update will use the new 
> > > > > > > mmemap_on_memory
> > > > > > > value.
> > > > > > 
> > > > > > Well, this is a user space kABI extension and as such you should 
> > > > > > spend
> > > > > > more words about the usecase. Why we could live with this static 
> > > > > > and now
> > > > > > need dynamic?
> > > > > > 
> > > > > 
> > > > > This enables easy testing of memmap_on_memory feature without a 
> > > > > kernel reboot.
> > > > 
> > > > Testing alone is rather weak argument to be honest.
> > > > 
> > > > > I also expect people wanting to use that when they find dax kmem 
> > > > > memory online
> > > > > failing because of struct page allocation failures[1]. User could 
> > > > > reboot back with
> > > > > memmap_on_memory=y kernel parameter. But being able to enable it via 
> > > > > sysfs makes
> > > > > the feature much more useful.
> > > > 
> > > > Sure it can be useful but that holds for any feature, right. The main
> > > > question is whether this is worth maintaing. The current implementation
> > > > seems rather trivial which is an argument to have it but are there any
> > > > risks long term? Have you evaluated a potential long term maintenance
> > > > cost? There is no easy way to go back and disable it later on without
> > > > breaking some userspace.
> > > > 
> > > > All that should be in the changelog!
> > > 
> > > I updated it as below.
> > > 
> > > mm/memory_hotplug: Enable runtime update of memmap_on_memory parameter
> > > 
> > > Allow updating memmap_on_memory mode after the kernel boot. Memory
> > > hotplug done after the mode update will use the new mmemap_on_memory
> > > value.
> > > 
> > > It is now possible to test the memmap_on_memory feature easily without
> > > the need for a kernel reboot. Additionally, for those encountering
> > > struct page allocation failures while using dax kmem memory online, this
> > > feature may prove useful. Instead of rebooting with the
> > > memmap_on_memory=y kernel parameter, users can now enable it via sysfs,
> > > which greatly enhances its usefulness.
> > 
> > 
> > I do not really see a solid argument why rebooting is really a problem
> > TBH. Also is the global policy knob really the right fit for existing
> > hotplug usecases? In other words, if we really want to make
> > memmap_on_memory more flexible would it make more sense to have it per
> > memory block property instead (the global knob being the default if
> > admin doesn't specify it differently).
> 
> Per memory block isn't possible, due to the creation order.

I am not sure I follow. Could you elaborate more?

> Also, I think it's not the right approach.
> 
> I thought about driver toggles. At least for dax/kmem people are looking
> into that:
> 
> https://lkml.kernel.org/r/20230801-vv-kmem_memmap-v3-2-406e9aaf5...@intel.com
> 
> Where that can also be toggled per device.

Per device flag makes sense to me as well. But what we should do about
hotplugged memory with a backing device (like regular RAM). I can see
some reason to have large hotplugged memory ranges to be self vmemap
hosted while smaller blocks to be backed by external vmemmaps.

-- 
Michal Hocko
SUSE Labs


Re: [PATCH v7 7/7] mm/memory_hotplug: Enable runtime update of memmap_on_memory parameter

2023-08-02 Thread Michal Hocko
On Wed 02-08-23 10:15:04, Aneesh Kumar K V wrote:
> On 8/1/23 4:20 PM, Michal Hocko wrote:
> > On Tue 01-08-23 14:58:29, Aneesh Kumar K V wrote:
> >> On 8/1/23 2:28 PM, Michal Hocko wrote:
> >>> On Tue 01-08-23 10:11:16, Aneesh Kumar K.V wrote:
> >>>> Allow updating memmap_on_memory mode after the kernel boot. Memory
> >>>> hotplug done after the mode update will use the new mmemap_on_memory
> >>>> value.
> >>>
> >>> Well, this is a user space kABI extension and as such you should spend
> >>> more words about the usecase. Why we could live with this static and now
> >>> need dynamic?
> >>>
> >>
> >> This enables easy testing of memmap_on_memory feature without a kernel 
> >> reboot.
> > 
> > Testing alone is rather weak argument to be honest.
> > 
> >> I also expect people wanting to use that when they find dax kmem memory 
> >> online
> >> failing because of struct page allocation failures[1]. User could reboot 
> >> back with
> >> memmap_on_memory=y kernel parameter. But being able to enable it via sysfs 
> >> makes
> >> the feature much more useful.
> > 
> > Sure it can be useful but that holds for any feature, right. The main
> > question is whether this is worth maintaing. The current implementation
> > seems rather trivial which is an argument to have it but are there any
> > risks long term? Have you evaluated a potential long term maintenance
> > cost? There is no easy way to go back and disable it later on without
> > breaking some userspace.
> > 
> > All that should be in the changelog!
> 
> I updated it as below. 
> 
> mm/memory_hotplug: Enable runtime update of memmap_on_memory parameter
> 
> Allow updating memmap_on_memory mode after the kernel boot. Memory
> hotplug done after the mode update will use the new mmemap_on_memory
> value.
> 
> It is now possible to test the memmap_on_memory feature easily without
> the need for a kernel reboot. Additionally, for those encountering
> struct page allocation failures while using dax kmem memory online, this
> feature may prove useful. Instead of rebooting with the
> memmap_on_memory=y kernel parameter, users can now enable it via sysfs,
> which greatly enhances its usefulness.


I do not really see a solid argument why rebooting is really a problem
TBH. Also is the global policy knob really the right fit for existing
hotplug usecases? In other words, if we really want to make
memmap_on_memory more flexible would it make more sense to have it per
memory block property instead (the global knob being the default if
admin doesn't specify it differently).
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v7 7/7] mm/memory_hotplug: Enable runtime update of memmap_on_memory parameter

2023-08-01 Thread Michal Hocko
On Tue 01-08-23 14:58:29, Aneesh Kumar K V wrote:
> On 8/1/23 2:28 PM, Michal Hocko wrote:
> > On Tue 01-08-23 10:11:16, Aneesh Kumar K.V wrote:
> >> Allow updating memmap_on_memory mode after the kernel boot. Memory
> >> hotplug done after the mode update will use the new mmemap_on_memory
> >> value.
> > 
> > Well, this is a user space kABI extension and as such you should spend
> > more words about the usecase. Why we could live with this static and now
> > need dynamic?
> > 
> 
> This enables easy testing of memmap_on_memory feature without a kernel reboot.

Testing alone is rather weak argument to be honest.

> I also expect people wanting to use that when they find dax kmem memory online
> failing because of struct page allocation failures[1]. User could reboot back 
> with
> memmap_on_memory=y kernel parameter. But being able to enable it via sysfs 
> makes
> the feature much more useful.

Sure it can be useful but that holds for any feature, right. The main
question is whether this is worth maintaing. The current implementation
seems rather trivial which is an argument to have it but are there any
risks long term? Have you evaluated a potential long term maintenance
cost? There is no easy way to go back and disable it later on without
breaking some userspace.

All that should be in the changelog!
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v7 0/7] Add support for memmap on memory feature on ppc64

2023-08-01 Thread Michal Hocko
I cannot really judge the ppc specific part but other patches seem
reasonable. Patch 4 could print a more useful information about the
wastage but this is nothing really earth shattering. I am not sure about
the last patch which makes the on-memory property dynamic. This needs
much more justification and use case description IMHO.

That being said for patches 1 - 4 and 6 feel free to add
Acked-by: Michal Hocko 

On Tue 01-08-23 10:11:09, Aneesh Kumar K.V wrote:
> This patch series update memmap on memory feature to fall back to
> memmap allocation outside the memory block if the alignment rules are
> not met. This makes the feature more useful on architectures like
> ppc64 where alignment rules are different with 64K page size.
> 
> This patch series is dependent on dax vmemmap optimization series
> posted here
> https://lore.kernel.org/linux-mm/20230718022934.90447-1-aneesh.ku...@linux.ibm.com/
> 
> Changes from v6:
> * Update comments in the code
> * Update commit message for patch 7
> 
> Changes from v5:
> * Update commit message
> * Move memory alloc/free to the callers in patch 6
> * Address review feedback w.r.t patch 4
> 
> Changes from v4:
> * Use altmap.free instead of altmap.reserve
> * Address review feedback
> 
> Changes from v3:
> * Extend the module parameter memmap_on_memory to force allocation even
>   though we can waste hotplug memory.
> 
> Changes from v2:
> * Rebase to latest linus tree
> * Redo the series based on review feedback. Multiple changes to the patchset.
> 
> Changes from v1:
> * update the memblock to store vmemmap_altmap details. This is required
> so that when we remove the memory we can find the altmap details which
> is needed on some architectures.
> * rebase to latest linus tree
> 
> 
> 
> Aneesh Kumar K.V (7):
>   mm/memory_hotplug: Simplify ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE kconfig
>   mm/memory_hotplug: Allow memmap on memory hotplug request to fallback
>   mm/memory_hotplug: Allow architecture to override memmap on memory
> support check
>   mm/memory_hotplug: Support memmap_on_memory when memmap is not aligned
> to pageblocks
>   powerpc/book3s64/memhotplug: Enable memmap on memory for radix
>   mm/memory_hotplug: Embed vmem_altmap details in memory block
>   mm/memory_hotplug: Enable runtime update of memmap_on_memory parameter
> 
>  .../admin-guide/mm/memory-hotplug.rst |  12 +
>  arch/arm64/Kconfig|   4 +-
>  arch/powerpc/Kconfig  |   1 +
>  arch/powerpc/include/asm/pgtable.h|  21 ++
>  .../platforms/pseries/hotplug-memory.c|   2 +-
>  arch/x86/Kconfig  |   4 +-
>  drivers/acpi/acpi_memhotplug.c|   3 +-
>  drivers/base/memory.c |  27 ++-
>  include/linux/memory.h|   8 +-
>  include/linux/memory_hotplug.h|   3 +-
>  mm/Kconfig        |   3 +
>  mm/memory_hotplug.c   | 205 ++
>  12 files changed, 220 insertions(+), 73 deletions(-)
> 
> -- 
> 2.41.0

-- 
Michal Hocko
SUSE Labs


Re: [PATCH v7 4/7] mm/memory_hotplug: Support memmap_on_memory when memmap is not aligned to pageblocks

2023-08-01 Thread Michal Hocko
On Tue 01-08-23 10:11:13, Aneesh Kumar K.V wrote:
[...]
> + if (mode == MEMMAP_ON_MEMORY_FORCE) {
> + unsigned long memmap_pages = 
> memory_block_memmap_on_memory_pages();
unsigned long wastage = memmap_pages - 
PFN_UP(memory_block_memmap_size());

if (wastage)
pr_info_once("memmap_on_memory=force will waste %ld 
pages in each memory block %ld\n",
wastage, memory_block_size_bytes() >> 
PAGE_SHIFT);

would be more useful IMHO.

> +
> + pr_info_once("Memory hotplug will waste %ld pages in each 
> memory block\n",
> +  memmap_pages - PFN_UP(memory_block_memmap_size()));
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v7 7/7] mm/memory_hotplug: Enable runtime update of memmap_on_memory parameter

2023-08-01 Thread Michal Hocko
On Tue 01-08-23 10:11:16, Aneesh Kumar K.V wrote:
> Allow updating memmap_on_memory mode after the kernel boot. Memory
> hotplug done after the mode update will use the new mmemap_on_memory
> value.

Well, this is a user space kABI extension and as such you should spend
more words about the usecase. Why we could live with this static and now
need dynamic?

-- 
Michal Hocko
SUSE Labs


Re: [PATCH v6 6/7] mm/memory_hotplug: Embed vmem_altmap details in memory block

2023-07-27 Thread Michal Hocko
On Thu 27-07-23 15:02:12, Aneesh Kumar K V wrote:
> On 7/27/23 2:55 PM, Michal Hocko wrote:
> > On Thu 27-07-23 13:32:31, Aneesh Kumar K.V wrote:
> >> With memmap on memory, some architecture needs more details w.r.t altmap
> >> such as base_pfn, end_pfn, etc to unmap vmemmap memory. Instead of
> >> computing them again when we remove a memory block, embed vmem_altmap
> >> details in struct memory_block if we are using memmap on memory block
> >> feature.
> >>
> >> No functional change in this patch
> >>
> >> Signed-off-by: Aneesh Kumar K.V 
> >> ---
> >>  drivers/base/memory.c  | 25 +++---
> >>  include/linux/memory.h |  8 ++
> >>  mm/memory_hotplug.c| 58 +++---
> >>  3 files changed, 55 insertions(+), 36 deletions(-)
> >>
> >> diff --git a/drivers/base/memory.c b/drivers/base/memory.c
> >> index b456ac213610..57ed61212277 100644
> >> --- a/drivers/base/memory.c
> >> +++ b/drivers/base/memory.c
> >> @@ -106,6 +106,7 @@ static void memory_block_release(struct device *dev)
> >>  {
> >>struct memory_block *mem = to_memory_block(dev);
> >>  
> >> +  WARN_ON(mem->altmap);
> > 
> > What is this supposed to catch? A comment would be handy so that we know
> > what to look at should it ever trigger.
> > 
> 
> I did add a comment where we clear the altmap in try_remove_memory(). I will 
> also add
> more details here.
> 
> +  * Mark altmap NULL so that we can add a debug
> +  * check on memblock free.
>*/
> 
> WARN_ON is an indication of memory leak because if we have mem->altmap != NULL
> then the allocated altmap is not freed . It also indicate that memblock got 
> freed
> without going through the try_remove_memory(). 

I think it would be better to be explicit here (who should free up but
hasn't).

-- 
Michal Hocko
SUSE Labs


Re: [PATCH v6 4/7] mm/memory_hotplug: Support memmap_on_memory when memmap is not aligned to pageblocks

2023-07-27 Thread Michal Hocko
On Thu 27-07-23 14:57:17, Aneesh Kumar K V wrote:
> On 7/27/23 2:53 PM, Michal Hocko wrote:
> > On Thu 27-07-23 13:32:29, Aneesh Kumar K.V wrote:
> > [...]
> >> +  if (mode == MEMMAP_ON_MEMORY_FORCE) {
> >> +  unsigned long memmap_pages = 
> >> memory_block_memmap_on_memory_pages();
> >> +
> >> +  pr_info_once("Memory hotplug will reserve %ld pages in each 
> >> memory block\n",
> >> +   memmap_pages - PFN_UP(memory_block_memmap_size()));
> >> +  }
> >> +  return 0;
> >> +}
> > 
> > Why should we print this only for the forced case? Isn't that
> > interesting for any on memory memmap? Also is this the above sufficient
> > on its own? the size depends on the block size and that can vary.
> > I think it would make more sense to print the block size and the vmemmap
> > reservation and for the force case also any wasted amount on top (if
> > any).
> > 
> 
> For the other cases the space is completely used by for struct page 
> allocation. What
> the information is indicating here is that for each memblock we add we are 
> loosing/wasting so many pages. 
> May be I should have used the term "waste" instead of "reserve" ?

OK, so I have clearly misread and it just confirms this would benefit
from a clarification. In any case I still think that it would be
benefitial to also report how much of the memory is used for vmemmap on
the hotplugged memory. Maybe as a separate patch.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v6 7/7] mm/memory_hotplug: Enable runtime update of memmap_on_memory parameter

2023-07-27 Thread Michal Hocko
ENOCHANGELOG. Considering this is a user visible change then it is
really due.

On Thu 27-07-23 13:32:32, Aneesh Kumar K.V wrote:
> Acked-by: David Hildenbrand 
> Signed-off-by: Aneesh Kumar K.V 
> ---
>  mm/memory_hotplug.c | 35 +++
>  1 file changed, 19 insertions(+), 16 deletions(-)

-- 
Michal Hocko
SUSE Labs


Re: [PATCH v6 6/7] mm/memory_hotplug: Embed vmem_altmap details in memory block

2023-07-27 Thread Michal Hocko
On Thu 27-07-23 13:32:31, Aneesh Kumar K.V wrote:
> With memmap on memory, some architecture needs more details w.r.t altmap
> such as base_pfn, end_pfn, etc to unmap vmemmap memory. Instead of
> computing them again when we remove a memory block, embed vmem_altmap
> details in struct memory_block if we are using memmap on memory block
> feature.
> 
> No functional change in this patch
> 
> Signed-off-by: Aneesh Kumar K.V 
> ---
>  drivers/base/memory.c  | 25 +++---
>  include/linux/memory.h |  8 ++
>  mm/memory_hotplug.c| 58 +++---
>  3 files changed, 55 insertions(+), 36 deletions(-)
> 
> diff --git a/drivers/base/memory.c b/drivers/base/memory.c
> index b456ac213610..57ed61212277 100644
> --- a/drivers/base/memory.c
> +++ b/drivers/base/memory.c
> @@ -106,6 +106,7 @@ static void memory_block_release(struct device *dev)
>  {
>   struct memory_block *mem = to_memory_block(dev);
>  
> + WARN_ON(mem->altmap);

What is this supposed to catch? A comment would be handy so that we know
what to look at should it ever trigger.

>   kfree(mem);
>  }
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v6 4/7] mm/memory_hotplug: Support memmap_on_memory when memmap is not aligned to pageblocks

2023-07-27 Thread Michal Hocko
On Thu 27-07-23 13:32:29, Aneesh Kumar K.V wrote:
[...]
> + if (mode == MEMMAP_ON_MEMORY_FORCE) {
> + unsigned long memmap_pages = 
> memory_block_memmap_on_memory_pages();
> +
> + pr_info_once("Memory hotplug will reserve %ld pages in each 
> memory block\n",
> +  memmap_pages - PFN_UP(memory_block_memmap_size()));
> + }
> + return 0;
> +}

Why should we print this only for the forced case? Isn't that
interesting for any on memory memmap? Also is this the above sufficient
on its own? the size depends on the block size and that can vary.
I think it would make more sense to print the block size and the vmemmap
reservation and for the force case also any wasted amount on top (if
any).

-- 
Michal Hocko
SUSE Labs


Re: [next-20230705] kernel BUG mm/memcontrol.c:3715! (ltp/madvise06)

2023-07-07 Thread Michal Hocko
SAFE_READ_MEMINFO("Cached:") - init_cached);
> > 92   
> > 93 print_cgmem("memory.current");
> > 94 print_cgmem("memory.swap.current");
> > 95 print_cgmem("memory.kmem.usage_in_bytes”);  <<== this line. 
> > 96 }
> > 
> > If I comment line 95 from the testcase, it completes successfully.
> 
> The handling for _KMEM was removed from mem_cgroup_read_u64()
> incorrectly.
> It is used by the still existing kmem.*usage*_in_bytes in addition to
> the now removed kmem.*limit*_in_bytes.
> (And kmem.max_usage_in_bytes, kmem.failcnt)
> 
> The testcase seems to be fine, it actually did its job.

Correct. The updated patch has been already posted
http://lkml.kernel.org/r/zke5wxdbvpi5c...@dhcp22.suse.cz

Thanks for the report!

-- 
Michal Hocko
SUSE Labs


Re: [PATCH 1/1] mm: introduce vm_flags_reset_once to replace WRITE_ONCE vm_flags updates

2023-01-31 Thread Michal Hocko
On Tue 31-01-23 16:01:16, Suren Baghdasaryan wrote:
> Provide vm_flags_reset_once() and replace the vm_flags updates which used
> WRITE_ONCE() to prevent compiler optimizations.
> 
> Fixes: 0cce31a0aa0e ("mm: replace vma->vm_flags direct modifications with 
> modifier calls")
> Reported-by: Hyeonggon Yoo <42.hye...@gmail.com>
> Signed-off-by: Suren Baghdasaryan 

This would have been better folded into the vm_flags modification patch
because it would be more obvious change. Hugh has provided a very nice
comment in mlock_vma_pages_range but the git blame would be more visible
when the conversion is from WRITE_ONCE.

One way or the other
Acked-by: Michal Hocko 

> ---
> Notes:
> - The patch applies cleanly over mm-unstable
> - The SHA in Fixes: line is from mm-unstable, so is... unstable
> 
>  include/linux/mm.h | 7 +++
>  mm/mlock.c | 4 ++--
>  2 files changed, 9 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 5bf0ad48faaa..23ce04f6e91e 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -648,6 +648,13 @@ static inline void vm_flags_reset(struct vm_area_struct 
> *vma,
>   vm_flags_init(vma, flags);
>  }
>  
> +static inline void vm_flags_reset_once(struct vm_area_struct *vma,
> +vm_flags_t flags)
> +{
> + mmap_assert_write_locked(vma->vm_mm);
> + WRITE_ONCE(ACCESS_PRIVATE(vma, __vm_flags), flags);
> +}
> +
>  static inline void vm_flags_set(struct vm_area_struct *vma,
>   vm_flags_t flags)
>  {
> diff --git a/mm/mlock.c b/mm/mlock.c
> index ed49459e343e..617469fce96d 100644
> --- a/mm/mlock.c
> +++ b/mm/mlock.c
> @@ -380,7 +380,7 @@ static void mlock_vma_pages_range(struct vm_area_struct 
> *vma,
>*/
>   if (newflags & VM_LOCKED)
>   newflags |= VM_IO;
> - vm_flags_reset(vma, newflags);
> + vm_flags_reset_once(vma, newflags);
>  
>   lru_add_drain();
>   walk_page_range(vma->vm_mm, start, end, &mlock_walk_ops, NULL);
> @@ -388,7 +388,7 @@ static void mlock_vma_pages_range(struct vm_area_struct 
> *vma,
>  
>   if (newflags & VM_IO) {
>   newflags &= ~VM_IO;
> - vm_flags_reset(vma, newflags);
> + vm_flags_reset_once(vma, newflags);
>   }
>  }
>  
> -- 
> 2.39.1.456.gfc5497dd1b-goog

-- 
Michal Hocko
SUSE Labs


Re: [PATCH v3 6/7] mm: introduce mod_vm_flags_nolock and use it in untrack_pfn

2023-01-26 Thread Michal Hocko
On Wed 25-01-23 15:35:53, Suren Baghdasaryan wrote:
> In cases when VMA flags are modified after VMA was isolated and mmap_lock
> was downgraded, flags modifications would result in an assertion because
> mmap write lock is not held.
> Introduce mod_vm_flags_nolock to be used in such situation, when VMA is
> not part of VMA tree and locking it is not required.
> Pass a hint to untrack_pfn to conditionally use mod_vm_flags_nolock for
> flags modification and to avoid assertion.
> 
> Signed-off-by: Suren Baghdasaryan 

Acked-by: Michal Hocko 
Thanks!

> ---
>  arch/x86/mm/pat/memtype.c | 10 +++---
>  include/linux/mm.h| 16 +---
>  include/linux/pgtable.h   |  5 +++--
>  mm/memory.c   | 13 +++--
>  mm/memremap.c |  4 ++--
>  mm/mmap.c | 16 ++--
>  6 files changed, 42 insertions(+), 22 deletions(-)
> 
> diff --git a/arch/x86/mm/pat/memtype.c b/arch/x86/mm/pat/memtype.c
> index ae9645c900fa..d8adc0b42cf2 100644
> --- a/arch/x86/mm/pat/memtype.c
> +++ b/arch/x86/mm/pat/memtype.c
> @@ -1046,7 +1046,7 @@ void track_pfn_insert(struct vm_area_struct *vma, 
> pgprot_t *prot, pfn_t pfn)
>   * can be for the entire vma (in which case pfn, size are zero).
>   */
>  void untrack_pfn(struct vm_area_struct *vma, unsigned long pfn,
> -  unsigned long size)
> +  unsigned long size, bool mm_wr_locked)
>  {
>   resource_size_t paddr;
>   unsigned long prot;
> @@ -1065,8 +1065,12 @@ void untrack_pfn(struct vm_area_struct *vma, unsigned 
> long pfn,
>   size = vma->vm_end - vma->vm_start;
>   }
>   free_pfn_range(paddr, size);
> - if (vma)
> - clear_vm_flags(vma, VM_PAT);
> + if (vma) {
> + if (mm_wr_locked)
> + clear_vm_flags(vma, VM_PAT);
> + else
> + mod_vm_flags_nolock(vma, 0, VM_PAT);
> + }
>  }
>  
>  /*
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 1ab5f73360f2..86bf043136f3 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -656,12 +656,22 @@ static inline void clear_vm_flags(struct vm_area_struct 
> *vma,
>   ACCESS_PRIVATE(vma, __vm_flags) &= ~flags;
>  }
>  
> +/*
> + * Use only if VMA has been previously isolated, is not part of the VMA tree
> + * and therefore needs no locking.
> + */
> +static inline void mod_vm_flags_nolock(struct vm_area_struct *vma,
> +vm_flags_t set, vm_flags_t clear)
> +{
> + ACCESS_PRIVATE(vma, __vm_flags) |= set;
> + ACCESS_PRIVATE(vma, __vm_flags) &= ~clear;
> +}
> +
>  static inline void mod_vm_flags(struct vm_area_struct *vma,
>   vm_flags_t set, vm_flags_t clear)
>  {
>   mmap_assert_write_locked(vma->vm_mm);
> - ACCESS_PRIVATE(vma, __vm_flags) |= set;
> - ACCESS_PRIVATE(vma, __vm_flags) &= ~clear;
> + mod_vm_flags_nolock(vma, set, clear);
>  }
>  
>  static inline void vma_set_anonymous(struct vm_area_struct *vma)
> @@ -2087,7 +2097,7 @@ static inline void zap_vma_pages(struct vm_area_struct 
> *vma)
>  }
>  void unmap_vmas(struct mmu_gather *tlb, struct maple_tree *mt,
>   struct vm_area_struct *start_vma, unsigned long start,
> - unsigned long end);
> + unsigned long end, bool mm_wr_locked);
>  
>  struct mmu_notifier_range;
>  
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index 5fd45454c073..c63cd44777ec 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -1185,7 +1185,8 @@ static inline int track_pfn_copy(struct vm_area_struct 
> *vma)
>   * can be for the entire vma (in which case pfn, size are zero).
>   */
>  static inline void untrack_pfn(struct vm_area_struct *vma,
> -unsigned long pfn, unsigned long size)
> +unsigned long pfn, unsigned long size,
> +bool mm_wr_locked)
>  {
>  }
>  
> @@ -1203,7 +1204,7 @@ extern void track_pfn_insert(struct vm_area_struct 
> *vma, pgprot_t *prot,
>pfn_t pfn);
>  extern int track_pfn_copy(struct vm_area_struct *vma);
>  extern void untrack_pfn(struct vm_area_struct *vma, unsigned long pfn,
> - unsigned long size);
> + unsigned long size, bool mm_wr_locked);
>  extern void untrack_pfn_moved(struct vm_area_struct *vma);
>  #endif
>  
> diff --git a/mm/memory.c b/mm/memory.c
> index d6902065e558..5b11b50e2c4a 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1613,7 +1613,7 @@ void unmap_page_range(struct 

Re: [PATCH v3 2/7] mm: introduce vma->vm_flags wrapper functions

2023-01-26 Thread Michal Hocko
On Wed 25-01-23 15:35:49, Suren Baghdasaryan wrote:
> vm_flags are among VMA attributes which affect decisions like VMA merging
> and splitting. Therefore all vm_flags modifications are performed after
> taking exclusive mmap_lock to prevent vm_flags updates racing with such
> operations. Introduce modifier functions for vm_flags to be used whenever
> flags are updated. This way we can better check and control correct
> locking behavior during these updates.
> 
> Signed-off-by: Suren Baghdasaryan 

with or without the proposed renaming (it seems we are not consistent
on that much even in the core kernel - e.g. init_rwsem vs. mutex_init)

Acked-by: Michal Hocko 

> ---
>  include/linux/mm.h   | 37 +
>  include/linux/mm_types.h | 10 +-
>  2 files changed, 46 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index c2f62bdce134..bf16ddd544a5 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -627,6 +627,43 @@ static inline void vma_init(struct vm_area_struct *vma, 
> struct mm_struct *mm)
>   INIT_LIST_HEAD(&vma->anon_vma_chain);
>  }
>  
> +/* Use when VMA is not part of the VMA tree and needs no locking */
> +static inline void init_vm_flags(struct vm_area_struct *vma,
> +  vm_flags_t flags)
> +{
> + ACCESS_PRIVATE(vma, __vm_flags) = flags;
> +}
> +
> +/* Use when VMA is part of the VMA tree and modifications need coordination 
> */
> +static inline void reset_vm_flags(struct vm_area_struct *vma,
> +   vm_flags_t flags)
> +{
> + mmap_assert_write_locked(vma->vm_mm);
> + init_vm_flags(vma, flags);
> +}
> +
> +static inline void set_vm_flags(struct vm_area_struct *vma,
> + vm_flags_t flags)
> +{
> + mmap_assert_write_locked(vma->vm_mm);
> + ACCESS_PRIVATE(vma, __vm_flags) |= flags;
> +}
> +
> +static inline void clear_vm_flags(struct vm_area_struct *vma,
> +   vm_flags_t flags)
> +{
> + mmap_assert_write_locked(vma->vm_mm);
> + ACCESS_PRIVATE(vma, __vm_flags) &= ~flags;
> +}
> +
> +static inline void mod_vm_flags(struct vm_area_struct *vma,
> + vm_flags_t set, vm_flags_t clear)
> +{
> + mmap_assert_write_locked(vma->vm_mm);
> + ACCESS_PRIVATE(vma, __vm_flags) |= set;
> + ACCESS_PRIVATE(vma, __vm_flags) &= ~clear;
> +}
> +
>  static inline void vma_set_anonymous(struct vm_area_struct *vma)
>  {
>   vma->vm_ops = NULL;
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 2d6d790d9bed..bccbd5896850 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -491,7 +491,15 @@ struct vm_area_struct {
>* See vmf_insert_mixed_prot() for discussion.
>*/
>   pgprot_t vm_page_prot;
> - unsigned long vm_flags; /* Flags, see mm.h. */
> +
> + /*
> +  * Flags, see mm.h.
> +  * To modify use {init|reset|set|clear|mod}_vm_flags() functions.
> +  */
> + union {
> + const vm_flags_t vm_flags;
> + vm_flags_t __private __vm_flags;
> + };
>  
>   /*
>* For areas with an address space and backing store,
> -- 
> 2.39.1

-- 
Michal Hocko
SUSE Labs


Re: [PATCH v3 2/7] mm: introduce vma->vm_flags wrapper functions

2023-01-26 Thread Michal Hocko
On Wed 25-01-23 16:56:17, Suren Baghdasaryan wrote:
> On Wed, Jan 25, 2023 at 4:28 PM Andrew Morton  
> wrote:
> >
> > On Wed, 25 Jan 2023 15:35:49 -0800 Suren Baghdasaryan  
> > wrote:
> >
> > > --- a/include/linux/mm_types.h
> > > +++ b/include/linux/mm_types.h
> > > @@ -491,7 +491,15 @@ struct vm_area_struct {
> > >* See vmf_insert_mixed_prot() for discussion.
> > >*/
> > >   pgprot_t vm_page_prot;
> > > - unsigned long vm_flags; /* Flags, see mm.h. */
> > > +
> > > + /*
> > > +  * Flags, see mm.h.
> > > +  * To modify use {init|reset|set|clear|mod}_vm_flags() functions.
> > > +  */
> > > + union {
> > > + const vm_flags_t vm_flags;
> > > + vm_flags_t __private __vm_flags;
> > > + };
> >
> > Typically when making a change like this we'll rename the affected
> > field/variable/function/etc.  This will reliably and deliberately break
> > unconverted usage sites.
> >
> > This const trick will get us partway there, by breaking setters.  But
> > renaming it will break both setters and getters.
> 
> My intent here is to break setters but to allow getters to keep
> reading vma->vm_flags directly. We could provide get_vm_flags() and
> convert all getters as well but it would introduce a huge additional
> churn (800+ hits) with no obvious benefits, I think. Does that clarify
> the intent of this trick?

I think that makes sense at this stage. The conversion patch is quite
large already. Maybe the final renaming could be done on top of
everything and patch generated by coccinele. The const union is a neat
trick but a potential lockdep assert is a nice plus as well. I wouldn't
see it as a top priority though.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v2 4/6] mm: replace vma->vm_flags indirect modification in ksm_madvise

2023-01-25 Thread Michal Hocko
On Wed 25-01-23 08:57:48, Suren Baghdasaryan wrote:
> On Wed, Jan 25, 2023 at 1:38 AM 'Michal Hocko' via kernel-team
>  wrote:
> >
> > On Wed 25-01-23 00:38:49, Suren Baghdasaryan wrote:
> > > Replace indirect modifications to vma->vm_flags with calls to modifier
> > > functions to be able to track flag changes and to keep vma locking
> > > correctness. Add a BUG_ON check in ksm_madvise() to catch indirect
> > > vm_flags modification attempts.
> >
> > Those BUG_ONs scream to much IMHO. KSM is an MM internal code so I
> > gueess we should be willing to trust it.
> 
> Yes, but I really want to prevent an indirect misuse since it was not
> easy to find these. If you feel strongly about it I will remove them
> or if you have a better suggestion I'm all for it.

You can avoid that by making flags inaccesible directly, right?

-- 
Michal Hocko
SUSE Labs


Re: [PATCH v2 6/6] mm: export dump_mm()

2023-01-25 Thread Michal Hocko
On Wed 25-01-23 00:38:51, Suren Baghdasaryan wrote:
> mmap_assert_write_locked() is used in vm_flags modifiers. Because
> mmap_assert_write_locked() uses dump_mm() and vm_flags are sometimes
> modified from from inside a module, it's necessary to export
> dump_mm() function.
> 
> Signed-off-by: Suren Baghdasaryan 

Acked-by: Michal Hocko 

> ---
>  mm/debug.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/mm/debug.c b/mm/debug.c
> index 9d3d893dc7f4..96d594e16292 100644
> --- a/mm/debug.c
> +++ b/mm/debug.c
> @@ -215,6 +215,7 @@ void dump_mm(const struct mm_struct *mm)
>   mm->def_flags, &mm->def_flags
>   );
>  }
> +EXPORT_SYMBOL(dump_mm);
>  
>  static bool page_init_poisoning __read_mostly = true;
>  
> -- 
> 2.39.1

-- 
Michal Hocko
SUSE Labs


Re: [PATCH v2 5/6] mm: introduce mod_vm_flags_nolock and use it in untrack_pfn

2023-01-25 Thread Michal Hocko
ST_USER_ADDRESS,
>next ? next->vm_start : USER_PGTABLES_CEILING);
>   tlb_finish_mmu(&tlb);
> @@ -2391,7 +2391,11 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct 
> vm_area_struct *vma,
>   mmap_write_downgrade(mm);
>   }
>  
> - unmap_region(mm, &mt_detach, vma, prev, next, start, end);
> + /*
> +  * We can free page tables without write-locking mmap_lock because VMAs
> +  * were isolated before we downgraded mmap_lock.
> +  */
> + unmap_region(mm, &mt_detach, vma, prev, next, start, end, !downgrade);
>   /* Statistics and freeing VMAs */
>   mas_set(&mas_detach, start);
>   remove_mt(mm, &mas_detach);
> @@ -2704,7 +2708,7 @@ unsigned long mmap_region(struct file *file, unsigned 
> long addr,
>  
>   /* Undo any partial mapping done by a device driver. */
>   unmap_region(mm, &mm->mm_mt, vma, prev, next, vma->vm_start,
> -  vma->vm_end);
> +  vma->vm_end, true);
>   }
>   if (file && (vm_flags & VM_SHARED))
>   mapping_unmap_writable(file->f_mapping);
> @@ -3031,7 +3035,7 @@ void exit_mmap(struct mm_struct *mm)
>   tlb_gather_mmu_fullmm(&tlb, mm);
>   /* update_hiwater_rss(mm) here? but nobody should be looking */
>   /* Use ULONG_MAX here to ensure all VMAs in the mm are unmapped */
> - unmap_vmas(&tlb, &mm->mm_mt, vma, 0, ULONG_MAX);
> + unmap_vmas(&tlb, &mm->mm_mt, vma, 0, ULONG_MAX, false);
>   mmap_read_unlock(mm);
>  
>   /*
> -- 
> 2.39.1

-- 
Michal Hocko
SUSE Labs


Re: [PATCH v2 4/6] mm: replace vma->vm_flags indirect modification in ksm_madvise

2023-01-25 Thread Michal Hocko
On Wed 25-01-23 00:38:49, Suren Baghdasaryan wrote:
> Replace indirect modifications to vma->vm_flags with calls to modifier
> functions to be able to track flag changes and to keep vma locking
> correctness. Add a BUG_ON check in ksm_madvise() to catch indirect
> vm_flags modification attempts.

Those BUG_ONs scream to much IMHO. KSM is an MM internal code so I
gueess we should be willing to trust it.

> Signed-off-by: Suren Baghdasaryan 

Acked-by: Michal Hocko 
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v2 2/6] mm: replace VM_LOCKED_CLEAR_MASK with VM_LOCKED_MASK

2023-01-25 Thread Michal Hocko
On Wed 25-01-23 00:38:47, Suren Baghdasaryan wrote:
> To simplify the usage of VM_LOCKED_CLEAR_MASK in clear_vm_flags(),
> replace it with VM_LOCKED_MASK bitmask and convert all users.
>
> Signed-off-by: Suren Baghdasaryan 

Acked-by: Michal Hocko 

> ---
>  include/linux/mm.h | 4 ++--
>  kernel/fork.c  | 2 +-
>  mm/hugetlb.c   | 4 ++--
>  mm/mlock.c | 6 +++---
>  mm/mmap.c  | 6 +++---
>  mm/mremap.c| 2 +-
>  6 files changed, 12 insertions(+), 12 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index b71f2809caac..da62bdd627bf 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -421,8 +421,8 @@ extern unsigned int kobjsize(const void *objp);
>  /* This mask defines which mm->def_flags a process can inherit its parent */
>  #define VM_INIT_DEF_MASK VM_NOHUGEPAGE
>  
> -/* This mask is used to clear all the VMA flags used by mlock */
> -#define VM_LOCKED_CLEAR_MASK (~(VM_LOCKED | VM_LOCKONFAULT))
> +/* This mask represents all the VMA flag bits used by mlock */
> +#define VM_LOCKED_MASK   (VM_LOCKED | VM_LOCKONFAULT)
>  
>  /* Arch-specific flags to clear when updating VM flags on protection change 
> */
>  #ifndef VM_ARCH_CLEAR
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 6683c1b0f460..03d472051236 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -669,7 +669,7 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
>   tmp->anon_vma = NULL;
>   } else if (anon_vma_fork(tmp, mpnt))
>   goto fail_nomem_anon_vma_fork;
> - tmp->vm_flags &= ~(VM_LOCKED | VM_LOCKONFAULT);
> + clear_vm_flags(tmp, VM_LOCKED_MASK);
>   file = tmp->vm_file;
>   if (file) {
>   struct address_space *mapping = file->f_mapping;
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index d20c8b09890e..4ecdbad9a451 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -6973,8 +6973,8 @@ static unsigned long page_table_shareable(struct 
> vm_area_struct *svma,
>   unsigned long s_end = sbase + PUD_SIZE;
>  
>   /* Allow segments to share if only one is marked locked */
> - unsigned long vm_flags = vma->vm_flags & VM_LOCKED_CLEAR_MASK;
> - unsigned long svm_flags = svma->vm_flags & VM_LOCKED_CLEAR_MASK;
> + unsigned long vm_flags = vma->vm_flags & ~VM_LOCKED_MASK;
> + unsigned long svm_flags = svma->vm_flags & ~VM_LOCKED_MASK;
>  
>   /*
>* match the virtual addresses, permission and the alignment of the
> diff --git a/mm/mlock.c b/mm/mlock.c
> index 0336f52e03d7..5c4fff93cd6b 100644
> --- a/mm/mlock.c
> +++ b/mm/mlock.c
> @@ -497,7 +497,7 @@ static int apply_vma_lock_flags(unsigned long start, 
> size_t len,
>   if (vma->vm_start != tmp)
>   return -ENOMEM;
>  
> - newflags = vma->vm_flags & VM_LOCKED_CLEAR_MASK;
> + newflags = vma->vm_flags & ~VM_LOCKED_MASK;
>   newflags |= flags;
>   /* Here we know that  vma->vm_start <= nstart < vma->vm_end. */
>   tmp = vma->vm_end;
> @@ -661,7 +661,7 @@ static int apply_mlockall_flags(int flags)
>   struct vm_area_struct *vma, *prev = NULL;
>   vm_flags_t to_add = 0;
>  
> - current->mm->def_flags &= VM_LOCKED_CLEAR_MASK;
> + current->mm->def_flags &= ~VM_LOCKED_MASK;
>   if (flags & MCL_FUTURE) {
>   current->mm->def_flags |= VM_LOCKED;
>  
> @@ -681,7 +681,7 @@ static int apply_mlockall_flags(int flags)
>   for_each_vma(vmi, vma) {
>   vm_flags_t newflags;
>  
> - newflags = vma->vm_flags & VM_LOCKED_CLEAR_MASK;
> + newflags = vma->vm_flags & ~VM_LOCKED_MASK;
>   newflags |= to_add;
>  
>   /* Ignore errors */
> diff --git a/mm/mmap.c b/mm/mmap.c
> index d4abc6feced1..323bd253b25a 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -2671,7 +2671,7 @@ unsigned long mmap_region(struct file *file, unsigned 
> long addr,
>   if ((vm_flags & VM_SPECIAL) || vma_is_dax(vma) ||
>   is_vm_hugetlb_page(vma) ||
>   vma == get_gate_vma(current->mm))
> - vma->vm_flags &= VM_LOCKED_CLEAR_MASK;
> + clear_vm_flags(vma, VM_LOCKED_MASK);
>   else
>   mm->locked_vm += (len >> PAGE_SHIFT);
>   }
> @@ -3340,8 +3340,8 @@ static struct vm_area_struct *__install_special_mapping(
&g

Re: [PATCH v2 1/6] mm: introduce vma->vm_flags modifier functions

2023-01-25 Thread Michal Hocko
On Wed 25-01-23 00:38:46, Suren Baghdasaryan wrote:
> vm_flags are among VMA attributes which affect decisions like VMA merging
> and splitting. Therefore all vm_flags modifications are performed after
> taking exclusive mmap_lock to prevent vm_flags updates racing with such
> operations. Introduce modifier functions for vm_flags to be used whenever
> flags are updated. This way we can better check and control correct
> locking behavior during these updates.
> 
> Signed-off-by: Suren Baghdasaryan 

Acked-by: Michal Hocko 

> ---
>  include/linux/mm.h   | 37 +
>  include/linux/mm_types.h |  8 +++-
>  2 files changed, 44 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index c2f62bdce134..b71f2809caac 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -627,6 +627,43 @@ static inline void vma_init(struct vm_area_struct *vma, 
> struct mm_struct *mm)
>   INIT_LIST_HEAD(&vma->anon_vma_chain);
>  }
>  
> +/* Use when VMA is not part of the VMA tree and needs no locking */
> +static inline void init_vm_flags(struct vm_area_struct *vma,
> +  unsigned long flags)
> +{
> + vma->vm_flags = flags;
> +}
> +
> +/* Use when VMA is part of the VMA tree and modifications need coordination 
> */
> +static inline void reset_vm_flags(struct vm_area_struct *vma,
> +   unsigned long flags)
> +{
> + mmap_assert_write_locked(vma->vm_mm);
> + init_vm_flags(vma, flags);
> +}
> +
> +static inline void set_vm_flags(struct vm_area_struct *vma,
> + unsigned long flags)
> +{
> + mmap_assert_write_locked(vma->vm_mm);
> + vma->vm_flags |= flags;
> +}
> +
> +static inline void clear_vm_flags(struct vm_area_struct *vma,
> +   unsigned long flags)
> +{
> + mmap_assert_write_locked(vma->vm_mm);
> + vma->vm_flags &= ~flags;
> +}
> +
> +static inline void mod_vm_flags(struct vm_area_struct *vma,
> + unsigned long set, unsigned long clear)
> +{
> + mmap_assert_write_locked(vma->vm_mm);
> + vma->vm_flags |= set;
> + vma->vm_flags &= ~clear;
> +}
> +
>  static inline void vma_set_anonymous(struct vm_area_struct *vma)
>  {
>   vma->vm_ops = NULL;
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 2d6d790d9bed..6c7c70bf50dd 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -491,7 +491,13 @@ struct vm_area_struct {
>* See vmf_insert_mixed_prot() for discussion.
>*/
>   pgprot_t vm_page_prot;
> - unsigned long vm_flags; /* Flags, see mm.h. */
> +
> + /*
> +  * Flags, see mm.h.
> +  * WARNING! Do not modify directly.
> +  * Use {init|reset|set|clear|mod}_vm_flags() functions instead.
> +  */
> + unsigned long vm_flags;
>  
>   /*
>* For areas with an address space and backing store,
> -- 
> 2.39.1

-- 
Michal Hocko
SUSE Labs


Re: [PATCH 39/41] kernel/fork: throttle call_rcu() calls in vm_area_free

2023-01-23 Thread Michal Hocko
On Mon 23-01-23 19:30:43, Matthew Wilcox wrote:
> On Mon, Jan 23, 2023 at 08:18:37PM +0100, Michal Hocko wrote:
> > On Mon 23-01-23 18:23:08, Matthew Wilcox wrote:
> > > On Mon, Jan 23, 2023 at 09:46:20AM -0800, Suren Baghdasaryan wrote:
> > [...]
> > > > Yes, batching the vmas into a list and draining it in remove_mt() and
> > > > exit_mmap() as you suggested makes sense to me and is quite simple.
> > > > Let's do that if nobody has objections.
> > > 
> > > I object.  We *know* nobody has a reference to any of the VMAs because
> > > you have to have a refcount on the mm before you can get a reference
> > > to a VMA.  If Michal is saying that somebody could do:
> > > 
> > >   mmget(mm);
> > >   vma = find_vma(mm);
> > >   lock_vma(vma);
> > >   mmput(mm);
> > >   vma->a = b;
> > >   unlock_vma(mm, vma);
> > > 
> > > then that's something we'd catch in review -- you obviously can't use
> > > the mm after you've dropped your reference to it.
> > 
> > I am not claiming this is possible now. I do not think we want to have
> > something like that in the future either but that is really hard to
> > envision. I am claiming that it is subtle and potentially error prone to
> > have two different ways of mass vma freeing wrt. locking. Also, don't we
> > have a very similar situation during last munmaps?
> 
> We shouldn't have two ways of mass VMA freeing.  Nobody's suggesting that.
> There are two cases; there's munmap(), which typically frees a single
> VMA (yes, theoretically, you can free hundreds of VMAs with a single
> call which spans multiple VMAs, but in practice that doesn't happen),
> and there's exit_mmap() which happens on exec() and exit().

This requires special casing remove_vma for those two different paths
(exit_mmap and remove_mt).  If you ask me that sounds like a suboptimal
code to even not handle potential large munmap which might very well be
a rare thing as you say. But haven't we learned that sooner or later we
will find out there is somebody that cares afterall? Anyway, this is not
something I care about all that much. It is just weird to special case
exit_mmap, if you ask me. Up to Suren to decide which way he wants to
go. I just really didn't like the initial implementation of batching
based on a completely arbitrary batch limit and lazy freeing.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH 39/41] kernel/fork: throttle call_rcu() calls in vm_area_free

2023-01-23 Thread Michal Hocko
On Mon 23-01-23 18:23:08, Matthew Wilcox wrote:
> On Mon, Jan 23, 2023 at 09:46:20AM -0800, Suren Baghdasaryan wrote:
[...]
> > Yes, batching the vmas into a list and draining it in remove_mt() and
> > exit_mmap() as you suggested makes sense to me and is quite simple.
> > Let's do that if nobody has objections.
> 
> I object.  We *know* nobody has a reference to any of the VMAs because
> you have to have a refcount on the mm before you can get a reference
> to a VMA.  If Michal is saying that somebody could do:
> 
>   mmget(mm);
>   vma = find_vma(mm);
>   lock_vma(vma);
>   mmput(mm);
>   vma->a = b;
>   unlock_vma(mm, vma);
> 
> then that's something we'd catch in review -- you obviously can't use
> the mm after you've dropped your reference to it.

I am not claiming this is possible now. I do not think we want to have
something like that in the future either but that is really hard to
envision. I am claiming that it is subtle and potentially error prone to
have two different ways of mass vma freeing wrt. locking. Also, don't we
have a very similar situation during last munmaps?

-- 
Michal Hocko
SUSE Labs


Re: [PATCH 39/41] kernel/fork: throttle call_rcu() calls in vm_area_free

2023-01-23 Thread Michal Hocko
On Mon 23-01-23 09:07:34, Suren Baghdasaryan wrote:
> On Mon, Jan 23, 2023 at 8:55 AM Michal Hocko  wrote:
> >
> > On Mon 23-01-23 08:22:53, Suren Baghdasaryan wrote:
> > > On Mon, Jan 23, 2023 at 1:56 AM Michal Hocko  wrote:
> > > >
> > > > On Fri 20-01-23 09:50:01, Suren Baghdasaryan wrote:
> > > > > On Fri, Jan 20, 2023 at 9:32 AM Matthew Wilcox  
> > > > > wrote:
> > > > [...]
> > > > > > The page fault handler (or whatever other reader -- ptrace, proc, 
> > > > > > etc)
> > > > > > should have a refcount on the mm_struct, so we can't be in this path
> > > > > > trying to free VMAs.  Right?
> > > > >
> > > > > Hmm. That sounds right. I checked process_mrelease() as well, which
> > > > > operated on mm with only mmgrab()+mmap_read_lock() but it only unmaps
> > > > > VMAs without freeing them, so we are still good. Michal, do you agree
> > > > > this is ok?
> > > >
> > > > Don't we need RCU procetions for the vma life time assurance? Jann has
> > > > already shown how rwsem is not safe wrt to unlock and free without RCU.
> > >
> > > Jann's case requires a thread freeing the VMA to be blocked on vma
> > > write lock waiting for the vma real lock to be released by a page
> > > fault handler. However exit_mmap() means mm->mm_users==0, which in
> > > turn suggests that there are no racing page fault handlers and no new
> > > page fault handlers will appear. Is that a correct assumption? If so,
> > > then races with page fault handlers can't happen while in exit_mmap().
> > > Any other path (other than page fault handlers), accesses vma->lock
> > > under protection of mmap_lock (for read or write, does not matter).
> > > One exception is when we operate on an isolated VMA, then we don't
> > > need mmap_lock protection, but exit_mmap() does not deal with isolated
> > > VMAs, so out of scope here. exit_mmap() frees vm_area_structs under
> > > protection of mmap_lock in write mode, so races with anything other
> > > than page fault handler should be safe as they are today.
> >
> > I do not see you talking about #PF (RCU + vma read lock protected) with
> > munmap. It is my understanding that the latter will synchronize over per
> > vma lock (along with mmap_lock exclusive locking). But then we are back
> > to the lifetime guarantees, or do I miss anything.
> 
> munmap() or any VMA-freeing operation other than exit_mmap() will free
> using call_rcu(), as implemented today. The suggestion is to free VMAs
> directly, without RCU grace period only when done from exit_mmap().

OK, I have clearly missed that. This makes more sense but it also adds
some more complexity and assumptions - a harder to maintain code in the
end. Whoever wants to touch this scheme in future would have to
re-evaluate all of them. So, I would just avoid that special casing if
that is feasible.

Dealing with the flood of call_rcu during exit_mmap is a trivial thing
to deal with as proposed elsewhere (just batch all of them in a single
run). This will surely add some more code but at least the locking would
consistent.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH 39/41] kernel/fork: throttle call_rcu() calls in vm_area_free

2023-01-23 Thread Michal Hocko
On Mon 23-01-23 08:22:53, Suren Baghdasaryan wrote:
> On Mon, Jan 23, 2023 at 1:56 AM Michal Hocko  wrote:
> >
> > On Fri 20-01-23 09:50:01, Suren Baghdasaryan wrote:
> > > On Fri, Jan 20, 2023 at 9:32 AM Matthew Wilcox  
> > > wrote:
> > [...]
> > > > The page fault handler (or whatever other reader -- ptrace, proc, etc)
> > > > should have a refcount on the mm_struct, so we can't be in this path
> > > > trying to free VMAs.  Right?
> > >
> > > Hmm. That sounds right. I checked process_mrelease() as well, which
> > > operated on mm with only mmgrab()+mmap_read_lock() but it only unmaps
> > > VMAs without freeing them, so we are still good. Michal, do you agree
> > > this is ok?
> >
> > Don't we need RCU procetions for the vma life time assurance? Jann has
> > already shown how rwsem is not safe wrt to unlock and free without RCU.
> 
> Jann's case requires a thread freeing the VMA to be blocked on vma
> write lock waiting for the vma real lock to be released by a page
> fault handler. However exit_mmap() means mm->mm_users==0, which in
> turn suggests that there are no racing page fault handlers and no new
> page fault handlers will appear. Is that a correct assumption? If so,
> then races with page fault handlers can't happen while in exit_mmap().
> Any other path (other than page fault handlers), accesses vma->lock
> under protection of mmap_lock (for read or write, does not matter).
> One exception is when we operate on an isolated VMA, then we don't
> need mmap_lock protection, but exit_mmap() does not deal with isolated
> VMAs, so out of scope here. exit_mmap() frees vm_area_structs under
> protection of mmap_lock in write mode, so races with anything other
> than page fault handler should be safe as they are today.

I do not see you talking about #PF (RCU + vma read lock protected) with
munmap. It is my understanding that the latter will synchronize over per
vma lock (along with mmap_lock exclusive locking). But then we are back
to the lifetime guarantees, or do I miss anything.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH 39/41] kernel/fork: throttle call_rcu() calls in vm_area_free

2023-01-23 Thread Michal Hocko
On Fri 20-01-23 09:50:01, Suren Baghdasaryan wrote:
> On Fri, Jan 20, 2023 at 9:32 AM Matthew Wilcox  wrote:
[...]
> > The page fault handler (or whatever other reader -- ptrace, proc, etc)
> > should have a refcount on the mm_struct, so we can't be in this path
> > trying to free VMAs.  Right?
> 
> Hmm. That sounds right. I checked process_mrelease() as well, which
> operated on mm with only mmgrab()+mmap_read_lock() but it only unmaps
> VMAs without freeing them, so we are still good. Michal, do you agree
> this is ok?

Don't we need RCU procetions for the vma life time assurance? Jann has
already shown how rwsem is not safe wrt to unlock and free without RCU.

-- 
Michal Hocko
SUSE Labs


Re: [PATCH 39/41] kernel/fork: throttle call_rcu() calls in vm_area_free

2023-01-23 Thread Michal Hocko
On Fri 20-01-23 08:20:43, Suren Baghdasaryan wrote:
> On Fri, Jan 20, 2023 at 12:52 AM Michal Hocko  wrote:
> >
> > On Thu 19-01-23 10:52:03, Suren Baghdasaryan wrote:
> > > On Thu, Jan 19, 2023 at 4:59 AM Michal Hocko  wrote:
> > > >
> > > > On Mon 09-01-23 12:53:34, Suren Baghdasaryan wrote:
> > > > > call_rcu() can take a long time when callback offloading is enabled.
> > > > > Its use in the vm_area_free can cause regressions in the exit path 
> > > > > when
> > > > > multiple VMAs are being freed. To minimize that impact, place VMAs 
> > > > > into
> > > > > a list and free them in groups using one call_rcu() call per group.
> > > >
> > > > After some more clarification I can understand how call_rcu might not be
> > > > super happy about thousands of callbacks to be invoked and I do agree
> > > > that this is not really optimal.
> > > >
> > > > On the other hand I do not like this solution much either.
> > > > VM_AREA_FREE_LIST_MAX is arbitrary and it won't really help all that
> > > > much with processes with a huge number of vmas either. It would still be
> > > > in housands of callbacks to be scheduled without a good reason.
> > > >
> > > > Instead, are there any other cases than remove_vma that need this
> > > > batching? We could easily just link all the vmas into linked list and
> > > > use a single call_rcu instead, no? This would both simplify the
> > > > implementation, remove the scaling issue as well and we do not have to
> > > > argue whether VM_AREA_FREE_LIST_MAX should be epsilon or epsilon + 1.
> > >
> > > Yes, I agree the solution is not stellar. I wanted something simple
> > > but this is probably too simple. OTOH keeping all dead vm_area_structs
> > > on the list without hooking up a shrinker (additional complexity) does
> > > not sound too appealing either.
> >
> > I suspect you have missed my idea. I do not really want to keep the list
> > around or any shrinker. It is dead simple. Collect all vmas in
> > remove_vma and then call_rcu the whole list at once after the whole list
> > (be it from exit_mmap or remove_mt). See?
> 
> Yes, I understood your idea but keeping dead objects until the process
> exits even when the system is low on memory (no shrinkers attached)
> seems too wasteful. If we do this I would advocate for attaching a
> shrinker.

I am still not sure we are on the same page here. No, vmas shouldn't lay
around un ntil the process exit. I am really suggesting queuing only for
remove_vma paths. You can have a different rcu callback than the one
used for trivial single vma removal paths.

-- 
Michal Hocko
SUSE Labs


Re: [PATCH 39/41] kernel/fork: throttle call_rcu() calls in vm_area_free

2023-01-20 Thread Michal Hocko
On Thu 19-01-23 11:17:07, Paul E. McKenney wrote:
> On Thu, Jan 19, 2023 at 01:52:14PM +0100, Michal Hocko wrote:
> > On Wed 18-01-23 11:01:08, Suren Baghdasaryan wrote:
> > > On Wed, Jan 18, 2023 at 10:34 AM Paul E. McKenney  
> > > wrote:
> > [...]
> > > > There are a couple of possibilities here.
> > > >
> > > > First, if I am remembering correctly, the time between the call_rcu()
> > > > and invocation of the corresponding callback was taking multiple 
> > > > seconds,
> > > > but that was because the kernel was built with CONFIG_LAZY_RCU=y in
> > > > order to save power by batching RCU work over multiple call_rcu()
> > > > invocations.  If this is causing a problem for a given call site, the
> > > > shiny new call_rcu_hurry() can be used instead.  Doing this gets back
> > > > to the old-school non-laziness, but can of course consume more power.
> > > 
> > > That would not be the case because CONFIG_LAZY_RCU was not an option
> > > at the time I was profiling this issue.
> > > Laxy RCU would be a great option to replace this patch but
> > > unfortunately it's not the default behavior, so I would still have to
> > > implement this batching in case lazy RCU is not enabled.
> > > 
> > > >
> > > > Second, there is a much shorter one-jiffy delay between the call_rcu()
> > > > and the invocation of the corresponding callback in kernels built with
> > > > either CONFIG_NO_HZ_FULL=y (but only on CPUs mentioned in the nohz_full
> > > > or rcu_nocbs kernel boot parameters) or CONFIG_RCU_NOCB_CPU=y (but only
> > > > on CPUs mentioned in the rcu_nocbs kernel boot parameters).  The purpose
> > > > of this delay is to avoid lock contention, and so this delay is incurred
> > > > only on CPUs that are queuing callbacks at a rate exceeding 16K/second.
> > > > This is reduced to a per-jiffy limit, so on a HZ=1000 system, a CPU
> > > > invoking call_rcu() at least 16 times within a given jiffy will incur
> > > > the added delay.  The reason for this delay is the use of a separate
> > > > ->nocb_bypass list.  As Suren says, this bypass list is used to reduce
> > > > lock contention on the main ->cblist.  This is not needed in old-school
> > > > kernels built without either CONFIG_NO_HZ_FULL=y or 
> > > > CONFIG_RCU_NOCB_CPU=y
> > > > (including most datacenter kernels) because in that case the callbacks
> > > > enqueued by call_rcu() are touched only by the corresponding CPU, so
> > > > that there is no need for locks.
> > > 
> > > I believe this is the reason in my profiled case.
> > > 
> > > >
> > > > Third, if you are instead seeing multiple milliseconds of CPU consumed 
> > > > by
> > > > call_rcu() in the common case (for example, without the aid of 
> > > > interrupts,
> > > > NMIs, or SMIs), please do let me know.  That sounds to me like a bug.
> > > 
> > > I don't think I've seen such a case.
> > > Thanks for clarifications, Paul!
> > 
> > Thanks for the explanation Paul. I have to say this has caught me as a
> > surprise. There are just not enough details about the benchmark to
> > understand what is going on but I find it rather surprising that
> > call_rcu can induce a higher overhead than the actual kmem_cache_free
> > which is the callback. My naive understanding has been that call_rcu is
> > really fast way to defer the execution to the RCU safe context to do the
> > final cleanup.
> 
> If I am following along correctly (ha!), then your "induce a higher
> overhead" should be something like "induce a higher to-kfree() latency".

Yes, this is expected.

> Of course, there already is a higher latency-to-kfree via call_rcu()
> than via a direct call to kfree(), and callback-offload CPUs that are
> being flooded with callbacks raise that latency a jiffy or so more in
> order to avoid lock contention.
> 
> If this becomes a problem, the callback-offloading code can be a bit
> smarter about avoiding lock contention, but need to see a real problem
> before I make that change.  But if there is a real problem I will of
> course fix it.

I believe that Suren claims that the call_rcu is really visible in the
exit_mmap case. Time-to-free actual vmas shouldn't really be material
for that path. If that happens much more later on there could be some
side effects by an increased memory consumption but that should be
marginal. How fast exit_mmap really is should only depend on direct
calls from that path.

But I guess we need some specific numbers from Suren to be sure what is
going on here.

Thanks!
-- 
Michal Hocko
SUSE Labs


Re: [PATCH 39/41] kernel/fork: throttle call_rcu() calls in vm_area_free

2023-01-20 Thread Michal Hocko
On Thu 19-01-23 10:52:03, Suren Baghdasaryan wrote:
> On Thu, Jan 19, 2023 at 4:59 AM Michal Hocko  wrote:
> >
> > On Mon 09-01-23 12:53:34, Suren Baghdasaryan wrote:
> > > call_rcu() can take a long time when callback offloading is enabled.
> > > Its use in the vm_area_free can cause regressions in the exit path when
> > > multiple VMAs are being freed. To minimize that impact, place VMAs into
> > > a list and free them in groups using one call_rcu() call per group.
> >
> > After some more clarification I can understand how call_rcu might not be
> > super happy about thousands of callbacks to be invoked and I do agree
> > that this is not really optimal.
> >
> > On the other hand I do not like this solution much either.
> > VM_AREA_FREE_LIST_MAX is arbitrary and it won't really help all that
> > much with processes with a huge number of vmas either. It would still be
> > in housands of callbacks to be scheduled without a good reason.
> >
> > Instead, are there any other cases than remove_vma that need this
> > batching? We could easily just link all the vmas into linked list and
> > use a single call_rcu instead, no? This would both simplify the
> > implementation, remove the scaling issue as well and we do not have to
> > argue whether VM_AREA_FREE_LIST_MAX should be epsilon or epsilon + 1.
> 
> Yes, I agree the solution is not stellar. I wanted something simple
> but this is probably too simple. OTOH keeping all dead vm_area_structs
> on the list without hooking up a shrinker (additional complexity) does
> not sound too appealing either.

I suspect you have missed my idea. I do not really want to keep the list
around or any shrinker. It is dead simple. Collect all vmas in
remove_vma and then call_rcu the whole list at once after the whole list
(be it from exit_mmap or remove_mt). See?

-- 
Michal Hocko
SUSE Labs


Re: [PATCH 39/41] kernel/fork: throttle call_rcu() calls in vm_area_free

2023-01-19 Thread Michal Hocko
On Mon 09-01-23 12:53:34, Suren Baghdasaryan wrote:
> call_rcu() can take a long time when callback offloading is enabled.
> Its use in the vm_area_free can cause regressions in the exit path when
> multiple VMAs are being freed. To minimize that impact, place VMAs into
> a list and free them in groups using one call_rcu() call per group.

After some more clarification I can understand how call_rcu might not be
super happy about thousands of callbacks to be invoked and I do agree
that this is not really optimal.

On the other hand I do not like this solution much either.
VM_AREA_FREE_LIST_MAX is arbitrary and it won't really help all that
much with processes with a huge number of vmas either. It would still be
in housands of callbacks to be scheduled without a good reason.

Instead, are there any other cases than remove_vma that need this
batching? We could easily just link all the vmas into linked list and
use a single call_rcu instead, no? This would both simplify the
implementation, remove the scaling issue as well and we do not have to
argue whether VM_AREA_FREE_LIST_MAX should be epsilon or epsilon + 1.

-- 
Michal Hocko
SUSE Labs


Re: [PATCH 39/41] kernel/fork: throttle call_rcu() calls in vm_area_free

2023-01-19 Thread Michal Hocko
On Wed 18-01-23 11:01:08, Suren Baghdasaryan wrote:
> On Wed, Jan 18, 2023 at 10:34 AM Paul E. McKenney  wrote:
[...]
> > There are a couple of possibilities here.
> >
> > First, if I am remembering correctly, the time between the call_rcu()
> > and invocation of the corresponding callback was taking multiple seconds,
> > but that was because the kernel was built with CONFIG_LAZY_RCU=y in
> > order to save power by batching RCU work over multiple call_rcu()
> > invocations.  If this is causing a problem for a given call site, the
> > shiny new call_rcu_hurry() can be used instead.  Doing this gets back
> > to the old-school non-laziness, but can of course consume more power.
> 
> That would not be the case because CONFIG_LAZY_RCU was not an option
> at the time I was profiling this issue.
> Laxy RCU would be a great option to replace this patch but
> unfortunately it's not the default behavior, so I would still have to
> implement this batching in case lazy RCU is not enabled.
> 
> >
> > Second, there is a much shorter one-jiffy delay between the call_rcu()
> > and the invocation of the corresponding callback in kernels built with
> > either CONFIG_NO_HZ_FULL=y (but only on CPUs mentioned in the nohz_full
> > or rcu_nocbs kernel boot parameters) or CONFIG_RCU_NOCB_CPU=y (but only
> > on CPUs mentioned in the rcu_nocbs kernel boot parameters).  The purpose
> > of this delay is to avoid lock contention, and so this delay is incurred
> > only on CPUs that are queuing callbacks at a rate exceeding 16K/second.
> > This is reduced to a per-jiffy limit, so on a HZ=1000 system, a CPU
> > invoking call_rcu() at least 16 times within a given jiffy will incur
> > the added delay.  The reason for this delay is the use of a separate
> > ->nocb_bypass list.  As Suren says, this bypass list is used to reduce
> > lock contention on the main ->cblist.  This is not needed in old-school
> > kernels built without either CONFIG_NO_HZ_FULL=y or CONFIG_RCU_NOCB_CPU=y
> > (including most datacenter kernels) because in that case the callbacks
> > enqueued by call_rcu() are touched only by the corresponding CPU, so
> > that there is no need for locks.
> 
> I believe this is the reason in my profiled case.
> 
> >
> > Third, if you are instead seeing multiple milliseconds of CPU consumed by
> > call_rcu() in the common case (for example, without the aid of interrupts,
> > NMIs, or SMIs), please do let me know.  That sounds to me like a bug.
> 
> I don't think I've seen such a case.
> Thanks for clarifications, Paul!

Thanks for the explanation Paul. I have to say this has caught me as a
surprise. There are just not enough details about the benchmark to
understand what is going on but I find it rather surprising that
call_rcu can induce a higher overhead than the actual kmem_cache_free
which is the callback. My naive understanding has been that call_rcu is
really fast way to defer the execution to the RCU safe context to do the
final cleanup.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH 17/41] mm/mmap: move VMA locking before anon_vma_lock_write call

2023-01-19 Thread Michal Hocko
On Wed 18-01-23 13:48:13, Suren Baghdasaryan wrote:
> On Wed, Jan 18, 2023 at 1:33 PM Michal Hocko  wrote:
[...]
> > So it will become:
> > Move VMA flag modification (which now implies VMA locking) before
> > vma_adjust_trans_huge() to ensure the modifications are done after VMA
> > has been locked. Because vma_adjust_trans_huge() modifies the VMA and such
> > modifications should be done under VMA write-lock protection.
> >
> > which is effectivelly saying
> > vma_adjust_trans_huge() modifies the VMA and such modifications should
> > be done under VMA write-lock protection so move VMA flag modifications
> > before so all of them are covered by the same write protection.
> >
> > right?
> 
> Yes, and the wording in the latter version is simpler to understand
> IMO, so I would like to adopt it. Do you agree?

of course.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH 17/41] mm/mmap: move VMA locking before anon_vma_lock_write call

2023-01-18 Thread Michal Hocko
On Wed 18-01-23 10:09:29, Suren Baghdasaryan wrote:
> On Wed, Jan 18, 2023 at 1:23 AM Michal Hocko  wrote:
> >
> > On Tue 17-01-23 18:01:01, Suren Baghdasaryan wrote:
> > > On Tue, Jan 17, 2023 at 7:16 AM Michal Hocko  wrote:
> > > >
> > > > On Mon 09-01-23 12:53:12, Suren Baghdasaryan wrote:
> > > > > Move VMA flag modification (which now implies VMA locking) before
> > > > > anon_vma_lock_write to match the locking order of page fault handler.
> > > >
> > > > Does this changelog assumes per vma locking in the #PF?
> > >
> > > Hmm, you are right. Page fault handlers do not use per-vma locks yet
> > > but the changelog already talks about that. Maybe I should change it
> > > to simply:
> > > ```
> > > Move VMA flag modification (which now implies VMA locking) before
> > > vma_adjust_trans_huge() to ensure the modifications are done after VMA
> > > has been locked.
> >
> > Because 
> 
> because vma_adjust_trans_huge() modifies the VMA and such
> modifications should be done under VMA write-lock protection.

So it will become:
Move VMA flag modification (which now implies VMA locking) before
vma_adjust_trans_huge() to ensure the modifications are done after VMA
has been locked. Because vma_adjust_trans_huge() modifies the VMA and such
modifications should be done under VMA write-lock protection.

which is effectivelly saying
vma_adjust_trans_huge() modifies the VMA and such modifications should
be done under VMA write-lock protection so move VMA flag modifications
before so all of them are covered by the same write protection.

right?
-- 
Michal Hocko
SUSE Labs


Re: [PATCH 12/41] mm: add per-VMA lock and helper functions to control it

2023-01-18 Thread Michal Hocko
On Wed 18-01-23 09:36:44, Suren Baghdasaryan wrote:
> On Wed, Jan 18, 2023 at 7:11 AM 'Michal Hocko' via kernel-team
>  wrote:
> >
> > On Wed 18-01-23 14:23:32, Jann Horn wrote:
> > > On Wed, Jan 18, 2023 at 1:28 PM Michal Hocko  wrote:
> > > > On Tue 17-01-23 19:02:55, Jann Horn wrote:
> > > > > +locking maintainers
> > > > >
> > > > > On Mon, Jan 9, 2023 at 9:54 PM Suren Baghdasaryan  
> > > > > wrote:
> > > > > > Introduce a per-VMA rw_semaphore to be used during page fault 
> > > > > > handling
> > > > > > instead of mmap_lock. Because there are cases when multiple VMAs 
> > > > > > need
> > > > > > to be exclusively locked during VMA tree modifications, instead of 
> > > > > > the
> > > > > > usual lock/unlock patter we mark a VMA as locked by taking per-VMA 
> > > > > > lock
> > > > > > exclusively and setting vma->lock_seq to the current mm->lock_seq. 
> > > > > > When
> > > > > > mmap_write_lock holder is done with all modifications and drops 
> > > > > > mmap_lock,
> > > > > > it will increment mm->lock_seq, effectively unlocking all VMAs 
> > > > > > marked as
> > > > > > locked.
> > > > > [...]
> > > > > > +static inline void vma_read_unlock(struct vm_area_struct *vma)
> > > > > > +{
> > > > > > +   up_read(&vma->lock);
> > > > > > +}
> > > > >
> > > > > One thing that might be gnarly here is that I think you might not be
> > > > > allowed to use up_read() to fully release ownership of an object -
> > > > > from what I remember, I think that up_read() (unlike something like
> > > > > spin_unlock()) can access the lock object after it's already been
> > > > > acquired by someone else.
> > > >
> > > > Yes, I think you are right. From a look into the code it seems that
> > > > the UAF is quite unlikely as there is a ton of work to be done between
> > > > vma_write_lock used to prepare vma for removal and actual removal.
> > > > That doesn't make it less of a problem though.
> > > >
> > > > > So if you want to protect against concurrent
> > > > > deletion, this might have to be something like:
> > > > >
> > > > > rcu_read_lock(); /* keeps vma alive */
> > > > > up_read(&vma->lock);
> > > > > rcu_read_unlock();
> > > > >
> > > > > But I'm not entirely sure about that, the locking folks might know 
> > > > > better.
> > > >
> > > > I am not a locking expert but to me it looks like this should work
> > > > because the final cleanup would have to happen rcu_read_unlock.
> > > >
> > > > Thanks, I have completely missed this aspect of the locking when looking
> > > > into the code.
> > > >
> > > > Btw. looking at this again I have fully realized how hard it is actually
> > > > to see that vm_area_free is guaranteed to sync up with ongoing readers.
> > > > vma manipulation functions like __adjust_vma make my head spin. Would it
> > > > make more sense to have a rcu style synchronization point in
> > > > vm_area_free directly before call_rcu? This would add an overhead of
> > > > uncontended down_write of course.
> > >
> > > Something along those lines might be a good idea, but I think that
> > > rather than synchronizing the removal, it should maybe be something
> > > that splats (and bails out?) if it detects pending readers. If we get
> > > to vm_area_free() on a VMA that has pending readers, we might already
> > > be in a lot of trouble because the concurrent readers might have been
> > > traversing page tables while we were tearing them down or fun stuff
> > > like that.
> > >
> > > I think maybe Suren was already talking about something like that in
> > > another part of this patch series but I don't remember...
> >
> > This http://lkml.kernel.org/r/20230109205336.3665937-27-sur...@google.com?
> 
> Yes, I spent a lot of time ensuring that __adjust_vma locks the right
> VMAs and that VMAs are freed or isolated under VMA write lock
> protection to exclude any readers. If the VM_BUG_ON_VMA in the patch
> Michal mentioned gets hit then it's a bug in my design and I'll have
> to fix it. But please, let's not add synchronize_rcu() in the
> vm_area_free().

Just to clarify. I didn't suggest to add synchronize_rcu into
vm_area_free. What I really meant was synchronize_rcu like primitive to
effectivelly synchronize with any potential pending read locker (so
something like vma_write_lock (or whatever it is called). The point is
that vma freeing is an event all readers should be notified about.
This can be done explicitly for each and every vma before vm_area_free
is called but this is just hard to review and easy to break over time.
See my point?

-- 
Michal Hocko
SUSE Labs


Re: [PATCH 12/41] mm: add per-VMA lock and helper functions to control it

2023-01-18 Thread Michal Hocko
On Wed 18-01-23 14:23:32, Jann Horn wrote:
> On Wed, Jan 18, 2023 at 1:28 PM Michal Hocko  wrote:
> > On Tue 17-01-23 19:02:55, Jann Horn wrote:
> > > +locking maintainers
> > >
> > > On Mon, Jan 9, 2023 at 9:54 PM Suren Baghdasaryan  
> > > wrote:
> > > > Introduce a per-VMA rw_semaphore to be used during page fault handling
> > > > instead of mmap_lock. Because there are cases when multiple VMAs need
> > > > to be exclusively locked during VMA tree modifications, instead of the
> > > > usual lock/unlock patter we mark a VMA as locked by taking per-VMA lock
> > > > exclusively and setting vma->lock_seq to the current mm->lock_seq. When
> > > > mmap_write_lock holder is done with all modifications and drops 
> > > > mmap_lock,
> > > > it will increment mm->lock_seq, effectively unlocking all VMAs marked as
> > > > locked.
> > > [...]
> > > > +static inline void vma_read_unlock(struct vm_area_struct *vma)
> > > > +{
> > > > +   up_read(&vma->lock);
> > > > +}
> > >
> > > One thing that might be gnarly here is that I think you might not be
> > > allowed to use up_read() to fully release ownership of an object -
> > > from what I remember, I think that up_read() (unlike something like
> > > spin_unlock()) can access the lock object after it's already been
> > > acquired by someone else.
> >
> > Yes, I think you are right. From a look into the code it seems that
> > the UAF is quite unlikely as there is a ton of work to be done between
> > vma_write_lock used to prepare vma for removal and actual removal.
> > That doesn't make it less of a problem though.
> >
> > > So if you want to protect against concurrent
> > > deletion, this might have to be something like:
> > >
> > > rcu_read_lock(); /* keeps vma alive */
> > > up_read(&vma->lock);
> > > rcu_read_unlock();
> > >
> > > But I'm not entirely sure about that, the locking folks might know better.
> >
> > I am not a locking expert but to me it looks like this should work
> > because the final cleanup would have to happen rcu_read_unlock.
> >
> > Thanks, I have completely missed this aspect of the locking when looking
> > into the code.
> >
> > Btw. looking at this again I have fully realized how hard it is actually
> > to see that vm_area_free is guaranteed to sync up with ongoing readers.
> > vma manipulation functions like __adjust_vma make my head spin. Would it
> > make more sense to have a rcu style synchronization point in
> > vm_area_free directly before call_rcu? This would add an overhead of
> > uncontended down_write of course.
> 
> Something along those lines might be a good idea, but I think that
> rather than synchronizing the removal, it should maybe be something
> that splats (and bails out?) if it detects pending readers. If we get
> to vm_area_free() on a VMA that has pending readers, we might already
> be in a lot of trouble because the concurrent readers might have been
> traversing page tables while we were tearing them down or fun stuff
> like that.
> 
> I think maybe Suren was already talking about something like that in
> another part of this patch series but I don't remember...

This http://lkml.kernel.org/r/20230109205336.3665937-27-sur...@google.com?

-- 
Michal Hocko
SUSE Labs


Re: [PATCH 12/41] mm: add per-VMA lock and helper functions to control it

2023-01-18 Thread Michal Hocko
On Tue 17-01-23 19:02:55, Jann Horn wrote:
> +locking maintainers
> 
> On Mon, Jan 9, 2023 at 9:54 PM Suren Baghdasaryan  wrote:
> > Introduce a per-VMA rw_semaphore to be used during page fault handling
> > instead of mmap_lock. Because there are cases when multiple VMAs need
> > to be exclusively locked during VMA tree modifications, instead of the
> > usual lock/unlock patter we mark a VMA as locked by taking per-VMA lock
> > exclusively and setting vma->lock_seq to the current mm->lock_seq. When
> > mmap_write_lock holder is done with all modifications and drops mmap_lock,
> > it will increment mm->lock_seq, effectively unlocking all VMAs marked as
> > locked.
> [...]
> > +static inline void vma_read_unlock(struct vm_area_struct *vma)
> > +{
> > +   up_read(&vma->lock);
> > +}
> 
> One thing that might be gnarly here is that I think you might not be
> allowed to use up_read() to fully release ownership of an object -
> from what I remember, I think that up_read() (unlike something like
> spin_unlock()) can access the lock object after it's already been
> acquired by someone else.

Yes, I think you are right. From a look into the code it seems that
the UAF is quite unlikely as there is a ton of work to be done between
vma_write_lock used to prepare vma for removal and actual removal.
That doesn't make it less of a problem though.

> So if you want to protect against concurrent
> deletion, this might have to be something like:
> 
> rcu_read_lock(); /* keeps vma alive */
> up_read(&vma->lock);
> rcu_read_unlock();
> 
> But I'm not entirely sure about that, the locking folks might know better.

I am not a locking expert but to me it looks like this should work
because the final cleanup would have to happen rcu_read_unlock.

Thanks, I have completely missed this aspect of the locking when looking
into the code.

Btw. looking at this again I have fully realized how hard it is actually
to see that vm_area_free is guaranteed to sync up with ongoing readers.
vma manipulation functions like __adjust_vma make my head spin. Would it
make more sense to have a rcu style synchronization point in
vm_area_free directly before call_rcu? This would add an overhead of
uncontended down_write of course.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH 39/41] kernel/fork: throttle call_rcu() calls in vm_area_free

2023-01-18 Thread Michal Hocko
On Tue 17-01-23 17:19:46, Suren Baghdasaryan wrote:
> On Tue, Jan 17, 2023 at 7:57 AM Michal Hocko  wrote:
> >
> > On Mon 09-01-23 12:53:34, Suren Baghdasaryan wrote:
> > > call_rcu() can take a long time when callback offloading is enabled.
> > > Its use in the vm_area_free can cause regressions in the exit path when
> > > multiple VMAs are being freed.
> >
> > What kind of regressions.
> >
> > > To minimize that impact, place VMAs into
> > > a list and free them in groups using one call_rcu() call per group.
> >
> > Please add some data to justify this additional complexity.
> 
> Sorry, should have done that in the first place. A 4.3% regression was
> noticed when running execl test from unixbench suite. spawn test also
> showed 1.6% regression. Profiling revealed that vma freeing was taking
> longer due to call_rcu() which is slow when RCU callback offloading is
> enabled.

Could you be more specific? vma freeing is async with the RCU so how
come this has resulted in a regression? Is there any heavy
rcu_synchronize in the exec path? That would be an interesting
information.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH 26/41] kernel/fork: assert no VMA readers during its destruction

2023-01-18 Thread Michal Hocko
On Tue 17-01-23 17:53:00, Suren Baghdasaryan wrote:
> On Tue, Jan 17, 2023 at 7:42 AM 'Michal Hocko' via kernel-team
>  wrote:
> >
> > On Mon 09-01-23 12:53:21, Suren Baghdasaryan wrote:
> > > Assert there are no holders of VMA lock for reading when it is about to be
> > > destroyed.
> > >
> > > Signed-off-by: Suren Baghdasaryan 
> > > ---
> > >  include/linux/mm.h | 8 
> > >  kernel/fork.c  | 2 ++
> > >  2 files changed, 10 insertions(+)
> > >
> > > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > > index 594e835bad9c..c464fc8a514c 100644
> > > --- a/include/linux/mm.h
> > > +++ b/include/linux/mm.h
> > > @@ -680,6 +680,13 @@ static inline void vma_assert_write_locked(struct 
> > > vm_area_struct *vma)
> > >   VM_BUG_ON_VMA(vma->vm_lock_seq != 
> > > READ_ONCE(vma->vm_mm->mm_lock_seq), vma);
> > >  }
> > >
> > > +static inline void vma_assert_no_reader(struct vm_area_struct *vma)
> > > +{
> > > + VM_BUG_ON_VMA(rwsem_is_locked(&vma->lock) &&
> > > +   vma->vm_lock_seq != 
> > > READ_ONCE(vma->vm_mm->mm_lock_seq),
> > > +   vma);
> >
> > Do we really need to check for vm_lock_seq? rwsem_is_locked should tell
> > us something is wrong on its own, no? This could be somebody racing with
> > the vma destruction and using the write lock. Unlikely but I do not see
> > why to narrow debugging scope.
> 
> I wanted to ensure there are no page fault handlers (read-lockers)
> when we are destroying the VMA and rwsem_is_locked(&vma->lock) alone
> could trigger if someone is concurrently calling vma_write_lock(). But
> I don't think we expect someone to be write-locking the VMA while we

That would be UAF, no?

> are destroying it, so you are right, I'm overcomplicating things here.
> I think I can get rid of vma_assert_no_reader() and add
> VM_BUG_ON_VMA(rwsem_is_locked(&vma->lock)) directly in
> __vm_area_free(). WDYT?

Yes, that adds some debugging. Not sure it is really necessary buyt it
is VM_BUG_ON so why not.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH 18/41] mm/khugepaged: write-lock VMA while collapsing a huge page

2023-01-18 Thread Michal Hocko
On Tue 17-01-23 21:28:06, Jann Horn wrote:
> On Tue, Jan 17, 2023 at 4:25 PM Michal Hocko  wrote:
> > On Mon 09-01-23 12:53:13, Suren Baghdasaryan wrote:
> > > Protect VMA from concurrent page fault handler while collapsing a huge
> > > page. Page fault handler needs a stable PMD to use PTL and relies on
> > > per-VMA lock to prevent concurrent PMD changes. pmdp_collapse_flush(),
> > > set_huge_pmd() and collapse_and_free_pmd() can modify a PMD, which will
> > > not be detected by a page fault handler without proper locking.
> >
> > I am struggling with this changelog. Maybe because my recollection of
> > the THP collapsing subtleties is weak. But aren't you just trying to say
> > that the current #PF handling and THP collapsing need to be mutually
> > exclusive currently so in order to keep that assumption you have mark
> > the vma write locked?
> >
> > Also it is not really clear to me how that handles other vmas which can
> > share the same thp?
> 
> It's not about the hugepage itself, it's about how the THP collapse
> operation frees page tables.
> 
> Before this series, page tables can be walked under any one of the
> mmap lock, the mapping lock, and the anon_vma lock; so when khugepaged
> unlinks and frees page tables, it must ensure that all of those either
> are locked or don't exist. This series adds a fourth lock under which
> page tables can be traversed, and so khugepaged must also lock out that one.
> 
> There is a codepath in khugepaged that iterates through all mappings
> of a file to zap page tables (retract_page_tables()), which locks each
> visited mm with mmap_write_trylock() and now also does
> vma_write_lock().

OK, I see. This would be a great addendum to the changelog.
 
> I think one aspect of this patch that might cause trouble later on, if
> support for non-anonymous VMAs is added, is that retract_page_tables()
> now does vma_write_lock() while holding the mapping lock; the page
> fault handling path would probably take the locks the other way
> around, leading to a deadlock? So the vma_write_lock() in
> retract_page_tables() might have to become a trylock later on.

This, right?
#PF retract_page_tables
vma_read_lock
i_mmap_lock_write
i_mmap_lock_read
vma_write_lock


I might be missing something but I have only found huge_pmd_share to be
called from the #PF path. That one should be safe as it cannot be a
target for THP. Not that it would matter much because such a dependency
chain would be really subtle.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH 17/41] mm/mmap: move VMA locking before anon_vma_lock_write call

2023-01-18 Thread Michal Hocko
On Tue 17-01-23 18:01:01, Suren Baghdasaryan wrote:
> On Tue, Jan 17, 2023 at 7:16 AM Michal Hocko  wrote:
> >
> > On Mon 09-01-23 12:53:12, Suren Baghdasaryan wrote:
> > > Move VMA flag modification (which now implies VMA locking) before
> > > anon_vma_lock_write to match the locking order of page fault handler.
> >
> > Does this changelog assumes per vma locking in the #PF?
> 
> Hmm, you are right. Page fault handlers do not use per-vma locks yet
> but the changelog already talks about that. Maybe I should change it
> to simply:
> ```
> Move VMA flag modification (which now implies VMA locking) before
> vma_adjust_trans_huge() to ensure the modifications are done after VMA
> has been locked.

Because 

Withtout that additional reasonaning it is not really clear why that is
needed and seems arbitrary.

-- 
Michal Hocko
SUSE Labs


Re: [PATCH 12/41] mm: add per-VMA lock and helper functions to control it

2023-01-18 Thread Michal Hocko
On Tue 17-01-23 21:54:58, Matthew Wilcox wrote:
> On Tue, Jan 17, 2023 at 01:21:47PM -0800, Suren Baghdasaryan wrote:
> > On Tue, Jan 17, 2023 at 7:12 AM Michal Hocko  wrote:
> > >
> > > On Tue 17-01-23 16:04:26, Michal Hocko wrote:
> > > > On Mon 09-01-23 12:53:07, Suren Baghdasaryan wrote:
> > > > > Introduce a per-VMA rw_semaphore to be used during page fault handling
> > > > > instead of mmap_lock. Because there are cases when multiple VMAs need
> > > > > to be exclusively locked during VMA tree modifications, instead of the
> > > > > usual lock/unlock patter we mark a VMA as locked by taking per-VMA 
> > > > > lock
> > > > > exclusively and setting vma->lock_seq to the current mm->lock_seq. 
> > > > > When
> > > > > mmap_write_lock holder is done with all modifications and drops 
> > > > > mmap_lock,
> > > > > it will increment mm->lock_seq, effectively unlocking all VMAs marked 
> > > > > as
> > > > > locked.
> > > >
> > > > I have to say I was struggling a bit with the above and only understood
> > > > what you mean by reading the patch several times. I would phrase it like
> > > > this (feel free to use if you consider this to be an improvement).
> > > >
> > > > Introduce a per-VMA rw_semaphore. The lock implementation relies on a
> > > > per-vma and per-mm sequence counters to note exclusive locking:
> > > > - read lock - (implemented by vma_read_trylock) requires the the
> > > >   vma (vm_lock_seq) and mm (mm_lock_seq) sequence counters to
> > > >   differ. If they match then there must be a vma exclusive lock
> > > >   held somewhere.
> > > > - read unlock - (implemented by vma_read_unlock) is a trivial
> > > >   vma->lock unlock.
> > > > - write lock - (vma_write_lock) requires the mmap_lock to be
> > > >   held exclusively and the current mm counter is noted to the 
> > > > vma
> > > >   side. This will allow multiple vmas to be locked under a 
> > > > single
> > > >   mmap_lock write lock (e.g. during vma merging). The vma 
> > > > counter
> > > >   is modified under exclusive vma lock.
> > >
> > > Didn't realize one more thing.
> > > Unlike standard write lock this implementation allows to be
> > > called multiple times under a single mmap_lock. In a sense
> > > it is more of mark_vma_potentially_modified than a lock.
> > 
> > In the RFC it was called vma_mark_locked() originally and renames were
> > discussed in the email thread ending here:
> > https://lore.kernel.org/all/621612d7-c537-3971-9520-a3dec7b43...@suse.cz/.
> > If other names are preferable I'm open to changing them.
> 
> I don't want to bikeshed this, but rather than locking it seems to be
> more:
> 
>   vma_start_read()
>   vma_end_read()
>   vma_start_write()
>   vma_end_write()
>   vma_downgrade_write()
> 
> ... and that these are _implemented_ with locks (in part) is an
> implementation detail?

Agreed!

> Would that reduce people's confusion?

Yes I believe that naming it less like a locking primitive will clarify
it. vma_{start,end}_[try]read is better indeed. I am wondering about the
write side of things because that is where things get confusing. There
is no explicit write lock nor unlock. vma_start_write sounds better than
the vma_write_lock but it still lacks that pairing with vma_end_write
which is never the right thing to call. Wouldn't vma_mark_modified and
vma_publish_changes describe the scheme better?

Downgrade case is probably the least interesting one because that is
just one off thing that can be completely hidden from any code besides
mmap_write_downgrade so I wouldn't be too concern about that one.

But as you say, no need to bikeshed this too much. Great naming is hard
and if the scheme is documented properly we can live with a suboptimal
naming as well.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH 41/41] mm: replace rw_semaphore with atomic_t in vma_lock

2023-01-17 Thread Michal Hocko
On Tue 17-01-23 10:28:40, Suren Baghdasaryan wrote:
[...]
> > Then yes, that's a starvable lock.  Preventing starvation on the mmap
> > sem was the original motivation for making rwsems non-starvable, so
> > changing that behaviour now seems like a bad idea.  For efficiency, I'd
> > suggest that a waiting writer set the top bit of the counter.  That way,
> > all new readers will back off without needing to check a second variable
> > and old readers will know that they *may* need to do the wakeup when
> > atomic_sub_return_release() is negative.
> >
> > (rwsem.c has a more complex bitfield, but I don't think we need to go
> > that far; the important point is that the waiting writer indicates its
> > presence in the count field so that readers can modify their behaviour)
> 
> Got it. Ok, I think we can figure something out to check if there are
> waiting write-lockers and prevent new readers from taking the lock.

Reinventing locking primitives is a ticket to weird bugs. I would stick
with the rwsem and deal with performance fallouts after it is clear that
the core idea is generally acceptable and based on actual real life
numbers. This whole thing is quite big enough that we do not have to go
through "is this new synchronization primitive correct and behaving
reasonably" exercise.

-- 
Michal Hocko
SUSE Labs


Re: [PATCH 39/41] kernel/fork: throttle call_rcu() calls in vm_area_free

2023-01-17 Thread Michal Hocko
On Mon 09-01-23 12:53:34, Suren Baghdasaryan wrote:
> call_rcu() can take a long time when callback offloading is enabled.
> Its use in the vm_area_free can cause regressions in the exit path when
> multiple VMAs are being freed.

What kind of regressions.

> To minimize that impact, place VMAs into
> a list and free them in groups using one call_rcu() call per group.

Please add some data to justify this additional complexity.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH 28/41] mm: introduce lock_vma_under_rcu to be used from arch-specific code

2023-01-17 Thread Michal Hocko
On Mon 09-01-23 12:53:23, Suren Baghdasaryan wrote:
> Introduce lock_vma_under_rcu function to lookup and lock a VMA during
> page fault handling. When VMA is not found, can't be locked or changes
> after being locked, the function returns NULL. The lookup is performed
> under RCU protection to prevent the found VMA from being destroyed before
> the VMA lock is acquired. VMA lock statistics are updated according to
> the results.
> For now only anonymous VMAs can be searched this way. In other cases the
> function returns NULL.

Could you describe why only anonymous vmas are handled at this stage and
what (roughly) has to be done to support other vmas? lock_vma_under_rcu
doesn't seem to have any anonymous vma specific requirements AFAICS.

Also isn't lock_vma_under_rcu effectively find_read_lock_vma? Not that
the naming is really the most important part but the rcu locking is
internal to the function so why should we spread this implementation
detail to the world...

> Signed-off-by: Suren Baghdasaryan 
> ---
>  include/linux/mm.h |  3 +++
>  mm/memory.c| 51 ++
>  2 files changed, 54 insertions(+)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index c464fc8a514c..d0fddf6a1de9 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -687,6 +687,9 @@ static inline void vma_assert_no_reader(struct 
> vm_area_struct *vma)
> vma);
>  }
>  
> +struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
> +   unsigned long address);
> +
>  #else /* CONFIG_PER_VMA_LOCK */
>  
>  static inline void vma_init_lock(struct vm_area_struct *vma) {}
> diff --git a/mm/memory.c b/mm/memory.c
> index 9ece18548db1..a658e26d965d 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -5242,6 +5242,57 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, 
> unsigned long address,
>  }
>  EXPORT_SYMBOL_GPL(handle_mm_fault);
>  
> +#ifdef CONFIG_PER_VMA_LOCK
> +/*
> + * Lookup and lock a VMA under RCU protection. Returned VMA is guaranteed to 
> be
> + * stable and not isolated. If the VMA is not found or is being modified the
> + * function returns NULL.
> + */
> +struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
> +   unsigned long address)
> +{
> + MA_STATE(mas, &mm->mm_mt, address, address);
> + struct vm_area_struct *vma, *validate;
> +
> + rcu_read_lock();
> + vma = mas_walk(&mas);
> +retry:
> + if (!vma)
> + goto inval;
> +
> + /* Only anonymous vmas are supported for now */
> + if (!vma_is_anonymous(vma))
> + goto inval;
> +
> + if (!vma_read_trylock(vma))
> + goto inval;
> +
> + /* Check since vm_start/vm_end might change before we lock the VMA */
> + if (unlikely(address < vma->vm_start || address >= vma->vm_end)) {
> + vma_read_unlock(vma);
> + goto inval;
> + }
> +
> + /* Check if the VMA got isolated after we found it */
> + mas.index = address;
> + validate = mas_walk(&mas);
> + if (validate != vma) {
> + vma_read_unlock(vma);
> + count_vm_vma_lock_event(VMA_LOCK_MISS);
> + /* The area was replaced with another one. */
> + vma = validate;
> + goto retry;
> + }
> +
> + rcu_read_unlock();
> + return vma;
> +inval:
> + rcu_read_unlock();
> + count_vm_vma_lock_event(VMA_LOCK_ABORT);
> + return NULL;
> +}
> +#endif /* CONFIG_PER_VMA_LOCK */
> +
>  #ifndef __PAGETABLE_P4D_FOLDED
>  /*
>   * Allocate p4d page table.
> -- 
> 2.39.0

-- 
Michal Hocko
SUSE Labs


Re: [PATCH 26/41] kernel/fork: assert no VMA readers during its destruction

2023-01-17 Thread Michal Hocko
On Mon 09-01-23 12:53:21, Suren Baghdasaryan wrote:
> Assert there are no holders of VMA lock for reading when it is about to be
> destroyed.
> 
> Signed-off-by: Suren Baghdasaryan 
> ---
>  include/linux/mm.h | 8 
>  kernel/fork.c  | 2 ++
>  2 files changed, 10 insertions(+)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 594e835bad9c..c464fc8a514c 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -680,6 +680,13 @@ static inline void vma_assert_write_locked(struct 
> vm_area_struct *vma)
>   VM_BUG_ON_VMA(vma->vm_lock_seq != READ_ONCE(vma->vm_mm->mm_lock_seq), 
> vma);
>  }
>  
> +static inline void vma_assert_no_reader(struct vm_area_struct *vma)
> +{
> + VM_BUG_ON_VMA(rwsem_is_locked(&vma->lock) &&
> +   vma->vm_lock_seq != READ_ONCE(vma->vm_mm->mm_lock_seq),
> +   vma);

Do we really need to check for vm_lock_seq? rwsem_is_locked should tell
us something is wrong on its own, no? This could be somebody racing with
the vma destruction and using the write lock. Unlikely but I do not see
why to narrow debugging scope.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH 18/41] mm/khugepaged: write-lock VMA while collapsing a huge page

2023-01-17 Thread Michal Hocko
On Mon 09-01-23 12:53:13, Suren Baghdasaryan wrote:
> Protect VMA from concurrent page fault handler while collapsing a huge
> page. Page fault handler needs a stable PMD to use PTL and relies on
> per-VMA lock to prevent concurrent PMD changes. pmdp_collapse_flush(),
> set_huge_pmd() and collapse_and_free_pmd() can modify a PMD, which will
> not be detected by a page fault handler without proper locking.

I am struggling with this changelog. Maybe because my recollection of
the THP collapsing subtleties is weak. But aren't you just trying to say
that the current #PF handling and THP collapsing need to be mutually
exclusive currently so in order to keep that assumption you have mark
the vma write locked?

Also it is not really clear to me how that handles other vmas which can
share the same thp?
-- 
Michal Hocko
SUSE Labs


Re: [PATCH 17/41] mm/mmap: move VMA locking before anon_vma_lock_write call

2023-01-17 Thread Michal Hocko
On Mon 09-01-23 12:53:12, Suren Baghdasaryan wrote:
> Move VMA flag modification (which now implies VMA locking) before
> anon_vma_lock_write to match the locking order of page fault handler.

Does this changelog assumes per vma locking in the #PF?

-- 
Michal Hocko
SUSE Labs


Re: [PATCH 13/41] mm: introduce vma->vm_flags modifier functions

2023-01-17 Thread Michal Hocko
On Tue 17-01-23 16:09:03, Michal Hocko wrote:
> On Mon 09-01-23 12:53:08, Suren Baghdasaryan wrote:
> > To keep vma locking correctness when vm_flags are modified, add modifier
> > functions to be used whenever flags are updated.
> > 
> > Signed-off-by: Suren Baghdasaryan 
> > ---
> >  include/linux/mm.h   | 38 ++
> >  include/linux/mm_types.h |  8 +++-
> >  2 files changed, 45 insertions(+), 1 deletion(-)
> > 
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index ec2c4c227d51..35cf0a6cbcc2 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -702,6 +702,44 @@ static inline void vma_init(struct vm_area_struct 
> > *vma, struct mm_struct *mm)
> > vma_init_lock(vma);
> >  }
> >  
> > +/* Use when VMA is not part of the VMA tree and needs no locking */
> > +static inline
> > +void init_vm_flags(struct vm_area_struct *vma, unsigned long flags)
> > +{
> > +   WRITE_ONCE(vma->vm_flags, flags);
> > +}
> 
> Why do we need WRITE_ONCE here? Isn't vma invisible during its
> initialization?
> 
> > +
> > +/* Use when VMA is part of the VMA tree and needs appropriate locking */
> > +static inline
> > +void reset_vm_flags(struct vm_area_struct *vma, unsigned long flags)
> > +{
> > +   vma_write_lock(vma);
> > +   init_vm_flags(vma, flags);
> > +}
> > +
> > +static inline
> > +void set_vm_flags(struct vm_area_struct *vma, unsigned long flags)
> > +{
> > +   vma_write_lock(vma);
> > +   vma->vm_flags |= flags;
> > +}
> > +
> > +static inline
> > +void clear_vm_flags(struct vm_area_struct *vma, unsigned long flags)
> > +{
> > +   vma_write_lock(vma);
> > +   vma->vm_flags &= ~flags;
> > +}
> > +
> > +static inline
> > +void mod_vm_flags(struct vm_area_struct *vma,
> > + unsigned long set, unsigned long clear)
> > +{
> > +   vma_write_lock(vma);
> > +   vma->vm_flags |= set;
> > +   vma->vm_flags &= ~clear;
> > +}
> > +
> 
> This is rather unusual pattern. There is no note about locking involved
> in the naming and also why is the locking part of this interface in the
> first place? I can see reason for access functions to actually check for
> lock asserts.

OK, it took me a while but it is clear to me now. The confusion comes
from the naming vma_write_lock is no a lock in its usual terms. It is
more of a vma_mark_modified with side effects to read locking which is a
real lock. With that it makes more sense to have this done in these
helpers rather than requiring all users to keep this subtletly in mind.

-- 
Michal Hocko
SUSE Labs


Re: [PATCH 12/41] mm: add per-VMA lock and helper functions to control it

2023-01-17 Thread Michal Hocko
On Tue 17-01-23 16:04:26, Michal Hocko wrote:
> On Mon 09-01-23 12:53:07, Suren Baghdasaryan wrote:
> > Introduce a per-VMA rw_semaphore to be used during page fault handling
> > instead of mmap_lock. Because there are cases when multiple VMAs need
> > to be exclusively locked during VMA tree modifications, instead of the
> > usual lock/unlock patter we mark a VMA as locked by taking per-VMA lock
> > exclusively and setting vma->lock_seq to the current mm->lock_seq. When
> > mmap_write_lock holder is done with all modifications and drops mmap_lock,
> > it will increment mm->lock_seq, effectively unlocking all VMAs marked as
> > locked.
> 
> I have to say I was struggling a bit with the above and only understood
> what you mean by reading the patch several times. I would phrase it like
> this (feel free to use if you consider this to be an improvement).
> 
> Introduce a per-VMA rw_semaphore. The lock implementation relies on a
> per-vma and per-mm sequence counters to note exclusive locking:
> - read lock - (implemented by vma_read_trylock) requires the the
>   vma (vm_lock_seq) and mm (mm_lock_seq) sequence counters to
>   differ. If they match then there must be a vma exclusive lock
>   held somewhere.
> - read unlock - (implemented by vma_read_unlock) is a trivial
>   vma->lock unlock.
> - write lock - (vma_write_lock) requires the mmap_lock to be
>   held exclusively and the current mm counter is noted to the vma
>   side. This will allow multiple vmas to be locked under a single
>   mmap_lock write lock (e.g. during vma merging). The vma counter
>   is modified under exclusive vma lock.

Didn't realize one more thing.
Unlike standard write lock this implementation allows to be
called multiple times under a single mmap_lock. In a sense
it is more of mark_vma_potentially_modified than a lock.

> - write unlock - (vma_write_unlock_mm) is a batch release of all
>   vma locks held. It doesn't pair with a specific
>   vma_write_lock! It is done before exclusive mmap_lock is
>   released by incrementing mm sequence counter (mm_lock_seq).
>   - write downgrade - if the mmap_lock is downgraded to the read
> lock all vma write locks are released as well (effectivelly
> same as write unlock).
-- 
Michal Hocko
SUSE Labs


Re: [PATCH 13/41] mm: introduce vma->vm_flags modifier functions

2023-01-17 Thread Michal Hocko
On Mon 09-01-23 12:53:08, Suren Baghdasaryan wrote:
> To keep vma locking correctness when vm_flags are modified, add modifier
> functions to be used whenever flags are updated.
> 
> Signed-off-by: Suren Baghdasaryan 
> ---
>  include/linux/mm.h   | 38 ++
>  include/linux/mm_types.h |  8 +++-
>  2 files changed, 45 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index ec2c4c227d51..35cf0a6cbcc2 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -702,6 +702,44 @@ static inline void vma_init(struct vm_area_struct *vma, 
> struct mm_struct *mm)
>   vma_init_lock(vma);
>  }
>  
> +/* Use when VMA is not part of the VMA tree and needs no locking */
> +static inline
> +void init_vm_flags(struct vm_area_struct *vma, unsigned long flags)
> +{
> + WRITE_ONCE(vma->vm_flags, flags);
> +}

Why do we need WRITE_ONCE here? Isn't vma invisible during its
initialization?

> +
> +/* Use when VMA is part of the VMA tree and needs appropriate locking */
> +static inline
> +void reset_vm_flags(struct vm_area_struct *vma, unsigned long flags)
> +{
> + vma_write_lock(vma);
> + init_vm_flags(vma, flags);
> +}
> +
> +static inline
> +void set_vm_flags(struct vm_area_struct *vma, unsigned long flags)
> +{
> + vma_write_lock(vma);
> + vma->vm_flags |= flags;
> +}
> +
> +static inline
> +void clear_vm_flags(struct vm_area_struct *vma, unsigned long flags)
> +{
> + vma_write_lock(vma);
> + vma->vm_flags &= ~flags;
> +}
> +
> +static inline
> +void mod_vm_flags(struct vm_area_struct *vma,
> +   unsigned long set, unsigned long clear)
> +{
> + vma_write_lock(vma);
> + vma->vm_flags |= set;
> + vma->vm_flags &= ~clear;
> +}
> +

This is rather unusual pattern. There is no note about locking involved
in the naming and also why is the locking part of this interface in the
first place? I can see reason for access functions to actually check for
lock asserts.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH 12/41] mm: add per-VMA lock and helper functions to control it

2023-01-17 Thread Michal Hocko
On Mon 09-01-23 12:53:07, Suren Baghdasaryan wrote:
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 5986817f393c..c026d75108b3 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -474,6 +474,7 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct 
> *orig)
>*/
>   *new = data_race(*orig);
>   INIT_LIST_HEAD(&new->anon_vma_chain);
> + vma_init_lock(new);
>   dup_anon_vma_name(orig, new);
>   }
>   return new;
> @@ -1145,6 +1146,9 @@ static struct mm_struct *mm_init(struct mm_struct *mm, 
> struct task_struct *p,
>   seqcount_init(&mm->write_protect_seq);
>   mmap_init_lock(mm);
>   INIT_LIST_HEAD(&mm->mmlist);
> +#ifdef CONFIG_PER_VMA_LOCK
> + WRITE_ONCE(mm->mm_lock_seq, 0);
> +#endif

The mm shouldn't be visible so why WRITE_ONCE?

-- 
Michal Hocko
SUSE Labs


Re: [PATCH 12/41] mm: add per-VMA lock and helper functions to control it

2023-01-17 Thread Michal Hocko
idates all existing locks.
> +  */
> + if (unlikely(vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))) {
> + up_read(&vma->lock);
> + return false;
> + }
> + return true;
> +}
> +
> +static inline void vma_read_unlock(struct vm_area_struct *vma)
> +{
> + up_read(&vma->lock);
> +}
> +
> +static inline void vma_assert_write_locked(struct vm_area_struct *vma)
> +{
> + mmap_assert_write_locked(vma->vm_mm);
> + /*
> +  * current task is holding mmap_write_lock, both vma->vm_lock_seq and
> +  * mm->mm_lock_seq can't be concurrently modified.
> +  */
> + VM_BUG_ON_VMA(vma->vm_lock_seq != READ_ONCE(vma->vm_mm->mm_lock_seq), 
> vma);
> +}
> +
> +#else /* CONFIG_PER_VMA_LOCK */
> +
> +static inline void vma_init_lock(struct vm_area_struct *vma) {}
> +static inline void vma_write_lock(struct vm_area_struct *vma) {}
> +static inline bool vma_read_trylock(struct vm_area_struct *vma)
> + { return false; }
> +static inline void vma_read_unlock(struct vm_area_struct *vma) {}
> +static inline void vma_assert_write_locked(struct vm_area_struct *vma) {}
> +
> +#endif /* CONFIG_PER_VMA_LOCK */
> +
>  static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
>  {
>   static const struct vm_operations_struct dummy_vm_ops = {};
> @@ -620,6 +699,7 @@ static inline void vma_init(struct vm_area_struct *vma, 
> struct mm_struct *mm)
>   vma->vm_mm = mm;
>   vma->vm_ops = &dummy_vm_ops;
>   INIT_LIST_HEAD(&vma->anon_vma_chain);
> + vma_init_lock(vma);
>  }
>  
>  static inline void vma_set_anonymous(struct vm_area_struct *vma)
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index d5cdec1314fe..5f7c5ca89931 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -555,6 +555,11 @@ struct vm_area_struct {
>   pgprot_t vm_page_prot;
>   unsigned long vm_flags; /* Flags, see mm.h. */
>  
> +#ifdef CONFIG_PER_VMA_LOCK
> + int vm_lock_seq;
> + struct rw_semaphore lock;
> +#endif
> +
>   /*
>* For areas with an address space and backing store,
>* linkage into the address_space->i_mmap interval tree.
> @@ -680,6 +685,9 @@ struct mm_struct {
> * init_mm.mmlist, and are protected
> * by mmlist_lock
> */
> +#ifdef CONFIG_PER_VMA_LOCK
> + int mm_lock_seq;
> +#endif
>  
>  
>   unsigned long hiwater_rss; /* High-watermark of RSS usage */
> diff --git a/include/linux/mmap_lock.h b/include/linux/mmap_lock.h
> index e49ba91bb1f0..40facd4c398b 100644
> --- a/include/linux/mmap_lock.h
> +++ b/include/linux/mmap_lock.h
> @@ -72,6 +72,17 @@ static inline void mmap_assert_write_locked(struct 
> mm_struct *mm)
>   VM_BUG_ON_MM(!rwsem_is_locked(&mm->mmap_lock), mm);
>  }
>  
> +#ifdef CONFIG_PER_VMA_LOCK
> +static inline void vma_write_unlock_mm(struct mm_struct *mm)
> +{
> + mmap_assert_write_locked(mm);
> + /* No races during update due to exclusive mmap_lock being held */
> + WRITE_ONCE(mm->mm_lock_seq, mm->mm_lock_seq + 1);
> +}
> +#else
> +static inline void vma_write_unlock_mm(struct mm_struct *mm) {}
> +#endif
> +
>  static inline void mmap_init_lock(struct mm_struct *mm)
>  {
>   init_rwsem(&mm->mmap_lock);
> @@ -114,12 +125,14 @@ static inline bool mmap_write_trylock(struct mm_struct 
> *mm)
>  static inline void mmap_write_unlock(struct mm_struct *mm)
>  {
>   __mmap_lock_trace_released(mm, true);
> + vma_write_unlock_mm(mm);
>   up_write(&mm->mmap_lock);
>  }
>  
>  static inline void mmap_write_downgrade(struct mm_struct *mm)
>  {
>   __mmap_lock_trace_acquire_returned(mm, false, true);
> + vma_write_unlock_mm(mm);
>   downgrade_write(&mm->mmap_lock);
>  }
>  
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 5986817f393c..c026d75108b3 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -474,6 +474,7 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct 
> *orig)
>*/
>   *new = data_race(*orig);
>   INIT_LIST_HEAD(&new->anon_vma_chain);
> + vma_init_lock(new);
>   dup_anon_vma_name(orig, new);
>   }
>   return new;
> @@ -1145,6 +1146,9 @@ static struct mm_struct *mm_init(struct mm_struct *mm, 
> struct task_struct *p,
>   seqcount_init(&mm->write_protect_seq);
>   mmap_init_lock(mm);
>   INIT_LIST_HEAD(&mm->mmlist);
> +#ifdef CONFIG_PER_VMA_LOCK
> + WRITE_ONCE(mm->mm_lock_seq, 0);
> +#endif
>   mm_pgtables_bytes_init(mm);
>   mm->map_count = 0;
>   mm->locked_vm = 0;
> diff --git a/mm/init-mm.c b/mm/init-mm.c
> index c9327abb771c..33269314e060 100644
> --- a/mm/init-mm.c
> +++ b/mm/init-mm.c
> @@ -37,6 +37,9 @@ struct mm_struct init_mm = {
>   .page_table_lock =  __SPIN_LOCK_UNLOCKED(init_mm.page_table_lock),
>   .arg_lock   =  __SPIN_LOCK_UNLOCKED(init_mm.arg_lock),
>   .mmlist = LIST_HEAD_INIT(init_mm.mmlist),
> +#ifdef CONFIG_PER_VMA_LOCK
> + .mm_lock_seq= 0,
> +#endif
>   .user_ns= &init_user_ns,
>   .cpu_bitmap = CPU_BITS_NONE,
>  #ifdef CONFIG_IOMMU_SVA
> -- 
> 2.39.0

-- 
Michal Hocko
SUSE Labs


Re: [PATCH 09/41] mm: rcu safe VMA freeing

2023-01-17 Thread Michal Hocko
On Mon 09-01-23 12:53:04, Suren Baghdasaryan wrote:
[...]
>  void vm_area_free(struct vm_area_struct *vma)
>  {
>   free_anon_vma_name(vma);
> +#ifdef CONFIG_PER_VMA_LOCK
> + call_rcu(&vma->vm_rcu, __vm_area_free);
> +#else
>   kmem_cache_free(vm_area_cachep, vma);
> +#endif

Is it safe to have vma with already freed vma_name? I suspect this is
safe because of mmap_lock but is there any reason to split the freeing
process and have this potential UAF lurking?

>  }
>  
>  static void account_kernel_stack(struct task_struct *tsk, int account)
> -- 
> 2.39.0

-- 
Michal Hocko
SUSE Labs


Re: [PATCH 08/41] mm: introduce CONFIG_PER_VMA_LOCK

2023-01-11 Thread Michal Hocko
On Wed 11-01-23 09:49:08, Suren Baghdasaryan wrote:
> On Wed, Jan 11, 2023 at 9:37 AM Michal Hocko  wrote:
> >
> > On Wed 11-01-23 09:04:41, Suren Baghdasaryan wrote:
> > > On Wed, Jan 11, 2023 at 8:44 AM Michal Hocko  wrote:
> > > >
> > > > On Wed 11-01-23 08:28:49, Suren Baghdasaryan wrote:
> > > > [...]
> > > > > Anyhow. Sounds like the overhead of the current design is small enough
> > > > > to remove CONFIG_PER_VMA_LOCK and let it depend only on architecture
> > > > > support?
> > > >
> > > > Yes. Further optimizations can be done on top. Let's not over optimize
> > > > at this stage.
> > >
> > > Sure, I won't optimize any further.
> > > Just to expand on your question. Original design would be problematic
> > > for embedded systems like Android. It notoriously has a high number of
> > > VMAs due to anonymous VMAs being named, which prevents them from
> > > merging.
> >
> > What is the usual number of VMAs in that environment?
> 
> I've seen some games which had over 4000 VMAs but that's on the upper
> side. In my calculations I used 4 VMAs as a ballpark number and
> rough calculations before size optimization would increase memory
> consumption by ~2M (depending on the lock placement in vm_area_struct
> it would vary a bit). In Android, the performance team flags any
> change that exceeds 500KB, so it would raise questions.

Thanks, that is a useful information! This is just slightly off-topic
but I ak wondering how much memory those vma names consume. Are there
that many unique names or they just happen to be alternating so that
neighboring ones tend to be different.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH 08/41] mm: introduce CONFIG_PER_VMA_LOCK

2023-01-11 Thread Michal Hocko
On Wed 11-01-23 09:04:41, Suren Baghdasaryan wrote:
> On Wed, Jan 11, 2023 at 8:44 AM Michal Hocko  wrote:
> >
> > On Wed 11-01-23 08:28:49, Suren Baghdasaryan wrote:
> > [...]
> > > Anyhow. Sounds like the overhead of the current design is small enough
> > > to remove CONFIG_PER_VMA_LOCK and let it depend only on architecture
> > > support?
> >
> > Yes. Further optimizations can be done on top. Let's not over optimize
> > at this stage.
> 
> Sure, I won't optimize any further.
> Just to expand on your question. Original design would be problematic
> for embedded systems like Android. It notoriously has a high number of
> VMAs due to anonymous VMAs being named, which prevents them from
> merging.

What is the usual number of VMAs in that environment?

-- 
Michal Hocko
SUSE Labs


Re: [PATCH 08/41] mm: introduce CONFIG_PER_VMA_LOCK

2023-01-11 Thread Michal Hocko
On Wed 11-01-23 08:28:49, Suren Baghdasaryan wrote:
[...]
> Anyhow. Sounds like the overhead of the current design is small enough
> to remove CONFIG_PER_VMA_LOCK and let it depend only on architecture
> support?

Yes. Further optimizations can be done on top. Let's not over optimize
at this stage.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH 08/41] mm: introduce CONFIG_PER_VMA_LOCK

2023-01-11 Thread Michal Hocko
On Tue 10-01-23 16:44:42, Suren Baghdasaryan wrote:
> On Tue, Jan 10, 2023 at 4:39 PM Davidlohr Bueso  wrote:
> >
> > On Mon, 09 Jan 2023, Suren Baghdasaryan wrote:
> >
> > >This configuration variable will be used to build the support for VMA
> > >locking during page fault handling.
> > >
> > >This is enabled by default on supported architectures with SMP and MMU
> > >set.
> > >
> > >The architecture support is needed since the page fault handler is called
> > >from the architecture's page faulting code which needs modifications to
> > >handle faults under VMA lock.
> >
> > I don't think that per-vma locking should be something that is 
> > user-configurable.
> > It should just be depdendant on the arch. So maybe just remove 
> > CONFIG_PER_VMA_LOCK?
> 
> Thanks for the suggestion! I would be happy to make that change if
> there are no objections. I think the only pushback might have been the
> vma size increase but with the latest optimization in the last patch
> maybe that's less of an issue?

Has vma size ever been a real problem? Sure there might be a lot of
those but your patch increases it by rwsem (without the last patch)
which is something like 40B on top of 136B vma so we are talking about
400B in total which even with wild mapcount limits shouldn't really be
prohibitive. With a default map count limit we are talking about 2M
increase at most (per address space).

Or are you aware of any specific usecases where vma size is a real
problem?

-- 
Michal Hocko
SUSE Labs


Re: [PATCH] mm: remove zap_page_range and create zap_vma_pages

2023-01-03 Thread Michal Hocko
On Tue 03-01-23 16:27:32, Mike Kravetz wrote:
> zap_page_range was originally designed to unmap pages within an address
> range that could span multiple vmas.  While working on [1], it was
> discovered that all callers of zap_page_range pass a range entirely within
> a single vma.  In addition, the mmu notification call within zap_page
> range does not correctly handle ranges that span multiple vmas.  When
> crossing a vma boundary, a new mmu_notifier_range_init/end call pair
> with the new vma should be made.
> 
> Instead of fixing zap_page_range, do the following:
> - Create a new routine zap_vma_pages() that will remove all pages within
>   the passed vma.  Most users of zap_page_range pass the entire vma and
>   can use this new routine.
> - For callers of zap_page_range not passing the entire vma, instead call
>   zap_page_range_single().
> - Remove zap_page_range.
> 
> [1] 
> https://lore.kernel.org/linux-mm/20221114235507.294320-2-mike.krav...@oracle.com/
> Suggested-by: Peter Xu 
> Signed-off-by: Mike Kravetz 

This looks even better than the previous version.
Acked-by: Michal Hocko 

minor nit

[...]
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index ad608ef2a243..ffa36cfe5884 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -2713,7 +2713,7 @@ void folio_account_cleaned(struct folio *folio, struct 
> bdi_writeback *wb)
>   *
>   * The caller must hold lock_page_memcg().  Most callers have the folio
>   * locked.  A few have the folio blocked from truncation through other
> - * means (eg zap_page_range() has it mapped and is holding the page table
> + * means (eg zap_vma_pages() has it mapped and is holding the page table
>   * lock).  This can also be called from mark_buffer_dirty(), which I
>   * cannot prove is always protected against truncate.

strictly speaking this should be unmap_page_range
-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH] mm: remove zap_page_range and change callers to use zap_vma_page_range

2022-12-19 Thread Michal Hocko
On Fri 16-12-22 11:20:12, Mike Kravetz wrote:
> zap_page_range was originally designed to unmap pages within an address
> range that could span multiple vmas.  While working on [1], it was
> discovered that all callers of zap_page_range pass a range entirely within
> a single vma.  In addition, the mmu notification call within zap_page
> range does not correctly handle ranges that span multiple vmas as calls
> should be vma specific.

Could you spend a sentence or two explaining what is wrong here?

> Instead of fixing zap_page_range, change all callers to use the new
> routine zap_vma_page_range.  zap_vma_page_range is just a wrapper around
> zap_page_range_single passing in NULL zap details.  The name is also
> more in line with other exported routines that operate within a vma.
> We can then remove zap_page_range.

I would stick with zap_page_range_single rather than adding a new
wrapper but nothing really critical.

> Also, change madvise_dontneed_single_vma to use this new routine.
> 
> [1] 
> https://lore.kernel.org/linux-mm/20221114235507.294320-2-mike.krav...@oracle.com/
> Suggested-by: Peter Xu 
> Signed-off-by: Mike Kravetz 

Other than that LGTM
Acked-by: Michal Hocko 

Thanks!

> ---
>  arch/arm64/kernel/vdso.c|  4 ++--
>  arch/powerpc/kernel/vdso.c  |  2 +-
>  arch/powerpc/platforms/book3s/vas-api.c |  2 +-
>  arch/powerpc/platforms/pseries/vas.c|  2 +-
>  arch/riscv/kernel/vdso.c|  4 ++--
>  arch/s390/kernel/vdso.c |  2 +-
>  arch/s390/mm/gmap.c |  2 +-
>  arch/x86/entry/vdso/vma.c   |  2 +-
>  drivers/android/binder_alloc.c  |  2 +-
>  include/linux/mm.h  |  7 --
>  mm/madvise.c|  4 ++--
>  mm/memory.c | 30 -
>  mm/page-writeback.c |  2 +-
>  net/ipv4/tcp.c  |  6 ++---
>  14 files changed, 22 insertions(+), 49 deletions(-)
> 
> diff --git a/arch/arm64/kernel/vdso.c b/arch/arm64/kernel/vdso.c
> index e59a32aa0c49..a7b10e182f78 100644
> --- a/arch/arm64/kernel/vdso.c
> +++ b/arch/arm64/kernel/vdso.c
> @@ -141,10 +141,10 @@ int vdso_join_timens(struct task_struct *task, struct 
> time_namespace *ns)
>   unsigned long size = vma->vm_end - vma->vm_start;
>  
>   if (vma_is_special_mapping(vma, vdso_info[VDSO_ABI_AA64].dm))
> - zap_page_range(vma, vma->vm_start, size);
> + zap_vma_page_range(vma, vma->vm_start, size);
>  #ifdef CONFIG_COMPAT_VDSO
>   if (vma_is_special_mapping(vma, vdso_info[VDSO_ABI_AA32].dm))
> - zap_page_range(vma, vma->vm_start, size);
> + zap_vma_page_range(vma, vma->vm_start, size);
>  #endif
>   }
>  
> diff --git a/arch/powerpc/kernel/vdso.c b/arch/powerpc/kernel/vdso.c
> index 507f8228f983..479d70fe8c55 100644
> --- a/arch/powerpc/kernel/vdso.c
> +++ b/arch/powerpc/kernel/vdso.c
> @@ -123,7 +123,7 @@ int vdso_join_timens(struct task_struct *task, struct 
> time_namespace *ns)
>   unsigned long size = vma->vm_end - vma->vm_start;
>  
>   if (vma_is_special_mapping(vma, &vvar_spec))
> - zap_page_range(vma, vma->vm_start, size);
> + zap_vma_page_range(vma, vma->vm_start, size);
>   }
>   mmap_read_unlock(mm);
>  
> diff --git a/arch/powerpc/platforms/book3s/vas-api.c 
> b/arch/powerpc/platforms/book3s/vas-api.c
> index eb5bed333750..8f57388b760b 100644
> --- a/arch/powerpc/platforms/book3s/vas-api.c
> +++ b/arch/powerpc/platforms/book3s/vas-api.c
> @@ -414,7 +414,7 @@ static vm_fault_t vas_mmap_fault(struct vm_fault *vmf)
>   /*
>* When the LPAR lost credits due to core removal or during
>* migration, invalidate the existing mapping for the current
> -  * paste addresses and set windows in-active (zap_page_range in
> +  * paste addresses and set windows in-active (zap_vma_page_range in
>* reconfig_close_windows()).
>* New mapping will be done later after migration or new credits
>* available. So continue to receive faults if the user space
> diff --git a/arch/powerpc/platforms/pseries/vas.c 
> b/arch/powerpc/platforms/pseries/vas.c
> index 4ad6e510d405..2aef8d9295a2 100644
> --- a/arch/powerpc/platforms/pseries/vas.c
> +++ b/arch/powerpc/platforms/pseries/vas.c
> @@ -760,7 +760,7 @@ static int reconfig_close_windows(struct vas_caps *vcap, 
> int excess_creds,
>* is done before the original mmap() and after the ioctl.
>*/
>   if (vma)
> - 

Re: [RFC PATCH 0/7] Memory hotplug/hotremove at subsection size

2021-05-07 Thread Michal Hocko
[I haven't read through respective patches due to lack of time but let
 me comment on the general idea and the underlying justification]

On Thu 06-05-21 17:31:09, David Hildenbrand wrote:
> On 06.05.21 17:26, Zi Yan wrote:
> > From: Zi Yan 
> > 
> > Hi all,
> > 
> > This patchset tries to remove the restriction on memory hotplug/hotremove
> > granularity, which is always greater or equal to memory section size[1].
> > With the patchset, kernel is able to online/offline memory at a size 
> > independent
> > of memory section size, as small as 2MB (the subsection size).
> 
> ... which doesn't make any sense as we can only online/offline whole memory
> block devices.

Agreed. The subsection thingy is just a hack to workaround pmem
alignement problems. For the real memory hotplug it is quite hard to
argue for reasonable hotplug scenarios for very small physical memory
ranges wrt. to the existing sparsemem memory model.
 
> > The motivation is to increase MAX_ORDER of the buddy allocator and pageblock
> > size without increasing memory hotplug/hotremove granularity at the same 
> > time,
> 
> Gah, no. Please no. No.

Agreed. Those are completely independent concepts. MAX_ORDER is can be
really arbitrary irrespective of the section size with vmemmap sparse
model. The existing restriction is due to old sparse model not being
able to do page pointer arithmetic across memory sections. Is there any
reason to stick with that memory model for an advance feature you are
working on?
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v1 4/4] powernv/memtrace: don't abuse memory hot(un)plug infrastructure for memory allocations

2020-11-03 Thread Michal Hocko
On Thu 29-10-20 17:27:18, David Hildenbrand wrote:
> Let's use alloc_contig_pages() for allocating memory and remove the
> linear mapping manually via arch_remove_linear_mapping(). Mark all pages
> PG_offline, such that they will definitely not get touched - e.g.,
> when hibernating. When freeing memory, try to revert what we did.
> 
> The original idea was discussed in:
>  https://lkml.kernel.org/r/48340e96-7e6b-736f-9e23-d3111b915...@redhat.com
> 
> This is similar to CONFIG_DEBUG_PAGEALLOC handling on other
> architectures, whereby only single pages are unmapped from the linear
> mapping. Let's mimic what memory hot(un)plug would do with the linear
> mapping.
> 
> We now need MEMORY_HOTPLUG and CONTIG_ALLOC as dependencies.
> 
> Simple test under QEMU TCG (10GB RAM, single NUMA node):
> 
> sh-5.0# mount -t debugfs none /sys/kernel/debug/
> sh-5.0# cat /sys/devices/system/memory/block_size_bytes
> 4000
> sh-5.0# echo 0x4000 > /sys/kernel/debug/powerpc/memtrace/enable
> [   71.052836][  T356] memtrace: Allocated trace memory on node 0 at 
> 0x8000
> sh-5.0# echo 0x8000 > /sys/kernel/debug/powerpc/memtrace/enable
> [   75.424302][  T356] radix-mmu: Mapped 
> 0x8000-0xc000 with 64.0 KiB pages
> [   75.430549][  T356] memtrace: Freed trace memory back on node 0
> [   75.604520][  T356] memtrace: Allocated trace memory on node 0 at 
> 0x8000
> sh-5.0# echo 0x1 > /sys/kernel/debug/powerpc/memtrace/enable
> [   80.418835][  T356] radix-mmu: Mapped 
> 0x8000-0x0001 with 64.0 KiB pages
> [   80.430493][  T356] memtrace: Freed trace memory back on node 0
> [   80.433882][  T356] memtrace: Failed to allocate trace memory on node 0
> sh-5.0# echo 0x4000 > /sys/kernel/debug/powerpc/memtrace/enable
> [   91.920158][  T356] memtrace: Allocated trace memory on node 0 at 
> 0x8000
> 
> Note 1: We currently won't be allocating from ZONE_MOVABLE - because our
>   pages are not movable. However, as we don't run with any memory
>   hot(un)plug mechanism around, we could make an exception to
>   increase the chance of allocations succeeding.
> 
> Note 2: PG_reserved isn't sufficient. E.g., kernel_page_present() used
>   along PG_reserved in hibernation code will always return "true"
>   on powerpc, resulting in the pages getting touched. It's too
>   generic - e.g., indicates boot allocations.
> 
> Note 3: For now, we keep using memory_block_size_bytes() as minimum
>   granularity. I'm not able to come up with a better guess (most
>   probably, doing it on a section basis could be possible).
> 
> Suggested-by: Michal Hocko 
> Cc: Michael Ellerman 
> Cc: Benjamin Herrenschmidt 
> Cc: Paul Mackerras 
> Cc: Rashmica Gupta 
> Cc: Andrew Morton 
> Cc: Mike Rapoport 
> Cc: Michal Hocko 
> Cc: Oscar Salvador 
> Cc: Wei Yang 
> Signed-off-by: David Hildenbrand 

Thanks! This looks like a move into the right direction. I cannot really
judge implementation details because I am not familiar with the code.
I have only one tiny concern:
[...]
> -/* called with device_hotplug_lock held */
> -static bool memtrace_offline_pages(u32 nid, u64 start_pfn, u64 nr_pages)
> +static u64 memtrace_alloc_node(u32 nid, u64 size)
>  {
> - const unsigned long start = PFN_PHYS(start_pfn);
> - const unsigned long size = PFN_PHYS(nr_pages);
> + const unsigned long nr_pages = PHYS_PFN(size);
> + unsigned long pfn, start_pfn;
> + struct page *page;
>  
> - if (walk_memory_blocks(start, size, NULL, check_memblock_online))
> - return false;
> -
> - walk_memory_blocks(start, size, (void *)MEM_GOING_OFFLINE,
> -change_memblock_state);
> -
> - if (offline_pages(start_pfn, nr_pages)) {
> - walk_memory_blocks(start, size, (void *)MEM_ONLINE,
> -change_memblock_state);
> - return false;
> - }
> + /*
> +  * Trace memory needs to be aligned to the size, which is guaranteed
> +  * by alloc_contig_pages().
> +  */
> + page = alloc_contig_pages(nr_pages, __GFP_THISNODE | __GFP_NOWARN,
> +   nid, NULL);

__GFP_THISNODE without other modifiers looks suspicious. I suspect you
want to enfore node locality and exclude movable zones by this. While
this works it is an antipattern. I would rather use GFP_KERNEL |
__GFP_THISNODE | __GFP_NOWARN to be more in line with other gfp usage.

If for no other reasons we want to be able to work inside a normal
compaction context (comparing to effectively GFP_NOIO which the above
implies). Also this looks like a sleepable context.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v5 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline

2020-08-18 Thread Michal Hocko
On Tue 18-08-20 09:32:52, David Hildenbrand wrote:
> On 12.08.20 08:01, Srikar Dronamraju wrote:
> > Hi Andrew, Michal, David
> > 
> > * Andrew Morton  [2020-08-06 21:32:11]:
> > 
> >> On Fri, 3 Jul 2020 18:28:23 +0530 Srikar Dronamraju 
> >>  wrote:
> >>
> >>>> The memory hotplug changes that somehow because you can hotremove numa
> >>>> nodes and therefore make the nodemask sparse but that is not a common
> >>>> case. I am not sure what would happen if a completely new node was added
> >>>> and its corresponding node was already used by the renumbered one
> >>>> though. It would likely conflate the two I am afraid. But I am not sure
> >>>> this is really possible with x86 and a lack of a bug report would
> >>>> suggest that nobody is doing that at least.
> >>>>
> >>>
> >>> JFYI,
> >>> Satheesh copied in this mailchain had opened a bug a year on crash with 
> >>> vcpu
> >>> hotplug on memoryless node. 
> >>>
> >>> https://bugzilla.kernel.org/show_bug.cgi?id=202187
> >>
> >> So...  do we merge this patch or not?  Seems that the overall view is
> >> "risky but nobody is likely to do anything better any time soon"?
> > 
> > Can we decide on this one way or the other?
> 
> Hmm, not sure who's the person to decide. I tend to prefer doing the
> node renaming, handling this in ppc code;

Agreed. That would be a safer option.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v5 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline

2020-07-03 Thread Michal Hocko
On Fri 03-07-20 13:32:21, David Hildenbrand wrote:
> On 03.07.20 12:59, Michal Hocko wrote:
> > On Fri 03-07-20 11:24:17, Michal Hocko wrote:
> >> [Cc Andi]
> >>
> >> On Fri 03-07-20 11:10:01, Michal Suchanek wrote:
> >>> On Wed, Jul 01, 2020 at 02:21:10PM +0200, Michal Hocko wrote:
> >>>> On Wed 01-07-20 13:30:57, David Hildenbrand wrote:
> >> [...]
> >>>>> Yep, looks like it.
> >>>>>
> >>>>> [0.009726] SRAT: PXM 1 -> APIC 0x00 -> Node 0
> >>>>> [0.009727] SRAT: PXM 1 -> APIC 0x01 -> Node 0
> >>>>> [0.009727] SRAT: PXM 1 -> APIC 0x02 -> Node 0
> >>>>> [0.009728] SRAT: PXM 1 -> APIC 0x03 -> Node 0
> >>>>> [0.009731] ACPI: SRAT: Node 0 PXM 1 [mem 0x-0x0009]
> >>>>> [0.009732] ACPI: SRAT: Node 0 PXM 1 [mem 0x0010-0xbfff]
> >>>>> [0.009733] ACPI: SRAT: Node 0 PXM 1 [mem 0x1-0x13fff]
> >>>>
> >>>> This begs a question whether ppc can do the same thing?
> >>> Or x86 stop doing it so that you can see on what node you are running?
> >>>
> >>> What's the point of this indirection other than another way of avoiding
> >>> empty node 0?
> >>
> >> Honestly, I do not have any idea. I've traced it down to
> >> Author: Andi Kleen 
> >> Date:   Tue Jan 11 15:35:48 2005 -0800
> >>
> >> [PATCH] x86_64: Fix ACPI SRAT NUMA parsing
> >>
> >> Fix fallout from the recent nodemask_t changes. The node ids assigned
> >> in the SRAT parser were off by one.
> >>
> >> I added a new first_unset_node() function to nodemask.h to allocate
> >> IDs sanely.
> >>
> >> Signed-off-by: Andi Kleen 
> >> Signed-off-by: Linus Torvalds 
> >>
> >> which doesn't really tell all that much. The historical baggage and a
> >> long term behavior which is not really trivial to fix I suspect.
> > 
> > Thinking about this some more, this logic makes some sense afterall.
> > Especially in the world without memory hotplug which was very likely the
> > case back then. It is much better to have compact node mask rather than
> > sparse one. After all node numbers shouldn't really matter as long as
> > you have a clear mapping to the HW. I am not sure we export that
> > information (except for the kernel ring buffer) though.
> > 
> > The memory hotplug changes that somehow because you can hotremove numa
> > nodes and therefore make the nodemask sparse but that is not a common
> > case. I am not sure what would happen if a completely new node was added
> > and its corresponding node was already used by the renumbered one
> > though. It would likely conflate the two I am afraid. But I am not sure
> > this is really possible with x86 and a lack of a bug report would
> > suggest that nobody is doing that at least.
> > 
> 
> I think the ACPI code takes care of properly mapping PXM to nodes.
> 
> So if I start with PXM 0 empty and PXM 1 populated, I will get
> PXM 1 == node 0 as described. Once I hotplug something to PXM 0 in QEMU
> 
> $ echo "object_add memory-backend-ram,id=mem0,size=1G" | sudo nc -U 
> /var/tmp/monitor
> $ echo "device_add pc-dimm,id=dimm0,memdev=mem0,node=0" | sudo nc -U 
> /var/tmp/monitor
> 
> $ echo "info numa" | sudo nc -U /var/tmp/monitor
> QEMU 5.0.50 monitor - type 'help' for more information
> (qemu) info numa
> 2 nodes
> node 0 cpus:
> node 0 size: 1024 MB
> node 0 plugged: 1024 MB
> node 1 cpus: 0 1 2 3
> node 1 size: 4096 MB
> node 1 plugged: 0 MB

Thanks for double checking.

> I get in the guest:
> 
> [   50.174435] [ cut here ]
> [   50.175436] node 1 was absent from the node_possible_map
> [   50.176844] WARNING: CPU: 0 PID: 7 at mm/memory_hotplug.c:1021 
> add_memory_resource+0x8c/0x290

This would mean that the ACPI code or whoever does the remaping is not
adding the new node into possible nodes.

[...]
> I remember that we added that check just recently (due to powerpc if I am not 
> wrong).
> Not sure why that triggers here.

This was a misbehaving Qemu IIRC providing a garbage map.

-- 
Michal Hocko
SUSE Labs


Re: [PATCH v5 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline

2020-07-03 Thread Michal Hocko
On Fri 03-07-20 11:24:17, Michal Hocko wrote:
> [Cc Andi]
> 
> On Fri 03-07-20 11:10:01, Michal Suchanek wrote:
> > On Wed, Jul 01, 2020 at 02:21:10PM +0200, Michal Hocko wrote:
> > > On Wed 01-07-20 13:30:57, David Hildenbrand wrote:
> [...]
> > > > Yep, looks like it.
> > > > 
> > > > [0.009726] SRAT: PXM 1 -> APIC 0x00 -> Node 0
> > > > [0.009727] SRAT: PXM 1 -> APIC 0x01 -> Node 0
> > > > [0.009727] SRAT: PXM 1 -> APIC 0x02 -> Node 0
> > > > [0.009728] SRAT: PXM 1 -> APIC 0x03 -> Node 0
> > > > [0.009731] ACPI: SRAT: Node 0 PXM 1 [mem 0x-0x0009]
> > > > [0.009732] ACPI: SRAT: Node 0 PXM 1 [mem 0x0010-0xbfff]
> > > > [0.009733] ACPI: SRAT: Node 0 PXM 1 [mem 0x1-0x13fff]
> > > 
> > > This begs a question whether ppc can do the same thing?
> > Or x86 stop doing it so that you can see on what node you are running?
> > 
> > What's the point of this indirection other than another way of avoiding
> > empty node 0?
> 
> Honestly, I do not have any idea. I've traced it down to
> Author: Andi Kleen 
> Date:   Tue Jan 11 15:35:48 2005 -0800
> 
> [PATCH] x86_64: Fix ACPI SRAT NUMA parsing
> 
> Fix fallout from the recent nodemask_t changes. The node ids assigned
> in the SRAT parser were off by one.
> 
> I added a new first_unset_node() function to nodemask.h to allocate
> IDs sanely.
> 
> Signed-off-by: Andi Kleen 
> Signed-off-by: Linus Torvalds 
> 
> which doesn't really tell all that much. The historical baggage and a
> long term behavior which is not really trivial to fix I suspect.

Thinking about this some more, this logic makes some sense afterall.
Especially in the world without memory hotplug which was very likely the
case back then. It is much better to have compact node mask rather than
sparse one. After all node numbers shouldn't really matter as long as
you have a clear mapping to the HW. I am not sure we export that
information (except for the kernel ring buffer) though.

The memory hotplug changes that somehow because you can hotremove numa
nodes and therefore make the nodemask sparse but that is not a common
case. I am not sure what would happen if a completely new node was added
and its corresponding node was already used by the renumbered one
though. It would likely conflate the two I am afraid. But I am not sure
this is really possible with x86 and a lack of a bug report would
suggest that nobody is doing that at least.

-- 
Michal Hocko
SUSE Labs


Re: [PATCH v5 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline

2020-07-03 Thread Michal Hocko
[Cc Andi]

On Fri 03-07-20 11:10:01, Michal Suchanek wrote:
> On Wed, Jul 01, 2020 at 02:21:10PM +0200, Michal Hocko wrote:
> > On Wed 01-07-20 13:30:57, David Hildenbrand wrote:
[...]
> > > Yep, looks like it.
> > > 
> > > [0.009726] SRAT: PXM 1 -> APIC 0x00 -> Node 0
> > > [0.009727] SRAT: PXM 1 -> APIC 0x01 -> Node 0
> > > [0.009727] SRAT: PXM 1 -> APIC 0x02 -> Node 0
> > > [0.009728] SRAT: PXM 1 -> APIC 0x03 -> Node 0
> > > [0.009731] ACPI: SRAT: Node 0 PXM 1 [mem 0x-0x0009]
> > > [0.009732] ACPI: SRAT: Node 0 PXM 1 [mem 0x0010-0xbfff]
> > > [0.009733] ACPI: SRAT: Node 0 PXM 1 [mem 0x1-0x13fff]
> > 
> > This begs a question whether ppc can do the same thing?
> Or x86 stop doing it so that you can see on what node you are running?
> 
> What's the point of this indirection other than another way of avoiding
> empty node 0?

Honestly, I do not have any idea. I've traced it down to
Author: Andi Kleen 
Date:   Tue Jan 11 15:35:48 2005 -0800

[PATCH] x86_64: Fix ACPI SRAT NUMA parsing

Fix fallout from the recent nodemask_t changes. The node ids assigned
in the SRAT parser were off by one.

I added a new first_unset_node() function to nodemask.h to allocate
IDs sanely.

Signed-off-by: Andi Kleen 
Signed-off-by: Linus Torvalds 

which doesn't really tell all that much. The historical baggage and a
long term behavior which is not really trivial to fix I suspect.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v5 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline

2020-07-02 Thread Michal Hocko
On Thu 02-07-20 12:14:08, Srikar Dronamraju wrote:
> * Michal Hocko  [2020-07-01 14:21:10]:
> 
> > > >>>>>>
> > > >>>>>> 2. Also existence of dummy node also leads to inconsistent 
> > > >>>>>> information. The
> > > >>>>>> number of online nodes is inconsistent with the information in the
> > > >>>>>> device-tree and resource-dump
> > > >>>>>>
> > > >>>>>> 3. When the dummy node is present, single node non-Numa systems 
> > > >>>>>> end up showing
> > > >>>>>> up as NUMA systems and numa_balancing gets enabled. This will mean 
> > > >>>>>> we take
> > > >>>>>> the hit from the unnecessary numa hinting faults.
> > > >>>>>
> > > >>>>> I have to say that I dislike the node online/offline state and 
> > > >>>>> directly
> > > >>>>> exporting that to the userspace. Users should only care whether the 
> > > >>>>> node
> > > >>>>> has memory/cpus. Numa nodes can be online without any memory. Just
> > > >>>>> offline all the present memory blocks but do not physically hot 
> > > >>>>> remove
> > > >>>>> them and you are in the same situation. If users are confused by an
> > > >>>>> output of tools like numactl -H then those could be updated and hide
> > > >>>>> nodes without any memory&cpus.
> > > >>>>>
> > > >>>>> The autonuma problem sounds interesting but again this patch doesn't
> > > >>>>> really solve the underlying problem because I strongly suspect that 
> > > >>>>> the
> > > >>>>> problem is still there when a numa node gets all its memory offline 
> > > >>>>> as
> > > >>>>> mentioned above.
> > 
> > I would really appreciate a feedback to these two as well.
> 
> 1. Its not just numactl that's to be fixed but all tools/utilities that
> depend on /sys/devices/system/node/online. Are we saying to not rely/believe
> in the output given by the kernel but do further verification?  

No, what we are saying is that even an online node might have zero
number of online pages/cpus. So the online status is not really
something that matters. If people are confused by that output then user
space tools can make their confusion go away. I really do not understand
why the kernel should do any logic there.

> Also how would the user space differentiate between the case where the
> Kernel missed marking a node as offline to the case where the memory was
> offlined on a cpuless node but node wasn't offline?.

What I am arguing is that those two shouldn't be any different. Really!

> 2. Regarding the autonuma, the case of offline memory is user/admin driven,
> so if there is a performance hit, its something that's driven by his
> user/admin actions. Also how often do we see users offline complete memory
> of cpuless node on a 2 node system?

How often do we see crippled HW configurations like that? Really if
autonuma should be made more clever for one case it should recognize the
other as well.

> > > [0.009726] SRAT: PXM 1 -> APIC 0x00 -> Node 0
> > > [0.009727] SRAT: PXM 1 -> APIC 0x01 -> Node 0
> > > [0.009727] SRAT: PXM 1 -> APIC 0x02 -> Node 0
> > > [0.009728] SRAT: PXM 1 -> APIC 0x03 -> Node 0
> > > [0.009731] ACPI: SRAT: Node 0 PXM 1 [mem 0x-0x0009]
> > > [0.009732] ACPI: SRAT: Node 0 PXM 1 [mem 0x0010-0xbfff]
> > > [0.009733] ACPI: SRAT: Node 0 PXM 1 [mem 0x1-0x13fff]
> > 
> > This begs a question whether ppc can do the same thing?
> 
> Certainly ppc can be made to adapt to this situation but that would be a
> workaround. Do we have a reason why we think node 0 is unique and special?

It is not. As replied in other email in this thread. I would hope for
having less hacks in the numa initialization. Cleaning up the mess is
would be a lot of work and testing on all NUMA capable architectures.
This is a heritage from the past I am afraid. All that I am arguing here
is that your touch to the generic code with a very simple looking patch
might have side effects which are pretty much impossible to review.
Moreover it seems that nothing but ppc really needs this treatment.
So fixing it in ppc specific code sounds much more safe.

Normally I would really push for a generic solution but after getting
burned several times in this area I do not dare anymore. The problem is
not in the code complexity but in how spread it is in places where you
do not expect side effects.

-- 
Michal Hocko
SUSE Labs


Re: [PATCH v5 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline

2020-07-01 Thread Michal Hocko
On Tue 30-06-20 09:31:25, Srikar Dronamraju wrote:
> * Christopher Lameter  [2020-06-29 14:58:40]:
> 
> > On Wed, 24 Jun 2020, Srikar Dronamraju wrote:
> > 
> > > Currently Linux kernel with CONFIG_NUMA on a system with multiple
> > > possible nodes, marks node 0 as online at boot.  However in practice,
> > > there are systems which have node 0 as memoryless and cpuless.
> > 
> > Maybe add something to explain why you are not simply mapping the
> > existing memory to NUMA node 0 which is after all just a numbering scheme
> > used by the kernel and can be used arbitrarily?
> > 
> 
> I thought Michal Hocko already gave a clear picture on why mapping is a bad
> idea. https://lore.kernel.org/lkml/20200316085425.gb11...@dhcp22.suse.cz/t/#u
> Are you suggesting that we add that as part of the changelog?

Well, I was not aware x86 already does renumber. So there is a certain
precendence. As I've said I do not really like that but this is what
already is happening. If renumbering is not an option then just handle
that in the ppc code explicitly. Generic solution would be preferable of
course but as I've said it is really hard to check for correctness and
potential subtle issues.

-- 
Michal Hocko
SUSE Labs


Re: [PATCH v5 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline

2020-07-01 Thread Michal Hocko
On Wed 01-07-20 13:30:57, David Hildenbrand wrote:
> On 01.07.20 13:06, David Hildenbrand wrote:
> > On 01.07.20 13:01, Srikar Dronamraju wrote:
> >> * David Hildenbrand  [2020-07-01 12:15:54]:
> >>
> >>> On 01.07.20 12:04, Srikar Dronamraju wrote:
> >>>> * Michal Hocko  [2020-07-01 10:42:00]:
> >>>>
> >>>>>
> >>>>>>
> >>>>>> 2. Also existence of dummy node also leads to inconsistent 
> >>>>>> information. The
> >>>>>> number of online nodes is inconsistent with the information in the
> >>>>>> device-tree and resource-dump
> >>>>>>
> >>>>>> 3. When the dummy node is present, single node non-Numa systems end up 
> >>>>>> showing
> >>>>>> up as NUMA systems and numa_balancing gets enabled. This will mean we 
> >>>>>> take
> >>>>>> the hit from the unnecessary numa hinting faults.
> >>>>>
> >>>>> I have to say that I dislike the node online/offline state and directly
> >>>>> exporting that to the userspace. Users should only care whether the node
> >>>>> has memory/cpus. Numa nodes can be online without any memory. Just
> >>>>> offline all the present memory blocks but do not physically hot remove
> >>>>> them and you are in the same situation. If users are confused by an
> >>>>> output of tools like numactl -H then those could be updated and hide
> >>>>> nodes without any memory&cpus.
> >>>>>
> >>>>> The autonuma problem sounds interesting but again this patch doesn't
> >>>>> really solve the underlying problem because I strongly suspect that the
> >>>>> problem is still there when a numa node gets all its memory offline as
> >>>>> mentioned above.

I would really appreciate a feedback to these two as well.

> >>>>> While I completely agree that making node 0 special is wrong, I have
> >>>>> still hard time to review this very simply looking patch because all the
> >>>>> numa initialization is so spread around that this might just blow up
> >>>>> at unexpected places. IIRC we have discussed testing in the previous
> >>>>> version and David has provided a way to emulate these configurations
> >>>>> on x86. Did you manage to use those instruction for additional testing
> >>>>> on other than ppc architectures?
> >>>>>
> >>>>
> >>>> I have tried all the steps that David mentioned and reported back at
> >>>> https://lore.kernel.org/lkml/20200511174731.gd1...@linux.vnet.ibm.com/t/#u
> >>>>
> >>>> As a summary, David's steps are still not creating a memoryless/cpuless 
> >>>> on
> >>>> x86 VM.
> >>>
> >>> Now, that is wrong. You get a memoryless/cpuless node, which is *not
> >>> online*. Once you hotplug some memory, it will switch online. Once you
> >>> remove memory, it will switch back offline.
> >>>
> >>
> >> Let me clarify, we are looking for a node 0 which is cpuless/memoryless at
> >> boot.  The code in question tries to handle a cpuless/memoryless node 0 at
> >> boot.
> > 
> > I was just correcting your statement, because it was wrong.
> > 
> > Could be that x86 code maps PXM 1 to node 0 because PXM 1 does neither
> > have CPUs nor memory. That would imply that we can, in fact, never have
> > node 0 offline during boot.
> > 
> 
> Yep, looks like it.
> 
> [0.009726] SRAT: PXM 1 -> APIC 0x00 -> Node 0
> [0.009727] SRAT: PXM 1 -> APIC 0x01 -> Node 0
> [0.009727] SRAT: PXM 1 -> APIC 0x02 -> Node 0
> [0.009728] SRAT: PXM 1 -> APIC 0x03 -> Node 0
> [0.009731] ACPI: SRAT: Node 0 PXM 1 [mem 0x-0x0009]
> [0.009732] ACPI: SRAT: Node 0 PXM 1 [mem 0x0010-0xbfff]
> [0.009733] ACPI: SRAT: Node 0 PXM 1 [mem 0x1-0x13fff]

This begs a question whether ppc can do the same thing?

I would swear that we've had x86 system with node 0 but I cannot really
find it and it is possible that it was not x86 after all...
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v5 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline

2020-07-01 Thread Michal Hocko


On Wed 24-06-20 14:58:46, Srikar Dronamraju wrote:
> Currently Linux kernel with CONFIG_NUMA on a system with multiple
> possible nodes, marks node 0 as online at boot.  However in practice,
> there are systems which have node 0 as memoryless and cpuless.
> 
> This can cause numa_balancing to be enabled on systems with only one node
> with memory and CPUs. The existence of this dummy node which is cpuless and
> memoryless node can confuse users/scripts looking at output of lscpu /
> numactl.
> 
> By marking, N_ONLINE as NODE_MASK_NONE, lets stop assuming that Node 0 is
> always online.
> 
> v5.8-rc2
>  available: 2 nodes (0,2)
>  node 0 cpus:
>  node 0 size: 0 MB
>  node 0 free: 0 MB
>  node 2 cpus: 0 1 2 3 4 5 6 7
>  node 2 size: 32625 MB
>  node 2 free: 31490 MB
>  node distances:
>  node   0   2
>0:  10  20
>2:  20  10
> 
> proc and sys files
> --
>  /sys/devices/system/node/online:0,2
>  /proc/sys/kernel/numa_balancing:1
>  /sys/devices/system/node/has_cpu:   2
>  /sys/devices/system/node/has_memory:2
>  /sys/devices/system/node/has_normal_memory: 2
>  /sys/devices/system/node/possible:  0-31
> 
> v5.8-rc2 + patch
> --
>  available: 1 nodes (2)
>  node 2 cpus: 0 1 2 3 4 5 6 7
>  node 2 size: 32625 MB
>  node 2 free: 31487 MB
>  node distances:
>  node   2
>2:  10
> 
> proc and sys files
> --
> /sys/devices/system/node/online:2
> /proc/sys/kernel/numa_balancing:0
> /sys/devices/system/node/has_cpu:   2
> /sys/devices/system/node/has_memory:2
> /sys/devices/system/node/has_normal_memory: 2
> /sys/devices/system/node/possible:  0-31
> 
> Note: On Powerpc, cpu_to_node of possible but not present cpus would
> previously return 0. Hence this commit depends on commit ("powerpc/numa: Set
> numa_node for all possible cpus") and commit ("powerpc/numa: Prefer node id
> queried from vphn"). Without the 2 commits, Powerpc system might crash.
> 
> 1. User space applications like Numactl, lscpu, that parse the sysfs tend to
> believe there is an extra online node. This tends to confuse users and
> applications. Other user space applications start believing that system was
> not able to use all the resources (i.e missing resources) or the system was
> not setup correctly.
> 
> 2. Also existence of dummy node also leads to inconsistent information. The
> number of online nodes is inconsistent with the information in the
> device-tree and resource-dump
> 
> 3. When the dummy node is present, single node non-Numa systems end up showing
> up as NUMA systems and numa_balancing gets enabled. This will mean we take
> the hit from the unnecessary numa hinting faults.

I have to say that I dislike the node online/offline state and directly
exporting that to the userspace. Users should only care whether the node
has memory/cpus. Numa nodes can be online without any memory. Just
offline all the present memory blocks but do not physically hot remove
them and you are in the same situation. If users are confused by an
output of tools like numactl -H then those could be updated and hide
nodes without any memory&cpus.

The autonuma problem sounds interesting but again this patch doesn't
really solve the underlying problem because I strongly suspect that the
problem is still there when a numa node gets all its memory offline as
mentioned above.

While I completely agree that making node 0 special is wrong, I have
still hard time to review this very simply looking patch because all the
numa initialization is so spread around that this might just blow up
at unexpected places. IIRC we have discussed testing in the previous
version and David has provided a way to emulate these configurations
on x86. Did you manage to use those instruction for additional testing
on other than ppc architectures?

> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: linux...@kvack.org
> Cc: linux-ker...@vger.kernel.org
> Cc: Michal Hocko 
> Cc: Mel Gorman 
> Cc: Vlastimil Babka 
> Cc: "Kirill A. Shutemov" 
> Cc: Christopher Lameter 
> Cc: Michael Ellerman 
> Cc: Andrew Morton 
> Cc: Linus Torvalds 
> Cc: Gautham R Shenoy 
> Cc: Satheesh Rajendran 
> Cc: David Hildenbrand 
> Signed-off-by: Srikar Dronamraju 
> ---
> Changelog v4:->v5:
> - rebased to v5.8-rc2
> link v4: 
> http://lore.kernel.org/lkml/20200512132937.19295-1-sri...@linux.vnet.ibm.com/t/#u
> 
> Changelog v1:->v2:
> - Rebased to v5.7-rc3
> Link v2: 
> https://lore.kernel.org/linuxppc-dev/20200428093836.27190-1-sri...@linux.vnet.ibm.com/t/#u
> 
>  mm/page_alloc.c | 4 +++-
>  1 file changed, 3 insertion

Re: [PATCH v4 0/3] mm, treewide: Rename kzfree() to kfree_sensitive()

2020-06-17 Thread Michal Hocko
On Wed 17-06-20 05:23:21, Matthew Wilcox wrote:
> On Wed, Jun 17, 2020 at 01:31:57PM +0200, Michal Hocko wrote:
> > On Wed 17-06-20 04:08:20, Matthew Wilcox wrote:
> > > If you call vfree() under
> > > a spinlock, you're in trouble.  in_atomic() only knows if we hold a
> > > spinlock for CONFIG_PREEMPT, so it's not safe to check for in_atomic()
> > > in __vfree().  So we need the warning in order that preempt people can
> > > tell those without that there is a bug here.
> > 
> > ... Unless I am missing something in_interrupt depends on preempt_count() as
> > well so neither of the two is reliable without PREEMPT_COUNT configured.
> 
> preempt_count() always tracks whether we're in interrupt context,
> regardless of CONFIG_PREEMPT.  The difference is that CONFIG_PREEMPT
> will track spinlock acquisitions as well.

Right you are! Thanks for the clarification. I find the situation
around preempt_count quite confusing TBH. Looking at existing users
of in_atomic() (e.g. a random one zd_usb_iowrite16v_async which check
in_atomic and then does GFP_KERNEL allocation which would be obviously
broken on !PREEMPT if the function can be called from an atomic
context), I am wondering whether it would make sense to track atomic
context also for !PREEMPT. This check is just terribly error prone.

-- 
Michal Hocko
SUSE Labs


Re: [PATCH v4 0/3] mm, treewide: Rename kzfree() to kfree_sensitive()

2020-06-17 Thread Michal Hocko
On Wed 17-06-20 04:08:20, Matthew Wilcox wrote:
> On Wed, Jun 17, 2020 at 09:12:12AM +0200, Michal Hocko wrote:
> > On Tue 16-06-20 17:37:11, Matthew Wilcox wrote:
> > > Not just performance critical, but correctness critical.  Since kvfree()
> > > may allocate from the vmalloc allocator, I really think that kvfree()
> > > should assert that it's !in_atomic().  Otherwise we can get into trouble
> > > if we end up calling vfree() and have to take the mutex.
> > 
> > FWIW __vfree already checks for atomic context and put the work into a
> > deferred context. So this should be safe. It should be used as a last
> > resort, though.
> 
> Actually, it only checks for in_interrupt().

You are right. I have misremembered. You have made me look (thanks) ...

> If you call vfree() under
> a spinlock, you're in trouble.  in_atomic() only knows if we hold a
> spinlock for CONFIG_PREEMPT, so it's not safe to check for in_atomic()
> in __vfree().  So we need the warning in order that preempt people can
> tell those without that there is a bug here.

... Unless I am missing something in_interrupt depends on preempt_count() as
well so neither of the two is reliable without PREEMPT_COUNT configured.

-- 
Michal Hocko
SUSE Labs


Re: [PATCH v4 0/3] mm, treewide: Rename kzfree() to kfree_sensitive()

2020-06-17 Thread Michal Hocko
On Tue 16-06-20 17:37:11, Matthew Wilcox wrote:
> On Wed, Jun 17, 2020 at 01:01:30AM +0200, David Sterba wrote:
> > On Tue, Jun 16, 2020 at 11:53:50AM -0700, Joe Perches wrote:
> > > On Mon, 2020-06-15 at 21:57 -0400, Waiman Long wrote:
> > > >  v4:
> > > >   - Break out the memzero_explicit() change as suggested by Dan 
> > > > Carpenter
> > > > so that it can be backported to stable.
> > > >   - Drop the "crypto: Remove unnecessary memzero_explicit()" patch for
> > > > now as there can be a bit more discussion on what is best. It will 
> > > > be
> > > > introduced as a separate patch later on after this one is merged.
> > > 
> > > To this larger audience and last week without reply:
> > > https://lore.kernel.org/lkml/573b3fbd5927c643920e1364230c296b23e7584d.ca...@perches.com/
> > > 
> > > Are there _any_ fastpath uses of kfree or vfree?
> > 
> > I'd consider kfree performance critical for cases where it is called
> > under locks. If possible the kfree is moved outside of the critical
> > section, but we have rbtrees or lists that get deleted under locks and
> > restructuring the code to do eg. splice and free it outside of the lock
> > is not always possible.
> 
> Not just performance critical, but correctness critical.  Since kvfree()
> may allocate from the vmalloc allocator, I really think that kvfree()
> should assert that it's !in_atomic().  Otherwise we can get into trouble
> if we end up calling vfree() and have to take the mutex.

FWIW __vfree already checks for atomic context and put the work into a
deferred context. So this should be safe. It should be used as a last
resort, though.

-- 
Michal Hocko
SUSE Labs


Re: [PATCH v4 1/3] mm/slab: Use memzero_explicit() in kzfree()

2020-06-15 Thread Michal Hocko
On Mon 15-06-20 21:57:16, Waiman Long wrote:
> The kzfree() function is normally used to clear some sensitive
> information, like encryption keys, in the buffer before freeing it back
> to the pool. Memset() is currently used for the buffer clearing. However,
> it is entirely possible that the compiler may choose to optimize away the
> memory clearing especially if LTO is being used. To make sure that this
> optimization will not happen, memzero_explicit(), which is introduced
> in v3.18, is now used in kzfree() to do the clearing.
> 
> Fixes: 3ef0e5ba4673 ("slab: introduce kzfree()")
> Cc: sta...@vger.kernel.org
> Signed-off-by: Waiman Long 

Acked-by: Michal Hocko 

Although I am not really sure this is a stable material. Is there any
known instance where the memset was optimized out from kzfree?

> ---
>  mm/slab_common.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/slab_common.c b/mm/slab_common.c
> index 9e72ba224175..37d48a56431d 100644
> --- a/mm/slab_common.c
> +++ b/mm/slab_common.c
> @@ -1726,7 +1726,7 @@ void kzfree(const void *p)
>   if (unlikely(ZERO_OR_NULL_PTR(mem)))
>   return;
>   ks = ksize(mem);
> - memset(mem, 0, ks);
> + memzero_explicit(mem, ks);
>   kfree(mem);
>  }
>  EXPORT_SYMBOL(kzfree);
> -- 
> 2.18.1
> 

-- 
Michal Hocko
SUSE Labs


Re: [PATCH v2 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline

2020-05-04 Thread Michal Hocko
On Thu 30-04-20 12:48:20, Srikar Dronamraju wrote:
> * Michal Hocko  [2020-04-29 14:22:11]:
> 
> > On Wed 29-04-20 07:11:45, Srikar Dronamraju wrote:
> > > > > 
> > > > > By marking, N_ONLINE as NODE_MASK_NONE, lets stop assuming that Node 
> > > > > 0 is
> > > > > always online.
> > > > > 
> > > > > ...
> > > > >
> > > > > --- a/mm/page_alloc.c
> > > > > +++ b/mm/page_alloc.c
> > > > > @@ -116,8 +116,10 @@ EXPORT_SYMBOL(latent_entropy);
> > > > >   */
> > > > >  nodemask_t node_states[NR_NODE_STATES] __read_mostly = {
> > > > >   [N_POSSIBLE] = NODE_MASK_ALL,
> > > > > +#ifdef CONFIG_NUMA
> > > > > + [N_ONLINE] = NODE_MASK_NONE,
> > > > > +#else
> > > > >   [N_ONLINE] = { { [0] = 1UL } },
> > > > > -#ifndef CONFIG_NUMA
> > > > >   [N_NORMAL_MEMORY] = { { [0] = 1UL } },
> > > > >  #ifdef CONFIG_HIGHMEM
> > > > >   [N_HIGH_MEMORY] = { { [0] = 1UL } },
> > > > 
> > > > So on all other NUMA machines, when does node 0 get marked online?
> > > > 
> > > > This change means that for some time during boot, such machines will
> > > > now be running with node 0 marked as offline.  What are the
> > > > implications of this?  Will something break?
> > > 
> > > Till the nodes are detected, marking Node 0 as online tends to be 
> > > redundant.
> > > Because the system doesn't know if its a NUMA or a non-NUMA system.
> > > Once we detect the nodes, we online them immediately. Hence I don't see 
> > > any
> > > side-effects or negative implications of this change.
> > > 
> > > However if I am missing anything, please do let me know.
> > > 
> > > >From my part, I have tested this on
> > > 1. Non-NUMA Single node but CPUs and memory coming from zero node.
> > > 2. Non-NUMA Single node but CPUs and memory coming from non-zero node.
> > > 3. NUMA Multi node but with CPUs and memory from node 0.
> > > 4. NUMA Multi node but with no CPUs and memory from node 0.
> > 
> > Have you tested on something else than ppc? Each arch does the NUMA
> > setup separately and this is a big mess. E.g. x86 marks even memory less
> > nodes (see init_memory_less_node) as online.
> > 
> 
> while I have predominantly tested on ppc, I did test on X86 with CONFIG_NUMA
> enabled/disabled on both single node and multi node machines.
> However, I dont have a cpuless/memoryless x86 system.

This should be able to emulate inside kvm, I believe.

> > Honestly I have hard time to evaluate the effect of this patch. It makes
> > some sense to assume all nodes offline before they get online but this
> > is a land mine territory.
> > 
> > I am also not sure what kind of problem this is going to address. You
> > have mentioned numa balancing without many details.
> 
> 1. On a machine with just one node with node number not being 0,
> the current setup will end up showing 2 online nodes. And when there are
> more than one online nodes, numa_balancing gets enabled.
> 
> Without patch
> $ grep numa /proc/vmstat
> numa_hit 95179
> numa_miss 0
> numa_foreign 0
> numa_interleave 3764
> numa_local 95179
> numa_other 0
> numa_pte_updates 1206973 <--
> numa_huge_pte_updates 4654 <--
> numa_hint_faults 19560 <--
> numa_hint_faults_local 19560 <--
> numa_pages_migrated 0
> 
> 
> With patch
> $ grep numa /proc/vmstat 
> numa_hit 322338756
> numa_miss 0
> numa_foreign 0
> numa_interleave 3790
> numa_local 322338756
> numa_other 0
> numa_pte_updates 0     <--
> numa_huge_pte_updates 0 <--
> numa_hint_faults 0 <--
> numa_hint_faults_local 0 <--
> numa_pages_migrated 0
> 
> So we have a redundant page hinting numa faults which we can avoid.

interesting. Does this lead to any observable differences? Btw. it would
be really great to describe how the online state influences the numa
balancing.

> 2. Few people have complained about existence of this dummy node when
> parsing lscpu and numactl o/p. They somehow start to think that the tools
> are reporting incorrectly or the kernel is not able to recognize resources
> connected to the node.

Please be more specific.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v2 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline

2020-04-29 Thread Michal Hocko
On Wed 29-04-20 07:11:45, Srikar Dronamraju wrote:
> > > 
> > > By marking, N_ONLINE as NODE_MASK_NONE, lets stop assuming that Node 0 is
> > > always online.
> > > 
> > > ...
> > >
> > > --- a/mm/page_alloc.c
> > > +++ b/mm/page_alloc.c
> > > @@ -116,8 +116,10 @@ EXPORT_SYMBOL(latent_entropy);
> > >   */
> > >  nodemask_t node_states[NR_NODE_STATES] __read_mostly = {
> > >   [N_POSSIBLE] = NODE_MASK_ALL,
> > > +#ifdef CONFIG_NUMA
> > > + [N_ONLINE] = NODE_MASK_NONE,
> > > +#else
> > >   [N_ONLINE] = { { [0] = 1UL } },
> > > -#ifndef CONFIG_NUMA
> > >   [N_NORMAL_MEMORY] = { { [0] = 1UL } },
> > >  #ifdef CONFIG_HIGHMEM
> > >   [N_HIGH_MEMORY] = { { [0] = 1UL } },
> > 
> > So on all other NUMA machines, when does node 0 get marked online?
> > 
> > This change means that for some time during boot, such machines will
> > now be running with node 0 marked as offline.  What are the
> > implications of this?  Will something break?
> 
> Till the nodes are detected, marking Node 0 as online tends to be redundant.
> Because the system doesn't know if its a NUMA or a non-NUMA system.
> Once we detect the nodes, we online them immediately. Hence I don't see any
> side-effects or negative implications of this change.
> 
> However if I am missing anything, please do let me know.
> 
> >From my part, I have tested this on
> 1. Non-NUMA Single node but CPUs and memory coming from zero node.
> 2. Non-NUMA Single node but CPUs and memory coming from non-zero node.
> 3. NUMA Multi node but with CPUs and memory from node 0.
> 4. NUMA Multi node but with no CPUs and memory from node 0.

Have you tested on something else than ppc? Each arch does the NUMA
setup separately and this is a big mess. E.g. x86 marks even memory less
nodes (see init_memory_less_node) as online.

Honestly I have hard time to evaluate the effect of this patch. It makes
some sense to assume all nodes offline before they get online but this
is a land mine territory.

I am also not sure what kind of problem this is going to address. You
have mentioned numa balancing without many details.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH 1/2] mm, treewide: Rename kzfree() to kfree_sensitive()

2020-04-14 Thread Michal Hocko
On Mon 13-04-20 17:15:49, Waiman Long wrote:
> As said by Linus:
> 
>   A symmetric naming is only helpful if it implies symmetries in use.
>   Otherwise it's actively misleading.
> 
>   In "kzalloc()", the z is meaningful and an important part of what the
>   caller wants.
> 
>   In "kzfree()", the z is actively detrimental, because maybe in the
>   future we really _might_ want to use that "memfill(0xdeadbeef)" or
>   something. The "zero" part of the interface isn't even _relevant_.
> 
> The main reason that kzfree() exists is to clear sensitive information
> that should not be leaked to other future users of the same memory
> objects.
> 
> Rename kzfree() to kfree_sensitive() to follow the example of the
> recently added kvfree_sensitive() and make the intention of the API
> more explicit. In addition, memzero_explicit() is used to clear the
> memory to make sure that it won't get optimized away by the compiler.
> 
> The renaming is done by using the command sequence:
> 
>   git grep -w --name-only kzfree |\
>   xargs sed -i 's/\bkzfree\b/kfree_sensitive/'
> 
> followed by some editing of the kfree_sensitive() kerneldoc and the
> use of memzero_explicit() instead of memset().
> 
> Suggested-by: Joe Perches 
> Signed-off-by: Waiman Long 

Makes sense. I haven't checked all the conversions and will rely on the
script doing the right thing. The core MM part is correct.

Acked-by: Michal Hocko 
-- 
Michal Hocko
SUSE Labs


Re: [PATCH RFC] mm: remove CONFIG_HAVE_MEMBLOCK_NODE_MAP (was: Re: [PATCH v3 0/5] mm: Enable CONFIG_NODES_SPAN_OTHER_NODES by default for NUMA)

2020-04-09 Thread Michal Hocko
On Thu 09-04-20 22:41:19, Baoquan He wrote:
> On 04/02/20 at 10:01am, Michal Hocko wrote:
> > On Wed 01-04-20 10:51:55, Mike Rapoport wrote:
> > > Hi,
> > > 
> > > On Wed, Apr 01, 2020 at 01:42:27PM +0800, Baoquan He wrote:
> > [...]
> > > > From above information, we can remove HAVE_MEMBLOCK_NODE_MAP, and
> > > > replace it with CONFIG_NUMA. That sounds more sensible to store nid into
> > > > memblock when NUMA support is enabled.
> > >  
> > > Replacing CONFIG_HAVE_MEMBLOCK_NODE_MAP with CONFIG_NUMA will work, but
> > > this will not help cleaning up the whole node/zone initialization mess and
> > > we'll be stuck with two implementations.
> > 
> > Yeah, this is far from optimal.
> > 
> > > The overhead of enabling HAVE_MEMBLOCK_NODE_MAP is only for init time as
> > > most architectures will anyway discard the entire memblock, so having it 
> > > in
> > > a UMA arch won't be a problem. The only exception is arm that uses
> > > memblock for pfn_valid(), here we may also think about a solution to
> > > compensate the addition of nid to the memblock structures. 
> > 
> > Well, we can make memblock_region->nid defined only for CONFIG_NUMA.
> > memblock_get_region_node would then unconditionally return 0 on UMA.
> > Essentially the same way we do NUMA for other MM code. I only see few
> > direct usage of region->nid.
> 
> Checked code again, seems HAVE_MEMBLOCK_NODE_MAP is selected directly in
> all ARCHes which support it. Means HAVE_MEMBLOCK_NODE_MAP is enabled by
> default on those ARCHes, and has no dependency on CONFIG_NUMA at all.
> E.g on x86, it just calls free_area_init_nodes() in generic code path,
> while free_area_init_nodes() is defined in CONFIG_HAVE_MEMBLOCK_NODE_MAP
> ifdeffery scope. So I tend to agree with Mike to remove
> HAVE_MEMBLOCK_NODE_MAP firstly on all ARCHes. We can check if it's worth
> only defining memblock_region->nid for CONFIG_NUMA case after
> HAVE_MEMBLOCK_NODE_MAP is removed.

This can surely go in separate patches. What I meant to say is the
region->nid is by definition 0 on !CONFIG_NUMA.

-- 
Michal Hocko
SUSE Labs


Re: [PATCH v1 1/2] powerpc/pseries/hotplug-memory: stop checking is_mem_section_removable()

2020-04-09 Thread Michal Hocko
On Thu 09-04-20 10:12:20, David Hildenbrand wrote:
> On 09.04.20 09:59, Michal Hocko wrote:
> > On Thu 09-04-20 17:26:01, Michael Ellerman wrote:
> >> David Hildenbrand  writes:
> >>
> >>> In commit 53cdc1cb29e8 ("drivers/base/memory.c: indicate all memory
> >>> blocks as removable"), the user space interface to compute whether a 
> >>> memory
> >>> block can be offlined (exposed via
> >>> /sys/devices/system/memory/memoryX/removable) has effectively been
> >>> deprecated. We want to remove the leftovers of the kernel implementation.
> >>>
> >>> When offlining a memory block (mm/memory_hotplug.c:__offline_pages()),
> >>> we'll start by:
> >>> 1. Testing if it contains any holes, and reject if so
> >>> 2. Testing if pages belong to different zones, and reject if so
> >>> 3. Isolating the page range, checking if it contains any unmovable pages
> >>>
> >>> Using is_mem_section_removable() before trying to offline is not only 
> >>> racy,
> >>> it can easily result in false positives/negatives. Let's stop manually
> >>> checking is_mem_section_removable(), and let device_offline() handle it
> >>> completely instead. We can remove the racy is_mem_section_removable()
> >>> implementation next.
> >>>
> >>> We now take more locks (e.g., memory hotplug lock when offlining and the
> >>> zone lock when isolating), but maybe we should optimize that
> >>> implementation instead if this ever becomes a real problem (after all,
> >>> memory unplug is already an expensive operation). We started using
> >>> is_mem_section_removable() in commit 51925fb3c5c9 ("powerpc/pseries:
> >>> Implement memory hotplug remove in the kernel"), with the initial
> >>> hotremove support of lmbs.
> >>
> >> It's also not very pretty in dmesg.
> >>
> >> Before:
> >>
> >>   pseries-hotplug-mem: Attempting to hot-add 10 LMB(s)
> >>   pseries-hotplug-mem: Memory hot-add failed, removing any added LMBs
> >>   dlpar: Could not handle DLPAR request "memory add count 10"
> > 
> > Yeah, there is more output but isn't that useful? Or put it differently
> > what is the actual problem from having those messages in the kernel log?
> > 
> > From the below you can clearly tell that there are kernel allocations
> > which prevent hot remove from happening.
> > 
> > If the overall size of the debugging output is a concern then we can
> > think of a way to reduce it. E.g. once you have a couple of pages
> > reported then all others from the same block are likely not interesting
> > much.
> > 
> 
> IIRC, we only report one page per block already. (and stop, as we
> detected something unmovable)

You are right.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v1 1/2] powerpc/pseries/hotplug-memory: stop checking is_mem_section_removable()

2020-04-09 Thread Michal Hocko
On Thu 09-04-20 17:26:01, Michael Ellerman wrote:
> David Hildenbrand  writes:
> 
> > In commit 53cdc1cb29e8 ("drivers/base/memory.c: indicate all memory
> > blocks as removable"), the user space interface to compute whether a memory
> > block can be offlined (exposed via
> > /sys/devices/system/memory/memoryX/removable) has effectively been
> > deprecated. We want to remove the leftovers of the kernel implementation.
> >
> > When offlining a memory block (mm/memory_hotplug.c:__offline_pages()),
> > we'll start by:
> > 1. Testing if it contains any holes, and reject if so
> > 2. Testing if pages belong to different zones, and reject if so
> > 3. Isolating the page range, checking if it contains any unmovable pages
> >
> > Using is_mem_section_removable() before trying to offline is not only racy,
> > it can easily result in false positives/negatives. Let's stop manually
> > checking is_mem_section_removable(), and let device_offline() handle it
> > completely instead. We can remove the racy is_mem_section_removable()
> > implementation next.
> >
> > We now take more locks (e.g., memory hotplug lock when offlining and the
> > zone lock when isolating), but maybe we should optimize that
> > implementation instead if this ever becomes a real problem (after all,
> > memory unplug is already an expensive operation). We started using
> > is_mem_section_removable() in commit 51925fb3c5c9 ("powerpc/pseries:
> > Implement memory hotplug remove in the kernel"), with the initial
> > hotremove support of lmbs.
> 
> It's also not very pretty in dmesg.
> 
> Before:
> 
>   pseries-hotplug-mem: Attempting to hot-add 10 LMB(s)
>   pseries-hotplug-mem: Memory hot-add failed, removing any added LMBs
>   dlpar: Could not handle DLPAR request "memory add count 10"

Yeah, there is more output but isn't that useful? Or put it differently
what is the actual problem from having those messages in the kernel log?

>From the below you can clearly tell that there are kernel allocations
which prevent hot remove from happening.

If the overall size of the debugging output is a concern then we can
think of a way to reduce it. E.g. once you have a couple of pages
reported then all others from the same block are likely not interesting
much.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v1 2/2] mm/memory_hotplug: remove is_mem_section_removable()

2020-04-07 Thread Michal Hocko
On Tue 07-04-20 15:54:16, David Hildenbrand wrote:
> Fortunately, all users of is_mem_section_removable() are gone. Get rid of
> it, including some now unnecessary functions.
> 
> Cc: Michael Ellerman 
> Cc: Benjamin Herrenschmidt 
> Cc: Michal Hocko 
> Cc: Andrew Morton 
> Cc: Oscar Salvador 
> Cc: Baoquan He 
> Cc: Wei Yang 
> Signed-off-by: David Hildenbrand 

Acked-by: Michal Hocko 

> ---
>  include/linux/memory_hotplug.h |  7 
>  mm/memory_hotplug.c| 75 --

\o/

Thanks!

>  2 files changed, 82 deletions(-)
> 
> diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
> index 93d9ada74ddd..7dca9cd6076b 100644
> --- a/include/linux/memory_hotplug.h
> +++ b/include/linux/memory_hotplug.h
> @@ -314,19 +314,12 @@ static inline void pgdat_resize_init(struct pglist_data 
> *pgdat) {}
>  
>  #ifdef CONFIG_MEMORY_HOTREMOVE
>  
> -extern bool is_mem_section_removable(unsigned long pfn, unsigned long 
> nr_pages);
>  extern void try_offline_node(int nid);
>  extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages);
>  extern int remove_memory(int nid, u64 start, u64 size);
>  extern void __remove_memory(int nid, u64 start, u64 size);
>  
>  #else
> -static inline bool is_mem_section_removable(unsigned long pfn,
> - unsigned long nr_pages)
> -{
> - return false;
> -}
> -
>  static inline void try_offline_node(int nid) {}
>  
>  static inline int offline_pages(unsigned long start_pfn, unsigned long 
> nr_pages)
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 47cf6036eb31..4d338d546d52 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1112,81 +1112,6 @@ int add_memory(int nid, u64 start, u64 size)
>  EXPORT_SYMBOL_GPL(add_memory);
>  
>  #ifdef CONFIG_MEMORY_HOTREMOVE
> -/*
> - * A free page on the buddy free lists (not the per-cpu lists) has PageBuddy
> - * set and the size of the free page is given by page_order(). Using this,
> - * the function determines if the pageblock contains only free pages.
> - * Due to buddy contraints, a free page at least the size of a pageblock will
> - * be located at the start of the pageblock
> - */
> -static inline int pageblock_free(struct page *page)
> -{
> - return PageBuddy(page) && page_order(page) >= pageblock_order;
> -}
> -
> -/* Return the pfn of the start of the next active pageblock after a given 
> pfn */
> -static unsigned long next_active_pageblock(unsigned long pfn)
> -{
> - struct page *page = pfn_to_page(pfn);
> -
> - /* Ensure the starting page is pageblock-aligned */
> - BUG_ON(pfn & (pageblock_nr_pages - 1));
> -
> - /* If the entire pageblock is free, move to the end of free page */
> - if (pageblock_free(page)) {
> - int order;
> - /* be careful. we don't have locks, page_order can be changed.*/
> - order = page_order(page);
> - if ((order < MAX_ORDER) && (order >= pageblock_order))
> - return pfn + (1 << order);
> - }
> -
> - return pfn + pageblock_nr_pages;
> -}
> -
> -static bool is_pageblock_removable_nolock(unsigned long pfn)
> -{
> - struct page *page = pfn_to_page(pfn);
> - struct zone *zone;
> -
> - /*
> -  * We have to be careful here because we are iterating over memory
> -  * sections which are not zone aware so we might end up outside of
> -  * the zone but still within the section.
> -  * We have to take care about the node as well. If the node is offline
> -  * its NODE_DATA will be NULL - see page_zone.
> -  */
> - if (!node_online(page_to_nid(page)))
> - return false;
> -
> - zone = page_zone(page);
> - pfn = page_to_pfn(page);
> - if (!zone_spans_pfn(zone, pfn))
> - return false;
> -
> - return !has_unmovable_pages(zone, page, MIGRATE_MOVABLE,
> - MEMORY_OFFLINE);
> -}
> -
> -/* Checks if this range of memory is likely to be hot-removable. */
> -bool is_mem_section_removable(unsigned long start_pfn, unsigned long 
> nr_pages)
> -{
> - unsigned long end_pfn, pfn;
> -
> - end_pfn = min(start_pfn + nr_pages,
> - zone_end_pfn(page_zone(pfn_to_page(start_pfn;
> -
> - /* Check the starting page of each pageblock within the range */
> - for (pfn = start_pfn; pfn < end_pfn; pfn = next_active_pageblock(pfn)) {
> - if (!is_pageblock_removable_nolock(pfn))
> - return false;
> - cond_resched();
> - }
> -
> - /* All pageblocks in the memory block are likely to be hot-removable */
> - return true;
> -}
> -
>  /*
>   * Confirm all pages in a range [start, end) belong to the same zone 
> (skipping
>   * memory holes). When true, return the zone.
> -- 
> 2.25.1
> 

-- 
Michal Hocko
SUSE Labs


Re: [PATCH v1 1/2] powerpc/pseries/hotplug-memory: stop checking is_mem_section_removable()

2020-04-07 Thread Michal Hocko
On Tue 07-04-20 15:54:15, David Hildenbrand wrote:
> In commit 53cdc1cb29e8 ("drivers/base/memory.c: indicate all memory
> blocks as removable"), the user space interface to compute whether a memory
> block can be offlined (exposed via
> /sys/devices/system/memory/memoryX/removable) has effectively been
> deprecated. We want to remove the leftovers of the kernel implementation.
> 
> When offlining a memory block (mm/memory_hotplug.c:__offline_pages()),
> we'll start by:
> 1. Testing if it contains any holes, and reject if so
> 2. Testing if pages belong to different zones, and reject if so
> 3. Isolating the page range, checking if it contains any unmovable pages
> 
> Using is_mem_section_removable() before trying to offline is not only racy,
> it can easily result in false positives/negatives. Let's stop manually
> checking is_mem_section_removable(), and let device_offline() handle it
> completely instead. We can remove the racy is_mem_section_removable()
> implementation next.
> 
> We now take more locks (e.g., memory hotplug lock when offlining and the
> zone lock when isolating), but maybe we should optimize that
> implementation instead if this ever becomes a real problem (after all,
> memory unplug is already an expensive operation). We started using
> is_mem_section_removable() in commit 51925fb3c5c9 ("powerpc/pseries:
> Implement memory hotplug remove in the kernel"), with the initial
> hotremove support of lmbs.

I am not familiar with this code but it makes sense to make it sync with
the global behavior.

> Cc: Nathan Fontenot 
> Cc: Michael Ellerman 
> Cc: Benjamin Herrenschmidt 
> Cc: Paul Mackerras 
> Cc: Michal Hocko 
> Cc: Andrew Morton 
> Cc: Oscar Salvador 
> Cc: Baoquan He 
> Cc: Wei Yang 
> Signed-off-by: David Hildenbrand 
> ---
>  .../platforms/pseries/hotplug-memory.c| 26 +++
>  1 file changed, 3 insertions(+), 23 deletions(-)
> 
> diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c 
> b/arch/powerpc/platforms/pseries/hotplug-memory.c
> index b2cde1732301..5ace2f9a277e 100644
> --- a/arch/powerpc/platforms/pseries/hotplug-memory.c
> +++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
> @@ -337,39 +337,19 @@ static int pseries_remove_mem_node(struct device_node 
> *np)
>  
>  static bool lmb_is_removable(struct drmem_lmb *lmb)
>  {
> - int i, scns_per_block;
> - bool rc = true;
> - unsigned long pfn, block_sz;
> - u64 phys_addr;
> -
>   if (!(lmb->flags & DRCONF_MEM_ASSIGNED))
>   return false;
>  
> - block_sz = memory_block_size_bytes();
> - scns_per_block = block_sz / MIN_MEMORY_BLOCK_SIZE;
> - phys_addr = lmb->base_addr;
> -
>  #ifdef CONFIG_FA_DUMP
>   /*
>* Don't hot-remove memory that falls in fadump boot memory area
>* and memory that is reserved for capturing old kernel memory.
>*/
> - if (is_fadump_memory_area(phys_addr, block_sz))
> + if (is_fadump_memory_area(lmb->base_addr, memory_block_size_bytes()))
>   return false;
>  #endif
> -
> - for (i = 0; i < scns_per_block; i++) {
> - pfn = PFN_DOWN(phys_addr);
> - if (!pfn_in_present_section(pfn)) {
> - phys_addr += MIN_MEMORY_BLOCK_SIZE;
> - continue;
> - }
> -
> - rc = rc && is_mem_section_removable(pfn, PAGES_PER_SECTION);
> - phys_addr += MIN_MEMORY_BLOCK_SIZE;
> - }
> -
> - return rc;
> + /* device_offline() will determine if we can actually remove this lmb */
> + return true;
>  }
>  
>  static int dlpar_add_lmb(struct drmem_lmb *);
> -- 
> 2.25.1

-- 
Michal Hocko
SUSE Labs


Re: [PATCH RFC] mm: remove CONFIG_HAVE_MEMBLOCK_NODE_MAP (was: Re: [PATCH v3 0/5] mm: Enable CONFIG_NODES_SPAN_OTHER_NODES by default for NUMA)

2020-04-02 Thread Michal Hocko
On Wed 01-04-20 10:51:55, Mike Rapoport wrote:
> Hi,
> 
> On Wed, Apr 01, 2020 at 01:42:27PM +0800, Baoquan He wrote:
[...]
> > From above information, we can remove HAVE_MEMBLOCK_NODE_MAP, and
> > replace it with CONFIG_NUMA. That sounds more sensible to store nid into
> > memblock when NUMA support is enabled.
>  
> Replacing CONFIG_HAVE_MEMBLOCK_NODE_MAP with CONFIG_NUMA will work, but
> this will not help cleaning up the whole node/zone initialization mess and
> we'll be stuck with two implementations.

Yeah, this is far from optimal.

> The overhead of enabling HAVE_MEMBLOCK_NODE_MAP is only for init time as
> most architectures will anyway discard the entire memblock, so having it in
> a UMA arch won't be a problem. The only exception is arm that uses
> memblock for pfn_valid(), here we may also think about a solution to
> compensate the addition of nid to the memblock structures. 

Well, we can make memblock_region->nid defined only for CONFIG_NUMA.
memblock_get_region_node would then unconditionally return 0 on UMA.
Essentially the same way we do NUMA for other MM code. I only see few
direct usage of region->nid.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v3 0/5] mm: Enable CONFIG_NODES_SPAN_OTHER_NODES by default for NUMA

2020-03-31 Thread Michal Hocko
On Tue 31-03-20 22:03:32, Baoquan He wrote:
> Hi Michal,
> 
> On 03/31/20 at 10:55am, Michal Hocko wrote:
> > On Tue 31-03-20 11:14:23, Mike Rapoport wrote:
> > > Maybe I mis-read the code, but I don't see how this could happen. In the
> > > HAVE_MEMBLOCK_NODE_MAP=y case, free_area_init_node() calls
> > > calculate_node_totalpages() that ensures that node->node_zones are 
> > > entirely
> > > within the node because this is checked in zone_spanned_pages_in_node().
> > 
> > zone_spanned_pages_in_node does chech the zone boundaries are within the
> > node boundaries. But that doesn't really tell anything about other
> > potential zones interleaving with the physical memory range.
> > zone->spanned_pages simply gives the physical range for the zone
> > including holes. Interleaving nodes are essentially a hole
> > (__absent_pages_in_range is going to skip those).
> > 
> > That means that when free_area_init_core simply goes over the whole
> > physical zone range including holes and that is why we need to check
> > both for physical and logical holes (aka other nodes).
> > 
> > The life would be so much easier if the whole thing would simply iterate
> > over memblocks...
> 
> The memblock iterating sounds a great idea. I tried with putting the
> memblock iterating in the upper layer, memmap_init(), which is used for
> boot mem only anyway. Do you think it's doable and OK? It yes, I can
> work out a formal patch to make this simpler as you said. The draft code
> is as below. Like this it uses the existing code and involves little change.

Doing this would be a step in the right direction! I haven't checked the
code very closely though. The below sounds way too simple to be truth I
am afraid. First for_each_mem_pfn_range is available only for
CONFIG_HAVE_MEMBLOCK_NODE_MAP (which is one of the reasons why I keep
saying that I really hate that being conditional). Also I haven't really
checked the deferred initialization path - I have a very vague
recollection that it has been converted to the memblock api but I have
happilly dropped all that memory.
 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 138a56c0f48f..558d421f294b 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -6007,14 +6007,6 @@ void __meminit memmap_init_zone(unsigned long size, 
> int nid, unsigned long zone,
>* function.  They do not exist on hotplugged memory.
>*/
>   if (context == MEMMAP_EARLY) {
> - if (!early_pfn_valid(pfn)) {
> - pfn = next_pfn(pfn);
> - continue;
> - }
> - if (!early_pfn_in_nid(pfn, nid)) {
> - pfn++;
> - continue;
> - }
>   if (overlap_memmap_init(zone, &pfn))
>   continue;
>   if (defer_init(nid, pfn, end_pfn))
> @@ -6130,9 +6122,17 @@ static void __meminit zone_init_free_lists(struct zone 
> *zone)
>  }
>  
>  void __meminit __weak memmap_init(unsigned long size, int nid,
> -   unsigned long zone, unsigned long start_pfn)
> +   unsigned long zone, unsigned long 
> range_start_pfn)
>  {
> - memmap_init_zone(size, nid, zone, start_pfn, MEMMAP_EARLY, NULL);
> + unsigned long start_pfn, end_pfn;
> + unsigned long range_end_pfn = range_start_pfn + size;
> + int i;
> + for_each_mem_pfn_range(i, nid, &start_pfn, &end_pfn, NULL) {
> + start_pfn = clamp(start_pfn, range_start_pfn, range_end_pfn);
> + end_pfn = clamp(end_pfn, range_start_pfn, range_end_pfn);
> + if (end_pfn > start_pfn)
> + memmap_init_zone(size, nid, zone, start_pfn, 
> MEMMAP_EARLY, NULL);
> + }
>  }
>  
>  static int zone_batchsize(struct zone *zone)

-- 
Michal Hocko
SUSE Labs


Re: [PATCH v3 0/5] mm: Enable CONFIG_NODES_SPAN_OTHER_NODES by default for NUMA

2020-03-31 Thread Michal Hocko
On Tue 31-03-20 11:14:23, Mike Rapoport wrote:
> On Mon, Mar 30, 2020 at 08:23:01PM +0200, Michal Hocko wrote:
> > On Mon 30-03-20 20:51:00, Mike Rapoport wrote:
> > > On Mon, Mar 30, 2020 at 09:42:46AM +0200, Michal Hocko wrote:
> > > > On Sat 28-03-20 11:31:17, Hoan Tran wrote:
> > > > > In NUMA layout which nodes have memory ranges that span across other 
> > > > > nodes,
> > > > > the mm driver can detect the memory node id incorrectly.
> > > > > 
> > > > > For example, with layout below
> > > > > Node 0 address:    
> > > > > Node 1 address:    
> > > > > 
> > > > > Note:
> > > > >  - Memory from low to high
> > > > >  - 0/1: Node id
> > > > >  - x: Invalid memory of a node
> > > > > 
> > > > > When mm probes the memory map, without CONFIG_NODES_SPAN_OTHER_NODES
> > > > > config, mm only checks the memory validity but not the node id.
> > > > > Because of that, Node 1 also detects the memory from node 0 as below
> > > > > when it scans from the start address to the end address of node 1.
> > > > > 
> > > > > Node 0 address:    
> > > > > Node 1 address:    
> > > > > 
> > > > > This layout could occur on any architecture. Most of them enables
> > > > > this config by default with CONFIG_NUMA. This patch, by default, 
> > > > > enables
> > > > > CONFIG_NODES_SPAN_OTHER_NODES or uses early_pfn_in_nid() for NUMA.
> > > > 
> > > > I am not opposed to this at all. It reduces the config space and that is
> > > > a good thing on its own. The history has shown that meory layout might
> > > > be really wild wrt NUMA. The config is only used for early_pfn_in_nid
> > > > which is clearly an overkill.
> > > > 
> > > > Your description doesn't really explain why this is safe though. The
> > > > history of this config is somehow messy, though. Mike has tried
> > > > to remove it a94b3ab7eab4 ("[PATCH] mm: remove arch independent
> > > > NODES_SPAN_OTHER_NODES") just to be reintroduced by 7516795739bd
> > > > ("[PATCH] Reintroduce NODES_SPAN_OTHER_NODES for powerpc") without any
> > > > reasoning what so ever. This doesn't make it really easy see whether
> > > > reasons for reintroduction are still there. Maybe there are some subtle
> > > > dependencies. I do not see any TBH but that might be burried deep in an
> > > > arch specific code.
> > > 
> > > I've looked at this a bit more and it seems that the check for
> > > early_pfn_in_nid() in memmap_init_zone() can be simply removed.
> > > 
> > > The commits you've mentioned were way before the addition of
> > > HAVE_MEMBLOCK_NODE_MAP and the whole infrastructure that calculates zone
> > > sizes and boundaries based on the memblock node map.
> > > So, the memmap_init_zone() is called when zone boundaries are already
> > > within a node.
> > 
> > But zones from different nodes might overlap in the pfn range. And this
> > check is there to skip over those overlapping areas.
> 
> Maybe I mis-read the code, but I don't see how this could happen. In the
> HAVE_MEMBLOCK_NODE_MAP=y case, free_area_init_node() calls
> calculate_node_totalpages() that ensures that node->node_zones are entirely
> within the node because this is checked in zone_spanned_pages_in_node().

zone_spanned_pages_in_node does chech the zone boundaries are within the
node boundaries. But that doesn't really tell anything about other
potential zones interleaving with the physical memory range.
zone->spanned_pages simply gives the physical range for the zone
including holes. Interleaving nodes are essentially a hole
(__absent_pages_in_range is going to skip those).

That means that when free_area_init_core simply goes over the whole
physical zone range including holes and that is why we need to check
both for physical and logical holes (aka other nodes).

The life would be so much easier if the whole thing would simply iterate
over memblocks...

-- 
Michal Hocko
SUSE Labs


Re: [PATCH v3 0/5] mm: Enable CONFIG_NODES_SPAN_OTHER_NODES by default for NUMA

2020-03-30 Thread Michal Hocko
On Mon 30-03-20 20:51:00, Mike Rapoport wrote:
> On Mon, Mar 30, 2020 at 09:42:46AM +0200, Michal Hocko wrote:
> > On Sat 28-03-20 11:31:17, Hoan Tran wrote:
> > > In NUMA layout which nodes have memory ranges that span across other 
> > > nodes,
> > > the mm driver can detect the memory node id incorrectly.
> > > 
> > > For example, with layout below
> > > Node 0 address:    
> > > Node 1 address:    
> > > 
> > > Note:
> > >  - Memory from low to high
> > >  - 0/1: Node id
> > >  - x: Invalid memory of a node
> > > 
> > > When mm probes the memory map, without CONFIG_NODES_SPAN_OTHER_NODES
> > > config, mm only checks the memory validity but not the node id.
> > > Because of that, Node 1 also detects the memory from node 0 as below
> > > when it scans from the start address to the end address of node 1.
> > > 
> > > Node 0 address:    
> > > Node 1 address:    
> > > 
> > > This layout could occur on any architecture. Most of them enables
> > > this config by default with CONFIG_NUMA. This patch, by default, enables
> > > CONFIG_NODES_SPAN_OTHER_NODES or uses early_pfn_in_nid() for NUMA.
> > 
> > I am not opposed to this at all. It reduces the config space and that is
> > a good thing on its own. The history has shown that meory layout might
> > be really wild wrt NUMA. The config is only used for early_pfn_in_nid
> > which is clearly an overkill.
> > 
> > Your description doesn't really explain why this is safe though. The
> > history of this config is somehow messy, though. Mike has tried
> > to remove it a94b3ab7eab4 ("[PATCH] mm: remove arch independent
> > NODES_SPAN_OTHER_NODES") just to be reintroduced by 7516795739bd
> > ("[PATCH] Reintroduce NODES_SPAN_OTHER_NODES for powerpc") without any
> > reasoning what so ever. This doesn't make it really easy see whether
> > reasons for reintroduction are still there. Maybe there are some subtle
> > dependencies. I do not see any TBH but that might be burried deep in an
> > arch specific code.
> 
> I've looked at this a bit more and it seems that the check for
> early_pfn_in_nid() in memmap_init_zone() can be simply removed.
> 
> The commits you've mentioned were way before the addition of
> HAVE_MEMBLOCK_NODE_MAP and the whole infrastructure that calculates zone
> sizes and boundaries based on the memblock node map.
> So, the memmap_init_zone() is called when zone boundaries are already
> within a node.

But zones from different nodes might overlap in the pfn range. And this
check is there to skip over those overlapping areas. The only way to
skip over this check I can see is to do a different pfn walk and go
through memblock ranges which are guaranteed to belong to a single node.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v3 0/5] mm: Enable CONFIG_NODES_SPAN_OTHER_NODES by default for NUMA

2020-03-30 Thread Michal Hocko
On Mon 30-03-20 12:21:27, Mike Rapoport wrote:
> On Mon, Mar 30, 2020 at 09:42:46AM +0200, Michal Hocko wrote:
> > On Sat 28-03-20 11:31:17, Hoan Tran wrote:
> > > In NUMA layout which nodes have memory ranges that span across other 
> > > nodes,
> > > the mm driver can detect the memory node id incorrectly.
> > > 
> > > For example, with layout below
> > > Node 0 address:    
> > > Node 1 address:    
> > > 
> > > Note:
> > >  - Memory from low to high
> > >  - 0/1: Node id
> > >  - x: Invalid memory of a node
> > > 
> > > When mm probes the memory map, without CONFIG_NODES_SPAN_OTHER_NODES
> > > config, mm only checks the memory validity but not the node id.
> > > Because of that, Node 1 also detects the memory from node 0 as below
> > > when it scans from the start address to the end address of node 1.
> > > 
> > > Node 0 address:    
> > > Node 1 address:    
> > > 
> > > This layout could occur on any architecture. Most of them enables
> > > this config by default with CONFIG_NUMA. This patch, by default, enables
> > > CONFIG_NODES_SPAN_OTHER_NODES or uses early_pfn_in_nid() for NUMA.
> > 
> > I am not opposed to this at all. It reduces the config space and that is
> > a good thing on its own. The history has shown that meory layout might
> > be really wild wrt NUMA. The config is only used for early_pfn_in_nid
> > which is clearly an overkill.
> > 
> > Your description doesn't really explain why this is safe though. The
> > history of this config is somehow messy, though. Mike has tried
> > to remove it a94b3ab7eab4 ("[PATCH] mm: remove arch independent
> > NODES_SPAN_OTHER_NODES") just to be reintroduced by 7516795739bd
> > ("[PATCH] Reintroduce NODES_SPAN_OTHER_NODES for powerpc") without any
> > reasoning what so ever. This doesn't make it really easy see whether
> > reasons for reintroduction are still there. Maybe there are some subtle
> > dependencies. I do not see any TBH but that might be burried deep in an
> > arch specific code.
> 
> Well, back then early_pfn_in_nid() was arch-dependant, today everyone
> except ia64 rely on HAVE_MEMBLOCK_NODE_MAP.

What would it take to make ia64 use HAVE_MEMBLOCK_NODE_MAP? I would
really love to see that thing go away. It is causing problems when
people try to use memblock api.

> So, if the memblock node map
> is correct, that using CONFIG_NUMA instead of CONFIG_NODES_SPAN_OTHER_NODES
> would only mean that early_pfn_in_nid() will cost several cycles more on
> architectures that didn't select CONFIG_NODES_SPAN_OTHER_NODES (i.e. arm64
> and sh).

Do we have any idea on how much of an overhead that is? Because this is
per each pfn so it can accumulate a lot! 

> Agian, ia64 is an exception here.

Thanks for the clarification!
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v3 0/5] mm: Enable CONFIG_NODES_SPAN_OTHER_NODES by default for NUMA

2020-03-30 Thread Michal Hocko
On Sun 29-03-20 08:19:24, Baoquan He wrote:
> On 03/28/20 at 11:31am, Hoan Tran wrote:
> > In NUMA layout which nodes have memory ranges that span across other nodes,
> > the mm driver can detect the memory node id incorrectly.
> > 
> > For example, with layout below
> > Node 0 address:    
> > Node 1 address:    
> 
> Sorry, I read this example several times, but still don't get what it
> means. Can it be given with real hex number address as an exmaple? I
> mean just using the memory layout you have seen from some systems. The
> change looks interesting though.

Does this make it more clear?
   physical address range and its node associaion
 [0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1]
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v3 0/5] mm: Enable CONFIG_NODES_SPAN_OTHER_NODES by default for NUMA

2020-03-30 Thread Michal Hocko
On Sat 28-03-20 11:31:17, Hoan Tran wrote:
> In NUMA layout which nodes have memory ranges that span across other nodes,
> the mm driver can detect the memory node id incorrectly.
> 
> For example, with layout below
> Node 0 address:    
> Node 1 address:    
> 
> Note:
>  - Memory from low to high
>  - 0/1: Node id
>  - x: Invalid memory of a node
> 
> When mm probes the memory map, without CONFIG_NODES_SPAN_OTHER_NODES
> config, mm only checks the memory validity but not the node id.
> Because of that, Node 1 also detects the memory from node 0 as below
> when it scans from the start address to the end address of node 1.
> 
> Node 0 address:    
> Node 1 address:    
> 
> This layout could occur on any architecture. Most of them enables
> this config by default with CONFIG_NUMA. This patch, by default, enables
> CONFIG_NODES_SPAN_OTHER_NODES or uses early_pfn_in_nid() for NUMA.

I am not opposed to this at all. It reduces the config space and that is
a good thing on its own. The history has shown that meory layout might
be really wild wrt NUMA. The config is only used for early_pfn_in_nid
which is clearly an overkill.

Your description doesn't really explain why this is safe though. The
history of this config is somehow messy, though. Mike has tried
to remove it a94b3ab7eab4 ("[PATCH] mm: remove arch independent
NODES_SPAN_OTHER_NODES") just to be reintroduced by 7516795739bd
("[PATCH] Reintroduce NODES_SPAN_OTHER_NODES for powerpc") without any
reasoning what so ever. This doesn't make it really easy see whether
reasons for reintroduction are still there. Maybe there are some subtle
dependencies. I do not see any TBH but that might be burried deep in an
arch specific code.

> v3:
>  * Revise the patch description
> 
> V2:
>  * Revise the patch description
> 
> Hoan Tran (5):
>   mm: Enable CONFIG_NODES_SPAN_OTHER_NODES by default for NUMA
>   powerpc: Kconfig: Remove CONFIG_NODES_SPAN_OTHER_NODES
>   x86: Kconfig: Remove CONFIG_NODES_SPAN_OTHER_NODES
>   sparc: Kconfig: Remove CONFIG_NODES_SPAN_OTHER_NODES
>   s390: Kconfig: Remove CONFIG_NODES_SPAN_OTHER_NODES
> 
>  arch/powerpc/Kconfig | 9 -
>  arch/s390/Kconfig| 8 
>  arch/sparc/Kconfig   | 9 -
>  arch/x86/Kconfig | 9 -
>  mm/page_alloc.c  | 2 +-
>  5 files changed, 1 insertion(+), 36 deletions(-)
> 
> -- 
> 1.8.3.1
> 

-- 
Michal Hocko
SUSE Labs


Re: [PATCH v2] mm/sparse: Fix kernel crash with pfn_section_valid check

2020-03-26 Thread Michal Hocko
On Thu 26-03-20 19:02:35, Aneesh Kumar K.V wrote:
> Fixes the below crash
> 
> BUG: Kernel NULL pointer dereference on read at 0x
> Faulting instruction address: 0xc0c3447c
> Oops: Kernel access of bad area, sig: 11 [#1]
> LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
> CPU: 11 PID: 7519 Comm: lt-ndctl Not tainted 5.6.0-rc7-autotest #1
> ...
> NIP [c0c3447c] vmemmap_populated+0x98/0xc0
> LR [c0088354] vmemmap_free+0x144/0x320
> Call Trace:
>  section_deactivate+0x220/0x240
>  __remove_pages+0x118/0x170
>  arch_remove_memory+0x3c/0x150
>  memunmap_pages+0x1cc/0x2f0
>  devm_action_release+0x30/0x50
>  release_nodes+0x2f8/0x3e0
>  device_release_driver_internal+0x168/0x270
>  unbind_store+0x130/0x170
>  drv_attr_store+0x44/0x60
>  sysfs_kf_write+0x68/0x80
>  kernfs_fop_write+0x100/0x290
>  __vfs_write+0x3c/0x70
>  vfs_write+0xcc/0x240
>  ksys_write+0x7c/0x140
>  system_call+0x5c/0x68
> 
> The crash is due to NULL dereference at
> 
> test_bit(idx, ms->usage->subsection_map); due to ms->usage = NULL; in 
> pfn_section_valid()
> 
> With commit: d41e2f3bd546 ("mm/hotplug: fix hot remove failure in 
> SPARSEMEM|!VMEMMAP case")
> section_mem_map is set to NULL after depopulate_section_mem(). This
> was done so that pfn_page() can work correctly with kernel config that 
> disables
> SPARSEMEM_VMEMMAP. With that config pfn_to_page does
> 
>   __section_mem_map_addr(__sec) + __pfn;
> where
> 
> static inline struct page *__section_mem_map_addr(struct mem_section *section)
> {
>   unsigned long map = section->section_mem_map;
>   map &= SECTION_MAP_MASK;
>   return (struct page *)map;
> }
> 
> Now with SPASEMEM_VMEMAP enabled, mem_section->usage->subsection_map is used 
> to
> check the pfn validity (pfn_valid()). Since section_deactivate release
> mem_section->usage if a section is fully deactivated, pfn_valid() check after
> a subsection_deactivate cause a kernel crash.
> 
> static inline int pfn_valid(unsigned long pfn)
> {
> ...
>   return early_section(ms) || pfn_section_valid(ms, pfn);
> }
> 
> where
> 
> static inline int pfn_section_valid(struct mem_section *ms, unsigned long pfn)
> {
>   int idx = subsection_map_index(pfn);
> 
>   return test_bit(idx, ms->usage->subsection_map);
> }
> 
> Avoid this by clearing SECTION_HAS_MEM_MAP when mem_section->usage is freed.
> For architectures like ppc64 where large pages are used for vmmemap mapping 
> (16MB),
> a specific vmemmap mapping can cover multiple sections. Hence before a vmemmap
> mapping page can be freed, the kernel needs to make sure there are no valid 
> sections
> within that mapping. Clearing the section valid bit before
> depopulate_section_memap enables this.

I believe that the necessity of clearing the section before the tear
down is worth a comment into the code. Because this is just way to easy
to miss or not be aware at all while looking into the code without git
balme.

> Fixes: d41e2f3bd546 ("mm/hotplug: fix hot remove failure in 
> SPARSEMEM|!VMEMMAP case")
> Reported-by: Sachin Sant 
> Tested-by: Sachin Sant 
> Cc: Baoquan He 
> Cc: Michael Ellerman 
> Cc: Dan Williams 
> Cc: Pankaj Gupta 
> Cc: David Hildenbrand 
> Cc: Michal Hocko 
> Cc: Wei Yang 
> Cc: Oscar Salvador 
> Cc: Mike Rapoport 
> Cc: 
> Signed-off-by: Aneesh Kumar K.V 

Acked-by: Michal Hocko 

Thanks!

> ---
>  mm/sparse.c | 6 ++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/mm/sparse.c b/mm/sparse.c
> index aadb7298dcef..65599e8bd636 100644
> --- a/mm/sparse.c
> +++ b/mm/sparse.c
> @@ -781,6 +781,12 @@ static void section_deactivate(unsigned long pfn, 
> unsigned long nr_pages,
>   ms->usage = NULL;
>   }
>   memmap = sparse_decode_mem_map(ms->section_mem_map, section_nr);
> + /*
> +  * Mark the section invalid so that valid_section()
> +  * return false. This prevents code from dereferencing
> +  * ms->usage array.
> +  */
> + ms->section_mem_map &= ~SECTION_HAS_MEM_MAP;
>   }
>  
>   if (section_is_early && memmap)
> -- 
> 2.25.1
> 

-- 
Michal Hocko
SUSE Labs


Re: [PATCH] mm/sparse: Fix kernel crash with pfn_section_valid check

2020-03-26 Thread Michal Hocko
On Thu 26-03-20 11:16:33, Michal Hocko wrote:
> On Thu 26-03-20 15:26:22, Aneesh Kumar K.V wrote:
> > On 3/26/20 3:10 PM, Michal Hocko wrote:
> > > On Wed 25-03-20 08:49:14, Aneesh Kumar K.V wrote:
> > > > Fixes the below crash
> > > > 
> > > > BUG: Kernel NULL pointer dereference on read at 0x
> > > > Faulting instruction address: 0xc0c3447c
> > > > Oops: Kernel access of bad area, sig: 11 [#1]
> > > > LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
> > > > CPU: 11 PID: 7519 Comm: lt-ndctl Not tainted 5.6.0-rc7-autotest #1
> > > > ...
> > > > NIP [c0c3447c] vmemmap_populated+0x98/0xc0
> > > > LR [c0088354] vmemmap_free+0x144/0x320
> > > > Call Trace:
> > > >   section_deactivate+0x220/0x240
> > > 
> > > It would be great to match this to the specific source code.
> > 
> > The crash is due to NULL dereference at
> > 
> > test_bit(idx, ms->usage->subsection_map); due to ms->usage = NULL;
> 
> It would be nice to call that out here as well
> 
> [...]
> > > Why do we have to free usage before deactivaing section memmap? Now that
> > > we have a late section_mem_map reset shouldn't we tear down the usage in
> > > the same branch?
> > > 
> > 
> > We still need to make the section invalid before we call into
> > depopulate_section_memmap(). Because architecture like powerpc can share
> > vmemmap area across sections (16MB mapping of vmemmap area) and we use
> > vmemmap_popluated() to make that decision.
> 
> This should be noted in a comment as well.
> 
> > > > Fixes: d41e2f3bd546 ("mm/hotplug: fix hot remove failure in 
> > > > SPARSEMEM|!VMEMMAP case")
> > > > Cc: Baoquan He 
> > > > Reported-by: Sachin Sant 
> > > > Signed-off-by: Aneesh Kumar K.V 
> > > > ---
> > > >   mm/sparse.c | 2 ++
> > > >   1 file changed, 2 insertions(+)
> > > > 
> > > > diff --git a/mm/sparse.c b/mm/sparse.c
> > > > index aadb7298dcef..3012d1f3771a 100644
> > > > --- a/mm/sparse.c
> > > > +++ b/mm/sparse.c
> > > > @@ -781,6 +781,8 @@ static void section_deactivate(unsigned long pfn, 
> > > > unsigned long nr_pages,
> > > > ms->usage = NULL;
> > > > }
> > > > memmap = sparse_decode_mem_map(ms->section_mem_map, 
> > > > section_nr);
> > > > +   /* Mark the section invalid */
> > > > +   ms->section_mem_map &= ~SECTION_HAS_MEM_MAP;
> > > 
> > > Btw. this comment is not really helping at all.
> > 
> > That is marking the section invalid so that
> > 
> > static inline int valid_section(struct mem_section *section)
> > {
> > return (section && (section->section_mem_map & SECTION_HAS_MEM_MAP));
> > }
> > 
> > 
> > returns false.
> 
> Yes that is obvious once you are clear where to look. I was really
> hoping for a comment that would simply point you to the right
> direcection without chasing SECTION_HAS_MEM_MAP usage. This code is
> subtle and useful comments, even when they state something that is
> obvious to you _right_now_, can be really helpful.

Btw. forgot to add. With the improved comment feel free to add
Acked-by: Michal Hocko 

-- 
Michal Hocko
SUSE Labs


  1   2   3   4   5   >