Re: [PATCH v7 04/15] swiotlb: Add restricted DMA pool initialization

2021-05-24 Thread Claire Chang
On Mon, May 24, 2021 at 11:49 PM Konrad Rzeszutek Wilk
 wrote:
>
> On Tue, May 18, 2021 at 02:48:35PM +0800, Claire Chang wrote:
> > I didn't move this to a separate file because I feel it might be
> > confusing for swiotlb_alloc/free (and need more functions to be
> > non-static).
> > Maybe instead of moving to a separate file, we can try to come up with
> > a better naming?
>
> I think you are referring to:
>
> rmem_swiotlb_setup
>
> ?

Yes, and the following swiotlb_alloc/free.

>
> Which is ARM specific and inside the generic code?
>
> 
>
> Christopher wants to unify it in all the code so there is one single
> source, but the "you seperate arch code out from generic" saying
> makes me want to move it out.
>
> I agree that if you move it out from generic to arch-specific we have to
> expose more of the swiotlb functions, which will undo's Christopher
> cleanup code.
>
> How about this - lets leave it as is now, and when there are more
> use-cases we can revisit it and then if need to move the code?
>
Ok! Sounds good!


Re: [PATCH v7 05/15] swiotlb: Add a new get_io_tlb_mem getter

2021-05-24 Thread Claire Chang
On Mon, May 24, 2021 at 11:51 PM Konrad Rzeszutek Wilk
 wrote:
>
> On Tue, May 18, 2021 at 02:51:52PM +0800, Claire Chang wrote:
> > Still keep this function because directly using dev->dma_io_tlb_mem
> > will cause issues for memory allocation for existing devices. The pool
> > can't support atomic coherent allocation so we need to distinguish the
> > per device pool and the default pool in swiotlb_alloc.
>
> This above should really be rolled in the commit. You can prefix it by
> "The reason it was done this way was because directly using .."
>

Will add it.


Re: [PATCH v7 01/15] swiotlb: Refactor swiotlb init functions

2021-05-24 Thread Claire Chang
On Mon, May 24, 2021 at 11:53 PM Konrad Rzeszutek Wilk
 wrote:
>
> > > do the set_memory_decrypted()+memset(). Is this okay or should
> > > swiotlb_init_io_tlb_mem() add an additional argument to do this
> > > conditionally?
> >
> > I'm actually not sure if this it okay. If not, will add an additional
> > argument for it.
>
> Any observations discovered? (Want to make sure my memory-cache has the
> correct semantics for set_memory_decrypted in mind).

It works fine on my arm64 device.

> >
> > > --
> > > Florian


Re: [PATCH v4 2/3] audit: add support for the openat2 syscall

2021-05-24 Thread Paul Moore
On Thu, May 20, 2021 at 3:58 AM Christian Brauner
 wrote:
> On Wed, May 19, 2021 at 04:00:21PM -0400, Richard Guy Briggs wrote:
> > The openat2(2) syscall was added in kernel v5.6 with commit fddb5d430ad9
> > ("open: introduce openat2(2) syscall")
> >
> > Add the openat2(2) syscall to the audit syscall classifier.
> >
> > Link: https://github.com/linux-audit/audit-kernel/issues/67
> > Signed-off-by: Richard Guy Briggs 
> > Link: 
> > https://lore.kernel.org/r/f5f1a4d8699613f8c02ce762807228c841c2e26f.1621363275.git@redhat.com
> > ---
> >  arch/alpha/kernel/audit.c   | 2 ++
> >  arch/ia64/kernel/audit.c| 2 ++
> >  arch/parisc/kernel/audit.c  | 2 ++
> >  arch/parisc/kernel/compat_audit.c   | 2 ++
> >  arch/powerpc/kernel/audit.c | 2 ++
> >  arch/powerpc/kernel/compat_audit.c  | 2 ++
> >  arch/s390/kernel/audit.c| 2 ++
> >  arch/s390/kernel/compat_audit.c | 2 ++
> >  arch/sparc/kernel/audit.c   | 2 ++
> >  arch/sparc/kernel/compat_audit.c| 2 ++
> >  arch/x86/ia32/audit.c   | 2 ++
> >  arch/x86/kernel/audit_64.c  | 2 ++
> >  include/linux/auditsc_classmacros.h | 1 +
> >  kernel/auditsc.c| 3 +++
> >  lib/audit.c | 4 
> >  lib/compat_audit.c  | 4 
> >  16 files changed, 36 insertions(+)

...

> > diff --git a/kernel/auditsc.c b/kernel/auditsc.c
> > index d775ea16505b..3f59ab209dfd 100644
> > --- a/kernel/auditsc.c
> > +++ b/kernel/auditsc.c
> > @@ -76,6 +76,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >
> >  #include "audit.h"
> >
> > @@ -196,6 +197,8 @@ static int audit_match_perm(struct audit_context *ctx, 
> > int mask)
> >   return ((mask & AUDIT_PERM_WRITE) && ctx->argv[0] == 
> > SYS_BIND);
> >   case AUDITSC_EXECVE:
> >   return mask & AUDIT_PERM_EXEC;
> > + case AUDITSC_OPENAT2:
> > + return mask & ACC_MODE((u32)((struct open_how 
> > *)ctx->argv[2])->flags);
>
> That's a lot of dereferncing, casting and masking all at once. Maybe a
> small static inline helper would be good for the sake of legibility? Sm
> like:
>
> static inline u32 audit_openat2_acc(struct open_how *how, int mask)
> {
> u32 flags = how->flags;
> return mask & ACC_MODE(flags);
> }
>
> but not sure. Just seems more legible to me.
> Otherwise.

I'm on the fence about this.  I understand Christian's concern, but I
have a bit of hatred towards single caller functions like this.  Since
this function isn't really high-touch, and I don't expect that to
change in the near future, let's leave the casting mess as-is.

-- 
paul moore
www.paul-moore.com


Re: Linux powerpc new system call instruction and ABI

2021-05-24 Thread Matheus Castanho


Matheus Castanho  writes:

> Dmitry V. Levin  writes:
>
>> On Fri, May 21, 2021 at 05:00:36PM -0300, Matheus Castanho wrote:
>>> Florian Weimer  writes:
>>> > * Matheus Castanho via Libc-alpha:
>>> >> From: Nicholas Piggin 
>>> >> Subject: [PATCH 1/1] powerpc: Fix handling of scv return error codes
>>> >>
>>> >> When using scv on templated ASM syscalls, current code interprets any
>>> >> negative return value as error, but the only valid error codes are in
>>> >> the range -4095..-1 according to the ABI.
>>> >>
>>> >> Reviewed-by: Matheus Castanho 
>>> >
>>> > Please reference bug 27892 in the commit message.  I'd also appreciate a
>>> > backport to the 2.33 release branch (where you need to add NEWS manually
>>> > to add the bug reference).
>>>
>>> No problem. [BZ #27892] appended to the commit title. I'll make sure to
>>> backport to 2.33 as well.
>>
>> Could you also mention in the commit message that the change fixes
>> 'signal.gen.test' strace test where it was observed initially?
>
> Sure, no problem. I'll commit it later today.

Since the patch falls into the less-than-15-LOC category and this is
Nick's first contribution to glibc, looks like he doesn't need a
copyright assignment.

Pushed to master as 7de36744ee1325f35d3fe0ca079dd33c40b12267

Backported to 2.33 via commit 0ef0e6de7fdfa18328b09ba2afb4f0112d4bdab4

Thanks,
Matheus Castanho


Re: [PATCH v6 updated 9/11] mm/mremap: Fix race between mremap and pageout

2021-05-24 Thread Linus Torvalds
On Mon, May 24, 2021 at 3:38 AM Aneesh Kumar K.V
 wrote:
>
> Avoid the above race with MOVE_PMD by holding pte ptl in mremap and waiting 
> for
> parallel pagetable walk to finish operating on pte before updating new_pmd

Ack on the concept.

However, not so much on the patch.

Odd whitespace change:

> @@ -254,6 +254,7 @@ static bool move_normal_pmd(struct vm_area_struct *vma, 
> unsigned long old_addr,
> if (WARN_ON_ONCE(!pmd_none(*new_pmd)))
> return false;
>
> +
> /*
>  * We don't have to worry about the ordering of src and dst
>  * ptlocks because exclusive mmap_lock prevents deadlock.

And new optimization for empty pmd, which seems unrelated to the
change and should presumably be separate:

> @@ -263,6 +264,10 @@ static bool move_normal_pmd(struct vm_area_struct *vma, 
> unsigned long old_addr,
> if (new_ptl != old_ptl)
> spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING);
>
> +   if (pmd_none(*old_pmd))
> +   goto unlock_out;
> +
> +   pte_ptl = pte_lockptr(mm, old_pmd);
> /* Clear the pmd */
> pmd = *old_pmd;
> pmd_clear(old_pmd);

And also, why does the above assign 'pte_ptl' without using it, when
the actual use is ten lines further down?

So I think this patch needs some cleanup.

  Linus


Re: [PATCH v6 07/11] mm/mremap: Use range flush that does TLB and page walk cache flush

2021-05-24 Thread Linus Torvalds
On Sun, May 23, 2021 at 11:04 PM Aneesh Kumar K.V
 wrote:
>
> Add new helper flush_pte_tlb_pwc_range() which invalidates both TLB and
> page walk cache where TLB entries are mapped with page size PAGE_SIZE.

So I dislike this patch for two reasons:

 (a) naming.

If the ppc people want to use crazy TLA's that have no meaning outside
of the powerpc community, that's fine. But only in powerpc code.

"pwc" makes no sense to me, or to anybody else that isn't intimately
involved in low-level powerpc stuff. I assume it's "page walk cache",
but honestly, outside of this area, PWC is mostly used for a specific
type of webcam.

So there's no way I'd accept this as-is, simply because of that.
flush_pte_tlb_pwc_range() is simply not an acceptable name. You would
have to spell it out, not use an obscure TLA.

But I think you don't even want to do that, because of

 (b) is this even worth it as a public interface?

Why doesn't the powerpc radix TLB flushing code just always flush the
page table walking cache when the range is larger than a PMD?

Once you have big flush ranges like that, I don't believe it makes any
sense not to flush the walking cache too.

NOTE! This is particularly true as "flush the walking cache" isn't a
well-defined operation anyway. Which _levels_ of the walking cache?
Again, the size (and alignment) of the flush would actually tell you.
A new boolean "flush" parameter does *NOT* tell that at all.

So I think this new interface is mis-named, but I also think it's
pointless. Just DTRT automatically when somebody asks for a flush that
covers a PMD range (or a PUD range).

  Linus


Re: [PATCH v7 01/15] swiotlb: Refactor swiotlb init functions

2021-05-24 Thread Konrad Rzeszutek Wilk
> > do the set_memory_decrypted()+memset(). Is this okay or should
> > swiotlb_init_io_tlb_mem() add an additional argument to do this
> > conditionally?
> 
> I'm actually not sure if this it okay. If not, will add an additional
> argument for it.

Any observations discovered? (Want to make sure my memory-cache has the
correct semantics for set_memory_decrypted in mind).
> 
> > --
> > Florian


Re: [PATCH v7 05/15] swiotlb: Add a new get_io_tlb_mem getter

2021-05-24 Thread Konrad Rzeszutek Wilk
On Tue, May 18, 2021 at 02:51:52PM +0800, Claire Chang wrote:
> Still keep this function because directly using dev->dma_io_tlb_mem
> will cause issues for memory allocation for existing devices. The pool
> can't support atomic coherent allocation so we need to distinguish the
> per device pool and the default pool in swiotlb_alloc.

This above should really be rolled in the commit. You can prefix it by
"The reason it was done this way was because directly using .."




Re: [PATCH v7 04/15] swiotlb: Add restricted DMA pool initialization

2021-05-24 Thread Konrad Rzeszutek Wilk
On Tue, May 18, 2021 at 02:48:35PM +0800, Claire Chang wrote:
> I didn't move this to a separate file because I feel it might be
> confusing for swiotlb_alloc/free (and need more functions to be
> non-static).
> Maybe instead of moving to a separate file, we can try to come up with
> a better naming?

I think you are referring to:

rmem_swiotlb_setup

?

Which is ARM specific and inside the generic code?



Christopher wants to unify it in all the code so there is one single
source, but the "you seperate arch code out from generic" saying
makes me want to move it out.

I agree that if you move it out from generic to arch-specific we have to
expose more of the swiotlb functions, which will undo's Christopher
cleanup code.

How about this - lets leave it as is now, and when there are more
use-cases we can revisit it and then if need to move the code?



Re: [PATCH 1/3] sched/topology: Allow archs to populate distance map

2021-05-24 Thread Srikar Dronamraju
* Valentin Schneider  [2021-05-24 15:16:09]:

> On 21/05/21 14:58, Srikar Dronamraju wrote:
> > * Peter Zijlstra  [2021-05-21 10:14:10]:
> >
> >> On Fri, May 21, 2021 at 08:08:02AM +0530, Srikar Dronamraju wrote:
> >> > * Peter Zijlstra  [2021-05-20 20:56:31]:
> >> >
> >> > > On Thu, May 20, 2021 at 09:14:25PM +0530, Srikar Dronamraju wrote:
> >> > > > Currently scheduler populates the distance map by looking at distance
> >> > > > of each node from all other nodes. This should work for most
> >> > > > architectures and platforms.
> >> > > >
> >> > > > However there are some architectures like POWER that may not expose
> >> > > > the distance of nodes that are not yet onlined because those 
> >> > > > resources
> >> > > > are not yet allocated to the OS instance. Such architectures have
> >> > > > other means to provide valid distance data for the current platform.
> >> > > >
> >> > > > For example distance info from numactl from a fully populated 8 node
> >> > > > system at boot may look like this.
> >> > > >
> >> > > > node distances:
> >> > > > node   0   1   2   3   4   5   6   7
> >> > > >   0:  10  20  40  40  40  40  40  40
> >> > > >   1:  20  10  40  40  40  40  40  40
> >> > > >   2:  40  40  10  20  40  40  40  40
> >> > > >   3:  40  40  20  10  40  40  40  40
> >> > > >   4:  40  40  40  40  10  20  40  40
> >> > > >   5:  40  40  40  40  20  10  40  40
> >> > > >   6:  40  40  40  40  40  40  10  20
> >> > > >   7:  40  40  40  40  40  40  20  10
> >> > > >
> >> > > > However the same system when only two nodes are online at boot, then 
> >> > > > the
> >> > > > numa topology will look like
> >> > > > node distances:
> >> > > > node   0   1
> >> > > >   0:  10  20
> >> > > >   1:  20  10
> >> > > >
> >> > > > It may be implementation dependent on what node_distance(0,3) where
> >> > > > node 0 is online and node 3 is offline. In POWER case, it returns
> >> > > > LOCAL_DISTANCE(10). Here at boot the scheduler would assume that the 
> >> > > > max
> >> > > > distance between nodes is 20. However that would not be true.
> >> > > >
> >> > > > When Nodes are onlined and CPUs from those nodes are hotplugged,
> >> > > > the max node distance would be 40.
> >> > > >
> >> > > > To handle such scenarios, let scheduler allow architectures to 
> >> > > > populate
> >> > > > the distance map. Architectures that like to populate the distance 
> >> > > > map
> >> > > > can overload arch_populate_distance_map().
> >> > >
> >> > > Why? Why can't your node_distance() DTRT? The arch interface is
> >> > > nr_node_ids and node_distance(), I don't see why we need something new
> >> > > and then replace one special use of it.
> >> > >
> >> > > By virtue of you being able to actually implement this new hook, you
> >> > > supposedly can actually do node_distance() right too.
> >> >
> >> > Since for an offline node, arch interface code doesn't have the info.
> >> > As far as I know/understand, in POWER, unless there is an active memory 
> >> > or
> >> > CPU that's getting onlined, arch can't fetch the correct node distance.
> >> >
> >> > Taking the above example: node 3 is offline, then node_distance of (3,X)
> >> > where X is anything other than 3, is not reliable. The moment node 3 is
> >> > onlined, the node distance is reliable.
> >> >
> >> > This problem will not happen even on POWER if all the nodes have either
> >> > memory or CPUs active at the time of boot.
> >>
> >> But then how can you implement this new hook? Going by the fact that
> >> both nr_node_ids and distance_ref_points_depth are fixed, how many
> >> possible __node_distance() configurations are there left?
> >>
> >
> > distance_ref_point_depth is provided as a different property and is readily
> > available at boot. The new api will use just use that. So based on the
> > distance_ref_point_depth, we know all possible node distances for that
> > platform.
> >
> > For an offline node, we don't have that specific nodes distance_lookup_table
> > array entries. Each array would be of distance_ref_point_depth entries.
> > Without the distance_lookup_table for an array populated, we will not be
> > able to tell how far the node is with respect to other nodes.
> >
> > We can lookup the correct distance_lookup_table for a node based on memory
> > or the CPUs attached to that node. Since in an offline node, both of them
> > would not be around, the distance_lookup_table will have stale values.
> >
> 
> Ok so from your arch you can figure out the *size* of the set of unique
> distances, but not the individual node_distance(a, b)... That's quite
> unfortunate.

Yes, thats true.

> 
> I suppose one way to avoid the hook would be to write some "fake" distance
> values into your distance_lookup_table[] for offline nodes using your
> distance_ref_point_depth thing, i.e. ensure an iteration of
> node_distance(a, b) covers all distance values [1]. You can then keep patch
> 3 around, and that should roughly be it.
> 

Yes, this would suffice but to me its not very 

Re: [PATCH 2/3] powerpc/numa: Populate distance map correctly

2021-05-24 Thread Srikar Dronamraju
* Valentin Schneider  [2021-05-24 15:16:22]:

> On 20/05/21 21:14, Srikar Dronamraju wrote:
> > +int arch_populate_distance_map(unsigned long *distance_map)
> > +{
> > +   int i;
> > +   int distance = LOCAL_DISTANCE;
> > +
> > +   bitmap_set(distance_map, distance, 1);
> > +
> > +   if (!form1_affinity) {
> > +   bitmap_set(distance_map, REMOTE_DISTANCE, 1);
> > +   return 0;
> > +   }
> > +
> > +   for (i = 0; i < distance_ref_points_depth; i++) {
> > +   distance *= 2;
> > +   bitmap_set(distance_map, distance, 1);
> 
> Do you have guarantees your distance values will always be in the form of
> 
>   LOCAL_DISTANCE * 2^i
> 
> because that certainly isn't true for x86/arm64.
> 

This is true till now. It don't think that's going to change anytime soon, but
we never know what lies ahead.

For all practical purposes, (unless a newer, shinier property is proposed,)
distance_ref_points_depth is going to give us the unique distances.

> > +   }
> > +   return 0;
> > +}
> > +
> >  /*
> >   * Returns nid in the range [0..nr_node_ids], or -1 if no useful NUMA
> >   * info is found.
> > --
> > 2.27.0

-- 
Thanks and Regards
Srikar Dronamraju


Re: [PATCH v5 7/9] mm/mremap: Move TLB flush outside page table lock

2021-05-24 Thread Aneesh Kumar K.V
Linus Torvalds  writes:

> On Fri, May 21, 2021 at 3:04 AM Aneesh Kumar K.V
>  wrote:
>>
>> We could do MOVE_PMD with something like below? A equivalent MOVE_PUD
>> will be costlier which makes me wonder whether we should even support that?
>
> Well, without USE_SPLIT_PTE_PTLOCKS the pud case would be trivial too.
> But everybody uses split pte locks in practice.
>
> I get the feeling that the rmap code might have to use
> pud_lock/pmd_lock. I wonder how painful that would be.
>

Looking at this further, i guess we need to do the above to close the
race window. We do

static bool map_pte(struct page_vma_mapped_walk *pvmw)
{
pvmw->pte = pte_offset_map(pvmw->pmd, pvmw->address);
..
pvmw->ptl = pte_lockptr(pvmw->vma->vm_mm, pvmw->pmd);
spin_lock(pvmw->ptl);
}

That is we walk the table without holding the pte ptl. Hence we still
can race with the optimized PMD move.

-aneesh




Re: [PATCH 2/3] powerpc/numa: Populate distance map correctly

2021-05-24 Thread Valentin Schneider
On 20/05/21 21:14, Srikar Dronamraju wrote:
> +int arch_populate_distance_map(unsigned long *distance_map)
> +{
> + int i;
> + int distance = LOCAL_DISTANCE;
> +
> + bitmap_set(distance_map, distance, 1);
> +
> + if (!form1_affinity) {
> + bitmap_set(distance_map, REMOTE_DISTANCE, 1);
> + return 0;
> + }
> +
> + for (i = 0; i < distance_ref_points_depth; i++) {
> + distance *= 2;
> + bitmap_set(distance_map, distance, 1);

Do you have guarantees your distance values will always be in the form of

  LOCAL_DISTANCE * 2^i

because that certainly isn't true for x86/arm64.

> + }
> + return 0;
> +}
> +
>  /*
>   * Returns nid in the range [0..nr_node_ids], or -1 if no useful NUMA
>   * info is found.
> --
> 2.27.0


Re: [PATCH 1/3] sched/topology: Allow archs to populate distance map

2021-05-24 Thread Valentin Schneider
On 21/05/21 14:58, Srikar Dronamraju wrote:
> * Peter Zijlstra  [2021-05-21 10:14:10]:
>
>> On Fri, May 21, 2021 at 08:08:02AM +0530, Srikar Dronamraju wrote:
>> > * Peter Zijlstra  [2021-05-20 20:56:31]:
>> >
>> > > On Thu, May 20, 2021 at 09:14:25PM +0530, Srikar Dronamraju wrote:
>> > > > Currently scheduler populates the distance map by looking at distance
>> > > > of each node from all other nodes. This should work for most
>> > > > architectures and platforms.
>> > > >
>> > > > However there are some architectures like POWER that may not expose
>> > > > the distance of nodes that are not yet onlined because those resources
>> > > > are not yet allocated to the OS instance. Such architectures have
>> > > > other means to provide valid distance data for the current platform.
>> > > >
>> > > > For example distance info from numactl from a fully populated 8 node
>> > > > system at boot may look like this.
>> > > >
>> > > > node distances:
>> > > > node   0   1   2   3   4   5   6   7
>> > > >   0:  10  20  40  40  40  40  40  40
>> > > >   1:  20  10  40  40  40  40  40  40
>> > > >   2:  40  40  10  20  40  40  40  40
>> > > >   3:  40  40  20  10  40  40  40  40
>> > > >   4:  40  40  40  40  10  20  40  40
>> > > >   5:  40  40  40  40  20  10  40  40
>> > > >   6:  40  40  40  40  40  40  10  20
>> > > >   7:  40  40  40  40  40  40  20  10
>> > > >
>> > > > However the same system when only two nodes are online at boot, then 
>> > > > the
>> > > > numa topology will look like
>> > > > node distances:
>> > > > node   0   1
>> > > >   0:  10  20
>> > > >   1:  20  10
>> > > >
>> > > > It may be implementation dependent on what node_distance(0,3) where
>> > > > node 0 is online and node 3 is offline. In POWER case, it returns
>> > > > LOCAL_DISTANCE(10). Here at boot the scheduler would assume that the 
>> > > > max
>> > > > distance between nodes is 20. However that would not be true.
>> > > >
>> > > > When Nodes are onlined and CPUs from those nodes are hotplugged,
>> > > > the max node distance would be 40.
>> > > >
>> > > > To handle such scenarios, let scheduler allow architectures to populate
>> > > > the distance map. Architectures that like to populate the distance map
>> > > > can overload arch_populate_distance_map().
>> > >
>> > > Why? Why can't your node_distance() DTRT? The arch interface is
>> > > nr_node_ids and node_distance(), I don't see why we need something new
>> > > and then replace one special use of it.
>> > >
>> > > By virtue of you being able to actually implement this new hook, you
>> > > supposedly can actually do node_distance() right too.
>> >
>> > Since for an offline node, arch interface code doesn't have the info.
>> > As far as I know/understand, in POWER, unless there is an active memory or
>> > CPU that's getting onlined, arch can't fetch the correct node distance.
>> >
>> > Taking the above example: node 3 is offline, then node_distance of (3,X)
>> > where X is anything other than 3, is not reliable. The moment node 3 is
>> > onlined, the node distance is reliable.
>> >
>> > This problem will not happen even on POWER if all the nodes have either
>> > memory or CPUs active at the time of boot.
>>
>> But then how can you implement this new hook? Going by the fact that
>> both nr_node_ids and distance_ref_points_depth are fixed, how many
>> possible __node_distance() configurations are there left?
>>
>
> distance_ref_point_depth is provided as a different property and is readily
> available at boot. The new api will use just use that. So based on the
> distance_ref_point_depth, we know all possible node distances for that
> platform.
>
> For an offline node, we don't have that specific nodes distance_lookup_table
> array entries. Each array would be of distance_ref_point_depth entries.
> Without the distance_lookup_table for an array populated, we will not be
> able to tell how far the node is with respect to other nodes.
>
> We can lookup the correct distance_lookup_table for a node based on memory
> or the CPUs attached to that node. Since in an offline node, both of them
> would not be around, the distance_lookup_table will have stale values.
>

Ok so from your arch you can figure out the *size* of the set of unique
distances, but not the individual node_distance(a, b)... That's quite
unfortunate.

I suppose one way to avoid the hook would be to write some "fake" distance
values into your distance_lookup_table[] for offline nodes using your
distance_ref_point_depth thing, i.e. ensure an iteration of
node_distance(a, b) covers all distance values [1]. You can then keep patch
3 around, and that should roughly be it.


>> The example provided above does not suggest there's much room for
>> alternatives, and hence for actual need of this new interface.
>>
>
> --
> Thanks and Regards
> Srikar Dronamraju


[PATCH v6 updated 9/11] mm/mremap: Fix race between mremap and pageout

2021-05-24 Thread Aneesh Kumar K.V
CPU 1   CPU 2   CPU 3

mremap(old_addr, new_addr)  page_shrinker/try_to_unmap_one

addr = old_addr
lock(pte_ptl)
lock(pmd_ptl)
pmd = *old_pmd
pmd_clear(old_pmd)
flush_tlb_range(old_addr)

*new_pmd = pmd

*new_addr = 10; and fills
TLB 
with new addr
and old 
pfn

unlock(pmd_ptl)
ptep_get_and_clear()
flush_tlb_range(old_addr)

old pfn is free.
Stale 
TLB entry

Avoid the above race with MOVE_PMD by holding pte ptl in mremap and waiting for
parallel pagetable walk to finish operating on pte before updating new_pmd

With MOVE_PUD only enable MOVE_PUD only if USE_SPLIT_PTE_PTLOCKS is disabled.
In this case both pte ptl and pud ptl points to mm->page_table_lock.

Fixes: c49dd3401802 ("mm: speedup mremap on 1GB or larger regions")
Fixes: 2c91bd4a4e2e ("mm: speed up mremap by 20x on large regions")
Link: 
https://lore.kernel.org/linux-mm/CAHk-=wgxvr04ebntxqfevontwnp6fdm+oj5vauqxp3s-huw...@mail.gmail.com
Signed-off-by: Aneesh Kumar K.V 
---
Change:
* Check for split PTL before taking pte ptl lock.

 mm/mremap.c | 26 +-
 1 file changed, 25 insertions(+), 1 deletion(-)

diff --git a/mm/mremap.c b/mm/mremap.c
index 8967a3707332..2fa3e0cb6176 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -224,7 +224,7 @@ static inline void flush_pte_tlb_pwc_range(struct 
vm_area_struct *vma,
 static bool move_normal_pmd(struct vm_area_struct *vma, unsigned long old_addr,
  unsigned long new_addr, pmd_t *old_pmd, pmd_t *new_pmd)
 {
-   spinlock_t *old_ptl, *new_ptl;
+   spinlock_t *pte_ptl, *old_ptl, *new_ptl;
struct mm_struct *mm = vma->vm_mm;
pmd_t pmd;
 
@@ -254,6 +254,7 @@ static bool move_normal_pmd(struct vm_area_struct *vma, 
unsigned long old_addr,
if (WARN_ON_ONCE(!pmd_none(*new_pmd)))
return false;
 
+
/*
 * We don't have to worry about the ordering of src and dst
 * ptlocks because exclusive mmap_lock prevents deadlock.
@@ -263,6 +264,10 @@ static bool move_normal_pmd(struct vm_area_struct *vma, 
unsigned long old_addr,
if (new_ptl != old_ptl)
spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING);
 
+   if (pmd_none(*old_pmd))
+   goto unlock_out;
+
+   pte_ptl = pte_lockptr(mm, old_pmd);
/* Clear the pmd */
pmd = *old_pmd;
pmd_clear(old_pmd);
@@ -270,9 +275,20 @@ static bool move_normal_pmd(struct vm_area_struct *vma, 
unsigned long old_addr,
 * flush the TLB before we move the page table entries.
 */
flush_pte_tlb_pwc_range(vma, old_addr, old_addr + PMD_SIZE);
+
+   /*
+* Take the ptl here so that we wait for parallel page table walk
+* and operations (eg: pageout)using old addr to finish.
+*/
+   if (USE_SPLIT_PTE_PTLOCKS)
+   spin_lock(pte_ptl);
+
VM_BUG_ON(!pmd_none(*new_pmd));
pmd_populate(mm, new_pmd, pmd_pgtable(pmd));
+   if (USE_SPLIT_PTE_PTLOCKS)
+   spin_unlock(pte_ptl);
 
+unlock_out:
if (new_ptl != old_ptl)
spin_unlock(new_ptl);
spin_unlock(old_ptl);
@@ -296,6 +312,14 @@ static bool move_normal_pud(struct vm_area_struct *vma, 
unsigned long old_addr,
struct mm_struct *mm = vma->vm_mm;
pud_t pud;
 
+   /*
+* Disable MOVE_PUD until we get the pageout done with all
+* higher level page table locks held. With SPLIT_PTE_PTLOCKS
+* we use mm->page_table_lock for both pte ptl and pud ptl
+*/
+   if (USE_SPLIT_PTE_PTLOCKS)
+   return false;
+
/*
 * The destination pud shouldn't be established, free_pgtables()
 * should have released it.
-- 
2.31.1



[PATCH v6 updated 9/11] mm/mremap: Fix race between mremap and pageout

2021-05-24 Thread Aneesh Kumar K.V
CPU 1   CPU 2   CPU 3

mremap(old_addr, new_addr)  page_shrinker/try_to_unmap_one

addr = old_addr
lock(pte_ptl)
lock(pmd_ptl)
pmd = *old_pmd
pmd_clear(old_pmd)
flush_tlb_range(old_addr)

*new_pmd = pmd

*new_addr = 10; and fills
TLB 
with new addr
and old 
pfn

unlock(pmd_ptl)
ptep_get_and_clear()
flush_tlb_range(old_addr)

old pfn is free.
Stale 
TLB entry

Avoid the above race with MOVE_PMD by holding pte ptl in mremap and waiting for
parallel pagetable walk to finish operating on pte before updating new_pmd

With MOVE_PUD only enable MOVE_PUD only if USE_SPLIT_PTE_PTLOCKS is disabled.
In this case both pte ptl and pud ptl points to mm->page_table_lock.

Fixes: c49dd3401802 ("mm: speedup mremap on 1GB or larger regions")
Fixes: 2c91bd4a4e2e ("mm: speed up mremap by 20x on large regions")
Link: 
https://lore.kernel.org/linux-mm/CAHk-=wgxvr04ebntxqfevontwnp6fdm+oj5vauqxp3s-huw...@mail.gmail.com
Signed-off-by: Aneesh Kumar K.V 
---
Change:
* Check for split PTL before taking pte ptl lock.

 mm/mremap.c | 26 +-
 1 file changed, 25 insertions(+), 1 deletion(-)

diff --git a/mm/mremap.c b/mm/mremap.c
index 8967a3707332..2fa3e0cb6176 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -224,7 +224,7 @@ static inline void flush_pte_tlb_pwc_range(struct 
vm_area_struct *vma,
 static bool move_normal_pmd(struct vm_area_struct *vma, unsigned long old_addr,
  unsigned long new_addr, pmd_t *old_pmd, pmd_t *new_pmd)
 {
-   spinlock_t *old_ptl, *new_ptl;
+   spinlock_t *pte_ptl, *old_ptl, *new_ptl;
struct mm_struct *mm = vma->vm_mm;
pmd_t pmd;
 
@@ -254,6 +254,7 @@ static bool move_normal_pmd(struct vm_area_struct *vma, 
unsigned long old_addr,
if (WARN_ON_ONCE(!pmd_none(*new_pmd)))
return false;
 
+
/*
 * We don't have to worry about the ordering of src and dst
 * ptlocks because exclusive mmap_lock prevents deadlock.
@@ -263,6 +264,10 @@ static bool move_normal_pmd(struct vm_area_struct *vma, 
unsigned long old_addr,
if (new_ptl != old_ptl)
spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING);
 
+   if (pmd_none(*old_pmd))
+   goto unlock_out;
+
+   pte_ptl = pte_lockptr(mm, old_pmd);
/* Clear the pmd */
pmd = *old_pmd;
pmd_clear(old_pmd);
@@ -270,9 +275,20 @@ static bool move_normal_pmd(struct vm_area_struct *vma, 
unsigned long old_addr,
 * flush the TLB before we move the page table entries.
 */
flush_pte_tlb_pwc_range(vma, old_addr, old_addr + PMD_SIZE);
+
+   /*
+* Take the ptl here so that we wait for parallel page table walk
+* and operations (eg: pageout)using old addr to finish.
+*/
+   if (USE_SPLIT_PTE_PTLOCKS)
+   spin_lock(pte_ptl);
+
VM_BUG_ON(!pmd_none(*new_pmd));
pmd_populate(mm, new_pmd, pmd_pgtable(pmd));
+   if (USE_SPLIT_PTE_PTLOCKS)
+   spin_unlock(pte_ptl);
 
+unlock_out:
if (new_ptl != old_ptl)
spin_unlock(new_ptl);
spin_unlock(old_ptl);
@@ -296,6 +312,14 @@ static bool move_normal_pud(struct vm_area_struct *vma, 
unsigned long old_addr,
struct mm_struct *mm = vma->vm_mm;
pud_t pud;
 
+   /*
+* Disable MOVE_PUD until we get the pageout done with all
+* higher level page table locks held. With SPLIT_PTE_PTLOCKS
+* we use mm->page_table_lock for both pte ptl and pud ptl
+*/
+   if (USE_SPLIT_PTE_PTLOCKS)
+   return false;
+
/*
 * The destination pud shouldn't be established, free_pgtables()
 * should have released it.
-- 
2.31.1



Re: [PATCH] KVM: PPC: Book3S HV: Save host FSCR in the P7/8 path

2021-05-24 Thread Fabiano Rosas
Nicholas Piggin  writes:

> Similar to commit 25edcc50d76c ("KVM: PPC: Book3S HV: Save and restore
> FSCR in the P9 path"), ensure the P7/8 path saves and restores the host
> FSCR. The logic explained in that patch actually applies there to the
> old path well: a context switch can be made before kvmppc_vcpu_run_hv
> restores the host FSCR and returns.
>
> Fixes: b005255e12a3 ("KVM: PPC: Book3S HV: Context-switch new POWER8 SPRs")
> Signed-off-by: Nicholas Piggin 
> ---
>  arch/powerpc/kvm/book3s_hv_rmhandlers.S | 11 +--
>  1 file changed, 9 insertions(+), 2 deletions(-)
>
> diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S 
> b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> index 5e634db4809b..2b98e710c7a1 100644
> --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> @@ -44,7 +44,7 @@ END_FTR_SECTION_IFCLR(CPU_FTR_ARCH_300)
>  #define NAPPING_UNSPLIT  3
>  
>  /* Stack frame offsets for kvmppc_hv_entry */
> -#define SFS  208
> +#define SFS  216
>  #define STACK_SLOT_TRAP  (SFS-4)
>  #define STACK_SLOT_SHORT_PATH(SFS-8)
>  #define STACK_SLOT_TID   (SFS-16)
> @@ -59,8 +59,9 @@ END_FTR_SECTION_IFCLR(CPU_FTR_ARCH_300)
>  #define STACK_SLOT_UAMOR (SFS-88)
>  #define STACK_SLOT_DAWR1 (SFS-96)
>  #define STACK_SLOT_DAWRX1(SFS-104)
> +#define STACK_SLOT_FSCR  (SFS-112)
>  /* the following is used by the P9 short path */
> -#define STACK_SLOT_NVGPRS(SFS-152)   /* 18 gprs */
> +#define STACK_SLOT_NVGPRS(SFS-160)   /* 18 gprs */
>  
>  /*
>   * Call kvmppc_hv_entry in real mode.
> @@ -686,6 +687,8 @@ BEGIN_FTR_SECTION
>   std r6, STACK_SLOT_DAWR0(r1)
>   std r7, STACK_SLOT_DAWRX0(r1)
>   std r8, STACK_SLOT_IAMR(r1)
> + mfspr   r5, SPRN_FSCR
> + std r5, STACK_SLOT_FSCR(r1)
>  END_FTR_SECTION_IFSET(CPU_FTR_ARCH_207S)
>  BEGIN_FTR_SECTION
>   mfspr   r6, SPRN_DAWR1
> @@ -1663,6 +1666,10 @@ FTR_SECTION_ELSE
>   ld  r7, STACK_SLOT_HFSCR(r1)
>   mtspr   SPRN_HFSCR, r7
>  ALT_FTR_SECTION_END_IFCLR(CPU_FTR_ARCH_300)
> +BEGIN_FTR_SECTION
> + ld  r5, STACK_SLOT_FSCR(r1)
> + mtspr   SPRN_FSCR, r5
> +END_FTR_SECTION_IFSET(CPU_FTR_ARCH_207S)
>   /*
>* Restore various registers to 0, where non-zero values
>* set by the guest could disrupt the host.

So it seems this line in kvmppc_vcpu_run_hv loses its purpose now?

do{
(...)
} while (is_kvmppc_resume_guest(r));

/* Restore userspace EBB and other register values */
if (cpu_has_feature(CPU_FTR_ARCH_207S)) {
mtspr(SPRN_EBBHR, ebb_regs[0]);
mtspr(SPRN_EBBRR, ebb_regs[1]);
mtspr(SPRN_BESCR, ebb_regs[2]);
mtspr(SPRN_TAR, user_tar);
--->mtspr(SPRN_FSCR, current->thread.fscr);
}


Re: Linux powerpc new system call instruction and ABI

2021-05-24 Thread Matheus Castanho


Dmitry V. Levin  writes:

> On Fri, May 21, 2021 at 05:00:36PM -0300, Matheus Castanho wrote:
>> Florian Weimer  writes:
>> > * Matheus Castanho via Libc-alpha:
>> >> From: Nicholas Piggin 
>> >> Subject: [PATCH 1/1] powerpc: Fix handling of scv return error codes
>> >>
>> >> When using scv on templated ASM syscalls, current code interprets any
>> >> negative return value as error, but the only valid error codes are in
>> >> the range -4095..-1 according to the ABI.
>> >>
>> >> Reviewed-by: Matheus Castanho 
>> >
>> > Please reference bug 27892 in the commit message.  I'd also appreciate a
>> > backport to the 2.33 release branch (where you need to add NEWS manually
>> > to add the bug reference).
>>
>> No problem. [BZ #27892] appended to the commit title. I'll make sure to
>> backport to 2.33 as well.
>
> Could you also mention in the commit message that the change fixes
> 'signal.gen.test' strace test where it was observed initially?

Sure, no problem. I'll commit it later today.

--
Matheus Castanho


[PATCH] powerpc/configs: Enable STACK_TRACER and FTRACE_SYSCALLS in some of the configs

2021-05-24 Thread Naveen N. Rao
Both these config options are generally enabled in distro kernels.
Enable the same in a few powerpc64 configs to get better coverage and
testing.

Signed-off-by: Naveen N. Rao 
---
 arch/powerpc/configs/powernv_defconfig | 1 +
 arch/powerpc/configs/ppc64_defconfig   | 2 ++
 arch/powerpc/configs/pseries_defconfig | 2 ++
 3 files changed, 5 insertions(+)

diff --git a/arch/powerpc/configs/powernv_defconfig 
b/arch/powerpc/configs/powernv_defconfig
index 2c87e856d839b0..8bfeea6c7de7b4 100644
--- a/arch/powerpc/configs/powernv_defconfig
+++ b/arch/powerpc/configs/powernv_defconfig
@@ -309,6 +309,7 @@ CONFIG_SOFTLOCKUP_DETECTOR=y
 CONFIG_HARDLOCKUP_DETECTOR=y
 CONFIG_FUNCTION_TRACER=y
 CONFIG_SCHED_TRACER=y
+CONFIG_STACK_TRACER=y
 CONFIG_FTRACE_SYSCALLS=y
 CONFIG_BLK_DEV_IO_TRACE=y
 CONFIG_PPC_EMULATED_STATS=y
diff --git a/arch/powerpc/configs/ppc64_defconfig 
b/arch/powerpc/configs/ppc64_defconfig
index 701811c91a6f3f..0ad2291337a713 100644
--- a/arch/powerpc/configs/ppc64_defconfig
+++ b/arch/powerpc/configs/ppc64_defconfig
@@ -368,7 +368,9 @@ CONFIG_SOFTLOCKUP_DETECTOR=y
 CONFIG_HARDLOCKUP_DETECTOR=y
 CONFIG_DEBUG_MUTEXES=y
 CONFIG_FUNCTION_TRACER=y
+CONFIG_FTRACE_SYSCALLS=y
 CONFIG_SCHED_TRACER=y
+CONFIG_STACK_TRACER=y
 CONFIG_BLK_DEV_IO_TRACE=y
 CONFIG_CODE_PATCHING_SELFTEST=y
 CONFIG_FTR_FIXUP_SELFTEST=y
diff --git a/arch/powerpc/configs/pseries_defconfig 
b/arch/powerpc/configs/pseries_defconfig
index 50168dde4ea598..b183629f1bcfb8 100644
--- a/arch/powerpc/configs/pseries_defconfig
+++ b/arch/powerpc/configs/pseries_defconfig
@@ -289,7 +289,9 @@ CONFIG_DEBUG_STACKOVERFLOW=y
 CONFIG_SOFTLOCKUP_DETECTOR=y
 CONFIG_HARDLOCKUP_DETECTOR=y
 CONFIG_FUNCTION_TRACER=y
+CONFIG_FTRACE_SYSCALLS=y
 CONFIG_SCHED_TRACER=y
+CONFIG_STACK_TRACER=y
 CONFIG_BLK_DEV_IO_TRACE=y
 CONFIG_CODE_PATCHING_SELFTEST=y
 CONFIG_FTR_FIXUP_SELFTEST=y

base-commit: 8dbbcb8a8856c6b4e56ae705218d8dad1f9cf1e9
-- 
2.30.2



[PATCH -next] macintosh/therm_adt746x: Replaced simple_strtol() with kstrtoint()

2021-05-24 Thread Liu Shixin
The simple_strtol() function is deprecated in some situation since
it does not check for the range overflow. Use kstrtoint() instead.

Signed-off-by: Liu Shixin 
---
 drivers/macintosh/therm_adt746x.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/macintosh/therm_adt746x.c 
b/drivers/macintosh/therm_adt746x.c
index 7e218437730c..0d7ef55126ce 100644
--- a/drivers/macintosh/therm_adt746x.c
+++ b/drivers/macintosh/therm_adt746x.c
@@ -352,7 +352,8 @@ static ssize_t store_##name(struct device *dev, struct 
device_attribute *attr, c
struct thermostat *th = dev_get_drvdata(dev);   \
int val;\
int i;  \
-   val = simple_strtol(buf, NULL, 10); \
+   if (unlikely(kstrtoint(buf, 10, ))  \
+   return -EINVAL; \
printk(KERN_INFO "Adjusting limits by %d degrees\n", val);  \
limit_adjust = val; \
for (i=0; i < 3; i++)   \
@@ -364,7 +365,8 @@ static ssize_t store_##name(struct device *dev, struct 
device_attribute *attr, c
 static ssize_t store_##name(struct device *dev, struct device_attribute *attr, 
const char *buf, size_t n) \
 {  \
int val;\
-   val = simple_strtol(buf, NULL, 10); \
+   if (unlikely(kstrtoint(buf, 10, ))  \
+   return -EINVAL; \
if (val < 0 || val > 255)   \
return -EINVAL; \
printk(KERN_INFO "Setting specified fan speed to %d\n", val);   \
-- 
2.18.0.huawei.25



[PATCH v6 07/11] mm/mremap: Use range flush that does TLB and page walk cache flush

2021-05-24 Thread Aneesh Kumar K.V
Some architectures do have the concept of page walk cache which need
to be flush when updating higher levels of page tables. A fast mremap
that involves moving page table pages instead of copying pte entries
should flush page walk cache since the old translation cache is no more
valid.

Add new helper flush_pte_tlb_pwc_range() which invalidates both TLB and
page walk cache where TLB entries are mapped with page size PAGE_SIZE.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/book3s/64/tlbflush.h | 10 ++
 mm/mremap.c   | 14 --
 2 files changed, 22 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/tlbflush.h 
b/arch/powerpc/include/asm/book3s/64/tlbflush.h
index f9f8a3a264f7..e84fee9db106 100644
--- a/arch/powerpc/include/asm/book3s/64/tlbflush.h
+++ b/arch/powerpc/include/asm/book3s/64/tlbflush.h
@@ -80,6 +80,16 @@ static inline void flush_hugetlb_tlb_range(struct 
vm_area_struct *vma,
return flush_hugetlb_tlb_pwc_range(vma, start, end, false);
 }
 
+#define flush_pte_tlb_pwc_range flush_tlb_pwc_range
+static inline void flush_pte_tlb_pwc_range(struct vm_area_struct *vma,
+  unsigned long start, unsigned long 
end)
+{
+   if (radix_enabled())
+   return radix__flush_tlb_pwc_range_psize(vma->vm_mm, start,
+   end, mmu_virtual_psize, 
true);
+   return hash__flush_tlb_range(vma, start, end);
+}
+
 static inline void flush_tlb_range(struct vm_area_struct *vma,
   unsigned long start, unsigned long end)
 {
diff --git a/mm/mremap.c b/mm/mremap.c
index 7372c8c0cf26..000a71917557 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -210,6 +210,16 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t 
*old_pmd,
drop_rmap_locks(vma);
 }
 
+#ifndef flush_pte_tlb_pwc_range
+#define flush_pte_tlb_pwc_range flush_pte_tlb_pwc_range
+static inline void flush_pte_tlb_pwc_range(struct vm_area_struct *vma,
+  unsigned long start,
+  unsigned long end)
+{
+   return flush_tlb_range(vma, start, end);
+}
+#endif
+
 #ifdef CONFIG_HAVE_MOVE_PMD
 static bool move_normal_pmd(struct vm_area_struct *vma, unsigned long old_addr,
  unsigned long new_addr, pmd_t *old_pmd, pmd_t *new_pmd)
@@ -260,7 +270,7 @@ static bool move_normal_pmd(struct vm_area_struct *vma, 
unsigned long old_addr,
VM_BUG_ON(!pmd_none(*new_pmd));
pmd_populate(mm, new_pmd, pmd_pgtable(pmd));
 
-   flush_tlb_range(vma, old_addr, old_addr + PMD_SIZE);
+   flush_pte_tlb_pwc_range(vma, old_addr, old_addr + PMD_SIZE);
if (new_ptl != old_ptl)
spin_unlock(new_ptl);
spin_unlock(old_ptl);
@@ -307,7 +317,7 @@ static bool move_normal_pud(struct vm_area_struct *vma, 
unsigned long old_addr,
VM_BUG_ON(!pud_none(*new_pud));
 
pud_populate(mm, new_pud, (pmd_t *)pud_page_vaddr(pud));
-   flush_tlb_range(vma, old_addr, old_addr + PUD_SIZE);
+   flush_pte_tlb_pwc_range(vma, old_addr, old_addr + PUD_SIZE);
if (new_ptl != old_ptl)
spin_unlock(new_ptl);
spin_unlock(old_ptl);
-- 
2.31.1



[PATCH v6 11/11] powerpc/mm: Enable HAVE_MOVE_PMD support

2021-05-24 Thread Aneesh Kumar K.V
mremap HAVE_MOVE_PMD/PUD optimization time comparison for 1GB region:
1GB mremap - Source PTE-aligned, Destination PTE-aligned
  mremap time:  1122062ns
1GB mremap - Source PMD-aligned, Destination PMD-aligned
  mremap time:   522062ns
1GB mremap - Source PUD-aligned, Destination PUD-aligned
  mremap time:   523345ns

There is not much impact with HAVE_MOVE_PUD because the kernel by default
enables USE_SPLIT_PTE_PTLOCKS. That results in the optimized pud move being
disabled.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/platforms/Kconfig.cputype | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/powerpc/platforms/Kconfig.cputype 
b/arch/powerpc/platforms/Kconfig.cputype
index 3ce907523b1e..2e666e569fdf 100644
--- a/arch/powerpc/platforms/Kconfig.cputype
+++ b/arch/powerpc/platforms/Kconfig.cputype
@@ -97,6 +97,8 @@ config PPC_BOOK3S_64
select PPC_HAVE_PMU_SUPPORT
select SYS_SUPPORTS_HUGETLBFS
select HAVE_ARCH_TRANSPARENT_HUGEPAGE
+   select HAVE_MOVE_PMD
+   select HAVE_MOVE_PUD
select ARCH_ENABLE_THP_MIGRATION if TRANSPARENT_HUGEPAGE
select ARCH_SUPPORTS_NUMA_BALANCING
select IRQ_WORK
-- 
2.31.1



[PATCH v6 09/11] mm/mremap: Fix race between mremap and pageout

2021-05-24 Thread Aneesh Kumar K.V
CPU 1   CPU 2   CPU 3

mremap(old_addr, new_addr)  page_shrinker/try_to_unmap_one

addr = old_addr
lock(pte_ptl)
lock(pmd_ptl)
pmd = *old_pmd
pmd_clear(old_pmd)
flush_tlb_range(old_addr)

*new_pmd = pmd

*new_addr = 10; and fills
TLB 
with new addr
and old 
pfn

unlock(pmd_ptl)
ptep_get_and_clear()
flush_tlb_range(old_addr)

old pfn is free.
Stale 
TLB entry

Avoid the above race with MOVE_PMD by holding pte ptl in mremap and waiting for
parallel pagetable walk to finish operating on pte before updating new_pmd

With MOVE_PUD only enable MOVE_PUD only if USE_SPLIT_PTE_PTLOCKS is disabled.
In this case both pte ptl and pud ptl points to mm->page_table_lock.

Fixes: c49dd3401802 ("mm: speedup mremap on 1GB or larger regions")
Fixes: 2c91bd4a4e2e ("mm: speed up mremap by 20x on large regions")
Link: 
https://lore.kernel.org/linux-mm/CAHk-=wgxvr04ebntxqfevontwnp6fdm+oj5vauqxp3s-huw...@mail.gmail.com
Signed-off-by: Aneesh Kumar K.V 
---
 mm/mremap.c | 24 +++-
 1 file changed, 23 insertions(+), 1 deletion(-)

diff --git a/mm/mremap.c b/mm/mremap.c
index 8967a3707332..e70b8e3b9568 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -224,7 +224,7 @@ static inline void flush_pte_tlb_pwc_range(struct 
vm_area_struct *vma,
 static bool move_normal_pmd(struct vm_area_struct *vma, unsigned long old_addr,
  unsigned long new_addr, pmd_t *old_pmd, pmd_t *new_pmd)
 {
-   spinlock_t *old_ptl, *new_ptl;
+   spinlock_t *pte_ptl, *old_ptl, *new_ptl;
struct mm_struct *mm = vma->vm_mm;
pmd_t pmd;
 
@@ -254,6 +254,7 @@ static bool move_normal_pmd(struct vm_area_struct *vma, 
unsigned long old_addr,
if (WARN_ON_ONCE(!pmd_none(*new_pmd)))
return false;
 
+
/*
 * We don't have to worry about the ordering of src and dst
 * ptlocks because exclusive mmap_lock prevents deadlock.
@@ -263,6 +264,14 @@ static bool move_normal_pmd(struct vm_area_struct *vma, 
unsigned long old_addr,
if (new_ptl != old_ptl)
spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING);
 
+   if (pmd_none(*old_pmd))
+   goto unlock_out;
+
+   /*
+* Take the ptl here so that we wait for parallel page table walk
+* and operations (eg: pageout)using old addr to finish.
+*/
+   pte_ptl = pte_lockptr(mm, old_pmd);
/* Clear the pmd */
pmd = *old_pmd;
pmd_clear(old_pmd);
@@ -270,9 +279,14 @@ static bool move_normal_pmd(struct vm_area_struct *vma, 
unsigned long old_addr,
 * flush the TLB before we move the page table entries.
 */
flush_pte_tlb_pwc_range(vma, old_addr, old_addr + PMD_SIZE);
+
+   spin_lock(pte_ptl);
+
VM_BUG_ON(!pmd_none(*new_pmd));
pmd_populate(mm, new_pmd, pmd_pgtable(pmd));
+   spin_unlock(pte_ptl);
 
+unlock_out:
if (new_ptl != old_ptl)
spin_unlock(new_ptl);
spin_unlock(old_ptl);
@@ -296,6 +310,14 @@ static bool move_normal_pud(struct vm_area_struct *vma, 
unsigned long old_addr,
struct mm_struct *mm = vma->vm_mm;
pud_t pud;
 
+   /*
+* Disable MOVE_PUD until we get the pageout done with all
+* higher level page table locks held. With SPLIT_PTE_PTLOCKS
+* we use mm->page_table_lock for both pte ptl and pud ptl
+*/
+   if (USE_SPLIT_PTE_PTLOCKS)
+   return false;
+
/*
 * The destination pud shouldn't be established, free_pgtables()
 * should have released it.
-- 
2.31.1



[PATCH v6 10/11] mm/mremap: Allow arch runtime override

2021-05-24 Thread Aneesh Kumar K.V
Architectures like ppc64 support faster mremap only with radix
translation. Hence allow a runtime check w.r.t support for fast mremap.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/tlb.h |  6 ++
 mm/mremap.c| 15 ++-
 2 files changed, 20 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/tlb.h b/arch/powerpc/include/asm/tlb.h
index 160422a439aa..09a9ae5f3656 100644
--- a/arch/powerpc/include/asm/tlb.h
+++ b/arch/powerpc/include/asm/tlb.h
@@ -83,5 +83,11 @@ static inline int mm_is_thread_local(struct mm_struct *mm)
 }
 #endif
 
+#define arch_supports_page_table_move arch_supports_page_table_move
+static inline bool arch_supports_page_table_move(void)
+{
+   return radix_enabled();
+}
+
 #endif /* __KERNEL__ */
 #endif /* __ASM_POWERPC_TLB_H */
diff --git a/mm/mremap.c b/mm/mremap.c
index e70b8e3b9568..42485d19c490 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -25,7 +25,7 @@
 #include 
 
 #include 
-#include 
+#include 
 #include 
 
 #include "internal.h"
@@ -220,6 +220,15 @@ static inline void flush_pte_tlb_pwc_range(struct 
vm_area_struct *vma,
 }
 #endif
 
+#ifndef arch_supports_page_table_move
+#define arch_supports_page_table_move arch_supports_page_table_move
+static inline bool arch_supports_page_table_move(void)
+{
+   return IS_ENABLED(CONFIG_HAVE_MOVE_PMD) ||
+   IS_ENABLED(CONFIG_HAVE_MOVE_PUD);
+}
+#endif
+
 #ifdef CONFIG_HAVE_MOVE_PMD
 static bool move_normal_pmd(struct vm_area_struct *vma, unsigned long old_addr,
  unsigned long new_addr, pmd_t *old_pmd, pmd_t *new_pmd)
@@ -228,6 +237,8 @@ static bool move_normal_pmd(struct vm_area_struct *vma, 
unsigned long old_addr,
struct mm_struct *mm = vma->vm_mm;
pmd_t pmd;
 
+   if (!arch_supports_page_table_move())
+   return false;
/*
 * The destination pmd shouldn't be established, free_pgtables()
 * should have released it.
@@ -318,6 +329,8 @@ static bool move_normal_pud(struct vm_area_struct *vma, 
unsigned long old_addr,
if (USE_SPLIT_PTE_PTLOCKS)
return false;
 
+   if (!arch_supports_page_table_move())
+   return false;
/*
 * The destination pud shouldn't be established, free_pgtables()
 * should have released it.
-- 
2.31.1



[PATCH v6 08/11] mm/mremap: properly flush the TLB on mremap.

2021-05-24 Thread Aneesh Kumar K.V
As explained in
commit eb66ae030829 ("mremap: properly flush TLB before releasing the page")
mremap is special in that it doesn't take ownership of the page. The
optimized version for PUD/PMD aligned mremap also doesn't hold the ptl lock.
Hence flush the TLB before we update the new page table location. This ensures
the kernel invalidates the older translation cache before it can free the page 
via
the newly inserted translation.

Fixes: c49dd3401802 ("mm: speedup mremap on 1GB or larger regions")
Fixes: 2c91bd4a4e2e ("mm: speed up mremap by 20x on large regions")
Link: 
https://lore.kernel.org/linux-mm/CAHk-=wjq8thag3unv-2mmu75ogx5ybmon7gzduhywzetwcz...@mail.gmail.com
Signed-off-by: Aneesh Kumar K.V 
---
 mm/mremap.c | 11 ---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/mm/mremap.c b/mm/mremap.c
index 000a71917557..8967a3707332 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -266,11 +266,13 @@ static bool move_normal_pmd(struct vm_area_struct *vma, 
unsigned long old_addr,
/* Clear the pmd */
pmd = *old_pmd;
pmd_clear(old_pmd);
-
+   /*
+* flush the TLB before we move the page table entries.
+*/
+   flush_pte_tlb_pwc_range(vma, old_addr, old_addr + PMD_SIZE);
VM_BUG_ON(!pmd_none(*new_pmd));
pmd_populate(mm, new_pmd, pmd_pgtable(pmd));
 
-   flush_pte_tlb_pwc_range(vma, old_addr, old_addr + PMD_SIZE);
if (new_ptl != old_ptl)
spin_unlock(new_ptl);
spin_unlock(old_ptl);
@@ -313,11 +315,14 @@ static bool move_normal_pud(struct vm_area_struct *vma, 
unsigned long old_addr,
/* Clear the pud */
pud = *old_pud;
pud_clear(old_pud);
+   /*
+* flush the TLB before we move the page table entries.
+*/
+   flush_pte_tlb_pwc_range(vma, old_addr, old_addr + PUD_SIZE);
 
VM_BUG_ON(!pud_none(*new_pud));
 
pud_populate(mm, new_pud, (pmd_t *)pud_page_vaddr(pud));
-   flush_pte_tlb_pwc_range(vma, old_addr, old_addr + PUD_SIZE);
if (new_ptl != old_ptl)
spin_unlock(new_ptl);
spin_unlock(old_ptl);
-- 
2.31.1



[PATCH v6 06/11] powerpc/mm/book3s64: Update tlb flush routines to take a page walk cache flush argument

2021-05-24 Thread Aneesh Kumar K.V
No functional change in this patch

Signed-off-by: Aneesh Kumar K.V 
---
 .../include/asm/book3s/64/tlbflush-radix.h| 19 +++-
 arch/powerpc/include/asm/book3s/64/tlbflush.h | 23 ---
 arch/powerpc/mm/book3s64/radix_hugetlbpage.c  |  4 +--
 arch/powerpc/mm/book3s64/radix_tlb.c  | 29 +++
 4 files changed, 42 insertions(+), 33 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/tlbflush-radix.h 
b/arch/powerpc/include/asm/book3s/64/tlbflush-radix.h
index 8b33601cdb9d..171441a43b35 100644
--- a/arch/powerpc/include/asm/book3s/64/tlbflush-radix.h
+++ b/arch/powerpc/include/asm/book3s/64/tlbflush-radix.h
@@ -56,15 +56,18 @@ static inline void radix__flush_all_lpid_guest(unsigned int 
lpid)
 }
 #endif
 
-extern void radix__flush_hugetlb_tlb_range(struct vm_area_struct *vma,
-  unsigned long start, unsigned long 
end);
-extern void radix__flush_tlb_range_psize(struct mm_struct *mm, unsigned long 
start,
-unsigned long end, int psize);
-extern void radix__flush_pmd_tlb_range(struct vm_area_struct *vma,
-  unsigned long start, unsigned long end);
-extern void radix__flush_tlb_range(struct vm_area_struct *vma, unsigned long 
start,
+void radix__flush_hugetlb_tlb_range(struct vm_area_struct *vma,
+   unsigned long start, unsigned long end,
+   bool flush_pwc);
+void radix__flush_pmd_tlb_range(struct vm_area_struct *vma,
+   unsigned long start, unsigned long end,
+   bool flush_pwc);
+void radix__flush_tlb_pwc_range_psize(struct mm_struct *mm, unsigned long 
start,
+ unsigned long end, int psize, bool 
flush_pwc);
+void radix__flush_tlb_range(struct vm_area_struct *vma, unsigned long start,
unsigned long end);
-extern void radix__flush_tlb_kernel_range(unsigned long start, unsigned long 
end);
+void radix__flush_tlb_kernel_range(unsigned long start, unsigned long end);
+
 
 extern void radix__local_flush_tlb_mm(struct mm_struct *mm);
 extern void radix__local_flush_all_mm(struct mm_struct *mm);
diff --git a/arch/powerpc/include/asm/book3s/64/tlbflush.h 
b/arch/powerpc/include/asm/book3s/64/tlbflush.h
index 215973b4cb26..f9f8a3a264f7 100644
--- a/arch/powerpc/include/asm/book3s/64/tlbflush.h
+++ b/arch/powerpc/include/asm/book3s/64/tlbflush.h
@@ -45,13 +45,30 @@ static inline void tlbiel_all_lpid(bool radix)
hash__tlbiel_all(TLB_INVAL_SCOPE_LPID);
 }
 
+static inline void flush_pmd_tlb_pwc_range(struct vm_area_struct *vma,
+  unsigned long start,
+  unsigned long end,
+  bool flush_pwc)
+{
+   if (radix_enabled())
+   return radix__flush_pmd_tlb_range(vma, start, end, flush_pwc);
+   return hash__flush_tlb_range(vma, start, end);
+}
 
 #define __HAVE_ARCH_FLUSH_PMD_TLB_RANGE
 static inline void flush_pmd_tlb_range(struct vm_area_struct *vma,
   unsigned long start, unsigned long end)
+{
+   return flush_pmd_tlb_pwc_range(vma, start, end, false);
+}
+
+static inline void flush_hugetlb_tlb_pwc_range(struct vm_area_struct *vma,
+  unsigned long start,
+  unsigned long end,
+  bool flush_pwc)
 {
if (radix_enabled())
-   return radix__flush_pmd_tlb_range(vma, start, end);
+   return radix__flush_hugetlb_tlb_range(vma, start, end, 
flush_pwc);
return hash__flush_tlb_range(vma, start, end);
 }
 
@@ -60,9 +77,7 @@ static inline void flush_hugetlb_tlb_range(struct 
vm_area_struct *vma,
   unsigned long start,
   unsigned long end)
 {
-   if (radix_enabled())
-   return radix__flush_hugetlb_tlb_range(vma, start, end);
-   return hash__flush_tlb_range(vma, start, end);
+   return flush_hugetlb_tlb_pwc_range(vma, start, end, false);
 }
 
 static inline void flush_tlb_range(struct vm_area_struct *vma,
diff --git a/arch/powerpc/mm/book3s64/radix_hugetlbpage.c 
b/arch/powerpc/mm/book3s64/radix_hugetlbpage.c
index cb91071eef52..e62f5679b119 100644
--- a/arch/powerpc/mm/book3s64/radix_hugetlbpage.c
+++ b/arch/powerpc/mm/book3s64/radix_hugetlbpage.c
@@ -26,13 +26,13 @@ void radix__local_flush_hugetlb_page(struct vm_area_struct 
*vma, unsigned long v
 }
 
 void radix__flush_hugetlb_tlb_range(struct vm_area_struct *vma, unsigned long 
start,
-  unsigned long end)
+   unsigned long end, bool flush_pwc)
 {
int psize;
struct hstate *hstate = 

[PATCH v6 04/11] mm/mremap: Use pmd/pud_poplulate to update page table entries

2021-05-24 Thread Aneesh Kumar K.V
pmd/pud_populate is the right interface to be used to set the respective
page table entries. Some architectures like ppc64 do assume that set_pmd/pud_at
can only be used to set a hugepage PTE. Since we are not setting up a hugepage
PTE here, use the pmd/pud_populate interface.

Signed-off-by: Aneesh Kumar K.V 
---
 mm/mremap.c | 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/mm/mremap.c b/mm/mremap.c
index 1d6fadbd4820..7372c8c0cf26 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -26,6 +26,7 @@
 
 #include 
 #include 
+#include 
 
 #include "internal.h"
 
@@ -257,9 +258,8 @@ static bool move_normal_pmd(struct vm_area_struct *vma, 
unsigned long old_addr,
pmd_clear(old_pmd);
 
VM_BUG_ON(!pmd_none(*new_pmd));
+   pmd_populate(mm, new_pmd, pmd_pgtable(pmd));
 
-   /* Set the new pmd */
-   set_pmd_at(mm, new_addr, new_pmd, pmd);
flush_tlb_range(vma, old_addr, old_addr + PMD_SIZE);
if (new_ptl != old_ptl)
spin_unlock(new_ptl);
@@ -306,8 +306,7 @@ static bool move_normal_pud(struct vm_area_struct *vma, 
unsigned long old_addr,
 
VM_BUG_ON(!pud_none(*new_pud));
 
-   /* Set the new pud */
-   set_pud_at(mm, new_addr, new_pud, pud);
+   pud_populate(mm, new_pud, (pmd_t *)pud_page_vaddr(pud));
flush_tlb_range(vma, old_addr, old_addr + PUD_SIZE);
if (new_ptl != old_ptl)
spin_unlock(new_ptl);
-- 
2.31.1



[PATCH v6 05/11] powerpc/mm/book3s64: Fix possible build error

2021-05-24 Thread Aneesh Kumar K.V
Update _tlbiel_pid() such that we can avoid build errors like below when
using this function in other places.

arch/powerpc/mm/book3s64/radix_tlb.c: In function 
‘__radix__flush_tlb_range_psize’:
arch/powerpc/mm/book3s64/radix_tlb.c:114:2: warning: ‘asm’ operand 3 probably 
does not match constraints
  114 |  asm volatile(PPC_TLBIEL(%0, %4, %3, %2, %1)
  |  ^~~
arch/powerpc/mm/book3s64/radix_tlb.c:114:2: error: impossible constraint in 
‘asm’
make[4]: *** [scripts/Makefile.build:271: arch/powerpc/mm/book3s64/radix_tlb.o] 
Error 1
m

With this fix, we can also drop the __always_inline in 
__radix_flush_tlb_range_psize
which was added by commit e12d6d7d46a6 ("powerpc/mm/radix: mark 
__radix__flush_tlb_range_psize() as __always_inline")

Reviewed-by: Christophe Leroy 
Acked-by: Michael Ellerman 
Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/mm/book3s64/radix_tlb.c | 26 +-
 1 file changed, 17 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/mm/book3s64/radix_tlb.c 
b/arch/powerpc/mm/book3s64/radix_tlb.c
index 409e61210789..817a02ef6032 100644
--- a/arch/powerpc/mm/book3s64/radix_tlb.c
+++ b/arch/powerpc/mm/book3s64/radix_tlb.c
@@ -291,22 +291,30 @@ static inline void fixup_tlbie_lpid(unsigned long lpid)
 /*
  * We use 128 set in radix mode and 256 set in hpt mode.
  */
-static __always_inline void _tlbiel_pid(unsigned long pid, unsigned long ric)
+static inline void _tlbiel_pid(unsigned long pid, unsigned long ric)
 {
int set;
 
asm volatile("ptesync": : :"memory");
 
-   /*
-* Flush the first set of the TLB, and if we're doing a RIC_FLUSH_ALL,
-* also flush the entire Page Walk Cache.
-*/
-   __tlbiel_pid(pid, 0, ric);
+   switch (ric) {
+   case RIC_FLUSH_PWC:
 
-   /* For PWC, only one flush is needed */
-   if (ric == RIC_FLUSH_PWC) {
+   /* For PWC, only one flush is needed */
+   __tlbiel_pid(pid, 0, RIC_FLUSH_PWC);
ppc_after_tlbiel_barrier();
return;
+   case RIC_FLUSH_TLB:
+   __tlbiel_pid(pid, 0, RIC_FLUSH_TLB);
+   break;
+   case RIC_FLUSH_ALL:
+   default:
+   /*
+* Flush the first set of the TLB, and if
+* we're doing a RIC_FLUSH_ALL, also flush
+* the entire Page Walk Cache.
+*/
+   __tlbiel_pid(pid, 0, RIC_FLUSH_ALL);
}
 
if (!cpu_has_feature(CPU_FTR_ARCH_31)) {
@@ -1176,7 +1184,7 @@ void radix__tlb_flush(struct mmu_gather *tlb)
}
 }
 
-static __always_inline void __radix__flush_tlb_range_psize(struct mm_struct 
*mm,
+static void __radix__flush_tlb_range_psize(struct mm_struct *mm,
unsigned long start, unsigned long end,
int psize, bool also_pwc)
 {
-- 
2.31.1



[PATCH v6 02/11] selftest/mremap_test: Avoid crash with static build

2021-05-24 Thread Aneesh Kumar K.V
With a large mmap map size, we can overlap with the text area and using
MAP_FIXED results in unmapping that area. Switch to MAP_FIXED_NOREPLACE
and handle the EEXIST error.

Reviewed-by: Kalesh Singh 
Signed-off-by: Aneesh Kumar K.V 
---
 tools/testing/selftests/vm/mremap_test.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/vm/mremap_test.c 
b/tools/testing/selftests/vm/mremap_test.c
index c9a5461eb786..0624d1bd71b5 100644
--- a/tools/testing/selftests/vm/mremap_test.c
+++ b/tools/testing/selftests/vm/mremap_test.c
@@ -75,9 +75,10 @@ static void *get_source_mapping(struct config c)
 retry:
addr += c.src_alignment;
src_addr = mmap((void *) addr, c.region_size, PROT_READ | PROT_WRITE,
-   MAP_FIXED | MAP_ANONYMOUS | MAP_SHARED, -1, 0);
+   MAP_FIXED_NOREPLACE | MAP_ANONYMOUS | MAP_SHARED,
+   -1, 0);
if (src_addr == MAP_FAILED) {
-   if (errno == EPERM)
+   if (errno == EPERM || errno == EEXIST)
goto retry;
goto error;
}
-- 
2.31.1



[PATCH v6 01/11] selftest/mremap_test: Update the test to handle pagesize other than 4K

2021-05-24 Thread Aneesh Kumar K.V
Instead of hardcoding 4K page size fetch it using sysconf(). For the performance
measurements test still assume 2M and 1G are hugepage sizes.

Reviewed-by: Kalesh Singh 
Signed-off-by: Aneesh Kumar K.V 
---
 tools/testing/selftests/vm/mremap_test.c | 113 ---
 1 file changed, 61 insertions(+), 52 deletions(-)

diff --git a/tools/testing/selftests/vm/mremap_test.c 
b/tools/testing/selftests/vm/mremap_test.c
index 9c391d016922..c9a5461eb786 100644
--- a/tools/testing/selftests/vm/mremap_test.c
+++ b/tools/testing/selftests/vm/mremap_test.c
@@ -45,14 +45,15 @@ enum {
_4MB = 4ULL << 20,
_1GB = 1ULL << 30,
_2GB = 2ULL << 30,
-   PTE = _4KB,
PMD = _2MB,
PUD = _1GB,
 };
 
+#define PTE page_size
+
 #define MAKE_TEST(source_align, destination_align, size,   \
  overlaps, should_fail, test_name) \
-{  \
+(struct test){ \
.name = test_name,  \
.config = { \
.src_alignment = source_align,  \
@@ -252,12 +253,17 @@ static int parse_args(int argc, char **argv, unsigned int 
*threshold_mb,
return 0;
 }
 
+#define MAX_TEST 13
+#define MAX_PERF_TEST 3
 int main(int argc, char **argv)
 {
int failures = 0;
int i, run_perf_tests;
unsigned int threshold_mb = VALIDATION_DEFAULT_THRESHOLD;
unsigned int pattern_seed;
+   struct test test_cases[MAX_TEST];
+   struct test perf_test_cases[MAX_PERF_TEST];
+   int page_size;
time_t t;
 
pattern_seed = (unsigned int) time();
@@ -268,56 +274,59 @@ int main(int argc, char **argv)
ksft_print_msg("Test 
configs:\n\tthreshold_mb=%u\n\tpattern_seed=%u\n\n",
   threshold_mb, pattern_seed);
 
-   struct test test_cases[] = {
-   /* Expected mremap failures */
-   MAKE_TEST(_4KB, _4KB, _4KB, OVERLAPPING, EXPECT_FAILURE,
- "mremap - Source and Destination Regions Overlapping"),
-   MAKE_TEST(_4KB, _1KB, _4KB, NON_OVERLAPPING, EXPECT_FAILURE,
- "mremap - Destination Address Misaligned (1KB-aligned)"),
-   MAKE_TEST(_1KB, _4KB, _4KB, NON_OVERLAPPING, EXPECT_FAILURE,
- "mremap - Source Address Misaligned (1KB-aligned)"),
-
-   /* Src addr PTE aligned */
-   MAKE_TEST(PTE, PTE, _8KB, NON_OVERLAPPING, EXPECT_SUCCESS,
- "8KB mremap - Source PTE-aligned, Destination PTE-aligned"),
-
-   /* Src addr 1MB aligned */
-   MAKE_TEST(_1MB, PTE, _2MB, NON_OVERLAPPING, EXPECT_SUCCESS,
- "2MB mremap - Source 1MB-aligned, Destination PTE-aligned"),
-   MAKE_TEST(_1MB, _1MB, _2MB, NON_OVERLAPPING, EXPECT_SUCCESS,
- "2MB mremap - Source 1MB-aligned, Destination 1MB-aligned"),
-
-   /* Src addr PMD aligned */
-   MAKE_TEST(PMD, PTE, _4MB, NON_OVERLAPPING, EXPECT_SUCCESS,
- "4MB mremap - Source PMD-aligned, Destination PTE-aligned"),
-   MAKE_TEST(PMD, _1MB, _4MB, NON_OVERLAPPING, EXPECT_SUCCESS,
- "4MB mremap - Source PMD-aligned, Destination 1MB-aligned"),
-   MAKE_TEST(PMD, PMD, _4MB, NON_OVERLAPPING, EXPECT_SUCCESS,
- "4MB mremap - Source PMD-aligned, Destination PMD-aligned"),
-
-   /* Src addr PUD aligned */
-   MAKE_TEST(PUD, PTE, _2GB, NON_OVERLAPPING, EXPECT_SUCCESS,
- "2GB mremap - Source PUD-aligned, Destination PTE-aligned"),
-   MAKE_TEST(PUD, _1MB, _2GB, NON_OVERLAPPING, EXPECT_SUCCESS,
- "2GB mremap - Source PUD-aligned, Destination 1MB-aligned"),
-   MAKE_TEST(PUD, PMD, _2GB, NON_OVERLAPPING, EXPECT_SUCCESS,
- "2GB mremap - Source PUD-aligned, Destination PMD-aligned"),
-   MAKE_TEST(PUD, PUD, _2GB, NON_OVERLAPPING, EXPECT_SUCCESS,
- "2GB mremap - Source PUD-aligned, Destination PUD-aligned"),
-   };
-
-   struct test perf_test_cases[] = {
-   /*
-* mremap 1GB region - Page table level aligned time
-* comparison.
-*/
-   MAKE_TEST(PTE, PTE, _1GB, NON_OVERLAPPING, EXPECT_SUCCESS,
- "1GB mremap - Source PTE-aligned, Destination PTE-aligned"),
-   MAKE_TEST(PMD, PMD, _1GB, NON_OVERLAPPING, EXPECT_SUCCESS,
- "1GB mremap - Source PMD-aligned, Destination PMD-aligned"),
-   MAKE_TEST(PUD, PUD, _1GB, NON_OVERLAPPING, EXPECT_SUCCESS,
- "1GB mremap - Source PUD-aligned, Destination PUD-aligned"),
-   };
+   page_size = sysconf(_SC_PAGESIZE);
+
+   /* Expected mremap failures */
+   test_cases[0] = 

[PATCH v6 03/11] mm/mremap: Convert huge PUD move to separate helper

2021-05-24 Thread Aneesh Kumar K.V
With TRANSPARENT_HUGEPAGE_PUD enabled the kernel can find huge PUD entries.
Add a helper to move huge PUD entries on mremap().

This will be used by a later patch to optimize mremap of PUD_SIZE aligned
level 4 PTE mapped address

This also make sure we support mremap on huge PUD entries even with
CONFIG_HAVE_MOVE_PUD disabled.

Signed-off-by: Aneesh Kumar K.V 
---
 mm/mremap.c | 80 -
 1 file changed, 73 insertions(+), 7 deletions(-)

diff --git a/mm/mremap.c b/mm/mremap.c
index ec8f840399ed..1d6fadbd4820 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -324,10 +324,62 @@ static inline bool move_normal_pud(struct vm_area_struct 
*vma,
 }
 #endif
 
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE_PUD
+static bool move_huge_pud(struct vm_area_struct *vma, unsigned long old_addr,
+ unsigned long new_addr, pud_t *old_pud, pud_t 
*new_pud)
+{
+   spinlock_t *old_ptl, *new_ptl;
+   struct mm_struct *mm = vma->vm_mm;
+   pud_t pud;
+
+   /*
+* The destination pud shouldn't be established, free_pgtables()
+* should have released it.
+*/
+   if (WARN_ON_ONCE(!pud_none(*new_pud)))
+   return false;
+
+   /*
+* We don't have to worry about the ordering of src and dst
+* ptlocks because exclusive mmap_lock prevents deadlock.
+*/
+   old_ptl = pud_lock(vma->vm_mm, old_pud);
+   new_ptl = pud_lockptr(mm, new_pud);
+   if (new_ptl != old_ptl)
+   spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING);
+
+   /* Clear the pud */
+   pud = *old_pud;
+   pud_clear(old_pud);
+
+   VM_BUG_ON(!pud_none(*new_pud));
+
+   /* Set the new pud */
+   /* mark soft_ditry when we add pud level soft dirty support */
+   set_pud_at(mm, new_addr, new_pud, pud);
+   flush_pud_tlb_range(vma, old_addr, old_addr + HPAGE_PUD_SIZE);
+   if (new_ptl != old_ptl)
+   spin_unlock(new_ptl);
+   spin_unlock(old_ptl);
+
+   return true;
+}
+#else
+static bool move_huge_pud(struct vm_area_struct *vma, unsigned long old_addr,
+ unsigned long new_addr, pud_t *old_pud, pud_t 
*new_pud)
+{
+   WARN_ON_ONCE(1);
+   return false;
+
+}
+#endif
+
 enum pgt_entry {
NORMAL_PMD,
HPAGE_PMD,
NORMAL_PUD,
+   HPAGE_PUD,
 };
 
 /*
@@ -347,6 +399,7 @@ static __always_inline unsigned long get_extent(enum 
pgt_entry entry,
mask = PMD_MASK;
size = PMD_SIZE;
break;
+   case HPAGE_PUD:
case NORMAL_PUD:
mask = PUD_MASK;
size = PUD_SIZE;
@@ -395,6 +448,11 @@ static bool move_pgt_entry(enum pgt_entry entry, struct 
vm_area_struct *vma,
move_huge_pmd(vma, old_addr, new_addr, old_entry,
  new_entry);
break;
+   case HPAGE_PUD:
+   moved = move_huge_pud(vma, old_addr, new_addr, old_entry,
+ new_entry);
+   break;
+
default:
WARN_ON_ONCE(1);
break;
@@ -414,6 +472,7 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
unsigned long extent, old_end;
struct mmu_notifier_range range;
pmd_t *old_pmd, *new_pmd;
+   pud_t *old_pud, *new_pud;
 
old_end = old_addr + len;
flush_cache_range(vma, old_addr, old_end);
@@ -429,15 +488,22 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
 * PUD level if possible.
 */
extent = get_extent(NORMAL_PUD, old_addr, old_end, new_addr);
-   if (IS_ENABLED(CONFIG_HAVE_MOVE_PUD) && extent == PUD_SIZE) {
-   pud_t *old_pud, *new_pud;
 
-   old_pud = get_old_pud(vma->vm_mm, old_addr);
-   if (!old_pud)
+   old_pud = get_old_pud(vma->vm_mm, old_addr);
+   if (!old_pud)
+   continue;
+   new_pud = alloc_new_pud(vma->vm_mm, vma, new_addr);
+   if (!new_pud)
+   break;
+   if (pud_trans_huge(*old_pud) || pud_devmap(*old_pud)) {
+   if (extent == HPAGE_PUD_SIZE) {
+   move_pgt_entry(HPAGE_PUD, vma, old_addr, 
new_addr,
+  old_pud, new_pud, 
need_rmap_locks);
+   /* We ignore and continue on error? */
continue;
-   new_pud = alloc_new_pud(vma->vm_mm, vma, new_addr);
-   if (!new_pud)
-   break;
+   }
+   } else if (IS_ENABLED(CONFIG_HAVE_MOVE_PUD) && extent == 
PUD_SIZE) {
+
if (move_pgt_entry(NORMAL_PUD, vma, old_addr, new_addr,
   

[PATCH v6 00/11] Speedup mremap on ppc64

2021-05-24 Thread Aneesh Kumar K.V


This patchset enables MOVE_PMD/MOVE_PUD support on power. This requires
the platform to support updating higher-level page tables without
updating page table entries. This also needs to invalidate the Page Walk
Cache on architecture supporting the same.

Changes from v5:
* Drop patch mm/mremap: Move TLB flush outside page table lock
* Add fixes for race between optimized mremap and page out

Changes from v4:
* Change function name and arguments based on review feedback.

Changes from v3:
* Fix build error reported by kernel test robot
* Address review feedback.

Changes from v2:
* switch from using mmu_gather to flush_pte_tlb_pwc_range() 

Changes from v1:
* Rebase to recent upstream
* Fix build issues with tlb_gather_mmu changes



Aneesh Kumar K.V (11):
  selftest/mremap_test: Update the test to handle pagesize other than 4K
  selftest/mremap_test: Avoid crash with static build
  mm/mremap: Convert huge PUD move to separate helper
  mm/mremap: Use pmd/pud_poplulate to update page table entries
  powerpc/mm/book3s64: Fix possible build error
  powerpc/mm/book3s64: Update tlb flush routines to take a page walk
cache flush argument
  mm/mremap: Use range flush that does TLB and page walk cache flush
  mm/mremap: properly flush the TLB on mremap.
  mm/mremap: Fix race between mremap and pageout
  mm/mremap: Allow arch runtime override
  powerpc/mm: Enable HAVE_MOVE_PMD support

 .../include/asm/book3s/64/tlbflush-radix.h|  19 ++-
 arch/powerpc/include/asm/book3s/64/tlbflush.h |  29 +++-
 arch/powerpc/include/asm/tlb.h|   6 +
 arch/powerpc/mm/book3s64/radix_hugetlbpage.c  |   4 +-
 arch/powerpc/mm/book3s64/radix_tlb.c  |  55 ---
 arch/powerpc/platforms/Kconfig.cputype|   2 +
 mm/mremap.c   | 145 --
 tools/testing/selftests/vm/mremap_test.c  | 118 +++---
 8 files changed, 269 insertions(+), 109 deletions(-)

-- 
2.31.1



Re: [PATCH 14/26] md: convert to blk_alloc_disk/blk_cleanup_disk

2021-05-24 Thread Hannes Reinecke

On 5/24/21 9:26 AM, Christoph Hellwig wrote:

On Sun, May 23, 2021 at 10:12:49AM +0200, Hannes Reinecke wrote:

+   blk_set_stacking_limits(>queue->limits);
blk_queue_write_cache(mddev->queue, true, true);
/* Allow extended partitions.  This makes the
 * 'mdp' device redundant, but we can't really


Wouldn't it make sense to introduce a helper 'blk_queue_from_disk()' or
somesuch to avoid having to keep an explicit 'queue' pointer?


My rought plan is that a few series from now bio based drivers will
never directly deal with the request_queue at all.


Go for it.

Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
--
Dr. Hannes ReineckeKernel Storage Architect
h...@suse.de  +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer


Re: [PATCH] KVM: PPC: Book3S HV: Fix reverse map real-mode address lookup with huge vmalloc

2021-05-24 Thread Aneesh Kumar K.V
Christophe Leroy  writes:

> Nicholas Piggin  a écrit :
>
>> real_vmalloc_addr() does not currently work for huge vmalloc, which is
>> what the reverse map can be allocated with for radix host, hash guest.
>>
>> Add huge page awareness to the function.
>>
>> Fixes: 8abddd968a30 ("powerpc/64s/radix: Enable huge vmalloc mappings")
>> Signed-off-by: Nicholas Piggin 
>> ---
>>  arch/powerpc/kvm/book3s_hv_rm_mmu.c | 17 -
>>  1 file changed, 12 insertions(+), 5 deletions(-)
>>
>> diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c  
>> b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
>> index 7af7c70f1468..5f68cb5cc009 100644
>> --- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c
>> +++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
>> @@ -26,16 +26,23 @@
>>  static void *real_vmalloc_addr(void *x)
>>  {
>>  unsigned long addr = (unsigned long) x;
>> +unsigned long mask;
>> +int shift;
>>  pte_t *p;
>> +
>>  /*
>> - * assume we don't have huge pages in vmalloc space...
>> - * So don't worry about THP collapse/split. Called
>> - * Only in realmode with MSR_EE = 0, hence won't need irq_save/restore.
>> + * This is called only in realmode with MSR_EE = 0, hence won't need
>> + * irq_save/restore around find_init_mm_pte.
>>   */
>> -p = find_init_mm_pte(addr, NULL);
>> +p = find_init_mm_pte(addr, );
>>  if (!p || !pte_present(*p))
>>  return NULL;
>> -addr = (pte_pfn(*p) << PAGE_SHIFT) | (addr & ~PAGE_MASK);
>> +if (!shift)
>> +shift = PAGE_SHIFT;
>> +
>> +mask = (1UL << shift) - 1;
>> +addr = (pte_pfn(*p) << PAGE_SHIFT) | (addr & mask);
>
> Looks strange, before we have ~MASK now we have mask without the ~

#define PAGE_MASK  (~((1 << PAGE_SHIFT) - 1))

-aneesh


Re: [PATCH 18/26] nvme-multipath: convert to blk_alloc_disk/blk_cleanup_disk

2021-05-24 Thread Christoph Hellwig
On Sun, May 23, 2021 at 10:20:27AM +0200, Hannes Reinecke wrote:
> What about the check for GENHD_FL_UP a bit further up in line 766?
> Can this still happen with the new allocation scheme, ie is there still a 
> difference in lifetime between ->disk and ->disk->queue?

Yes, nvme_free_ns_head can still be called before device_add_disk was
called for an allocated nshead gendisk during error handling of the
setup path.  There is still a difference in the lifetime in that they
are separately refcounted, but it does not matter to the driver.


Re: [PATCH 14/26] md: convert to blk_alloc_disk/blk_cleanup_disk

2021-05-24 Thread Christoph Hellwig
On Sun, May 23, 2021 at 10:12:49AM +0200, Hannes Reinecke wrote:
>> +blk_set_stacking_limits(>queue->limits);
>>  blk_queue_write_cache(mddev->queue, true, true);
>>  /* Allow extended partitions.  This makes the
>>   * 'mdp' device redundant, but we can't really
>>
> Wouldn't it make sense to introduce a helper 'blk_queue_from_disk()' or 
> somesuch to avoid having to keep an explicit 'queue' pointer?

My rought plan is that a few series from now bio based drivers will
never directly deal with the request_queue at all.


Re: [PATCH 13/26] dm: convert to blk_alloc_disk/blk_cleanup_disk

2021-05-24 Thread Christoph Hellwig
On Sun, May 23, 2021 at 10:10:34AM +0200, Hannes Reinecke wrote:
> Can't these conditionals be merged into a single 'if (md->disk)'?
> Eg like:
>
>   if (md->disk) {
>   spin_lock(&_minor_lock);
>   md->disk->private_data = NULL;
>   spin_unlock(&_minor_lock);
>   del_gendisk(md->disk);
>   dm_queue_destroy_keyslot_manager(md->queue);
>   blk_cleanup_disk(md->queue);
>   }
>
> We're now always allocating 'md->disk' and 'md->queue' together,
> so how can we end up in a situation where one is set without the other?

I guess we could do that, not sure it is worth the churn, though.


Re: [PATCH 06/26] brd: convert to blk_alloc_disk/blk_cleanup_disk

2021-05-24 Thread Christoph Hellwig
On Sun, May 23, 2021 at 09:58:48AM +0200, Hannes Reinecke wrote:
>> +/*
>> + * This is so fdisk will align partitions on 4k, because of
>> + * direct_access API needing 4k alignment, returning a PFN
>> + * (This is only a problem on very small devices <= 4M,
>> + *  otherwise fdisk will align on 1M. Regardless this call
>> + *  is harmless)
>> + */
>> +blk_queue_physical_block_size(disk->queue, PAGE_SIZE);
>>   
>
> Maybe converting the comment to refer to 'PAGE_SIZE' instead of 4k while 
> you're at it ...

I really do not want to touch these kinds of unrelated things here.


Re: [dm-devel] [PATCH 05/26] block: add blk_alloc_disk and blk_cleanup_disk APIs

2021-05-24 Thread Christoph Hellwig
On Fri, May 21, 2021 at 05:44:07PM +, Luis Chamberlain wrote:
> Its not obvious to me why using this new API requires you then to
> set minors explicitly to 1, and yet here underneath we see the minors
> argument passed is 0.
> 
> Nor is it clear from the documentation.

Basically for all new drivers no one should set minors at all, and the
dynamic dev_t mechanism does all the work.  For converted old drivers
minors is set manually instead of being passed an an argument that
should be 0 for all new drivers.


Re: [PATCH 01/26] block: refactor device number setup in __device_add_disk

2021-05-24 Thread Christoph Hellwig
On Sun, May 23, 2021 at 09:46:01AM +0200, Hannes Reinecke wrote:
> ... and also fixes an issue with GENHD_FL_UP remained set in an error path 
> in __device_add_disk().

Well, the error path in __device_add_disk is a complete disaster right
now, but Luis is looking into it fortunately.


Re: [dm-devel] [PATCH 01/26] block: refactor device number setup in __device_add_disk

2021-05-24 Thread Christoph Hellwig
On Fri, May 21, 2021 at 05:16:46PM +, Luis Chamberlain wrote:
> > -   /* in consecutive minor range? */
> > -   if (bdev->bd_partno < disk->minors) {
> > -   *devt = MKDEV(disk->major, disk->first_minor + bdev->bd_partno);
> > -   return 0;
> > -   }
> > -
> 
> It is not obviously clear to me, why this was part of add_disk()
> path, and ...
> 
> > diff --git a/block/partitions/core.c b/block/partitions/core.c
> > index dc60ecf46fe6..504297bdc8bf 100644
> > --- a/block/partitions/core.c
> > +++ b/block/partitions/core.c
> > @@ -379,9 +380,15 @@ static struct block_device *add_partition(struct 
> > gendisk *disk, int partno,
> > pdev->type = _type;
> > pdev->parent = ddev;
> >  
> > -   err = blk_alloc_devt(bdev, );
> > -   if (err)
> > -   goto out_put;
> > +   /* in consecutive minor range? */
> > +   if (bdev->bd_partno < disk->minors) {
> > +   devt = MKDEV(disk->major, disk->first_minor + bdev->bd_partno);
> > +   } else {
> > +   err = blk_alloc_ext_minor();
> > +   if (err < 0)
> > +   goto out_put;
> > +   devt = MKDEV(BLOCK_EXT_MAJOR, err);
> > +   }
> > pdev->devt = devt;
> >  
> > /* delay uevent until 'holders' subdir is created */
> 
> ... and why we only add this here now.

For the genhd minors == 0 (aka GENHD_FL_EXT_DEVT) implies having to
allocate a dynamic dev_t, so it can be folded into another conditional.


[PATCH] docs: kernel-parameters: mark numa=off is supported by a bundle of architectures

2021-05-24 Thread Barry Song
risc-v and arm64 support numa=off by common arch_numa_init()
in drivers/base/arch_numa.c. x86, ppc, mips, sparc support it
by arch-level early_param.
numa=off is widely used in linux distributions. it is better
to document it.

Signed-off-by: Barry Song 
---
 Documentation/admin-guide/kernel-parameters.txt | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index cb89dbdedc46..a388fbdaa2ec 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -3513,6 +3513,9 @@
 
nr_uarts=   [SERIAL] maximum number of UARTs to be registered.
 
+   numa=off[KNL, ARM64, PPC, RISCV, SPARC, X86] Disable NUMA, Only
+   set up a single NUMA node spanning all memory.
+
numa_balancing= [KNL,ARM64,PPC,RISCV,S390,X86] Enable or disable 
automatic
NUMA balancing.
Allowed values are enable and disable
-- 
2.25.1