Re: [PATCH v2] memblock: make memblock_find_in_range method private

2021-08-02 Thread Kirill A. Shutemov
On Mon, Aug 02, 2021 at 09:37:37AM +0300, Mike Rapoport wrote:
> From: Mike Rapoport 
> 
> There are a lot of uses of memblock_find_in_range() along with
> memblock_reserve() from the times memblock allocation APIs did not exist.
> 
> memblock_find_in_range() is the very core of memblock allocations, so any
> future changes to its internal behaviour would mandate updates of all the
> users outside memblock.
> 
> Replace the calls to memblock_find_in_range() with an equivalent calls to
> memblock_phys_alloc() and memblock_phys_alloc_range() and make
> memblock_find_in_range() private method of memblock.
> 
> This simplifies the callers, ensures that (unlikely) errors in
> memblock_reserve() are handled and improves maintainability of
> memblock_find_in_range().
> 
> Signed-off-by: Mike Rapoport 

Looks good to me:

Acked-by: Kirill A. Shutemov 

-- 
 Kirill A. Shutemov
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH 1/4] treewide: remove unused address argument from pte_alloc functions (v2)

2018-10-25 Thread Kirill A. Shutemov
On Wed, Oct 24, 2018 at 10:37:16AM +0200, Peter Zijlstra wrote:
> On Fri, Oct 12, 2018 at 06:31:57PM -0700, Joel Fernandes (Google) wrote:
> > This series speeds up mremap(2) syscall by copying page tables at the
> > PMD level even for non-THP systems. There is concern that the extra
> > 'address' argument that mremap passes to pte_alloc may do something
> > subtle architecture related in the future that may make the scheme not
> > work.  Also we find that there is no point in passing the 'address' to
> > pte_alloc since its unused. So this patch therefore removes this
> > argument tree-wide resulting in a nice negative diff as well. Also
> > ensuring along the way that the enabled architectures do not do anything
> > funky with 'address' argument that goes unnoticed by the optimization.
> 
> Did you happen to look at the history of where that address argument
> came from? -- just being curious here. ISTR something vague about
> architectures having different paging structure for different memory
> ranges.

I see some archicetures (i.e. sparc and, I believe power) used the address
for coloring. It's not needed anymore. Page allocator and SL?B are good
enough now.

See 3c936465249f ("[SPARC64]: Kill pgtable quicklists and use SLAB.")

-- 
 Kirill A. Shutemov
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH 2/4] mm: speed up mremap by 500x on large regions (v2)

2018-10-25 Thread Kirill A. Shutemov
On Wed, Oct 24, 2018 at 07:09:07PM -0700, Joel Fernandes wrote:
> On Wed, Oct 24, 2018 at 03:57:24PM +0300, Kirill A. Shutemov wrote:
> > On Wed, Oct 24, 2018 at 10:57:33PM +1100, Balbir Singh wrote:
> > > On Wed, Oct 24, 2018 at 01:12:56PM +0300, Kirill A. Shutemov wrote:
> > > > On Fri, Oct 12, 2018 at 06:31:58PM -0700, Joel Fernandes (Google) wrote:
> > > > > diff --git a/mm/mremap.c b/mm/mremap.c
> > > > > index 9e68a02a52b1..2fd163cff406 100644
> > > > > --- a/mm/mremap.c
> > > > > +++ b/mm/mremap.c
> > > > > @@ -191,6 +191,54 @@ static void move_ptes(struct vm_area_struct 
> > > > > *vma, pmd_t *old_pmd,
> > > > >   drop_rmap_locks(vma);
> > > > >  }
> > > > >  
> > > > > +static bool move_normal_pmd(struct vm_area_struct *vma, unsigned 
> > > > > long old_addr,
> > > > > +   unsigned long new_addr, unsigned long old_end,
> > > > > +   pmd_t *old_pmd, pmd_t *new_pmd, bool *need_flush)
> > > > > +{
> > > > > + spinlock_t *old_ptl, *new_ptl;
> > > > > + struct mm_struct *mm = vma->vm_mm;
> > > > > +
> > > > > + if ((old_addr & ~PMD_MASK) || (new_addr & ~PMD_MASK)
> > > > > + || old_end - old_addr < PMD_SIZE)
> > > > > + return false;
> > > > > +
> > > > > + /*
> > > > > +  * The destination pmd shouldn't be established, free_pgtables()
> > > > > +  * should have release it.
> > > > > +  */
> > > > > + if (WARN_ON(!pmd_none(*new_pmd)))
> > > > > + return false;
> > > > > +
> > > > > + /*
> > > > > +  * We don't have to worry about the ordering of src and dst
> > > > > +  * ptlocks because exclusive mmap_sem prevents deadlock.
> > > > > +  */
> > > > > + old_ptl = pmd_lock(vma->vm_mm, old_pmd);
> > > > > + if (old_ptl) {
> > > > 
> > > > How can it ever be false?
> 
> Kirill,
> It cannot, you are right. I'll remove the test.
> 
> By the way, there are new changes upstream by Linus which flush the TLB
> before releasing the ptlock instead of after. I'm guessing that patch came
> about because of reviews of this patch and someone spotted an issue in the
> existing code :)
> 
> Anyway the patch in concern is:
> eb66ae030829 ("mremap: properly flush TLB before releasing the page")
> 
> I need to rebase on top of that with appropriate modifications, but I worry
> that this patch will slow down performance since we have to flush at every
> PMD/PTE move before releasing the ptlock. Where as with my patch, the
> intention is to flush only at once in the end of move_page_tables. When I
> tried to flush TLB on every PMD move, it was quite slow on my arm64 device 
> [2].
> 
> Further observation [1] is, it seems like the move_huge_pmds and move_ptes 
> code
> is a bit sub optimal in the sense, we are acquiring and releasing the same
> ptlock for a bunch of PMDs if the said PMDs are on the same page-table page
> right? Instead we can do better by acquiring and release the ptlock less
> often.
> 
> I think this observation [1] and the frequent TLB flush issue [2] can be 
> solved
> by acquiring the ptlock once for a bunch of PMDs, move them all, then flush
> the tlb and then release the ptlock, and then proceed to doing the same thing
> for the PMDs in the next page-table page. What do you think?

Yeah, that's viable optimization.

The tricky part is that one PMD page table can have PMD entires of
different types: THP, page table that you can move as whole and the one
that you cannot (for any reason).

If we cannot move the PMD entry as a whole and must go to PTE page table
we would need to drop PMD ptl and take PTE ptl (it might be the same lock
in some configuations).

Also we don't want to take PMD lock unless it's required.

I expect it to be not very trivial to get everything right. But take a
shot :)

-- 
 Kirill A. Shutemov
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH 2/4] mm: speed up mremap by 500x on large regions (v2)

2018-10-24 Thread Kirill A. Shutemov
On Wed, Oct 24, 2018 at 10:57:33PM +1100, Balbir Singh wrote:
> On Wed, Oct 24, 2018 at 01:12:56PM +0300, Kirill A. Shutemov wrote:
> > On Fri, Oct 12, 2018 at 06:31:58PM -0700, Joel Fernandes (Google) wrote:
> > > diff --git a/mm/mremap.c b/mm/mremap.c
> > > index 9e68a02a52b1..2fd163cff406 100644
> > > --- a/mm/mremap.c
> > > +++ b/mm/mremap.c
> > > @@ -191,6 +191,54 @@ static void move_ptes(struct vm_area_struct *vma, 
> > > pmd_t *old_pmd,
> > >   drop_rmap_locks(vma);
> > >  }
> > >  
> > > +static bool move_normal_pmd(struct vm_area_struct *vma, unsigned long 
> > > old_addr,
> > > +   unsigned long new_addr, unsigned long old_end,
> > > +   pmd_t *old_pmd, pmd_t *new_pmd, bool *need_flush)
> > > +{
> > > + spinlock_t *old_ptl, *new_ptl;
> > > + struct mm_struct *mm = vma->vm_mm;
> > > +
> > > + if ((old_addr & ~PMD_MASK) || (new_addr & ~PMD_MASK)
> > > + || old_end - old_addr < PMD_SIZE)
> > > + return false;
> > > +
> > > + /*
> > > +  * The destination pmd shouldn't be established, free_pgtables()
> > > +  * should have release it.
> > > +  */
> > > + if (WARN_ON(!pmd_none(*new_pmd)))
> > > + return false;
> > > +
> > > + /*
> > > +  * We don't have to worry about the ordering of src and dst
> > > +  * ptlocks because exclusive mmap_sem prevents deadlock.
> > > +  */
> > > + old_ptl = pmd_lock(vma->vm_mm, old_pmd);
> > > + if (old_ptl) {
> > 
> > How can it ever be false?
> > 
> > > + pmd_t pmd;
> > > +
> > > +     new_ptl = pmd_lockptr(mm, new_pmd);
> 
> 
> Looks like this is largely inspired by move_huge_pmd(), I guess a lot of
> the code applies, why not just reuse as much as possible? The same comments
> w.r.t mmap_sem helping protect against lock order issues applies as well.

pmd_lock() cannot fail, but __pmd_trans_huge_lock() can. We should not
copy the code blindly.

-- 
 Kirill A. Shutemov
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH 2/4] mm: speed up mremap by 500x on large regions (v2)

2018-10-24 Thread Kirill A. Shutemov
On Fri, Oct 12, 2018 at 06:31:58PM -0700, Joel Fernandes (Google) wrote:
> diff --git a/mm/mremap.c b/mm/mremap.c
> index 9e68a02a52b1..2fd163cff406 100644
> --- a/mm/mremap.c
> +++ b/mm/mremap.c
> @@ -191,6 +191,54 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t 
> *old_pmd,
>   drop_rmap_locks(vma);
>  }
>  
> +static bool move_normal_pmd(struct vm_area_struct *vma, unsigned long 
> old_addr,
> +   unsigned long new_addr, unsigned long old_end,
> +   pmd_t *old_pmd, pmd_t *new_pmd, bool *need_flush)
> +{
> + spinlock_t *old_ptl, *new_ptl;
> + struct mm_struct *mm = vma->vm_mm;
> +
> + if ((old_addr & ~PMD_MASK) || (new_addr & ~PMD_MASK)
> + || old_end - old_addr < PMD_SIZE)
> + return false;
> +
> + /*
> +  * The destination pmd shouldn't be established, free_pgtables()
> +  * should have release it.
> +  */
> + if (WARN_ON(!pmd_none(*new_pmd)))
> + return false;
> +
> + /*
> +  * We don't have to worry about the ordering of src and dst
> +  * ptlocks because exclusive mmap_sem prevents deadlock.
> +  */
> + old_ptl = pmd_lock(vma->vm_mm, old_pmd);
> + if (old_ptl) {

How can it ever be false?

> + pmd_t pmd;
> +
> + new_ptl = pmd_lockptr(mm, new_pmd);
> + if (new_ptl != old_ptl)
> + spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING);
> +
> + /* Clear the pmd */
> + pmd = *old_pmd;
> + pmd_clear(old_pmd);
> +
> + VM_BUG_ON(!pmd_none(*new_pmd));
> +
> + /* Set the new pmd */
> + set_pmd_at(mm, new_addr, new_pmd, pmd);
> + if (new_ptl != old_ptl)
> +     spin_unlock(new_ptl);
> + spin_unlock(old_ptl);
> +
> + *need_flush = true;
> + return true;
> + }
> + return false;
> +}
> +
-- 
 Kirill A. Shutemov
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v2 2/2] mm: speed up mremap by 500x on large regions

2018-10-13 Thread Kirill A. Shutemov
On Fri, Oct 12, 2018 at 05:42:24PM +0100, Anton Ivanov wrote:
> 
> On 10/12/18 3:48 PM, Anton Ivanov wrote:
> > On 12/10/2018 15:37, Kirill A. Shutemov wrote:
> > > On Fri, Oct 12, 2018 at 03:09:49PM +0100, Anton Ivanov wrote:
> > > > On 10/12/18 2:37 AM, Joel Fernandes (Google) wrote:
> > > > > Android needs to mremap large regions of memory during
> > > > > memory management
> > > > > related operations. The mremap system call can be really
> > > > > slow if THP is
> > > > > not enabled. The bottleneck is move_page_tables, which is copying each
> > > > > pte at a time, and can be really slow across a large map.
> > > > > Turning on THP
> > > > > may not be a viable option, and is not for us. This patch
> > > > > speeds up the
> > > > > performance for non-THP system by copying at the PMD level
> > > > > when possible.
> > > > > 
> > > > > The speed up is three orders of magnitude. On a 1GB mremap, the mremap
> > > > > completion times drops from 160-250 millesconds to 380-400
> > > > > microseconds.
> > > > > 
> > > > > Before:
> > > > > Total mremap time for 1GB data: 242321014 nanoseconds.
> > > > > Total mremap time for 1GB data: 196842467 nanoseconds.
> > > > > Total mremap time for 1GB data: 167051162 nanoseconds.
> > > > > 
> > > > > After:
> > > > > Total mremap time for 1GB data: 385781 nanoseconds.
> > > > > Total mremap time for 1GB data: 388959 nanoseconds.
> > > > > Total mremap time for 1GB data: 402813 nanoseconds.
> > > > > 
> > > > > Incase THP is enabled, the optimization is skipped. I also flush the
> > > > > tlb every time we do this optimization since I couldn't find a way to
> > > > > determine if the low-level PTEs are dirty. It is seen that the cost of
> > > > > doing so is not much compared the improvement, on both
> > > > > x86-64 and arm64.
> > > > > 
> > > > > Cc: minc...@kernel.org
> > > > > Cc: pan...@google.com
> > > > > Cc: hu...@google.com
> > > > > Cc: lokeshgi...@google.com
> > > > > Cc: dan...@google.com
> > > > > Cc: mho...@kernel.org
> > > > > Cc: kir...@shutemov.name
> > > > > Cc: a...@linux-foundation.org
> > > > > Signed-off-by: Joel Fernandes (Google) 
> > > > > ---
> > > > >    mm/mremap.c | 62
> > > > > +
> > > > >    1 file changed, 62 insertions(+)
> > > > > 
> > > > > diff --git a/mm/mremap.c b/mm/mremap.c
> > > > > index 9e68a02a52b1..d82c485822ef 100644
> > > > > --- a/mm/mremap.c
> > > > > +++ b/mm/mremap.c
> > > > > @@ -191,6 +191,54 @@ static void move_ptes(struct
> > > > > vm_area_struct *vma, pmd_t *old_pmd,
> > > > >    drop_rmap_locks(vma);
> > > > >    }
> > > > > +static bool move_normal_pmd(struct vm_area_struct *vma,
> > > > > unsigned long old_addr,
> > > > > +  unsigned long new_addr, unsigned long old_end,
> > > > > +  pmd_t *old_pmd, pmd_t *new_pmd, bool *need_flush)
> > > > > +{
> > > > > +    spinlock_t *old_ptl, *new_ptl;
> > > > > +    struct mm_struct *mm = vma->vm_mm;
> > > > > +
> > > > > +    if ((old_addr & ~PMD_MASK) || (new_addr & ~PMD_MASK)
> > > > > +    || old_end - old_addr < PMD_SIZE)
> > > > > +    return false;
> > > > > +
> > > > > +    /*
> > > > > + * The destination pmd shouldn't be established, free_pgtables()
> > > > > + * should have release it.
> > > > > + */
> > > > > +    if (WARN_ON(!pmd_none(*new_pmd)))
> > > > > +    return false;
> > > > > +
> > > > > +    /*
> > > > > + * We don't have to worry about the ordering of src and dst
> > > > > + * ptlocks because exclusive mmap_sem prevents deadlock.
> > > > > + */
> > > > > +    old_ptl = pmd_lock(vma->vm_mm, old_pmd);
> > > > > +    if (old_ptl) {
> > > > > +    pmd_t pmd;
> > > > > +
&

Re: [PATCH v2 2/2] mm: speed up mremap by 500x on large regions

2018-10-13 Thread Kirill A. Shutemov
On Fri, Oct 12, 2018 at 09:57:19AM -0700, Joel Fernandes wrote:
> On Fri, Oct 12, 2018 at 04:19:46PM +0300, Kirill A. Shutemov wrote:
> > On Fri, Oct 12, 2018 at 05:50:46AM -0700, Joel Fernandes wrote:
> > > On Fri, Oct 12, 2018 at 02:30:56PM +0300, Kirill A. Shutemov wrote:
> > > > On Thu, Oct 11, 2018 at 06:37:56PM -0700, Joel Fernandes (Google) wrote:
> > > > > Android needs to mremap large regions of memory during memory 
> > > > > management
> > > > > related operations. The mremap system call can be really slow if THP 
> > > > > is
> > > > > not enabled. The bottleneck is move_page_tables, which is copying each
> > > > > pte at a time, and can be really slow across a large map. Turning on 
> > > > > THP
> > > > > may not be a viable option, and is not for us. This patch speeds up 
> > > > > the
> > > > > performance for non-THP system by copying at the PMD level when 
> > > > > possible.
> > > > > 
> > > > > The speed up is three orders of magnitude. On a 1GB mremap, the mremap
> > > > > completion times drops from 160-250 millesconds to 380-400 
> > > > > microseconds.
> > > > > 
> > > > > Before:
> > > > > Total mremap time for 1GB data: 242321014 nanoseconds.
> > > > > Total mremap time for 1GB data: 196842467 nanoseconds.
> > > > > Total mremap time for 1GB data: 167051162 nanoseconds.
> > > > > 
> > > > > After:
> > > > > Total mremap time for 1GB data: 385781 nanoseconds.
> > > > > Total mremap time for 1GB data: 388959 nanoseconds.
> > > > > Total mremap time for 1GB data: 402813 nanoseconds.
> > > > > 
> > > > > Incase THP is enabled, the optimization is skipped. I also flush the
> > > > > tlb every time we do this optimization since I couldn't find a way to
> > > > > determine if the low-level PTEs are dirty. It is seen that the cost of
> > > > > doing so is not much compared the improvement, on both x86-64 and 
> > > > > arm64.
> > > > 
> > > > I looked into the code more and noticed move_pte() helper called from
> > > > move_ptes(). It changes PTE entry to suite new address.
> > > > 
> > > > It is only defined in non-trivial way on Sparc. I don't know much about
> > > > Sparc and it's hard for me to say if the optimization will break 
> > > > anything
> > > > there.
> > > 
> > > Sparc's move_pte seems to be flushing the D-cache to prevent aliasing. It 
> > > is
> > > not modifying the PTE itself AFAICS:
> > > 
> > > #ifdef DCACHE_ALIASING_POSSIBLE
> > > #define __HAVE_ARCH_MOVE_PTE
> > > #define move_pte(pte, prot, old_addr, new_addr) \
> > > ({  \
> > > pte_t newpte = (pte);   \
> > > if (tlb_type != hypervisor && pte_present(pte)) {   \
> > > unsigned long this_pfn = pte_pfn(pte);  \
> > > \
> > > if (pfn_valid(this_pfn) &&  \
> > > (((old_addr) ^ (new_addr)) & (1 << 13)))\
> > > flush_dcache_page_all(current->mm,  \
> > >   pfn_to_page(this_pfn));   \
> > > }   \
> > > newpte; \
> > > })
> > > #endif
> > > 
> > > If its an issue, then how do transparent huge pages work on Sparc?  I 
> > > don't
> > > see the huge page code (move_huge_pages) during mremap doing anything 
> > > special
> > > for Sparc architecture when moving PMDs..
> > 
> > My *guess* is that it will work fine on Sparc as it apprarently it only
> > cares about change in bit 13 of virtual address. It will never happen for
> > huge pages or when PTE page tables move.
> > 
> > But I just realized that the problem is bigger: since we pass new_addr to
> > the set_pte_at() we would need to audit all implementations that they are
> > safe with just moving PTE page

Re: [PATCH v2 2/2] mm: speed up mremap by 500x on large regions

2018-10-12 Thread Kirill A. Shutemov
On Fri, Oct 12, 2018 at 03:09:49PM +0100, Anton Ivanov wrote:
> On 10/12/18 2:37 AM, Joel Fernandes (Google) wrote:
> > Android needs to mremap large regions of memory during memory management
> > related operations. The mremap system call can be really slow if THP is
> > not enabled. The bottleneck is move_page_tables, which is copying each
> > pte at a time, and can be really slow across a large map. Turning on THP
> > may not be a viable option, and is not for us. This patch speeds up the
> > performance for non-THP system by copying at the PMD level when possible.
> > 
> > The speed up is three orders of magnitude. On a 1GB mremap, the mremap
> > completion times drops from 160-250 millesconds to 380-400 microseconds.
> > 
> > Before:
> > Total mremap time for 1GB data: 242321014 nanoseconds.
> > Total mremap time for 1GB data: 196842467 nanoseconds.
> > Total mremap time for 1GB data: 167051162 nanoseconds.
> > 
> > After:
> > Total mremap time for 1GB data: 385781 nanoseconds.
> > Total mremap time for 1GB data: 388959 nanoseconds.
> > Total mremap time for 1GB data: 402813 nanoseconds.
> > 
> > Incase THP is enabled, the optimization is skipped. I also flush the
> > tlb every time we do this optimization since I couldn't find a way to
> > determine if the low-level PTEs are dirty. It is seen that the cost of
> > doing so is not much compared the improvement, on both x86-64 and arm64.
> > 
> > Cc: minc...@kernel.org
> > Cc: pan...@google.com
> > Cc: hu...@google.com
> > Cc: lokeshgi...@google.com
> > Cc: dan...@google.com
> > Cc: mho...@kernel.org
> > Cc: kir...@shutemov.name
> > Cc: a...@linux-foundation.org
> > Signed-off-by: Joel Fernandes (Google) 
> > ---
> >   mm/mremap.c | 62 +
> >   1 file changed, 62 insertions(+)
> > 
> > diff --git a/mm/mremap.c b/mm/mremap.c
> > index 9e68a02a52b1..d82c485822ef 100644
> > --- a/mm/mremap.c
> > +++ b/mm/mremap.c
> > @@ -191,6 +191,54 @@ static void move_ptes(struct vm_area_struct *vma, 
> > pmd_t *old_pmd,
> > drop_rmap_locks(vma);
> >   }
> > +static bool move_normal_pmd(struct vm_area_struct *vma, unsigned long 
> > old_addr,
> > + unsigned long new_addr, unsigned long old_end,
> > + pmd_t *old_pmd, pmd_t *new_pmd, bool *need_flush)
> > +{
> > +   spinlock_t *old_ptl, *new_ptl;
> > +   struct mm_struct *mm = vma->vm_mm;
> > +
> > +   if ((old_addr & ~PMD_MASK) || (new_addr & ~PMD_MASK)
> > +   || old_end - old_addr < PMD_SIZE)
> > +   return false;
> > +
> > +   /*
> > +* The destination pmd shouldn't be established, free_pgtables()
> > +* should have release it.
> > +*/
> > +   if (WARN_ON(!pmd_none(*new_pmd)))
> > +   return false;
> > +
> > +   /*
> > +* We don't have to worry about the ordering of src and dst
> > +* ptlocks because exclusive mmap_sem prevents deadlock.
> > +*/
> > +   old_ptl = pmd_lock(vma->vm_mm, old_pmd);
> > +   if (old_ptl) {
> > +   pmd_t pmd;
> > +
> > +   new_ptl = pmd_lockptr(mm, new_pmd);
> > +   if (new_ptl != old_ptl)
> > +   spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING);
> > +
> > +   /* Clear the pmd */
> > +   pmd = *old_pmd;
> > +   pmd_clear(old_pmd);
> > +
> > +   VM_BUG_ON(!pmd_none(*new_pmd));
> > +
> > +       /* Set the new pmd */
> > +   set_pmd_at(mm, new_addr, new_pmd, pmd);
> 
> UML does not have set_pmd_at at all

Every architecture does. :)

But it may come not from the arch code.

> If I read the code right, MIPS completely ignores the address argument so
> set_pmd_at there may not have the effect which this patch is trying to
> achieve.

Ignoring address is fine. Most architectures do that..
The ideas is to move page table to the new pmd slot. It's nothing to do
with the address passed to set_pmd_at().

-- 
 Kirill A. Shutemov
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v2 1/2] treewide: remove unused address argument from pte_alloc functions

2018-10-12 Thread Kirill A. Shutemov
On Thu, Oct 11, 2018 at 06:37:55PM -0700, Joel Fernandes (Google) wrote:
> diff --git a/arch/m68k/include/asm/mcf_pgalloc.h 
> b/arch/m68k/include/asm/mcf_pgalloc.h
> index 12fe700632f4..4399d712f6db 100644
> --- a/arch/m68k/include/asm/mcf_pgalloc.h
> +++ b/arch/m68k/include/asm/mcf_pgalloc.h
> @@ -12,8 +12,7 @@ extern inline void pte_free_kernel(struct mm_struct *mm, 
> pte_t *pte)
>  
>  extern const char bad_pmd_string[];
>  
> -extern inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm,
> - unsigned long address)
> +extern inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm)
>  {
>   unsigned long page = __get_free_page(GFP_DMA);
>  
> @@ -32,8 +31,6 @@ extern inline pmd_t *pmd_alloc_kernel(pgd_t *pgd, unsigned 
> long address)
>  #define pmd_alloc_one_fast(mm, address) ({ BUG(); ((pmd_t *)1); })
>  #define pmd_alloc_one(mm, address)  ({ BUG(); ((pmd_t *)2); })
>  
> -#define pte_alloc_one_fast(mm, addr) pte_alloc_one(mm, addr)
> -

I believe this was one done manually, right?
Please explicitely state everthing you did on not of sematic patch

...

> diff --git a/arch/microblaze/include/asm/pgalloc.h 
> b/arch/microblaze/include/asm/pgalloc.h
> index 7c89390c0c13..f4cc9ffc449e 100644
> --- a/arch/microblaze/include/asm/pgalloc.h
> +++ b/arch/microblaze/include/asm/pgalloc.h
> @@ -108,10 +108,9 @@ static inline void free_pgd_slow(pgd_t *pgd)
>  #define pmd_alloc_one_fast(mm, address)  ({ BUG(); ((pmd_t *)1); })
>  #define pmd_alloc_one(mm, address)   ({ BUG(); ((pmd_t *)2); })
>  
> -extern pte_t *pte_alloc_one_kernel(struct mm_struct *mm, unsigned long addr);
> +extern pte_t *pte_alloc_one_kernel(struct mm_struct *mm);
>  
> -static inline struct page *pte_alloc_one(struct mm_struct *mm,
> - unsigned long address)
> +static inline struct page *pte_alloc_one(struct mm_struct *mm)
>  {
>   struct page *ptepage;
>  
> @@ -132,20 +131,6 @@ static inline struct page *pte_alloc_one(struct 
> mm_struct *mm,
>   return ptepage;
>  }
>  
> -static inline pte_t *pte_alloc_one_fast(struct mm_struct *mm,
> - unsigned long address)
> -{
> - unsigned long *ret;
> -
> - ret = pte_quicklist;
> - if (ret != NULL) {
> - pte_quicklist = (unsigned long *)(*ret);
> - ret[0] = 0;
> - pgtable_cache_size--;
> - }
> - return (pte_t *)ret;
> -}
> -

Ditto.

-- 
 Kirill A. Shutemov
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v2 2/2] mm: speed up mremap by 500x on large regions

2018-10-12 Thread Kirill A. Shutemov
On Fri, Oct 12, 2018 at 05:50:46AM -0700, Joel Fernandes wrote:
> On Fri, Oct 12, 2018 at 02:30:56PM +0300, Kirill A. Shutemov wrote:
> > On Thu, Oct 11, 2018 at 06:37:56PM -0700, Joel Fernandes (Google) wrote:
> > > Android needs to mremap large regions of memory during memory management
> > > related operations. The mremap system call can be really slow if THP is
> > > not enabled. The bottleneck is move_page_tables, which is copying each
> > > pte at a time, and can be really slow across a large map. Turning on THP
> > > may not be a viable option, and is not for us. This patch speeds up the
> > > performance for non-THP system by copying at the PMD level when possible.
> > > 
> > > The speed up is three orders of magnitude. On a 1GB mremap, the mremap
> > > completion times drops from 160-250 millesconds to 380-400 microseconds.
> > > 
> > > Before:
> > > Total mremap time for 1GB data: 242321014 nanoseconds.
> > > Total mremap time for 1GB data: 196842467 nanoseconds.
> > > Total mremap time for 1GB data: 167051162 nanoseconds.
> > > 
> > > After:
> > > Total mremap time for 1GB data: 385781 nanoseconds.
> > > Total mremap time for 1GB data: 388959 nanoseconds.
> > > Total mremap time for 1GB data: 402813 nanoseconds.
> > > 
> > > Incase THP is enabled, the optimization is skipped. I also flush the
> > > tlb every time we do this optimization since I couldn't find a way to
> > > determine if the low-level PTEs are dirty. It is seen that the cost of
> > > doing so is not much compared the improvement, on both x86-64 and arm64.
> > 
> > I looked into the code more and noticed move_pte() helper called from
> > move_ptes(). It changes PTE entry to suite new address.
> > 
> > It is only defined in non-trivial way on Sparc. I don't know much about
> > Sparc and it's hard for me to say if the optimization will break anything
> > there.
> 
> Sparc's move_pte seems to be flushing the D-cache to prevent aliasing. It is
> not modifying the PTE itself AFAICS:
> 
> #ifdef DCACHE_ALIASING_POSSIBLE
> #define __HAVE_ARCH_MOVE_PTE
> #define move_pte(pte, prot, old_addr, new_addr) \
> ({  \
> pte_t newpte = (pte);   \
> if (tlb_type != hypervisor && pte_present(pte)) {   \
> unsigned long this_pfn = pte_pfn(pte);  \
> \
> if (pfn_valid(this_pfn) &&  \
> (((old_addr) ^ (new_addr)) & (1 << 13)))\
> flush_dcache_page_all(current->mm,  \
>   pfn_to_page(this_pfn));   \
> }   \
> newpte; \
> })
> #endif
> 
> If its an issue, then how do transparent huge pages work on Sparc?  I don't
> see the huge page code (move_huge_pages) during mremap doing anything special
> for Sparc architecture when moving PMDs..

My *guess* is that it will work fine on Sparc as it apprarently it only
cares about change in bit 13 of virtual address. It will never happen for
huge pages or when PTE page tables move.

But I just realized that the problem is bigger: since we pass new_addr to
the set_pte_at() we would need to audit all implementations that they are
safe with just moving PTE page table.

I would rather go with per-architecture enabling. It's much safer.

> Also, do we not flush the caches from any path when we munmap address space?
> We do call do_munmap on the old mapping from mremap after moving to the new 
> one.

Are you sure about that? It can be hided deeper in architecture-specific
code.

-- 
 Kirill A. Shutemov
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v2 2/2] mm: speed up mremap by 500x on large regions

2018-10-12 Thread Kirill A. Shutemov
On Thu, Oct 11, 2018 at 06:37:56PM -0700, Joel Fernandes (Google) wrote:
> Android needs to mremap large regions of memory during memory management
> related operations. The mremap system call can be really slow if THP is
> not enabled. The bottleneck is move_page_tables, which is copying each
> pte at a time, and can be really slow across a large map. Turning on THP
> may not be a viable option, and is not for us. This patch speeds up the
> performance for non-THP system by copying at the PMD level when possible.
> 
> The speed up is three orders of magnitude. On a 1GB mremap, the mremap
> completion times drops from 160-250 millesconds to 380-400 microseconds.
> 
> Before:
> Total mremap time for 1GB data: 242321014 nanoseconds.
> Total mremap time for 1GB data: 196842467 nanoseconds.
> Total mremap time for 1GB data: 167051162 nanoseconds.
> 
> After:
> Total mremap time for 1GB data: 385781 nanoseconds.
> Total mremap time for 1GB data: 388959 nanoseconds.
> Total mremap time for 1GB data: 402813 nanoseconds.
> 
> Incase THP is enabled, the optimization is skipped. I also flush the
> tlb every time we do this optimization since I couldn't find a way to
> determine if the low-level PTEs are dirty. It is seen that the cost of
> doing so is not much compared the improvement, on both x86-64 and arm64.

I looked into the code more and noticed move_pte() helper called from
move_ptes(). It changes PTE entry to suite new address.

It is only defined in non-trivial way on Sparc. I don't know much about
Sparc and it's hard for me to say if the optimization will break anything
there.

I think it worth to disable the optimization if __HAVE_ARCH_MOVE_PTE is
defined. Or make architectures state explicitely that the optimization is
safe.

> @@ -239,7 +287,21 @@ unsigned long move_page_tables(struct vm_area_struct 
> *vma,
>   split_huge_pmd(vma, old_pmd, old_addr);
>   if (pmd_trans_unstable(old_pmd))
>   continue;
> + } else if (extent == PMD_SIZE) {

Hm. What guarantees that new_addr is PMD_SIZE-aligned?
It's not obvious to me.

-- 
 Kirill A. Shutemov
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v2 2/2] mm: speed up mremap by 500x on large regions

2018-10-12 Thread Kirill A. Shutemov
On Fri, Oct 12, 2018 at 02:30:56PM +0300, Kirill A. Shutemov wrote:
> On Thu, Oct 11, 2018 at 06:37:56PM -0700, Joel Fernandes (Google) wrote:
> > @@ -239,7 +287,21 @@ unsigned long move_page_tables(struct vm_area_struct 
> > *vma,
> > split_huge_pmd(vma, old_pmd, old_addr);
> > if (pmd_trans_unstable(old_pmd))
> > continue;
> > +   } else if (extent == PMD_SIZE) {
> 
> Hm. What guarantees that new_addr is PMD_SIZE-aligned?
> It's not obvious to me.

Ignore this :)

-- 
 Kirill A. Shutemov
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm