Re: [PATCH 4/7] x86,tlb: make lazy TLB mode lazier

2018-07-19 Thread Andy Lutomirski
On Thu, Jul 19, 2018 at 9:45 AM, Andy Lutomirski  wrote:
> [I added PeterZ and Vitaly -- can you see any way in which this would
> break something obscure?  I don't.]
>
> On Thu, Jul 19, 2018 at 7:14 AM, Rik van Riel  wrote:
>> I guess we can skip both switch_ldt and load_mm_cr4 if real_prev equals
>> next?
>
> Yes, AFAICS.
>
>>
>> On to the lazy TLB mm_struct refcounting stuff :)
>>
>>>
>>> Which refcount?  mm_users shouldn’t be hot, so I assume you’re talking about
>>> mm_count. My suggestion is to get rid of mm_count instead of trying to
>>> optimize it.
>>
>>
>> Do you have any suggestions on how? :)
>>
>> The TLB shootdown sent at __exit_mm time does not get rid of the
>> kernelthread->active_mm
>> pointer pointing at the mm that is exiting.
>>
>
> Ah, but that's conceptually very easy to fix.  Add a #define like
> ARCH_NO_TASK_ACTIVE_MM.  Then just get rid of active_mm if that
> #define is set.  After some grepping, there are very few users.  The
> only nontrivial ones are the ones in kernel/ and mm/mmu_context.c that
> are involved in the rather complicated dance of refcounting active_mm.
> If that field goes away, it doesn't need to be refcounted.  Instead, I
> think the refcounting can get replaced with something like:
>
> /*
>  * Release any arch-internal references to mm.  Only called when
> mm_users is zero
>  * and all tasks using mm have either been switch_mm()'d away or have had
>  * enter_lazy_tlb() called.
>  */
> extern void arch_shoot_down_dead_mm(struct mm_struct *mm);
>
> which the kernel calls in __mmput() after tearing down all the page
> tables.  The body can be something like:
>
> if (WARN_ON(cpumask_any_but(mm_cpumask(...), ...)) {
>   /* send an IPI.  Maybe just call tlb_flush_remove_tables() */
> }
>
> (You'll also have to fix up the highly questionable users in
> arch/x86/platform/efi/efi_64.c, but that's easy.)
>
> Does all that make sense?  Basically, as I understand it, the
> expensive atomic ops you're seeing are all pointless because they're
> enabling an optimization that hasn't actually worked for a long time,
> if ever.

Hmm.  Xen PV has a big hack in xen_exit_mmap(), which is called from
arch_exit_mmap(), I think.  It's a heavier weight version of more or
less the same thing that arch_shoot_down_dead_mm() would be, except
that it happens before exit_mmap().  But maybe Xen actually has the
right idea.  In other words, rather doing the big pagetable free in
exit_mmap() while there may still be other CPUs pointing at the page
tables, the other order might make more sense.  So maybe, if
ARCH_NO_TASK_ACTIVE_MM is set, arch_exit_mmap() should be responsible
for getting rid of all secret arch references to the mm.

Hmm.  ARCH_FREE_UNUSED_MM_IMMEDIATELY might be a better name.

I added some more arch maintainers.  The idea here is that, on x86 at
least, task->active_mm and all its refcounting is pure overhead.  When
a process exits, __mmput() gets called, but the core kernel has a
longstanding "optimization" in which other tasks (kernel threads and
idle tasks) may have ->active_mm pointing at this mm.  This is nasty,
complicated, and hurts performance on large systems, since it requires
extra atomic operations whenever a CPU switches between real users
threads and idle/kernel threads.

It's also almost completely worthless on x86 at least, since __mmput()
frees pagetables, and that operation *already* forces a remote TLB
flush, so we might as well zap all the active_mm references at the
same time.

But arm64 has real HW remote flushes.  Does arm64 actually benefit
from the active_mm optimization?  What happens on arm64 when a process
exits?  How about s390?  I suspect that x390 has rather larger systems
than arm64, where the cost of the reference counting can be much
higher.

(Also, Rik, x86 on Hyper-V has remote flushes, too. How does that
interact with your previous patch set?)


Re: [PATCH 4/7] x86,tlb: make lazy TLB mode lazier

2018-07-19 Thread Benjamin Herrenschmidt
On Thu, 2018-07-19 at 10:04 -0700, Andy Lutomirski wrote:
> On Thu, Jul 19, 2018 at 9:45 AM, Andy Lutomirski  wrote:
> > [I added PeterZ and Vitaly -- can you see any way in which this would
> > break something obscure?  I don't.]

Added Nick and Aneesh. We do have HW remote flushes on powerpc.

> > On Thu, Jul 19, 2018 at 7:14 AM, Rik van Riel  wrote:
> > > I guess we can skip both switch_ldt and load_mm_cr4 if real_prev equals
> > > next?
> > 
> > Yes, AFAICS.
> > 
> > > 
> > > On to the lazy TLB mm_struct refcounting stuff :)
> > > 
> > > > 
> > > > Which refcount?  mm_users shouldn’t be hot, so I assume you’re talking 
> > > > about
> > > > mm_count. My suggestion is to get rid of mm_count instead of trying to
> > > > optimize it.
> > > 
> > > 
> > > Do you have any suggestions on how? :)
> > > 
> > > The TLB shootdown sent at __exit_mm time does not get rid of the
> > > kernelthread->active_mm
> > > pointer pointing at the mm that is exiting.
> > > 
> > 
> > Ah, but that's conceptually very easy to fix.  Add a #define like
> > ARCH_NO_TASK_ACTIVE_MM.  Then just get rid of active_mm if that
> > #define is set.  After some grepping, there are very few users.  The
> > only nontrivial ones are the ones in kernel/ and mm/mmu_context.c that
> > are involved in the rather complicated dance of refcounting active_mm.
> > If that field goes away, it doesn't need to be refcounted.  Instead, I
> > think the refcounting can get replaced with something like:
> > 
> > /*
> >  * Release any arch-internal references to mm.  Only called when
> > mm_users is zero
> >  * and all tasks using mm have either been switch_mm()'d away or have had
> >  * enter_lazy_tlb() called.
> >  */
> > extern void arch_shoot_down_dead_mm(struct mm_struct *mm);
> > 
> > which the kernel calls in __mmput() after tearing down all the page
> > tables.  The body can be something like:
> > 
> > if (WARN_ON(cpumask_any_but(mm_cpumask(...), ...)) {
> >   /* send an IPI.  Maybe just call tlb_flush_remove_tables() */
> > }
> > 
> > (You'll also have to fix up the highly questionable users in
> > arch/x86/platform/efi/efi_64.c, but that's easy.)
> > 
> > Does all that make sense?  Basically, as I understand it, the
> > expensive atomic ops you're seeing are all pointless because they're
> > enabling an optimization that hasn't actually worked for a long time,
> > if ever.
> 
> Hmm.  Xen PV has a big hack in xen_exit_mmap(), which is called from
> arch_exit_mmap(), I think.  It's a heavier weight version of more or
> less the same thing that arch_shoot_down_dead_mm() would be, except
> that it happens before exit_mmap().  But maybe Xen actually has the
> right idea.  In other words, rather doing the big pagetable free in
> exit_mmap() while there may still be other CPUs pointing at the page
> tables, the other order might make more sense.  So maybe, if
> ARCH_NO_TASK_ACTIVE_MM is set, arch_exit_mmap() should be responsible
> for getting rid of all secret arch references to the mm.
> 
> Hmm.  ARCH_FREE_UNUSED_MM_IMMEDIATELY might be a better name.
> 
> I added some more arch maintainers.  The idea here is that, on x86 at
> least, task->active_mm and all its refcounting is pure overhead.  When
> a process exits, __mmput() gets called, but the core kernel has a
> longstanding "optimization" in which other tasks (kernel threads and
> idle tasks) may have ->active_mm pointing at this mm.  This is nasty,
> complicated, and hurts performance on large systems, since it requires
> extra atomic operations whenever a CPU switches between real users
> threads and idle/kernel threads.
> 
> It's also almost completely worthless on x86 at least, since __mmput()
> frees pagetables, and that operation *already* forces a remote TLB
> flush, so we might as well zap all the active_mm references at the
> same time.
> 
> But arm64 has real HW remote flushes.  Does arm64 actually benefit
> from the active_mm optimization?  What happens on arm64 when a process
> exits?  How about s390?  I suspect that x390 has rather larger systems
> than arm64, where the cost of the reference counting can be much
> higher.
> 
> (Also, Rik, x86 on Hyper-V has remote flushes, too. How does that
> interact with your previous patch set?)


Re: [PATCH 4/7] x86,tlb: make lazy TLB mode lazier

2018-07-20 Thread Peter Zijlstra
On Thu, Jul 19, 2018 at 10:04:09AM -0700, Andy Lutomirski wrote:
> I added some more arch maintainers.  The idea here is that, on x86 at
> least, task->active_mm and all its refcounting is pure overhead.  When
> a process exits, __mmput() gets called, but the core kernel has a
> longstanding "optimization" in which other tasks (kernel threads and
> idle tasks) may have ->active_mm pointing at this mm.  This is nasty,
> complicated, and hurts performance on large systems, since it requires
> extra atomic operations whenever a CPU switches between real users
> threads and idle/kernel threads.
> 
> It's also almost completely worthless on x86 at least, since __mmput()
> frees pagetables, and that operation *already* forces a remote TLB
> flush, so we might as well zap all the active_mm references at the
> same time.

So I disagree that active_mm is complicated (the code is less than ideal
but that is actually fixable). And aside from the process exit case, it
does avoid CR3 writes when switching between user and kernel threads
(which can be far more often than exit if you have longer running
tasks).

Now agreed, recent x86 work has made that less important.

And I of course also agree that not doing those refcount atomics is
better.


Re: [PATCH 4/7] x86,tlb: make lazy TLB mode lazier

2018-07-23 Thread Rik van Riel
On Fri, 2018-07-20 at 10:30 +0200, Peter Zijlstra wrote:
> On Thu, Jul 19, 2018 at 10:04:09AM -0700, Andy Lutomirski wrote:
> > I added some more arch maintainers.  The idea here is that, on x86
> > at
> > least, task->active_mm and all its refcounting is pure
> > overhead.  When
> > a process exits, __mmput() gets called, but the core kernel has a
> > longstanding "optimization" in which other tasks (kernel threads
> > and
> > idle tasks) may have ->active_mm pointing at this mm.  This is
> > nasty,
> > complicated, and hurts performance on large systems, since it
> > requires
> > extra atomic operations whenever a CPU switches between real users
> > threads and idle/kernel threads.
> > 
> > It's also almost completely worthless on x86 at least, since
> > __mmput()
> > frees pagetables, and that operation *already* forces a remote TLB
> > flush, so we might as well zap all the active_mm references at the
> > same time.
> 
> So I disagree that active_mm is complicated (the code is less than
> ideal
> but that is actually fixable). And aside from the process exit case,
> it
> does avoid CR3 writes when switching between user and kernel threads
> (which can be far more often than exit if you have longer running
> tasks).
> 
> Now agreed, recent x86 work has made that less important.
> 
> And I of course also agree that not doing those refcount atomics is
> better.

It might be cleaner to keep the ->active_mm pointer
in place for now (at least in the first patch), even 
on architectures where we end up dropping the refcounting.

That way the code is more similar everywhere, and
we just get rid of the expensive instructions.

Let me try coding this up...

-- 
All Rights Reversed.

signature.asc
Description: This is a digitally signed message part


Re: [PATCH 4/7] x86,tlb: make lazy TLB mode lazier

2018-07-24 Thread Will Deacon
Hi Andy,

Sorry, I missed the arm64 question at the end of this...

On Thu, Jul 19, 2018 at 10:04:09AM -0700, Andy Lutomirski wrote:
> On Thu, Jul 19, 2018 at 9:45 AM, Andy Lutomirski  wrote:
> > [I added PeterZ and Vitaly -- can you see any way in which this would
> > break something obscure?  I don't.]
> >
> > On Thu, Jul 19, 2018 at 7:14 AM, Rik van Riel  wrote:
> >> I guess we can skip both switch_ldt and load_mm_cr4 if real_prev equals
> >> next?
> >
> > Yes, AFAICS.
> >
> >>
> >> On to the lazy TLB mm_struct refcounting stuff :)
> >>
> >>>
> >>> Which refcount?  mm_users shouldn’t be hot, so I assume you’re talking 
> >>> about
> >>> mm_count. My suggestion is to get rid of mm_count instead of trying to
> >>> optimize it.
> >>
> >>
> >> Do you have any suggestions on how? :)
> >>
> >> The TLB shootdown sent at __exit_mm time does not get rid of the
> >> kernelthread->active_mm
> >> pointer pointing at the mm that is exiting.
> >>
> >
> > Ah, but that's conceptually very easy to fix.  Add a #define like
> > ARCH_NO_TASK_ACTIVE_MM.  Then just get rid of active_mm if that
> > #define is set.  After some grepping, there are very few users.  The
> > only nontrivial ones are the ones in kernel/ and mm/mmu_context.c that
> > are involved in the rather complicated dance of refcounting active_mm.
> > If that field goes away, it doesn't need to be refcounted.  Instead, I
> > think the refcounting can get replaced with something like:
> >
> > /*
> >  * Release any arch-internal references to mm.  Only called when
> > mm_users is zero
> >  * and all tasks using mm have either been switch_mm()'d away or have had
> >  * enter_lazy_tlb() called.
> >  */
> > extern void arch_shoot_down_dead_mm(struct mm_struct *mm);
> >
> > which the kernel calls in __mmput() after tearing down all the page
> > tables.  The body can be something like:
> >
> > if (WARN_ON(cpumask_any_but(mm_cpumask(...), ...)) {
> >   /* send an IPI.  Maybe just call tlb_flush_remove_tables() */
> > }
> >
> > (You'll also have to fix up the highly questionable users in
> > arch/x86/platform/efi/efi_64.c, but that's easy.)
> >
> > Does all that make sense?  Basically, as I understand it, the
> > expensive atomic ops you're seeing are all pointless because they're
> > enabling an optimization that hasn't actually worked for a long time,
> > if ever.
> 
> Hmm.  Xen PV has a big hack in xen_exit_mmap(), which is called from
> arch_exit_mmap(), I think.  It's a heavier weight version of more or
> less the same thing that arch_shoot_down_dead_mm() would be, except
> that it happens before exit_mmap().  But maybe Xen actually has the
> right idea.  In other words, rather doing the big pagetable free in
> exit_mmap() while there may still be other CPUs pointing at the page
> tables, the other order might make more sense.  So maybe, if
> ARCH_NO_TASK_ACTIVE_MM is set, arch_exit_mmap() should be responsible
> for getting rid of all secret arch references to the mm.
> 
> Hmm.  ARCH_FREE_UNUSED_MM_IMMEDIATELY might be a better name.
> 
> I added some more arch maintainers.  The idea here is that, on x86 at
> least, task->active_mm and all its refcounting is pure overhead.  When
> a process exits, __mmput() gets called, but the core kernel has a
> longstanding "optimization" in which other tasks (kernel threads and
> idle tasks) may have ->active_mm pointing at this mm.  This is nasty,
> complicated, and hurts performance on large systems, since it requires
> extra atomic operations whenever a CPU switches between real users
> threads and idle/kernel threads.
> 
> It's also almost completely worthless on x86 at least, since __mmput()
> frees pagetables, and that operation *already* forces a remote TLB
> flush, so we might as well zap all the active_mm references at the
> same time.
> 
> But arm64 has real HW remote flushes.  Does arm64 actually benefit
> from the active_mm optimization?  What happens on arm64 when a process
> exits?  How about s390?  I suspect that x390 has rather larger systems
> than arm64, where the cost of the reference counting can be much
> higher.

IIRC, the TLB invalidation on task exit has the fullmm field set in the
mmu_gather structure, so we don't actually do any TLB invalidation at all.
Instead, we just don't re-allocate the ASID and invalidate the whole TLB
when we run out of ASIDs (they're 16-bit on most Armv8 CPUs).

Does that answer your question?

Will