Re: [Xen-devel] [PATCH v2 4/9] x86/mm/tlb: Flush remote and local TLBs concurrently

2019-07-03 Thread Andrew Cooper
On 03/07/2019 18:02, Nadav Amit wrote:
>> On Jul 3, 2019, at 7:04 AM, Juergen Gross  wrote:
>>
>> On 03.07.19 01:51, Nadav Amit wrote:
>>> To improve TLB shootdown performance, flush the remote and local TLBs
>>> concurrently. Introduce flush_tlb_multi() that does so. Introduce
>>> paravirtual versions of flush_tlb_multi() for KVM, Xen and hyper-v (Xen
>>> and hyper-v are only compile-tested).
>>> While the updated smp infrastructure is capable of running a function on
>>> a single local core, it is not optimized for this case. The multiple
>>> function calls and the indirect branch introduce some overhead, and
>>> might make local TLB flushes slower than they were before the recent
>>> changes.
>>> Before calling the SMP infrastructure, check if only a local TLB flush
>>> is needed to restore the lost performance in this common case. This
>>> requires to check mm_cpumask() one more time, but unless this mask is
>>> updated very frequently, this should impact performance negatively.
>>> Cc: "K. Y. Srinivasan" 
>>> Cc: Haiyang Zhang 
>>> Cc: Stephen Hemminger 
>>> Cc: Sasha Levin 
>>> Cc: Thomas Gleixner 
>>> Cc: Ingo Molnar 
>>> Cc: Borislav Petkov 
>>> Cc: x...@kernel.org
>>> Cc: Juergen Gross 
>>> Cc: Paolo Bonzini 
>>> Cc: Dave Hansen 
>>> Cc: Andy Lutomirski 
>>> Cc: Peter Zijlstra 
>>> Cc: Boris Ostrovsky 
>>> Cc: linux-hyp...@vger.kernel.org
>>> Cc: linux-ker...@vger.kernel.org
>>> Cc: virtualization@lists.linux-foundation.org
>>> Cc: k...@vger.kernel.org
>>> Cc: xen-de...@lists.xenproject.org
>>> Signed-off-by: Nadav Amit 
>>> ---
>>>  arch/x86/hyperv/mmu.c | 13 +++---
>>>  arch/x86/include/asm/paravirt.h   |  6 +--
>>>  arch/x86/include/asm/paravirt_types.h |  4 +-
>>>  arch/x86/include/asm/tlbflush.h   |  9 ++--
>>>  arch/x86/include/asm/trace/hyperv.h   |  2 +-
>>>  arch/x86/kernel/kvm.c | 11 +++--
>>>  arch/x86/kernel/paravirt.c|  2 +-
>>>  arch/x86/mm/tlb.c | 65 ---
>>>  arch/x86/xen/mmu_pv.c | 20 ++---
>>>  include/trace/events/xen.h|  2 +-
>>>  10 files changed, 91 insertions(+), 43 deletions(-)
>> ...
>>
>>> diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c
>>> index beb44e22afdf..19e481e6e904 100644
>>> --- a/arch/x86/xen/mmu_pv.c
>>> +++ b/arch/x86/xen/mmu_pv.c
>>> @@ -1355,8 +1355,8 @@ static void xen_flush_tlb_one_user(unsigned long addr)
>>> preempt_enable();
>>>  }
>>>  -static void xen_flush_tlb_others(const struct cpumask *cpus,
>>> -const struct flush_tlb_info *info)
>>> +static void xen_flush_tlb_multi(const struct cpumask *cpus,
>>> +   const struct flush_tlb_info *info)
>>>  {
>>> struct {
>>> struct mmuext_op op;
>>> @@ -1366,7 +1366,7 @@ static void xen_flush_tlb_others(const struct cpumask 
>>> *cpus,
>>> const size_t mc_entry_size = sizeof(args->op) +
>>> sizeof(args->mask[0]) * BITS_TO_LONGS(num_possible_cpus());
>>>  -  trace_xen_mmu_flush_tlb_others(cpus, info->mm, info->start, info->end);
>>> +   trace_xen_mmu_flush_tlb_multi(cpus, info->mm, info->start, info->end);
>>> if (cpumask_empty(cpus))
>>> return; /* nothing to do */
>>> @@ -1375,9 +1375,17 @@ static void xen_flush_tlb_others(const struct 
>>> cpumask *cpus,
>>> args = mcs.args;
>>> args->op.arg2.vcpumask = to_cpumask(args->mask);
>>>  -  /* Remove us, and any offline CPUS. */
>>> +   /* Flush locally if needed and remove us */
>>> +   if (cpumask_test_cpu(smp_processor_id(), to_cpumask(args->mask))) {
>>> +   local_irq_disable();
>>> +   flush_tlb_func_local(info);
>> I think this isn't the correct function for PV guests.
>>
>> In fact it should be much easier: just don't clear the own cpu from the
>> mask, that's all what's needed. The hypervisor is just fine having the
>> current cpu in the mask and it will do the right thing.
> Thanks. I will do so in v3. I don’t think Hyper-V people would want to do
> the same, unfortunately, since it would induce VM-exit on TLB flushes.

Why do you believe the vmexit matters?  You're talking one anyway for
the IPI.

Intel only have virtualised self-IPI, and while AMD do have working
non-self IPIs, you still take a vmexit anyway if any destination vcpu
isn't currently running in non-root mode (IIRC).

At that point, you might as well have the hypervisor do all the hard
work via a multi-cpu shootdown/flush hypercall, rather than trying to
arrange it locally.

~Andrew
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [Xen-devel] [PATCH v2 4/9] x86/mm/tlb: Flush remote and local TLBs concurrently

2019-07-03 Thread Nadav Amit via Virtualization
> On Jul 3, 2019, at 10:43 AM, Andrew Cooper  wrote:
> 
> On 03/07/2019 18:02, Nadav Amit wrote:
>>> On Jul 3, 2019, at 7:04 AM, Juergen Gross  wrote:
>>> 
>>> On 03.07.19 01:51, Nadav Amit wrote:
 To improve TLB shootdown performance, flush the remote and local TLBs
 concurrently. Introduce flush_tlb_multi() that does so. Introduce
 paravirtual versions of flush_tlb_multi() for KVM, Xen and hyper-v (Xen
 and hyper-v are only compile-tested).
 While the updated smp infrastructure is capable of running a function on
 a single local core, it is not optimized for this case. The multiple
 function calls and the indirect branch introduce some overhead, and
 might make local TLB flushes slower than they were before the recent
 changes.
 Before calling the SMP infrastructure, check if only a local TLB flush
 is needed to restore the lost performance in this common case. This
 requires to check mm_cpumask() one more time, but unless this mask is
 updated very frequently, this should impact performance negatively.
 Cc: "K. Y. Srinivasan" 
 Cc: Haiyang Zhang 
 Cc: Stephen Hemminger 
 Cc: Sasha Levin 
 Cc: Thomas Gleixner 
 Cc: Ingo Molnar 
 Cc: Borislav Petkov 
 Cc: x...@kernel.org
 Cc: Juergen Gross 
 Cc: Paolo Bonzini 
 Cc: Dave Hansen 
 Cc: Andy Lutomirski 
 Cc: Peter Zijlstra 
 Cc: Boris Ostrovsky 
 Cc: linux-hyp...@vger.kernel.org
 Cc: linux-ker...@vger.kernel.org
 Cc: virtualization@lists.linux-foundation.org
 Cc: k...@vger.kernel.org
 Cc: xen-de...@lists.xenproject.org
 Signed-off-by: Nadav Amit 
 ---
 arch/x86/hyperv/mmu.c | 13 +++---
 arch/x86/include/asm/paravirt.h   |  6 +--
 arch/x86/include/asm/paravirt_types.h |  4 +-
 arch/x86/include/asm/tlbflush.h   |  9 ++--
 arch/x86/include/asm/trace/hyperv.h   |  2 +-
 arch/x86/kernel/kvm.c | 11 +++--
 arch/x86/kernel/paravirt.c|  2 +-
 arch/x86/mm/tlb.c | 65 ---
 arch/x86/xen/mmu_pv.c | 20 ++---
 include/trace/events/xen.h|  2 +-
 10 files changed, 91 insertions(+), 43 deletions(-)
>>> ...
>>> 
 diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c
 index beb44e22afdf..19e481e6e904 100644
 --- a/arch/x86/xen/mmu_pv.c
 +++ b/arch/x86/xen/mmu_pv.c
 @@ -1355,8 +1355,8 @@ static void xen_flush_tlb_one_user(unsigned long 
 addr)
preempt_enable();
 }
 -static void xen_flush_tlb_others(const struct cpumask *cpus,
 -   const struct flush_tlb_info *info)
 +static void xen_flush_tlb_multi(const struct cpumask *cpus,
 +  const struct flush_tlb_info *info)
 {
struct {
struct mmuext_op op;
 @@ -1366,7 +1366,7 @@ static void xen_flush_tlb_others(const struct 
 cpumask *cpus,
const size_t mc_entry_size = sizeof(args->op) +
sizeof(args->mask[0]) * BITS_TO_LONGS(num_possible_cpus());
 -  trace_xen_mmu_flush_tlb_others(cpus, info->mm, info->start, info->end);
 +  trace_xen_mmu_flush_tlb_multi(cpus, info->mm, info->start, info->end);
if (cpumask_empty(cpus))
return; /* nothing to do */
 @@ -1375,9 +1375,17 @@ static void xen_flush_tlb_others(const struct 
 cpumask *cpus,
args = mcs.args;
args->op.arg2.vcpumask = to_cpumask(args->mask);
 -  /* Remove us, and any offline CPUS. */
 +  /* Flush locally if needed and remove us */
 +  if (cpumask_test_cpu(smp_processor_id(), to_cpumask(args->mask))) {
 +  local_irq_disable();
 +  flush_tlb_func_local(info);
>>> I think this isn't the correct function for PV guests.
>>> 
>>> In fact it should be much easier: just don't clear the own cpu from the
>>> mask, that's all what's needed. The hypervisor is just fine having the
>>> current cpu in the mask and it will do the right thing.
>> Thanks. I will do so in v3. I don’t think Hyper-V people would want to do
>> the same, unfortunately, since it would induce VM-exit on TLB flushes.
> 
> Why do you believe the vmexit matters?  You're talking one anyway for
> the IPI.
> 
> Intel only have virtualised self-IPI, and while AMD do have working
> non-self IPIs, you still take a vmexit anyway if any destination vcpu
> isn't currently running in non-root mode (IIRC).
> 
> At that point, you might as well have the hypervisor do all the hard
> work via a multi-cpu shootdown/flush hypercall, rather than trying to
> arrange it locally.

I forgot that xen_flush_tlb_multi() should actually only be called when
there are some remote CPUs (as I optimized the case in which there is only a
single local CPU that needs to be flushed), so you are right.

___
Virtualization mailing list
Virtualization@lists.lin