RE: [PATCH v3 4/9] x86/mm/tlb: Flush remote and local TLBs concurrently

2019-07-30 Thread Michael Kelley via Virtualization
From: Nadav Amit  Sent: Thursday, July 18, 2019 5:59 PM
> 
> To improve TLB shootdown performance, flush the remote and local TLBs
> concurrently. Introduce flush_tlb_multi() that does so. Introduce
> paravirtual versions of flush_tlb_multi() for KVM, Xen and hyper-v (Xen
> and hyper-v are only compile-tested).
> 
> While the updated smp infrastructure is capable of running a function on
> a single local core, it is not optimized for this case. The multiple
> function calls and the indirect branch introduce some overhead, and
> might make local TLB flushes slower than they were before the recent
> changes.
> 
> Before calling the SMP infrastructure, check if only a local TLB flush
> is needed to restore the lost performance in this common case. This
> requires to check mm_cpumask() one more time, but unless this mask is
> updated very frequently, this should impact performance negatively.
> 
> Cc: "K. Y. Srinivasan" 
> Cc: Haiyang Zhang 
> Cc: Stephen Hemminger 
> Cc: Sasha Levin 
> Cc: Thomas Gleixner 
> Cc: Ingo Molnar 
> Cc: Borislav Petkov 
> Cc: x...@kernel.org
> Cc: Juergen Gross 
> Cc: Paolo Bonzini 
> Cc: Dave Hansen 
> Cc: Andy Lutomirski 
> Cc: Peter Zijlstra 
> Cc: Boris Ostrovsky 
> Cc: linux-hyp...@vger.kernel.org
> Cc: linux-ker...@vger.kernel.org
> Cc: virtualization@lists.linux-foundation.org
> Cc: k...@vger.kernel.org
> Cc: xen-de...@lists.xenproject.org
> Signed-off-by: Nadav Amit 
> ---
>  arch/x86/hyperv/mmu.c | 10 +++---
>  arch/x86/include/asm/paravirt.h   |  6 ++--
>  arch/x86/include/asm/paravirt_types.h |  4 +--
>  arch/x86/include/asm/tlbflush.h   |  8 ++---
>  arch/x86/include/asm/trace/hyperv.h   |  2 +-
>  arch/x86/kernel/kvm.c | 11 +--
>  arch/x86/kernel/paravirt.c|  2 +-
>  arch/x86/mm/tlb.c | 47 ++-
>  arch/x86/xen/mmu_pv.c | 11 +++
>  include/trace/events/xen.h|  2 +-
>  10 files changed, 62 insertions(+), 41 deletions(-)
> 

For the Hyper-V parts --
Reviewed-by: Michael Kelley 

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v3 4/9] x86/mm/tlb: Flush remote and local TLBs concurrently

2019-07-26 Thread Juergen Gross

On 19.07.19 02:58, Nadav Amit wrote:

To improve TLB shootdown performance, flush the remote and local TLBs
concurrently. Introduce flush_tlb_multi() that does so. Introduce
paravirtual versions of flush_tlb_multi() for KVM, Xen and hyper-v (Xen
and hyper-v are only compile-tested).

While the updated smp infrastructure is capable of running a function on
a single local core, it is not optimized for this case. The multiple
function calls and the indirect branch introduce some overhead, and
might make local TLB flushes slower than they were before the recent
changes.

Before calling the SMP infrastructure, check if only a local TLB flush
is needed to restore the lost performance in this common case. This
requires to check mm_cpumask() one more time, but unless this mask is
updated very frequently, this should impact performance negatively.

Cc: "K. Y. Srinivasan" 
Cc: Haiyang Zhang 
Cc: Stephen Hemminger 
Cc: Sasha Levin 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: x...@kernel.org
Cc: Juergen Gross 
Cc: Paolo Bonzini 
Cc: Dave Hansen 
Cc: Andy Lutomirski 
Cc: Peter Zijlstra 
Cc: Boris Ostrovsky 
Cc: linux-hyp...@vger.kernel.org
Cc: linux-ker...@vger.kernel.org
Cc: virtualization@lists.linux-foundation.org
Cc: k...@vger.kernel.org
Cc: xen-de...@lists.xenproject.org
Signed-off-by: Nadav Amit 
---
  arch/x86/hyperv/mmu.c | 10 +++---
  arch/x86/include/asm/paravirt.h   |  6 ++--
  arch/x86/include/asm/paravirt_types.h |  4 +--
  arch/x86/include/asm/tlbflush.h   |  8 ++---
  arch/x86/include/asm/trace/hyperv.h   |  2 +-
  arch/x86/kernel/kvm.c | 11 +--
  arch/x86/kernel/paravirt.c|  2 +-
  arch/x86/mm/tlb.c | 47 ++-
  arch/x86/xen/mmu_pv.c | 11 +++
  include/trace/events/xen.h|  2 +-
  10 files changed, 62 insertions(+), 41 deletions(-)


Xen and paravirt parts: Reviewed-by: Juergen Gross 


Juergen
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v3 4/9] x86/mm/tlb: Flush remote and local TLBs concurrently

2019-07-22 Thread Peter Zijlstra
On Mon, Jul 22, 2019 at 07:27:09PM +, Nadav Amit wrote:
> > On Jul 22, 2019, at 12:14 PM, Peter Zijlstra  wrote:

> > But then we can still do something like the below, which doesn't change
> > things and still gets rid of that dual function crud, simplifying
> > smp_call_function_many again.

> Nice! I will add it on top, if you don’t mind (instead squashing it).

Not at all.

> The original decision to have local/remote functions was mostly to provide
> the generality.
> 
> I would change the last argument of __smp_call_function_many() from “wait”
> to “flags” that would indicate whether to run the function locally, since I
> don’t want to change the semantics of smp_call_function_many() and decide
> whether to run the function locally purely based on the mask. Let me know if
> you disagree.

Agreed.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v3 4/9] x86/mm/tlb: Flush remote and local TLBs concurrently

2019-07-22 Thread Nadav Amit via Virtualization
> On Jul 22, 2019, at 12:14 PM, Peter Zijlstra  wrote:
> 
> On Thu, Jul 18, 2019 at 05:58:32PM -0700, Nadav Amit wrote:
>> @@ -709,8 +716,9 @@ void native_flush_tlb_others(const struct cpumask 
>> *cpumask,
>>   * doing a speculative memory access.
>>   */
>>  if (info->freed_tables) {
>> -smp_call_function_many(cpumask, flush_tlb_func_remote,
>> -   (void *)info, 1);
>> +__smp_call_function_many(cpumask, flush_tlb_func_remote,
>> + flush_tlb_func_local,
>> + (void *)info, 1);
>>  } else {
>>  /*
>>   * Although we could have used on_each_cpu_cond_mask(),
>> @@ -737,7 +745,8 @@ void native_flush_tlb_others(const struct cpumask 
>> *cpumask,
>>  if (tlb_is_not_lazy(cpu))
>>  __cpumask_set_cpu(cpu, cond_cpumask);
>>  }
>> -smp_call_function_many(cond_cpumask, flush_tlb_func_remote,
>> +__smp_call_function_many(cond_cpumask, flush_tlb_func_remote,
>> + flush_tlb_func_local,
>>   (void *)info, 1);
>>  }
>> }
> 
> Do we really need that _local/_remote distinction? ISTR you had a patch
> that frobbed flush_tlb_info into the csd and that gave space
> constraints, but I'm not seeing that here (probably a wise, get stuff
> merged etc..).
> 
> struct __call_single_data {
>struct llist_node  llist;/* 0 8 */
>smp_call_func_tfunc; /* 8 8 */
>void * info; /*16 8 */
>unsigned int   flags;/*24 4 */
> 
>/* size: 32, cachelines: 1, members: 4 */
>/* padding: 4 */
>/* last cacheline: 32 bytes */
> };
> 
> struct flush_tlb_info {
>struct mm_struct * mm;   /* 0 8 */
>long unsigned int  start;/* 8 8 */
>long unsigned int  end;  /*16 8 */
>u64new_tlb_gen;  /*24 8 */
>unsigned int   stride_shift; /*32 4 */
>bool   freed_tables; /*36 1 */
> 
>/* size: 40, cachelines: 1, members: 6 */
>/* padding: 3 */
>/* last cacheline: 40 bytes */
> };
> 
> IIRC what you did was make void *__call_single_data::info the last
> member and a union until the full cacheline size (64). Given the above
> that would get us 24 bytes for csd, leaving us 40 for that
> flush_tlb_info.
> 
> But then we can still do something like the below, which doesn't change
> things and still gets rid of that dual function crud, simplifying
> smp_call_function_many again.
> 
> Index: linux-2.6/arch/x86/include/asm/tlbflush.h
> ===
> --- linux-2.6.orig/arch/x86/include/asm/tlbflush.h
> +++ linux-2.6/arch/x86/include/asm/tlbflush.h
> @@ -546,8 +546,9 @@ struct flush_tlb_info {
>   unsigned long   start;
>   unsigned long   end;
>   u64 new_tlb_gen;
> - unsigned intstride_shift;
> - boolfreed_tables;
> + unsigned intcpu;
> + unsigned short  stride_shift;
> + unsigned char   freed_tables;
> };
> 
> #define local_flush_tlb() __flush_tlb()
> Index: linux-2.6/arch/x86/mm/tlb.c
> ===
> --- linux-2.6.orig/arch/x86/mm/tlb.c
> +++ linux-2.6/arch/x86/mm/tlb.c
> @@ -659,6 +659,27 @@ static void flush_tlb_func_remote(void *
>   flush_tlb_func_common(f, false, TLB_REMOTE_SHOOTDOWN);
> }
> 
> +static void flush_tlb_func(void *info)
> +{
> + const struct flush_tlb_info *f = info;
> + enum tlb_flush_reason reason = TLB_REMOTE_SHOOTDOWN;
> + bool local = false;
> +
> + if (f->cpu == smp_processor_id()) {
> + local = true;
> + reason = (f->mm == NULL) ? TLB_LOCAL_SHOOTDOWN : 
> TLB_LOCAL_MM_SHOOTDOWN;
> + } else {
> + inc_irq_stat(irq_tlb_count);
> +
> + if (f->mm && f->mm != this_cpu_read(cpu_tlbstate.loaded_mm))
> + return;
> +
> + count_vm_tlb_event(NR_TLB_REMOTE_FLUSH_RECEIVED);
> + }
> +
> + flush_tlb_func_common(f, local, reason);
> +}
> +
> static bool tlb_is_not_lazy(int cpu)
> {
>   return !per_cpu(cpu_tlbstate_shared.is_lazy, cpu);

Nice! I will add it on top, if you don’t mind (instead squashing it).

The original decision to have local/remote functions was mostly to provide
the generality.

I would change the last argument of __smp_call_function_many() from “wait”
to “flags” that would indicate whether to run the function locally, 

Re: [PATCH v3 4/9] x86/mm/tlb: Flush remote and local TLBs concurrently

2019-07-22 Thread Peter Zijlstra
On Thu, Jul 18, 2019 at 05:58:32PM -0700, Nadav Amit wrote:
> @@ -709,8 +716,9 @@ void native_flush_tlb_others(const struct cpumask 
> *cpumask,
>* doing a speculative memory access.
>*/
>   if (info->freed_tables) {
> - smp_call_function_many(cpumask, flush_tlb_func_remote,
> -(void *)info, 1);
> + __smp_call_function_many(cpumask, flush_tlb_func_remote,
> +  flush_tlb_func_local,
> +  (void *)info, 1);
>   } else {
>   /*
>* Although we could have used on_each_cpu_cond_mask(),
> @@ -737,7 +745,8 @@ void native_flush_tlb_others(const struct cpumask 
> *cpumask,
>   if (tlb_is_not_lazy(cpu))
>   __cpumask_set_cpu(cpu, cond_cpumask);
>   }
> - smp_call_function_many(cond_cpumask, flush_tlb_func_remote,
> + __smp_call_function_many(cond_cpumask, flush_tlb_func_remote,
> +  flush_tlb_func_local,
>(void *)info, 1);
>   }
>  }

Do we really need that _local/_remote distinction? ISTR you had a patch
that frobbed flush_tlb_info into the csd and that gave space
constraints, but I'm not seeing that here (probably a wise, get stuff
merged etc..).

struct __call_single_data {
struct llist_node  llist;/* 0 8 */
smp_call_func_tfunc; /* 8 8 */
void * info; /*16 8 */
unsigned int   flags;/*24 4 */

/* size: 32, cachelines: 1, members: 4 */
/* padding: 4 */
/* last cacheline: 32 bytes */
};

struct flush_tlb_info {
struct mm_struct * mm;   /* 0 8 */
long unsigned int  start;/* 8 8 */
long unsigned int  end;  /*16 8 */
u64new_tlb_gen;  /*24 8 */
unsigned int   stride_shift; /*32 4 */
bool   freed_tables; /*36 1 */

/* size: 40, cachelines: 1, members: 6 */
/* padding: 3 */
/* last cacheline: 40 bytes */
};

IIRC what you did was make void *__call_single_data::info the last
member and a union until the full cacheline size (64). Given the above
that would get us 24 bytes for csd, leaving us 40 for that
flush_tlb_info.

But then we can still do something like the below, which doesn't change
things and still gets rid of that dual function crud, simplifying
smp_call_function_many again.

Index: linux-2.6/arch/x86/include/asm/tlbflush.h
===
--- linux-2.6.orig/arch/x86/include/asm/tlbflush.h
+++ linux-2.6/arch/x86/include/asm/tlbflush.h
@@ -546,8 +546,9 @@ struct flush_tlb_info {
unsigned long   start;
unsigned long   end;
u64 new_tlb_gen;
-   unsigned intstride_shift;
-   boolfreed_tables;
+   unsigned intcpu;
+   unsigned short  stride_shift;
+   unsigned char   freed_tables;
 };
 
 #define local_flush_tlb() __flush_tlb()
Index: linux-2.6/arch/x86/mm/tlb.c
===
--- linux-2.6.orig/arch/x86/mm/tlb.c
+++ linux-2.6/arch/x86/mm/tlb.c
@@ -659,6 +659,27 @@ static void flush_tlb_func_remote(void *
flush_tlb_func_common(f, false, TLB_REMOTE_SHOOTDOWN);
 }
 
+static void flush_tlb_func(void *info)
+{
+   const struct flush_tlb_info *f = info;
+   enum tlb_flush_reason reason = TLB_REMOTE_SHOOTDOWN;
+   bool local = false;
+
+   if (f->cpu == smp_processor_id()) {
+   local = true;
+   reason = (f->mm == NULL) ? TLB_LOCAL_SHOOTDOWN : 
TLB_LOCAL_MM_SHOOTDOWN;
+   } else {
+   inc_irq_stat(irq_tlb_count);
+
+   if (f->mm && f->mm != this_cpu_read(cpu_tlbstate.loaded_mm))
+   return;
+
+   count_vm_tlb_event(NR_TLB_REMOTE_FLUSH_RECEIVED);
+   }
+
+   flush_tlb_func_common(f, local, reason);
+}
+
 static bool tlb_is_not_lazy(int cpu)
 {
return !per_cpu(cpu_tlbstate_shared.is_lazy, cpu);

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v3 4/9] x86/mm/tlb: Flush remote and local TLBs concurrently

2019-07-18 Thread Nadav Amit via Virtualization
To improve TLB shootdown performance, flush the remote and local TLBs
concurrently. Introduce flush_tlb_multi() that does so. Introduce
paravirtual versions of flush_tlb_multi() for KVM, Xen and hyper-v (Xen
and hyper-v are only compile-tested).

While the updated smp infrastructure is capable of running a function on
a single local core, it is not optimized for this case. The multiple
function calls and the indirect branch introduce some overhead, and
might make local TLB flushes slower than they were before the recent
changes.

Before calling the SMP infrastructure, check if only a local TLB flush
is needed to restore the lost performance in this common case. This
requires to check mm_cpumask() one more time, but unless this mask is
updated very frequently, this should impact performance negatively.

Cc: "K. Y. Srinivasan" 
Cc: Haiyang Zhang 
Cc: Stephen Hemminger 
Cc: Sasha Levin 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: x...@kernel.org
Cc: Juergen Gross 
Cc: Paolo Bonzini 
Cc: Dave Hansen 
Cc: Andy Lutomirski 
Cc: Peter Zijlstra 
Cc: Boris Ostrovsky 
Cc: linux-hyp...@vger.kernel.org
Cc: linux-ker...@vger.kernel.org
Cc: virtualization@lists.linux-foundation.org
Cc: k...@vger.kernel.org
Cc: xen-de...@lists.xenproject.org
Signed-off-by: Nadav Amit 
---
 arch/x86/hyperv/mmu.c | 10 +++---
 arch/x86/include/asm/paravirt.h   |  6 ++--
 arch/x86/include/asm/paravirt_types.h |  4 +--
 arch/x86/include/asm/tlbflush.h   |  8 ++---
 arch/x86/include/asm/trace/hyperv.h   |  2 +-
 arch/x86/kernel/kvm.c | 11 +--
 arch/x86/kernel/paravirt.c|  2 +-
 arch/x86/mm/tlb.c | 47 ++-
 arch/x86/xen/mmu_pv.c | 11 +++
 include/trace/events/xen.h|  2 +-
 10 files changed, 62 insertions(+), 41 deletions(-)

diff --git a/arch/x86/hyperv/mmu.c b/arch/x86/hyperv/mmu.c
index e65d7fe6489f..8740d8b21db3 100644
--- a/arch/x86/hyperv/mmu.c
+++ b/arch/x86/hyperv/mmu.c
@@ -50,8 +50,8 @@ static inline int fill_gva_list(u64 gva_list[], int offset,
return gva_n - offset;
 }
 
-static void hyperv_flush_tlb_others(const struct cpumask *cpus,
-   const struct flush_tlb_info *info)
+static void hyperv_flush_tlb_multi(const struct cpumask *cpus,
+  const struct flush_tlb_info *info)
 {
int cpu, vcpu, gva_n, max_gvas;
struct hv_tlb_flush **flush_pcpu;
@@ -59,7 +59,7 @@ static void hyperv_flush_tlb_others(const struct cpumask 
*cpus,
u64 status = U64_MAX;
unsigned long flags;
 
-   trace_hyperv_mmu_flush_tlb_others(cpus, info);
+   trace_hyperv_mmu_flush_tlb_multi(cpus, info);
 
if (!hv_hypercall_pg)
goto do_native;
@@ -156,7 +156,7 @@ static void hyperv_flush_tlb_others(const struct cpumask 
*cpus,
if (!(status & HV_HYPERCALL_RESULT_MASK))
return;
 do_native:
-   native_flush_tlb_others(cpus, info);
+   native_flush_tlb_multi(cpus, info);
 }
 
 static u64 hyperv_flush_tlb_others_ex(const struct cpumask *cpus,
@@ -231,6 +231,6 @@ void hyperv_setup_mmu_ops(void)
return;
 
pr_info("Using hypercall for remote TLB flush\n");
-   pv_ops.mmu.flush_tlb_others = hyperv_flush_tlb_others;
+   pv_ops.mmu.flush_tlb_multi = hyperv_flush_tlb_multi;
pv_ops.mmu.tlb_remove_table = tlb_remove_table;
 }
diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index dce26f1d13e1..8c6c2394393b 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -62,10 +62,10 @@ static inline void __flush_tlb_one_user(unsigned long addr)
PVOP_VCALL1(mmu.flush_tlb_one_user, addr);
 }
 
-static inline void flush_tlb_others(const struct cpumask *cpumask,
-   const struct flush_tlb_info *info)
+static inline void flush_tlb_multi(const struct cpumask *cpumask,
+  const struct flush_tlb_info *info)
 {
-   PVOP_VCALL2(mmu.flush_tlb_others, cpumask, info);
+   PVOP_VCALL2(mmu.flush_tlb_multi, cpumask, info);
 }
 
 static inline void paravirt_tlb_remove_table(struct mmu_gather *tlb, void 
*table)
diff --git a/arch/x86/include/asm/paravirt_types.h 
b/arch/x86/include/asm/paravirt_types.h
index 639b2df445ee..c82969f38845 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -211,8 +211,8 @@ struct pv_mmu_ops {
void (*flush_tlb_user)(void);
void (*flush_tlb_kernel)(void);
void (*flush_tlb_one_user)(unsigned long addr);
-   void (*flush_tlb_others)(const struct cpumask *cpus,
-const struct flush_tlb_info *info);
+   void (*flush_tlb_multi)(const struct cpumask *cpus,
+   const struct flush_tlb_info *info);
 
void (*tlb_remove_table)(struct mmu_gather *tlb, void *table);