Re: [Xen-devel] [PATCH v2] x86: use VMLOAD for PV context switch

2018-09-11 Thread Boris Ostrovsky
On 9/11/18 10:38 AM, Jan Beulich wrote:
 On 11.09.18 at 16:17,  wrote:
>> On 9/11/18 3:54 AM, Jan Beulich wrote:
>> On 10.09.18 at 23:56,  wrote:
 On 09/10/2018 10:03 AM, Jan Beulich wrote:
> Having noticed that VMLOAD alone is about as fast as a single of the
> involved WRMSRs, I thought it might be a reasonable idea to also use it
> for PV. Measurements, however, have shown that an actual improvement can
> be achieved only with an early prefetch of the VMCB (thanks to Andrew
> for suggesting to try this), which I have to admit I can't really
> explain. This way on my Fam15 box context switch takes over 100 clocks
> less on average (the measured values are heavily varying in all cases,
> though).
>
> This is intentionally not using a new hvm_funcs hook: For one, this is
> all about PV, and something similar can hardly be done for VMX.
> Furthermore the indirect to direct call patching that is meant to be
> applied to most hvm_funcs hooks would be ugly to make work with
> functions having more than 6 parameters.
>
> Signed-off-by: Jan Beulich 
> Acked-by: Brian Woods 
> ---
> v2: Re-base.
> ---
> Besides the mentioned oddity with measured performance, I've also
> noticed a significant difference (of at least 150 clocks) between
> measuring immediately around the calls to svm_load_segs() and measuring
> immediately inside the function.
>


>  
> +#ifdef CONFIG_PV
> +bool svm_load_segs(unsigned int ldt_ents, unsigned long ldt_base,
> +   unsigned int fs_sel, unsigned long fs_base,
> +   unsigned int gs_sel, unsigned long gs_base,
> +   unsigned long gs_shadow)
> +{
> +unsigned int cpu = smp_processor_id();
> +struct vmcb_struct *vmcb = per_cpu(host_vmcb_va, cpu);
> +
> +if ( unlikely(!vmcb) )
> +return false;
> +
> +if ( !ldt_base )
> +{
> +asm volatile ( "prefetch %0" :: "m" (vmcb->ldtr) );
> +return true;


 Could you explain why this is true? We haven't loaded FS/GS here.
>>>
>>> A zero ldt_base argument indicates a prefetch request. This is an
>>> agreement between callers of the function and its implementation.
>>
>>
>> Oh, so this is what svm_load_segs(0, 0, 0, 0, 0, 0, 0) is for?
>>
>> If yes then IMO a separate call would make things a bit clearer,
>> especially since the return value is ignored.
> 
> Well, to me having a single central place where everything gets done
> seemed better. And it looks as if Brian agreed, considering I already
> have his ack for the patch. Let me know if you feel strongly.


I would at least like to have a comment explaining the calling convention.


> 
 I also couldn't find discussion about prefetch --- why is prefetching
 ldtr expected to help?
>>>
>>> See the patch description. ldtr as the element is a pretty random
>>> choice between the various fields VMLOAD touches. It's (presumably)
>>> more the page walk than the actual cache line(s) that we want to
>>> be pulled in ahead of time. I can only guess that VMLOAD execution
>>> is more "synchronous" wrt its memory accesses and/or latency to
>>> completion than other (simpler) instructions.
>>
>> I think a code comment would be very helpful (including the fact that
>> ldtr is an arbitrary field), even if this is mentioned in the commit
>> message.
> 
> I would likely have added a comment if I could firmly state what's
> going on. But this is derived from experiments only - I'd require
> AMD to fill in the holes before I could write a (useful) comment.


Well, since we have actual code we should be able to explain why we have
it ;-). Even if this is speculation on your part.

Otherwise someone looking at this will (likely?) have no idea about
what's going on, and doing git blame doesn't always work.

-boris


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH v2] x86: use VMLOAD for PV context switch

2018-09-11 Thread Jan Beulich
>>> On 11.09.18 at 16:17,  wrote:
> On 9/11/18 3:54 AM, Jan Beulich wrote:
> On 10.09.18 at 23:56,  wrote:
>>> On 09/10/2018 10:03 AM, Jan Beulich wrote:
 Having noticed that VMLOAD alone is about as fast as a single of the
 involved WRMSRs, I thought it might be a reasonable idea to also use it
 for PV. Measurements, however, have shown that an actual improvement can
 be achieved only with an early prefetch of the VMCB (thanks to Andrew
 for suggesting to try this), which I have to admit I can't really
 explain. This way on my Fam15 box context switch takes over 100 clocks
 less on average (the measured values are heavily varying in all cases,
 though).

 This is intentionally not using a new hvm_funcs hook: For one, this is
 all about PV, and something similar can hardly be done for VMX.
 Furthermore the indirect to direct call patching that is meant to be
 applied to most hvm_funcs hooks would be ugly to make work with
 functions having more than 6 parameters.

 Signed-off-by: Jan Beulich 
 Acked-by: Brian Woods 
 ---
 v2: Re-base.
 ---
 Besides the mentioned oddity with measured performance, I've also
 noticed a significant difference (of at least 150 clocks) between
 measuring immediately around the calls to svm_load_segs() and measuring
 immediately inside the function.

>>>
>>>
  
 +#ifdef CONFIG_PV
 +bool svm_load_segs(unsigned int ldt_ents, unsigned long ldt_base,
 +   unsigned int fs_sel, unsigned long fs_base,
 +   unsigned int gs_sel, unsigned long gs_base,
 +   unsigned long gs_shadow)
 +{
 +unsigned int cpu = smp_processor_id();
 +struct vmcb_struct *vmcb = per_cpu(host_vmcb_va, cpu);
 +
 +if ( unlikely(!vmcb) )
 +return false;
 +
 +if ( !ldt_base )
 +{
 +asm volatile ( "prefetch %0" :: "m" (vmcb->ldtr) );
 +return true;
>>>
>>>
>>> Could you explain why this is true? We haven't loaded FS/GS here.
>> 
>> A zero ldt_base argument indicates a prefetch request. This is an
>> agreement between callers of the function and its implementation.
> 
> 
> Oh, so this is what svm_load_segs(0, 0, 0, 0, 0, 0, 0) is for?
> 
> If yes then IMO a separate call would make things a bit clearer,
> especially since the return value is ignored.

Well, to me having a single central place where everything gets done
seemed better. And it looks as if Brian agreed, considering I already
have his ack for the patch. Let me know if you feel strongly.

>>> I also couldn't find discussion about prefetch --- why is prefetching
>>> ldtr expected to help?
>> 
>> See the patch description. ldtr as the element is a pretty random
>> choice between the various fields VMLOAD touches. It's (presumably)
>> more the page walk than the actual cache line(s) that we want to
>> be pulled in ahead of time. I can only guess that VMLOAD execution
>> is more "synchronous" wrt its memory accesses and/or latency to
>> completion than other (simpler) instructions.
> 
> I think a code comment would be very helpful (including the fact that
> ldtr is an arbitrary field), even if this is mentioned in the commit
> message.

I would likely have added a comment if I could firmly state what's
going on. But this is derived from experiments only - I'd require
AMD to fill in the holes before I could write a (useful) comment.

Jan



___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH v2] x86: use VMLOAD for PV context switch

2018-09-11 Thread Boris Ostrovsky
On 9/11/18 3:54 AM, Jan Beulich wrote:
 On 10.09.18 at 23:56,  wrote:
>> On 09/10/2018 10:03 AM, Jan Beulich wrote:
>>> Having noticed that VMLOAD alone is about as fast as a single of the
>>> involved WRMSRs, I thought it might be a reasonable idea to also use it
>>> for PV. Measurements, however, have shown that an actual improvement can
>>> be achieved only with an early prefetch of the VMCB (thanks to Andrew
>>> for suggesting to try this), which I have to admit I can't really
>>> explain. This way on my Fam15 box context switch takes over 100 clocks
>>> less on average (the measured values are heavily varying in all cases,
>>> though).
>>>
>>> This is intentionally not using a new hvm_funcs hook: For one, this is
>>> all about PV, and something similar can hardly be done for VMX.
>>> Furthermore the indirect to direct call patching that is meant to be
>>> applied to most hvm_funcs hooks would be ugly to make work with
>>> functions having more than 6 parameters.
>>>
>>> Signed-off-by: Jan Beulich 
>>> Acked-by: Brian Woods 
>>> ---
>>> v2: Re-base.
>>> ---
>>> Besides the mentioned oddity with measured performance, I've also
>>> noticed a significant difference (of at least 150 clocks) between
>>> measuring immediately around the calls to svm_load_segs() and measuring
>>> immediately inside the function.
>>>
>>
>>
>>>  
>>> +#ifdef CONFIG_PV
>>> +bool svm_load_segs(unsigned int ldt_ents, unsigned long ldt_base,
>>> +   unsigned int fs_sel, unsigned long fs_base,
>>> +   unsigned int gs_sel, unsigned long gs_base,
>>> +   unsigned long gs_shadow)
>>> +{
>>> +unsigned int cpu = smp_processor_id();
>>> +struct vmcb_struct *vmcb = per_cpu(host_vmcb_va, cpu);
>>> +
>>> +if ( unlikely(!vmcb) )
>>> +return false;
>>> +
>>> +if ( !ldt_base )
>>> +{
>>> +asm volatile ( "prefetch %0" :: "m" (vmcb->ldtr) );
>>> +return true;
>>
>>
>> Could you explain why this is true? We haven't loaded FS/GS here.
> 
> A zero ldt_base argument indicates a prefetch request. This is an
> agreement between callers of the function and its implementation.


Oh, so this is what svm_load_segs(0, 0, 0, 0, 0, 0, 0) is for?

If yes then IMO a separate call would make things a bit clearer,
especially since the return value is ignored.


> 
>> I also couldn't find discussion about prefetch --- why is prefetching
>> ldtr expected to help?
> 
> See the patch description. ldtr as the element is a pretty random
> choice between the various fields VMLOAD touches. It's (presumably)
> more the page walk than the actual cache line(s) that we want to
> be pulled in ahead of time. I can only guess that VMLOAD execution
> is more "synchronous" wrt its memory accesses and/or latency to
> completion than other (simpler) instructions.

I think a code comment would be very helpful (including the fact that
ldtr is an arbitrary field), even if this is mentioned in the commit
message.

-boris

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH v2] x86: use VMLOAD for PV context switch

2018-09-11 Thread Jan Beulich
>>> On 10.09.18 at 23:56,  wrote:
> On 09/10/2018 10:03 AM, Jan Beulich wrote:
>> Having noticed that VMLOAD alone is about as fast as a single of the
>> involved WRMSRs, I thought it might be a reasonable idea to also use it
>> for PV. Measurements, however, have shown that an actual improvement can
>> be achieved only with an early prefetch of the VMCB (thanks to Andrew
>> for suggesting to try this), which I have to admit I can't really
>> explain. This way on my Fam15 box context switch takes over 100 clocks
>> less on average (the measured values are heavily varying in all cases,
>> though).
>>
>> This is intentionally not using a new hvm_funcs hook: For one, this is
>> all about PV, and something similar can hardly be done for VMX.
>> Furthermore the indirect to direct call patching that is meant to be
>> applied to most hvm_funcs hooks would be ugly to make work with
>> functions having more than 6 parameters.
>>
>> Signed-off-by: Jan Beulich 
>> Acked-by: Brian Woods 
>> ---
>> v2: Re-base.
>> ---
>> Besides the mentioned oddity with measured performance, I've also
>> noticed a significant difference (of at least 150 clocks) between
>> measuring immediately around the calls to svm_load_segs() and measuring
>> immediately inside the function.
>>
> 
> 
>>  
>> +#ifdef CONFIG_PV
>> +bool svm_load_segs(unsigned int ldt_ents, unsigned long ldt_base,
>> +   unsigned int fs_sel, unsigned long fs_base,
>> +   unsigned int gs_sel, unsigned long gs_base,
>> +   unsigned long gs_shadow)
>> +{
>> +unsigned int cpu = smp_processor_id();
>> +struct vmcb_struct *vmcb = per_cpu(host_vmcb_va, cpu);
>> +
>> +if ( unlikely(!vmcb) )
>> +return false;
>> +
>> +if ( !ldt_base )
>> +{
>> +asm volatile ( "prefetch %0" :: "m" (vmcb->ldtr) );
>> +return true;
> 
> 
> Could you explain why this is true? We haven't loaded FS/GS here.

A zero ldt_base argument indicates a prefetch request. This is an
agreement between callers of the function and its implementation.

> I also couldn't find discussion about prefetch --- why is prefetching
> ldtr expected to help?

See the patch description. ldtr as the element is a pretty random
choice between the various fields VMLOAD touches. It's (presumably)
more the page walk than the actual cache line(s) that we want to
be pulled in ahead of time. I can only guess that VMLOAD execution
is more "synchronous" wrt its memory accesses and/or latency to
completion than other (simpler) instructions.

Jan



___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH v2] x86: use VMLOAD for PV context switch

2018-09-10 Thread Boris Ostrovsky
On 09/10/2018 10:03 AM, Jan Beulich wrote:
> Having noticed that VMLOAD alone is about as fast as a single of the
> involved WRMSRs, I thought it might be a reasonable idea to also use it
> for PV. Measurements, however, have shown that an actual improvement can
> be achieved only with an early prefetch of the VMCB (thanks to Andrew
> for suggesting to try this), which I have to admit I can't really
> explain. This way on my Fam15 box context switch takes over 100 clocks
> less on average (the measured values are heavily varying in all cases,
> though).
>
> This is intentionally not using a new hvm_funcs hook: For one, this is
> all about PV, and something similar can hardly be done for VMX.
> Furthermore the indirect to direct call patching that is meant to be
> applied to most hvm_funcs hooks would be ugly to make work with
> functions having more than 6 parameters.
>
> Signed-off-by: Jan Beulich 
> Acked-by: Brian Woods 
> ---
> v2: Re-base.
> ---
> Besides the mentioned oddity with measured performance, I've also
> noticed a significant difference (of at least 150 clocks) between
> measuring immediately around the calls to svm_load_segs() and measuring
> immediately inside the function.
>


>  
> +#ifdef CONFIG_PV
> +bool svm_load_segs(unsigned int ldt_ents, unsigned long ldt_base,
> +   unsigned int fs_sel, unsigned long fs_base,
> +   unsigned int gs_sel, unsigned long gs_base,
> +   unsigned long gs_shadow)
> +{
> +unsigned int cpu = smp_processor_id();
> +struct vmcb_struct *vmcb = per_cpu(host_vmcb_va, cpu);
> +
> +if ( unlikely(!vmcb) )
> +return false;
> +
> +if ( !ldt_base )
> +{
> +asm volatile ( "prefetch %0" :: "m" (vmcb->ldtr) );
> +return true;


Could you explain why this is true? We haven't loaded FS/GS here.

I also couldn't find discussion about prefetch --- why is prefetching
ldtr expected to help?

-boris

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

[Xen-devel] [PATCH v2] x86: use VMLOAD for PV context switch

2018-09-10 Thread Jan Beulich
Having noticed that VMLOAD alone is about as fast as a single of the
involved WRMSRs, I thought it might be a reasonable idea to also use it
for PV. Measurements, however, have shown that an actual improvement can
be achieved only with an early prefetch of the VMCB (thanks to Andrew
for suggesting to try this), which I have to admit I can't really
explain. This way on my Fam15 box context switch takes over 100 clocks
less on average (the measured values are heavily varying in all cases,
though).

This is intentionally not using a new hvm_funcs hook: For one, this is
all about PV, and something similar can hardly be done for VMX.
Furthermore the indirect to direct call patching that is meant to be
applied to most hvm_funcs hooks would be ugly to make work with
functions having more than 6 parameters.

Signed-off-by: Jan Beulich 
Acked-by: Brian Woods 
---
v2: Re-base.
---
Besides the mentioned oddity with measured performance, I've also
noticed a significant difference (of at least 150 clocks) between
measuring immediately around the calls to svm_load_segs() and measuring
immediately inside the function.

--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -52,6 +52,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -1281,11 +1282,35 @@ static void load_segments(struct vcpu *n
 struct cpu_user_regs *uregs = >arch.user_regs;
 int all_segs_okay = 1;
 unsigned int dirty_segment_mask, cpu = smp_processor_id();
+bool fs_gs_done = false;
 
 /* Load and clear the dirty segment mask. */
 dirty_segment_mask = per_cpu(dirty_segment_mask, cpu);
 per_cpu(dirty_segment_mask, cpu) = 0;
 
+#ifdef CONFIG_HVM
+if ( !is_pv_32bit_vcpu(n) && !cpu_has_fsgsbase && cpu_has_svm &&
+ !((uregs->fs | uregs->gs) & ~3) &&
+ /*
+  * The remaining part is just for optimization: If only shadow GS
+  * needs loading, there's nothing to be gained here.
+  */
+ (n->arch.pv.fs_base | n->arch.pv.gs_base_user) )
+{
+fs_gs_done = n->arch.flags & TF_kernel_mode
+? svm_load_segs(n->arch.pv.ldt_ents, LDT_VIRT_START(n),
+uregs->fs, n->arch.pv.fs_base,
+uregs->gs, n->arch.pv.gs_base_kernel,
+n->arch.pv.gs_base_user)
+: svm_load_segs(n->arch.pv.ldt_ents, LDT_VIRT_START(n),
+uregs->fs, n->arch.pv.fs_base,
+uregs->gs, n->arch.pv.gs_base_user,
+n->arch.pv.gs_base_kernel);
+}
+#endif
+if ( !fs_gs_done )
+load_LDT(n);
+
 /* Either selector != 0 ==> reload. */
 if ( unlikely((dirty_segment_mask & DIRTY_DS) | uregs->ds) )
 {
@@ -1301,7 +1326,7 @@ static void load_segments(struct vcpu *n
 }
 
 /* Either selector != 0 ==> reload. */
-if ( unlikely((dirty_segment_mask & DIRTY_FS) | uregs->fs) )
+if ( unlikely((dirty_segment_mask & DIRTY_FS) | uregs->fs) && !fs_gs_done )
 {
 all_segs_okay &= loadsegment(fs, uregs->fs);
 /* non-nul selector updates fs_base */
@@ -1310,7 +1335,7 @@ static void load_segments(struct vcpu *n
 }
 
 /* Either selector != 0 ==> reload. */
-if ( unlikely((dirty_segment_mask & DIRTY_GS) | uregs->gs) )
+if ( unlikely((dirty_segment_mask & DIRTY_GS) | uregs->gs) && !fs_gs_done  
)
 {
 all_segs_okay &= loadsegment(gs, uregs->gs);
 /* non-nul selector updates gs_base_user */
@@ -1318,7 +1343,7 @@ static void load_segments(struct vcpu *n
 dirty_segment_mask &= ~DIRTY_GS_BASE;
 }
 
-if ( !is_pv_32bit_vcpu(n) )
+if ( !fs_gs_done && !is_pv_32bit_vcpu(n) )
 {
 /* This can only be non-zero if selector is NULL. */
 if ( n->arch.pv.fs_base | (dirty_segment_mask & DIRTY_FS_BASE) )
@@ -1653,6 +1678,12 @@ static void __context_switch(void)
 
 write_ptbase(n);
 
+#if defined(CONFIG_PV) && defined(CONFIG_HVM)
+if ( is_pv_domain(nd) && !is_pv_32bit_domain(nd) && !is_idle_domain(nd) &&
+ !cpu_has_fsgsbase && cpu_has_svm )
+svm_load_segs(0, 0, 0, 0, 0, 0, 0);
+#endif
+
 if ( need_full_gdt(nd) &&
  ((p->vcpu_id != n->vcpu_id) || !need_full_gdt(pd)) )
 {
@@ -1714,10 +1745,7 @@ void context_switch(struct vcpu *prev, s
 local_irq_enable();
 
 if ( is_pv_domain(nextd) )
-{
-load_LDT(next);
 load_segments(next);
-}
 
 ctxt_switch_levelling(next);
 
--- a/xen/arch/x86/hvm/svm/svm.c
+++ b/xen/arch/x86/hvm/svm/svm.c
@@ -78,6 +78,9 @@ static struct hvm_function_table svm_fun
  */
 static DEFINE_PER_CPU_READ_MOSTLY(paddr_t, hsa);
 static DEFINE_PER_CPU_READ_MOSTLY(paddr_t, host_vmcb);
+#ifdef CONFIG_PV
+static DEFINE_PER_CPU(struct vmcb_struct *, host_vmcb_va);
+#endif
 
 static bool_t amd_erratum383_found __read_mostly;
 
@@ -1567,6 +1570,14 @@ static void svm_cpu_dead(unsigned int cp
 *this_hsa =