Re: [External] Re: [v2 0/6] KVM: arm64: implement vcpu_is_preempted check

2022-12-05 Thread Usama Arif



On 24/11/2022 13:55, Usama Arif wrote:



On 18/11/2022 00:20, Marc Zyngier wrote:

On Mon, 07 Nov 2022 12:00:44 +,
Usama Arif  wrote:




On 06/11/2022 16:35, Marc Zyngier wrote:

On Fri, 04 Nov 2022 06:20:59 +,
Usama Arif  wrote:


This patchset adds support for vcpu_is_preempted in arm64, which
allows the guest to check if a vcpu was scheduled out, which is
useful to know incase it was holding a lock. vcpu_is_preempted can
be used to improve performance in locking (see owner_on_cpu usage in
mutex_spin_on_owner, mutex_can_spin_on_owner, rtmutex_spin_on_owner
and osq_lock) and scheduling (see available_idle_cpu which is used
in several places in kernel/sched/fair.c for e.g. in wake_affine to
determine which CPU can run soonest):


[...]


pvcy shows a smaller overall improvement (50%) compared to
vcpu_is_preempted (277%).  Host side flamegraph analysis shows that
~60% of the host time when using pvcy is spent in kvm_handle_wfx,
compared with ~1.5% when using vcpu_is_preempted, hence
vcpu_is_preempted shows a larger improvement.


And have you worked out *why* we spend so much time handling WFE?

M.


Its from the following change in pvcy patchset:

diff --git a/arch/arm64/kvm/handle_exit.c b/arch/arm64/kvm/handle_exit.c
index e778eefcf214..915644816a85 100644
--- a/arch/arm64/kvm/handle_exit.c
+++ b/arch/arm64/kvm/handle_exit.c
@@ -118,7 +118,12 @@ static int kvm_handle_wfx(struct kvm_vcpu *vcpu)
 }

 if (esr & ESR_ELx_WFx_ISS_WFE) {
-   kvm_vcpu_on_spin(vcpu, vcpu_mode_priv(vcpu));
+   int state;
+   while ((state = kvm_pvcy_check_state(vcpu)) == 0)
+   schedule();
+
+   if (state == -1)
+   kvm_vcpu_on_spin(vcpu, vcpu_mode_priv(vcpu));
 } else {
 if (esr & ESR_ELx_WFx_ISS_WFxT)
 vcpu_set_flag(vcpu, IN_WFIT);


If my understanding is correct of the pvcy changes, whenever pvcy
returns an unchanged vcpu state, we would schedule to another
vcpu. And its the constant scheduling where the time is spent. I guess
the affects are much higher when the lock contention is very
high. This can be seem from the pvcy host side flamegraph as well with
(~67% of the time spent in the schedule() call in kvm_handle_wfx), For
reference, I have put the graph at:
https://uarif1.github.io/pvlock/perf_host_pvcy_nmi.svg


The real issue here is that we don't try to pick the right vcpu to
run, and strictly rely on schedule() to eventually pick something that
can run.

An interesting to do would be to try and fit the directed yield
mechanism there. It would be a lot more interesting than the one-off
vcpu_is_preempted hack, as it gives us a low-level primitive on which
to construct things (pvcy is effectively a mwait-like primitive).


We could use kvm_vcpu_yield_to to yield to a specific vcpu, but how 
would we determine which vcpu to yield to?


IMO vcpu_is_preempted is very well integrated in a lot of core kernel 
code, i.e. mutex, rtmutex, rwsem and osq_lock. It is also used in 
scheduler to determine better which vCPU we can run on soonest, select 
idle core, etc. I am not sure if all of these cases will be optimized by 
pvcy? Also, with vcpu_is_preempted, some of the lock heavy benchmarks 
come down from spending around 50% of the time in lock to less than 1% 
(so not sure how much more room is there for improvement).


We could also use vcpu_is_preempted to optimize IPI performance (along 
with directed yield to target IPI vCPU) similar to how its done in x86 
(https://lore.kernel.org/all/1560255830-8656-2-git-send-email-wanpen...@tencent.com/). 
This case definitely wont be covered by pvcy.


Considering all the above, i.e. the core kernel integration already 
present and possible future usecases of vcpu_is_preempted, maybe its 
worth making vcpu_is_preempted work on arm independently of pvcy?




Hi,

Just wanted to check if there are any comments on above? I can send a v3 
with the doc and code fixes suggested in the earlier reviews if it makes 
sense?


Thanks,
Usama


Thanks,
Usama



M.


___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [External] Re: [v2 0/6] KVM: arm64: implement vcpu_is_preempted check

2022-11-24 Thread Usama Arif




On 18/11/2022 00:20, Marc Zyngier wrote:

On Mon, 07 Nov 2022 12:00:44 +,
Usama Arif  wrote:




On 06/11/2022 16:35, Marc Zyngier wrote:

On Fri, 04 Nov 2022 06:20:59 +,
Usama Arif  wrote:


This patchset adds support for vcpu_is_preempted in arm64, which
allows the guest to check if a vcpu was scheduled out, which is
useful to know incase it was holding a lock. vcpu_is_preempted can
be used to improve performance in locking (see owner_on_cpu usage in
mutex_spin_on_owner, mutex_can_spin_on_owner, rtmutex_spin_on_owner
and osq_lock) and scheduling (see available_idle_cpu which is used
in several places in kernel/sched/fair.c for e.g. in wake_affine to
determine which CPU can run soonest):


[...]


pvcy shows a smaller overall improvement (50%) compared to
vcpu_is_preempted (277%).  Host side flamegraph analysis shows that
~60% of the host time when using pvcy is spent in kvm_handle_wfx,
compared with ~1.5% when using vcpu_is_preempted, hence
vcpu_is_preempted shows a larger improvement.


And have you worked out *why* we spend so much time handling WFE?

M.


Its from the following change in pvcy patchset:

diff --git a/arch/arm64/kvm/handle_exit.c b/arch/arm64/kvm/handle_exit.c
index e778eefcf214..915644816a85 100644
--- a/arch/arm64/kvm/handle_exit.c
+++ b/arch/arm64/kvm/handle_exit.c
@@ -118,7 +118,12 @@ static int kvm_handle_wfx(struct kvm_vcpu *vcpu)
 }

 if (esr & ESR_ELx_WFx_ISS_WFE) {
-   kvm_vcpu_on_spin(vcpu, vcpu_mode_priv(vcpu));
+   int state;
+   while ((state = kvm_pvcy_check_state(vcpu)) == 0)
+   schedule();
+
+   if (state == -1)
+   kvm_vcpu_on_spin(vcpu, vcpu_mode_priv(vcpu));
 } else {
 if (esr & ESR_ELx_WFx_ISS_WFxT)
 vcpu_set_flag(vcpu, IN_WFIT);


If my understanding is correct of the pvcy changes, whenever pvcy
returns an unchanged vcpu state, we would schedule to another
vcpu. And its the constant scheduling where the time is spent. I guess
the affects are much higher when the lock contention is very
high. This can be seem from the pvcy host side flamegraph as well with
(~67% of the time spent in the schedule() call in kvm_handle_wfx), For
reference, I have put the graph at:
https://uarif1.github.io/pvlock/perf_host_pvcy_nmi.svg


The real issue here is that we don't try to pick the right vcpu to
run, and strictly rely on schedule() to eventually pick something that
can run.

An interesting to do would be to try and fit the directed yield
mechanism there. It would be a lot more interesting than the one-off
vcpu_is_preempted hack, as it gives us a low-level primitive on which
to construct things (pvcy is effectively a mwait-like primitive).


We could use kvm_vcpu_yield_to to yield to a specific vcpu, but how 
would we determine which vcpu to yield to?


IMO vcpu_is_preempted is very well integrated in a lot of core kernel 
code, i.e. mutex, rtmutex, rwsem and osq_lock. It is also used in 
scheduler to determine better which vCPU we can run on soonest, select 
idle core, etc. I am not sure if all of these cases will be optimized by 
pvcy? Also, with vcpu_is_preempted, some of the lock heavy benchmarks 
come down from spending around 50% of the time in lock to less than 1% 
(so not sure how much more room is there for improvement).


We could also use vcpu_is_preempted to optimize IPI performance (along 
with directed yield to target IPI vCPU) similar to how its done in x86 
(https://lore.kernel.org/all/1560255830-8656-2-git-send-email-wanpen...@tencent.com/). 
This case definitely wont be covered by pvcy.


Considering all the above, i.e. the core kernel integration already 
present and possible future usecases of vcpu_is_preempted, maybe its 
worth making vcpu_is_preempted work on arm independently of pvcy?


Thanks,
Usama



M.


___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [External] Re: [v2 0/6] KVM: arm64: implement vcpu_is_preempted check

2022-11-17 Thread Marc Zyngier
On Mon, 07 Nov 2022 12:00:44 +,
Usama Arif  wrote:
> 
> 
> 
> On 06/11/2022 16:35, Marc Zyngier wrote:
> > On Fri, 04 Nov 2022 06:20:59 +,
> > Usama Arif  wrote:
> >> 
> >> This patchset adds support for vcpu_is_preempted in arm64, which
> >> allows the guest to check if a vcpu was scheduled out, which is
> >> useful to know incase it was holding a lock. vcpu_is_preempted can
> >> be used to improve performance in locking (see owner_on_cpu usage in
> >> mutex_spin_on_owner, mutex_can_spin_on_owner, rtmutex_spin_on_owner
> >> and osq_lock) and scheduling (see available_idle_cpu which is used
> >> in several places in kernel/sched/fair.c for e.g. in wake_affine to
> >> determine which CPU can run soonest):
> > 
> > [...]
> > 
> >> pvcy shows a smaller overall improvement (50%) compared to
> >> vcpu_is_preempted (277%).  Host side flamegraph analysis shows that
> >> ~60% of the host time when using pvcy is spent in kvm_handle_wfx,
> >> compared with ~1.5% when using vcpu_is_preempted, hence
> >> vcpu_is_preempted shows a larger improvement.
> > 
> > And have you worked out *why* we spend so much time handling WFE?
> > 
> > M.
> 
> Its from the following change in pvcy patchset:
> 
> diff --git a/arch/arm64/kvm/handle_exit.c b/arch/arm64/kvm/handle_exit.c
> index e778eefcf214..915644816a85 100644
> --- a/arch/arm64/kvm/handle_exit.c
> +++ b/arch/arm64/kvm/handle_exit.c
> @@ -118,7 +118,12 @@ static int kvm_handle_wfx(struct kvm_vcpu *vcpu)
> }
> 
> if (esr & ESR_ELx_WFx_ISS_WFE) {
> -   kvm_vcpu_on_spin(vcpu, vcpu_mode_priv(vcpu));
> +   int state;
> +   while ((state = kvm_pvcy_check_state(vcpu)) == 0)
> +   schedule();
> +
> +   if (state == -1)
> +   kvm_vcpu_on_spin(vcpu, vcpu_mode_priv(vcpu));
> } else {
> if (esr & ESR_ELx_WFx_ISS_WFxT)
> vcpu_set_flag(vcpu, IN_WFIT);
> 
> 
> If my understanding is correct of the pvcy changes, whenever pvcy
> returns an unchanged vcpu state, we would schedule to another
> vcpu. And its the constant scheduling where the time is spent. I guess
> the affects are much higher when the lock contention is very
> high. This can be seem from the pvcy host side flamegraph as well with
> (~67% of the time spent in the schedule() call in kvm_handle_wfx), For
> reference, I have put the graph at:
> https://uarif1.github.io/pvlock/perf_host_pvcy_nmi.svg

The real issue here is that we don't try to pick the right vcpu to
run, and strictly rely on schedule() to eventually pick something that
can run.

An interesting to do would be to try and fit the directed yield
mechanism there. It would be a lot more interesting than the one-off
vcpu_is_preempted hack, as it gives us a low-level primitive on which
to construct things (pvcy is effectively a mwait-like primitive).

M.

-- 
Without deviation from the norm, progress is not possible.
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [External] Re: [v2 0/6] KVM: arm64: implement vcpu_is_preempted check

2022-11-07 Thread Usama Arif




On 06/11/2022 16:35, Marc Zyngier wrote:

On Fri, 04 Nov 2022 06:20:59 +,
Usama Arif  wrote:


This patchset adds support for vcpu_is_preempted in arm64, which
allows the guest to check if a vcpu was scheduled out, which is
useful to know incase it was holding a lock. vcpu_is_preempted can
be used to improve performance in locking (see owner_on_cpu usage in
mutex_spin_on_owner, mutex_can_spin_on_owner, rtmutex_spin_on_owner
and osq_lock) and scheduling (see available_idle_cpu which is used
in several places in kernel/sched/fair.c for e.g. in wake_affine to
determine which CPU can run soonest):


[...]


pvcy shows a smaller overall improvement (50%) compared to
vcpu_is_preempted (277%).  Host side flamegraph analysis shows that
~60% of the host time when using pvcy is spent in kvm_handle_wfx,
compared with ~1.5% when using vcpu_is_preempted, hence
vcpu_is_preempted shows a larger improvement.


And have you worked out *why* we spend so much time handling WFE?

M.


Its from the following change in pvcy patchset:

diff --git a/arch/arm64/kvm/handle_exit.c b/arch/arm64/kvm/handle_exit.c
index e778eefcf214..915644816a85 100644
--- a/arch/arm64/kvm/handle_exit.c
+++ b/arch/arm64/kvm/handle_exit.c
@@ -118,7 +118,12 @@ static int kvm_handle_wfx(struct kvm_vcpu *vcpu)
}

if (esr & ESR_ELx_WFx_ISS_WFE) {
-   kvm_vcpu_on_spin(vcpu, vcpu_mode_priv(vcpu));
+   int state;
+   while ((state = kvm_pvcy_check_state(vcpu)) == 0)
+   schedule();
+
+   if (state == -1)
+   kvm_vcpu_on_spin(vcpu, vcpu_mode_priv(vcpu));
} else {
if (esr & ESR_ELx_WFx_ISS_WFxT)
vcpu_set_flag(vcpu, IN_WFIT);


If my understanding is correct of the pvcy changes, whenever pvcy 
returns an unchanged vcpu state, we would schedule to another vcpu. And 
its the constant scheduling where the time is spent. I guess the affects 
are much higher when the lock contention is very high. This can be seem 
from the pvcy host side flamegraph as well with (~67% of the time spent 
in the schedule() call in kvm_handle_wfx), For reference, I have put the 
graph at:

https://uarif1.github.io/pvlock/perf_host_pvcy_nmi.svg

Thanks,
Usama




___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm