Re: [PATCH RFC v6 09/11] pvqspinlock, x86: Add qspinlock para-virtualization support

2014-03-14 Thread Paolo Bonzini
Il 13/03/2014 20:49, Waiman Long ha scritto:
 On 03/13/2014 09:57 AM, Paolo Bonzini wrote:
 Il 13/03/2014 12:21, David Vrabel ha scritto:
 On 12/03/14 18:54, Waiman Long wrote:
 This patch adds para-virtualization support to the queue spinlock in
 the same way as was done in the PV ticket lock code. In essence, the
 lock waiters will spin for a specified number of times (QSPIN_THRESHOLD
 = 2^14) and then halted itself. The queue head waiter will spins
 2*QSPIN_THRESHOLD times before halting itself. When it has spinned
 QSPIN_THRESHOLD times, the queue head will assume that the lock
 holder may be scheduled out and attempt to kick the lock holder CPU
 if it has the CPU number on hand.

 I don't really understand the reasoning for kicking the lock holder.

 I agree.  If the lock holder isn't running, there's probably a good
 reason for that and going to sleep will not necessarily convince the
 scheduler to give more CPU to the lock holder.  I think there are two
 choices:

 1) use yield_to to donate part of the waiter's quantum to the lock
 holder?For this we probably need a new, separate hypercall
 interface.  For KVM it would be the same as hlt in the guest but with
 an additional yield_to in the host.

 2) do nothing, just go to sleep.

 Could you get (or do you have) numbers for (2)?
 
 I will take out the lock holder kick portion from the patch. I will also
 try to collect more test data.
 

 More important, I think a barrier is missing:

 Lock holder ---

 // queue_spin_unlock
 barrier();
 ACCESS_ONCE(qlock-lock) = 0;
 barrier();

 
 This is not the unlock code that is used when PV spinlock is enabled.

It is __queue_spin_unlock.  But you're right:

 if (static_key_false(paravirt_spinlocks_enabled)) {
 /*
  * Need to atomically clear the lock byte to avoid racing with
  * queue head waiter trying to set _QSPINLOCK_LOCKED_SLOWPATH.
  */
 if (likely(cmpxchg(qlock-lock, _QSPINLOCK_LOCKED, 0)
 == _QSPINLOCK_LOCKED))
 return;
 else
 queue_spin_unlock_slowpath(lock);
 
 } else {
 __queue_spin_unlock(lock);
 }

... indeed the __queue_spin_unlock/pv_kick_node pair is only done if the
waiter has already written _QSPINLOCK_LOCKED_SLOWPATH, and this means
that the lock holder must also observe PV_CPU_HALTED.

So this is correct:

 Nothing protects from writing qlock-lock before pv-cpustate is read,

but this cannot happen:

 leading to this:

 Lock holderWaiter
 ---
 read pv-cpustate
 (it is PV_CPU_ACTIVE)
 pv-cpustate = PV_CPU_HALTED
 lockval = cmpxchg(...)
 hibernate()
 qlock-lock = 0
 if (pv-cpustate != PV_CPU_HALTED)
 return;

 
 The lock holder will read cpustate only if the lock byte has been
 changed to _QSPINLOCK_LOCKED_SLOWPATH. So the setting of the lock byte
 synchronize the 2 threads.

Yes.

 The only thing that I am not certain is when
 the waiter is trying to go to sleep while, at the same time, the lock
 holder is trying to kick it. Will there be a missed wakeup because of
 this timing issue?

This is okay.  The kick_cpu hypercall is sticky until the next halt, if
no halt is pending.  Otherwise, pv ticketlocks would have the same issue.

Paolo
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH RFC v6 09/11] pvqspinlock, x86: Add qspinlock para-virtualization support

2014-03-13 Thread David Vrabel
On 12/03/14 18:54, Waiman Long wrote:
 This patch adds para-virtualization support to the queue spinlock in
 the same way as was done in the PV ticket lock code. In essence, the
 lock waiters will spin for a specified number of times (QSPIN_THRESHOLD
 = 2^14) and then halted itself. The queue head waiter will spins
 2*QSPIN_THRESHOLD times before halting itself. When it has spinned
 QSPIN_THRESHOLD times, the queue head will assume that the lock
 holder may be scheduled out and attempt to kick the lock holder CPU
 if it has the CPU number on hand.

I don't really understand the reasoning for kicking the lock holder.  It
will either be: running, runnable, or halted because it's in a slow path
wait for another lock.  In any of these states I do not see how a kick
is useful.

 Enabling the PV code does have a performance impact on spinlock
 acquisitions and releases. The following table shows the execution
 time (in ms) of a spinlock micro-benchmark that does lock/unlock
 operations 5M times for each task versus the number of contending
 tasks on a Westmere-EX system.
 
   # ofTicket lock  Queue lock
   tasks   PV off/PV on/%Change  PV off/PV on/%Change
   --     -
 1  135/  179/+33%  137/  169/+23%
 2 1045/ 1103/ +6% 1120/ 1536/+37%
 3 1827/ 2683/+47% 2313/ 2425/ +5%
 4   2689/ 4191/+56%   2914/ 3128/ +7%
 5   3736/ 5830/+56%   3715/ 3762/ +1%
 6   4942/ 7609/+54%   4504/ 4558/ +2%
 7   6304/ 9570/+52%   5292/ 5351/ +1%
 8   7736/11323/+46%   6037/ 6097/ +1%

Do you have measurements from tests when VCPUs are overcommitted?

 +#ifdef CONFIG_PARAVIRT_SPINLOCKS
 +/**
 + * queue_spin_unlock_slowpath - kick up the CPU of the queue head
 + * @lock : Pointer to queue spinlock structure
 + *
 + * The lock is released after finding the queue head to avoid racing
 + * condition between the queue head and the lock holder.
 + */
 +void queue_spin_unlock_slowpath(struct qspinlock *lock)
 +{
 + struct qnode *node, *prev;
 + u32 qcode = (u32)queue_get_qcode(lock);
 +
 + /*
 +  * Get the queue tail node
 +  */
 + node = xlate_qcode(qcode);
 +
 + /*
 +  * Locate the queue head node by following the prev pointer from
 +  * tail to head.
 +  * It is assumed that the PV guests won't have that many CPUs so
 +  * that it won't take a long time to follow the pointers.

This isn't a valid assumption, but this isn't that different from the
search done in the ticket slow unlock path so I guess it's ok.

David
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH RFC v6 09/11] pvqspinlock, x86: Add qspinlock para-virtualization support

2014-03-13 Thread Paolo Bonzini

Il 13/03/2014 12:21, David Vrabel ha scritto:

On 12/03/14 18:54, Waiman Long wrote:

This patch adds para-virtualization support to the queue spinlock in
the same way as was done in the PV ticket lock code. In essence, the
lock waiters will spin for a specified number of times (QSPIN_THRESHOLD
= 2^14) and then halted itself. The queue head waiter will spins
2*QSPIN_THRESHOLD times before halting itself. When it has spinned
QSPIN_THRESHOLD times, the queue head will assume that the lock
holder may be scheduled out and attempt to kick the lock holder CPU
if it has the CPU number on hand.


I don't really understand the reasoning for kicking the lock holder.


I agree.  If the lock holder isn't running, there's probably a good 
reason for that and going to sleep will not necessarily convince the 
scheduler to give more CPU to the lock holder.  I think there are two 
choices:


1) use yield_to to donate part of the waiter's quantum to the lock 
holder?For this we probably need a new, separate hypercall 
interface.  For KVM it would be the same as hlt in the guest but with an 
additional yield_to in the host.


2) do nothing, just go to sleep.

Could you get (or do you have) numbers for (2)?

More important, I think a barrier is missing:

Lock holder ---

// queue_spin_unlock
barrier();
ACCESS_ONCE(qlock-lock) = 0;
barrier();

// pv_kick_node:
if (pv-cpustate != PV_CPU_HALTED)
return;
ACCESS_ONCE(pv-cpustate) = PV_CPU_KICKED;
__queue_kick_cpu(pv-mycpu, PV_KICK_QUEUE_HEAD);

Waiter ---

// pv_head_spin_check
ACCESS_ONCE(pv-cpustate) = PV_CPU_HALTED;
lockval = cmpxchg(qlock-lock,
  _QSPINLOCK_LOCKED,
  _QSPINLOCK_LOCKED_SLOWPATH);
if (lockval == 0) {
/*
 * Can exit now as the lock is free
 */
ACCESS_ONCE(pv-cpustate) = PV_CPU_ACTIVE;
*count = 0;
return;
}
__queue_hibernate();

Nothing protects from writing qlock-lock before pv-cpustate is read, 
leading to this:


Lock holder Waiter
---
read pv-cpustate
(it is PV_CPU_ACTIVE)
pv-cpustate = PV_CPU_HALTED
lockval = cmpxchg(...)
hibernate()
qlock-lock = 0
if (pv-cpustate != PV_CPU_HALTED)
return;

I think you need this:

-   if (pv-cpustate != PV_CPU_HALTED)
-   return;
-   ACCESS_ONCE(pv-cpustate) = PV_CPU_KICKED;
+   if (cmpxchg(pv-cpustate, PV_CPU_HALTED, PV_CPU_KICKED)
+   != PV_CPU_HALTED)
+   return;

Paolo
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH RFC v6 09/11] pvqspinlock, x86: Add qspinlock para-virtualization support

2014-03-13 Thread Waiman Long

On 03/13/2014 07:21 AM, David Vrabel wrote:

On 12/03/14 18:54, Waiman Long wrote:

This patch adds para-virtualization support to the queue spinlock in
the same way as was done in the PV ticket lock code. In essence, the
lock waiters will spin for a specified number of times (QSPIN_THRESHOLD
= 2^14) and then halted itself. The queue head waiter will spins
2*QSPIN_THRESHOLD times before halting itself. When it has spinned
QSPIN_THRESHOLD times, the queue head will assume that the lock
holder may be scheduled out and attempt to kick the lock holder CPU
if it has the CPU number on hand.

I don't really understand the reasoning for kicking the lock holder.  It
will either be: running, runnable, or halted because it's in a slow path
wait for another lock.  In any of these states I do not see how a kick
is useful.


You may be right. I can certainly take this part out of the patch if 
people don't think that is useful.



Enabling the PV code does have a performance impact on spinlock
acquisitions and releases. The following table shows the execution
time (in ms) of a spinlock micro-benchmark that does lock/unlock
operations 5M times for each task versus the number of contending
tasks on a Westmere-EX system.

   # ofTicket lock   Queue lock
   tasks   PV off/PV on/%Change   PV off/PV on/%Change
   --     -
 1   135/  179/+33%  137/  169/+23%
 2  1045/ 1103/ +6% 1120/ 1536/+37%
 3  1827/ 2683/+47% 2313/ 2425/ +5%
 4   2689/ 4191/+56%2914/ 3128/ +7%
 5   3736/ 5830/+56%3715/ 3762/ +1%
 6   4942/ 7609/+54%4504/ 4558/ +2%
 7   6304/ 9570/+52%5292/ 5351/ +1%
 8   7736/11323/+46%6037/ 6097/ +1%

Do you have measurements from tests when VCPUs are overcommitted?


I don't have a measurement with overcommitted guests yet. I will set up 
such an environment and do some tests on it.



+#ifdef CONFIG_PARAVIRT_SPINLOCKS
+/**
+ * queue_spin_unlock_slowpath - kick up the CPU of the queue head
+ * @lock : Pointer to queue spinlock structure
+ *
+ * The lock is released after finding the queue head to avoid racing
+ * condition between the queue head and the lock holder.
+ */
+void queue_spin_unlock_slowpath(struct qspinlock *lock)
+{
+   struct qnode *node, *prev;
+   u32 qcode = (u32)queue_get_qcode(lock);
+
+   /*
+* Get the queue tail node
+*/
+   node = xlate_qcode(qcode);
+
+   /*
+* Locate the queue head node by following the prev pointer from
+* tail to head.
+* It is assumed that the PV guests won't have that many CPUs so
+* that it won't take a long time to follow the pointers.

This isn't a valid assumption, but this isn't that different from the
search done in the ticket slow unlock path so I guess it's ok.

David


I will change that to say that in most cases, the queue length will be 
short.


-Longman
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH RFC v6 09/11] pvqspinlock, x86: Add qspinlock para-virtualization support

2014-03-13 Thread Waiman Long

On 03/13/2014 09:57 AM, Paolo Bonzini wrote:

Il 13/03/2014 12:21, David Vrabel ha scritto:

On 12/03/14 18:54, Waiman Long wrote:

This patch adds para-virtualization support to the queue spinlock in
the same way as was done in the PV ticket lock code. In essence, the
lock waiters will spin for a specified number of times (QSPIN_THRESHOLD
= 2^14) and then halted itself. The queue head waiter will spins
2*QSPIN_THRESHOLD times before halting itself. When it has spinned
QSPIN_THRESHOLD times, the queue head will assume that the lock
holder may be scheduled out and attempt to kick the lock holder CPU
if it has the CPU number on hand.


I don't really understand the reasoning for kicking the lock holder.


I agree.  If the lock holder isn't running, there's probably a good 
reason for that and going to sleep will not necessarily convince the 
scheduler to give more CPU to the lock holder.  I think there are two 
choices:


1) use yield_to to donate part of the waiter's quantum to the lock 
holder?For this we probably need a new, separate hypercall 
interface.  For KVM it would be the same as hlt in the guest but with 
an additional yield_to in the host.


2) do nothing, just go to sleep.

Could you get (or do you have) numbers for (2)?


I will take out the lock holder kick portion from the patch. I will also 
try to collect more test data.




More important, I think a barrier is missing:

Lock holder ---

// queue_spin_unlock
barrier();
ACCESS_ONCE(qlock-lock) = 0;
barrier();



This is not the unlock code that is used when PV spinlock is enabled. 
The right unlock code is


if (static_key_false(paravirt_spinlocks_enabled)) {
/*
 * Need to atomically clear the lock byte to avoid 
racing with
 * queue head waiter trying to set 
_QSPINLOCK_LOCKED_SLOWPATH.

 */
if (likely(cmpxchg(qlock-lock, _QSPINLOCK_LOCKED, 0)
== _QSPINLOCK_LOCKED))
return;
else
queue_spin_unlock_slowpath(lock);

} else {
__queue_spin_unlock(lock);
}


// pv_kick_node:
if (pv-cpustate != PV_CPU_HALTED)
return;
ACCESS_ONCE(pv-cpustate) = PV_CPU_KICKED;
__queue_kick_cpu(pv-mycpu, PV_KICK_QUEUE_HEAD);

Waiter ---

// pv_head_spin_check
ACCESS_ONCE(pv-cpustate) = PV_CPU_HALTED;
lockval = cmpxchg(qlock-lock,
  _QSPINLOCK_LOCKED,
  _QSPINLOCK_LOCKED_SLOWPATH);
if (lockval == 0) {
/*
 * Can exit now as the lock is free
 */
ACCESS_ONCE(pv-cpustate) = PV_CPU_ACTIVE;
*count = 0;
return;
}
__queue_hibernate();

Nothing protects from writing qlock-lock before pv-cpustate is read, 
leading to this:


Lock holderWaiter
---
read pv-cpustate
(it is PV_CPU_ACTIVE)
pv-cpustate = PV_CPU_HALTED
lockval = cmpxchg(...)
hibernate()
qlock-lock = 0
if (pv-cpustate != PV_CPU_HALTED)
return;



The lock holder will read cpustate only if the lock byte has been 
changed to _QSPINLOCK_LOCKED_SLOWPATH. So the setting of the lock byte 
synchronize the 2 threads. The only thing that I am not certain is when 
the waiter is trying to go to sleep while, at the same time, the lock 
holder is trying to kick it. Will there be a missed wakeup because of 
this timing issue?


-Longman

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH RFC v6 09/11] pvqspinlock, x86: Add qspinlock para-virtualization support

2014-03-12 Thread Waiman Long
This patch adds para-virtualization support to the queue spinlock in
the same way as was done in the PV ticket lock code. In essence, the
lock waiters will spin for a specified number of times (QSPIN_THRESHOLD
= 2^14) and then halted itself. The queue head waiter will spins
2*QSPIN_THRESHOLD times before halting itself. When it has spinned
QSPIN_THRESHOLD times, the queue head will assume that the lock
holder may be scheduled out and attempt to kick the lock holder CPU
if it has the CPU number on hand. Before being halted, the queue head
waiter will set a flag (_QSPINLOCK_LOCKED_SLOWPATH) in the lock byte
to indicate that the unlock slowpath has to be invoked.

In the unlock slowpath, the current lock holder will find the queue
head by following the previous node pointer links stored in the
queue node structure until it finds one that has the wait flag turned
off. It then attempt to kick the CPU of the queue head.

After the queue head acquired the lock, it will also check the status
of the next node and set _QSPINLOCK_LOCKED_SLOWPATH if it has been
halted.

Enabling the PV code does have a performance impact on spinlock
acquisitions and releases. The following table shows the execution
time (in ms) of a spinlock micro-benchmark that does lock/unlock
operations 5M times for each task versus the number of contending
tasks on a Westmere-EX system.

  # ofTicket lockQueue lock
  tasks   PV off/PV on/%ChangePV off/PV on/%Change
  --     -
1135/  179/+33%  137/  169/+23%
2   1045/ 1103/ +6% 1120/ 1536/+37%
3   1827/ 2683/+47% 2313/ 2425/ +5%
4   2689/ 4191/+56% 2914/ 3128/ +7%
5   3736/ 5830/+56% 3715/ 3762/ +1%
6   4942/ 7609/+54% 4504/ 4558/ +2%
7   6304/ 9570/+52% 5292/ 5351/ +1%
8   7736/11323/+46% 6037/ 6097/ +1%

It can be seen that the ticket lock PV code has a fairly big decrease
in performance when there are 3 or more contending tasks. The
queue spinlock PV code, on the other hand, only has a minor drop in
performance for 3 or more contending tasks.

Signed-off-by: Waiman Long waiman.l...@hp.com
---
 arch/x86/include/asm/paravirt.h   |   12 ++-
 arch/x86/include/asm/paravirt_types.h |   12 ++
 arch/x86/include/asm/pvqspinlock.h|  232 +
 arch/x86/include/asm/qspinlock.h  |   35 +
 arch/x86/kernel/paravirt-spinlocks.c  |5 +
 kernel/locking/qspinlock.c|   96 ++-
 6 files changed, 390 insertions(+), 2 deletions(-)
 create mode 100644 arch/x86/include/asm/pvqspinlock.h

diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index cd6e161..cabc37a 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -711,7 +711,17 @@ static inline void __set_fixmap(unsigned /* enum 
fixed_addresses */ idx,
 }
 
 #if defined(CONFIG_SMP)  defined(CONFIG_PARAVIRT_SPINLOCKS)
+#ifdef CONFIG_QUEUE_SPINLOCK
+static __always_inline void __queue_kick_cpu(int cpu, enum pv_kick_type type)
+{
+   PVOP_VCALL2(pv_lock_ops.kick_cpu, cpu, type);
+}
 
+static __always_inline void __queue_hibernate(void)
+{
+   PVOP_VCALL0(pv_lock_ops.hibernate);
+}
+#else
 static __always_inline void __ticket_lock_spinning(struct arch_spinlock *lock,
__ticket_t ticket)
 {
@@ -723,7 +733,7 @@ static __always_inline void __ticket_unlock_kick(struct 
arch_spinlock *lock,
 {
PVOP_VCALL2(pv_lock_ops.unlock_kick, lock, ticket);
 }
-
+#endif
 #endif
 
 #ifdef CONFIG_X86_32
diff --git a/arch/x86/include/asm/paravirt_types.h 
b/arch/x86/include/asm/paravirt_types.h
index 7549b8b..fa16aa6 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -333,9 +333,21 @@ struct arch_spinlock;
 typedef u16 __ticket_t;
 #endif
 
+#ifdef CONFIG_QUEUE_SPINLOCK
+enum pv_kick_type {
+   PV_KICK_LOCK_HOLDER,
+   PV_KICK_QUEUE_HEAD
+};
+#endif
+
 struct pv_lock_ops {
+#ifdef CONFIG_QUEUE_SPINLOCK
+   void (*kick_cpu)(int cpu, enum pv_kick_type);
+   void (*hibernate)(void);
+#else
struct paravirt_callee_save lock_spinning;
void (*unlock_kick)(struct arch_spinlock *lock, __ticket_t ticket);
+#endif
 };
 
 /* This contains all the paravirt structures: we get a convenient
diff --git a/arch/x86/include/asm/pvqspinlock.h 
b/arch/x86/include/asm/pvqspinlock.h
new file mode 100644
index 000..13cbc4f
--- /dev/null
+++ b/arch/x86/include/asm/pvqspinlock.h
@@ -0,0 +1,232 @@
+#ifndef _ASM_X86_PVQSPINLOCK_H
+#define _ASM_X86_PVQSPINLOCK_H
+
+/*
+ * Queue Spinlock Para-Virtualization (PV) Support
+ *
+ * +--++-+ nxtcpu_p1  ++
+ * | Lock ||Queue|---|Next|
+ * |Holder|---|Head |---|Node|
+ * +--+ prev_qcode +-+ prev_qcode ++
+ *
+ * As