Re: [PATCH v10 10/19] qspinlock, x86: Allow unfair spinlock in a virtual guest

2014-05-19 Thread Waiman Long

On 05/08/2014 03:12 PM, Peter Zijlstra wrote:

On Wed, May 07, 2014 at 11:01:38AM -0400, Waiman Long wrote:


No, we want the unfair thing for VIRT, not PARAVIRT.



Yes, you are right. I will change that to VIRT.


diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 9e7659e..10e87e1 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -227,6 +227,14 @@ static __always_inline int get_qlock(struct qspinlock 
*lock)
  {
struct __qspinlock *l = (void *)lock;

+#ifdef CONFIG_PARAVIRT_UNFAIR_LOCKS
+   if (static_key_false(¶virt_unfairlocks_enabled))
+   /*
+* Need to use atomic operation to get the lock when
+* lock stealing can happen.
+*/
+   return cmpxchg(&l->locked, 0, _Q_LOCKED_VAL) == 0;

That's missing {}.


It is a single statement which doesn't need braces according to kernel 
coding style. I could move the comments up a bit to make it easier to read.



+#endif



barrier();
ACCESS_ONCE(l->locked) = _Q_LOCKED_VAL;
barrier();


But no, what you want is:

static __always_inline bool virt_lock(struct qspinlock *lock)
{
#ifdef CONFIG_VIRT_MUCK
if (static_key_false(&virt_unfairlocks_enabled)) {
while (!queue_spin_trylock(lock))
cpu_relax();

return true;
}
#else
return false;
}


void queue_spin_lock_slowpath(struct qspinlock *lock, u32 val)
{
if (virt_lock(lock))
return;

...
}


This is a possible way of doing it. I can do that in the patch series to 
simplify it. Hopefully that will speed up the review process and get it 
done quicker.


-Longman
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v10 10/19] qspinlock, x86: Allow unfair spinlock in a virtual guest

2014-05-12 Thread Radim Krčmář
(tl;dr: paravirtualization could be better than unfair qspinlock)

2014-05-07 11:01-0400, Waiman Long:
> Locking is always an issue in a virtualized environment because of 2
> different types of problems:
>  1) Lock holder preemption
>  2) Lock waiter preemption

Paravirtualized ticketlocks have a shortcoming;
we don't know which VCPU the ticket belongs to, so the hypervisor can
only blindly yield to runnable VCPUs after waiters halt in slowpath.
There aren't enough "free" bits in ticket struct to improve, thus we
have resorted to unfairness.

Qspinlock is different.

Most queued VCPUs already know the VCPU before it, so we have what it
takes to mitigate lock waiter preemption: we can include preempted CPU
id in hypercall, the hypervisor will schedule it, and we'll be woken up
from unlock slowpath [1].
This still isn't perfect: we can wake up a VCPU that got preempted
before it could hypercall, and these hypercalls will propagate one by
one through our queue to the preempted lock holder.
(We'd have to share the whole waiter-list to avoid this.
 We could also try to send holder's id instead and unconditionally kick
 next-in-line on unlock, I think it would be slower.)

Lock holder problem is tougher because we don't always share who is it.
The tail bits can be used for it as we don't really use them before a
queue has formed.  This would cost us one bit to differentiate between
holder/tail CPU id [2] and complicate operations a little, but only for
the paravirt case, where benefits are expected to be far greater.
Hypercall from lock slowpath could schedule preempted VCPU right away.

I think this could obsolete unfair locks and will prepare RFC patches
soon-ish [3]. (If the idea isn't proved infeasible before.)


---
1: It is possible that we could avoid O(N) traversal and hypercall in
   unlock slowpath by scheduling VCPUs in the right order often.
2: Or even less. idx=3 is a bug: if we are spinning in NMI, we are
   almost deadlocked, so we should WARN/BUG if it were to happen; which
   leaves the combination free to mean that the CPU id is a sole holder,
   not a tail.  (I prefer clean code though.)
3: I already tried and got quickly fed up by refactoring, so it might
   get postponed till the series gets merged.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v10 10/19] qspinlock, x86: Allow unfair spinlock in a virtual guest

2014-05-08 Thread Peter Zijlstra
On Wed, May 07, 2014 at 11:01:38AM -0400, Waiman Long wrote:


No, we want the unfair thing for VIRT, not PARAVIRT.


> diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
> index 9e7659e..10e87e1 100644
> --- a/kernel/locking/qspinlock.c
> +++ b/kernel/locking/qspinlock.c
> @@ -227,6 +227,14 @@ static __always_inline int get_qlock(struct qspinlock 
> *lock)
>  {
>   struct __qspinlock *l = (void *)lock;
>  
> +#ifdef CONFIG_PARAVIRT_UNFAIR_LOCKS
> + if (static_key_false(¶virt_unfairlocks_enabled))
> + /*
> +  * Need to use atomic operation to get the lock when
> +  * lock stealing can happen.
> +  */
> + return cmpxchg(&l->locked, 0, _Q_LOCKED_VAL) == 0;

That's missing {}.

> +#endif


>   barrier();
>   ACCESS_ONCE(l->locked) = _Q_LOCKED_VAL;
>   barrier();


But no, what you want is:

static __always_inline bool virt_lock(struct qspinlock *lock)
{
#ifdef CONFIG_VIRT_MUCK
if (static_key_false(&virt_unfairlocks_enabled)) {
while (!queue_spin_trylock(lock))
cpu_relax();

return true;
}
#else
return false;
}


void queue_spin_lock_slowpath(struct qspinlock *lock, u32 val)
{
if (virt_lock(lock))
return;

...
}
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v10 10/19] qspinlock, x86: Allow unfair spinlock in a virtual guest

2014-05-07 Thread Waiman Long
Locking is always an issue in a virtualized environment because of 2
different types of problems:
 1) Lock holder preemption
 2) Lock waiter preemption

One solution to the lock waiter preemption problem is to allow unfair
lock in a virtualized environment. In this case, a new lock acquirer
can come and steal the lock if the next-in-line CPU to get the lock
is scheduled out.

A simple unfair lock is the test-and-set byte lock where an lock
acquirer constantly spins on the lock word and attempt to grab it
when the lock is freed. This simple unfair lock has 2 main problems:
 1) The constant spinning on the lock word put a lot of cacheline
contention traffic on the affected cacheline, thus slowing tasks
that need to access the cacheline.
 2) Lock starvation is a real possibility especially if the number of
virtual CPUs is large.

A simple unfair queue spinlock can be implemented by allowing lock
stealing in the fast path. The slowpath will still be the same as
before and all the pending lock acquirers will have to wait in the
queue in FIFO order. This cannot completely solve the lock waiter
preemption problem, but it does help to alleviate the impact of
this problem.

To illustrate the performance impact of the various approaches, the
disk workload of the AIM7 benchmark and the ebizzy test were run on
a 4-socket 40-core Westmere-EX system (bare metal, HT off, ramdisk)
on a 3.14 based kernel.  The table below shows the performance
of the different kernel flavors.

AIM7 XFS Disk Test
  kernel JPMReal Time   Sys TimeUsr Time
  -  ----   
  ticketlock56782333.17   96.61   5.81
  qspinlock 57507993.13   94.83   5.97
  simple test-and-set   56250003.20   98.29   5.93
  simple unfair 57507993.13   95.91   5.98
qspinlock

AIM7 EXT4 Disk Test
  kernel JPMReal Time   Sys TimeUsr Time
  -  ----   
  ticketlock1114551   16.15  509.72   7.11
  qspinlock 21844668.24  232.99   6.01
  simple test-and-set593081   30.35  967.55   9.00
  simple unfair 22929947.85  222.84   5.89
qspinlock

Ebizzy -m test
  kernel   records/s  Real Time   Sys TimeUsr Time
  --  -   
  ticketlock 2075   10.00  216.35   3.49
  qspinlock  3023   10.00  198.20   4.80
  simple test-and-set1667   10.00  198.93   2.89
  simple unfair  2915   10.00  165.68   4.31
qspinlock

The disk-xfs workload spent only about 2.88% of CPU time in
_raw_spin_lock() whereas the disk-ext4 workload spent 57.8% of CPU
time in _raw_spin_lock(). It can be seen that there wasn't too much
difference in performance with low spinlock contention in the disk-xfs
workload. With heavy spinlock contention, the performance of simple
test-and-set lock can plummet when compared with the ticket and
queue spinlocks.

Unfair lock in a native environment is generally not a good idea as
there is a possibility of lock starvation for a heavily contended lock.

This patch adds a new configuration option for the x86 architecture
to enable the use of unfair queue spinlock (PARAVIRT_UNFAIR_LOCKS) in
a para-virtualized guest. A jump label (paravirt_unfairlocks_enabled)
is used to switch between a fair and an unfair version of the spinlock
code. This jump label will only be enabled in a virtual guest where
the X86_FEATURE_HYPERVISOR feature bit is set.

Enabling this configuration feature causes a slight decrease the
performance of an uncontended lock-unlock operation by about 1-2%
mainly due to the use of a static key. However, uncontended lock-unlock
operation are really just a tiny percentage of a real workload. So
there should no noticeable change in application performance.

With the unfair locking activated on bare metal 4-socket Westmere-EX
box, the execution times (in ms) of a spinlock micro-benchmark were
as follows:

  # ofTicket   Fair  Unfair simpleUnfair
  taskslock queue lockqueue lockbyte lock
  --  ---   -----
1   135135   137  137
2   890   1082   421  718
3  1932   2248   708 1263
4  2829   2819  1030 1916
5  3834   3522  1323 2327
6  4963   4173  1723 2938
7  6299   4875  2067 3292
8  7691   5563  2360 3768

Executing one task per node, the performance data were:

  # ofTicket   Fair  Unfair simpleUnfair
  nodeslock queue lockqueue lo