On Sat, Mar 19, 2016 at 11:21:19PM -0400, Waiman Long wrote: > In qrwlock, the reader that is spining on the lock will need to notify > the next reader in the queue when the lock is free. That introduces a > reader-to-reader latency that is not present in the original rwlock.
How did you find this 'problem'? > That is the price for reducing lock cacheline contention. It also > reduces the performance benefit of qrwlock on reader heavy workloads. > > However, if we allow a limited number of readers to spin on the > lock simultaneously, we can eliminates some of the reader-to-reader > latencies at the expense of a bit more cacheline contention and > probably more power consumption. So the embedded people might not like that much. > This patch changes the reader slowpath to allow multiple readers to > spin on the lock. The maximum number of concurrent readers allowed > is currently set to 4 to limit the amount of additional cacheline > contention while improving reader performance on most workloads. If > a writer comes to the queue head, however, it will stop additional > readers from coming out. > > Using a multi-threaded locking microbenchmark on a 4-socket 40-core > Haswell-EX system, the locking throughput of 4.5-rc6 kernel with or > without the patch were as follows: Do you have an actual real world benchmark where this makes a difference? > /** > * queued_read_lock_slowpath - acquire read lock of a queue rwlock > * @lock: Pointer to queue rwlock structure > * @cnts: Current qrwlock lock value > */ > void queued_read_lock_slowpath(struct qrwlock *lock, u32 cnts) > { > + bool locked = true; > + > /* > * Readers come here when they cannot get the lock without waiting > */ > @@ -78,7 +71,10 @@ void queued_read_lock_slowpath(struct qrwlock *lock, u32 > cnts) > * semantics) until the lock is available without waiting in > * the queue. > */ > + while ((cnts & _QW_WMASK) == _QW_LOCKED) { > + cpu_relax_lowlatency(); > + cnts = atomic_read_acquire(&lock->cnts); > + } > return; > } > atomic_sub(_QR_BIAS, &lock->cnts); > @@ -92,14 +88,31 @@ void queued_read_lock_slowpath(struct qrwlock *lock, u32 > cnts) > * The ACQUIRE semantics of the following spinning code ensure > * that accesses can't leak upwards out of our subsequent critical > * section in the case that the lock is currently held for write. > + * > + * The reader increments the reader count & wait until the writer > + * releases the lock. > */ > cnts = atomic_add_return_acquire(_QR_BIAS, &lock->cnts) - _QR_BIAS; > + while ((cnts & _QW_WMASK) == _QW_LOCKED) { > + if (locked && ((cnts >> _QR_SHIFT) < MAX_SPINNING_READERS)) { > + /* > + * Unlock the wait queue so that more readers can > + * come forward and waiting for the writer to exit > + * as long as no more than MAX_SPINNING_READERS > + * readers are present. > + */ > + arch_spin_unlock(&lock->wait_lock); > + locked = false; Only 1 more can come forward with this logic. How can you ever get to 4? Also, what says the next in queue is a reader? > + } > + cpu_relax_lowlatency(); > + cnts = atomic_read_acquire(&lock->cnts); > + } > > /* > * Signal the next one in queue to become queue head > */ > + if (locked) > + arch_spin_unlock(&lock->wait_lock); > } > EXPORT_SYMBOL(queued_read_lock_slowpath); > > -- > 1.7.1 >