Spinlock implementation on x86_64 (was Re: [HACKERS] Better LWLocks with compare-and-swap (9.4))

Heikki Linnakangas Wed, 28 Aug 2013 10:06:56 -0700

On 21.05.2013 00:20, Heikki Linnakangas wrote:

On 16.05.2013 01:08, Daniel Farina wrote:

On Mon, May 13, 2013 at 5:50 AM, Heikki Linnakangas
<[email protected]> wrote:

pgbench -S is such a workload. With 9.3beta1, I'm seeing this
profile, when
I run "pgbench -S -c64 -j64 -T60 -M prepared" on a 32-core Linux
machine:


- 64.09% postgres postgres [.] tas
- tas
- 99.83% s_lock
- 53.22% LWLockAcquire
+ 99.87% GetSnapshotData
- 46.78% LWLockRelease
GetSnapshotData
+ GetTransactionSnapshot
+ 2.97% postgres postgres [.] tas
+ 1.53% postgres libc-2.13.so [.] 0x119873
+ 1.44% postgres postgres [.] GetSnapshotData
+ 1.29% postgres [kernel.kallsyms] [k] arch_local_irq_enable
+ 1.18% postgres postgres [.] AllocSetAlloc
...

So, on this test, a lot of time is wasted spinning on the mutex of
ProcArrayLock. If you plot a graph of TPS vs. # of clients, there is a
surprisingly steep drop in performance once you go beyond 29 clients
(attached, pgbench-lwlock-cas-local-clients-sets.png, red line). My
theory
is that after that point all the cores are busy, and processes start
to be
sometimes context switched while holding the spinlock, which kills
performance.


I have, I also used linux perf to come to this conclusion, and my
determination was similar: a system was undergoing increasingly heavy
load, in this case with processes>> number of processors. It was
also a phase-change type of event: at one moment everything would be
going great, but once a critical threshold was hit, s_lock would
consume enormous amount of CPU time. I figured preemption while in
the spinlock was to blame at the time, given the extreme nature


Stop the press! I'm getting the same speedup on that 32-core box I got
with the compare-and-swap patch, from this one-liner:

--- a/src/include/storage/s_lock.h
+++ b/src/include/storage/s_lock.h
@@ -200,6 +200,8 @@ typedef unsigned char slock_t;

#define TAS(lock) tas(lock)

+#define TAS_SPIN(lock) (*(lock) ? 1 : TAS(lock))
+
static __inline__ int
tas(volatile slock_t *lock)
{

So, on this system, doing a non-locked test before the locked xchg
instruction while spinning, is a very good idea. That contradicts the
testing that was done earlier when the x86-64 implementation was added,
as we have this comment in the tas() implementation:

/*
* On Opteron, using a non-locking test before the locking instruction
* is a huge loss. On EM64T, it appears to be a wash or small loss,
* so we needn't bother to try to distinguish the sub-architectures.
*/


On my test system, the non-locking test is a big win. I tested this
because I was reading this article from Intel:

http://software.intel.com/en-us/articles/implementing-scalable-atomic-locks-for-multi-core-intel-em64t-and-ia32-architectures/.
It says very explicitly that the non-locking test is a good idea:

Spinning on volatile read vs. spin on lock attempt

One common mistake made by developers developing their own spin-wait
loops is attempting to spin on an atomic instruction instead of
spinning on a volatile read. Spinning on a dirty read instead of
attempting to acquire a lock consumes less time and resources. This
allows an application to only attempt to acquire a lock only when it
is free.


Now, I'm not sure what to do about this. If we put the non-locking test
in there, according to the previous testing that would be a huge loss on
Opterons.

I think we should apply the attached patch to put a non-locked test inTAS_SPIN() on x86_64.

Tom tested this back in 2011 on a 32-core Opteron [1] and found littledifference. And on a 80-core Xeon (160 with hyperthreading) system, hesaw a quite significant gain from it [2]. Reading the thread, I don'tunderstand why it wasn't applied to x86_64 back then. Maybe it was forthe fear of having a negative impact on performance on old Opterons, butI strongly suspect that the performance would be OK even on old Opteronsif you only do the non-locked test in TAS_SPIN() and not in TAS(). Backwhen the comment about poor performance with a non-locking test onOpteron was written, we used the same TAS() macro in the first and whilespinning.

I tried to find some sort of an authoritative guide from AMD that wouldsay how spinlocks should be implemented, but didn't find any. The Intelguide I linked above says we should apply the patch. Linux uses a ticketspinlock thingie nowadays, which is slightly different, but if I'mreading the older pre-ticket version correctly, they used to do thesame. Ie. non-locking test while spinning, but not before the firstattempt to grab the lock. Same in FreeBSD.

So, my plan is to apply the attached non-locked-tas-spin-x86_64.patch tomaster. But I would love to get feedback from people running differentx86_64 hardware.


[1] http://www.postgresql.org/message-id/[email protected]
[2] http://www.postgresql.org/message-id/[email protected]

- Heikki

diff --git a/src/include/storage/s_lock.h b/src/include/storage/s_lock.h
index 180c013..2a9c1c4 100644
--- a/src/include/storage/s_lock.h
+++ b/src/include/storage/s_lock.h
@@ -199,6 +199,15 @@ spin_delay(void)
 typedef unsigned char slock_t;
 
 #define TAS(lock) tas(lock)
+/*
+ * On Intel EM64T, it's a win to use a non-locking test before the xchg proper.
+ *
+ * See also Implementing Scalable Atomic Locks for Multi-Core Intel(tm) EM64T
+ * and IA32 by Michael Chynoweth and Mary R. Lee. As of this writing, it is
+ * available at:
+ * http://software.intel.com/en-us/articles/implementing-scalable-atomic-locks-for-multi-core-intel-em64t-and-ia32-architectures
+ */
+#define TAS_SPIN(lock)    (*(lock) ? 1 : TAS(lock))
 
 static __inline__ int
 tas(volatile slock_t *lock)

-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Spinlock implementation on x86_64 (was Re: [HACKERS] Better LWLocks with compare-and-swap (9.4))

Reply via email to