I wrote: > It might be that this hardware is capable of showing a difference with a > better-tuned pgbench test, but with an untuned pgbench run, we just aren't > sufficiently sensitive to the spinlock properties. (Which I guess is good > news, really.)
It occurred to me that if we don't insist on a semi-realistic test case, it's not that hard to just pound on a spinlock and see what happens. I made up a simple C function (attached) to repeatedly call XLogGetLastRemovedSegno, which is basically just a spinlock acquire/release. Using this as a "transaction": $ cat bench.sql select drive_spinlocks(50000); I get this with HEAD: $ pgbench -f bench.sql -n -T 60 -c 1 bench transaction type: bench.sql scaling factor: 1 query mode: simple number of clients: 1 number of threads: 1 duration: 60 s number of transactions actually processed: 127597 latency average = 0.470 ms tps = 2126.479699 (including connections establishing) tps = 2126.595015 (excluding connections establishing) $ pgbench -f bench.sql -n -T 60 -c 2 bench transaction type: bench.sql scaling factor: 1 query mode: simple number of clients: 2 number of threads: 1 duration: 60 s number of transactions actually processed: 108979 latency average = 1.101 ms tps = 1816.051930 (including connections establishing) tps = 1816.150556 (excluding connections establishing) $ pgbench -f bench.sql -n -T 60 -c 4 bench transaction type: bench.sql scaling factor: 1 query mode: simple number of clients: 4 number of threads: 1 duration: 60 s number of transactions actually processed: 42862 latency average = 5.601 ms tps = 714.202152 (including connections establishing) tps = 714.237301 (excluding connections establishing) (With only 4 high-performance cores, it's probably not interesting to go further; involving the slower cores will just confuse matters.) And this with the patch: $ pgbench -f bench.sql -n -T 60 -c 1 bench transaction type: bench.sql scaling factor: 1 query mode: simple number of clients: 1 number of threads: 1 duration: 60 s number of transactions actually processed: 130455 latency average = 0.460 ms tps = 2174.098284 (including connections establishing) tps = 2174.217097 (excluding connections establishing) $ pgbench -f bench.sql -n -T 60 -c 2 bench transaction type: bench.sql scaling factor: 1 query mode: simple number of clients: 2 number of threads: 1 duration: 60 s number of transactions actually processed: 51533 latency average = 2.329 ms tps = 858.765176 (including connections establishing) tps = 858.811132 (excluding connections establishing) $ pgbench -f bench.sql -n -T 60 -c 4 bench transaction type: bench.sql scaling factor: 1 query mode: simple number of clients: 4 number of threads: 1 duration: 60 s number of transactions actually processed: 31154 latency average = 7.705 ms tps = 519.116788 (including connections establishing) tps = 519.144375 (excluding connections establishing) So at least on Apple's hardware, it seems like the CAS implementation might be a shade faster when uncontended, but it's very clearly worse when there is contention for the spinlock. That's interesting, because the argument that CAS should involve strictly less work seems valid ... but that's what I'm getting. It might be useful to try this on other ARM platforms, but I lack the energy right now (plus the only other thing I've got is a Raspberry Pi, which might not be something we particularly care about performance-wise). regards, tom lane
/* create function drive_spinlocks(count int) returns void strict volatile language c as '.../spinlocktest.so'; */ #include "postgres.h" #include "access/xlog.h" #include "fmgr.h" #include "miscadmin.h" PG_MODULE_MAGIC; /* * drive_spinlocks(count int) returns void */ PG_FUNCTION_INFO_V1(drive_spinlocks); Datum drive_spinlocks(PG_FUNCTION_ARGS) { int32 count = PG_GETARG_INT32(0); while (count-- > 0) { XLogGetLastRemovedSegno(); CHECK_FOR_INTERRUPTS(); } PG_RETURN_VOID(); }