Re: [HACKERS] spinlock contention

2011-07-21 Thread Florian Pflug
On Jul18, 2011, at 04:36 , Robert Haas wrote:
 On Fri, Jul 8, 2011 at 6:02 AM, Florian Pflug f...@phlo.org wrote:
 I don't want to fiddle with your git repo, but if you attach a patch
 that applies to the master branch I'll give it a spin if I have time.
 
 Patch attached.
 
 Beware that it needs at least GCC 4.1, otherwise it'll use a per-partition
 spin lock instead of locked xadd to increment the shared counters.
 
 [ Back from vacation, catching up on email. ]
 
 gcc version 4.4.5 (Ubuntu/Linaro 4.4.4-14ubuntu5)
 
 pgbench -n -S -T 180 -c 32 -j 32
 
 with lwlock-part patch:
 tps = 36974.644091 (including connections establishing)
 
 unpatched cd34647c666be867f95ef8fc0492c30356043f10:
 tps = 39432.064293 (including connections establishing)
 
 And with -c 8 -j 8:
 
 tps = 26946.202428 (including connections establishing)
 tps = 27206.507424 (including connections establishing)

:-( That's disappointing, to say the least.

I also completely fail to understand what the heck is going on there.

I mean, you did conclusively prove that commenting out the SInval stuff
made a huge difference. There's also supposed to hardly any invalidation
going on during a pgbench -S run. So, since the patch removes two of the
three spin-lock acquisitions from SIGetDataEntries() (so long as there are
no exclusive lockers of SInvalReadLock), there should be some effect.
Or so I'd think at least...

If anyone has I theory, I'd love to hear it.

best regards,
Florian Pflug


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] spinlock contention

2011-07-17 Thread Robert Haas
On Fri, Jul 8, 2011 at 6:02 AM, Florian Pflug f...@phlo.org wrote:
 I don't want to fiddle with your git repo, but if you attach a patch
 that applies to the master branch I'll give it a spin if I have time.

 Patch attached.

 Beware that it needs at least GCC 4.1, otherwise it'll use a per-partition
 spin lock instead of locked xadd to increment the shared counters.

[ Back from vacation, catching up on email. ]

gcc version 4.4.5 (Ubuntu/Linaro 4.4.4-14ubuntu5)

pgbench -n -S -T 180 -c 32 -j 32

with lwlock-part patch:
tps = 36974.644091 (including connections establishing)

unpatched cd34647c666be867f95ef8fc0492c30356043f10:
tps = 39432.064293 (including connections establishing)

And with -c 8 -j 8:

tps = 26946.202428 (including connections establishing)
tps = 27206.507424 (including connections establishing)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] spinlock contention

2011-07-13 Thread Robert Haas
On Jul 12, 2011, at 8:10 PM, Florian Pflug f...@phlo.org wrote:
 On Jul13, 2011, at 00:10 , Robert Haas wrote:
 On Jul 12, 2011, at 8:03 AM, Florian Pflug f...@phlo.org wrote:
 The algorithm is quite straight forward, if one assumes a lock-free
 implementation of a queue (More on that below)
 
 This is similar to the CAS-based LWLocks I played around with, though
 I didn't use a lock-free queue.  I think I probably need to devote some
 time to figuring out why that didn't work out to a win...
 
 Yeah, the non-waitqueue-related parts are mostly identical. The only
 significant difference there is that the waiters-present bit is replaced
 by fence-and-recheck.
 
 What motivated me to research this was your theory that the problem is
 not so much spin-lock contention, but rather those unlucky processes who
 lose their time-slice while holding the lock.
 
 If that is true, then maybe the problem with your CAS patch is that
 once the LWLocks is contested heavily, the waiters-present bit will be
 set pretty much all the time, and the code will thus fall back to
 using the spin-lock.
 
 *Reading the code again*
 
 Hm, I just realized that you only clear the waiters-present bit after
 emptying the queue completely. With rising contention, you might reach a
 point where you never empty the queue completely (unless the lock is
 only ever acquired in share mode, that is). As it stands, the CAS patch
 is effectively neutralized at this point. It'll even have an adverse
 effect due to the higher number of atomic operations compared to the
 unpatched version.

Hmm, yeah.

 I wonder if clearing the waiters-present bit only upon clearing the
 queue completely is necessary for correctness. Wouldn't it be OK
 to clear the bit after waking up at least one locker? That'd still
 guarantee that you cannot end up with a blocked process but no lock
 holder, I believe.

I don't think so - how does the next process that releases the lock know to 
release waiters?

...Robert
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] spinlock contention

2011-07-13 Thread Florian Pflug
On Jul13, 2011, at 22:04 , Robert Haas wrote:
 On Jul 12, 2011, at 8:10 PM, Florian Pflug f...@phlo.org wrote:
 I wonder if clearing the waiters-present bit only upon clearing the
 queue completely is necessary for correctness. Wouldn't it be OK
 to clear the bit after waking up at least one locker? That'd still
 guarantee that you cannot end up with a blocked process but no lock
 holder, I believe.
 
 I don't think so - how does the next process that releases the lock
 know to release waiters?

It won't :-(. Not that easily, at least.

The idea stemmed from that fact that my shared counter partitioning
patch get away with something similar. But that only works because it
always re-checks for possible interference after doing the fast path,
so the idea isn't directly applicable to your patch it seems.

Porting that idea to your CAS patch would probably make it nearly
identical to the shared-counter partitioning patch with the number of
partitions set to 1, so I see little point in that.

BTW, I think I got a workable atomic queue that suits our needs sketched
up - it'll post the details once I've convinced myself that it's actually
correct. So should you believe that approach to be unworkable for some
reason, please speak up and same me the effort.

best regards,
Florian Pflug


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] spinlock contention

2011-07-12 Thread Florian Pflug
On Jul7, 2011, at 03:35 , Robert Haas wrote:
 Some poking around suggests that the problem isn't that
 spinlocks are routinely contended - it seems that we nearly always get
 the spinlock right off the bat.  I'm wondering if the problem may be
 not so much that we have continuous spinlock contention, but rather
 than every once in a while a process gets time-sliced out while it
 holds a spinlock.  If it's an important spinlock (like the one
 protecting SInvalReadLock), the system will quickly evolve into a
 state where every single processor is doing nothing but trying to get
 that spinlock.  Even after the hapless lock-holder gets to run again
 and lets go of the lock, you have a whole pile of other backends who
 are sitting there firing of lock xchgb in a tight loop, and they can
 only get it one at a time, so you have ferocious cache line contention
 until the backlog clears.

Pondering this some more, I came up with another idea, pretty much
orthogonal to the shared counter partitioning I posted a patch for.

If indeed that problem isn't spin lock contention, but rather losing
control of the CPU while holding a spin lock, then why not try to get
rid of the spin lock entirely? On Linux, that's easy - this is exactly
what futexes are about. But it occurred to me that kernel support isn't
actually needed to do that - futexes can effectively be emulated in
user-space using just a semaphore, and without doing a syscall except
when there's contention.

The algorithm is quite straight forward, if one assumes a lock-free
implementation of a queue (More on that below)

  LockAcquire:
(1) CAS the lock state to increment the reader count or set
the exclusive bit in a loop while the lock looks free.
If successful, we're done, otherwise we continue with (2)
(2) Add ourself to the wait queue
[ Since we've added ourself to the queue, we *must* now
  decrement the semaphore no matter what, to keep the
  increment/decrement calls balanced. We're careful to
  maintain that invariant below. ]
(3) Fence (AKA issue full memory barrier)
(4) Re-check if the lock still looks taken. If it does,
we decrement the semaphore (PGSemaphoreLock), and
(upon wake-up) restart at (1).
Otherwise, continue with (5)
(5) Remove the first waiter from the queue and increment
her semaphore. Rinse-and-repeat until we either removed
ourself from the queue or the queue is empty.
(6) Decrement our semaphore.
[ (6) is necessary in the general case, see the remark
  below (2). But we can of course detect the case were
  we'd increment our own semaphore in (5) only to
  decrement it again in (6), and skip both operations ]

  LockRelease:
(1) Set the lock state to 0, i.e. release the lock.
(2) Fence (AKA issue full memory barrier)
(3) If the lock still looks free, remove the first waiter from
the queue and increment her semaphore. Rinse-and-repeat
while the lock looks free and the queue is non-empty.
[ From a correctness POV, we only have to wake up one
  waiter here, and that only if there isn't one who was been
  woken up but hasn't yet retried to take the lock. In reality,
  the decision if and how many waiter to wake up would depend
  on their desired lock level, and some variant of what we
  currently call releaseOK. ]

The fencing steps (LockAcquire:3) and (LockRelease:2) guarantee
that if we block in LockAcquire() a lock holder will see our
queue entry and thus will wake us up eventually.

Because we use a semaphore and not, say, a simple signal, we don't
have to worry about the precise ordering of block and unblock
operations - we just need to ensure they're balanced.

Now to to enqueue() and dequeue() primitives that the algorithm
above depends on. There are multiple published algorithms
for lock-free queues. Some googling turned up

  An optimistic approach to lock-free FIFO queues,
   E.L. Mozes, N. Shavit,
  DOI 10.1007/s00446-007-0050-0

and
  
  CAS-Based Lock-Free Algorithm for Shared Deques,
  M.M. Michael

The second one looks promising, since it only requires a single
CAS to enqueue and dequeue entries in the common case. Thus, it
should be just as efficient as our current spin-lock-based queue
in the uncontended case (and much better in the contested case, of
course).

[ Our case is, in fact, simpler than the generic setting that these
algorithms support. We only ever enqueue our *own* proc array entry
(so allocation and re-use of queue entries isn't much of an issue),
and always *process* the queue after enqueuing something - either
directly in LockAcquire or later in LockRelease. We thus don't really
need to support concurrent removal of queue entries, but might get
away with simply skipping queue processing if we detect that somebody
else is in the process of doing so. I think I have an idea for a
simpler lock-less queue that fits our needs, but I haven't yet
ironed 

Re: [HACKERS] spinlock contention

2011-07-12 Thread Robert Haas
On Jul 12, 2011, at 8:03 AM, Florian Pflug f...@phlo.org wrote:
 The algorithm is quite straight forward, if one assumes a lock-free
 implementation of a queue (More on that below)

This is similar to the CAS-based LWLocks I played around with, though I didn't 
use a lock-free queue.  I think I probably need to devote some time to figuring 
out why that didn't work out to a win...

...Robert
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] spinlock contention

2011-07-12 Thread Florian Pflug
On Jul13, 2011, at 00:10 , Robert Haas wrote:
 On Jul 12, 2011, at 8:03 AM, Florian Pflug f...@phlo.org wrote:
 The algorithm is quite straight forward, if one assumes a lock-free
 implementation of a queue (More on that below)
 
 This is similar to the CAS-based LWLocks I played around with, though
 I didn't use a lock-free queue.  I think I probably need to devote some
 time to figuring out why that didn't work out to a win...

Yeah, the non-waitqueue-related parts are mostly identical. The only
significant difference there is that the waiters-present bit is replaced
by fence-and-recheck.

What motivated me to research this was your theory that the problem is
not so much spin-lock contention, but rather those unlucky processes who
lose their time-slice while holding the lock.

If that is true, then maybe the problem with your CAS patch is that
once the LWLocks is contested heavily, the waiters-present bit will be
set pretty much all the time, and the code will thus fall back to
using the spin-lock.

*Reading the code again*

Hm, I just realized that you only clear the waiters-present bit after
emptying the queue completely. With rising contention, you might reach a
point where you never empty the queue completely (unless the lock is
only ever acquired in share mode, that is). As it stands, the CAS patch
is effectively neutralized at this point. It'll even have an adverse
effect due to the higher number of atomic operations compared to the
unpatched version.

I wonder if clearing the waiters-present bit only upon clearing the
queue completely is necessary for correctness. Wouldn't it be OK
to clear the bit after waking up at least one locker? That'd still
guarantee that you cannot end up with a blocked process but no lock
holder, I believe.

best regards
Florian Pflug


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] spinlock contention

2011-07-08 Thread Florian Pflug
On Jul7, 2011, at 18:09 , Robert Haas wrote:
 On Thu, Jul 7, 2011 at 5:54 AM, Florian Pflug f...@phlo.org wrote:
 In effect, the resulting thing is an LWLock with a partitioned shared
 counter. The partition one backend operates on for shared locks is
 determined by its backend id.
 
 I've added the implementation to the lock benchmarking tool at
  https://github.com/fgp/lockbench
 and also pushed a patched version of postgres to
  https://github.com/fgp/postgres/tree/lwlock_part
 
 The number of shared counter partitions is current 4, but can easily
 be adjusted in lwlock.h. The code uses GCC's atomic fetch and add
 intrinsic if available, otherwise it falls back to using a per-partition
 spin lock.
 
 I think this is probably a good trade-off for locks that are most
 frequently taken in shared mode (like SInvalReadLock), but it seems
 like it could be a very bad trade-off for locks that are frequently
 taken in both shared and exclusive mode (e.g. ProcArrayLock,
 BufMappingLocks).

That's definitely a concern. One idea I had to alleviate that is to
add a per-partition used flag to the LWLock struct, and set that
to true (if not already set) before incrementing a partition's
shared counter. Exclusive lockers would then only need to inspect those
partitions for which the flag is set, and would clear all flags after
having acquires the lock.

I actually tried to do that initially, but figuring out where and how
to safely clear the flag again turned out to be a bit hairy. So I decided
to keep it simple first, and wait to see whether that problem manifests
itself in pratice.

 I don't want to fiddle with your git repo, but if you attach a patch
 that applies to the master branch I'll give it a spin if I have time.

Patch attached.

Beware that it needs at least GCC 4.1, otherwise it'll use a per-partition
spin lock instead of locked xadd to increment the shared counters.

best regards,
Florian Pflug


pg_lwlock_part.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] spinlock contention

2011-07-08 Thread Tom Lane
Florian Pflug f...@phlo.org writes:
 Patch attached.

 Beware that it needs at least GCC 4.1, otherwise it'll use a per-partition
 spin lock instead of locked xadd to increment the shared counters.

That's already sufficient reason to reject the patch.  Not everyone
uses gcc, let alone very recent versions of gcc.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] spinlock contention

2011-07-08 Thread Florian Pflug
On Jul8, 2011, at 16:21 , Tom Lane wrote:
 Florian Pflug f...@phlo.org writes:
 Patch attached.
 
 Beware that it needs at least GCC 4.1, otherwise it'll use a per-partition
 spin lock instead of locked xadd to increment the shared counters.
 
 That's already sufficient reason to reject the patch.  Not everyone
 uses gcc, let alone very recent versions of gcc.

This is a WIP version meant for testing, not a finish patch!

Spending time on making this work on every conceivable compiler before we
even know whether or not the approach is worthwhile at all seems ludicrous
to me.

A finished version would use inline assembly to avoid the GCC version
dependency, and would support as many additional compilers as there are
people with access to these compilers who offer to help...

But yeah, that will very probably leave some compilers unsupported
(in the fall back to spin lock per partition sense. Which, if the patch
proves worthwhile at all, probably still provides a benefit over the current
code).

If that is reason enough to reject the patch, i.e. if the policy is we
don't want it for any if we cannot have it for all, then consider it
withdrawn.

best regards,
Florian Pflug


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] spinlock contention

2011-07-08 Thread Stefan Kaltenbrunner
On 07/08/2011 04:21 PM, Tom Lane wrote:
 Florian Pflug f...@phlo.org writes:
 Patch attached.
 
 Beware that it needs at least GCC 4.1, otherwise it'll use a per-partition
 spin lock instead of locked xadd to increment the shared counters.
 
 That's already sufficient reason to reject the patch.  Not everyone
 uses gcc, let alone very recent versions of gcc.

hmm - 4.1.0 was released in february 2006, which will be much older than
even the 5 year support policy we have on pg when 9.2 will be released,
not sure how much it will matter if we don't support as specific
optimization on a gcc that old..


Stefan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] spinlock contention

2011-07-08 Thread Florian Pflug
On Jul8, 2011, at 22:27 , Stefan Kaltenbrunner wrote:
 On 07/08/2011 04:21 PM, Tom Lane wrote:
 Florian Pflug f...@phlo.org writes:
 Patch attached.
 
 Beware that it needs at least GCC 4.1, otherwise it'll use a per-partition
 spin lock instead of locked xadd to increment the shared counters.
 
 That's already sufficient reason to reject the patch.  Not everyone
 uses gcc, let alone very recent versions of gcc.
 
 hmm - 4.1.0 was released in february 2006, which will be much older than
 even the 5 year support policy we have on pg when 9.2 will be released,
 not sure how much it will matter if we don't support as specific
 optimization on a gcc that old..

Still, it's not really hard to support older Versions, at least on
x86 and x86-64. All it takes is some inline assembly. I just don't
want to put effort into this until we know whether or not the whole
approach is worthwhile or not. 

Should someone want to test this patch, but can't because of the GCC
version restriction, please speak up.

best regards,
Florian Pflug


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] spinlock contention

2011-07-07 Thread Florian Pflug
On Jul7, 2011, at 03:35 , Robert Haas wrote:
 On Thu, Jun 23, 2011 at 11:42 AM, Robert Haas robertmh...@gmail.com wrote:
 On Wed, Jun 22, 2011 at 5:43 PM, Florian Pflug f...@phlo.org wrote:
 On Jun12, 2011, at 23:39 , Robert Haas wrote:
 So, the majority (60%) of the excess spinning appears to be due to
 SInvalReadLock.  A good chunk are due to ProcArrayLock (25%).
 
 Hm, sizeof(LWLock) is 24 on X86-64, making sizeof(LWLockPadded) 32.
 However, cache lines are 64 bytes large on recent Intel CPUs AFAIK,
 so I guess that two adjacent LWLocks currently share one cache line.
 
 Currently, the ProcArrayLock has index 4 while SInvalReadLock has
 index 5, so if I'm not mistaken exactly the two locks where you saw
 the largest contention on are on the same cache line...
 
 Might make sense to try and see if these numbers change if you
 either make LWLockPadded 64bytes or arrange the locks differently...
 
 I fooled around with this a while back and saw no benefit.  It's
 possible a more careful test would turn up something, but I think the
 only real way forward here is going to be to eliminate some of that
 locking altogether.
 
 I also tried inserting an
 acquire-and-release cycle on MyProc-backendLock, which was just as
 good.  That seems like a pretty encouraging result, since there appear
 to be several ways of reimplementing SIGetDataEntries() that would
 work with a per-backend lock rather than a global one.

I did some research and found TLRW: Return of the Read-Write Lock
by D. Dice and N. Shavit. PDF can be found here
  http://transact09.cs.washington.edu/19_paper.pdf

They describe a read-write lock called bytelock with set of slots
for shared lockers, where each shared locker only needs to increment his
slot, and thus doesn't interfere with other concurrent shared locking
attempts. The price is that, after incrementing its slot, a shared 
locker must issue a fencing instruction and then verify that no writer
has taken the lock in the mean while. In their application, they 
distinguish between slotted threads - those who own a designated
slot, and unslotted thread - those who operation on the normal
shared counter.

I implemented that idea, but with a few modifications. First, I don't
distinguish between slotted and unslotted threads, but allow
multiple backends to share a slot. This means my implementation cannot
use a simple unlocked increment to update the slot, but instead uses
locked xadd. I also moved the slots out of the lock itself, and into
a separate set of LWLockPart structures. Each such structure contains
the counters for one slot id and multiple locks.

In effect, the resulting thing is an LWLock with a partitioned shared
counter. The partition one backend operates on for shared locks is
determined by its backend id.

I've added the implementation to the lock benchmarking tool at
  https://github.com/fgp/lockbench
and also pushed a patched version of postgres to
  https://github.com/fgp/postgres/tree/lwlock_part

The number of shared counter partitions is current 4, but can easily
be adjusted in lwlock.h. The code uses GCC's atomic fetch and add
intrinsic if available, otherwise it falls back to using a per-partition
spin lock.

Benchmarking results look promising so far. This is the through-put
I get, in acquisition/release cycles per second, for the LWLock
implementations on a 4-core machine. pg_lwlock_part,N,X is the
partitioned lock with N lock partitions and using LOCKED XADD if
X == yes. The numbers are normalized to the through-put in the
single-worker case.

-- Cycles / Second (Wall-Time) -
1   2   4   8   16  worker
wallwallwallwallwalltime
1   1.9 3.9 3.9 3.9 none
1   0.2 0.9 0.8 0.6 pg_lwlock_orig
1   0.4 0.3 0.3 0.3 pg_lwlock_cas
1   0.3 1.0 0.8 0.6 pg_lwlock_part,1,no
1   0.4 0.4 0.4 0.4 pg_lwlock_part,1,yes
1   2.0 0.4 0.4 0.3 pg_lwlock_part,2,no
1   2.0 0.7 0.8 0.7 pg_lwlock_part,2,yes
1   2.0 4.1 2.1 1.0 pg_lwlock_part,4,no
1   2.1 3.8 1.8 2.2 pg_lwlock_part,4,yes

These numbers show that as long is the number of workers does not
exceed the number of lock partitions, throughput increases approximately
linearly. It drops off afterwards, but I that's to be expected - 
the test executes LQLockAcquire/Release in a tight loop, so once there's
any kind of contention (even cache-line bouncing), performance drops.
This is also why the original LWLock beats CAS and the partitioned
lock with less than 4 partitions here - effectively, at least half of
the workers are sleeping at any given time, as the following
user vs. wall time numbers show.

-- User-Time / Wall-Time 
1   2   4   8   16  worker
1.0 1.9 3.9 3.9 3.9 none
1.0 1.9 1.1 1.3 1.8 

Re: [HACKERS] spinlock contention

2011-07-07 Thread Robert Haas
On Thu, Jul 7, 2011 at 5:54 AM, Florian Pflug f...@phlo.org wrote:
 In effect, the resulting thing is an LWLock with a partitioned shared
 counter. The partition one backend operates on for shared locks is
 determined by its backend id.

 I've added the implementation to the lock benchmarking tool at
  https://github.com/fgp/lockbench
 and also pushed a patched version of postgres to
  https://github.com/fgp/postgres/tree/lwlock_part

 The number of shared counter partitions is current 4, but can easily
 be adjusted in lwlock.h. The code uses GCC's atomic fetch and add
 intrinsic if available, otherwise it falls back to using a per-partition
 spin lock.

I think this is probably a good trade-off for locks that are most
frequently taken in shared mode (like SInvalReadLock), but it seems
like it could be a very bad trade-off for locks that are frequently
taken in both shared and exclusive mode (e.g. ProcArrayLock,
BufMappingLocks).

I don't want to fiddle with your git repo, but if you attach a patch
that applies to the master branch I'll give it a spin if I have time.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] spinlock contention

2011-07-06 Thread Robert Haas
On Thu, Jun 23, 2011 at 11:42 AM, Robert Haas robertmh...@gmail.com wrote:
 On Wed, Jun 22, 2011 at 5:43 PM, Florian Pflug f...@phlo.org wrote:
 On Jun12, 2011, at 23:39 , Robert Haas wrote:
 So, the majority (60%) of the excess spinning appears to be due to
 SInvalReadLock.  A good chunk are due to ProcArrayLock (25%).

 Hm, sizeof(LWLock) is 24 on X86-64, making sizeof(LWLockPadded) 32.
 However, cache lines are 64 bytes large on recent Intel CPUs AFAIK,
 so I guess that two adjacent LWLocks currently share one cache line.

 Currently, the ProcArrayLock has index 4 while SInvalReadLock has
 index 5, so if I'm not mistaken exactly the two locks where you saw
 the largest contention on are on the same cache line...

 Might make sense to try and see if these numbers change if you
 either make LWLockPadded 64bytes or arrange the locks differently...

 I fooled around with this a while back and saw no benefit.  It's
 possible a more careful test would turn up something, but I think the
 only real way forward here is going to be to eliminate some of that
 locking altogether.

I did some benchmarking, on the 32-core system from Nate Boley, with
pgbench -n -S -c 80 -j 80.   With master+fastlock+lazyvxid, I can hit
180-200k TPS in the 32-client range.  But at 80 clients throughput
starts to fall off quite a bit, dropping down to about 80k TPS.  So
then, just for giggles, I inserted a return; statement at the top of
AcceptInvalidationMessages(), turning it into a noop.  Performance at
80 clients shot up to 210k TPS.  I also tried inserting an
acquire-and-release cycle on MyProc-backendLock, which was just as
good.  That seems like a pretty encouraging result, since there appear
to be several ways of reimplementing SIGetDataEntries() that would
work with a per-backend lock rather than a global one.

I did some other experiments, too.  In the hopes of finding a general
way to reduce spinlock contention, I implemented a set of elimination
buffers, where backends that can't get a spinlock go and try to
combine their requests and then send a designated representative to
acquire the spinlock and acquire shared locks simultaneously for all
group members.  It took me a while to debug the code, and I still
can't get it to do anything other than reduce performance, which may
mean that I haven't found all the bugs yet, but I'm not feeling
encouraged.  Some poking around suggests that the problem isn't that
spinlocks are routinely contended - it seems that we nearly always get
the spinlock right off the bat.  I'm wondering if the problem may be
not so much that we have continuous spinlock contention, but rather
than every once in a while a process gets time-sliced out while it
holds a spinlock.  If it's an important spinlock (like the one
protecting SInvalReadLock), the system will quickly evolve into a
state where every single processor is doing nothing but trying to get
that spinlock.  Even after the hapless lock-holder gets to run again
and lets go of the lock, you have a whole pile of other backends who
are sitting there firing of lock xchgb in a tight loop, and they can
only get it one at a time, so you have ferocious cache line contention
until the backlog clears.  Then things are OK again for a bit until
the same thing happens to some other backend.  This is just a theory,
I might be totally wrong...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] spinlock contention

2011-06-28 Thread Florian Pflug
On Jun23, 2011, at 23:40 , Robert Haas wrote:
 I tried rewriting the LWLocks using CAS.  It actually seems to make
 things slightly worse on the tests I've done so far, perhaps because I
 didn't make it respect spins_per_delay.  Perhaps fetch-and-add would
 be better, but I'm not holding my breath.  Everything I'm seeing so
 far leads me to the belief that we need to get rid of the contention
 altogether, not just contend more quickly.

I got curious and put a tool together to benchmark this. The code can
be found at https://github.com/fgp/lockbench.
  
The tool starts N workers who each acquire a lock in SHARE mode, increment
a per-worker counter and release the lock again, rinse, repeat. That is
done for T seconds, after which it reports the sum of the individual
counters, the elapsed (wall) time and the sums of the user and system times
consumed by the workers. The lock implementations currently supported are

atomicinc:  RW-Lock using atomic increment/decrement instructions
(locked xaddl) to acquire and release the lock in SHARE
mode in the non-contested case. The contested case is
unimplemented.

cmpxchng:   Like atomicinc, but uses locked cmpxchng loop instead of
locked xaddl.

spin:   RW-Lock built around a simple cmpxchng-based spinlock (i.e.,
to lock/unlock spinlock is taken, shared-lockers count is
incremented/ decremented, spinlock is released). Similar to
pg_lwlock, but without the sleep() in the contested path of
the spinlock. The contested case of the RW-Lock is again
unimplemented.

none:   No locking.

pg_lwlock:  Postgres LWLock implementation. The contested case doesn't
work because the workers currently don't have a valid MyProc.

pw_lwlock_cas:  Like pg_lwlock but with your CAS patch applied.

On an 4-core Intel Xeon system (Intel Xeon X3430, 2.4 GHz, 8MB cache) I get
the following results:

 |   1 worker|   2 workers   |   3 workers   |   4 workers   |
 |wallsec usersec|wallsec usersec|wallsec usersec|wallsec usersec|
 |  / /  |  / /  |  / /  |  / /  |
 |cycles   cycles|cycles   cycles|cycles   cycles|cycles   cycles|
--
none |2.5e-09 2.5e-09|1.3e-09 2.5e-09|8.5e-10 2.5e-09|6.8e-10 2.5e-09|
atomicinc|1.8e-08 1.8e-08|2.8e-08 5.6e-08|2.7e-08 8.1e-08|2.9e-08 1.1e-07|
cmpxchng |3.0e-08 3.0e-08|7.1e-08 1.4e-07|6.9e-08 2.0e-07|7.2e-08 2.9e-07|
spin |5.0e-08 5.0e-08|3.0e-07 5.9e-07|3.8e-07 1.1e-06|4.0e-07 1.5e-06|
pg_lwlock|2.8e-08 2.8e-08|2.9e-08 3.0e-08|3.0e-08 3.2e-08|2.9e-08 3.1e-08|
pg_lwlock|2.6e-08 2.6e-08|6.2e-08 1.2e-07|7.7e-08 2.3e-07|7.8e-08 3.1e-07|
 _cas|   |   ||  |
--

These results allow the following conclusions to be drawn

First, our current pg_lwlock is quite good at preserving resources, i.e.
at not spinning excessively - at least for up to four workers. It's the only
lock implementation that consistently uses about 30ns of processor time for
one acquire/release cycle. Only atomicinc with one worker (i.e. no cachline
fighting) manages to beat that. It does, however, effectively serializate
execution with this workload - the consumed user time is approximately equal
to the elapsed wall clock time, even in the case of 4 workers, meaning that
most of the time 3 of those 4 workers are sleeping.

Secondly, pg_lwlock_cas and cmpxchng perform simiarly, which comes at no
surprise, since use effectively the same algorithm. One could expect spin
and pg_lwlock to also perform similarly, but these two don't. The reason
is probably that spin uses a rather brain-dead spinning method, while
pg_lwlock uses rep; nop.

Now to atomicinc, which is (the fast-path of) the most optimized RW-lock
possible, at least on x86 and x86_64. One interesting result here is that
wall time seconds / cycle are suspiciously for atomicinc and pg_lwlock.
This proves, I think, that the limiting factor in both of these tests is
the speed at which cache line invalidation can occur. It doesn't seem to
matter whether the other cores are idle (as in the pg_lwlock case), or
trying to obtain ownership of the cache line themselves (as in the
atomicinc case). 

Finally, the no-locking test (none) shows that, even though the counter
is written to shared memory after every increment, memory bandwith isn't
an issue here, since performance scaled nearly linearly with the number
of workers here.

Here's the result of another run, this time with the per-worker counter
being increment 100 in each acquire/release cycle.

--- 100 Increments per Cycle -
 |   1 worker|   2 workers   |   3 workers   |   4 workers   |

Re: [HACKERS] spinlock contention

2011-06-28 Thread Robert Haas
On Tue, Jun 28, 2011 at 2:33 PM, Florian Pflug f...@phlo.org wrote:
 [ testing of various spinlock implementations ]

I set T=30 and N=1 2 4 8 16 32 and tried this out on a 32-core
loaner from Nate Boley:

100 counter increments per cycle
worker  1   2   4   8   16  
32  
timewalluserwalluserwalluserwalluserwall
userwalluser
none2.8e-07 2.8e-07 1.5e-07 3.0e-07 8.0e-08 3.2e-07 4.2e-08 3.3e-07 2.1e-08 
3.3e-07 1.1e-08 3.4e-07
atomicinc   3.6e-07 3.6e-07 2.6e-07 5.1e-07 1.4e-07 5.5e-07 1.4e-07 1.1e-06 
1.5e-07 2.3e-06 1.5e-07 4.9e-06
cmpxchng3.6e-07 3.6e-07 3.4e-07 6.9e-07 3.2e-07 1.3e-06 2.9e-07 2.3e-06 
4.2e-07 6.6e-06 4.5e-07 1.4e-05
spin4.1e-07 4.1e-07 2.8e-07 5.7e-07 1.6e-07 6.3e-07 1.2e-06 9.4e-06 3.8e-06 
6.1e-05 1.4e-05 4.3e-04
pg_lwlock   3.8e-07 3.8e-07 2.7e-07 5.3e-07 1.5e-07 6.2e-07 3.9e-07 3.1e-06 
1.6e-06 2.5e-05 6.4e-06 2.0e-04
pg_lwlock_cas   3.7e-07 3.7e-07 2.8e-07 5.6e-07 1.4e-07 5.8e-07 1.4e-07 1.1e-06 
1.9e-07 3.0e-06 2.4e-07 7.5e-06

I wrote a little script to show to reorganize this data in a
possibly-easier-to-understand format - ordering each column from
lowest to highest, and showing each algorithm as a multiple of the
cheapest value for that column:

wall-1: 
none(1.0),atomicinc(1.3),cmpxchng(1.3),pg_lwlock_cas(1.3),pg_lwlock(1.4),spin(1.5)
user-1: 
none(1.0),atomicinc(1.3),cmpxchng(1.3),pg_lwlock_cas(1.3),pg_lwlock(1.4),spin(1.5)
wall-2: 
none(1.0),atomicinc(1.7),pg_lwlock(1.8),spin(1.9),pg_lwlock_cas(1.9),cmpxchng(2.3)
user-2: 
none(1.0),atomicinc(1.7),pg_lwlock(1.8),pg_lwlock_cas(1.9),spin(1.9),cmpxchng(2.3)
wall-4: 
none(1.0),atomicinc(1.7),pg_lwlock_cas(1.7),pg_lwlock(1.9),spin(2.0),cmpxchng(4.0)
user-4: 
none(1.0),atomicinc(1.7),pg_lwlock_cas(1.8),pg_lwlock(1.9),spin(2.0),cmpxchng(4.1)
wall-8: 
none(1.0),atomicinc(3.3),pg_lwlock_cas(3.3),cmpxchng(6.9),pg_lwlock(9.3),spin(28.6)
user-8: 
none(1.0),atomicinc(3.3),pg_lwlock_cas(3.3),cmpxchng(7.0),pg_lwlock(9.4),spin(28.5)
wall-16: 
none(1.0),atomicinc(7.1),pg_lwlock_cas(9.0),cmpxchng(20.0),pg_lwlock(76.2),spin(181.0)
user-16: 
none(1.0),atomicinc(7.0),pg_lwlock_cas(9.1),cmpxchng(20.0),pg_lwlock(75.8),spin(184.8)
wall-32: 
none(1.0),atomicinc(13.6),pg_lwlock_cas(21.8),cmpxchng(40.9),pg_lwlock(581.8),spin(1272.7)
user-32: 
none(1.0),atomicinc(14.4),pg_lwlock_cas(22.1),cmpxchng(41.2),pg_lwlock(588.2),spin(1264.7)

Here's the result of the same calculation applied to your second set of data:

wall-1: 
none(1.0),atomicinc(1.0),cmpxchng(1.0),pg_lwlock(1.0),pg_lwlock_cas(1.0),spin(1.2)
user-1: 
none(1.0),atomicinc(1.0),cmpxchng(1.0),pg_lwlock(1.0),pg_lwlock_cas(1.0),spin(1.2)
wall-2: 
none(1.0),atomicinc(1.0),cmpxchng(1.0),pg_lwlock(1.0),spin(1.2),pg_lwlock_cas(1.2)
user-2: 
none(1.0),cmpxchng(1.0),pg_lwlock(1.0),atomicinc(1.0),spin(1.2),pg_lwlock_cas(1.2)
wall-3: 
none(1.0),pg_lwlock(1.0),atomicinc(1.0),cmpxchng(1.0),spin(1.3),pg_lwlock_cas(1.3)
user-3: 
none(1.0),atomicinc(1.0),pg_lwlock(1.0),cmpxchng(1.0),spin(1.3),pg_lwlock_cas(1.3)
wall-4: 
none(1.0),atomicinc(1.0),cmpxchng(1.2),pg_lwlock_cas(1.3),pg_lwlock(1.8),spin(5.2)
user-4: 
none(1.0),atomicinc(1.0),cmpxchng(1.2),pg_lwlock_cas(1.3),pg_lwlock(1.8),spin(5.4)

There seems to be something a bit funky in your 3-core data, but
overall I read this data to indicate that 4 cores aren't really enough
to see a severe problem with spinlock contention.  I'm not even sure
that you can see this problem on a real workload on a 16-core system.
But as you add cores the problem gets rapidly worse, because the more
cores you have, the more likely it is that someone else is already
holding the spinlock (or has just very recently held the spinlock)
that you want.  And, of course, the problem multiplies itself, because
once your lock acquistions start taking longer, they become a larger
percentage of your execution time, which makes it proportionately more
likely that you'll find someone already there when you go to grab the
lock.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] spinlock contention

2011-06-28 Thread Merlin Moncure
On Tue, Jun 28, 2011 at 3:18 PM, Robert Haas robertmh...@gmail.com wrote:
 user-32: 
 none(1.0),atomicinc(14.4),pg_lwlock_cas(22.1),cmpxchng(41.2),pg_lwlock(588.2),spin(1264.7)

I may not be following all this correctly, but doesn't this suggest a
huge potential upside for the cas based patch you posted upthread when
combined with your earlier patches that were bogging down on spinlock
contentionl?

merlin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] spinlock contention

2011-06-28 Thread Robert Haas
On Tue, Jun 28, 2011 at 5:33 PM, Merlin Moncure mmonc...@gmail.com wrote:
 On Tue, Jun 28, 2011 at 3:18 PM, Robert Haas robertmh...@gmail.com wrote:
 user-32: 
 none(1.0),atomicinc(14.4),pg_lwlock_cas(22.1),cmpxchng(41.2),pg_lwlock(588.2),spin(1264.7)

 I may not be following all this correctly, but doesn't this suggest a
 huge potential upside for the cas based patch you posted upthread when
 combined with your earlier patches that were bogging down on spinlock
 contentionl?

Well, you'd think so, but in fact that patch makes it slower.  Don't
ask me why, 'cuz I dunno.  :-(

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] spinlock contention

2011-06-28 Thread Florian Pflug
On Jun28, 2011, at 23:48 , Robert Haas wrote:
 On Tue, Jun 28, 2011 at 5:33 PM, Merlin Moncure mmonc...@gmail.com wrote:
 On Tue, Jun 28, 2011 at 3:18 PM, Robert Haas robertmh...@gmail.com wrote:
 user-32: 
 none(1.0),atomicinc(14.4),pg_lwlock_cas(22.1),cmpxchng(41.2),pg_lwlock(588.2),spin(1264.7)
 
 I may not be following all this correctly, but doesn't this suggest a
 huge potential upside for the cas based patch you posted upthread when
 combined with your earlier patches that were bogging down on spinlock
 contentionl?
 
 Well, you'd think so, but in fact that patch makes it slower.  Don't
 ask me why, 'cuz I dunno.  :-(

Huh? Where do you see your CAS patch being slower than unpatched LWLocks
in these results? Or are you referring to pgbench runs you made, not
to these artificial benchmarks?

best regards,
Florian Pflug



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] spinlock contention

2011-06-28 Thread Robert Haas
On Tue, Jun 28, 2011 at 5:55 PM, Florian Pflug f...@phlo.org wrote:
 On Jun28, 2011, at 23:48 , Robert Haas wrote:
 On Tue, Jun 28, 2011 at 5:33 PM, Merlin Moncure mmonc...@gmail.com wrote:
 On Tue, Jun 28, 2011 at 3:18 PM, Robert Haas robertmh...@gmail.com wrote:
 user-32: 
 none(1.0),atomicinc(14.4),pg_lwlock_cas(22.1),cmpxchng(41.2),pg_lwlock(588.2),spin(1264.7)

 I may not be following all this correctly, but doesn't this suggest a
 huge potential upside for the cas based patch you posted upthread when
 combined with your earlier patches that were bogging down on spinlock
 contentionl?

 Well, you'd think so, but in fact that patch makes it slower.  Don't
 ask me why, 'cuz I dunno.  :-(

 Huh? Where do you see your CAS patch being slower than unpatched LWLocks
 in these results? Or are you referring to pgbench runs you made, not
 to these artificial benchmarks?

pgbench -S

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] spinlock contention

2011-06-28 Thread Florian Pflug
On Jun28, 2011, at 22:18 , Robert Haas wrote:
 On Tue, Jun 28, 2011 at 2:33 PM, Florian Pflug f...@phlo.org wrote:
 [ testing of various spinlock implementations ]
 
 I set T=30 and N=1 2 4 8 16 32 and tried this out on a 32-core
 loaner from Nate Boley:

Cool, thanks!

 100 counter increments per cycle
 worker1   2   4   8   
 16  32  
 time  walluserwalluserwalluserwalluserwall
 userwalluser
 none  2.8e-07 2.8e-07 1.5e-07 3.0e-07 8.0e-08 3.2e-07 4.2e-08 3.3e-07 2.1e-08 
 3.3e-07 1.1e-08 3.4e-07
 atomicinc 3.6e-07 3.6e-07 2.6e-07 5.1e-07 1.4e-07 5.5e-07 1.4e-07 1.1e-06 
 1.5e-07 2.3e-06 1.5e-07 4.9e-06
 cmpxchng  3.6e-07 3.6e-07 3.4e-07 6.9e-07 3.2e-07 1.3e-06 2.9e-07 2.3e-06 
 4.2e-07 6.6e-06 4.5e-07 1.4e-05
 spin  4.1e-07 4.1e-07 2.8e-07 5.7e-07 1.6e-07 6.3e-07 1.2e-06 9.4e-06 3.8e-06 
 6.1e-05 1.4e-05 4.3e-04
 pg_lwlock 3.8e-07 3.8e-07 2.7e-07 5.3e-07 1.5e-07 6.2e-07 3.9e-07 3.1e-06 
 1.6e-06 2.5e-05 6.4e-06 2.0e-04
 pg_lwlock_cas 3.7e-07 3.7e-07 2.8e-07 5.6e-07 1.4e-07 5.8e-07 1.4e-07 1.1e-06 
 1.9e-07 3.0e-06 2.4e-07 7.5e-06

Here's the same table, formatted with spaces.

worker  1   2   4   8   
16  32  
timewalluserwalluserwalluserwalluser
walluserwalluser
none2.8e-07 2.8e-07 1.5e-07 3.0e-07 8.0e-08 3.2e-07 4.2e-08 3.3e-07 
2.1e-08 3.3e-07 1.1e-08 3.4e-07
atomicinc   3.6e-07 3.6e-07 2.6e-07 5.1e-07 1.4e-07 5.5e-07 1.4e-07 1.1e-06 
1.5e-07 2.3e-06 1.5e-07 4.9e-06
cmpxchng3.6e-07 3.6e-07 3.4e-07 6.9e-07 3.2e-07 1.3e-06 2.9e-07 2.3e-06 
4.2e-07 6.6e-06 4.5e-07 1.4e-05
spin4.1e-07 4.1e-07 2.8e-07 5.7e-07 1.6e-07 6.3e-07 1.2e-06 9.4e-06 
3.8e-06 6.1e-05 1.4e-05 4.3e-04
pg_lwlock   3.8e-07 3.8e-07 2.7e-07 5.3e-07 1.5e-07 6.2e-07 3.9e-07 3.1e-06 
1.6e-06 2.5e-05 6.4e-06 2.0e-04
pg_lwlock_cas   3.7e-07 3.7e-07 2.8e-07 5.6e-07 1.4e-07 5.8e-07 1.4e-07 1.1e-06 
1.9e-07 3.0e-06 2.4e-07 7.5e-06

And here's the throughput table calculated from your results,
i.e. the wall time per cycle relative to the wall time per cycle
for 1 worker.

workers   2   4   8  16  32
none1.9 3.5 6.7  13  26
atomicinc   1.4 2.6 2.6 2.4 2.4
cmpxchng1.1 1.1 1.2 0.9 0.8
spin1.5 2.6 0.3 0.1 0.0
pg_lwlock   1.4 2.5 1.0 0.2 0.1
pg_lwlock_cas   1.3 2.6 2.6 1.9 1.5

Hm, so in the best case we get 2.6x the throughput of a single core,
and that only for 4 and 8 workers (1.4e-7 second / cycle vs 3.6e-7).
In that case, there also seems to be little difference between
pg_lwlock{_cas} and atomicinc. atomicinc again manages to at least
sustain that throughput when the worker count is increased, while
for for the others the throughput actually *decreases*.

What totally puzzles me is that your results don't show any
trace of a decreased system load for the pg_lwlock implementation,
which I'd have expected due to the sleep() in the contested
path. Here are the user vs. wall time ratios - I'd have expected
to see value significantly below the number of workers for pg_lwlock

none  1.0 2.0 4.0 7.9 16 31
atomicinc 1.0 2.0 3.9 7.9 15 33
cmpxchng  1.0 2.0 4.1 7.9 16 31
spin  1.0 2.0 3.9 7.8 16 31
pg_lwlock 1.0 2.0 4.1 7.9 16 31
pg_lwlock_cas 1.0 2.0 4.1 7.9 16 31

 I wrote a little script to show to reorganize this data in a
 possibly-easier-to-understand format - ordering each column from
 lowest to highest, and showing each algorithm as a multiple of the
 cheapest value for that column:

If you're OK with that, I'd like to add that to the lockbench
repo.

 There seems to be something a bit funky in your 3-core data, but
 overall I read this data to indicate that 4 cores aren't really enough
 to see a severe problem with spinlock contention.

Hm, it starts to show if you lower the counter increment per cycle
(the D constant in run.sh). But yeah, it's never as bad as the
32-core results above..

best regards,
Florian Pflug


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] spinlock contention

2011-06-28 Thread Robert Haas
On Tue, Jun 28, 2011 at 8:11 PM, Florian Pflug f...@phlo.org wrote:
 I wrote a little script to show to reorganize this data in a
 possibly-easier-to-understand format - ordering each column from
 lowest to highest, and showing each algorithm as a multiple of the
 cheapest value for that column:

 If you're OK with that, I'd like to add that to the lockbench
 repo.

It's a piece of crap, but you're welcome to try to fix it up.  See attached.

With respect to the apparent lack of impact of the sleep loop, I can
only speculate that this test case doesn't really mimic real-world
conditions inside the backend very well.  There is a lot more going on
there - multiple locks, lots of memory access, etc.  But I don't
really understand it either.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


sortby
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] spinlock contention

2011-06-27 Thread Robert Haas
On Sat, Jun 25, 2011 at 8:26 PM, Greg Stark st...@mit.edu wrote:
 On Thu, Jun 23, 2011 at 4:42 PM, Robert Haas robertmh...@gmail.com wrote:
 ProcArrayLock looks like a tougher nut to crack - there's simply no
 way, with the system we have right now, that you can take a snapshot
 without locking the list of running processes.  I'm not sure what to
 do about that, but we're probably going to have to come up with
 something, because it seems clear that once we eliminate the lock
 manager LWLock contention, this is a major bottleneck.

 Well as Tom observed earlier the kernel of a snapshot is actually a
 LSN. A snapshot contains a set of xids which all committed before some
 LSN and none which committed after it.

 So if we had a record of what log sequence number the commit record
 for any given transaction is we could build the snapshot at our
 leisure without any exclusive lock. In fact we could even build it
 lazily as a kind of cache only when we actually are interested in a
 given xid.

Yeah, I've been thinking about that.  I think what we might do is set
up a new SLRU that works like CLOG, but each entry is an LSN rather
than just two bits.  When a transaction commits, we save the commit
LSN under the entry for that XID.  We truncate away SLRU pages that
contain no still-running XIDs.  When we need to check whether an XID
is visible to our snapshot, we just look up the commit LSN and compare
it with our snapshot LSN.  If it's before and non-zero, we can see it.
 If it's after or all-zeroes, we can't.

But I'm not sure how much this would really help.  It might (subject
to working out the details) make the actual process of taking a
snapshot faster.  But it's not clear that taking snapshots more
quickly will actually help anything, because the problem is not the
amount of time spending taking the snapshot.  The problem is rather
that doing so requires acquiring and releasing an LWLock, and each of
those operations requires taking and releasing a spinlock.  And it is
the spinlock contention that is killing us.   That makes me think we
need a way to take a snapshot without taking a spinlock.  Or if we
must take spinlocks, we at least have to avoid every backend that
needs a snapshot lusting after the *same* spinlock.

What I've been thinking about this weekend is whether it might be
possible to create a sort of lock-free queue infrastructure.  When a
backend starts up, it would add an entry to the queue saying I'm
running.  When it commits, it would add an entry to the queue saying
I'm committed.  All entries would be added at the *end* of the
queue, so a backend scanning the queue to build up a snapshot wouldn't
ever be able to see commits out of order.  We would need some memory
barrier operations on weak-memory-ordering machines to make sure that
the queue writes became visible before the end-of-queue pointer bump.

The trick is figuring out how to clean up the queue.  Since commit
entries exist only to guard against running entries earlier in the
queue, the start-of-queue pointer can be advanced whenever it points
to a commit entry.  Also, if it points to a running entry for
which there is a later commit entry, then the start-of-queue pointer
can be advanced over that as well.  However, just because we can
advance the point at which backends start reading doesn't mean that we
can actually recycle space, because while we know that new scans
needn't worry about those entries, we *don't* know that there isn't
already a scan in flight that still needs them.  Furthermore, if a
transaction runs for a long time, we can never advance the
start-of-queue pointer past the running entry for its XID, which is
obviously problematic since the queue would get very long.

To work around that problem, I think we could use Florian's idea
upthread of an RCU system.  We keep two copies of the queue around, an
A copy and a B copy.  When the currently active copy fills up, we
rewrite it into the other queue, omitting all committed entries and
any running entries that have matching committed entries, and then
tell everyone to start looking at that copy instead.   We would need
some kind of gymnastics to make sure that we don't flip from the A
copy to the B copy and back to the A copy while some laggardly backend
is still hoping to scan the old A copy.  A simple algorithm (there's
probably a smarter one) would be to have each backend take a spinlock
while it's scanning either copy, and to have the backend that is doing
the rewrite take and release all of those spinlocks one at a time
before beginning the rewrite, thus guaranteeing that any scans still
in progress when the rewrite is requested have completed before it's
actually performed.  Any new scans started in the meanwhile will
certainly be looking at the current copy rather than the old copy
we're about to overwrite.

We would still need a lock around the operation of adding new items to
the queue; if two backends try to do that at the same time, chaos will
ensue.  But 

Re: [HACKERS] spinlock contention

2011-06-25 Thread Greg Stark
On Thu, Jun 23, 2011 at 4:42 PM, Robert Haas robertmh...@gmail.com wrote:
 ProcArrayLock looks like a tougher nut to crack - there's simply no
 way, with the system we have right now, that you can take a snapshot
 without locking the list of running processes.  I'm not sure what to
 do about that, but we're probably going to have to come up with
 something, because it seems clear that once we eliminate the lock
 manager LWLock contention, this is a major bottleneck.

Well as Tom observed earlier the kernel of a snapshot is actually a
LSN. A snapshot contains a set of xids which all committed before some
LSN and none which committed after it.

So if we had a record of what log sequence number the commit record
for any given transaction is we could build the snapshot at our
leisure without any exclusive lock. In fact we could even build it
lazily as a kind of cache only when we actually are interested in a
given xid.





-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] spinlock contention

2011-06-23 Thread Heikki Linnakangas

On 23.06.2011 18:42, Robert Haas wrote:

ProcArrayLock looks like a tougher nut to crack - there's simply no
way, with the system we have right now, that you can take a snapshot
without locking the list of running processes.  I'm not sure what to
do about that, but we're probably going to have to come up with
something, because it seems clear that once we eliminate the lock
manager LWLock contention, this is a major bottleneck.


ProcArrayLock is currently held for a relatively long period of time 
when a snapshot is taken, especially if there's a lot of backends. There 
are three operations to the procarray:


1. Advertising a new xid belonging to my backend.
2. Ending a transaction.
3. Getting a snapshot.

Advertising a new xid is currently done without a lock, assuming that 
setting an xid in shared memory is atomic. To end a transaction, you 
acquire ProcArrayLock in exclusive mode. To get a snapshot, you acquire 
ProcArrayLock in shared mode, and scan the whole procarray.


I wonder if it would be a better tradeoff to keep a materialized 
snapshot in shared memory that's updated every time a new xid is 
assigned or a transaction ends. Getting a snapshot would become much 
cheaper, as you could just memcpy the ready-built snapshot from shared 
memory into local memory.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] spinlock contention

2011-06-23 Thread Florian Pflug
On Jun23, 2011, at 17:42 , Robert Haas wrote:
 On Wed, Jun 22, 2011 at 5:43 PM, Florian Pflug f...@phlo.org wrote:
 On Jun12, 2011, at 23:39 , Robert Haas wrote:
 So, the majority (60%) of the excess spinning appears to be due to
 SInvalReadLock.  A good chunk are due to ProcArrayLock (25%).
 
 Hm, sizeof(LWLock) is 24 on X86-64, making sizeof(LWLockPadded) 32.
 However, cache lines are 64 bytes large on recent Intel CPUs AFAIK,
 so I guess that two adjacent LWLocks currently share one cache line.
 
 Currently, the ProcArrayLock has index 4 while SInvalReadLock has
 index 5, so if I'm not mistaken exactly the two locks where you saw
 the largest contention on are on the same cache line...
 
 Might make sense to try and see if these numbers change if you
 either make LWLockPadded 64bytes or arrange the locks differently...
 
 I fooled around with this a while back and saw no benefit.  It's
 possible a more careful test would turn up something, but I think the
 only real way forward here is going to be to eliminate some of that
 locking altogether.

It seems hard to believe that there isn't some effect of two locks
sharing a cache line. There are architectures (even some of the
Intel architectures, I believe) where cache lines are 32 bit, though -
so measuring this certainly isn't easy. Also, I'm absolute no
expert for this, so it might very well be that' I'm missing something
fundamental.

 SInvalReadLock is a good example of a lock which seems barely to be
 necessary at all.  Except on extraordinarily rare occasions, it is
 only ever taken in share mode.

This is the case we should optimize for, I think. Currently, there's
contention even between two SHARE lockers of a LWLock because our
LWLock implementation uses a spinlock underneath. I've googled around
a bit, and found this:

http://git.kernel.org/?p=linux/kernel/git/torvalds/rwlock.git

It's an userspace rwlock implementation based on atomic instructions
and futexes (for blocking) written by Linus Torwalds as a response
to the following thread

http://lwn.net/Articles/284741/

It seems to require only a single atomic increment instruction to
acquire a share lock in the uncontested case, and a single compare-
and-exchange instruction to acquire an exlcusive lock (again in
the uncontested case).

The code doesn't seem to have a license attached, and seems to have
been written over a weekend and never touched since, so we might
not want (or be able to) to use this directly. But it'd still be
interesting to see if replacing our LWLocks with this implementation
affects throughput.

 Only SICleanupQueue() needs to lock
 out readers, and in the (probably common) case where all backends are
 reasonably close to being caught up, it wouldn't even matter if the
 determination of minMsgNum were a little bit fuzzy.  I've been
 straining my brain trying to figure out whether there's some way to
 rewrite this logic so that it can run concurrently with readers; then
 we could get rid of SInvalReadLock altogether.  I have a feeling if I
 were 25% smarter I could do it, but I'm not.

This sounds like something which could be done with RCU (Read-Copy-Update)
in a lockless way. The idea is to copy the data structure (or parts)
of it before updating it, and to commit the change afterwards by
adjusting a single pointer to point to the new structure (kinda like
we do atomic commits by an atomic update of one bit in the clog).

The hard part is usually to figure when it's OK to clean up or
reuse the old version of the data structure - for that, you need to
know that all previous readers have completed.

Here's a rough description of how that could work for the invalidation
queue. We'd have two queue structures in shared memory, each containing
a version number. We'd also have a current pointer pointing to the
most recent one. Each reader would publish the version number of the
queue he's about to read (the current one) in shared memory
(and issue a write barrier). SICleanupQueue() would check whether the
non-current queue structure is unused by comparing its version to the
published versions of all backends (after issuing a read barrier, and
after taking a lock that prevents concurrent SICleanupQueue runs of 
course). If there still are users of that version, SICleanupQueue()
out-waits them. Afterwards, it simply copies the current structure
over the old one, does the necessary cleanups, increments the version
number to be larger than the one of the current structure,
and *then* updates the current pointer.

 So my fallback position is to try to find some way to optimize the
 common case where there are no messages to be read.  We're already
 allowing readers and writers to run in parallel, so it must not be
 important for a reader to see a message that a writer is in the
 process of writing, but has not yet finished writing.  I believe the
 idea here is that sinval messages should be emitted before releasing
 the related locks, so if we've successfully acquired a lock on 

Re: [HACKERS] spinlock contention

2011-06-23 Thread Robert Haas
On Thu, Jun 23, 2011 at 12:19 PM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 On 23.06.2011 18:42, Robert Haas wrote:
 ProcArrayLock looks like a tougher nut to crack - there's simply no
 way, with the system we have right now, that you can take a snapshot
 without locking the list of running processes.  I'm not sure what to
 do about that, but we're probably going to have to come up with
 something, because it seems clear that once we eliminate the lock
 manager LWLock contention, this is a major bottleneck.

 ProcArrayLock is currently held for a relatively long period of time when a
 snapshot is taken, especially if there's a lot of backends. There are three
 operations to the procarray:

 1. Advertising a new xid belonging to my backend.
 2. Ending a transaction.
 3. Getting a snapshot.

 Advertising a new xid is currently done without a lock, assuming that
 setting an xid in shared memory is atomic.

The issue isn't just whether it's atomic, but whether some other
backend scanning the ProcArray just afterwards might fail to see the
update due to memory-ordering effects.  But I think even if they do,
there's no real harm done.  Since we're holding XidGenLock, we know
that the XID we are advertising is higher than any XID previously
advertised, and in particular it must be higher than
latestCompletedXid, so GetSnapshotdata() will ignore it anyway.

I am rather doubtful that this code is strictly safe:

myproc-subxids.xids[nxids] = xid;
myproc-subxids.nxids = nxids + 1;

In practice, the window for a race condition is minimal, because (a)
in many cases, nxids will be on the same cache line as the xid being
set; (b) even if it's not, we almost immediately release XidGenLock,
which acts as a memory barrier; (c) on many common architectures, such
as x86, stores are guaranteed to become visible in execution order;
(d) if, on some architecture, you manged to see the stores out of
order and thus loaded a garbage XID from the end of the array, it
might happen to be zero (which would be ignored, as a non-normal
transaction ID) or the transaction might happen never to examine a
tuple with that XID anyway.

 To end a transaction, you acquire
 ProcArrayLock in exclusive mode. To get a snapshot, you acquire
 ProcArrayLock in shared mode, and scan the whole procarray.

 I wonder if it would be a better tradeoff to keep a materialized snapshot
 in shared memory that's updated every time a new xid is assigned or a
 transaction ends. Getting a snapshot would become much cheaper, as you could
 just memcpy the ready-built snapshot from shared memory into local memory.

At least in the case of the pgbench -S workload, I doubt that would
help at all.  The problem isn't that the LWLocks are being held for
too long -- they are all being held in shared mode and don't conflict
anyway.  The problem is that everyone has to acquire and release the
spinlock in order to acquire the LWLock, and again to release it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] spinlock contention

2011-06-23 Thread Merlin Moncure
On Thu, Jun 23, 2011 at 1:34 PM, Florian Pflug f...@phlo.org wrote:
 On Jun23, 2011, at 17:42 , Robert Haas wrote:
 On Wed, Jun 22, 2011 at 5:43 PM, Florian Pflug f...@phlo.org wrote:
 On Jun12, 2011, at 23:39 , Robert Haas wrote:
 So, the majority (60%) of the excess spinning appears to be due to
 SInvalReadLock.  A good chunk are due to ProcArrayLock (25%).

 Hm, sizeof(LWLock) is 24 on X86-64, making sizeof(LWLockPadded) 32.
 However, cache lines are 64 bytes large on recent Intel CPUs AFAIK,
 so I guess that two adjacent LWLocks currently share one cache line.

 Currently, the ProcArrayLock has index 4 while SInvalReadLock has
 index 5, so if I'm not mistaken exactly the two locks where you saw
 the largest contention on are on the same cache line...

 Might make sense to try and see if these numbers change if you
 either make LWLockPadded 64bytes or arrange the locks differently...

 I fooled around with this a while back and saw no benefit.  It's
 possible a more careful test would turn up something, but I think the
 only real way forward here is going to be to eliminate some of that
 locking altogether.

 It seems hard to believe that there isn't some effect of two locks
 sharing a cache line. There are architectures (even some of the
 Intel architectures, I believe) where cache lines are 32 bit, though -
 so measuring this certainly isn't easy. Also, I'm absolute no
 expert for this, so it might very well be that' I'm missing something
 fundamental.

 SInvalReadLock is a good example of a lock which seems barely to be
 necessary at all.  Except on extraordinarily rare occasions, it is
 only ever taken in share mode.

 This is the case we should optimize for, I think. Currently, there's
 contention even between two SHARE lockers of a LWLock because our
 LWLock implementation uses a spinlock underneath. I've googled around
 a bit, and found this:

 http://git.kernel.org/?p=linux/kernel/git/torvalds/rwlock.git

 It's an userspace rwlock implementation based on atomic instructions
 and futexes (for blocking) written by Linus Torwalds as a response
 to the following thread

 http://lwn.net/Articles/284741/

 It seems to require only a single atomic increment instruction to
 acquire a share lock in the uncontested case, and a single compare-
 and-exchange instruction to acquire an exlcusive lock (again in
 the uncontested case).

 The code doesn't seem to have a license attached, and seems to have
 been written over a weekend and never touched since, so we might
 not want (or be able to) to use this directly. But it'd still be
 interesting to see if replacing our LWLocks with this implementation
 affects throughput.

 Only SICleanupQueue() needs to lock
 out readers, and in the (probably common) case where all backends are
 reasonably close to being caught up, it wouldn't even matter if the
 determination of minMsgNum were a little bit fuzzy.  I've been
 straining my brain trying to figure out whether there's some way to
 rewrite this logic so that it can run concurrently with readers; then
 we could get rid of SInvalReadLock altogether.  I have a feeling if I
 were 25% smarter I could do it, but I'm not.

 This sounds like something which could be done with RCU (Read-Copy-Update)
 in a lockless way. The idea is to copy the data structure (or parts)
 of it before updating it, and to commit the change afterwards by
 adjusting a single pointer to point to the new structure (kinda like
 we do atomic commits by an atomic update of one bit in the clog).

 The hard part is usually to figure when it's OK to clean up or
 reuse the old version of the data structure - for that, you need to
 know that all previous readers have completed.

 Here's a rough description of how that could work for the invalidation
 queue. We'd have two queue structures in shared memory, each containing
 a version number. We'd also have a current pointer pointing to the
 most recent one. Each reader would publish the version number of the
 queue he's about to read (the current one) in shared memory
 (and issue a write barrier). SICleanupQueue() would check whether the
 non-current queue structure is unused by comparing its version to the
 published versions of all backends (after issuing a read barrier, and
 after taking a lock that prevents concurrent SICleanupQueue runs of
 course). If there still are users of that version, SICleanupQueue()
 out-waits them. Afterwards, it simply copies the current structure
 over the old one, does the necessary cleanups, increments the version
 number to be larger than the one of the current structure,
 and *then* updates the current pointer.

 So my fallback position is to try to find some way to optimize the
 common case where there are no messages to be read.  We're already
 allowing readers and writers to run in parallel, so it must not be
 important for a reader to see a message that a writer is in the
 process of writing, but has not yet finished writing.  I believe the
 idea here is that sinval 

Re: [HACKERS] spinlock contention

2011-06-23 Thread Robert Haas
On Thu, Jun 23, 2011 at 2:34 PM, Florian Pflug f...@phlo.org wrote:
 It seems hard to believe that there isn't some effect of two locks
 sharing a cache line. There are architectures (even some of the
 Intel architectures, I believe) where cache lines are 32 bit, though -
 so measuring this certainly isn't easy. Also, I'm absolute no
 expert for this, so it might very well be that' I'm missing something
 fundamental.

Well, I'm sure there is some effect, but my experiments seem to
indicate that it's not a very important one.  Again, please feel free
to provide contrary evidence.  I think the basic issue is that - in
the best possible case - padding the LWLocks so that you don't have
two locks sharing a cache line can reduce contention on the busier
lock by at most 2x.  (The less busy lock may get a larger reduction
but that may not help you much.)  If you what you really need is for
the contention to decrease by 1000x, you're just not really moving the
needle.  That's why the basic fast-relation-lock patch helps so much:
it replaces a system where every lock request results in contention
with a system where NONE of them do.

 SInvalReadLock is a good example of a lock which seems barely to e
 necessary at all.  Except on extraordinarily rare occasions, it is
 only ever taken in share mode.

 This is the case we should optimize for, I think. Currently, there's
 contention even between two SHARE lockers of a LWLock because our
 LWLock implementation uses a spinlock underneath. I've googled around
 a bit, and found this:

 http://git.kernel.org/?p=linux/kernel/git/torvalds/rwlock.git

 It's an userspace rwlock implementation based on atomic instructions
 and futexes (for blocking) written by Linus Torwalds as a response
 to the following thread

 http://lwn.net/Articles/284741/

 It seems to require only a single atomic increment instruction to
 acquire a share lock in the uncontested case, and a single compare-
 and-exchange instruction to acquire an exlcusive lock (again in
 the uncontested case).

 The code doesn't seem to have a license attached, and seems to have
 been written over a weekend and never touched since, so we might
 not want (or be able to) to use this directly. But it'd still be
 interesting to see if replacing our LWLocks with this implementation
 affects throughput.

I tried rewriting the LWLocks using CAS.  It actually seems to make
things slightly worse on the tests I've done so far, perhaps because I
didn't make it respect spins_per_delay.  Perhaps fetch-and-add would
be better, but I'm not holding my breath.  Everything I'm seeing so
far leads me to the belief that we need to get rid of the contention
altogether, not just contend more quickly.

 Only SICleanupQueue() needs to lock
 out readers, and in the (probably common) case where all backends are
 reasonably close to being caught up, it wouldn't even matter if the
 determination of minMsgNum were a little bit fuzzy.  I've been
 straining my brain trying to figure out whether there's some way to
 rewrite this logic so that it can run concurrently with readers; then
 we could get rid of SInvalReadLock altogether.  I have a feeling if I
 were 25% smarter I could do it, but I'm not.

 This sounds like something which could be done with RCU (Read-Copy-Update)
 in a lockless way. The idea is to copy the data structure (or parts)
 of it before updating it, and to commit the change afterwards by
 adjusting a single pointer to point to the new structure (kinda like
 we do atomic commits by an atomic update of one bit in the clog).

 The hard part is usually to figure when it's OK to clean up or
 reuse the old version of the data structure - for that, you need to
 know that all previous readers have completed.

 Here's a rough description of how that could work for the invalidation
 queue. We'd have two queue structures in shared memory, each containing
 a version number. We'd also have a current pointer pointing to the
 most recent one. Each reader would publish the version number of the
 queue he's about to read (the current one) in shared memory
 (and issue a write barrier). SICleanupQueue() would check whether the
 non-current queue structure is unused by comparing its version to the
 published versions of all backends (after issuing a read barrier, and
 after taking a lock that prevents concurrent SICleanupQueue runs of
 course). If there still are users of that version, SICleanupQueue()
 out-waits them. Afterwards, it simply copies the current structure
 over the old one, does the necessary cleanups, increments the version
 number to be larger than the one of the current structure,
 and *then* updates the current pointer.

Interesting idea.

 So my fallback position is to try to find some way to optimize the
 common case where there are no messages to be read.  We're already
 allowing readers and writers to run in parallel, so it must not be
 important for a reader to see a message that a writer is in the
 process of writing, but has 

Re: [HACKERS] spinlock contention

2011-06-23 Thread Florian Pflug
On Jun23, 2011, at 22:15 , Robert Haas wrote:
 On Thu, Jun 23, 2011 at 2:34 PM, Florian Pflug f...@phlo.org wrote:
 It seems hard to believe that there isn't some effect of two locks
 sharing a cache line. There are architectures (even some of the
 Intel architectures, I believe) where cache lines are 32 bit, though -
 so measuring this certainly isn't easy. Also, I'm absolute no
 expert for this, so it might very well be that' I'm missing something
 fundamental.
 
 Well, I'm sure there is some effect, but my experiments seem to
 indicate that it's not a very important one.  Again, please feel free
 to provide contrary evidence.  I think the basic issue is that - in
 the best possible case - padding the LWLocks so that you don't have
 two locks sharing a cache line can reduce contention on the busier
 lock by at most 2x.  (The less busy lock may get a larger reduction
 but that may not help you much.)  If you what you really need is for
 the contention to decrease by 1000x, you're just not really moving the
 needle.

Agreed. OTOH, adding a few dummy entries to the LWLocks array to separate
the most heavily contested LWLocks for the others might still be
worthwhile.

 That's why the basic fast-relation-lock patch helps so much:
 it replaces a system where every lock request results in contention
 with a system where NONE of them do.
 
 I tried rewriting the LWLocks using CAS.  It actually seems to make
 things slightly worse on the tests I've done so far, perhaps because I
 didn't make it respect spins_per_delay.  Perhaps fetch-and-add would
 be better, but I'm not holding my breath.  Everything I'm seeing so
 far leads me to the belief that we need to get rid of the contention
 altogether, not just contend more quickly.

Is there a patch available? How did you do the slow path (i.e. the
case where there's contention and you need to block)? It seems to
me that without some kernel support like futexes it's impossible
to do better than LWLocks already do, because any simpler scheme
like
  while (atomic_inc_post(lock)  0) {
atomic_dec(lock);
block(lock);
  }
for the shared-locker case suffers from a race condition (the lock
might be released before you actually block()).

 There is an obvious (potential) memory-ordering problem here.  If it's
 possible for a backend doing AcceptInvalidationMessages() to see a
 stale version of the counter, then it might fail to read messages from
 the queue that it really ought to have seen.  Protecting the global
 counter with a spinlock defeats the purpose of doing any of this in
 the first place, because the whole problem is too many people fighting
 over the spinlock protecting SInvalReadLock.  It should be sufficient,
 though, for the writing backend to execute a memory-barrier operation
 just after bumping the global counter and the reading backend to
 execute one just before.  As it happens, LWLockRelease() is such a
 barrier, since it must acquire and release a spinlock, so we should be
 able to just bump the counter right before releasing the LWLock and
 call it good.  On the reading side, the immediately-preceding lock
 acquisition seems like it ought to be sufficient - surely it can't be
 possible to acquire a heavyweight lock on an object without seeing
 memory updates that some other process made before releasing the same
 heavyweight lock.
 
 If we start doing lockless algorithms, I really think we should add
 explicit barrier primitives. We could start out with something
 simple such as
 
 #define MemoryBarrierWrite \
  do { slock_t l; S_UNLOCK(l); } while (0)
 #define MemoryBarrierRead \
  do { slock_t l; S_INIT_LOCK(l); S_LOCK(l); }
 #define MemoryBarrierReadWrite \
  do { slock_t; S_INIT_LOCK(l); S_LOCK(l); S_UNLOCK(l); }
 
 But yeah, it seems that in the case of the sinval queue, the surrounding
 LWLocks actually provide the necessary barriers.
 
 Yes, a set of general memory barrier primitives would be handy.
 Unfortunately, it would probably be quite a bit of work to develop
 this and verify that it works properly on all supported platforms.

The idea would be to start out with something trivial like the above.
Maybe with an #if for compilers which have something like GCC's
__sync_synchronize(). We could then gradually add implementations
for specific architectures, hopefully done by people who actually
own the hardware and can test.

best regards,
Florian Pflug



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] spinlock contention

2011-06-23 Thread Robert Haas
On Thu, Jun 23, 2011 at 5:35 PM, Florian Pflug f...@phlo.org wrote:
 Well, I'm sure there is some effect, but my experiments seem to
 indicate that it's not a very important one.  Again, please feel free
 to provide contrary evidence.  I think the basic issue is that - in
 the best possible case - padding the LWLocks so that you don't have
 two locks sharing a cache line can reduce contention on the busier
 lock by at most 2x.  (The less busy lock may get a larger reduction
 but that may not help you much.)  If you what you really need is for
 the contention to decrease by 1000x, you're just not really moving the
 needle.

 Agreed. OTOH, adding a few dummy entries to the LWLocks array to separate
 the most heavily contested LWLocks for the others might still be
 worthwhile.

Hey, if we can show that it works, sign me up.

 That's why the basic fast-relation-lock patch helps so much:
 it replaces a system where every lock request results in contention
 with a system where NONE of them do.

 I tried rewriting the LWLocks using CAS.  It actually seems to make
 things slightly worse on the tests I've done so far, perhaps because I
 didn't make it respect spins_per_delay.  Perhaps fetch-and-add would
 be better, but I'm not holding my breath.  Everything I'm seeing so
 far leads me to the belief that we need to get rid of the contention
 altogether, not just contend more quickly.

 Is there a patch available? How did you do the slow path (i.e. the
 case where there's contention and you need to block)? It seems to
 me that without some kernel support like futexes it's impossible
 to do better than LWLocks already do, because any simpler scheme
 like
  while (atomic_inc_post(lock)  0) {
    atomic_dec(lock);
    block(lock);
  }
 for the shared-locker case suffers from a race condition (the lock
 might be released before you actually block()).

Attached...

 The idea would be to start out with something trivial like the above.
 Maybe with an #if for compilers which have something like GCC's
 __sync_synchronize(). We could then gradually add implementations
 for specific architectures, hopefully done by people who actually
 own the hardware and can test.

Yes.  But if we go that route, then we have to also support a code
path for architectures for which we don't have that support.  That's
going to be more work, so I don't want to do it until we have a case
where there is a good, clear benefit.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


lwlock-v1.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers