Is there a TODO here? ---------------------------------------------------------------------------
Tom Lane wrote: > The test case I just posted shows that our spinlock code, which we had > thought largely done, is once again becoming a performance bottleneck. > It's time to resurrect some of the ideas we kicked around in early > 2002, and didn't pursue because we decided spinlocks weren't our major > performance problem. > > I started investigating this by trying to understand why the "futex > spinlock" patch had helped the Red Hat customer I mentioned, when it > hadn't done anything much in last year's experiments. It turns out > that this customer was using an 8-way Opteron, and the futex patch > helps a lot more on that architecture than others. Some random > experimentation eventually turned up part of the reason: the x86 TAS > assembly code we are using is pessimal for x86_64. Specifically, > s_lock.h uses this for both architectures: > > /* Use a non-locking test before asserting the bus lock */ > __asm__ __volatile__( > " cmpb $0,%1 \n" > " jne 1f \n" > " lock \n" > " xchgb %0,%1 \n" > "1: \n" > > and it turns out that deleting the cmpb and jne makes things go > significantly faster on Opterons. So the "non-locking test" is > not a win at all on x86_64. As best I can tell from testing, > removing it wins because it causes the rate of spinlock delays > to drop significantly. Why this should be is not completely > obvious. I speculate that the Opterons are capable of pulling > in a copy of the cache line that contains the spinlock and then > running 100 iterations of the cmpb test without ever noticing > that the processor holding the lock has now released it. Without > the cmpb, they are forced by the locking xchgb test to actually > look at the real state of the spinlock each time. > > I kinda suspect that the cmpb test is a no-op or loss on all > Intelish processors: it can only be a win if you expect a lot > of contention for the spin lock, but in percentage terms we still > have a very low conflict rate, so in most executions of the TAS > macro, the cmpb is just wasted cycles. Bottom line: we definitely > don't want it for x86_64, and maybe not at all, but we need more > research to decide the latter. > > The second reason that the futex patch is helping is that when > a spinlock delay does occur, it allows the delaying process to be > awoken almost immediately, rather than delaying 10 msec or more > as the existing code does. However, given that we are only expecting > the spinlock to be held for a couple dozen instructions, using the > kernel futex mechanism is huge overkill --- the in-kernel overhead > to manage the futex state is almost certainly several orders of > magnitude more than the delay we actually want. > > I looked into several other methods of doing the spinlock delay > instead. I think all of these were suggested at one point or > another in our earlier discussions of spinlocks: > > 1. Use sched_yield() if available: it does just what we want, > ie, yield the processor without forcing any useless time delay > before we can be rescheduled. This doesn't exist everywhere > but it exists in recent Linuxen, so I tried it. It made for a > beautiful improvement in my test case numbers: CPU utilization > went to 100% and the context swap rate to almost nil. Unfortunately, > I also saw fairly frequent "stuck spinlock" panics when running > more queries than there were processors --- this despite increasing > NUM_DELAYS to 10000 in s_lock.c. So I don't trust sched_yield > anymore. Whatever it's doing in Linux 2.6 isn't what you'd expect. > (I speculate that it's set up to only yield the processor to other > processes already affiliated to that processor. In any case, it > is definitely capable of getting through 10000 yields without > running the guy who's holding the spinlock.) > > 2. Reduce the length of the select() delays in s_lock. The current code > delays in quanta of 10msec, but on recent Linuxen (and I think other > platforms too) the scheduling quantum is 1msec, so we can get the > processor back faster if we ask for a 1msec delay. I tried this and it > is definitely a win on Linux; it doesn't seem to hurt anything on older > systems either, they just round the delay up to 10msec like before. > So I think we should do this, even though it's only a partial answer. > > 3. Modify the spin loop to do a little more work between TAS() tests. > In the existing code there are only about half a dozen instructions > total in the normal spin loop, most of them very fast, with the result > that the spinning processor does its best to monopolize the system bus > with locked probe instructions. This is obviously not good, as it'll > interfere with the ability of the spinlock holder to complete its work > and release the lock. (The bulk of the spinlock uses are for LWLocks, > and with the current data structures the LWLock's own state is usually > going to be in the same cache line as the spinlock proper, so that > cache grabs for the spinlock hurt the LWLock state updating as well > as the eventual spinlock release.) I modified the code to use an > integer modulo operation as part of the test for SPINS_PER_DELAY; > on most processors this should keep the processor's attention and > keep it away from the system bus for several times longer than an > average instruction. (It's relevant here that the "rep;nop" used > to slow the spin loop on recent Intel processors appears to be > completely ignored by Opterons. Basically we need a non-hardware- > specific trick to slow the loop.) > > 4. Modify SPINS_PER_DELAY. Looping more times instead of entering the > kernel is a win for SMP boxes: whenever you give up and call the kernel > to delay, it's likely that the spinlock holder will have released the > lock before you even get all the way into the kernel, let alone get > rescheduled again. I found that pushing SPINS_PER_DELAY above 200 > caused the select delay rate to fall to almost nil in this test case. > However, that would be counterproductive for uniprocessors where the > optimal strategy is normally to give up the processor immediately. > We've talked about making the spin count user configurable, but I never > cared for that answer. Instead, I've worked up some code to > automatically adjust the spin count depending on whether we seem to be > in a uniprocessor or multiprocessor. The idea is to decrease the spin > count (down to a lower limit) whenever s_lock has to delay in order to > get a lock, and to increase the count (up to a limit) whenever we see > the lock come free before we've had to delay. The former condition > suggests that we are in a uniprocessor and the latter suggests we are in > a multiprocessor. I recall having complained that this idea would > probably be unstable, but in testing it seems to do pretty nicely at > converging to the minimum spin count on a 1-CPU machine and the maximum > limit on an SMP box. It could do with more testing though to see if it > works the same for other people. > > My proposal therefore is to do #2, #3, and #4, and to modify the TAS > assembly code at least on x86_64. Together, these changes bring > my test case on a 4-way Opteron from > > N, runtime: 1 36s 2 61s 4 105s 8 198s > > to > > N, runtime: 1 35s 2 48s 4 58s 8 105s > > At 4 queries, CPU utilization is 100% and the context swap rate nearly > nil, which seems to mean we've gained another (temporary no doubt) > victory over the spinlock problem. > > I attach two proposed patches: the first removes the cmpb/jne from > the x86 TAS assembly code, and the second one does the s_lock changes > enumerated as points #2, #3, #4. The first one in particular needs > more testing to see if it hurts performance on any non-Opteron x86 > chips. (If so, we'd just make it conditional to x86_64.) > > Comments and testing invited. > > regards, tom lane > Content-Description: slock-no-cmpb.patch > *** REL8_0/src/include/storage/s_lock.h Sun Aug 28 20:41:44 2005 > --- new/s_lock.h Fri Sep 9 14:58:44 2005 > *************** > *** 120,132 **** > { > register slock_t _res = 1; > > ! /* Use a non-locking test before asserting the bus lock */ > __asm__ __volatile__( > - " cmpb $0,%1 \n" > - " jne 1f \n" > - " lock \n" > " xchgb %0,%1 \n" > - "1: \n" > : "+q"(_res), "+m"(*lock) > : > : "memory", "cc"); > --- 120,128 ---- > { > register slock_t _res = 1; > > ! /* xchg implies a LOCK prefix, so no need to say LOCK explicitly */ > __asm__ __volatile__( > " xchgb %0,%1 \n" > : "+q"(_res), "+m"(*lock) > : > : "memory", "cc"); Content-Description: spin-delay.patch > *** /home/postgres/pgsql/src/backend/storage/lmgr/proc.c.orig Sat Aug 20 > 19:26:24 2005 > --- /home/postgres/pgsql/src/backend/storage/lmgr/proc.c Sun Sep 11 > 15:07:20 2005 > *************** > *** 170,175 **** > --- 170,177 ---- > > ProcGlobal->freeProcs = INVALID_OFFSET; > > + ProcGlobal->spins_per_delay = DEFAULT_SPINS_PER_DELAY; > + > /* > * Pre-create the PGPROC structures and create a semaphore for > * each. > *************** > *** 224,232 **** > --- 226,239 ---- > /* > * Try to get a proc struct from the free list. If this fails, we > * must be out of PGPROC structures (not to mention semaphores). > + * > + * While we are holding the ProcStructLock, also copy the current > + * shared estimate of spins_per_delay to local storage. > */ > SpinLockAcquire(ProcStructLock); > > + set_spins_per_delay(procglobal->spins_per_delay); > + > myOffset = procglobal->freeProcs; > > if (myOffset != INVALID_OFFSET) > *************** > *** 318,338 **** > > Assert(proctype >= 0 && proctype < NUM_DUMMY_PROCS); > > dummyproc = &DummyProcs[proctype]; > > /* > * dummyproc should not presently be in use by anyone else > */ > if (dummyproc->pid != 0) > elog(FATAL, "DummyProc[%d] is in use by PID %d", > proctype, dummyproc->pid); > MyProc = dummyproc; > > /* > * Initialize all fields of MyProc, except MyProc->sem which was set > * up by InitProcGlobal. > */ > - MyProc->pid = MyProcPid; /* marks dummy proc as in use by me */ > SHMQueueElemInit(&(MyProc->links)); > MyProc->waitStatus = STATUS_OK; > MyProc->xid = InvalidTransactionId; > --- 325,362 ---- > > Assert(proctype >= 0 && proctype < NUM_DUMMY_PROCS); > > + /* > + * Just for paranoia's sake, we use the ProcStructLock to protect > + * assignment and releasing of DummyProcs entries. > + * > + * While we are holding the ProcStructLock, also copy the current > + * shared estimate of spins_per_delay to local storage. > + */ > + SpinLockAcquire(ProcStructLock); > + > + set_spins_per_delay(ProcGlobal->spins_per_delay); > + > dummyproc = &DummyProcs[proctype]; > > /* > * dummyproc should not presently be in use by anyone else > */ > if (dummyproc->pid != 0) > + { > + SpinLockRelease(ProcStructLock); > elog(FATAL, "DummyProc[%d] is in use by PID %d", > proctype, dummyproc->pid); > + } > MyProc = dummyproc; > > + MyProc->pid = MyProcPid; /* marks dummy proc as in use by me */ > + > + SpinLockRelease(ProcStructLock); > + > /* > * Initialize all fields of MyProc, except MyProc->sem which was set > * up by InitProcGlobal. > */ > SHMQueueElemInit(&(MyProc->links)); > MyProc->waitStatus = STATUS_OK; > MyProc->xid = InvalidTransactionId; > *************** > *** 509,514 **** > --- 533,541 ---- > /* PGPROC struct isn't mine anymore */ > MyProc = NULL; > > + /* Update shared estimate of spins_per_delay */ > + procglobal->spins_per_delay = > update_spins_per_delay(procglobal->spins_per_delay); > + > SpinLockRelease(ProcStructLock); > } > > *************** > *** 532,542 **** > --- 559,576 ---- > /* Release any LW locks I am holding (see notes above) */ > LWLockReleaseAll(); > > + SpinLockAcquire(ProcStructLock); > + > /* Mark dummy proc no longer in use */ > MyProc->pid = 0; > > /* PGPROC struct isn't mine anymore */ > MyProc = NULL; > + > + /* Update shared estimate of spins_per_delay */ > + ProcGlobal->spins_per_delay = > update_spins_per_delay(ProcGlobal->spins_per_delay); > + > + SpinLockRelease(ProcStructLock); > } > > > *** /home/postgres/pgsql/src/backend/storage/lmgr/s_lock.c.orig Fri Aug > 26 10:47:35 2005 > --- /home/postgres/pgsql/src/backend/storage/lmgr/s_lock.c Sun Sep 11 > 17:43:20 2005 > *************** > *** 21,26 **** > --- 21,30 ---- > #include "storage/s_lock.h" > #include "miscadmin.h" > > + > + static int spins_per_delay = DEFAULT_SPINS_PER_DELAY; > + > + > /* > * s_lock_stuck() - complain about a stuck spinlock > */ > *************** > *** 49,102 **** > * We loop tightly for awhile, then delay using pg_usleep() and try > * again. Preferably, "awhile" should be a small multiple of the > * maximum time we expect a spinlock to be held. 100 iterations seems > ! * about right. In most multi-CPU scenarios, the spinlock is probably > ! * held by a process on another CPU and will be released before we > ! * finish 100 iterations. However, on a uniprocessor, the tight loop > ! * is just a waste of cycles, so don't iterate thousands of times. > * > * Once we do decide to block, we use randomly increasing pg_usleep() > ! * delays. The first delay is 10 msec, then the delay randomly > ! * increases to about one second, after which we reset to 10 msec and > * start again. The idea here is that in the presence of heavy > * contention we need to increase the delay, else the spinlock holder > * may never get to run and release the lock. (Consider situation > * where spinlock holder has been nice'd down in priority by the > * scheduler --- it will not get scheduled until all would-be > ! * acquirers are sleeping, so if we always use a 10-msec sleep, there > * is a real possibility of starvation.) But we can't just clamp the > * delay to an upper bound, else it would take a long time to make a > * reasonable number of tries. > * > * We time out and declare error after NUM_DELAYS delays (thus, exactly > * that many tries). With the given settings, this will usually take > ! * 3 or so minutes. It seems better to fix the total number of tries > * (and thus the probability of unintended failure) than to fix the > * total time spent. > * > ! * The pg_usleep() delays are measured in centiseconds (0.01 sec) > because > ! * 10 msec is a common resolution limit at the OS level. > */ > ! #define SPINS_PER_DELAY 100 > #define NUM_DELAYS 1000 > ! #define MIN_DELAY_CSEC 1 > ! #define MAX_DELAY_CSEC 100 > > ! int spins = 0; > int delays = 0; > ! int cur_delay = MIN_DELAY_CSEC; > > while (TAS(lock)) > { > /* CPU-specific delay each time through the loop */ > SPIN_DELAY(); > > ! /* Block the process every SPINS_PER_DELAY tries */ > ! if (++spins > SPINS_PER_DELAY) > { > if (++delays > NUM_DELAYS) > s_lock_stuck(lock, file, line); > > ! pg_usleep(cur_delay * 10000L); > > #if defined(S_LOCK_TEST) > fprintf(stdout, "*"); > --- 53,121 ---- > * We loop tightly for awhile, then delay using pg_usleep() and try > * again. Preferably, "awhile" should be a small multiple of the > * maximum time we expect a spinlock to be held. 100 iterations seems > ! * about right as an initial guess. However, on a uniprocessor the > ! * loop is a waste of cycles, while in a multi-CPU scenario it's usually > ! * better to spin a bit longer than to call the kernel, so we try to > ! * adapt the spin loop count depending on whether we seem to be in > ! * a uniprocessor or multiprocessor. > * > * Once we do decide to block, we use randomly increasing pg_usleep() > ! * delays. The first delay is 1 msec, then the delay randomly > ! * increases to about one second, after which we reset to 1 msec and > * start again. The idea here is that in the presence of heavy > * contention we need to increase the delay, else the spinlock holder > * may never get to run and release the lock. (Consider situation > * where spinlock holder has been nice'd down in priority by the > * scheduler --- it will not get scheduled until all would-be > ! * acquirers are sleeping, so if we always use a 1-msec sleep, there > * is a real possibility of starvation.) But we can't just clamp the > * delay to an upper bound, else it would take a long time to make a > * reasonable number of tries. > * > * We time out and declare error after NUM_DELAYS delays (thus, exactly > * that many tries). With the given settings, this will usually take > ! * 2 or so minutes. It seems better to fix the total number of tries > * (and thus the probability of unintended failure) than to fix the > * total time spent. > * > ! * The pg_usleep() delays are measured in milliseconds because 1 msec > ! * is a common resolution limit at the OS level for newer platforms. > ! * On older platforms the resolution limit is usually 10 msec, in > ! * which case the total delay before timeout will be a bit more. > */ > ! #define MIN_SPINS_PER_DELAY 10 > ! #define MAX_SPINS_PER_DELAY 1000 > #define NUM_DELAYS 1000 > ! #define MIN_DELAY_MSEC 1 > ! #define MAX_DELAY_MSEC 1000 > > ! int spins = spins_per_delay; > int delays = 0; > ! int cur_delay = 0; > > while (TAS(lock)) > { > /* CPU-specific delay each time through the loop */ > SPIN_DELAY(); > > ! /* > ! * Block the process every spins_per_delay tries. > ! * > ! * What we are really testing for here is spins being > decremented > ! * to zero. We insert an unnecessary integer modulo operation > ! * into the test because we'd like this loop to run longer than > ! * just two or three instructions: ideally, the processor should > ! * not be contending for the system bus for a little while here. > ! */ > ! if ((--spins % MAX_SPINS_PER_DELAY) == 0) > { > if (++delays > NUM_DELAYS) > s_lock_stuck(lock, file, line); > > ! if (cur_delay == 0) /* first time to delay? */ > ! cur_delay = MIN_DELAY_MSEC; > ! > ! pg_usleep(cur_delay * 1000L); > > #if defined(S_LOCK_TEST) > fprintf(stdout, "*"); > *************** > *** 107,119 **** > cur_delay += (int) (cur_delay * > (((double) random()) / ((double) MAX_RANDOM_VALUE)) + > 0.5); > /* wrap back to minimum delay when max is exceeded */ > ! if (cur_delay > MAX_DELAY_CSEC) > ! cur_delay = MIN_DELAY_CSEC; > > ! spins = 0; > } > } > } > > /* > * Various TAS implementations that cannot live in s_lock.h as no inline > --- 126,200 ---- > cur_delay += (int) (cur_delay * > (((double) random()) / ((double) MAX_RANDOM_VALUE)) + > 0.5); > /* wrap back to minimum delay when max is exceeded */ > ! if (cur_delay > MAX_DELAY_MSEC) > ! cur_delay = MIN_DELAY_MSEC; > > ! spins = spins_per_delay; > } > } > + > + /* > + * If we were able to acquire the lock without delaying, it's a good > + * indication we are in a multiprocessor. If we had to delay, it's > + * a sign (but not a sure thing) that we are in a uniprocessor. > + * Hence, we decrement spins_per_delay slowly when we had to delay, > + * and increase it rapidly when we didn't. It's expected that > + * spins_per_delay will converge to the minimum value on a uniprocessor > + * and to the maximum value on a multiprocessor. > + * > + * Note: spins_per_delay is local within our current process. > + * We want to average these observations across multiple backends, > + * since it's relatively rare for this function to even get entered, > + * and so a single backend might not live long enough to converge on > + * a good value. That is handled by the two routines below. > + */ > + if (cur_delay == 0) > + { > + /* we never had to delay */ > + spins_per_delay += 100; > + spins_per_delay = Min(spins_per_delay, MAX_SPINS_PER_DELAY); > + } > + else > + { > + spins_per_delay -= 1; > + spins_per_delay = Max(spins_per_delay, MIN_SPINS_PER_DELAY); > + } > + } > + > + > + /* > + * Set local copy of spins_per_delay during backend startup. > + * > + * NB: this has to be pretty fast as it is called while holding a spinlock > + */ > + void > + set_spins_per_delay(int shared_spins_per_delay) > + { > + spins_per_delay = shared_spins_per_delay; > + } > + > + /* > + * Update shared estimate of spins_per_delay during backend exit. > + * > + * NB: this has to be pretty fast as it is called while holding a spinlock > + */ > + int > + update_spins_per_delay(int shared_spins_per_delay) > + { > + /* > + * We use an exponential moving average with a relatively slow > + * adaption rate, so that noise in any one backend's result won't > + * affect the shared value too much. As long as both inputs are > + * within the allowed range, the result must be too, so we need not > + * worry about clamping the result. > + * > + * We deliberately truncate rather than rounding; this is so that > + * single adjustments inside a backend can affect the shared estimate > + * (see the asymmetric adjustment rules above). > + */ > + return (shared_spins_per_delay * 15 + spins_per_delay) / 16; > } > + > > /* > * Various TAS implementations that cannot live in s_lock.h as no inline > *** /home/postgres/pgsql/src/include/storage/proc.h.orig Sat Aug 20 > 19:26:34 2005 > --- /home/postgres/pgsql/src/include/storage/proc.h Sun Sep 11 15:07:09 2005 > *************** > *** 105,110 **** > --- 105,112 ---- > { > /* Head of list of free PGPROC structures */ > SHMEM_OFFSET freeProcs; > + /* Current shared estimate of appropriate spins_per_delay value */ > + int spins_per_delay; > } PROC_HDR; > > > *** /home/postgres/pgsql/src/include/storage/s_lock.h.orig Sun Aug 28 > 20:41:34 2005 > --- /home/postgres/pgsql/src/include/storage/s_lock.h Sun Sep 11 15:07:09 2005 > *************** > *** 826,829 **** > --- 826,835 ---- > */ > extern void s_lock(volatile slock_t *lock, const char *file, int line); > > + /* Support for dynamic adjustment of spins_per_delay */ > + #define DEFAULT_SPINS_PER_DELAY 100 > + > + extern void set_spins_per_delay(int shared_spins_per_delay); > + extern int update_spins_per_delay(int shared_spins_per_delay); > + > #endif /* S_LOCK_H */ > > ---------------------------(end of broadcast)--------------------------- > TIP 5: don't forget to increase your free space map settings -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073 ---------------------------(end of broadcast)--------------------------- TIP 5: don't forget to increase your free space map settings