On Thu, Jul 30, 2015 at 8:22 PM, Andres Freund <and...@anarazel.de> wrote:
> On 2015-07-30 09:03:01 -0700, Jeff Janes wrote: > > On Wed, Jul 29, 2015 at 6:10 AM, Andres Freund <and...@anarazel.de> > wrote: > > > What do you think about something roughly like the attached? > > > > > > > I've not evaluated the code, but applying it does solve the problem I was > > seeing. > > Cool, thanks for testing! How long did you run the test and how long did > it, on average, previously take for the issue to occur? > I had run it for 24 hours, while it usually took less than 8 hours to look up before. I did see it within a few minutes one time when I checked out a new HEAD and then forgot to re-apply your or Heikki's patch. But now I've got the same situation again, after 15 hours, with your patch. This is probably all down to luck. The only differences that I can think of is that I advanced the base to e8e86fbc8b3619da54c, and turned on JJ_vac and set log_autovacuum_min_duration=0. checkpointer, 31516 (gdb) bt #0 0x0000003dcb6eaf27 in semop () from /lib64/libc.so.6 #1 0x00000000006481df in PGSemaphoreLock (sema=0x7f84ccd3c5f8) at pg_sema.c:387 #2 0x00000000006a5754 in LWLockAcquireCommon (l=0x7f84c42edf80, valptr=0x7f84c42edfa8, val=18446744073709551615) at lwlock.c:1041 #3 LWLockAcquireWithVar (l=0x7f84c42edf80, valptr=0x7f84c42edfa8, val=18446744073709551615) at lwlock.c:915 #4 0x00000000004cbd86 in WALInsertLockAcquireExclusive () at xlog.c:1445 #5 0x00000000004d4195 in CreateCheckPoint (flags=64) at xlog.c:8181 #6 0x000000000064ea53 in CheckpointerMain () at checkpointer.c:493 #7 0x00000000004e328a in AuxiliaryProcessMain (argc=2, argv=0x7fff6eed1f50) at bootstrap.c:428 #8 0x00000000006563f7 in StartChildProcess (type=CheckpointerProcess) at postmaster.c:5167 #9 0x000000000065884a in reaper (postgres_signal_arg=<value optimized out>) at postmaster.c:2670 (gdb) f 2 (gdb) p *(lock->owner) $1 = {links = {prev = 0x0, next = 0x0}, procgloballist = 0x7f84ccd27c88, sem = {semId = 1480359975, semNum = 12}, waitStatus = 0, procLatch = {is_set = 1, is_shared = 1 '\001', owner_pid = 31540}, lxid = 3209, pid = 31540, pgprocno = 92, backendId = 12, databaseId = 16384, roleId = 10, recoveryConflictPending = 0 '\000', lwWaiting = 1 '\001', lwWaitMode = 2 '\002', lwWaitLink = {prev = 0x7f84c42edf10, next = 0x7f84ccd38e60}, waitLock = 0x0, waitProcLock = 0x0, waitLockMode = 0, heldLocks = 0, waitLSN = 0, syncRepState = 0, syncRepLinks = {prev = 0x0, next = 0x0}, myProcLocks = {{prev = 0x7f84ccd38bb8, next = 0x7f84ccd38bb8}, {prev = 0x7f84ccd38bc8, next = 0x7f84ccd38bc8}, {prev = 0x7f84ccd38bd8, next = 0x7f84ccd38bd8}, {prev = 0x7f84ccd38be8, next = 0x7f84ccd38be8}, {prev = 0x7f84ccd38bf8, next = 0x7f84ccd38bf8}, {prev = 0x7f84ccd38c08, next = 0x7f84ccd38c08}, {prev = 0x7f84ccd38c18, next = 0x7f84ccd38c18}, {prev = 0x7f84ccd38c28, next = 0x7f84ccd38c28}, {prev = 0x7f84ccd38c38, next = 0x7f84ccd38c38}, {prev = 0x7f84ccd38c48, next = 0x7f84ccd38c48}, {prev = 0x7f84ccd38c58, next = 0x7f84ccd38c58}, {prev = 0x7f84ccd38c68, next = 0x7f84ccd38c68}, {prev = 0x7f84ccd38c78, next = 0x7f84ccd38c78}, {prev = 0x7f84ccd38c88, next = 0x7f84ccd38c88}, {prev = 0x7f84ccb0ceb0, next = 0x7f84ccb0ceb0}, {prev = 0x7f84ccd38ca8, next = 0x7f84ccd38ca8}}, subxids = {xids = {0 <repeats 64 times>}}, backendLock = 0x7f84c42e9800, fpLockBits = 160528697655296, fpRelId = {0, 0, 0, 0, 0, 0, 0, 0, 0, 2679, 2610, 16413, 16411, 16415, 16414, 16408}, fpVXIDLock = 1 '\001', fpLocalTransactionId = 3209} backend 31540: #0 0x0000003dcb6eaf27 in semop () from /lib64/libc.so.6 #1 0x00000000006481df in PGSemaphoreLock (sema=0x7f84ccd38b38) at pg_sema.c:387 #2 0x00000000006a53d4 in LWLockWaitForVar (lock=0x7f84c42edf00, valptr=0x7f84c42edf28, oldval=0, newval=0x7fff6eeceaf8) at lwlock.c:1438 #3 0x00000000004d3276 in WaitXLogInsertionsToFinish (upto=183341449216) at xlog.c:1576 #4 0x00000000004d3b33 in AdvanceXLInsertBuffer (upto=183345636352, opportunistic=<value optimized out>) at xlog.c:1888 #5 0x00000000004d8df2 in GetXLogBuffer (ptr=183345636352) at xlog.c:1669 #6 0x00000000004d8fea in CopyXLogRecordToWAL (rdata=0xbd2410, fpw_lsn=<value optimized out>) at xlog.c:1276 #7 XLogInsertRecord (rdata=0xbd2410, fpw_lsn=<value optimized out>) at xlog.c:1008 #8 0x00000000004dc72c in XLogInsert (rmid=13 '\r', info=32 ' ') at xloginsert.c:453 #9 0x0000000000478f41 in ginPlaceToPage (btree=0x7fff6eecee20, stack=0x1be11e0, insertdata=<value optimized out>, updateblkno=<value optimized out>, childbuf=0, buildStats=0x0) at ginbtree.c:418 #10 0x00000000004798a3 in ginInsertValue (btree=0x7fff6eecee20, stack=0x1be11e0, insertdata=0x7fff6eeceea0, buildStats=0x0) at ginbtree.c:748 #11 0x00000000004734aa in ginEntryInsert (ginstate=0x7fff6eecf200, attnum=39408, key=1, category=<value optimized out>, items=0x1c81c20, nitem=54, buildStats=0x0) at gininsert.c:234 #12 0x000000000047edbb in ginInsertCleanup (ginstate=0x7fff6eecf200, vac_delay=<value optimized out>, stats=0x0) at ginfast.c:843 #13 0x000000000047fae6 in ginHeapTupleFastInsert (ginstate=0x7fff6eecf200, collector=<value optimized out>) at ginfast.c:436 #14 0x000000000047386a in gininsert (fcinfo=<value optimized out>) at gininsert.c:531 ... (gdb) f 2 (gdb) p *(lock->owner) $1 = {links = {prev = 0x0, next = 0x0}, procgloballist = 0x0, sem = {semId = 1480425513, semNum = 0}, waitStatus = 0, procLatch = {is_set = 0, is_shared = 1 '\001', owner_pid = 31516}, lxid = 0, pid = 31516, pgprocno = 112, backendId = -1, databaseId = 0, roleId = 0, recoveryConflictPending = 0 '\000', lwWaiting = 1 '\001', lwWaitMode = 0 '\000', lwWaitLink = {prev = 0x7f84ccd3cc10, next = 0x7f84ccd3a5e0}, waitLock = 0x0, waitProcLock = 0x0, waitLockMode = 0, heldLocks = 0, waitLSN = 0, syncRepState = 0, syncRepLinks = {prev = 0x0, next = 0x0}, myProcLocks = {{prev = 0x7f84ccd3c678, next = 0x7f84ccd3c678}, {prev = 0x7f84ccd3c688, next = 0x7f84ccd3c688}, {prev = 0x7f84ccd3c698, next = 0x7f84ccd3c698}, {prev = 0x7f84ccd3c6a8, next = 0x7f84ccd3c6a8}, {prev = 0x7f84ccd3c6b8, next = 0x7f84ccd3c6b8}, {prev = 0x7f84ccd3c6c8, next = 0x7f84ccd3c6c8}, {prev = 0x7f84ccd3c6d8, next = 0x7f84ccd3c6d8}, {prev = 0x7f84ccd3c6e8, next = 0x7f84ccd3c6e8}, {prev = 0x7f84ccd3c6f8, next = 0x7f84ccd3c6f8}, {prev = 0x7f84ccd3c708, next = 0x7f84ccd3c708}, {prev = 0x7f84ccd3c718, next = 0x7f84ccd3c718}, {prev = 0x7f84ccd3c728, next = 0x7f84ccd3c728}, {prev = 0x7f84ccd3c738, next = 0x7f84ccd3c738}, {prev = 0x7f84ccd3c748, next = 0x7f84ccd3c748}, {prev = 0x7f84ccd3c758, next = 0x7f84ccd3c758}, {prev = 0x7f84ccd3c768, next = 0x7f84ccd3c768}}, subxids = {xids = {0 <repeats 64 times>}}, backendLock = 0x7f84c42e9d00, fpLockBits = 0, fpRelId = {0 <repeats 16 times>}, fpVXIDLock = 0 '\000', fpLocalTransactionId = 0} Cheers, Jeff