Re: backends stuck in "startup"

2017-11-25 Thread Justin Pryzby
On Sat, Nov 25, 2017 at 05:45:59PM -0500, Tom Lane wrote:
> Justin Pryzby  writes:
> > We never had any issue during the ~2 years running PG96 on this VM, until
> > upgrading Monday to PG10.1, and we've now hit it 5+ times.
> 
> > BTW this is a VM run on a hypervisor managed by our customer:
> > DMI: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, 
> > BIOS 6.00 06/22/2012
> 
> > Linux TS-DB 2.6.32-431.el6.x86_64 #1 SMP Fri Nov 22 03:15:09 UTC 2013 
> > x86_64 x86_64 x86_64 GNU/Linux
> 
> Actually ... I was focusing on the wrong part of that.  It's not
> your hypervisor, it's your kernel.  Running four-year-old kernels
> is seldom a great idea, and in this case, the one you're using
> contains the well-reported missed-futex-wakeups bug:
> 
> https://bugs.centos.org/view.php?id=8371
> 
> While rebuilding PG so it doesn't use POSIX semaphores will dodge
> that bug, I think a kernel update would be a far better idea.
> There are lots of other known bugs in that version.
> 
> Relevant to our discussion, the fix involves inserting a memory
> barrier into the kernel's futex call handling:

Ouch !  Thanks for the heads up and sorry for the noise.

I'm still trying to coax 3 customers off centos5.x, so the 2 customers left
running centos6.5 weren't on any of my mental lists..

Justin



Re: backends stuck in "startup"

2017-11-25 Thread Tom Lane
Justin Pryzby  writes:
> We never had any issue during the ~2 years running PG96 on this VM, until
> upgrading Monday to PG10.1, and we've now hit it 5+ times.

> BTW this is a VM run on a hypervisor managed by our customer:
> DMI: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, 
> BIOS 6.00 06/22/2012

> Linux TS-DB 2.6.32-431.el6.x86_64 #1 SMP Fri Nov 22 03:15:09 UTC 2013 x86_64 
> x86_64 x86_64 GNU/Linux

Actually ... I was focusing on the wrong part of that.  It's not
your hypervisor, it's your kernel.  Running four-year-old kernels
is seldom a great idea, and in this case, the one you're using
contains the well-reported missed-futex-wakeups bug:

https://bugs.centos.org/view.php?id=8371

While rebuilding PG so it doesn't use POSIX semaphores will dodge
that bug, I think a kernel update would be a far better idea.
There are lots of other known bugs in that version.

Relevant to our discussion, the fix involves inserting a memory
barrier into the kernel's futex call handling:

https://github.com/torvalds/linux/commit/76835b0ebf8a7fe85beb03c75121419a7dec52f0

regards, tom lane



Re: backends stuck in "startup"

2017-11-22 Thread Tom Lane
Justin Pryzby  writes:
> On Wed, Nov 22, 2017 at 07:43:50PM -0500, Tom Lane wrote:
>> My hypothesis about a missed memory barrier would imply that there's (at
>> least) one process that's waiting but is not in the lock's wait queue and

> Do I have to also check the wait queue to verify?  Give a hint/pointer please?

Andres probably knows more about this data structure than I do, but I
believe that the values in the LWLock's proclist_head field are indexes
into the PGProc array, and that the PGProc.lwWaitLink proclist_node fields
contain the fore and aft pointers in a doubly-linked list of waiting
processes.  But chasing through that by hand is going to be darn tedious
if there are a bunch of processes queued for the same lock.  In any case,
if the process is blocked right there and its lwWaiting field is not set,
that is sufficient proof of a bug IMO.  What is not quite proven yet is
why it failed to detect that it'd been woken.

I think really the most useful thing at this point is just to wait and
see if your SYSV-semaphore build exhibits the same problem or not.
If it does not, we can be pretty confident that *something* is wrong
with the POSIX-semaphore code, even if my current theory isn't it.

>> My theory suggests that any contended use of an LWLock is at risk,
>> in which case just running pgbench with about as many sessions as
>> you have in the live server ought to be able to trigger it.  However,
>> that doesn't really account for your having observed the problem
>> only during session startup,

> Remember, this issue breaks existing sessions, too.

Well, once one session is hung up, anything else that came along wanting
access to that same LWLock would also get stuck.  Since the lock in
question is a buffer partition lock controlling access to something like
1/128'th of the shared buffer pool, it would not take too long for every
active session to get stuck there, whether it were doing anything related
or not.

In any case, if you feel like trying the pgbench approach, I'd suggest
setting up a script to run a lot of relatively short runs rather than one
long one.  If there is something magic about the first blockage in a
session, that would help catch it.

> Am I right this won't help for lwlocks? ALTER SYSTEM SET log_lock_waits=yes

Nope, that's just for heavyweight locks.  LWLocks are lightweight
precisely because they don't have stuff like logging, timeouts,
or deadlock detection ...

regards, tom lane



Re: backends stuck in "startup"

2017-11-22 Thread Justin Pryzby
On Wed, Nov 22, 2017 at 07:43:50PM -0500, Tom Lane wrote:
> Justin Pryzby  writes:
> > For starters, I found that PID 27427 has:
> 
> > (gdb) p proc->lwWaiting
> > $1 = 0 '\000'
> > (gdb) p proc->lwWaitMode
> > $2 = 1 '\001'
> 
> To confirm, this is LWLockAcquire's "proc", equal to MyProc?
> If so, and if LWLockAcquire is blocked at PGSemaphoreLock,
> that sure seems like a smoking gun.

Right:

(gdb) bt
#0  0x003f6ee0d930 in ?? ()
#1  0x7f19f3400d38 in ?? ()
#2  0x006a4440 in PGSemaphoreLock (sema=0x7f19f3400d38) at pg_sema.c:310
#3  0x00714495 in LWLockAcquire (lock=0x7f19f3404f80, mode=LW_SHARED) 
at lwlock.c:1233
#4  0x006f75e5 in BufferAlloc (smgr=0x1f3c368, relpersistence=112 'p', 
forkNum=MAIN_FORKNUM, blockNum=1, mode=RBM_NORMAL, strategy=0x0, 
hit=0x7fff6ef1c5af "") at bufmgr.c:1012
#5  ReadBuffer_common (smgr=0x1f3c368, relpersistence=112 'p', 
forkNum=MAIN_FORKNUM, blockNum=1, mode=RBM_NORMAL, strategy=0x0, 
hit=0x7fff6ef1c5af "") at bufmgr.c:745
[...]

(gdb) up
#1  0x7f19f3400d38 in ?? ()
(gdb) up
#2  0x006a4440 in PGSemaphoreLock (sema=0x7f19f3400d38) at pg_sema.c:310
310 errStatus = sem_wait(PG_SEM_REF(sema));
(gdb) up
#3  0x00714495 in LWLockAcquire (lock=0x7f19f3404f80, mode=LW_SHARED) 
at lwlock.c:1233
1233PGSemaphoreLock(proc->sem);
(gdb) p proc->lwWaiting
$1 = 0 '\000'
(gdb) p proc->lwWaitMode
$2 = 1 '\001'
(gdb) p proc
$3 = (PGPROC *) 0x7f1a77dba500
(gdb) p MyProc
$4 = (PGPROC *) 0x7f1a77dba500
(gdb) p MyProc==proc
$5 = 1

I suppose one needs to show that this backend was stuck and not "just
happened to start 1ms before gdb connected", which I actually have:
postgres 27427 17665  0 16:57 ?00:00:00 postgres: indigo indigo 
207.241.169.248(37226) startup
...
postgres 30298 17665  0 17:01 ?00:00:00 postgres: main main 
207.241.169.100(43088) startup

..so that process was in "startup" 3-4 minutes after being forked.

> My hypothesis about a missed memory barrier would imply that there's (at
> least) one process that's waiting but is not in the lock's wait queue and

Do I have to also check the wait queue to verify?  Give a hint/pointer please?

> > Would you suggest how I can maximize the likelyhood/speed of triggering 
> > that ?
> > Five years ago, with a report of similar symptoms, you said "You need to 
> > hack
> > pgbench to suppress the single initialization connection it normally likes 
> > to
> > make, else the test degenerates to the one-incoming-connection case"
> > https://www.postgresql.org/message-id/8896.1337998337%40sss.pgh.pa.us
> 
> I don't think that case was related at all.
> 
> My theory suggests that any contended use of an LWLock is at risk,
> in which case just running pgbench with about as many sessions as
> you have in the live server ought to be able to trigger it.  However,
> that doesn't really account for your having observed the problem
> only during session startup,

Remember, this issue breaks existing sessions, too.

After it'd happened a couple times, I made sure to leave a session opened to
allow collecting diagnostics (and try to un-stick it), but when it recurred,
was unable to even SELECT * FROM pg_locks.

Am I right this won't help for lwlocks? ALTER SYSTEM SET log_lock_waits=yes

Justin



Re: backends stuck in "startup"

2017-11-22 Thread Tom Lane
Justin Pryzby  writes:
> For starters, I found that PID 27427 has:

> (gdb) p proc->lwWaiting
> $1 = 0 '\000'
> (gdb) p proc->lwWaitMode
> $2 = 1 '\001'

To confirm, this is LWLockAcquire's "proc", equal to MyProc?
If so, and if LWLockAcquire is blocked at PGSemaphoreLock,
that sure seems like a smoking gun.

> Note: I've compiled locally PG 10.1 with PREFERRED_SEMAPHORES=SYSV to keep the
> service up (and to the degree that serves to verify that avoids the issue,
> great).

Good idea, I was going to suggest that.  It will be very interesting
to see if that makes the problem go away.

> Would you suggest how I can maximize the likelyhood/speed of triggering that ?
> Five years ago, with a report of similar symptoms, you said "You need to hack
> pgbench to suppress the single initialization connection it normally likes to
> make, else the test degenerates to the one-incoming-connection case"
> https://www.postgresql.org/message-id/8896.1337998337%40sss.pgh.pa.us

I don't think that case was related at all.

My theory suggests that any contended use of an LWLock is at risk,
in which case just running pgbench with about as many sessions as
you have in the live server ought to be able to trigger it.  However,
that doesn't really account for your having observed the problem
only during session startup, so there may be some other factor
involved.  I wonder if it only happens during the first wait for
an LWLock ... and if so, how could that be?

regards, tom lane



Re: backends stuck in "startup"

2017-11-22 Thread Justin Pryzby
On Wed, Nov 22, 2017 at 01:27:12PM -0500, Tom Lane wrote:
> Justin Pryzby  writes:
> > On Tue, Nov 21, 2017 at 03:40:27PM -0800, Andres Freund wrote:
> >> Could you try stracing next time?
> 
> > I straced all the "startup" PIDs, which were all in futex, without 
> > exception:
> 
> If you've got debug symbols installed, could you investigate the states
> of the LWLocks the processes are stuck on?
> 
> My hypothesis about a missed memory barrier would imply that there's (at
> least) one process that's waiting but is not in the lock's wait queue and
> has MyProc->lwWaiting == false, while the rest are in the wait queue and
> have MyProc->lwWaiting == true.  Actually chasing through the list
> pointers would be slightly tedious, but checking MyProc->lwWaiting,
> and maybe MyProc->lwWaitMode, in each process shouldn't be too hard.

> Also verify that they're all waiting for the same LWLock (by address).

I believe my ~40 cores are actually for backends from two separate instances of
this issue on the VM, as evidenced by different argv pointers.

And for each instance, I have cores for only a fraction of the backends
(max_connections=400).

For starters, I found that PID 27427 has:

(gdb) p proc->lwWaiting
$1 = 0 '\000'
(gdb) p proc->lwWaitMode
$2 = 1 '\001'

..where all the others have lwWaiting=1

For #27427:
(gdb) p *lock
$27 = {tranche = 59, state = {value = 1627389952}, waiters = {head = 147, tail 
= 308}}

(gdb) info locals
mustwait = 1 '\001'
proc = 0x7f1a77dba500
result = 1 '\001'
extraWaits = 0
__func__ = "LWLockAcquire"

And at this point I have to ask for help how to finish traversing these
structures.  I could upload cores for someone (I don't think there's anything
too private) but so far I have 16GB gz compressed cores.

Note: I've compiled locally PG 10.1 with PREFERRED_SEMAPHORES=SYSV to keep the
service up (and to the degree that serves to verify that avoids the issue,
great).

But I could start an instance running pgbench to try to trigger on this VM,
with smaller shared_buffers and backends/clients to allow full cores of every
backend (I don't think I'll be able to dump all 400 cores each up to 2GB from
the production instance).

Would you suggest how I can maximize the likelyhood/speed of triggering that ?
Five years ago, with a report of similar symptoms, you said "You need to hack
pgbench to suppress the single initialization connection it normally likes to
make, else the test degenerates to the one-incoming-connection case"
https://www.postgresql.org/message-id/8896.1337998337%40sss.pgh.pa.us

..but, pgbench --connect seems to do what's needed(?)  (I see that dates back
to 2001, having been added at ba708ea3).

(I don't know there's any suggestion or reason to be believe the bug is
specific to connection/startup phase, or that it's a necessary or sufficient to
hit the bug, but it's at least known to be impacted and all I have to go on for
now).

Justin



Re: backends stuck in "startup"

2017-11-22 Thread Justin Pryzby
On Wed, Nov 22, 2017 at 01:27:12PM -0500, Tom Lane wrote:
> Justin Pryzby  writes:
> [ in an earlier post: ]
> > BTW this is a VM run on a hypervisor managed by our customer:
> > DMI: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, 
> > BIOS 6.00 06/22/2012
> 
> Hmm.  Can't avoid the suspicion that that's relevant somehow.

In case it helps or relevant, I'm told this is: Vsphere version 6.0 u3.

Justin



Re: backends stuck in "startup"

2017-11-22 Thread Tom Lane
Justin Pryzby  writes:
> On Tue, Nov 21, 2017 at 03:40:27PM -0800, Andres Freund wrote:
>> Could you try stracing next time?

> I straced all the "startup" PIDs, which were all in futex, without exception:

If you've got debug symbols installed, could you investigate the states
of the LWLocks the processes are stuck on?

My hypothesis about a missed memory barrier would imply that there's (at
least) one process that's waiting but is not in the lock's wait queue and
has MyProc->lwWaiting == false, while the rest are in the wait queue and
have MyProc->lwWaiting == true.  Actually chasing through the list
pointers would be slightly tedious, but checking MyProc->lwWaiting,
and maybe MyProc->lwWaitMode, in each process shouldn't be too hard.
Also verify that they're all waiting for the same LWLock (by address).

I recognize Andres' point that on x86 lock-prefixed instructions should
be full memory barriers, and at least on my Linux machines, there do
seem to be lock-prefixed instructions in the fast paths through sem_wait
and sem_post.  But the theory fits the reported evidence awfully well,
and we have no other theory that fits at all.

[ in an earlier post: ]
> BTW this is a VM run on a hypervisor managed by our customer:
> DMI: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, 
> BIOS 6.00 06/22/2012

Hmm.  Can't avoid the suspicion that that's relevant somehow.

regards, tom lane



Re: backends stuck in "startup"

2017-11-22 Thread Justin Pryzby
On Tue, Nov 21, 2017 at 03:40:27PM -0800, Andres Freund wrote:
> Hi,
> 
> On 2017-11-21 17:09:26 -0600, Justin Pryzby wrote:
> > I'm sorry to report this previously reported problem is happening again,
> > starting shortly after pg_upgrading a customer to PG10.1 from 9.6.5.
> > 
> > As $subject: backends are stuck in startup for minutes at a time.  I didn't
> > strace this time, but I believe last time I saw one was waiting in a futex.
> 
> >  - I'm planning do disable sync seqscan for the night (although as I 
> > understand
> >that may not matter (?))
> 
> I doubt it matters here.

The system still got wedged/stuck after a few hours even with:

synchronize_seqscans=off; and use of pg_buffercache disabled (but extension not
actually dropped), and our maintenance (vacuum and reindex) scripts disabled
(these normally only run at night, so at least REINDEX script wasn't involved
with several earlier incidents of this issue; but, I was still running the
vacuum ANALYZE script post-upgrade).

Justin



Re: backends stuck in "startup"

2017-11-21 Thread Justin Pryzby
On Tue, Nov 21, 2017 at 03:40:27PM -0800, Andres Freund wrote:
> Hi,
> 
> On 2017-11-21 17:09:26 -0600, Justin Pryzby wrote:
> > I'm sorry to report this previously reported problem is happening again,
> > starting shortly after pg_upgrading a customer to PG10.1 from 9.6.5.
> > 
> > As $subject: backends are stuck in startup for minutes at a time.  I didn't
> > strace this time, but I believe last time I saw one was waiting in a futex.
> 
> Could you try stracing next time?

I straced all the "startup" PIDs, which were all in futex, without exception:

[pryzbyj@TS-DB ~]$ sudo strace `ps h --sort start_time -o pid,args -u postgres 
|awk '/startup$/{print "-p"$1}'` 2>&1 |tee -a /tmp/strace-pg-backends

Process 3611 attached
Process 6081 attached
[pid  6081] futex(0x7ff4314000f8, FUTEX_WAIT, 0, NULL 
[pid  3611] futex(0x7ff431400138, FUTEX_WAIT, 0, NULL 
[pid  3354] futex(0x7ff431400258, FUTEX_WAIT, 0, NULL 
[pid  3117] futex(0x7ff4314002b8, FUTEX_WAIT, 0, NULL 
[pid  2948] futex(0x7ff431400378, FUTEX_WAIT, 0, NULL 
[pid  2925] futex(0x7ff4314003d8, FUTEX_WAIT, 0, NULL 
[pid  2789] futex(0x7ff431400438, FUTEX_WAIT, 0, NULL 
[pid  2752] futex(0x7ff4314004b8, FUTEX_WAIT, 0, NULL 
[pid  2663] futex(0x7ff431400618, FUTEX_WAIT, 0, NULL 
[pid  2661] futex(0x7ff431400638, FUTEX_WAIT, 0, NULL 
[pid  2624] futex(0x7ff431400678, FUTEX_WAIT, 0, NULL 
[pid  2576] futex(0x7ff4314006f8, FUTEX_WAIT, 0, NULL 
[pid   365] futex(0x7ff431400718, FUTEX_WAIT, 0, NULL 
[pid   326] futex(0x7ff431400798, FUTEX_WAIT, 0, NULL 
[pid 32704] futex(0x7ff431400898, FUTEX_WAIT, 0, NULL 
[pid 32309] futex(0x7ff431400938, FUTEX_WAIT, 0, NULL 
[pid 32292] futex(0x7ff431400a18, FUTEX_WAIT, 0, NULL 
[pid 32170] futex(0x7ff431400ad8, FUTEX_WAIT, 0, NULL 
[pid 32134] futex(0x7ff431400b58, FUTEX_WAIT, 0, NULL 
[pid 31987] futex(0x7ff431400bf8, FUTEX_WAIT, 0, NULL 
[pid 31962] futex(0x7ff431400c58, FUTEX_WAIT, 0, NULL 
[pid 31817] futex(0x7ff431400d38, FUTEX_WAIT, 0, NULL 
[pid 31807] futex(0x7ff431400d98, FUTEX_WAIT, 0, NULL 
[pid 29664] futex(0x7ff431400e98, FUTEX_WAIT, 0, NULL 
[pid 29462] futex(0x7ff431400ef8, FUTEX_WAIT, 0, NULL 
[pid 29426] futex(0x7ff431400f38, FUTEX_WAIT, 0, NULL 
[pid 29258] futex(0x7ff431401058, FUTEX_WAIT, 0, NULL 
[pid 29227] futex(0x7ff431402f58, FUTEX_WAIT, 0, NULL 
[pid 29107] futex(0x7ff431401138, FUTEX_WAIT, 0, NULL 
[pid 29071] futex(0x7ff431401198, FUTEX_WAIT, 0, NULL 
[pid 28942] futex(0x7ff431401258, FUTEX_WAIT, 0, NULL 
[pid 28924] futex(0x7ff431401298, FUTEX_WAIT, 0, NULL 
[pid 26942] futex(0x7ff431401418, FUTEX_WAIT, 0, NULL 
[pid 25278] futex(0x7ff431401438, FUTEX_WAIT, 0, NULL 
[pid 24909] futex(0x7ff431401598, FUTEX_WAIT, 0, NULL 
[pid 24908] futex(0x7ff4314015b8, FUTEX_WAIT, 0, NULL 
[pid 24852] futex(0x7ff4314015f8, FUTEX_WAIT, 0, NULL 
[pid 24851] futex(0x7ff431401618, FUTEX_WAIT, 0, NULL 
[pid 24585] futex(0x7ff4314016f8, FUTEX_WAIT, 0, NULL 
[pid 24584] futex(0x7ff431401718, FUTEX_WAIT, 0, NULL 
[pid 24478] futex(0x7ff431401918, FUTEX_WAIT, 0, NULL 
[pid 24476] futex(0x7ff431401958, FUTEX_WAIT, 0, NULL 
[pid 24470] futex(0x7ff4314019d8, FUTEX_WAIT, 0, NULL 
[pid 24464] futex(0x7ff431401ad8, FUTEX_WAIT, 0, NULL 
[pid 24458] futex(0x7ff431401bb8, FUTEX_WAIT, 0, NULL 
[pid 24456] futex(0x7ff431401b98, FUTEX_WAIT, 0, NULL 
[pid 24450] futex(0x7ff431401c58, FUTEX_WAIT, 0, NULL 
[pid 24448] futex(0x7ff431401c98, FUTEX_WAIT, 0, NULL 
[pid 24439] futex(0x7ff431401cf8, FUTEX_WAIT, 0, NULL 
[pid 24323] futex(0x7ff431401e38, FUTEX_WAIT, 0, NULL 
[pid 24322] futex(0x7ff431401e58, FUTEX_WAIT, 0, NULL 
[pid 23057] futex(0x7ff431401fd8, FUTEX_WAIT, 0, NULL 
[pid 23055] futex(0x7ff431402038, FUTEX_WAIT, 0, NULL 
[pid 21826] futex(0x7ff4314021f8, FUTEX_WAIT, 0, NULL 
[pid 21824] futex(0x7ff431402238, FUTEX_WAIT, 0, NULL 
[pid 21794] futex(0x7ff431402298, FUTEX_WAIT, 0, NULL 
[pid 21792] futex(0x7ff431402638, FUTEX_WAIT, 0, NULL 
[pid 21791] futex(0x7ff431402af8, FUTEX_WAIT, 0, NULL 
[pid 21790] futex(0x7ff431402c98, FUTEX_WAIT, 0, NULL 
[pid 21786] futex(0x7ff431402478, FUTEX_WAIT, 0, NULL 
[pid 21785] futex(0x7ff431402758, FUTEX_WAIT, 0, NULL 
[pid 21784] futex(0x7ff431403098, FUTEX_WAIT, 0, NULL 
[pid 21768] futex(0x7ff431402718, FUTEX_WAIT, 0, NULL 
[pid 21662] futex(0x7ff431402338, FUTEX_WAIT, 0, NULL 
[pid 21648] futex(0x7ff431402c38, FUTEX_WAIT, 0, NULL 
[pid 21630] futex(0x7ff4314024d8, FUTEX_WAIT, 0, NULL 
[pid 21591] futex(0x7ff431402278, FUTEX_WAIT, 0, NULL 
[pid 21590] futex(0x7ff431402bf8, FUTEX_WAIT, 0, NULL

Also:

[pryzbyj@TS-DB ~]$ date;ps -O lstart,wchan=99 -u postgres 
--sort start_time |grep startup$
Wed Nov 22 00:47:41 EST 2017
 1990 Wed Nov 22 00:36:44 2017 futex_wait_queue_meS ?00:00:00 
postgres: xx xx xx.247(53934) startup
 2007 Wed Nov 22 00:36:44 2017 futex_wait_queue_meS ?00:00:00 
postgres: xx xx xx.247(53945) startup
 2014 Wed Nov 22 00:36:44 2017 futex_wait_queue_meS ?00:00:00 
postgres: xx xx xx.247(53946) startup
 2042 Wed Nov 22 0

Re: backends stuck in "startup"

2017-11-21 Thread Justin Pryzby
On Tue, Nov 21, 2017 at 03:45:58PM -0800, Andres Freund wrote:
> On 2017-11-21 18:21:16 -0500, Tom Lane wrote:
> > Justin Pryzby  writes:
> > > As $subject: backends are stuck in startup for minutes at a time.  I 
> > > didn't
> > > strace this time, but I believe last time I saw one was waiting in a 
> > > futex.
> 
> A futex? Hm, that was stock postgres?

Yes, from PGDG:

[pryzbyj@TS-DB ~]$ rpm -q postgresql10
postgresql10-10.1-1PGDG.rhel6.x86_64

> How many different databases and roles are you having on that system?

postgres=# SELECT COUNT(1) FROM pg_roles ;
count | 25

postgres=# SELECT COUNT(1) FROM pg_database;
count | 15

> I guess iii is a sanitized name?

Yes

> Right. I took this to be referring to something before the current
> migration, but I might have overinterpreted things.

We never had any issue during the ~2 years running PG96 on this VM, until
upgrading Monday to PG10.1, and we've now hit it 5+ times.

BTW this is a VM run on a hypervisor managed by our customer:
DMI: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, 
BIOS 6.00 06/22/2012

Linux TS-DB 2.6.32-431.el6.x86_64 #1 SMP Fri Nov 22 03:15:09 UTC 2013 x86_64 
x86_64 x86_64 GNU/Linux

Justin



Re: backends stuck in "startup"

2017-11-21 Thread Tom Lane
Rakesh Kumar  writes:
> why is that I did not receive the first 4 emails on this topic?

Perhaps you need to adjust your mail filters.

> I see that only the old email address "pgsql-gene...@postgresql.org" is 
> mentioned.  Could that be the reason ?
> ps: I am adding the new lists address.

Please do not do that; it just results in the mail getting delivered
to the list twice, like this did.

(I believe the list admins are working on putting in de-duplication
logic, because you won't be the last one to make this mistake.  But
it is a mistake, and right now we don't have a defense against it.)

regards, tom lane



Re: backends stuck in "startup"

2017-11-21 Thread Rakesh Kumar
why is that I did not receive the first 4 emails on this topic?  I see that 
only the old email address "pgsql-gene...@postgresql.org" is mentioned.  Could 
that be the reason ?

ps: I am adding the new lists address.



On 2017-11-21 19:02:01 -0500, Tom Lane wrote:
> and...@anarazel.de (Andres Freund) writes:
> > On 2017-11-21 18:50:05 -0500, Tom Lane wrote:
> >> (If Justin saw that while still on 9.6, then it'd be worth looking
> >> closer.)
> 
> > Right. I took this to be referring to something before the current
> > migration, but I might have overinterpreted things. There've been
> > various forks/ports of pg around that had hand-coded replacements with
> > futex usage, and there were definitely buggy versions going around a few
> > years back.
> 
> Poking around in the archives reminded me of this thread:
> https://www.postgresql.org/message-id/flat/14947.1475690...@sss.pgh.pa.us
> which describes symptoms uncomfortably close to what Justin is showing.
> 
> I remember speculating that the SysV-sema implementation, because it'd
> always enter the kernel, would provide some memory barrier behavior
> that POSIX-sema code based on futexes might miss when taking the no-wait
> path.

I think I was speculating that, but with the benefit of just having had
my fourth espresso: I've a hard time believing that - the fast path in a
futex pretty much has to be either a test-and-set or a
compare-and-exchange type operation. See e.g. the demo of futex usage in
the futex(2) manpage:

while (1) {
/* Is the futex available? */
if (__sync_bool_compare_and_swap(futexp, 1, 0))
break;  /* Yes */

/* Futex is not available; wait */
s = futex(futexp, FUTEX_WAIT, 0, NULL, NULL, 0);
if (s == -1 && errno != EAGAIN)
errExit("futex-FUTEX_WAIT");
}

I can't see how you could make use of futexes without some kind of
barrier semantics, at least on x86.

Greetings,

Andres Freund


Re: backends stuck in "startup"

2017-11-21 Thread Tom Lane
I wrote:
> ...  Maybe we need
> to take a closer look at where LWLocks devolve to blocking on the process
> semaphore and see if there's any implicit assumptions about barriers there.

Like, say, here:

for (;;)
{
PGSemaphoreLock(proc->sem);
if (!proc->lwWaiting)
break;
extraWaits++;
}

which is exactly where Justin's backend is stuck.  Without a read barrier
after the PGSemaphoreLock, it seems possible that we might not see our
lwWaiting as cleared yet, causing the loop to go around an extra time and
wait for a sema wakeup that will never come.  Then, we are holding the
lock but are blocked anyway, and everyone else who wants that same lock
will queue up behind us.

regards, tom lane



Re: backends stuck in "startup"

2017-11-21 Thread Andres Freund
On 2017-11-21 19:02:01 -0500, Tom Lane wrote:
> and...@anarazel.de (Andres Freund) writes:
> > On 2017-11-21 18:50:05 -0500, Tom Lane wrote:
> >> (If Justin saw that while still on 9.6, then it'd be worth looking
> >> closer.)
> 
> > Right. I took this to be referring to something before the current
> > migration, but I might have overinterpreted things. There've been
> > various forks/ports of pg around that had hand-coded replacements with
> > futex usage, and there were definitely buggy versions going around a few
> > years back.
> 
> Poking around in the archives reminded me of this thread:
> https://www.postgresql.org/message-id/flat/14947.1475690...@sss.pgh.pa.us
> which describes symptoms uncomfortably close to what Justin is showing.
> 
> I remember speculating that the SysV-sema implementation, because it'd
> always enter the kernel, would provide some memory barrier behavior
> that POSIX-sema code based on futexes might miss when taking the no-wait
> path.

I think I was speculating that, but with the benefit of just having had
my fourth espresso: I've a hard time believing that - the fast path in a
futex pretty much has to be either a test-and-set or a
compare-and-exchange type operation. See e.g. the demo of futex usage in
the futex(2) manpage:

while (1) {
/* Is the futex available? */
if (__sync_bool_compare_and_swap(futexp, 1, 0))
break;  /* Yes */

/* Futex is not available; wait */
s = futex(futexp, FUTEX_WAIT, 0, NULL, NULL, 0);
if (s == -1 && errno != EAGAIN)
errExit("futex-FUTEX_WAIT");
}

I can't see how you could make use of futexes without some kind of
barrier semantics, at least on x86.

Greetings,

Andres Freund



Re: backends stuck in "startup"

2017-11-21 Thread Tom Lane
and...@anarazel.de (Andres Freund) writes:
> On 2017-11-21 18:50:05 -0500, Tom Lane wrote:
>> (If Justin saw that while still on 9.6, then it'd be worth looking
>> closer.)

> Right. I took this to be referring to something before the current
> migration, but I might have overinterpreted things. There've been
> various forks/ports of pg around that had hand-coded replacements with
> futex usage, and there were definitely buggy versions going around a few
> years back.

Poking around in the archives reminded me of this thread:
https://www.postgresql.org/message-id/flat/14947.1475690...@sss.pgh.pa.us
which describes symptoms uncomfortably close to what Justin is showing.

I remember speculating that the SysV-sema implementation, because it'd
always enter the kernel, would provide some memory barrier behavior
that POSIX-sema code based on futexes might miss when taking the no-wait
path.  I'd figured that any real problems of that sort would show up
pretty quickly, but that could've been over optimistic.  Maybe we need
to take a closer look at where LWLocks devolve to blocking on the process
semaphore and see if there's any implicit assumptions about barriers there.

regards, tom lane



Re: backends stuck in "startup"

2017-11-21 Thread Andres Freund
On 2017-11-21 18:50:05 -0500, Tom Lane wrote:
> Andres Freund  writes:
> > On 2017-11-21 18:21:16 -0500, Tom Lane wrote:
> >> Justin Pryzby  writes:
> >>> As $subject: backends are stuck in startup for minutes at a time.  I 
> >>> didn't
> >>> strace this time, but I believe last time I saw one was waiting in a 
> >>> futex.
> 
> > A futex? Hm, that was stock postgres?
> 
> Sure, since we switched to unnamed-POSIX semaphores in v10.  On Linux
> those are futexes.

> (If Justin saw that while still on 9.6, then it'd be worth looking
> closer.)

Right. I took this to be referring to something before the current
migration, but I might have overinterpreted things. There've been
various forks/ports of pg around that had hand-coded replacements with
futex usage, and there were definitely buggy versions going around a few
years back.

Greetings,

Andres Freund



Re: backends stuck in "startup"

2017-11-21 Thread Tom Lane
Andres Freund  writes:
> On 2017-11-21 18:21:16 -0500, Tom Lane wrote:
>> Justin Pryzby  writes:
>>> As $subject: backends are stuck in startup for minutes at a time.  I didn't
>>> strace this time, but I believe last time I saw one was waiting in a futex.

> A futex? Hm, that was stock postgres?

Sure, since we switched to unnamed-POSIX semaphores in v10.  On Linux
those are futexes.

(If Justin saw that while still on 9.6, then it'd be worth looking
closer.)

regards, tom lane



Re: backends stuck in "startup"

2017-11-21 Thread Andres Freund
On 2017-11-21 18:21:16 -0500, Tom Lane wrote:
> Justin Pryzby  writes:
> > As $subject: backends are stuck in startup for minutes at a time.  I didn't
> > strace this time, but I believe last time I saw one was waiting in a futex.
> 
> Hm...

A futex? Hm, that was stock postgres?


> > I saved ~40 cores from backends from the most recent incident, which are all
> > essentially identical:
> 
> This one seems to be waiting for a shared LWLock (specifically, one of
> the locks for the shared-buffer lookup hashtable), which could only block
> if someone else is holding that lock exclusively, which implies that
> there's at least one backtrace that's different from this.

Hm. Wonder if that implies that there's a lot of cache pressure, or
whether it's the compounded amount of


How many different databases and roles are you having on that system? I
guess iii is a sanitized name?

Greetings,

Andres Freund



Re: backends stuck in "startup"

2017-11-21 Thread Andres Freund
Hi,

On 2017-11-21 17:09:26 -0600, Justin Pryzby wrote:
> I'm sorry to report this previously reported problem is happening again,
> starting shortly after pg_upgrading a customer to PG10.1 from 9.6.5.
> 
> As $subject: backends are stuck in startup for minutes at a time.  I didn't
> strace this time, but I believe last time I saw one was waiting in a futex.

Could you try stracing next time? Don't think it's conclusive that
they're all continually waiting rather than contending on a few locks.
As soon as you attach to one backend a contended lock situation can
quickly turn into standstill if you end up interrupting a process that
holds a lock.

>  - we DO use temp tables; it looks like we created up to 90 temp tables each 
> hour;

Doesn't quite seem enough to cause problems.


>  - I'm planning do disable sync seqscan for the night (although as I 
> understand
>that may not matter (?))

I doubt it matters here.

Greetings,

Andres Freund



Re: backends stuck in "startup"

2017-11-21 Thread Tom Lane
Justin Pryzby  writes:
> As $subject: backends are stuck in startup for minutes at a time.  I didn't
> strace this time, but I believe last time I saw one was waiting in a futex.

Hm...

> I saved ~40 cores from backends from the most recent incident, which are all
> essentially identical:

This one seems to be waiting for a shared LWLock (specifically, one of
the locks for the shared-buffer lookup hashtable), which could only block
if someone else is holding that lock exclusively, which implies that
there's at least one backtrace that's different from this.  Unless you
can find what was holding that lock, there's not much we can learn from
this report ...

regards, tom lane



backends stuck in "startup"

2017-11-21 Thread Justin Pryzby
I'm sorry to report this previously reported problem is happening again,
starting shortly after pg_upgrading a customer to PG10.1 from 9.6.5.

As $subject: backends are stuck in startup for minutes at a time.  I didn't
strace this time, but I believe last time I saw one was waiting in a futex.

postgres 25271 17665  0 16:54 ?00:00:00 postgres:   
a.b.c.246(59980) startup
postgres 27123 17665  0 16:55 ?00:00:00 postgres:   
a.b.c.247(59919) startup
postgres 27142 17665  0 16:55 ?00:00:00 postgres: main main 
a.b.c.100(42892) startup
postgres 27224 17665  0 16:56 ?00:00:00 postgres: main main 
a.b.c.100(42898) startup
postgres 27270 17665  0 16:56 ?00:00:00 postgres: ww ww 
a.b.c.252(39760) startup
postgres 27271 17665  0 16:56 ?00:00:00 postgres: iii iii 
a.b.c.248(37224) startup
postgres 27873 17665  0 16:58 ?00:00:00 postgres:   
a.b.c.253(59667) startup
postgres 30038 17665  0 17:00 ?00:00:00 postgres: ww ww 
a.b.c.252(39780) startup
postgres 30039 17665  0 17:00 ?00:00:00 postgres:   
a.b.c.246(60014) startup
postgres 30040 17665  0 17:00 ?00:00:00 postgres:   
a.b.c.247(59951) startup
postgres 30041 17665  0 17:00 ?00:00:00 postgres:   
a.b.c.247(59952) startup
postgres 30042 17665  0 17:00 ?00:00:00 postgres: ww ww 
a.b.c.252(39788) startup
postgres 30043 17665  0 17:00 ?00:00:00 postgres: a a 
a.b.c.245(44819) startup  
postgres 30044 17665  0 17:00 ?00:00:00 postgres: nnn nnn 
a.b.c.250(41317) startup  
postgres 30045 17665  0 17:00 ?00:00:00 postgres: mm mm 
a.b.c.249(47876) startup
postgres 30046 17665  0 17:00 ?00:00:00 postgres: nnn nnn 
a.b.c.250(41320) startup  
postgres 30048 17665  0 17:00 ?00:00:00 postgres: a a 
a.b.c.245(44834) startup  
postgres 30051 17665  0 17:00 ?00:00:00 postgres: iii iii 
a.b.c.248(37244) startup
postgres 30052 17665  0 17:00 ?00:00:00 postgres: iii iii 
a.b.c.248(37247) startup
postgres 30054 17665  0 17:00 ?00:00:00 postgres: nnn nnn 
a.b.c.250(41327) startup  
postgres 30079 17665  0 17:00 ?00:00:00 postgres: a a 
a.b.c.245(44843) startup  
postgres 30081 17665  0 17:00 ?00:00:00 postgres: ww ww 
a.b.c.252(39790) startup
postgres 30083 17665  0 17:00 ?00:00:00 postgres:   
a.b.c.251(50397) startup
postgres 30084 17665  0 17:00 ?00:00:00 postgres: main main 
a.b.c.100(43074) startup
postgres 30604 17665  0 17:02 ?00:00:00 postgres:   
a.b.c.246(60018) startup
postgres 30614 17665  0 17:02 ?00:00:00 postgres: nnn nnn 
a.b.c.250(41331) startup  
postgres 30624 17665  0 17:02 ?00:00:00 postgres: a a 
a.b.c.245(44847) startup  
postgres 30841 17665  0 17:03 ?00:00:00 postgres:   
a.b.c.247(59961) startup
postgres 31023 17665  0 17:03 ?00:00:00 postgres: postgres main [local] 
startup   
postgres 31024 17665  0 17:03 ?00:00:00 postgres: ww ww 
a.b.c.252(39796) startup

I found these older, similar reports:
https://www.postgresql.org/message-id/flat/4F9AC452.3050102%40pgexperts.com#4f9ac452.3050...@pgexperts.com
https://www.postgresql.org/message-id/8896.1337998337%40sss.pgh.pa.us

I saved ~40 cores from backends from the most recent incident, which are all
essentially identical:

(gdb) bt
#0  0x003f6ee0d930 in ?? ()
#1  0x7f19f3401258 in ?? ()
#2  0x006a4440 in PGSemaphoreLock (sema=0x7f19f3401258) at pg_sema.c:310
#3  0x00714495 in LWLockAcquire (lock=0x7f19f3404f80, mode=LW_SHARED) 
at lwlock.c:1233
#4  0x006f75e5 in BufferAlloc (smgr=0x1f3bc48, relpersistence=112 'p', 
forkNum=MAIN_FORKNUM, blockNum=1, mode=RBM_NORMAL, strategy=0x0, 
hit=0x7fff6ef1c5af "") at bufmgr.c:1012
#5  ReadBuffer_common (smgr=0x1f3bc48, relpersistence=112 'p', 
forkNum=MAIN_FORKNUM, blockNum=1, mode=RBM_NORMAL, strategy=0x0, 
hit=0x7fff6ef1c5af "")
at bufmgr.c:745
#6  0x006f800e in ReadBufferExtended (reln=0x1ee6a08, 
forkNum=MAIN_FORKNUM, blockNum=1, mode=, 
strategy=) at bufmgr.c:664
#7  0x004c090a in _bt_relandgetbuf (rel=0x1ee6a08, obuf=145, 
blkno=, access=1) at nbtpage.c:708
#8  0x004c1284 in _bt_getroot (rel=0x1ee6a08, access=) at nbtpage.c:298
#9  0x004c46eb in _bt_search (rel=0x1ee6a08, keysz=1, 
scankey=0x7fff6ef1d040, nextkey=0 '\000', bufP=0x7fff6ef1da4c, access=1, 
snapshot=0x1f1a648)
at nbtsearch.c:103
#10 0x004c5d1d in _bt_first (scan=0x1f1a4a8, dir=ForwardScanDirection) 
at nbtsearch.c:1028
#11 0x004c39dd in btgettuple (scan=0x1f1a4a8, dir=ForwardScanDirection) 
at nbtree.c:372
#12 0x004bcf1a in index_getnext_tid (scan=0x1f1a4a8, direction=) at indexam.c:541
#13 0x00