Hello,
I ran into what appears to be a deadlock in the logging subsystem. It
looks like what happened was that the syslogger process exited because it
ran out of memory. But before the postmaster got a chance to handle the
SIGCLD to restart it, it handled a SIGUSR1 to start an autovacuum worker.
That also failed, and the postmaster went to log a message about it, but
it's blocked on the pipe that's normally connected to the syslogger,
presumably because the pipe is full because the syslogger is gone and
hasn't read from it. The net result is that new connections to PostgreSQL
hang, and any query that causes PostgreSQL to log appears to hang, though
queries appear to work on existing connections as long as they do not cause
the backend to log anything. I don't believe the system would ever get out
of this state.
I ran into this with version 9.2.4 on SmartOS (illumos), but it doesn't
look to me like any of the relevant code has changed in 9.6 or 10, nor is
it platform-specific. Version 10.0 still:
- restarts the syslogger in response to a SIGCHLD signal (in "reaper")
- starts autovacuum workers in response to SIGUSR1 (in "sigusr1_handler")
- mask SIGCHLD at the beginning of the SIGUSR1 handler
- potentially logs from StartAutoVacWorker (and, of course, many other
places in the postmaster)
which I think means the problem is still possible. Am I missing
something? I couldn't find anything about this issue. Is this just a
rarely seen bug?
Here's the raw data: my initial observation was that when I ran "psql" to
connect to the database, it would hang (for hours, it turned out). My
postmaster process was blocked here, writing to its own stderr pipe:
$ mdb core.41349
Loading modules: [ libumem.so.1 libc.so.1 ld.so.1 ]
> $C
fd7fffdf9df0 libc.so.1`__write+0xa()
fd7fffdfb220 write_pipe_chunks+0x142()
fd7fffdfb280 send_message_to_server_log+0x55b()
fd7fffdfb2a0 EmitErrorReport+0xe6()
fd7fffdfb390 errfinish+0x1ac()
fd7fffdfb3b0 StartAutoVacWorker+0x5c()
fd7fffdfb3e0 StartAutovacuumWorker+0x6f()
fd7fffdfb410 sigusr1_handler+0x185()
fd7fffdfb420 libc.so.1`__sighndlr+6()
fd7fffdfb4b0 libc.so.1`call_user_handler+0x1db(10, 0, fd7fffdfb520)
fd7fffdfb500 libc.so.1`sigacthandler+0x116(10, 0, fd7fffdfb520)
fd7fffdfb970 libc.so.1`__pollsys+0xa()
fd7fffdfba90 libc.so.1`pselect+0x1cb(5, fd7fffdfbaf0, 0, 0,
fd7fffdfbaa0, 0)
fd7fffdfbae0 libc.so.1`select+0x5a(5, fd7fffdfbaf0, 0, 0,
fd7fffdfdaf0)
fd7fffdffb40 ServerLoop+0xb5()
fd7fffdffbe0 PostmasterMain+0xec2()
fd7fffdffc00 main+0x23a()
fd7fffdffc10 _start+0x6c()
>From inspecting kernel state, I found that the only other file descriptor
on the system that corresponds to the other side of this pipe is in the
postmaster itself, which it keeps open in case it needs to restart the
syslogger. Based on the code, I expected the syslogger subprocess to have
the pipe open, but that process appears to have exited due to a fatal error
(out of memory). (I know it exited because the process still exists in the
kernel -- it hasn't been reaped yet -- and I think it ran out of memory
based on a log message I found from around the time when the process
exited.) Then I expected the postmaster would have restarted it, but I
found that it currently has SIGCHLD both pending and masked:
$ pflags 41349
41349: /opt/postgresql/9.2.4/bin/postgres -D /manatee/pg/data
data model = _LP64 flags = ORPHAN|MSACCT|MSFORK
sigpend = 0x00028000,0x,0x
/1:flags = ASLEEP write(0x2,0xfd7fffdf9e10,0x65)
sigmask = 0xfebff047,0xfff7,0x03ff
The signal mask may be system-specific, but I've decoded it and confirmed
that SIGCHLD is in "sigpend" as well as "sigmask". And that makes sense
because the stack above indicates we're in "sigusr1_handler", which masks
SIGCHLD when it starts.
If it wasn't clear from the above, the logging configuration looks like
this:
log_destination = 'stderr'
logging_collector = on
log_directory = '/var/pg/'
log_filename = 'postgresql.log'
Thanks,
Dave