Re: [GENERAL] postmaster deadlock while logging after syslogger exited

2017-11-16 Thread David Pacheco
On Mon, Nov 6, 2017 at 12:35 PM, Tom Lane  wrote:

> David Pacheco  writes:
> > I ran into what appears to be a deadlock in the logging subsystem.  It
> > looks like what happened was that the syslogger process exited because it
> > ran out of memory.  But before the postmaster got a chance to handle the
> > SIGCLD to restart it, it handled a SIGUSR1 to start an autovacuum worker.
> > That also failed, and the postmaster went to log a message about it, but
> > it's blocked on the pipe that's normally connected to the syslogger,
> > presumably because the pipe is full because the syslogger is gone and
> > hasn't read from it.
>
> Ugh.
>


Should I file a bug on this issue?

Thanks,
Dave


Re: [GENERAL] postmaster deadlock while logging after syslogger exited

2017-11-06 Thread David Pacheco
On Mon, Nov 6, 2017 at 12:35 PM, Tom Lane  wrote:

> David Pacheco  writes:
> > ... that process appears to have exited due to a fatal error
> > (out of memory).  (I know it exited because the process still exists in
> the
> > kernel -- it hasn't been reaped yet -- and I think it ran out of memory
> > based on a log message I found from around the time when the process
> > exited.)
>
> Could we see the exact log message(s) involved?  It's pretty hard to
> believe that the logger would have consumed much memory.



Thanks for the quick reply!

Based on kernel state about the dead but unreaped syslogger process, I
believe the process exited at 2017-10-27T23:46:21.258Z.  Here are all of
the entries in the PostgreSQL log from 23:19:12 until the top of the next
hour:
https://gist.githubusercontent.com/davepacheco/c5541bb464532075f2da761dd990a457/raw/2ba242055aca2fb374e9118045a830d08c590e0a/gistfile1.txt

There's no log entry at exactly 23:46:21 or even immediately before that,
but there are a lot of "out of memory" errors and a FATAL one at 23:47:28.
Unfortunately, we haven't configured logging to include the pid, so I can't
be sure which messages came from the syslogger.

There are also many log entries for some very long SQL queries.  I'm sure
that contributed to this problem by filling up the pipe.  I was able to
extract the contents of the pipe while the system was hung, and it was more
of these giant query strings.

I think it's likely that this database instance was running in a container
with way too small a memory cap for the number of processes configured.
(This was a zone (a lightweight container) allocated with 2GB of memory and
configured with 512MB of shared_buffers and up to 200 connections.)  I
expect that the system got to a persistent state of having basically no
memory available, at which point nearly any attempt to allocate memory
could fail.  The syslogger itself may not have been using much memory.

So I'm not so much worried about the memory usage itself, but it would be
nice if this condition were handled better.  Handling out-of-memory is
obviously hard, especially when it means being unable to fork, but even
crashing would have been better for our use-case.  And of course, there are
other reasons that the syslogger could exit prematurely besides being low
on memory, and those might be more recoverable.

Thanks,
Dave


[GENERAL] postmaster deadlock while logging after syslogger exited

2017-11-06 Thread David Pacheco
Hello,

I ran into what appears to be a deadlock in the logging subsystem.  It
looks like what happened was that the syslogger process exited because it
ran out of memory.  But before the postmaster got a chance to handle the
SIGCLD to restart it, it handled a SIGUSR1 to start an autovacuum worker.
That also failed, and the postmaster went to log a message about it, but
it's blocked on the pipe that's normally connected to the syslogger,
presumably because the pipe is full because the syslogger is gone and
hasn't read from it.  The net result is that new connections to PostgreSQL
hang, and any query that causes PostgreSQL to log appears to hang, though
queries appear to work on existing connections as long as they do not cause
the backend to log anything.  I don't believe the system would ever get out
of this state.

I ran into this with version 9.2.4 on SmartOS (illumos), but it doesn't
look to me like any of the relevant code has changed in 9.6 or 10, nor is
it platform-specific.  Version 10.0 still:

- restarts the syslogger in response to a SIGCHLD signal (in "reaper")
- starts autovacuum workers in response to SIGUSR1 (in "sigusr1_handler")
- mask SIGCHLD at the beginning of the SIGUSR1 handler
- potentially logs from StartAutoVacWorker (and, of course, many other
places in the postmaster)

which I think means the problem is still possible.  Am I missing
something?  I couldn't find anything about this issue.  Is this just a
rarely seen bug?



Here's the raw data: my initial observation was that when I ran "psql" to
connect to the database, it would hang (for hours, it turned out).  My
postmaster process was blocked here, writing to its own stderr pipe:

$ mdb core.41349
Loading modules: [ libumem.so.1 libc.so.1 ld.so.1 ]
> $C
fd7fffdf9df0 libc.so.1`__write+0xa()
fd7fffdfb220 write_pipe_chunks+0x142()
fd7fffdfb280 send_message_to_server_log+0x55b()
fd7fffdfb2a0 EmitErrorReport+0xe6()
fd7fffdfb390 errfinish+0x1ac()
fd7fffdfb3b0 StartAutoVacWorker+0x5c()
fd7fffdfb3e0 StartAutovacuumWorker+0x6f()
fd7fffdfb410 sigusr1_handler+0x185()
fd7fffdfb420 libc.so.1`__sighndlr+6()
fd7fffdfb4b0 libc.so.1`call_user_handler+0x1db(10, 0, fd7fffdfb520)
fd7fffdfb500 libc.so.1`sigacthandler+0x116(10, 0, fd7fffdfb520)
fd7fffdfb970 libc.so.1`__pollsys+0xa()
fd7fffdfba90 libc.so.1`pselect+0x1cb(5, fd7fffdfbaf0, 0, 0,
fd7fffdfbaa0, 0)
fd7fffdfbae0 libc.so.1`select+0x5a(5, fd7fffdfbaf0, 0, 0,
fd7fffdfdaf0)
fd7fffdffb40 ServerLoop+0xb5()
fd7fffdffbe0 PostmasterMain+0xec2()
fd7fffdffc00 main+0x23a()
fd7fffdffc10 _start+0x6c()

>From inspecting kernel state, I found that the only other file descriptor
on the system that corresponds to the other side of this pipe is in the
postmaster itself, which it keeps open in case it needs to restart the
syslogger.  Based on the code, I expected the syslogger subprocess to have
the pipe open, but that process appears to have exited due to a fatal error
(out of memory).  (I know it exited because the process still exists in the
kernel -- it hasn't been reaped yet -- and I think it ran out of memory
based on a log message I found from around the time when the process
exited.)  Then I expected the postmaster would have restarted it, but I
found that it currently has SIGCHLD both pending and masked:

$ pflags 41349
41349:  /opt/postgresql/9.2.4/bin/postgres -D /manatee/pg/data
data model = _LP64  flags = ORPHAN|MSACCT|MSFORK
sigpend = 0x00028000,0x,0x
 /1:flags = ASLEEP  write(0x2,0xfd7fffdf9e10,0x65)
sigmask = 0xfebff047,0xfff7,0x03ff

The signal mask may be system-specific, but I've decoded it and confirmed
that SIGCHLD is in "sigpend" as well as "sigmask".  And that makes sense
because the stack above indicates we're in "sigusr1_handler", which masks
SIGCHLD when it starts.

If it wasn't clear from the above, the logging configuration looks like
this:

log_destination = 'stderr'
logging_collector = on
log_directory = '/var/pg/'
log_filename = 'postgresql.log'

Thanks,
Dave