On Thu, Feb 12, 2015 at 9:56 AM, Thom Brown <t...@linux.com> wrote: > On 12 February 2015 at 13:56, Robert Haas <robertmh...@gmail.com> wrote: >> >> On Wed, Feb 11, 2015 at 12:55 PM, Thom Brown <t...@linux.com> wrote: >> > Today I witnessed a situation which appears to have gone down like this: >> > >> > - The primary server starting streaming WAL data from segment 00A8 to >> > the >> > standby >> > - The standby server started receiving that data >> > - Before 00A8 is finished, the wal sender process dies on the primary, >> > but >> > the archiver process continues, and 00A8 ends up being archived as usual >> > - The primary continues to generate WAL and cleans up old WAL files from >> > pg_xlog until 00A8 is gone. >> > - The primary is restarted and the wal sender process is back up and >> > running >> > - The standby says "waiting for 00A8", which it can no longer get from >> > the >> > primary >> > - 00A8 is in the standby's archive directory, but the standby is waiting >> > for >> > the rest of the segment from the primary via streaming replication, so >> > doesn't check the archive >> > - The standby is restarted >> > - The standby goes back into recovery and eventually replays 00A8 and >> > continues as normal. >> > >> > Should the standby be able to get feedback from the primary that the >> > requested segment is no longer available, and therefore know to check >> > its >> > archive? >> >> Last time I played around with this, if the standby requested a >> segment from the master that was no longer present there, the standby >> would immediately get an ERROR, which it seems like would get you out >> of trouble. I wonder why that didn't happen in your case. > > > Yeah, I've tried recreating this like so: > > - Primary streams to standby like usual > - Kill -9 primary then change its port and bring it back up > - Create traffic on primary until it no longer has the WAL file the standby > wants, but has archived it > - Change the port of the primary back to what the standby is trying to talk > to > > But before it gets to that 4th point, the standby has gone to the archive > for the rest of it: > > cp: cannot stat ‘/tmp/walarch/0000000100000006000000C8’: No such file or > directory > 2015-02-12 14:47:52 GMT [8280]: [1-1] user=,db=,client= FATAL: could not > connect to the primary server: could not connect to server: Connection > refused > Is the server running on host "127.0.0.1" and accepting > TCP/IP connections on port 5488? > > cp: cannot stat ‘/tmp/walarch/0000000100000006000000C8’: No such file or > directory > 2015-02-12 14:47:57 GMT [8283]: [1-1] user=,db=,client= FATAL: could not > connect to the primary server: could not connect to server: Connection > refused > Is the server running on host "127.0.0.1" and accepting > TCP/IP connections on port 5488? > > 2015-02-12 14:48:02 GMT [8202]: [6-1] user=,db=,client= LOG: restored log > file "0000000100000006000000C8" from archive > > I don't suppose this is something that was buggy in 9.3.1?
I don't know. I don't remember a bug of that type, but I might have missed it. If you see the scenario again, it might help to pull a stack trace from the master and the standby. That might tell us whether the master or the standby is to blame, and where the offender is wedged. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers