Re: [HACKERS] streaming replication breaks horribly if master crashes

2010-06-17 Thread Heikki Linnakangas
On 17/06/10 02:40, Greg Stark wrote: On Thu, Jun 17, 2010 at 12:16 AM, Kevin Grittner kevin.gritt...@wicourts.gov wrote: Greg Starkgsst...@mit.edu wrote: TCP keepalives are for detecting broken network connections Yeah. That seems like what we have here. If you shoot the OS in the head,

Re: [HACKERS] streaming replication breaks horribly if master crashes

2010-06-17 Thread Rafael Martinez
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Heikki Linnakangas wrote: We're not talking about a timeout for promoting standby to master. The problem is that the standby doesn't notice that from the master's point of view, the connection has been broken. Whether it's because of a network

Re: [HACKERS] streaming replication breaks horribly if master crashes

2010-06-17 Thread Fujii Masao
On Thu, Jun 17, 2010 at 4:02 PM, Rafael Martinez r.m.guerr...@usit.uio.no wrote: I tested this yesterday and I could not get any reaction from the wal receiver even after using minimal values compared to the default values  . The default values in linux for tcp_keepalive_time,

Re: [HACKERS] streaming replication breaks horribly if master crashes

2010-06-17 Thread Magnus Hagander
On Thu, Jun 17, 2010 at 09:20, Fujii Masao masao.fu...@gmail.com wrote: On Thu, Jun 17, 2010 at 4:02 PM, Rafael Martinez r.m.guerr...@usit.uio.no wrote: I tested this yesterday and I could not get any reaction from the wal receiver even after using minimal values compared to the default values

Re: [HACKERS] streaming replication breaks horribly if master crashes

2010-06-17 Thread Tom Lane
Fujii Masao masao.fu...@gmail.com writes: On Thu, Jun 17, 2010 at 5:26 AM, Robert Haas robertmh...@gmail.com wrote: The real problem here is that we're sending records to the slave which might cease to exist on the master if it unexpectedly reboots.  I believe that what we need to do is make

[HACKERS] streaming replication breaks horribly if master crashes

2010-06-16 Thread Robert Haas
On Mon, Jun 14, 2010 at 7:55 AM, Simon Riggs si...@2ndquadrant.com wrote: But that change would cause the problem that Robert pointed out. http://archives.postgresql.org/pgsql-hackers/2010-06/msg00670.php Presumably this means that if synchronous_commit = off on primary that SR in 9.0 will no

Re: [HACKERS] streaming replication breaks horribly if master crashes

2010-06-16 Thread Joshua D. Drake
On Wed, 2010-06-16 at 15:47 -0400, Robert Haas wrote: So, obviously at this point my slave database is corrupted beyond repair due to nothing more than an unexpected crash on the master. That's bad. What is worse is that the system only detected the corruption because the slave had crossed

Re: [HACKERS] streaming replication breaks horribly if master crashes

2010-06-16 Thread Kevin Grittner
Robert Haas robertmh...@gmail.com wrote: So, obviously at this point my slave database is corrupted beyond repair due to nothing more than an unexpected crash on the master. Certainly that's true for resuming replication. From your description it sounds as though the slave would be usable

Re: [HACKERS] streaming replication breaks horribly if master crashes

2010-06-16 Thread Stefan Kaltenbrunner
On 06/16/2010 09:47 PM, Robert Haas wrote: On Mon, Jun 14, 2010 at 7:55 AM, Simon Riggssi...@2ndquadrant.com wrote: But that change would cause the problem that Robert pointed out. http://archives.postgresql.org/pgsql-hackers/2010-06/msg00670.php Presumably this means that if

Re: [HACKERS] streaming replication breaks horribly if master crashes

2010-06-16 Thread Kevin Grittner
Robert Haas robertmh...@gmail.com wrote: I don't know what to do about this This probably is out of the question for 9.0 based on scale of change, and maybe forever based on the impact of WAL volume, but -- if we logged before images along with the after, we could undo the work of the

Re: [HACKERS] streaming replication breaks horribly if master crashes

2010-06-16 Thread Kevin Grittner
Stefan Kaltenbrunner ste...@kaltenbrunner.cc wrote: well this is likely caused by the OS not noticing that the connections went away (linux has really long timeouts here) - maybe we should unconditionally enable keepalive on systems that support that for replication connections (if that is

Re: [HACKERS] streaming replication breaks horribly if master crashes

2010-06-16 Thread Josh Berkus
The first problem I noticed is that the slave never seems to realize that the master has gone away. Every time I crashed the master, I had to kill the wal receiver process on the slave to get it to reconnect; otherwise it just sat there waiting, either forever or at least for longer than I

Re: [HACKERS] streaming replication breaks horribly if master crashes

2010-06-16 Thread Robert Haas
On Wed, Jun 16, 2010 at 4:00 PM, Kevin Grittner kevin.gritt...@wicourts.gov wrote: Robert Haas robertmh...@gmail.com wrote: So, obviously at this point my slave database is corrupted beyond repair due to nothing more than an unexpected crash on the master. Certainly that's true for resuming

Re: [HACKERS] streaming replication breaks horribly if master crashes

2010-06-16 Thread Robert Haas
On Wed, Jun 16, 2010 at 4:14 PM, Josh Berkus j...@agliodbs.com wrote: The first problem I noticed is that the slave never seems to realize that the master has gone away.  Every time I crashed the master, I had to kill the wal receiver process on the slave to get it to reconnect; otherwise it

Re: [HACKERS] streaming replication breaks horribly if master crashes

2010-06-16 Thread Kevin Grittner
Robert Haas robertmh...@gmail.com wrote: Kevin Grittner kevin.gritt...@wicourts.gov wrote: Robert Haas robertmh...@gmail.com wrote: So, obviously at this point my slave database is corrupted beyond repair due to nothing more than an unexpected crash on the master. Certainly that's true for

Re: [HACKERS] streaming replication breaks horribly if master crashes

2010-06-16 Thread Magnus Hagander
On Wed, Jun 16, 2010 at 22:26, Robert Haas robertmh...@gmail.com wrote: and this just makes it more likely.  After the most recent crash, the master thought pg_current_xlog_location() was 1/86CD4000; the slave thought pg_last_xlog_receive_location() was 1/8733C000.  After reconnecting to the

Re: [HACKERS] streaming replication breaks horribly if master crashes

2010-06-16 Thread Rafael Martinez
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Robert Haas wrote: The first problem I noticed is that the slave never seems to realize that the master has gone away. Every time I crashed the master, I had to kill the wal receiver process on the slave to get it to reconnect; otherwise it

Re: [HACKERS] streaming replication breaks horribly if master crashes

2010-06-16 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes: The first problem I noticed is that the slave never seems to realize that the master has gone away. Every time I crashed the master, I had to kill the wal receiver process on the slave to get it to reconnect; otherwise it just sat there waiting,

Re: [HACKERS] streaming replication breaks horribly if master crashes

2010-06-16 Thread Josh Berkus
On 6/16/10 1:26 PM, Robert Haas wrote: Similarly with synchronous_commit=off, I believe that the next checkpoint will still fsync WAL, but the lag might be long. That's not a showstopper. Just tell people that having synch_commit=off on the master might increase the lag to the slave, and

Re: [HACKERS] streaming replication breaks horribly if master crashes

2010-06-16 Thread Pierre C
The real problem here is that we're sending records to the slave which might cease to exist on the master if it unexpectedly reboots. I believe that what we need to do is make sure that the master only sends WAL it has already fsync'd How about this : - pg records somewhere the xlog

Re: [HACKERS] streaming replication breaks horribly if master crashes

2010-06-16 Thread Greg Stark
On Wed, Jun 16, 2010 at 9:56 PM, Tom Lane t...@sss.pgh.pa.us wrote: Robert Haas robertmh...@gmail.com writes: The first problem I noticed is that the slave never seems to realize that the master has gone away.  Every time I crashed the master, I had to kill the wal receiver process on the

Re: [HACKERS] streaming replication breaks horribly if master crashes

2010-06-16 Thread Kevin Grittner
Greg Stark gsst...@mit.edu wrote: TCP keepalives are for detecting broken network connections Yeah. That seems like what we have here. If you shoot the OS in the head, the network connection is broken rather abruptly, without the normal packets exchanged to close the TCP connection. It

Re: [HACKERS] streaming replication breaks horribly if master crashes

2010-06-16 Thread Kevin Grittner
Kevin Grittner kevin.gritt...@wicourts.gov wrote: It sounds like it behaves just fine except for not detecting a broken connection. Of course I meant in terms of the slave's attempts at retrieving more WAL, not in terms of it applying a second time line. TCP keepalive timeouts don't help

Re: [HACKERS] streaming replication breaks horribly if master crashes

2010-06-16 Thread Greg Stark
On Thu, Jun 17, 2010 at 12:22 AM, Kevin Grittner kevin.gritt...@wicourts.gov wrote: Kevin Grittner kevin.gritt...@wicourts.gov wrote: It sounds like it behaves just fine except for not detecting a broken connection. Of course I meant in terms of the slave's attempts at retrieving more WAL,

Re: [HACKERS] streaming replication breaks horribly if master crashes

2010-06-16 Thread Greg Stark
On Thu, Jun 17, 2010 at 12:16 AM, Kevin Grittner kevin.gritt...@wicourts.gov wrote: Greg Stark gsst...@mit.edu wrote: TCP keepalives are for detecting broken network connections Yeah.  That seems like what we have here.  If you shoot the OS in the head, the network connection is broken

Re: [HACKERS] streaming replication breaks horribly if master crashes

2010-06-16 Thread Fujii Masao
On Thu, Jun 17, 2010 at 5:26 AM, Robert Haas robertmh...@gmail.com wrote: On Wed, Jun 16, 2010 at 4:14 PM, Josh Berkus j...@agliodbs.com wrote: The first problem I noticed is that the slave never seems to realize that the master has gone away.  Every time I crashed the master, I had to kill