Re: [HACKERS] streaming replication breaks horribly if master crashes

2010-06-17 Thread Tom Lane
Fujii Masao writes: > On Thu, Jun 17, 2010 at 5:26 AM, Robert Haas wrote: >> The real problem here is that we're sending records to the slave which >> might cease to exist on the master if it unexpectedly reboots.  I >> believe that what we need to do is make sure that the master only >> sends WA

Re: [HACKERS] streaming replication breaks horribly if master crashes

2010-06-17 Thread Magnus Hagander
On Thu, Jun 17, 2010 at 09:20, Fujii Masao wrote: > On Thu, Jun 17, 2010 at 4:02 PM, Rafael Martinez > wrote: >> I tested this yesterday and I could not get any reaction from the wal >> receiver even after using minimal values compared to the default values  . >> >> The default values in linux fo

Re: [HACKERS] streaming replication breaks horribly if master crashes

2010-06-17 Thread Fujii Masao
On Thu, Jun 17, 2010 at 4:02 PM, Rafael Martinez wrote: > I tested this yesterday and I could not get any reaction from the wal > receiver even after using minimal values compared to the default values  . > > The default values in linux for tcp_keepalive_time, tcp_keepalive_intvl > and tcp_keepali

Re: [HACKERS] streaming replication breaks horribly if master crashes

2010-06-17 Thread Rafael Martinez
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Heikki Linnakangas wrote: > > We're not talking about a timeout for promoting standby to master. The > problem is that the standby doesn't notice that from the master's point > of view, the connection has been broken. Whether it's because of a > netw

Re: [HACKERS] streaming replication breaks horribly if master crashes

2010-06-16 Thread Heikki Linnakangas
On 17/06/10 02:40, Greg Stark wrote: On Thu, Jun 17, 2010 at 12:16 AM, Kevin Grittner wrote: Greg Stark wrote: TCP keepalives are for detecting broken network connections Yeah. That seems like what we have here. If you shoot the OS in the head, the network connection is broken rather ab

Re: [HACKERS] streaming replication breaks horribly if master crashes

2010-06-16 Thread Fujii Masao
On Thu, Jun 17, 2010 at 5:26 AM, Robert Haas wrote: > On Wed, Jun 16, 2010 at 4:14 PM, Josh Berkus wrote: >>> The first problem I noticed is that the slave never seems to realize >>> that the master has gone away.  Every time I crashed the master, I had >>> to kill the wal receiver process on the

Re: [HACKERS] streaming replication breaks horribly if master crashes

2010-06-16 Thread Greg Stark
On Thu, Jun 17, 2010 at 12:16 AM, Kevin Grittner wrote: > Greg Stark wrote: > >> TCP keepalives are for detecting broken network connections > > Yeah.  That seems like what we have here.  If you shoot the OS in > the head, the network connection is broken rather abruptly, without > the normal pac

Re: [HACKERS] streaming replication breaks horribly if master crashes

2010-06-16 Thread Greg Stark
On Thu, Jun 17, 2010 at 12:22 AM, Kevin Grittner wrote: > "Kevin Grittner" wrote: > >> It sounds like it behaves just fine except for not detecting a >> broken connection. > > Of course I meant in terms of the slave's attempts at retrieving > more WAL, not in terms of it applying a second time li

Re: [HACKERS] streaming replication breaks horribly if master crashes

2010-06-16 Thread Kevin Grittner
"Kevin Grittner" wrote: > It sounds like it behaves just fine except for not detecting a > broken connection. Of course I meant in terms of the slave's attempts at retrieving more WAL, not in terms of it applying a second time line. TCP keepalive timeouts don't help with that part of it, just

Re: [HACKERS] streaming replication breaks horribly if master crashes

2010-06-16 Thread Kevin Grittner
Greg Stark wrote: > TCP keepalives are for detecting broken network connections Yeah. That seems like what we have here. If you shoot the OS in the head, the network connection is broken rather abruptly, without the normal packets exchanged to close the TCP connection. It sounds like it beh

Re: [HACKERS] streaming replication breaks horribly if master crashes

2010-06-16 Thread Greg Stark
On Wed, Jun 16, 2010 at 9:56 PM, Tom Lane wrote: > Robert Haas writes: >> The first problem I noticed is that the slave never seems to realize >> that the master has gone away.  Every time I crashed the master, I had >> to kill the wal receiver process on the slave to get it to reconnect; >> othe

Re: [HACKERS] streaming replication breaks horribly if master crashes

2010-06-16 Thread Pierre C
The real problem here is that we're sending records to the slave which might cease to exist on the master if it unexpectedly reboots. I believe that what we need to do is make sure that the master only sends WAL it has already fsync'd How about this : - pg records somewhere the xlog position

Re: [HACKERS] streaming replication breaks horribly if master crashes

2010-06-16 Thread Josh Berkus
On 6/16/10 1:26 PM, Robert Haas wrote: > Similarly with synchronous_commit=off, I believe > that the next checkpoint will still fsync WAL, but the lag might be > long. That's not a showstopper. Just tell people that having synch_commit=off on the master might increase the lag to the slave, and le

Re: [HACKERS] streaming replication breaks horribly if master crashes

2010-06-16 Thread Tom Lane
Robert Haas writes: > The first problem I noticed is that the slave never seems to realize > that the master has gone away. Every time I crashed the master, I had > to kill the wal receiver process on the slave to get it to reconnect; > otherwise it just sat there waiting, either forever or at le

Re: [HACKERS] streaming replication breaks horribly if master crashes

2010-06-16 Thread Rafael Martinez
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Robert Haas wrote: > > The first problem I noticed is that the slave never seems to realize > that the master has gone away. Every time I crashed the master, I had > to kill the wal receiver process on the slave to get it to reconnect; > otherwise i

Re: [HACKERS] streaming replication breaks horribly if master crashes

2010-06-16 Thread Magnus Hagander
On Wed, Jun 16, 2010 at 22:26, Robert Haas wrote: >>> and this just >>> makes it more likely.  After the most recent crash, the master thought >>> pg_current_xlog_location() was 1/86CD4000; the slave thought >>> pg_last_xlog_receive_location() was 1/8733C000.  After reconnecting to >>> the master,

Re: [HACKERS] streaming replication breaks horribly if master crashes

2010-06-16 Thread Kevin Grittner
Robert Haas wrote: > Kevin Grittner wrote: >> Robert Haas wrote: >>> So, obviously at this point my slave database is corrupted >>> beyond repair due to nothing more than an unexpected crash on >>> the master. >> >> Certainly that's true for resuming replication. From your >> description it sou

Re: [HACKERS] streaming replication breaks horribly if master crashes

2010-06-16 Thread Robert Haas
On Wed, Jun 16, 2010 at 4:14 PM, Josh Berkus wrote: >> The first problem I noticed is that the slave never seems to realize >> that the master has gone away.  Every time I crashed the master, I had >> to kill the wal receiver process on the slave to get it to reconnect; >> otherwise it just sat th

Re: [HACKERS] streaming replication breaks horribly if master crashes

2010-06-16 Thread Robert Haas
On Wed, Jun 16, 2010 at 4:00 PM, Kevin Grittner wrote: > Robert Haas wrote: >> So, obviously at this point my slave database is corrupted beyond >> repair due to nothing more than an unexpected crash on the master. > > Certainly that's true for resuming replication.  From your > description it so

Re: [HACKERS] streaming replication breaks horribly if master crashes

2010-06-16 Thread Josh Berkus
> The first problem I noticed is that the slave never seems to realize > that the master has gone away. Every time I crashed the master, I had > to kill the wal receiver process on the slave to get it to reconnect; > otherwise it just sat there waiting, either forever or at least for > longer tha

Re: [HACKERS] streaming replication breaks horribly if master crashes

2010-06-16 Thread Kevin Grittner
Stefan Kaltenbrunner wrote: > well this is likely caused by the OS not noticing that the > connections went away (linux has really long timeouts here) - > maybe we should unconditionally enable keepalive on systems that > support that for replication connections (if that is possible in > the cur

Re: [HACKERS] streaming replication breaks horribly if master crashes

2010-06-16 Thread Kevin Grittner
Robert Haas wrote: > I don't know what to do about this This probably is out of the question for 9.0 based on scale of change, and maybe forever based on the impact of WAL volume, but -- if we logged "before" images along with the "after", we could undo the work of the "over-eager" transaction

Re: [HACKERS] streaming replication breaks horribly if master crashes

2010-06-16 Thread Stefan Kaltenbrunner
On 06/16/2010 09:47 PM, Robert Haas wrote: On Mon, Jun 14, 2010 at 7:55 AM, Simon Riggs wrote: But that change would cause the problem that Robert pointed out. http://archives.postgresql.org/pgsql-hackers/2010-06/msg00670.php Presumably this means that if synchronous_commit = off on primary t

Re: [HACKERS] streaming replication breaks horribly if master crashes

2010-06-16 Thread Kevin Grittner
Robert Haas wrote: > So, obviously at this point my slave database is corrupted beyond > repair due to nothing more than an unexpected crash on the master. Certainly that's true for resuming replication. From your description it sounds as though the slave would be usable for purposes of takin

Re: [HACKERS] streaming replication breaks horribly if master crashes

2010-06-16 Thread Joshua D. Drake
On Wed, 2010-06-16 at 15:47 -0400, Robert Haas wrote: > So, obviously at this point my slave database is corrupted beyond > repair due to nothing more than an unexpected crash on the master. > That's bad. What is worse is that the system only detected the > corruption because the slave had crosse