Re: [HACKERS] streaming replication breaks horribly if master crashes

2010-06-17 Thread Heikki Linnakangas

On 17/06/10 02:40, Greg Stark wrote:

On Thu, Jun 17, 2010 at 12:16 AM, Kevin Grittner
kevin.gritt...@wicourts.gov  wrote:

Greg Starkgsst...@mit.edu  wrote:


TCP keepalives are for detecting broken network connections


Yeah.  That seems like what we have here.  If you shoot the OS in
the head, the network connection is broken rather abruptly, without
the normal packets exchanged to close the TCP connection.  It sounds
like it behaves just fine except for not detecting a broken
connection.


So I think there are two things happening here. If you shut down the
master and don't replace it then you'll get no network errors until
TCP gives up entirely. Similarly if you pull the network cable or your
switch powers off or your routing table becomes messed up, or anything
else occurs which prevents packets from getting through then you'll
see similar breakage. You wouldn't want your database to suddenly come
up as master in such circumstances though when you'll have to fix the
problem anyways, doing so won't solve any problems it would just
create a second problem.


We're not talking about a timeout for promoting standby to master. The 
problem is that the standby doesn't notice that from the master's point 
of view, the connection has been broken. Whether it's because of a 
network error or because the master server crashed doesn't matter, the 
standby should reconnect in any case. TCP keepalives are a perfect fit, 
as long as you can tune the keepalive time short enough. Where Short 
enough is up to the admin to decide depending on the application.


Having said that, it would probably make life easier if we implemented 
an application level heartbeat anyway. Not all OS's allow tuning keepalives.



But there's a second case. The Postgres master just stops responding
-- perhaps it starts seeing disk errors and becomes stuck in disk-wait
or the machine just becomes very heaviliy loaded and Postgres can't
get any cycles, or someone attaches to it with gdb, or one of any
number of things happen which cause it to stop sending data. In that
case replication will not see any data from the master but TCP will
never time out because the network is just fine. That's why there
needs to be an application level health check if you want to have
timeouts. You can't depend on the network layer to detect problems
between the application.


If the PostgreSQL master stops responding, it's OK for the slave to sit 
and wait for the master to recover. Reconnecting wouldn't help.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] streaming replication breaks horribly if master crashes

2010-06-17 Thread Rafael Martinez
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Heikki Linnakangas wrote:

 
 We're not talking about a timeout for promoting standby to master. The
 problem is that the standby doesn't notice that from the master's point
 of view, the connection has been broken. Whether it's because of a
 network error or because the master server crashed doesn't matter, the
 standby should reconnect in any case. TCP keepalives are a perfect fit,
 as long as you can tune the keepalive time short enough. Where Short
 enough is up to the admin to decide depending on the application.
 


I tested this yesterday and I could not get any reaction from the wal
receiver even after using minimal values compared to the default values  .

The default values in linux for tcp_keepalive_time, tcp_keepalive_intvl
and tcp_keepalive_probes are 7200, 75 and 9. I reduced these values to
60, 3, 3 and nothing happened, it continuous with status ESTABLISHED
after 60+3*3 seconds.

I did not restart the network after I changed these values on the fly
via /proc. I wonder if this is the reason the connection didn't die
neither with the new keppalive values after the connection was broken. I
will check this later today.

regards,
- --
 Rafael Martinez, r.m.guerr...@usit.uio.no
 Center for Information Technology Services
 University of Oslo, Norway

 PGP Public Key: http://folk.uio.no/rafael/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.10 (GNU/Linux)

iEYEARECAAYFAkwZyJ4ACgkQBhuKQurGihT3kgCgn4iQkZ8YKr/nAk5/QqpwYfnc
4lsAn2CKvgeeIOon+lWRHe908hbJ+zK6
=VymH
-END PGP SIGNATURE-

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] streaming replication breaks horribly if master crashes

2010-06-17 Thread Fujii Masao
On Thu, Jun 17, 2010 at 4:02 PM, Rafael Martinez
r.m.guerr...@usit.uio.no wrote:
 I tested this yesterday and I could not get any reaction from the wal
 receiver even after using minimal values compared to the default values  .

 The default values in linux for tcp_keepalive_time, tcp_keepalive_intvl
 and tcp_keepalive_probes are 7200, 75 and 9. I reduced these values to
 60, 3, 3 and nothing happened, it continuous with status ESTABLISHED
 after 60+3*3 seconds.

 I did not restart the network after I changed these values on the fly
 via /proc. I wonder if this is the reason the connection didn't die
 neither with the new keppalive values after the connection was broken. I
 will check this later today.

Walreceiver uses libpq to communicate with the master. But keepalive is not
enabled in libpq currently. That is libpq code doesn't call something like
setsockopt(SOL_SOCKET, SO_KEEPALIVE). So even if you change the kernel options
for keepalive, it has no effect on walreceiver.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] streaming replication breaks horribly if master crashes

2010-06-17 Thread Magnus Hagander
On Thu, Jun 17, 2010 at 09:20, Fujii Masao masao.fu...@gmail.com wrote:
 On Thu, Jun 17, 2010 at 4:02 PM, Rafael Martinez
 r.m.guerr...@usit.uio.no wrote:
 I tested this yesterday and I could not get any reaction from the wal
 receiver even after using minimal values compared to the default values  .

 The default values in linux for tcp_keepalive_time, tcp_keepalive_intvl
 and tcp_keepalive_probes are 7200, 75 and 9. I reduced these values to
 60, 3, 3 and nothing happened, it continuous with status ESTABLISHED
 after 60+3*3 seconds.

 I did not restart the network after I changed these values on the fly
 via /proc. I wonder if this is the reason the connection didn't die
 neither with the new keppalive values after the connection was broken. I
 will check this later today.

 Walreceiver uses libpq to communicate with the master. But keepalive is not
 enabled in libpq currently. That is libpq code doesn't call something like
 setsockopt(SOL_SOCKET, SO_KEEPALIVE). So even if you change the kernel options
 for keepalive, it has no effect on walreceiver.

Yeah, there was a patch submitted for this - I think it's on the CF
page for 9.1... I guess if we really need it walreceiver could enable
it - just get the socket with PQsocket().

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] streaming replication breaks horribly if master crashes

2010-06-17 Thread Tom Lane
Fujii Masao masao.fu...@gmail.com writes:
 On Thu, Jun 17, 2010 at 5:26 AM, Robert Haas robertmh...@gmail.com wrote:
 The real problem here is that we're sending records to the slave which
 might cease to exist on the master if it unexpectedly reboots.  I
 believe that what we need to do is make sure that the master only
 sends WAL it has already fsync'd (Tom suggested on another thread that
 this might be necessary, and I think it's now clear that it is 100%
 necessary).

 The attached patch changes walsender so that it always sends WAL up to
 LogwrtResult.Flush instead of LogwrtResult.Write.

Applied, along with some minor comment improvements of my own.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] streaming replication breaks horribly if master crashes

2010-06-16 Thread Joshua D. Drake
On Wed, 2010-06-16 at 15:47 -0400, Robert Haas wrote:

 So, obviously at this point my slave database is corrupted beyond
 repair due to nothing more than an unexpected crash on the master.
 That's bad.  What is worse is that the system only detected the
 corruption because the slave had crossed an xlog segment boundary
 which the master had not crossed.  Had it been otherwise, when the
 slave rewound to the beginning of the current segment, it would have
 had no trouble getting back in sync with the master - but it would
 have done this after having replayed WAL that, from the master's point
 of view, doesn't exist.  In other words, the database on the slave
 would be silently corrupted.
 
 I don't know what to do about this, but I'm pretty sure we can't ship it 
 as-is.

The slave must be able to survive a master crash.

Joshua D. Drake


 
 -- 
 Robert Haas
 EnterpriseDB: http://www.enterprisedb.com
 The Enterprise Postgres Company
 

-- 
PostgreSQL.org Major Contributor
Command Prompt, Inc: http://www.commandprompt.com/ - 503.667.4564
Consulting, Training, Support, Custom Development, Engineering



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] streaming replication breaks horribly if master crashes

2010-06-16 Thread Kevin Grittner
Robert Haas robertmh...@gmail.com wrote:
 
 So, obviously at this point my slave database is corrupted beyond
 repair due to nothing more than an unexpected crash on the master.
 
Certainly that's true for resuming replication.  From your
description it sounds as though the slave would be usable for
purposes of taking over for an unrecoverable master.  Or am I
misunderstanding?
 
 had no trouble getting back in sync with the master - but it would
 have done this after having replayed WAL that, from the master's
 point of view, doesn't exist.  In other words, the database on the
 slave would be silently corrupted.
 
 I don't know what to do about this, but I'm pretty sure we can't
 ship it as-is.
 
I'm sure we can't.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] streaming replication breaks horribly if master crashes

2010-06-16 Thread Stefan Kaltenbrunner

On 06/16/2010 09:47 PM, Robert Haas wrote:

On Mon, Jun 14, 2010 at 7:55 AM, Simon Riggssi...@2ndquadrant.com  wrote:

But that change would cause the problem that Robert pointed out.
http://archives.postgresql.org/pgsql-hackers/2010-06/msg00670.php


Presumably this means that if synchronous_commit = off on primary that
SR in 9.0 will no longer work correctly if the primary crashes?


I spent some time investigating this today and have come to the
conclusion that streaming replication is really, really broken in the
face of potential crashes on the master.  Using a copy of VMware
parallels provided by $EMPLOYER, I set up two Fedora 12 virtual
machines on my MacBook in a master/slave configuration.  Then I
crashed the master repeatedly using 'echo b  /proc/sysrq-trigger',
which causes an immediate reboot (without syncing the disks, closing
network connections, etc.) while running pgbench or other stuff
against it.

The first problem I noticed is that the slave never seems to realize
that the master has gone away.  Every time I crashed the master, I had
to kill the wal receiver process on the slave to get it to reconnect;
otherwise it just sat there waiting, either forever or at least for
longer than I was willing to wait.


well this is likely caused by the OS not noticing that the connections 
went away (linux has really long timeouts here) - maybe we should 
unconditionally enable keepalive on systems that support that for 
replication connections (if that is possible in the current design anyway)





More seriously, I was able to demonstrate that the problem linked in
the thread above is real: if the master crashes after streaming WAL
that it hasn't yet fsync'd, then on recovery the slave's xlog position
is ahead of the master.  So far I've only been able to reproduce this
with fsync=off, but I believe it's possible anyway, and this just
makes it more likely.  After the most recent crash, the master thought
pg_current_xlog_location() was 1/86CD4000; the slave thought
pg_last_xlog_receive_location() was 1/8733C000.  After reconnecting to
the master, the slave then thought that
pg_last_xlog_receive_location() was 1/8700.  The slave didn't
think this was a problem yet, though.  When I then restarted a pgbench
run against the master, the slave pretty quickly started spewing an
endless stream of messages complaining of LOG: invalid record length
at 1/8733A828.


this is obviously bad but with fsync=off(or sync_commit=off?) it is 
probably impossible to prevent...




Stefan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] streaming replication breaks horribly if master crashes

2010-06-16 Thread Kevin Grittner
Robert Haas robertmh...@gmail.com wrote:
 
 I don't know what to do about this
 
This probably is out of the question for 9.0 based on scale of
change, and maybe forever based on the impact of WAL volume, but --
if we logged before images along with the after, we could undo
the work of the over-eager transactions on the slave upon
reconnect.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] streaming replication breaks horribly if master crashes

2010-06-16 Thread Kevin Grittner
Stefan Kaltenbrunner ste...@kaltenbrunner.cc wrote:
 
 well this is likely caused by the OS not noticing that the
 connections went away (linux has really long timeouts here) -
 maybe we should unconditionally enable keepalive on systems that
 support that for replication connections (if that is possible in
 the current design anyway)
 
Yeah, in similar situations I've had good results with a keepalive
timeout of a minute or two.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] streaming replication breaks horribly if master crashes

2010-06-16 Thread Josh Berkus

 The first problem I noticed is that the slave never seems to realize
 that the master has gone away.  Every time I crashed the master, I had
 to kill the wal receiver process on the slave to get it to reconnect;
 otherwise it just sat there waiting, either forever or at least for
 longer than I was willing to wait.

Yes, I've noticed this.  That was the reason for forcing walreceiver to
shut down on a restart per prior discussion and patches.  This needs to
be on the open items list ... possibly it'll be fixed by Simon's
keepalive patch?  Or is it just a tcp_keeplalive issue?

 More seriously, I was able to demonstrate that the problem linked in
 the thread above is real: if the master crashes after streaming WAL
 that it hasn't yet fsync'd, then on recovery the slave's xlog position
 is ahead of the master.  So far I've only been able to reproduce this
 with fsync=off, but I believe it's possible anyway, 

... and some users will turn fsync off.  This is, in fact, one of the
primary uses for streaming replication: Durability via replicas.

 and this just
 makes it more likely.  After the most recent crash, the master thought
 pg_current_xlog_location() was 1/86CD4000; the slave thought
 pg_last_xlog_receive_location() was 1/8733C000.  After reconnecting to
 the master, the slave then thought that
 pg_last_xlog_receive_location() was 1/8700.  

So, *in this case*, detecting out-of-sequence xlogs (and PANICing) would
have actually prevented the slave from being corrupted.

My question, though, is detecting out-of-sequence xlogs *enough*?  Are
there any crash conditions on the master which would cause the master to
reuse the same locations for different records, for example?  I don't
think so, but I'd like to be certain.

-- 
  -- Josh Berkus
 PostgreSQL Experts Inc.
 http://www.pgexperts.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] streaming replication breaks horribly if master crashes

2010-06-16 Thread Robert Haas
On Wed, Jun 16, 2010 at 4:00 PM, Kevin Grittner
kevin.gritt...@wicourts.gov wrote:
 Robert Haas robertmh...@gmail.com wrote:
 So, obviously at this point my slave database is corrupted beyond
 repair due to nothing more than an unexpected crash on the master.

 Certainly that's true for resuming replication.  From your
 description it sounds as though the slave would be usable for
 purposes of taking over for an unrecoverable master.  Or am I
 misunderstanding?

It depends on what you mean.  If you can prevent the slave from ever
reconnecting to the master, then it's still safe to promote it.  But
if the master comes up and starts generating WAL again, and the slave
ever sees any of that WAL (either via SR or via the archive) then
you're toast.

In my case, the slave was irrecoverably out of sync with the master as
soon as the crash happened, but it still could have been promoted at
that point if you killed the old master.  It became corrupted as soon
as it replayed the first WAL record starting beyond 1/8700.  At
that point it's potentially got arbitrary corruption; you need a new
base backup (but this may not be immediately obvious; it may look OK
even if it isn't).

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] streaming replication breaks horribly if master crashes

2010-06-16 Thread Robert Haas
On Wed, Jun 16, 2010 at 4:14 PM, Josh Berkus j...@agliodbs.com wrote:
 The first problem I noticed is that the slave never seems to realize
 that the master has gone away.  Every time I crashed the master, I had
 to kill the wal receiver process on the slave to get it to reconnect;
 otherwise it just sat there waiting, either forever or at least for
 longer than I was willing to wait.

 Yes, I've noticed this.  That was the reason for forcing walreceiver to
 shut down on a restart per prior discussion and patches.  This needs to
 be on the open items list ... possibly it'll be fixed by Simon's
 keepalive patch?  Or is it just a tcp_keeplalive issue?

I think a TCP keepalive might be enough, but I have not tried to code
or test it.

 More seriously, I was able to demonstrate that the problem linked in
 the thread above is real: if the master crashes after streaming WAL
 that it hasn't yet fsync'd, then on recovery the slave's xlog position
 is ahead of the master.  So far I've only been able to reproduce this
 with fsync=off, but I believe it's possible anyway,

 ... and some users will turn fsync off.  This is, in fact, one of the
 primary uses for streaming replication: Durability via replicas.

Yep.

 and this just
 makes it more likely.  After the most recent crash, the master thought
 pg_current_xlog_location() was 1/86CD4000; the slave thought
 pg_last_xlog_receive_location() was 1/8733C000.  After reconnecting to
 the master, the slave then thought that
 pg_last_xlog_receive_location() was 1/8700.

 So, *in this case*, detecting out-of-sequence xlogs (and PANICing) would
 have actually prevented the slave from being corrupted.

 My question, though, is detecting out-of-sequence xlogs *enough*?  Are
 there any crash conditions on the master which would cause the master to
 reuse the same locations for different records, for example?  I don't
 think so, but I'd like to be certain.

The real problem here is that we're sending records to the slave which
might cease to exist on the master if it unexpectedly reboots.  I
believe that what we need to do is make sure that the master only
sends WAL it has already fsync'd (Tom suggested on another thread that
this might be necessary, and I think it's now clear that it is 100%
necessary).  But I'm not sure how this will play with fsync=off - if
we never fsync, then we can't ever really send any WAL without risking
this failure mode.  Similarly with synchronous_commit=off, I believe
that the next checkpoint will still fsync WAL, but the lag might be
long.

I think we should also change the slave to panic and shut down
immediately if its xlog position is ahead of the master.  That can
never be a watertight solution because you can always advance the xlog
position on them master and mask the problem.  But I think we should
do it anyway, so that we at least have a chance of noticing that we're
hosed.  I wish I could think of something a little more watertight...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] streaming replication breaks horribly if master crashes

2010-06-16 Thread Kevin Grittner
Robert Haas robertmh...@gmail.com wrote:
 Kevin Grittner kevin.gritt...@wicourts.gov wrote:
 Robert Haas robertmh...@gmail.com wrote:
 So, obviously at this point my slave database is corrupted
 beyond repair due to nothing more than an unexpected crash on
 the master.

 Certainly that's true for resuming replication.  From your
 description it sounds as though the slave would be usable for
 purposes of taking over for an unrecoverable master.  Or am I
 misunderstanding?
 
 It depends on what you mean.  If you can prevent the slave from
 ever reconnecting to the master, then it's still safe to promote
 it.
 
Yeah, that's what I meant.
 
 But if the master comes up and starts generating WAL again, and
 the slave ever sees any of that WAL (either via SR or via the
 archive) then you're toast.
 
Well, if it *applies* what it sees, yes.  Effectively you've got
transactions from two alternative timelines applied in the same
database, which is not going to work.  At a minimum we need some
way to reliably detect that the incoming WAL stream is starting
before some applied WAL record and isn't a match.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] streaming replication breaks horribly if master crashes

2010-06-16 Thread Magnus Hagander
On Wed, Jun 16, 2010 at 22:26, Robert Haas robertmh...@gmail.com wrote:
 and this just
 makes it more likely.  After the most recent crash, the master thought
 pg_current_xlog_location() was 1/86CD4000; the slave thought
 pg_last_xlog_receive_location() was 1/8733C000.  After reconnecting to
 the master, the slave then thought that
 pg_last_xlog_receive_location() was 1/8700.

 So, *in this case*, detecting out-of-sequence xlogs (and PANICing) would
 have actually prevented the slave from being corrupted.

 My question, though, is detecting out-of-sequence xlogs *enough*?  Are
 there any crash conditions on the master which would cause the master to
 reuse the same locations for different records, for example?  I don't
 think so, but I'd like to be certain.

 The real problem here is that we're sending records to the slave which
 might cease to exist on the master if it unexpectedly reboots.  I
 believe that what we need to do is make sure that the master only
 sends WAL it has already fsync'd (Tom suggested on another thread that
 this might be necessary, and I think it's now clear that it is 100%
 necessary).  But I'm not sure how this will play with fsync=off - if
 we never fsync, then we can't ever really send any WAL without risking

Well, at this point we can just prevent streaming replication with
fsync=off if we can't think of an easy fix, and then design a proper
fix for 9.1. Given how late we are in the cycle.


-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] streaming replication breaks horribly if master crashes

2010-06-16 Thread Rafael Martinez
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Robert Haas wrote:

 
 The first problem I noticed is that the slave never seems to realize
 that the master has gone away.  Every time I crashed the master, I had
 to kill the wal receiver process on the slave to get it to reconnect;
 otherwise it just sat there waiting, either forever or at least for
 longer than I was willing to wait.
 

Hei Robert

I have seen two different behaviors in my tests.

a) If I crash the server , the wal receiver process will wait forever
and the only way to get it working again is to restart postgres in the
slave after the master is back online. I have not been able to get the
slave database corrupted (I am running with fsync=on).

b) If I kill all postgres processes in the master with kill -9, the wal
receiver will start trying to reconnect automatically and it will
success in the moment postgres gets startet in the master.

The only different I can see at the OS level is that in a) the
connection continues to have the status ESTABLISHED forever, and in b)
it gets status TIME_WAIT in the moment postgres is down in the master.

regards,
- --
 Rafael Martinez, r.m.guerr...@usit.uio.no
 Center for Information Technology Services
 University of Oslo, Norway

 PGP Public Key: http://folk.uio.no/rafael/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkwZNiMACgkQBhuKQurGihQ3CQCaAhKcLkur6MO0/F7RqD6OWbv2
R/IAnjj4SrgiwkD6qKodJxrFHCODAEuh
=qHlh
-END PGP SIGNATURE-

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] streaming replication breaks horribly if master crashes

2010-06-16 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 The first problem I noticed is that the slave never seems to realize
 that the master has gone away.  Every time I crashed the master, I had
 to kill the wal receiver process on the slave to get it to reconnect;
 otherwise it just sat there waiting, either forever or at least for
 longer than I was willing to wait.

TCP timeout is the answer there.

 More seriously, I was able to demonstrate that the problem linked in
 the thread above is real: if the master crashes after streaming WAL
 that it hasn't yet fsync'd, then on recovery the slave's xlog position
 is ahead of the master.

So indeed we'd better change walsender to not get ahead of the fsync'd
position.  And probably also warn people to not disable fsync on the
master, unless they're willing to write it off and fail over at any
system crash.

 I don't know what to do about this, but I'm pretty sure we can't ship it 
 as-is.

Doesn't seem tremendously insoluble from here ...

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] streaming replication breaks horribly if master crashes

2010-06-16 Thread Josh Berkus
On 6/16/10 1:26 PM, Robert Haas wrote:
 Similarly with synchronous_commit=off, I believe
 that the next checkpoint will still fsync WAL, but the lag might be
 long.

That's not a showstopper.  Just tell people that having synch_commit=off
on the master might increase the lag to the slave, and leave it alone.

-- 
  -- Josh Berkus
 PostgreSQL Experts Inc.
 http://www.pgexperts.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] streaming replication breaks horribly if master crashes

2010-06-16 Thread Pierre C



The real problem here is that we're sending records to the slave which
might cease to exist on the master if it unexpectedly reboots.  I
believe that what we need to do is make sure that the master only
sends WAL it has already fsync'd


How about this :

- pg records somewhere the xlog position of the last record synced to  
disk. I dont remember the variable name, let's just say xlog_synced_recptr
- pg always writes the xlog first, ie. before writing any page it checks  
that the page's xlog recptr  xlog_synced_recptr and if it's not the case  
it has to wait before it can write the page.


Now :

- master sends messages to slave with the xlog_synced_recptr after each  
fsync

- slave gets these messages and records the master_xlog_synced_recptr
- slave doesn't write any page to disk until BOTH the slave's local WAL  
copy AND the master's WAL have reached the recptr of this page


If a master crashes or the slave loses connection, then the in-memory  
pages of the slave could be in a state that is in the future compared to  
the master's state when it comes up.


Therefore when a slave detects that the master has crashed, it could shoot  
itself and recover from WAL, at which point the slave will not be in the  
future anymore from the master, rather it would be in the past, which is  
a lot less problematic...


Of course this wouldn't speed up the failover process !...


I think we should also change the slave to panic and shut down
immediately if its xlog position is ahead of the master.  That can
never be a watertight solution because you can always advance the xlog
position on them master and mask the problem.  But I think we should
do it anyway, so that we at least have a chance of noticing that we're
hosed.  I wish I could think of something a little more watertight...


If a slave is in the future relative to the master, then the only way to  
keep using this slave could be to make it the new master...



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] streaming replication breaks horribly if master crashes

2010-06-16 Thread Greg Stark
On Wed, Jun 16, 2010 at 9:56 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 The first problem I noticed is that the slave never seems to realize
 that the master has gone away.  Every time I crashed the master, I had
 to kill the wal receiver process on the slave to get it to reconnect;
 otherwise it just sat there waiting, either forever or at least for
 longer than I was willing to wait.

 TCP timeout is the answer there.

If you mean TCP Keepalives, I disagree quite strongly. If you want the
application to guarantee any particular timing constraints then you
have to implement that in the application using timers and data
packets. TCP keepalives are for detecting broken network connections,
not enforcing application rules. Using TCP timeouts would have a
number of problems: On many systems they are impossible or difficult
to adjust and worse, it would make it impossible to distinguish an
postgres master crash from a transient or permanent network outage.


 More seriously, I was able to demonstrate that the problem linked in
 the thread above is real: if the master crashes after streaming WAL
 that it hasn't yet fsync'd, then on recovery the slave's xlog position
 is ahead of the master.

 So indeed we'd better change walsender to not get ahead of the fsync'd
 position.  And probably also warn people to not disable fsync on the
 master, unless they're willing to write it off and fail over at any
 system crash.

 I don't know what to do about this, but I'm pretty sure we can't ship it 
 as-is.

 Doesn't seem tremendously insoluble from here ...

For the case of fsync=off I can't get terribly excited about the slave
being ahead of the master after a crash. After all the master is toast
anyways. It seems to me in this situation the slave should detect that
the master has failed and automatically come up in master mode. Or
perhaps it should just shut down and then refuse to come up as a slave
again on the basis that it would be unsafe precisely because it might
be ahead of the (corrupt) master. At some point we should consider
having a server set to fsync=off refuse to come back up unless it was
shut down cleanly anyways. Perhaps we should put a strongly worded
warning now.

For the case of fsync=on it does seem to me to be terribly obvious
that the master should never send records to the slave that aren't
fsynced on the master. For 9.1 the other option proposed would work as
well but would be more complex -- to send and store records
immediately but not replay them on the slave until they're either
fsynced on the master or failover occurs.

-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] streaming replication breaks horribly if master crashes

2010-06-16 Thread Kevin Grittner
Greg Stark gsst...@mit.edu wrote:
 
 TCP keepalives are for detecting broken network connections
 
Yeah.  That seems like what we have here.  If you shoot the OS in
the head, the network connection is broken rather abruptly, without
the normal packets exchanged to close the TCP connection.  It sounds
like it behaves just fine except for not detecting a broken
connection.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] streaming replication breaks horribly if master crashes

2010-06-16 Thread Kevin Grittner
Kevin Grittner kevin.gritt...@wicourts.gov wrote:
 
 It sounds like it behaves just fine except for not detecting a
 broken connection.
 
Of course I meant in terms of the slave's attempts at retrieving
more WAL, not in terms of it applying a second time line.  TCP
keepalive timeouts don't help with that part of it, just the failure
to recognize the broken connection.  I suppose someone could argue
that's a *feature*, since it gives you two hours to manually
intervene before it does something stupid, but that hardly seems
like a solution
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] streaming replication breaks horribly if master crashes

2010-06-16 Thread Greg Stark
On Thu, Jun 17, 2010 at 12:22 AM, Kevin Grittner
kevin.gritt...@wicourts.gov wrote:
 Kevin Grittner kevin.gritt...@wicourts.gov wrote:

 It sounds like it behaves just fine except for not detecting a
 broken connection.

 Of course I meant in terms of the slave's attempts at retrieving
 more WAL, not in terms of it applying a second time line.  TCP
 keepalive timeouts don't help with that part of it, just the failure
 to recognize the broken connection.  I suppose someone could argue
 that's a *feature*, since it gives you two hours to manually
 intervene before it does something stupid, but that hardly seems
 like a solution

It's certainly a design goal of TCP that you should be able to
disconnect the network and reconnect it everything should recover. If
no data was sent it should be able to withstand arbitrarily long
disconnections. TCP Keepalives break that but they should only break
it in the case where the network connection has definitely exceeded
the retry timeouts, not when it merely hasn't responded fast enough
for the application requirements.


-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] streaming replication breaks horribly if master crashes

2010-06-16 Thread Greg Stark
On Thu, Jun 17, 2010 at 12:16 AM, Kevin Grittner
kevin.gritt...@wicourts.gov wrote:
 Greg Stark gsst...@mit.edu wrote:

 TCP keepalives are for detecting broken network connections

 Yeah.  That seems like what we have here.  If you shoot the OS in
 the head, the network connection is broken rather abruptly, without
 the normal packets exchanged to close the TCP connection.  It sounds
 like it behaves just fine except for not detecting a broken
 connection.

So I think there are two things happening here. If you shut down the
master and don't replace it then you'll get no network errors until
TCP gives up entirely. Similarly if you pull the network cable or your
switch powers off or your routing table becomes messed up, or anything
else occurs which prevents packets from getting through then you'll
see similar breakage. You wouldn't want your database to suddenly come
up as master in such circumstances though when you'll have to fix the
problem anyways, doing so won't solve any problems it would just
create a second problem.

But there's a second case. The Postgres master just stops responding
-- perhaps it starts seeing disk errors and becomes stuck in disk-wait
or the machine just becomes very heaviliy loaded and Postgres can't
get any cycles, or someone attaches to it with gdb, or one of any
number of things happen which cause it to stop sending data. In that
case replication will not see any data from the master but TCP will
never time out because the network is just fine. That's why there
needs to be an application level health check if you want to have
timeouts. You can't depend on the network layer to detect problems
between the application.

-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] streaming replication breaks horribly if master crashes

2010-06-16 Thread Fujii Masao
On Thu, Jun 17, 2010 at 5:26 AM, Robert Haas robertmh...@gmail.com wrote:
 On Wed, Jun 16, 2010 at 4:14 PM, Josh Berkus j...@agliodbs.com wrote:
 The first problem I noticed is that the slave never seems to realize
 that the master has gone away.  Every time I crashed the master, I had
 to kill the wal receiver process on the slave to get it to reconnect;
 otherwise it just sat there waiting, either forever or at least for
 longer than I was willing to wait.

 Yes, I've noticed this.  That was the reason for forcing walreceiver to
 shut down on a restart per prior discussion and patches.  This needs to
 be on the open items list ... possibly it'll be fixed by Simon's
 keepalive patch?  Or is it just a tcp_keeplalive issue?

 I think a TCP keepalive might be enough, but I have not tried to code
 or test it.

The keepalive on libpq patch would help.
https://commitfest.postgresql.org/action/patch_view?id=281

 and this just
 makes it more likely.  After the most recent crash, the master thought
 pg_current_xlog_location() was 1/86CD4000; the slave thought
 pg_last_xlog_receive_location() was 1/8733C000.  After reconnecting to
 the master, the slave then thought that
 pg_last_xlog_receive_location() was 1/8700.

 So, *in this case*, detecting out-of-sequence xlogs (and PANICing) would
 have actually prevented the slave from being corrupted.

 My question, though, is detecting out-of-sequence xlogs *enough*?  Are
 there any crash conditions on the master which would cause the master to
 reuse the same locations for different records, for example?  I don't
 think so, but I'd like to be certain.

 The real problem here is that we're sending records to the slave which
 might cease to exist on the master if it unexpectedly reboots.  I
 believe that what we need to do is make sure that the master only
 sends WAL it has already fsync'd (Tom suggested on another thread that
 this might be necessary, and I think it's now clear that it is 100%
 necessary).

The attached patch changes walsender so that it always sends WAL up to
LogwrtResult.Flush instead of LogwrtResult.Write.

 But I'm not sure how this will play with fsync=off - if
 we never fsync, then we can't ever really send any WAL without risking
 this failure mode.  Similarly with synchronous_commit=off, I believe
 that the next checkpoint will still fsync WAL, but the lag might be
 long.

First of all, we should not restart the master after the crash in
fsync=off case. That would cause the corruption of the master database
itself.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


send_after_fsync_v1.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers