subject:"\[HACKERS\] Inconsistent DB data in Streaming Replication"

Re: [HACKERS] Inconsistent DB data in Streaming Replication

2013-04-21 Thread Fujii Masao

On Wed, Apr 17, 2013 at 10:11 PM, Amit Kapila amit.kap...@huawei.com wrote:
 On Wednesday, April 17, 2013 4:19 PM Florian Pflug wrote:
 On Apr17, 2013, at 12:22 , Amit Kapila amit.kap...@huawei.com wrote:
  Do you mean to say that as an error has occurred, so it would not be
 able to
  flush received WAL, which could result in loss of WAL?
  I think even if error occurs, it will call flush in WalRcvDie(),
 before
  terminating WALReceiver.

 Hm, true, but for that to prevent the problem the inner processing
 loop needs to always read up to EOF before it exits and we attempt
 to send a reply. Which I don't think it necessarily does. Assume,
 that the master sends a chunk of data, waits a bit, and finally
 sends the shutdown record and exits. The slave might then receive
 the first chunk, and it might trigger sending a reply. At the time
 the reply is sent, the master has already sent the shutdown record
 and closed the connection, and we'll thus fail to reply and abort.
 Since the shutdown record has never been read from the socket,
 XLogWalRcvFlush won't flush it, and the slave ends up behind the
 master.

 Also, since XLogWalRcvProcessMsg responds to keep-alives messages,
 we might also error out of the inner processing loop if the server
 closes the socket after sending a keepalive but before we attempt
 to respond.

 Fixing this on the receive side alone seems quite messy and fragile.
 So instead, I think we should let the master send a shutdown message
 after it has sent everything it wants to send, and wait for the client
 to acknowledge it before shutting down the socket.

 If the client fails to respond, we could log a fat WARNING.

 Your explanation seems to be okay, but I think before discussing the exact
 solution,
 If the actual problem can be reproduced, then it might be better to discuss
 this solution.

I got this problem several times when I enabled WAL archiving and shut down
the master.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Inconsistent DB data in Streaming Replication

2013-04-21 Thread Fujii Masao

On Wed, Apr 17, 2013 at 7:49 PM, Florian Pflug f...@phlo.org wrote:
 On Apr17, 2013, at 12:22 , Amit Kapila amit.kap...@huawei.com wrote:
 Do you mean to say that as an error has occurred, so it would not be able to
 flush received WAL, which could result in loss of WAL?
 I think even if error occurs, it will call flush in WalRcvDie(), before
 terminating WALReceiver.

 Hm, true, but for that to prevent the problem the inner processing
 loop needs to always read up to EOF before it exits and we attempt
 to send a reply. Which I don't think it necessarily does. Assume,
 that the master sends a chunk of data, waits a bit, and finally
 sends the shutdown record and exits. The slave might then receive
 the first chunk, and it might trigger sending a reply. At the time
 the reply is sent, the master has already sent the shutdown record
 and closed the connection, and we'll thus fail to reply and abort.
 Since the shutdown record has never been read from the socket,
 XLogWalRcvFlush won't flush it, and the slave ends up behind the
 master.

 Also, since XLogWalRcvProcessMsg responds to keep-alives messages,
 we might also error out of the inner processing loop if the server
 closes the socket after sending a keepalive but before we attempt
 to respond.

 Fixing this on the receive side alone seems quite messy and fragile.
 So instead, I think we should let the master send a shutdown message
 after it has sent everything it wants to send, and wait for the client
 to acknowledge it before shutting down the socket.

Agreed. I've tried to fix this problem on only the walreceiver side, but
that failed. I agree that we should change walsender so that it waits
for the replay from the standby before closing the connection.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Inconsistent DB data in Streaming Replication

2013-04-19 Thread Martijn van Oosterhout

On Wed, Apr 17, 2013 at 12:49:10PM +0200, Florian Pflug wrote:
 Fixing this on the receive side alone seems quite messy and fragile.
 So instead, I think we should let the master send a shutdown message
 after it has sent everything it wants to send, and wait for the client
 to acknowledge it before shutting down the socket.
 
 If the client fails to respond, we could log a fat WARNING.

ISTM the master should half close the socket, using shutdown(). That
way the client receives an EOF and can still then send its reply to the
master.  Then when the master receives that it can close() completely.

Have a nice day,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 He who writes carelessly confesses thereby at the very outset that he does
 not attach much importance to his own thoughts.
   -- Arthur Schopenhauer


signature.asc
Description: Digital signature

Re: [HACKERS] Inconsistent DB data in Streaming Replication

2013-04-19 Thread Florian Pflug

On Apr19, 2013, at 14:46 , Martijn van Oosterhout klep...@svana.org wrote:
 On Wed, Apr 17, 2013 at 12:49:10PM +0200, Florian Pflug wrote:
 Fixing this on the receive side alone seems quite messy and fragile.
 So instead, I think we should let the master send a shutdown message
 after it has sent everything it wants to send, and wait for the client
 to acknowledge it before shutting down the socket.
 
 If the client fails to respond, we could log a fat WARNING.
 
 ISTM the master should half close the socket, using shutdown(). That
 way the client receives an EOF and can still then send its reply to the
 master.  Then when the master receives that it can close() completely.

Hm, there may be arbitrarily many reply requests within the unread
data in the socket's buffer, so wait for just one reply won't work.
Also, to distinguish a slave which crashes while the master shuts down
from one that has received all WAL and flushed it, the slave should flush
all WAL and send a final reply before closing the socket.

So the master would, upon shutting down, close only its writing end
of the connection, and continue to receive replies until it sees EOF.
After all slaves have gone, the master would emit a WARNING for every
slave whose last logged flush position is earlier than the master's
idea of end-of-wal.

The slave would, upon seeing EOF, flush all its WAL, send a final
reply, and close() the socket.

I'm not sure that relying on TCP's half-close feature has much benefit
over using a home-grown shutdown message, though. Anyway, the basic
shutdown protocol would be the same regardless of what exactly we use
to signal a shutdown.

BTW, I assume we'd only do this for smart shutdowns. 

best regards,
Florian Pflug



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Inconsistent DB data in Streaming Replication

2013-04-17 Thread Amit Kapila

On Monday, April 15, 2013 1:02 PM Florian Pflug wrote:
 On Apr14, 2013, at 17:56 , Fujii Masao masao.fu...@gmail.com wrote:
  At fast shutdown, after walsender sends the checkpoint record and
  closes the replication connection, walreceiver can detect the close
  of connection before receiving all WAL records. This means that,
  even if walsender sends all WAL records, walreceiver cannot always
  receive all of them.
 
 That sounds like a bug in walreceiver to me.
 
 The following code in walreceiver's main loop looks suspicious:
 
   /*
* Process the received data, and any subsequent data we
* can read without blocking.
*/
   for (;;)
   {
 if (len  0)
 {
   /* Something was received from master, so reset timeout */
   ...
   XLogWalRcvProcessMsg(buf[0], buf[1], len - 1);
 }
 else if (len == 0)
   break;
 else if (len  0)
 {
   ereport(LOG,
   (errmsg(replication terminated by primary server),
errdetail(End of WAL reached on timeline %u at %X/%X,
  startpointTLI,
  (uint32) (LogstreamResult.Write  32),
  (uint32) LogstreamResult.Write)));
   ...
 }
 len = walrcv_receive(0, buf);
   }
 
   /* Let the master know that we received some data. */
   XLogWalRcvSendReply(false, false);
 
   /*
* If we've written some records, flush them to disk and
* let the startup process and primary server know about
* them.
*/
   XLogWalRcvFlush(false);
 
 The loop at the top looks fine - it specifically avoids throwing
 an error on EOF. But the code then proceeds to XLogWalRcvSendReply()
 which doesn't seem to have the same smarts - it simply does
 
   if (PQputCopyData(streamConn, buffer, nbytes) = 0 ||
   PQflush(streamConn))
   ereport(ERROR,
   (errmsg(could not send data to WAL stream: %s,
   PQerrorMessage(streamConn;
 
 Unless I'm missing something, that certainly seems to explain
 how a standby can lag behind even after a controlled shutdown of
 the master.

Do you mean to say that as an error has occurred, so it would not be able to
flush received WAL, which could result in loss of WAL?
I think even if error occurs, it will call flush in WalRcvDie(), before
terminating WALReceiver.

With Regards,
Amit Kapila.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Inconsistent DB data in Streaming Replication

2013-04-17 Thread Florian Pflug

On Apr17, 2013, at 12:22 , Amit Kapila amit.kap...@huawei.com wrote:
 Do you mean to say that as an error has occurred, so it would not be able to
 flush received WAL, which could result in loss of WAL?
 I think even if error occurs, it will call flush in WalRcvDie(), before
 terminating WALReceiver.

Hm, true, but for that to prevent the problem the inner processing
loop needs to always read up to EOF before it exits and we attempt
to send a reply. Which I don't think it necessarily does. Assume,
that the master sends a chunk of data, waits a bit, and finally
sends the shutdown record and exits. The slave might then receive
the first chunk, and it might trigger sending a reply. At the time
the reply is sent, the master has already sent the shutdown record
and closed the connection, and we'll thus fail to reply and abort.
Since the shutdown record has never been read from the socket,
XLogWalRcvFlush won't flush it, and the slave ends up behind the
master.

Also, since XLogWalRcvProcessMsg responds to keep-alives messages,
we might also error out of the inner processing loop if the server
closes the socket after sending a keepalive but before we attempt
to respond.

Fixing this on the receive side alone seems quite messy and fragile.
So instead, I think we should let the master send a shutdown message
after it has sent everything it wants to send, and wait for the client
to acknowledge it before shutting down the socket.

If the client fails to respond, we could log a fat WARNING.

best regards,
Florian Pflug



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Inconsistent DB data in Streaming Replication

2013-04-17 Thread Amit Kapila

On Wednesday, April 17, 2013 4:19 PM Florian Pflug wrote:
 On Apr17, 2013, at 12:22 , Amit Kapila amit.kap...@huawei.com wrote:
  Do you mean to say that as an error has occurred, so it would not be
 able to
  flush received WAL, which could result in loss of WAL?
  I think even if error occurs, it will call flush in WalRcvDie(),
 before
  terminating WALReceiver.
 
 Hm, true, but for that to prevent the problem the inner processing
 loop needs to always read up to EOF before it exits and we attempt
 to send a reply. Which I don't think it necessarily does. Assume,
 that the master sends a chunk of data, waits a bit, and finally
 sends the shutdown record and exits. The slave might then receive
 the first chunk, and it might trigger sending a reply. At the time
 the reply is sent, the master has already sent the shutdown record
 and closed the connection, and we'll thus fail to reply and abort.
 Since the shutdown record has never been read from the socket,
 XLogWalRcvFlush won't flush it, and the slave ends up behind the
 master.
 
 Also, since XLogWalRcvProcessMsg responds to keep-alives messages,
 we might also error out of the inner processing loop if the server
 closes the socket after sending a keepalive but before we attempt
 to respond.
 
 Fixing this on the receive side alone seems quite messy and fragile.
 So instead, I think we should let the master send a shutdown message
 after it has sent everything it wants to send, and wait for the client
 to acknowledge it before shutting down the socket.
 
 If the client fails to respond, we could log a fat WARNING.

Your explanation seems to be okay, but I think before discussing the exact
solution, 
If the actual problem can be reproduced, then it might be better to discuss
this solution.

With Regards,
Amit Kapila.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Inconsistent DB data in Streaming Replication

2013-04-15 Thread Florian Pflug

On Apr14, 2013, at 17:56 , Fujii Masao masao.fu...@gmail.com wrote:
 At fast shutdown, after walsender sends the checkpoint record and
 closes the replication connection, walreceiver can detect the close
 of connection before receiving all WAL records. This means that,
 even if walsender sends all WAL records, walreceiver cannot always
 receive all of them.

That sounds like a bug in walreceiver to me.

The following code in walreceiver's main loop looks suspicious:

  /*
   * Process the received data, and any subsequent data we
   * can read without blocking.
   */
  for (;;)
  {
if (len  0)
{
  /* Something was received from master, so reset timeout */
  ...
  XLogWalRcvProcessMsg(buf[0], buf[1], len - 1);
}
else if (len == 0)
  break;
else if (len  0)
{
  ereport(LOG,
  (errmsg(replication terminated by primary server),
   errdetail(End of WAL reached on timeline %u at %X/%X,
 startpointTLI,
 (uint32) (LogstreamResult.Write  32),
 (uint32) LogstreamResult.Write)));
  ...
}
len = walrcv_receive(0, buf);
  }

  /* Let the master know that we received some data. */
  XLogWalRcvSendReply(false, false);

  /*
   * If we've written some records, flush them to disk and
   * let the startup process and primary server know about
   * them.
   */   
  XLogWalRcvFlush(false);

The loop at the top looks fine - it specifically avoids throwing
an error on EOF. But the code then proceeds to XLogWalRcvSendReply()
which doesn't seem to have the same smarts - it simply does

  if (PQputCopyData(streamConn, buffer, nbytes) = 0 ||  
  PQflush(streamConn))
  ereport(ERROR,
  (errmsg(could not send data to WAL stream: %s,
  PQerrorMessage(streamConn;

Unless I'm missing something, that certainly seems to explain
how a standby can lag behind even after a controlled shutdown of
the master.

best regards,
Florian Pflug



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Inconsistent DB data in Streaming Replication

2013-04-14 Thread Fujii Masao

On Fri, Apr 12, 2013 at 5:53 PM, Hannu Krosing ha...@2ndquadrant.com wrote:
 On 04/11/2013 07:29 PM, Fujii Masao wrote:

 On Thu, Apr 11, 2013 at 10:25 PM, Hannu Krosing ha...@2ndquadrant.com
 wrote:

 You just shut down the old master and let the standby catch
 up (takas a few microseconds ;) ) before you promote it.

 After this you can start up the former master with recovery.conf
 and it will follow nicely.

 No. When you shut down the old master, it might not have been
 able to send all the WAL records to the standby.

 In what cases (other than a standby lagging too much or
 not listening at all) have you observed this ?

 I have observed
 this situation several times. So in your approach, new standby
 might fail to catch up with the master nicely.

 the page http://wiki.postgresql.org/wiki/Streaming_Replication claims this:

 * Graceful shutdown

 When smart/fast shutdown is requested, the primary waits to exit
 until XLOG records have been sent to the standby, up to the
 shutdown checkpoint record.

 Maybe you were requesting immediate shutdown ?

No. I did fast shutdown.

It's true that the master waits for checkpoint record to be replicated to the
standby when fast shutdown is performed. But the standby can not always
successfully receive all WAL records which the master sent.

To ensure that all WAL records have been replicated to the standby at fast
shutdown, we should make the walsender wait for the standby to write the
checkpoint record and send back the ACK.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Inconsistent DB data in Streaming Replication

2013-04-14 Thread Fujii Masao

On Fri, Apr 12, 2013 at 7:57 PM, Andres Freund and...@2ndquadrant.com wrote:
 On 2013-04-12 02:29:01 +0900, Fujii Masao wrote:
 On Thu, Apr 11, 2013 at 10:25 PM, Hannu Krosing ha...@2ndquadrant.com 
 wrote:
 
  You just shut down the old master and let the standby catch
  up (takas a few microseconds ;) ) before you promote it.
 
  After this you can start up the former master with recovery.conf
  and it will follow nicely.

 No. When you shut down the old master, it might not have been
 able to send all the WAL records to the standby. I have observed
 this situation several times. So in your approach, new standby
 might fail to catch up with the master nicely.

 It seems most of this thread is focusing on the wrong thing then. If we
 really are only talking about planned failover then we need to solve
 *that* not some ominous don't flush data too early which has
 noticeable performance and implementation complexity problems.

At least I'd like to talk about not only planned failover but also normal
failover.

 I guess youre observing that not everything is replicated because youre
 doing an immediate shutdown

No. I did fast shutdown.

At fast shutdown, after walsender sends the checkpoint record and
closes the replication connection, walreceiver can detect the close
of connection before receiving all WAL records. This means that,
even if walsender sends all WAL records, walreceiver cannot always
receive all of them.

 You could even teach the standby not to increment the timeline in that
 case since thats safe.

I don't think this is required thanks to recent Heikki's great efforts about
timelines.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Inconsistent DB data in Streaming Replication

2013-04-14 Thread Hannu Krosing


On 04/14/2013 05:56 PM, Fujii Masao wrote:

On Fri, Apr 12, 2013 at 7:57 PM, Andres Freund and...@2ndquadrant.com wrote:

On 2013-04-12 02:29:01 +0900, Fujii Masao wrote:

On Thu, Apr 11, 2013 at 10:25 PM, Hannu Krosing ha...@2ndquadrant.com wrote:

You just shut down the old master and let the standby catch
up (takas a few microseconds ;) ) before you promote it.

After this you can start up the former master with recovery.conf
and it will follow nicely.

No. When you shut down the old master, it might not have been
able to send all the WAL records to the standby. I have observed
this situation several times. So in your approach, new standby
might fail to catch up with the master nicely.

It seems most of this thread is focusing on the wrong thing then. If we
really are only talking about planned failover then we need to solve
*that* not some ominous don't flush data too early which has
noticeable performance and implementation complexity problems.

At least I'd like to talk about not only planned failover but also normal
failover.


I guess youre observing that not everything is replicated because youre
doing an immediate shutdown

No. I did fast shutdown.

At fast shutdown, after walsender sends the checkpoint record and
closes the replication connection, walreceiver can detect the close
of connection before receiving all WAL records. This means that,
even if walsender sends all WAL records, walreceiver cannot always
receive all of them.

Seems very much like a bug, or at least a missing mode -
synchronous shutdown - where the master will wait for ack from standby(s)
before closing client connection.



You could even teach the standby not to increment the timeline in that
case since thats safe.

I don't think this is required thanks to recent Heikki's great efforts about
timelines.

Regards,





--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Inconsistent DB data in Streaming Replication

2013-04-12 Thread Hannu Krosing


On 04/11/2013 07:29 PM, Fujii Masao wrote:

On Thu, Apr 11, 2013 at 10:25 PM, Hannu Krosing ha...@2ndquadrant.com wrote:

You just shut down the old master and let the standby catch
up (takas a few microseconds ;) ) before you promote it.

After this you can start up the former master with recovery.conf
and it will follow nicely.

No. When you shut down the old master, it might not have been
able to send all the WAL records to the standby.

In what cases (other than a standby lagging too much or
not listening at all) have you observed this ?

I have observed
this situation several times. So in your approach, new standby
might fail to catch up with the master nicely.

the page http://wiki.postgresql.org/wiki/Streaming_Replication claims this:

* Graceful shutdown

When smart/fast shutdown is requested, the primary waits to exit
until XLOG records have been sent to the standby, up to the
shutdown checkpoint record.

Maybe you were requesting immediate shutdown ?

Regards
Hannu Krosing



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Inconsistent DB data in Streaming Replication

2013-04-12 Thread Andres Freund

On 2013-04-12 02:29:01 +0900, Fujii Masao wrote:
 On Thu, Apr 11, 2013 at 10:25 PM, Hannu Krosing ha...@2ndquadrant.com wrote:
 
  You just shut down the old master and let the standby catch
  up (takas a few microseconds ;) ) before you promote it.
 
  After this you can start up the former master with recovery.conf
  and it will follow nicely.

 No. When you shut down the old master, it might not have been
 able to send all the WAL records to the standby. I have observed
 this situation several times. So in your approach, new standby
 might fail to catch up with the master nicely.

It seems most of this thread is focusing on the wrong thing then. If we
really are only talking about planned failover then we need to solve
*that* not some ominous don't flush data too early which has
noticeable performance and implementation complexity problems.

I guess youre observing that not everything is replicated because youre
doing an immediate shutdown - probably because performing the shutdown
checkpoint would take too long. This seems solveable by implementing a
recovery connection command which initiates a shutdown that just
disables future WAL inserts and returns the last lsn that has been
written. Then you can fall over as soon as that llsn has been reached
and can make the previous master follow from there on without problems.

You could even teach the standby not to increment the timeline in that
case since thats safe.

The biggest issue seems to be how to implement this without another
spinlock acquisition for every XLogInsert(), but that seems possible.

Greetings,

Andres Freund

--
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Inconsistent DB data in Streaming Replication

2013-04-12 Thread Andres Freund

On 2013-04-12 11:18:01 +0530, Pavan Deolasee wrote:
 On Thu, Apr 11, 2013 at 8:39 PM, Ants Aasma a...@cybertec.at wrote:
 
  On Thu, Apr 11, 2013 at 5:33 PM, Hannu Krosing ha...@2ndquadrant.com
  wrote:
   On 04/11/2013 03:52 PM, Ants Aasma wrote:
  
   On Thu, Apr 11, 2013 at 4:25 PM, Hannu Krosing ha...@2ndquadrant.com
   wrote:
  
   The proposed fix - halting all writes of data pages to disk and
   to WAL files while waiting ACK from standby - will tremendously
   slow down all parallel work on master.
  
   This is not what is being proposed. The proposed fix halts writes of
   only data pages that are modified within the window of WAL that is not
   yet ACKed by the slave. This means pages that were recently modified
   and where the clocksweep or checkpoint has decided to evict them. This
   only affects the checkpointer, bgwriter and backends doing allocation.
   Furthermore, for the backend clocksweep case it would be reasonable to
   just pick another buffer to evict. The slowdown for most actual cases
   will be negligible.
  
   You also need to hold back all WAL writes, including the ones by
   parallel async and locally-synced transactions. Which means that
   you have to make all locally synced transactions to wait on the
   syncrep transactions committed before them.
   After getting the ACK from slave you then have a backlog of stuff
   to write locally, which then also needs to be sent to slave. Basically
   this turns a nice smooth WAL write-and-stream pipeline into a
   chunky wait-and-write-and-wait-and-stream-and-wait :P
   This may not be a problem in slight write load cases, which is
   probably the most widely happening usecase for postgres, but it
   will harm top performance and also force people to get much
   better (and more expensive) hardware than would otherways
   be needed.
 
  Why would you need to hold back WAL writes? WAL is written on master
  first and then steamed to slave as it is done now. You would only need
  hold back dirty page evictions having a recent enough LSN to not yet
  be replicated. This holding back is already done to wait for local WAL
  flushes, see bufmgr.c:1976 and bufmgr.c:669. When a page gets dirtied
  it's usage count gets bumped, so it will not be considered for
  eviction for at least one clocksweep cycle. In normal circumstances
  that will be enough time to get an ACK from the slave. When WAL is
  generated at an higher rate than can be replicated this will not be
  true. In that case backends that need to bring in new pages will have
  to wait for WAL to be replicated before they can continue. That will
  hopefully include the backends that are doing the dirtying, throttling
  the WAL generation rate. This would definitely be optional behavior,
  not something turned on by default.
 
 
 I agree. I don't think the proposes change would cause a lot of performance
 bottleneck since the proposal is to hold back writing of dirty pages until
 the WAL is replicated successfully to the standby. The heap pages are
 mostly written by the background threads often much later than the WAL for
 the change is written. So in all likelihood, there will be no wait
 involved. Of course, this will not be true for very frequently updated
 pages that must be written at a checkpoint.

I don't think that holds true at all. If you look at pg_stat_bgwriter in
any remotely bugs cluster with a hot data set over shared_buffers you'll
notice that a large percentage of writes will have been done by backends
themselves.
Yes, we need to improve on this, and we are talking about it right now
in another thread, but until thats solved this argumentation seems to
fall flat on its face.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Inconsistent DB data in Streaming Replication

2013-04-12 Thread Pavan Deolasee

On Fri, Apr 12, 2013 at 4:29 PM, Andres Freund and...@2ndquadrant.comwrote:



 I don't think that holds true at all. If you look at pg_stat_bgwriter in
 any remotely bugs cluster with a hot data set over shared_buffers you'll
 notice that a large percentage of writes will have been done by backends
 themselves.


Even if what you are saying it true, which I am sure is, the pages that the
backend is evicting mustn't be recently used by the LRU algorithm which
means that the WAL pertaining to the last change to the page in most
likelihood is already replicated, unless the replication is really lagging
much behind. Of course, if the standby is not able to keep pace with the
master in a realistic manner then we have a problem with the approach.

Thanks,
Pavan

-- 
Pavan Deolasee
http://www.linkedin.com/in/pavandeolasee

Re: [HACKERS] Inconsistent DB data in Streaming Replication

2013-04-12 Thread Andres Freund

On 2013-04-12 16:58:44 +0530, Pavan Deolasee wrote:
 On Fri, Apr 12, 2013 at 4:29 PM, Andres Freund and...@2ndquadrant.comwrote:
 
 
 
  I don't think that holds true at all. If you look at pg_stat_bgwriter in
  any remotely bugs cluster with a hot data set over shared_buffers you'll
  notice that a large percentage of writes will have been done by backends
  themselves.
 
 
 Even if what you are saying it true, which I am sure is, the pages that the
 backend is evicting mustn't be recently used by the LRU algorithm which
 means that the WAL pertaining to the last change to the page in most
 likelihood is already replicated, unless the replication is really lagging
 much behind. Of course, if the standby is not able to keep pace with the
 master in a realistic manner then we have a problem with the approach.

It frequently takes time in the sub to few second range for usagecounts
in zero.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Inconsistent DB data in Streaming Replication

2013-04-11 Thread Amit Kapila

On Wednesday, April 10, 2013 10:31 PM Fujii Masao wrote:
 On Thu, Apr 11, 2013 at 1:44 AM, Shaun Thomas
 stho...@optionshouse.com wrote:
  On 04/10/2013 11:40 AM, Fujii Masao wrote:
 
  Strange. If this is really true, shared disk failover solution is
  fundamentally broken because the standby needs to start up with the
  shared corrupted database at the failover.
 
 
  How so? Shared disk doesn't use replication. The point I was trying
 to make
  is that replication requires synchronization between two disparate
 servers,
  and verifying they have exactly the same data is a non-trivial
 exercise.
  Even a single transaction after a failover (effectively) negates the
 old
  server because there's no easy catch up mechanism yet.
 
 Hmm... ISTM what Samrat is proposing can resolve the problem. That is,
 if we can think that any data page which has not been replicated to the
 standby
 is not written in the master, new standby (i.e., old master) can safely
 catch up
 with new master (i.e., old standby). In this approach, of course, new
 standby
 might have some WAL records which new master doesn't have, so before
 starting up new standby, we need to remove all the WAL files in new
 standby
 and retrieve any WAL files from new master. But, what's the problem in
 his
 approach?

Consider the case old-master crashed during flushing the data page, now you
would need full page image from new-master.
It might so happen that in new-master Checkpoint would have purged (reused)
the log file's from that time line, in that case
it will be difficult to get the full page image, user can refer WAL archive
for that, but I think it will not be straight forward.

One more point, what will be the new behavior when there are 2 transactions
one has synchronous_commit =off and other with on?

With Regards,
Amit Kapila.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Inconsistent DB data in Streaming Replication

2013-04-11 Thread Ants Aasma

On Thu, Apr 11, 2013 at 10:09 AM, Amit Kapila amit.kap...@huawei.com wrote:
 Consider the case old-master crashed during flushing the data page, now you
 would need full page image from new-master.
 It might so happen that in new-master Checkpoint would have purged (reused)
 the log file's from that time line, in that case
 it will be difficult to get the full page image, user can refer WAL archive
 for that, but I think it will not be straight forward.

Using a WAL archive is standard procedure when you do not wish to take
new base backups all the time. This already works now, when archive
recovery is set up the WAL will be automatically fetched from the
archive.

 One more point, what will be the new behavior when there are 2 transactions
 one has synchronous_commit =off and other with on?

Exactly the same as now - waiting for commit record replication when
reporting commit success to the client and waiting for WAL replication
before writing a data page are orthogonal features.

Regards,
Ants Aasma
-- 
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Inconsistent DB data in Streaming Replication

2013-04-11 Thread Sameer Thakur

Hello,
The only potential use case for this that I can see, would be for system
maintenance and a controlled failover. I agree: that's a major PITA when
doing DR testing, but I personally don't think this is the way to fix that
particular edge case.

This is the use case we are trying to address (I work with Samrat). We were
wondering why this may not be a good fix?

regards
Sameer


On Thu, Apr 11, 2013 at 3:52 PM, Ants Aasma a...@cybertec.at wrote:

 On Thu, Apr 11, 2013 at 10:09 AM, Amit Kapila amit.kap...@huawei.com
 wrote:
  Consider the case old-master crashed during flushing the data page, now
 you
  would need full page image from new-master.
  It might so happen that in new-master Checkpoint would have purged
 (reused)
  the log file's from that time line, in that case
  it will be difficult to get the full page image, user can refer WAL
 archive
  for that, but I think it will not be straight forward.

 Using a WAL archive is standard procedure when you do not wish to take
 new base backups all the time. This already works now, when archive
 recovery is set up the WAL will be automatically fetched from the
 archive.

  One more point, what will be the new behavior when there are 2
 transactions
  one has synchronous_commit =off and other with on?

 Exactly the same as now - waiting for commit record replication when
 reporting commit success to the client and waiting for WAL replication
 before writing a data page are orthogonal features.

 Regards,
 Ants Aasma
 --
 Cybertec Schönig  Schönig GmbH
 Gröhrmühlgasse 26
 A-2700 Wiener Neustadt
 Web: http://www.postgresql-support.de


 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Inconsistent DB data in Streaming Replication

2013-04-11 Thread Hannu Krosing


On 04/11/2013 01:26 PM, Sameer Thakur wrote:

Hello,
The only potential use case for this that I can see, would be for 
system maintenance and a controlled failover. I agree: that's a major 
PITA when doing DR testing, but I personally don't think this is the 
way to fix that particular edge case.


This is the use case we are trying to address (I work with Samrat).

You just shut down the old master and let the standby catch
up (takas a few microseconds ;) ) before you promote it.

After this you can start up the former master with recovery.conf
and it will follow nicely.

We were wondering why this may not be a good fix?

The proposed fix - halting all writes of data pages to disk and
to WAL files while waiting ACK from standby - will tremendously
slow down all parallel work on master.

And it does just turn around master is ahead of slave problem
into slave is ahead of master problem :)

Regards
Hannu


regards
Sameer


On Thu, Apr 11, 2013 at 3:52 PM, Ants Aasma a...@cybertec.at 
mailto:a...@cybertec.at wrote:


On Thu, Apr 11, 2013 at 10:09 AM, Amit Kapila
amit.kap...@huawei.com mailto:amit.kap...@huawei.com wrote:
 Consider the case old-master crashed during flushing the data
page, now you
 would need full page image from new-master.
 It might so happen that in new-master Checkpoint would have
purged (reused)
 the log file's from that time line, in that case
 it will be difficult to get the full page image, user can refer
WAL archive
 for that, but I think it will not be straight forward.

Using a WAL archive is standard procedure when you do not wish to take
new base backups all the time. This already works now, when archive
recovery is set up the WAL will be automatically fetched from the
archive.

 One more point, what will be the new behavior when there are 2
transactions
 one has synchronous_commit =off and other with on?

Exactly the same as now - waiting for commit record replication when
reporting commit success to the client and waiting for WAL replication
before writing a data page are orthogonal features.

Regards,
Ants Aasma
--
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org
mailto:pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Inconsistent DB data in Streaming Replication

2013-04-11 Thread Ants Aasma

On Thu, Apr 11, 2013 at 4:25 PM, Hannu Krosing ha...@2ndquadrant.com wrote:
 The proposed fix - halting all writes of data pages to disk and
 to WAL files while waiting ACK from standby - will tremendously
 slow down all parallel work on master.

This is not what is being proposed. The proposed fix halts writes of
only data pages that are modified within the window of WAL that is not
yet ACKed by the slave. This means pages that were recently modified
and where the clocksweep or checkpoint has decided to evict them. This
only affects the checkpointer, bgwriter and backends doing allocation.
Furthermore, for the backend clocksweep case it would be reasonable to
just pick another buffer to evict. The slowdown for most actual cases
will be negligible.

 And it does just turn around master is ahead of slave problem
 into slave is ahead of master problem :)

The issue is not being ahead or behind. The issue is ensuring WAL
durability in the face of failovers before modifying data pages. This
is sufficient to guarantee no forks in the WAL stream from the point
of view of data files and with that the capability to always recover
by replaying WAL. There can still be forks from the point of view of
async commits, with most recent commits disappearing on failover, but
this is in no way different from what we have now.

I don't share the view that the disk image is extremely likely to be
corrupt after a crash. If that were the case then we should recommend
that people don't use crash recovery at all and always restore from a
backup. For errors like power supply failure, uncorrectable ECC
errors, etc. we can be pretty sure that the server was not writing
garbage into the storage system before failing. Having to do a day
long rsync run + recovery to catch up on all changes during the resync
to restore high-availability safety in those circumstances is in many
cases a larger risk.

Regards,
Ants Aasma
-- 
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Inconsistent DB data in Streaming Replication

2013-04-11 Thread Tom Lane

Ants Aasma a...@cybertec.at writes:
 On Thu, Apr 11, 2013 at 4:25 PM, Hannu Krosing ha...@2ndquadrant.com wrote:
 The proposed fix - halting all writes of data pages to disk and
 to WAL files while waiting ACK from standby - will tremendously
 slow down all parallel work on master.

 This is not what is being proposed.

The claim that it won't slow things down is lacking in credible
evidence.  Moreover, whether there's any performance hit or not, it will
certainly add complexity (and risk of bugs) in a place where we can ill
afford more.  And the bottom line remains: this is optimizing for the
wrong thing.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Inconsistent DB data in Streaming Replication

2013-04-11 Thread Hannu Krosing


On 04/11/2013 03:52 PM, Ants Aasma wrote:

On Thu, Apr 11, 2013 at 4:25 PM, Hannu Krosing ha...@2ndquadrant.com wrote:

The proposed fix - halting all writes of data pages to disk and
to WAL files while waiting ACK from standby - will tremendously
slow down all parallel work on master.

This is not what is being proposed. The proposed fix halts writes of
only data pages that are modified within the window of WAL that is not
yet ACKed by the slave. This means pages that were recently modified
and where the clocksweep or checkpoint has decided to evict them. This
only affects the checkpointer, bgwriter and backends doing allocation.
Furthermore, for the backend clocksweep case it would be reasonable to
just pick another buffer to evict. The slowdown for most actual cases
will be negligible.

You also need to hold back all WAL writes, including the ones by
parallel async and locally-synced transactions. Which means that
you have to make all locally synced transactions to wait on the
syncrep transactions committed before them.
After getting the ACK from slave you then have a backlog of stuff
to write locally, which then also needs to be sent to slave. Basically
this turns a nice smooth WAL write-and-stream pipeline into a
chunky wait-and-write-and-wait-and-stream-and-wait :P
This may not be a problem in slight write load cases, which is
probably the most widely happening usecase for postgres, but it
will harm top performance and also force people to get much
better (and more expensive) hardware than would otherways
be needed.



And it does just turn around master is ahead of slave problem
into slave is ahead of master problem :)

The issue is not being ahead or behind. The issue is ensuring WAL
durability in the face of failovers before modifying data pages. This
is sufficient to guarantee no forks in the WAL stream from the point
of view of data files and with that the capability to always recover
by replaying WAL.

How would this handle the case Tom pointed out, namely a short
power recycling on master ?

Instead of just continuing after booting up again the master now
has to figure out if it had any slaves and then try to query them
(for how long?) if they had any replayed WAL the master does
not know of.

Suddenly the pure existence of streaming replica slaves has become
a problem for master !

This will especially complicate the case of multiple slaves each
having received WAL to a slightly different LSN ? And you do want
to have at least 2 slaves if you want both durability
and availability with syncrep.

What if the one of slaves disconnects ? how should master react to this ?


Regards
Hannu Krosing


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Inconsistent DB data in Streaming Replication

2013-04-11 Thread Ants Aasma

On Thu, Apr 11, 2013 at 5:33 PM, Hannu Krosing ha...@2ndquadrant.com wrote:
 On 04/11/2013 03:52 PM, Ants Aasma wrote:

 On Thu, Apr 11, 2013 at 4:25 PM, Hannu Krosing ha...@2ndquadrant.com
 wrote:

 The proposed fix - halting all writes of data pages to disk and
 to WAL files while waiting ACK from standby - will tremendously
 slow down all parallel work on master.

 This is not what is being proposed. The proposed fix halts writes of
 only data pages that are modified within the window of WAL that is not
 yet ACKed by the slave. This means pages that were recently modified
 and where the clocksweep or checkpoint has decided to evict them. This
 only affects the checkpointer, bgwriter and backends doing allocation.
 Furthermore, for the backend clocksweep case it would be reasonable to
 just pick another buffer to evict. The slowdown for most actual cases
 will be negligible.

 You also need to hold back all WAL writes, including the ones by
 parallel async and locally-synced transactions. Which means that
 you have to make all locally synced transactions to wait on the
 syncrep transactions committed before them.
 After getting the ACK from slave you then have a backlog of stuff
 to write locally, which then also needs to be sent to slave. Basically
 this turns a nice smooth WAL write-and-stream pipeline into a
 chunky wait-and-write-and-wait-and-stream-and-wait :P
 This may not be a problem in slight write load cases, which is
 probably the most widely happening usecase for postgres, but it
 will harm top performance and also force people to get much
 better (and more expensive) hardware than would otherways
 be needed.

Why would you need to hold back WAL writes? WAL is written on master
first and then steamed to slave as it is done now. You would only need
hold back dirty page evictions having a recent enough LSN to not yet
be replicated. This holding back is already done to wait for local WAL
flushes, see bufmgr.c:1976 and bufmgr.c:669. When a page gets dirtied
it's usage count gets bumped, so it will not be considered for
eviction for at least one clocksweep cycle. In normal circumstances
that will be enough time to get an ACK from the slave. When WAL is
generated at an higher rate than can be replicated this will not be
true. In that case backends that need to bring in new pages will have
to wait for WAL to be replicated before they can continue. That will
hopefully include the backends that are doing the dirtying, throttling
the WAL generation rate. This would definitely be optional behavior,
not something turned on by default.


 And it does just turn around master is ahead of slave problem
 into slave is ahead of master problem :)

 The issue is not being ahead or behind. The issue is ensuring WAL
 durability in the face of failovers before modifying data pages. This
 is sufficient to guarantee no forks in the WAL stream from the point
 of view of data files and with that the capability to always recover
 by replaying WAL.

 How would this handle the case Tom pointed out, namely a short
 power recycling on master ?

 Instead of just continuing after booting up again the master now
 has to figure out if it had any slaves and then try to query them
 (for how long?) if they had any replayed WAL the master does
 not know of.

If the master is restarted and there is no failover to the slave, then
nothing strange would happen, master does recovery, comes up and
starts streaming to the slave again. If there is a failover, then
whatever is managing the failover needs to ensure that the master does
not come up again on its own before it is reconfigured as a slave.
This is what HA cluster managers do.

 Suddenly the pure existence of streaming replica slaves has become
 a problem for master !

 This will especially complicate the case of multiple slaves each
 having received WAL to a slightly different LSN ? And you do want
 to have at least 2 slaves if you want both durability
 and availability with syncrep.

 What if the one of slaves disconnects ? how should master react to this ?

Again, WAL replication will be the same as it is now. Availability
considerations, including what to do when slaves go away, are the same
as for current sync replication. Only required change is that we can
configure the master to hold out on writing any data pages that
contain changes that might go missing in the case of a failover.

Whether the additional complexity is worth the feature is a matter of
opinion. As we have no patch yet I can't say that I know what all the
implications are, but at first glance the complexity seems rather
compartmentalized. This would only amend what the concept of a WAL
flush considers safely flushed.

Regards,
Ants Aasma
-- 
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Inconsistent DB data in Streaming Replication

2013-04-11 Thread Fujii Masao

On Thu, Apr 11, 2013 at 2:42 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Ants Aasma a...@cybertec.at writes:
 We already rely on WAL-before-data to ensure correct recovery. What is
 proposed here is to slightly redefine it to require WAL to be
 replicated before it is considered to be flushed. This ensures that no
 data page on disk differs from the WAL that the slave has. The
 machinery to do this is already mostly there, we already wait for WAL
 flushes and we know the write location on the slave. The second
 requirement is that we never start up as master and we don't trust any
 local WAL. This is actually how pacemaker clusters work, you would
 only need to amend the RA to wipe the WAL and configure postgresql
 with restart_after_crash = false.

 It would be very helpful in restoring HA capability after failover if
 we wouldn't have to read through the whole database after a VM goes
 down and is migrated with the shared disk onto a new host.

 The problem with this is it's making an idealistic assumption that a
 crashed master didn't do anything wrong or lose/corrupt any data during
 its crash.  As soon as you realize that's an unsafe assumption, the
 whole thing becomes worthless to you.

The crash recovery relies on the same assumption. If it's really unsafe,
we should stop supporting the crash recovery. But I don't think that's
always true.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Inconsistent DB data in Streaming Replication

2013-04-11 Thread Fujii Masao

On Thu, Apr 11, 2013 at 10:25 PM, Hannu Krosing ha...@2ndquadrant.com wrote:

 You just shut down the old master and let the standby catch
 up (takas a few microseconds ;) ) before you promote it.

 After this you can start up the former master with recovery.conf
 and it will follow nicely.

No. When you shut down the old master, it might not have been
able to send all the WAL records to the standby. I have observed
this situation several times. So in your approach, new standby
might fail to catch up with the master nicely.

Regards,

--
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Inconsistent DB data in Streaming Replication

2013-04-11 Thread Fujii Masao

On Fri, Apr 12, 2013 at 12:09 AM, Ants Aasma a...@cybertec.at wrote:
 On Thu, Apr 11, 2013 at 5:33 PM, Hannu Krosing ha...@2ndquadrant.com wrote:
 On 04/11/2013 03:52 PM, Ants Aasma wrote:

 On Thu, Apr 11, 2013 at 4:25 PM, Hannu Krosing ha...@2ndquadrant.com
 wrote:

 The proposed fix - halting all writes of data pages to disk and
 to WAL files while waiting ACK from standby - will tremendously
 slow down all parallel work on master.

 This is not what is being proposed. The proposed fix halts writes of
 only data pages that are modified within the window of WAL that is not
 yet ACKed by the slave. This means pages that were recently modified
 and where the clocksweep or checkpoint has decided to evict them. This
 only affects the checkpointer, bgwriter and backends doing allocation.
 Furthermore, for the backend clocksweep case it would be reasonable to
 just pick another buffer to evict. The slowdown for most actual cases
 will be negligible.

 You also need to hold back all WAL writes, including the ones by
 parallel async and locally-synced transactions. Which means that
 you have to make all locally synced transactions to wait on the
 syncrep transactions committed before them.
 After getting the ACK from slave you then have a backlog of stuff
 to write locally, which then also needs to be sent to slave. Basically
 this turns a nice smooth WAL write-and-stream pipeline into a
 chunky wait-and-write-and-wait-and-stream-and-wait :P
 This may not be a problem in slight write load cases, which is
 probably the most widely happening usecase for postgres, but it
 will harm top performance and also force people to get much
 better (and more expensive) hardware than would otherways
 be needed.

 Why would you need to hold back WAL writes? WAL is written on master
 first and then steamed to slave as it is done now. You would only need
 hold back dirty page evictions having a recent enough LSN to not yet
 be replicated. This holding back is already done to wait for local WAL
 flushes, see bufmgr.c:1976 and bufmgr.c:669. When a page gets dirtied
 it's usage count gets bumped, so it will not be considered for
 eviction for at least one clocksweep cycle. In normal circumstances
 that will be enough time to get an ACK from the slave. When WAL is
 generated at an higher rate than can be replicated this will not be
 true. In that case backends that need to bring in new pages will have
 to wait for WAL to be replicated before they can continue. That will
 hopefully include the backends that are doing the dirtying, throttling
 the WAL generation rate. This would definitely be optional behavior,
 not something turned on by default.


 And it does just turn around master is ahead of slave problem
 into slave is ahead of master problem :)

 The issue is not being ahead or behind. The issue is ensuring WAL
 durability in the face of failovers before modifying data pages. This
 is sufficient to guarantee no forks in the WAL stream from the point
 of view of data files and with that the capability to always recover
 by replaying WAL.

 How would this handle the case Tom pointed out, namely a short
 power recycling on master ?

 Instead of just continuing after booting up again the master now
 has to figure out if it had any slaves and then try to query them
 (for how long?) if they had any replayed WAL the master does
 not know of.

 If the master is restarted and there is no failover to the slave, then
 nothing strange would happen, master does recovery, comes up and
 starts streaming to the slave again. If there is a failover, then
 whatever is managing the failover needs to ensure that the master does
 not come up again on its own before it is reconfigured as a slave.
 This is what HA cluster managers do.

 Suddenly the pure existence of streaming replica slaves has become
 a problem for master !

 This will especially complicate the case of multiple slaves each
 having received WAL to a slightly different LSN ? And you do want
 to have at least 2 slaves if you want both durability
 and availability with syncrep.

 What if the one of slaves disconnects ? how should master react to this ?

 Again, WAL replication will be the same as it is now. Availability
 considerations, including what to do when slaves go away, are the same
 as for current sync replication. Only required change is that we can
 configure the master to hold out on writing any data pages that
 contain changes that might go missing in the case of a failover.

 Whether the additional complexity is worth the feature is a matter of
 opinion. As we have no patch yet I can't say that I know what all the
 implications are, but at first glance the complexity seems rather
 compartmentalized. This would only amend what the concept of a WAL
 flush considers safely flushed.

I really share the same view with you!

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

Re: [HACKERS] Inconsistent DB data in Streaming Replication

2013-04-11 Thread Pavan Deolasee

On Thu, Apr 11, 2013 at 8:39 PM, Ants Aasma a...@cybertec.at wrote:

 On Thu, Apr 11, 2013 at 5:33 PM, Hannu Krosing ha...@2ndquadrant.com
 wrote:
  On 04/11/2013 03:52 PM, Ants Aasma wrote:
 
  On Thu, Apr 11, 2013 at 4:25 PM, Hannu Krosing ha...@2ndquadrant.com
  wrote:
 
  The proposed fix - halting all writes of data pages to disk and
  to WAL files while waiting ACK from standby - will tremendously
  slow down all parallel work on master.
 
  This is not what is being proposed. The proposed fix halts writes of
  only data pages that are modified within the window of WAL that is not
  yet ACKed by the slave. This means pages that were recently modified
  and where the clocksweep or checkpoint has decided to evict them. This
  only affects the checkpointer, bgwriter and backends doing allocation.
  Furthermore, for the backend clocksweep case it would be reasonable to
  just pick another buffer to evict. The slowdown for most actual cases
  will be negligible.
 
  You also need to hold back all WAL writes, including the ones by
  parallel async and locally-synced transactions. Which means that
  you have to make all locally synced transactions to wait on the
  syncrep transactions committed before them.
  After getting the ACK from slave you then have a backlog of stuff
  to write locally, which then also needs to be sent to slave. Basically
  this turns a nice smooth WAL write-and-stream pipeline into a
  chunky wait-and-write-and-wait-and-stream-and-wait :P
  This may not be a problem in slight write load cases, which is
  probably the most widely happening usecase for postgres, but it
  will harm top performance and also force people to get much
  better (and more expensive) hardware than would otherways
  be needed.

 Why would you need to hold back WAL writes? WAL is written on master
 first and then steamed to slave as it is done now. You would only need
 hold back dirty page evictions having a recent enough LSN to not yet
 be replicated. This holding back is already done to wait for local WAL
 flushes, see bufmgr.c:1976 and bufmgr.c:669. When a page gets dirtied
 it's usage count gets bumped, so it will not be considered for
 eviction for at least one clocksweep cycle. In normal circumstances
 that will be enough time to get an ACK from the slave. When WAL is
 generated at an higher rate than can be replicated this will not be
 true. In that case backends that need to bring in new pages will have
 to wait for WAL to be replicated before they can continue. That will
 hopefully include the backends that are doing the dirtying, throttling
 the WAL generation rate. This would definitely be optional behavior,
 not something turned on by default.


I agree. I don't think the proposes change would cause a lot of performance
bottleneck since the proposal is to hold back writing of dirty pages until
the WAL is replicated successfully to the standby. The heap pages are
mostly written by the background threads often much later than the WAL for
the change is written. So in all likelihood, there will be no wait
involved. Of course, this will not be true for very frequently updated
pages that must be written at a checkpoint.

But I wonder if the problem is really limited to the heap pages ? Even for
something like a CLOG page, we will need to ensure that the WAL records are
replayed before the page is written to the disk. Same is true for relation
truncation. In fact, all places where the master needs to call XLogFlush()
probably needs to be examined to decide if the subsequent action has a
chance to leave the database corrupt and ensure that the WAL is replicated
before proceeding with the change.

Tom has a very valid concern from the additional code complexity point of
view though I disagree that its always good idea to start with a fresh
rsync. If we can avoid that with right checks, I don't see why we should
not improve the downtime for the master. Its very likely that the standby
may not be as good a server as the master is and the user would want to
quickly switch back to the master for performance reasons. To reduce
complexity, can we do this as some sort of plugin for XLogFlush() which
gets to know that XLogFlush has been upto the given LSN and the event that
caused the function to be called ? We can then leave the handling  of the
even to the implementer. This will also avoid any penalty for those who are
happy with the current mechanism and do not want any complex HA setups.

Thanks,
Pavan

http://www.linkedin.com/in/pavandeolasee

Re: [HACKERS] Inconsistent DB data in Streaming Replication

2013-04-10 Thread Samrat Revagade

it's one of the reasons why a fresh base backup is required when starting
old master as new standby? If yes, I agree with you. I've often heard the
complaints about a backup when restarting new standby. That's really big
problem.

I think Fujii Masao is on the same page.



In case of syncrep the master just waits for confirmation from standby
before returning to client on commit.

Not just commit, you must stop any *writing* of the wal records
effectively killing any parallelism.
 Min issue is that it will make *all* backends dependant on each sync
commit, essentially serialising all backends commits, with the
serialisation *including* the latency of roundtrip to client. With current
sync streaming the other backends can continue to write wal, with proposed
approach you cannot write any records after the one waiting an ACK from
standby.



Let me rephrase the proposal in a more accurate manner:

Consider following scenario:



(1) A client sends the COMMIT command to the master server.

(2) The master writes WAL record to disk

(3) The master writes the data page related to this transaction.  i.e. via
checkpoint or bgwriter.

(4) The master sends WAL records continuously to the standby, up to the
commit WAL record.

(5) The standby receives WAL records, writes them to the disk, and then
replies the ACK.

(6) The master returns a success indication to a client after it receives
ACK.



If failover happens between (3) and (4), WAL and DB data in old master are
ahead of them in new master. After failover, new master continues running
new transactions independently from old master. Then WAL record and DB data
would become inconsistent between those two servers. To resolve these
inconsistencies, the backup of new master needs to be taken onto new
standby.


But taking backup is not feasible in case of larger database size with
several TB over a slow WAN.

So to avoid this type of inconsistency without taking fresh backup we are
thinking to do following thing:



 I think that you can introduce GUC specifying whether this extra check
is required to avoid a backup when failback.

Approach:

Introduce new GUC option specifying whether to prevent PostgreSQL from
writing DB data before corresponding WAL records have been replicated to
the standby. That is, if this GUC option is enabled, PostgreSQL waits for
corresponding WAL records to be not only written to the disk but also
replicated to the standby before writing DB data.


So the process becomes as follows:

(1) A client sends the COMMIT command to the master server.

(2) The master writes the commit WAL record to the disk.

(3) The master sends WAL records continuously to standby up to the commit
WAL record.

(4) The standby receives WAL records, writes them to disk, and then replies
the ACK.

(5) *The master then forces a write of the data page related to this
transaction. *

(6) The master returns a success indication to a client after it receives
ACK.



While master is waiting to force a write (point 5) for this data page,
streaming replication continuous. Also other data page writes are not
dependent on this particular page write. So the commit of data pages are
not serialized.


Regards,

Samrat

Re: [HACKERS] Inconsistent DB data in Streaming Replication

2013-04-10 Thread Samrat Revagade

(5) *The master then forces a write of the data page related to this
transaction.*

*Sorry, this is incorrect. Whenever the master writes the data page it
checks that the WAL record is written in standby till that LSN. *

*
*

While master is waiting to force a write (point 5) for this data page,
streaming replication continuous.

Also other data page writes are not dependent on this particular page
write. So the commit of data pages are not serialized.

*Sorry, this is incorrect. Streaming replication continuous, master is not
waiting, whenever the master writes the data page it checks that the WAL
record is written in standby till that LSN.*

*
*

*Regards,*

*Samrat*

*
*

Re: [HACKERS] Inconsistent DB data in Streaming Replication

2013-04-10 Thread Amit Kapila

On Wednesday, April 10, 2013 3:42 PM Samrat Revagade wrote:

(5) The master then forces a write of the data page related to this
transaction.
 Sorry, this is incorrect. Whenever the master writes the data page it
checks that the WAL record is written in standby till that LSN. 

 While master is waiting to force a write (point 5) for this data page,
streaming replication continuous. 
 Also other data page writes are not dependent on this particular page
write. So the commit of data pages are not serialized.

Sorry, this is incorrect. Streaming replication continuous, master is not
waiting, whenever the master writes the data page it checks that the WAL
record is written in standby till that LSN.

I am not sure it will resolve the problem completely as your old-master can
have some WAL extra then new-master for same timeline. I don't remember
exactly will timeline switch feature
take care of this extra WAL, Heikki can confirm this point?
Also I think this can serialize flush of data pages in checkpoint/bgwriter
which is currently not the case.

With Regards,
Amit Kapila.






-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Inconsistent DB data in Streaming Replication

2013-04-10 Thread Tom Lane

Amit Kapila amit.kap...@huawei.com writes:
 On Wednesday, April 10, 2013 3:42 PM Samrat Revagade wrote:
 Sorry, this is incorrect. Streaming replication continuous, master is not
 waiting, whenever the master writes the data page it checks that the WAL
 record is written in standby till that LSN.

 I am not sure it will resolve the problem completely as your old-master can
 have some WAL extra then new-master for same timeline. I don't remember
 exactly will timeline switch feature
 take care of this extra WAL, Heikki can confirm this point?
 Also I think this can serialize flush of data pages in checkpoint/bgwriter
 which is currently not the case.

Yeah.  TBH this entire discussion seems to be let's cripple performance
in the normal case so that we can skip doing an rsync when resurrecting
a crashed, failed-over master.  This is not merely optimizing for the
wrong thing, it's positively hazardous.  After a fail-over, you should
be wondering whether it's safe to resurrect the old master at all, not
about how fast you can bring it back up without validating its data.
IOW, I wouldn't consider skipping the rsync even if I had a feature
like this.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Inconsistent DB data in Streaming Replication

2013-04-10 Thread Andres Freund

On 2013-04-10 10:10:31 -0400, Tom Lane wrote:
 Amit Kapila amit.kap...@huawei.com writes:
  On Wednesday, April 10, 2013 3:42 PM Samrat Revagade wrote:
  Sorry, this is incorrect. Streaming replication continuous, master is not
  waiting, whenever the master writes the data page it checks that the WAL
  record is written in standby till that LSN.
 
  I am not sure it will resolve the problem completely as your old-master can
  have some WAL extra then new-master for same timeline. I don't remember
  exactly will timeline switch feature
  take care of this extra WAL, Heikki can confirm this point?
  Also I think this can serialize flush of data pages in checkpoint/bgwriter
  which is currently not the case.
 
 Yeah.  TBH this entire discussion seems to be let's cripple performance
 in the normal case so that we can skip doing an rsync when resurrecting
 a crashed, failed-over master.  This is not merely optimizing for the
 wrong thing, it's positively hazardous.  After a fail-over, you should
 be wondering whether it's safe to resurrect the old master at all, not
 about how fast you can bring it back up without validating its data.
 IOW, I wouldn't consider skipping the rsync even if I had a feature
 like this.

Agreed. Especially as in situations where you fall over in a planned
way, e.g. for a hardware upgrade, you can avoid the need to resync with
a littlebit of care. So its mostly in catastrophic situations this
becomes a problem and in those you really should resync - and its a good
idea not to use a normal rsync but a rsync --checksum or similar.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Inconsistent DB data in Streaming Replication

2013-04-10 Thread Shaun Thomas


On 04/10/2013 09:10 AM, Tom Lane wrote:


IOW, I wouldn't consider skipping the rsync even if I had a feature
like this.


Totally. Out in the field, we consider the old database corrupt the 
moment we fail over. There is literally no way to verify the safety of 
any data along the broken chain, given race conditions and multiple 
potential failure points.


The only potential use case for this that I can see, would be for system 
maintenance and a controlled failover. I agree: that's a major PITA when 
doing DR testing, but I personally don't think this is the way to fix 
that particular edge case.


Maybe checksums will fix this in the long run... I don't know. DRBD has 
a handy block-level verify function for things like this, and it can 
re-sync master/slave data by comparing the commit log across the servers 
if you tell it one node should be considered incorrect.


The thing is... we have clogs, and we have WAL. If we can assume 
bidirectional communication and verification (checksum comparison?) of 
both of those components, the database *should* be able to re-sync itself.


Even if that were possible given the internals, I can't see anyone 
jumping on this before 9.4 or 9.5 unless someone sponsors the feature.


Automatic re-sync would (within available WALs) be an awesome feature, 
though...


--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604
312-676-8870
stho...@optionshouse.com

__

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to 
this email


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Inconsistent DB data in Streaming Replication

2013-04-10 Thread Fujii Masao

On Wed, Apr 10, 2013 at 11:26 PM, Shaun Thomas stho...@optionshouse.com wrote:
 On 04/10/2013 09:10 AM, Tom Lane wrote:

 IOW, I wouldn't consider skipping the rsync even if I had a feature
 like this.


 Totally. Out in the field, we consider the old database corrupt the moment
 we fail over.

Strange. If this is really true, shared disk failover solution is
fundamentally broken
because the standby needs to start up with the shared corrupted
database at the
failover. Also, we cannot trust the crash recovery at all if we adopt
the same logic
as you think. I think that there are the cases where we can replay and reuse the
old database even after PostgreSQL crashes.

Regards,

--
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Inconsistent DB data in Streaming Replication

2013-04-10 Thread Shaun Thomas


On 04/10/2013 11:40 AM, Fujii Masao wrote:


Strange. If this is really true, shared disk failover solution is
fundamentally broken because the standby needs to start up with the
shared corrupted database at the failover.


How so? Shared disk doesn't use replication. The point I was trying to 
make is that replication requires synchronization between two disparate 
servers, and verifying they have exactly the same data is a non-trivial 
exercise. Even a single transaction after a failover (effectively) 
negates the old server because there's no easy catch up mechanism yet.


Even if this isn't necessarily true, it's the safest approach IMO.

--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604
312-676-8870
stho...@optionshouse.com

__

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to 
this email


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Inconsistent DB data in Streaming Replication

2013-04-10 Thread Fujii Masao

On Wed, Apr 10, 2013 at 11:16 PM, Andres Freund and...@2ndquadrant.com wrote:
 On 2013-04-10 10:10:31 -0400, Tom Lane wrote:
 Amit Kapila amit.kap...@huawei.com writes:
  On Wednesday, April 10, 2013 3:42 PM Samrat Revagade wrote:
  Sorry, this is incorrect. Streaming replication continuous, master is not
  waiting, whenever the master writes the data page it checks that the WAL
  record is written in standby till that LSN.

  I am not sure it will resolve the problem completely as your old-master can
  have some WAL extra then new-master for same timeline. I don't remember
  exactly will timeline switch feature
  take care of this extra WAL, Heikki can confirm this point?
  Also I think this can serialize flush of data pages in checkpoint/bgwriter
  which is currently not the case.

 Yeah.  TBH this entire discussion seems to be let's cripple performance
 in the normal case so that we can skip doing an rsync when resurrecting
 a crashed, failed-over master.  This is not merely optimizing for the
 wrong thing, it's positively hazardous.  After a fail-over, you should
 be wondering whether it's safe to resurrect the old master at all, not
 about how fast you can bring it back up without validating its data.
 IOW, I wouldn't consider skipping the rsync even if I had a feature
 like this.

 Agreed. Especially as in situations where you fall over in a planned
 way, e.g. for a hardware upgrade, you can avoid the need to resync with
 a littlebit of care.

It's really worth documenting that way.

 So its mostly in catastrophic situations this
 becomes a problem and in those you really should resync - and its a good
 idea not to use a normal rsync but a rsync --checksum or similar.

If database is very large, rsync --checksum takes very long. And I'm concerned
that most of data pages in master has the different checksum from those in the
standby because of commit hint bit. I'm not sure how rsync --checksum can
speed up the backup after failover.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Inconsistent DB data in Streaming Replication

2013-04-10 Thread Fujii Masao

On Thu, Apr 11, 2013 at 1:44 AM, Shaun Thomas stho...@optionshouse.com wrote:
 On 04/10/2013 11:40 AM, Fujii Masao wrote:

 Strange. If this is really true, shared disk failover solution is
 fundamentally broken because the standby needs to start up with the
 shared corrupted database at the failover.


 How so? Shared disk doesn't use replication. The point I was trying to make
 is that replication requires synchronization between two disparate servers,
 and verifying they have exactly the same data is a non-trivial exercise.
 Even a single transaction after a failover (effectively) negates the old
 server because there's no easy catch up mechanism yet.

Hmm... ISTM what Samrat is proposing can resolve the problem. That is,
if we can think that any data page which has not been replicated to the standby
is not written in the master, new standby (i.e., old master) can safely catch up
with new master (i.e., old standby). In this approach, of course, new standby
might have some WAL records which new master doesn't have, so before
starting up new standby, we need to remove all the WAL files in new standby
and retrieve any WAL files from new master. But, what's the problem in his
approach?

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Inconsistent DB data in Streaming Replication

2013-04-10 Thread Ants Aasma

On Wed, Apr 10, 2013 at 7:44 PM, Shaun Thomas stho...@optionshouse.com wrote:
 On 04/10/2013 11:40 AM, Fujii Masao wrote:

 Strange. If this is really true, shared disk failover solution is
 fundamentally broken because the standby needs to start up with the
 shared corrupted database at the failover.


 How so? Shared disk doesn't use replication. The point I was trying to make
 is that replication requires synchronization between two disparate servers,
 and verifying they have exactly the same data is a non-trivial exercise.
 Even a single transaction after a failover (effectively) negates the old
 server because there's no easy catch up mechanism yet.

 Even if this isn't necessarily true, it's the safest approach IMO.

We already rely on WAL-before-data to ensure correct recovery. What is
proposed here is to slightly redefine it to require WAL to be
replicated before it is considered to be flushed. This ensures that no
data page on disk differs from the WAL that the slave has. The
machinery to do this is already mostly there, we already wait for WAL
flushes and we know the write location on the slave. The second
requirement is that we never start up as master and we don't trust any
local WAL. This is actually how pacemaker clusters work, you would
only need to amend the RA to wipe the WAL and configure postgresql
with restart_after_crash = false.

It would be very helpful in restoring HA capability after failover if
we wouldn't have to read through the whole database after a VM goes
down and is migrated with the shared disk onto a new host.

Regards,
Ants Aasma
-- 
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Inconsistent DB data in Streaming Replication

2013-04-10 Thread Tom Lane

Ants Aasma a...@cybertec.at writes:
 We already rely on WAL-before-data to ensure correct recovery. What is
 proposed here is to slightly redefine it to require WAL to be
 replicated before it is considered to be flushed. This ensures that no
 data page on disk differs from the WAL that the slave has. The
 machinery to do this is already mostly there, we already wait for WAL
 flushes and we know the write location on the slave. The second
 requirement is that we never start up as master and we don't trust any
 local WAL. This is actually how pacemaker clusters work, you would
 only need to amend the RA to wipe the WAL and configure postgresql
 with restart_after_crash = false.

 It would be very helpful in restoring HA capability after failover if
 we wouldn't have to read through the whole database after a VM goes
 down and is migrated with the shared disk onto a new host.

The problem with this is it's making an idealistic assumption that a
crashed master didn't do anything wrong or lose/corrupt any data during
its crash.  As soon as you realize that's an unsafe assumption, the
whole thing becomes worthless to you.

If the idea had zero implementation cost, I would say sure, let people
play with it until they find out (probably the hard way) that it's a bad
idea.  But it's going to introduce, at the very least, additional
complexity into a portion of the system that is critical and plenty
complicated enough already.  That being the case, I don't want it there
at all, not even as an option.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Inconsistent DB data in Streaming Replication

2013-04-10 Thread Boszormenyi Zoltan


2013-04-10 18:46 keltezéssel, Fujii Masao írta:

On Wed, Apr 10, 2013 at 11:16 PM, Andres Freund and...@2ndquadrant.com wrote:

On 2013-04-10 10:10:31 -0400, Tom Lane wrote:

Amit Kapila amit.kap...@huawei.com writes:

On Wednesday, April 10, 2013 3:42 PM Samrat Revagade wrote:

Sorry, this is incorrect. Streaming replication continuous, master is not
waiting, whenever the master writes the data page it checks that the WAL
record is written in standby till that LSN.

I am not sure it will resolve the problem completely as your old-master can
have some WAL extra then new-master for same timeline. I don't remember
exactly will timeline switch feature
take care of this extra WAL, Heikki can confirm this point?
Also I think this can serialize flush of data pages in checkpoint/bgwriter
which is currently not the case.

Yeah.  TBH this entire discussion seems to be let's cripple performance
in the normal case so that we can skip doing an rsync when resurrecting
a crashed, failed-over master.  This is not merely optimizing for the
wrong thing, it's positively hazardous.  After a fail-over, you should
be wondering whether it's safe to resurrect the old master at all, not
about how fast you can bring it back up without validating its data.
IOW, I wouldn't consider skipping the rsync even if I had a feature
like this.

Agreed. Especially as in situations where you fall over in a planned
way, e.g. for a hardware upgrade, you can avoid the need to resync with
a littlebit of care.

It's really worth documenting that way.


So its mostly in catastrophic situations this
becomes a problem and in those you really should resync - and its a good
idea not to use a normal rsync but a rsync --checksum or similar.

If database is very large, rsync --checksum takes very long. And I'm concerned
that most of data pages in master has the different checksum from those in the
standby because of commit hint bit. I'm not sure how rsync --checksum can
speed up the backup after failover.


rsync --checksum alone may not but rsync --inplace may speed up backup a 
lot.



Regards,




--
--
Zoltán Böszörményi
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt, Austria
Web: http://www.postgresql-support.de
 http://www.postgresql.at/



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Inconsistent DB data in Streaming Replication

2013-04-10 Thread Andres Freund

On 2013-04-10 20:39:25 +0200, Boszormenyi Zoltan wrote:
 2013-04-10 18:46 keltezéssel, Fujii Masao írta:
 On Wed, Apr 10, 2013 at 11:16 PM, Andres Freund and...@2ndquadrant.com 
 wrote:
 On 2013-04-10 10:10:31 -0400, Tom Lane wrote:
 Amit Kapila amit.kap...@huawei.com writes:
 On Wednesday, April 10, 2013 3:42 PM Samrat Revagade wrote:
 Sorry, this is incorrect. Streaming replication continuous, master is not
 waiting, whenever the master writes the data page it checks that the WAL
 record is written in standby till that LSN.
 I am not sure it will resolve the problem completely as your old-master 
 can
 have some WAL extra then new-master for same timeline. I don't remember
 exactly will timeline switch feature
 take care of this extra WAL, Heikki can confirm this point?
 Also I think this can serialize flush of data pages in checkpoint/bgwriter
 which is currently not the case.
 Yeah.  TBH this entire discussion seems to be let's cripple performance
 in the normal case so that we can skip doing an rsync when resurrecting
 a crashed, failed-over master.  This is not merely optimizing for the
 wrong thing, it's positively hazardous.  After a fail-over, you should
 be wondering whether it's safe to resurrect the old master at all, not
 about how fast you can bring it back up without validating its data.
 IOW, I wouldn't consider skipping the rsync even if I had a feature
 like this.
 Agreed. Especially as in situations where you fall over in a planned
 way, e.g. for a hardware upgrade, you can avoid the need to resync with
 a littlebit of care.
 It's really worth documenting that way.
 
 So its mostly in catastrophic situations this
 becomes a problem and in those you really should resync - and its a good
 idea not to use a normal rsync but a rsync --checksum or similar.
 If database is very large, rsync --checksum takes very long. And I'm 
 concerned
 that most of data pages in master has the different checksum from those in 
 the
 standby because of commit hint bit. I'm not sure how rsync --checksum can
 speed up the backup after failover.

Its not about speed, its about correctness.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Inconsistent DB data in Streaming Replication

2013-04-09 Thread Samrat Revagade

What Samrat is proposing here is that WAL is not flushed to the OS before

it is acked by a synchronous replica so recovery won't go past the

timeline change made in failover, making it necessary to take a new

base backup to resync with the new master.

Actually we are proposing that the data page on the master is not committed
till master receives ACK from the standby. The WAL files can be flushed to
the disk on both the master and standby, before standby generates ACK to
master. The end objective is the same of avoiding to take base backup of
old master to resync with new master.

Why do you think that the inconsistent data after failover happens is
problem? Because

it's one of the reasons why a fresh base backup is required when
starting old master as
new standby? If yes, I agree with you. I've often heard the complaints
about a backup
when restarting new standby. That's really big problem.

 Yes, taking backup is  major problem when the database size is more than
several TB. It would take very long time to ship backup data over the slow
WAN network.

 One solution to avoid this situation is have the master send WAL records
to standby and wait for ACK from standby committing WAL files to disk and
only after that commit data page related to this transaction on master.

You mean to make the master wait the data page write until WAL has been
not only
flushed to disk but also replicated to the standby?

 Yes. Master should not write the data page before corresponding WAL
records have been replicated to the standby. The WAL records have been
flushed to disk on both master and standby.

 The main drawback would be increased wait time for the client due to
extra round trip to standby before master sends ACK to client. Are there
any other issues with this approach?

I think that you can introduce GUC specifying whether this extra check
is required to avoid a backup when failback

That would be better idea. We can disable it whenever taking a fresh backup
is not a problem.


Regards,

Samrat



On Mon, Apr 8, 2013 at 10:40 PM, Fujii Masao masao.fu...@gmail.com wrote:

 On Mon, Apr 8, 2013 at 7:34 PM, Samrat Revagade
 revagade.sam...@gmail.com wrote:
 
  Hello,
 
  We have been trying to figure out possible solutions to the following
 problem in streaming replication Consider following scenario:
 
  If master receives commit command, it writes and flushes commit WAL
 records to the disk, It also writes and flushes data page related to this
 transaction.
 
  The master then sends WAL records to standby up to the commit WAL
 record. But before sending these records if failover happens then,  old
 master is ahead of  standby which is now the new master in terms of DB data
 leading to inconsistent data .

 Why do you think that the inconsistent data after failover happens is
 problem? Because
 it's one of the reasons why a fresh base backup is required when
 starting old master as
 new standby? If yes, I agree with you. I've often heard the complaints
 about a backup
 when restarting new standby. That's really big problem.

 The timeline mismatch after failover was one of the reasons why a
 backup is required.
 But, thanks to Heikki's recent work, that's solved, i.e., the timeline
 mismatch would be
 automatically resolved when starting replication in 9.3. So, the
 remaining problem is an
 inconsistent database.

  One solution to avoid this situation is have the master send WAL records
 to standby and wait for ACK from standby committing WAL files to disk and
 only after that commit data page related to this transaction on master.

 You mean to make the master wait the data page write until WAL has been
 not only
 flushed to disk but also replicated to the standby?

  The main drawback would be increased wait time for the client due to
 extra round trip to standby before master sends ACK to client. Are there
 any other issues with this approach?

 I think that you can introduce GUC specifying whether this extra check
 is required to
 avoid a backup when failback.

 Regards,

 --
 Fujii Masao

Re: [HACKERS] Inconsistent DB data in Streaming Replication

2013-04-09 Thread Ants Aasma

On Tue, Apr 9, 2013 at 9:42 AM, Samrat Revagade
revagade.sam...@gmail.com wrote:
What Samrat is proposing here is that WAL is not flushed to the OS before
it is acked by a synchronous replica so recovery won't go past the
timeline change made in failover, making it necessary to take a new
base backup to resync with the new master.

 Actually we are proposing that the data page on the master is not committed
 till master receives ACK from the standby. The WAL files can be flushed to
 the disk on both the master and standby, before standby generates ACK to
 master. The end objective is the same of avoiding to take base backup of old
 master to resync with new master.

Sorry for misreading your e-mail. It seems like  we are on the same
page here. I too have found this an annoying limitation in using
replication in an unreliable environment.

  Yes, taking backup is  major problem when the database size is more than
 several TB. It would take very long time to ship backup data over the slow
 WAN network.

For WAN environment rsync can be a good enough answer, a tiny amount
of pages will be actually transferred. This is assuming a smallish
database and low bandwidth. For larger databases avoiding the need to
read in the whole database for differences is an obvious win.

Regards,
Ants Aasma
-- 
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Inconsistent DB data in Streaming Replication

2013-04-09 Thread Hannu Krosing


On 04/08/2013 12:34 PM, Samrat Revagade wrote:


Hello,

We have been trying to figure out possible solutions to the following 
problem in streaming replication Consider following scenario:


If master receives commit command, it writes and flushes commit WAL 
records to the disk, It also writes and flushes data page related to 
this transaction.


No data page flushing takes place. All data page writing is delayed to 
bgWriter and/or checkpoints.


The master then sends WAL records to standby up to the commit WAL 
record. But before sending these records if failover happens then,  
old master is ahead of  standby which is now the new master in terms 
of DB data leading to inconsistent data .


The master sends WAL records to standby continuously, not upon commit 
wal record.


In case of syncrep the master just waits for confirmation from standby 
before returning to client on commit.


One solution to avoid this situation is have the master send WAL 
records to standby and wait for ACK from standby committing WAL files 
to disk and only after that commit data page related to this 
transaction on master.


Not just commit, you must stop any *writing* of the wal records 
effectively killing any parallelism.


The main drawback would be increased wait time for the client due to 
extra round trip to standby before master sends ACK to client. Are 
there any other issues with this approach?



Min issue is that it will make *all* backends dependant on each sync
commit, essentially serialising all backends commits, with the
serialisation *including* the latency of roundtrip to client.

With current sync streaming the other backends can continue to write wal,
with proposed approach you can not write any records after the one 
waiting an ACK from standby.



Thank you,

Samrat





--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Inconsistent DB data in Streaming Replication

2013-04-08 Thread Samrat Revagade

Hello,

We have been trying to figure out possible solutions to the following
problem in streaming replication Consider following scenario:

If master receives commit command, it writes and flushes commit WAL records
to the disk, It also writes and flushes data page related to this
transaction.

The master then sends WAL records to standby up to the commit WAL record.
But before sending these records if failover happens then,  old master is
ahead of  standby which is now the new master in terms of DB data leading
to inconsistent data .



One solution to avoid this situation is have the master send WAL records to
standby and wait for ACK from standby committing WAL files to disk and only
after that commit data page related to this transaction on master.

The main drawback would be increased wait time for the client due to extra
round trip to standby before master sends ACK to client. Are there any
other issues with this approach?


Thank you,

Samrat

Re: [HACKERS] Inconsistent DB data in Streaming Replication

2013-04-08 Thread Shaun Thomas


On 04/08/2013 05:34 AM, Samrat Revagade wrote:


One solution to avoid this situation is have the master send WAL
records to standby and wait for ACK from standby committing WAL files
to disk and only after that commit data page related to this
transaction on master.


Isn't this basically what synchronous replication does in PG 9.1+?

--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604
312-676-8870
stho...@optionshouse.com

__

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to 
this email


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Inconsistent DB data in Streaming Replication

2013-04-08 Thread Tom Lane

Samrat Revagade revagade.sam...@gmail.com writes:
 We have been trying to figure out possible solutions to the following
 problem in streaming replication Consider following scenario:

 If master receives commit command, it writes and flushes commit WAL records
 to the disk, It also writes and flushes data page related to this
 transaction.

 The master then sends WAL records to standby up to the commit WAL record.
 But before sending these records if failover happens then,  old master is
 ahead of  standby which is now the new master in terms of DB data leading
 to inconsistent data .

I don't exactly see the problem ... unless you're imagining that master
and slave share the same data storage or something like that.  That's
not going to work for a ton of reasons besides this one.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Inconsistent DB data in Streaming Replication

2013-04-08 Thread Ants Aasma

On Mon, Apr 8, 2013 at 6:50 PM, Shaun Thomas stho...@optionshouse.com wrote:
 On 04/08/2013 05:34 AM, Samrat Revagade wrote:

 One solution to avoid this situation is have the master send WAL
 records to standby and wait for ACK from standby committing WAL files
 to disk and only after that commit data page related to this
 transaction on master.


 Isn't this basically what synchronous replication does in PG 9.1+?

Not exactly. Sync-rep ensures that commit success is not sent to the
client before a synchronous replica acks the commit record. What
Samrat is proposing here is that WAL is not flushed to the OS before
it is acked by a synchronous replica so recovery won't go past the
timeline change made in failover, making it necessary to take a new
base backup to resync with the new master. I seem to remember this
being discussed when sync rep was committed. I don't recall if the
idea was discarded only on performance grounds or whether there were
other issues too.

Thinking about it now it, the requirement is that after crash and
failover to a sync replica we should be able to reuse the datadir to
replicate from the new master without consistency. We should be able
to achieve that by ensuring that we don't write out pages until we
have received an ack from the sync replica and that we check for
possible timeline switches before recovering local WAL. For the first,
it seems to me that it should be enough to rework the updating of
XlogCtl-LogwrtResult.Flush so it accounts for the sync replica. For
the second part, I think Heikkis work on enabling timeline switches
over streaming connections already ensure this (I haven't checked it
out in detail), but if not, shouldn't be too hard to add.

Regards,
Ants Aasma
-- 
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Inconsistent DB data in Streaming Replication

2013-04-08 Thread Andres Freund

On 2013-04-08 19:26:33 +0300, Ants Aasma wrote:
 On Mon, Apr 8, 2013 at 6:50 PM, Shaun Thomas stho...@optionshouse.com wrote:
  On 04/08/2013 05:34 AM, Samrat Revagade wrote:
 
  One solution to avoid this situation is have the master send WAL
  records to standby and wait for ACK from standby committing WAL files
  to disk and only after that commit data page related to this
  transaction on master.
 
 
  Isn't this basically what synchronous replication does in PG 9.1+?
 
 Not exactly. Sync-rep ensures that commit success is not sent to the
 client before a synchronous replica acks the commit record. What
 Samrat is proposing here is that WAL is not flushed to the OS before
 it is acked by a synchronous replica so recovery won't go past the
 timeline change made in failover, making it necessary to take a new
 base backup to resync with the new master. I seem to remember this
 being discussed when sync rep was committed. I don't recall if the
 idea was discarded only on performance grounds or whether there were
 other issues too.

Thats not going to work for a fair number of reasons:
* wal is streamed *from disk* not from memory
* what if the local node crashes/restarts immediately? Then the standby
  is farther ahead than the master.
* the performance implications of never writing data before flushing it
  are pretty severe
* ...

So this doesn't seem to solve anything.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Inconsistent DB data in Streaming Replication

2013-04-08 Thread Ants Aasma

On Mon, Apr 8, 2013 at 7:38 PM, Andres Freund and...@2ndquadrant.com wrote:
 On 2013-04-08 19:26:33 +0300, Ants Aasma wrote:
 Not exactly. Sync-rep ensures that commit success is not sent to the
 client before a synchronous replica acks the commit record. What
 Samrat is proposing here is that WAL is not flushed to the OS before
 it is acked by a synchronous replica so recovery won't go past the
 timeline change made in failover, making it necessary to take a new
 base backup to resync with the new master. I seem to remember this
 being discussed when sync rep was committed. I don't recall if the
 idea was discarded only on performance grounds or whether there were
 other issues too.

 Thats not going to work for a fair number of reasons:
 * wal is streamed *from disk* not from memory

Yeah, this one alone makes the do-not-flush-before-replicating
approach impractical.

 * what if the local node crashes/restarts immediately? Then the standby
   is farther ahead than the master.
 * the performance implications of never writing data before flushing it
   are pretty severe
 * ...

 So this doesn't seem to solve anything.

Yeah, delaying WAL writes until replication is successful seems
impractical, but I don't see why we couldn't optionally take into
account walsender write pointers when considering if we can write out
a page. Sure there will be some performance hit for waiting to
replicate WAL, but on the other hand having to rsync a huge database
isn't too good for performance either.

Regards,
Ants Aasma
-- 
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Inconsistent DB data in Streaming Replication

2013-04-08 Thread Fujii Masao

On Mon, Apr 8, 2013 at 7:34 PM, Samrat Revagade
revagade.sam...@gmail.com wrote:

 Hello,

 We have been trying to figure out possible solutions to the following problem 
 in streaming replication Consider following scenario:

 If master receives commit command, it writes and flushes commit WAL records 
 to the disk, It also writes and flushes data page related to this transaction.

 The master then sends WAL records to standby up to the commit WAL record. But 
 before sending these records if failover happens then,  old master is ahead 
 of  standby which is now the new master in terms of DB data leading to 
 inconsistent data .

Why do you think that the inconsistent data after failover happens is
problem? Because
it's one of the reasons why a fresh base backup is required when
starting old master as
new standby? If yes, I agree with you. I've often heard the complaints
about a backup
when restarting new standby. That's really big problem.

The timeline mismatch after failover was one of the reasons why a
backup is required.
But, thanks to Heikki's recent work, that's solved, i.e., the timeline
mismatch would be
automatically resolved when starting replication in 9.3. So, the
remaining problem is an
inconsistent database.

 One solution to avoid this situation is have the master send WAL records to 
 standby and wait for ACK from standby committing WAL files to disk and only 
 after that commit data page related to this transaction on master.

You mean to make the master wait the data page write until WAL has been not only
flushed to disk but also replicated to the standby?

 The main drawback would be increased wait time for the client due to extra 
 round trip to standby before master sends ACK to client. Are there any other 
 issues with this approach?

I think that you can introduce GUC specifying whether this extra check
is required to
avoid a backup when failback.

Regards,

--
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

52 matches

Mail list logo