Re: [HACKERS] Straightforward Synchronous Replication

2010-05-27 Thread Robert Haas
On Thu, May 27, 2010 at 9:08 AM, Simon Riggs si...@2ndquadrant.com wrote:
 * New process: WALAck (on standby)
 Reads shared memory to get last received and last applied xlog location
 and sends message to WALSync on primary. Loop/Sleep forever.

So would WALAck be polling shared memory?  That would increase latency
significantly, I think, though perhaps you have a plan for avoiding
that?

 The above needs just two parameters at user level
 synch_rep = none | recv | apply
 synch_rep_timeout = Ns
 and an additional parameter in recovery.conf to say whether a standby is
 providing the facility for sync replication (as requested by Yeb etc)
 (default = yes).

 So this is the same as having quorum = 0 or 1 (boring but simple) and
 having sync_rep_timeout_action = commit in all cases (clear behaviour in
 failure modes, without need for per-standby parameters).

This seems good, but I think we need a little more definition about
what happens with sync_rep_timeout expires.

 Yes, this is a 3rd design for sync rep, though I think it improves upon
 the things I've heard so far from other authors and also includes
 feedback from Dimitri, Heikki, Yeb, Alastair. I'm happy to code this as
 well, when 9.1 dev starts and a benchmark should be interesting also.

It's great that we have so many people who want to implement this
feature, or in one case already have.  I'm not sure whose design is
best, but I do hope that we can avoid dueling patches.  There are
plenty of other good features to work on also.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Straightforward Synchronous Replication

2010-05-27 Thread Simon Riggs
On Thu, 2010-05-27 at 10:11 -0400, Robert Haas wrote:
 On Thu, May 27, 2010 at 9:08 AM, Simon Riggs si...@2ndquadrant.com wrote:
  * New process: WALAck (on standby)
  Reads shared memory to get last received and last applied xlog location
  and sends message to WALSync on primary. Loop/Sleep forever.
 
 So would WALAck be polling shared memory?  That would increase latency
 significantly, I think, though perhaps you have a plan for avoiding
 that?

The backends are going to be released in batches anyway, so I can't see
how polling makes a difference.

Polling means no waiting, so asynchronous action and higher throughput,
and with sufficiently high polling rate no significant loss of latency.

The other plan requires WALReceiver to wait for fsync and apply, which
seems very likely to suck badly from a latency perspective. While its
waiting it is also reducing throughout of incoming WAL. It's hard to see
how that would work well.

You could also do this by avoiding the wait in WALReceiver, but then
that becomes more like polling anyway.

  The above needs just two parameters at user level
  synch_rep = none | recv | apply
  synch_rep_timeout = Ns
  and an additional parameter in recovery.conf to say whether a standby is
  providing the facility for sync replication (as requested by Yeb etc)
  (default = yes).
 
  So this is the same as having quorum = 0 or 1 (boring but simple) and
  having sync_rep_timeout_action = commit in all cases (clear behaviour in
  failure modes, without need for per-standby parameters).
 
 This seems good, but I think we need a little more definition about
 what happens with sync_rep_timeout expires.

It commits... that is very clear: sync_rep_timeout_action = commit in
all cases. Commit is the only viable option, since abort and
wait-forever both have disadvantages pointed out for them.

  Yes, this is a 3rd design for sync rep, though I think it improves upon
  the things I've heard so far from other authors and also includes
  feedback from Dimitri, Heikki, Yeb, Alastair. I'm happy to code this as
  well, when 9.1 dev starts and a benchmark should be interesting also.
 
 It's great that we have so many people who want to implement this
 feature, or in one case already have.  I'm not sure whose design is
 best, but I do hope that we can avoid dueling patches.  There are
 plenty of other good features to work on also.

There is already a patch on SR, yet Masao is discussing another that
contains what looks to me like very close to nothing of Zoltan's work,
not even similar ideas. The dueling patches situation looks like it
already exists to me, though not of my making or encouragement. Even if
I agreed with everything one of those authors say, there would still be
two patches.

Considering a variety of design approaches seems like a good idea for an
important feature, especially when the information is thin and opinions
run high. It's unlikely that anyone is right about everything, which is
why I've amalgamated this simple proposal from everything said so far.

It's easy to add some things if we add them at the start, much harder to
retrofit them. I've shown that some things are easier than has been
said, with fewer parameters and a good case for better performance also.

-- 
 Simon Riggs   www.2ndQuadrant.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Straightforward Synchronous Replication

2010-05-27 Thread Robert Haas
On Thu, May 27, 2010 at 11:50 AM, Simon Riggs si...@2ndquadrant.com wrote:
 On Thu, 2010-05-27 at 10:11 -0400, Robert Haas wrote:
 On Thu, May 27, 2010 at 9:08 AM, Simon Riggs si...@2ndquadrant.com wrote:
  * New process: WALAck (on standby)
  Reads shared memory to get last received and last applied xlog location
  and sends message to WALSync on primary. Loop/Sleep forever.

 So would WALAck be polling shared memory?  That would increase latency
 significantly, I think, though perhaps you have a plan for avoiding
 that?

 The backends are going to be released in batches anyway, so I can't see
 how polling makes a difference.

 Polling means no waiting, so asynchronous action and higher throughput,
 and with sufficiently high polling rate no significant loss of latency.

I guess what I'm trying to figure out is the part that says
loop/sleep forever.  That sounds like you wait 50 ms (or some other
interval), then check shared memory to see if anything has changed, if
not you do it again.  That means that up to 49.9 ms (or whatever
interval you picked) could be spent waiting before you realize that
new WAL has been applied, which I suspect will not work out very well.
 On the other hand checking it in a TIGHT loop would mean using up a
whole CPU on an idle system, so that's not practical either.  ISTM
you'd need some kind of signalling system between the startup process
and the WALAck process, so that the startup process can wake WALAck
after applying each bit of WAL (or maybe the startup process knows
about the lowest LSN that WALAck cares about, and wakes it only upon
reaching that point).

 The other plan requires WALReceiver to wait for fsync and apply, which
 seems very likely to suck badly from a latency perspective. While its
 waiting it is also reducing throughout of incoming WAL. It's hard to see
 how that would work well.

 You could also do this by avoiding the wait in WALReceiver, but then
 that becomes more like polling anyway.

I'm not sure if I understand this part, so let me try to say it
another way and you can tell me if I've got it right.  I think your
concern is that, during the time that WALReceiver is waiting for one
chunk of WAL to get fsynced, the startup process might finish applying
an earlier chunk of WAL that is of interest to the master.  The ACK
will therefore be delayed until the fsync completes and WALReceiver
can again do other things, like check whether there are any ACKs that
must be sent.  Is that it, or have I missed the boat completely?

  The above needs just two parameters at user level
  synch_rep = none | recv | apply
  synch_rep_timeout = Ns
  and an additional parameter in recovery.conf to say whether a standby is
  providing the facility for sync replication (as requested by Yeb etc)
  (default = yes).
 
  So this is the same as having quorum = 0 or 1 (boring but simple) and
  having sync_rep_timeout_action = commit in all cases (clear behaviour in
  failure modes, without need for per-standby parameters).

 This seems good, but I think we need a little more definition about
 what happens with sync_rep_timeout expires.

 It commits... that is very clear: sync_rep_timeout_action = commit in
 all cases. Commit is the only viable option, since abort and
 wait-forever both have disadvantages pointed out for them.

So, do we declare the sync server offline at that point and stop
waiting for it, or do we continue waiting for it on every transaction?
 If we declare it dead, what are the criteria for subsequently making
it alive again?

  Yes, this is a 3rd design for sync rep, though I think it improves upon
  the things I've heard so far from other authors and also includes
  feedback from Dimitri, Heikki, Yeb, Alastair. I'm happy to code this as
  well, when 9.1 dev starts and a benchmark should be interesting also.

 It's great that we have so many people who want to implement this
 feature, or in one case already have.  I'm not sure whose design is
 best, but I do hope that we can avoid dueling patches.  There are
 plenty of other good features to work on also.

 There is already a patch on SR, yet Masao is discussing another that
 contains what looks to me like very close to nothing of Zoltan's work,
 not even similar ideas. The dueling patches situation looks like it
 already exists to me, though not of my making or encouragement. Even if
 I agreed with everything one of those authors say, there would still be
 two patches.

Oh, I wasn't aware that Fujii Masao's work had progressed as far as an
actual patch yet.

 Considering a variety of design approaches seems like a good idea for an
 important feature, especially when the information is thin and opinions
 run high. It's unlikely that anyone is right about everything, which is
 why I've amalgamated this simple proposal from everything said so far.

Agreed.

 It's easy to add some things if we add them at the start, much harder to
 retrofit them. I've shown that some things are easier than has been
 said, with