Re: [HACKERS] Synchronous replication patch built on SR

2010-05-19 Thread Boszormenyi Zoltan
Fujii Masao írta:
 Thanks for your reply!

 On Fri, May 14, 2010 at 10:33 PM, Boszormenyi Zoltan z...@cybertec.at wrote:
   
 In your design, the transaction commit on the master waits for its XID
 to be read from the XLOG_XACT_COMMIT record and replied by the standby.
 Right? This design seems not to be extensible to #2 and #3 since
 walreceiver cannot read XID from the XLOG_XACT_COMMIT record.
   
 Yes, this was my problem, too. I would have had to
 implement a custom interpreter into walreceiver to
 process the WAL records and extract the XIDs.
 

 Isn't reading the same WAL twice (by walreceiver and startup process)
 inefficient?

Yes, and I didn't implement that because it's inefficient.
I implemented a minimal communication between
StartupXLOG() and the walreceiver.

  In synchronous replication, the overhead of walreceiver
 directly affects the performance of the master. We should not assign
 such a hard work to walreceiver, I think.
   

Exactly.

 But at least the supporting details, i.e. not opening another
 connection, instead being able to do duplex COPY operations in
 a server-acknowledged way is acceptable, no? :-)
 

 Though I might not understand your point (sorry), it's OK for the standby
 to send the reply to the master by using CopyData message.

I thought about the same.

  Currently
 PQputCopyData() cannot be executed in COPY OUT, but we can relax
 that.
   

And I implemented just that, in a way that upon walreceiver startup
it sends a new protocol message to the walsender by calling
PQsetDuplexCopy() (see my patch) and the walsender response is ACK.
This protocol message is intentionally not handled by the normal
backend, so plain libpq clients cannot mess up their COPY streams.

  How about
 using LSN instead of XID? That is, the transaction commit waits until
 the standby has reached its LSN. LSN is more easy-used for walreceiver
 and startup process, I think.

   
 Indeed, using the LSN seems to be more appropriate for
 the walreceiver, but how would you extract the information
 that a certain LSN means a COMMITted transaction? Or
 we could release a locked transaction in case the master receives
 an LSN greater than or equal to the transaction's own LSN?
 

 Yep, we can ensure that the transaction has been replicated by
 comparing its own LSN with the smallest LSN in the latest LSNs
 of each connected synchronous standby.

   
 Sending back all the LSNs in case of long transactions would
 increase the network traffic compared to sending back only the
 XIDs, but the amount is not clear for me. What I am more
 worried about is the contention on the ProcArrayLock.
 XIDs are rarer then LSNs, no?
 

 No. For example, when WAL data sent by walsender at a time
 has two XLOG_XACT_COMMIT records, in XID approach, walreceiver
 would need to send two replies. OTOH, in LSN approach, only
 one reply which indicates the last received location would
 need to be sent.
   

I see.

 What if the synchronous standby starts up from the very old backup?
 The transaction on the master needs to wait until a large amount of
 outstanding WAL has been applied? I think that synchronous replication
 should start with *asynchronous* replication, and should switch to the
 sync level after the gap between servers has become enough small.
 What's your opinion?

   
 It's certainly one option, which I think partly addressed
 with the strict_sync_replication knob below.
 If strict_sync_replication = off, then the master doesn't make
 its transactions wait for the synchronous reports, and the client(s)
 can work through their WALs. IIRC, the walreceiver connects
 to the master only very late in the recovery process, no?
 

 No, the master might have a large number of WAL files which
 the standby doesn't have.
   

We can change the walreceiver so it sends similarly encapsulated
messages as the walsender does. In our patch, the walreceiver
currently sends the raw XIDs. If we add a minimal protocol
encapsulation, we can distinguish between the XIDs (or later LSNs)
and the mark me synchronous from now on message.

The only problem is: what should be the point when such a client
becomes synchronous from the master's POV, so the XID/LSN reports
will count and transactions are made to wait for this client?

As a side note, the async walreceivers' behaviour should be kept
so they don't send anything back and the message that
PQsetDuplexCopy() sends to the master would then only
prepare the walsender that its client will become synchronous
in the near future.

 I have added 3 new options, two GUCs in postgresql.conf and one
 setting in recovery.conf. These options are:

 1. min_sync_replication_clients = N

 where N is the number of reports for a given transaction before it's
 released as committed synchronously. 0 means completely asynchronous,
 the value is maximized by the value of max_wal_senders. Anything
 in between 0 and max_wal_senders means different levels of partially
 

Re: [HACKERS] Synchronous replication patch built on SR

2010-05-19 Thread Fujii Masao
On Wed, May 19, 2010 at 5:41 PM, Boszormenyi Zoltan z...@cybertec.at wrote:
 Isn't reading the same WAL twice (by walreceiver and startup process)
 inefficient?

 Yes, and I didn't implement that because it's inefficient.

So I'd like to propose to use LSN instead of XID since LSN can
be easily handled by both walreceiver and startup process.

  Currently
 PQputCopyData() cannot be executed in COPY OUT, but we can relax
 that.


 And I implemented just that, in a way that upon walreceiver startup
 it sends a new protocol message to the walsender by calling
 PQsetDuplexCopy() (see my patch) and the walsender response is ACK.
 This protocol message is intentionally not handled by the normal
 backend, so plain libpq clients cannot mess up their COPY streams.

The newly-introduced message type Set Duplex Copy is really required?
I think that the standby can send its replication mode to the master
via Query or CopyData message, which are already used in SR. For example,
how about including the mode in the handshake message START_REPLICATION?
If we do that, we would not need to introduce new libpq function
PQsetDuplexCopy(). BTW, I often got the complaints about adding
new libpq function when I implemented SR ;)

In the patch, PQputCopyData() checks the newly-introduced pg_conn field
duplexCopy. Instead, how about checking the existing field replication?
Or we can just allow PQputCopyData() to go even in COPY OUT state.

 We can change the walreceiver so it sends similarly encapsulated
 messages as the walsender does. In our patch, the walreceiver
 currently sends the raw XIDs. If we add a minimal protocol
 encapsulation, we can distinguish between the XIDs (or later LSNs)
 and the mark me synchronous from now on message.

 The only problem is: what should be the point when such a client
 becomes synchronous from the master's POV, so the XID/LSN reports
 will count and transactions are made to wait for this client?

One idea is to switch to sync when the gap of LSN becomes less
than or equal to XLOG_SEG_SIZE (currently 8MB). That is, walsender
calculates the gap from the current write WAL location on the master
and the last receive/flush/replay location on the standby. And if
the gap = XLOG_SEG_SIZE, it instructs backends to wait for
replication from then on.

 As a side note, the async walreceivers' behaviour should be kept
 so they don't send anything back and the message that
 PQsetDuplexCopy() sends to the master would then only
 prepare the walsender that its client will become synchronous
 in the near future.

I agree that walreceiver should send no replication ack if async
mode is chosen. OTOH, in sync case, walreceiver should always
send ack even if the gap is large and the master doesn't wait for
replication yet. As mentioned above, walsender needs to calculate
the gap from the ack.

 Seems s/min_sync_replication_clients/max_sync_replication_clients


 No, min is indicating the minimum number of walreceiver reports
 needed before a transaction can be released from under the waiting.
 The other reports coming from walreceivers are ignored.

Hmm... when min_sync_replication_clients = 2 and there are three
synchronous standbys, the master waits for only two standbys?

The standby which the master ignores is fixed? or dynamically (or
randomly) changed?

 min_sync_replication_clients is required to prevent outside attacker
 from connecting to the master as synchronous standby, and degrading
 the performance on the master?

 ???

 Properly configured pg_hba.conf prevents outside attackers
 to connect as replication clients, no?

Yes :)

I'd like to just know the use case of min_sync_replication_clients.
Sorry, I've not understood yet how useful this option is.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Synchronous replication patch built on SR

2010-05-19 Thread Boszormenyi Zoltan
Fujii Masao írta:
 On Wed, May 19, 2010 at 5:41 PM, Boszormenyi Zoltan z...@cybertec.at wrote:
   
 Isn't reading the same WAL twice (by walreceiver and startup process)
 inefficient?
   
 Yes, and I didn't implement that because it's inefficient.
 

 So I'd like to propose to use LSN instead of XID since LSN can
 be easily handled by both walreceiver and startup process.
   

OK, I will look into it replacing XIDs with LSNs.

  Currently
 PQputCopyData() cannot be executed in COPY OUT, but we can relax
 that.

   
 And I implemented just that, in a way that upon walreceiver startup
 it sends a new protocol message to the walsender by calling
 PQsetDuplexCopy() (see my patch) and the walsender response is ACK.
 This protocol message is intentionally not handled by the normal
 backend, so plain libpq clients cannot mess up their COPY streams.
 

 The newly-introduced message type Set Duplex Copy is really required?
 I think that the standby can send its replication mode to the master
 via Query or CopyData message, which are already used in SR. For example,
 how about including the mode in the handshake message START_REPLICATION?
 If we do that, we would not need to introduce new libpq function
 PQsetDuplexCopy(). BTW, I often got the complaints about adding
 new libpq function when I implemented SR ;)
   

:-)

 In the patch, PQputCopyData() checks the newly-introduced pg_conn field
 duplexCopy. Instead, how about checking the existing field replication?
   

I didn't see there was such a new field. (looking...) I can see now,
it was added in the middle of the structure. Ok, we can then use it
to allow duplex COPY instead of my new field. I suppose it's non-NULL
if replication is on, right? Then the extra call is not needed then.

 Or we can just allow PQputCopyData() to go even in COPY OUT state.
   

I think this may not be too useful for SQL clients, but who knows? :-)
Use cases, anyone?

 We can change the walreceiver so it sends similarly encapsulated
 messages as the walsender does. In our patch, the walreceiver
 currently sends the raw XIDs. If we add a minimal protocol
 encapsulation, we can distinguish between the XIDs (or later LSNs)
 and the mark me synchronous from now on message.

 The only problem is: what should be the point when such a client
 becomes synchronous from the master's POV, so the XID/LSN reports
 will count and transactions are made to wait for this client?
 

 One idea is to switch to sync when the gap of LSN becomes less
 than or equal to XLOG_SEG_SIZE (currently 8MB). That is, walsender
 calculates the gap from the current write WAL location on the master
 and the last receive/flush/replay location on the standby. And if
 the gap = XLOG_SEG_SIZE, it instructs backends to wait for
 replication from then on.
   

This is a sensible idea.

 As a side note, the async walreceivers' behaviour should be kept
 so they don't send anything back and the message that
 PQsetDuplexCopy() sends to the master would then only
 prepare the walsender that its client will become synchronous
 in the near future.
 

 I agree that walreceiver should send no replication ack if async
 mode is chosen. OTOH, in sync case, walreceiver should always
 send ack even if the gap is large and the master doesn't wait for
 replication yet. As mentioned above, walsender needs to calculate
 the gap from the ack.
   

Agreed.

 Seems s/min_sync_replication_clients/max_sync_replication_clients

   
 No, min is indicating the minimum number of walreceiver reports
 needed before a transaction can be released from under the waiting.
 The other reports coming from walreceivers are ignored.
 

 Hmm... when min_sync_replication_clients = 2 and there are three
 synchronous standbys, the master waits for only two standbys?
   

Yes. This is the idea, partially synchronous replication.
I heard anecdotes about replication solutions where say
ensuring that (say) if at least 50% of the machines across the
whole cluster report back synchronously then the transaction
is considered replicated good enough.

 The standby which the master ignores is fixed? or dynamically (or
 randomly) changed?
   

It may be randomly changed, depending on who send the reports
first. The replication servers themselves may get very busy with
large queries or they may be loaded by some other ways and
be somewhat late in processing the WAL stream. The less loaded
servers answer first, and the transaction is considered properly
replicated.

 min_sync_replication_clients is required to prevent outside attacker
 from connecting to the master as synchronous standby, and degrading
 the performance on the master?
   
 ???

 Properly configured pg_hba.conf prevents outside attackers
 to connect as replication clients, no?
 

 Yes :)

 I'd like to just know the use case of min_sync_replication_clients.
 Sorry, I've not understood yet how useful this option is.
   

I hope I answered it. :-)

Best regards,
Zoltán 

Re: [HACKERS] Synchronous replication patch built on SR

2010-05-19 Thread Fujii Masao
On Wed, May 19, 2010 at 9:58 PM, Boszormenyi Zoltan z...@cybertec.at wrote:
 In the patch, PQputCopyData() checks the newly-introduced pg_conn field
 duplexCopy. Instead, how about checking the existing field replication?

 I didn't see there was such a new field. (looking...) I can see now,
 it was added in the middle of the structure. Ok, we can then use it
 to allow duplex COPY instead of my new field. I suppose it's non-NULL
 if replication is on, right? Then the extra call is not needed then.

Right. Usually the first byte of the pg_conn field seems to be also
checked as follows, but I'm not sure if that is valuable for this case.

if (conn-replication  conn-replication[0])

 Or we can just allow PQputCopyData() to go even in COPY OUT state.

 I think this may not be too useful for SQL clients, but who knows? :-)
 Use cases, anyone?

It's for only replication.

 Hmm... when min_sync_replication_clients = 2 and there are three
 synchronous standbys, the master waits for only two standbys?


 Yes. This is the idea, partially synchronous replication.
 I heard anecdotes about replication solutions where say
 ensuring that (say) if at least 50% of the machines across the
 whole cluster report back synchronously then the transaction
 is considered replicated good enough.

Oh, I got. I heard such a use case for the first time.

We seem to have many ideas about the knobs to control synchronization
levels, and would need to clarify which ones to be implemented for 9.1.

 I'd like to just know the use case of min_sync_replication_clients.
 Sorry, I've not understood yet how useful this option is.


 I hope I answered it. :-)

Yep. Thanks!

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Synchronous replication patch built on SR

2010-05-18 Thread Fujii Masao
Thanks for your reply!

On Fri, May 14, 2010 at 10:33 PM, Boszormenyi Zoltan z...@cybertec.at wrote:
 In your design, the transaction commit on the master waits for its XID
 to be read from the XLOG_XACT_COMMIT record and replied by the standby.
 Right? This design seems not to be extensible to #2 and #3 since
 walreceiver cannot read XID from the XLOG_XACT_COMMIT record.

 Yes, this was my problem, too. I would have had to
 implement a custom interpreter into walreceiver to
 process the WAL records and extract the XIDs.

Isn't reading the same WAL twice (by walreceiver and startup process)
inefficient? In synchronous replication, the overhead of walreceiver
directly affects the performance of the master. We should not assign
such a hard work to walreceiver, I think.

 But at least the supporting details, i.e. not opening another
 connection, instead being able to do duplex COPY operations in
 a server-acknowledged way is acceptable, no? :-)

Though I might not understand your point (sorry), it's OK for the standby
to send the reply to the master by using CopyData message. Currently
PQputCopyData() cannot be executed in COPY OUT, but we can relax
that.

  How about
 using LSN instead of XID? That is, the transaction commit waits until
 the standby has reached its LSN. LSN is more easy-used for walreceiver
 and startup process, I think.


 Indeed, using the LSN seems to be more appropriate for
 the walreceiver, but how would you extract the information
 that a certain LSN means a COMMITted transaction? Or
 we could release a locked transaction in case the master receives
 an LSN greater than or equal to the transaction's own LSN?

Yep, we can ensure that the transaction has been replicated by
comparing its own LSN with the smallest LSN in the latest LSNs
of each connected synchronous standby.

 Sending back all the LSNs in case of long transactions would
 increase the network traffic compared to sending back only the
 XIDs, but the amount is not clear for me. What I am more
 worried about is the contention on the ProcArrayLock.
 XIDs are rarer then LSNs, no?

No. For example, when WAL data sent by walsender at a time
has two XLOG_XACT_COMMIT records, in XID approach, walreceiver
would need to send two replies. OTOH, in LSN approach, only
one reply which indicates the last received location would
need to be sent.

 What if the synchronous standby starts up from the very old backup?
 The transaction on the master needs to wait until a large amount of
 outstanding WAL has been applied? I think that synchronous replication
 should start with *asynchronous* replication, and should switch to the
 sync level after the gap between servers has become enough small.
 What's your opinion?


 It's certainly one option, which I think partly addressed
 with the strict_sync_replication knob below.
 If strict_sync_replication = off, then the master doesn't make
 its transactions wait for the synchronous reports, and the client(s)
 can work through their WALs. IIRC, the walreceiver connects
 to the master only very late in the recovery process, no?

No, the master might have a large number of WAL files which
the standby doesn't have.

 I have added 3 new options, two GUCs in postgresql.conf and one
 setting in recovery.conf. These options are:

 1. min_sync_replication_clients = N

 where N is the number of reports for a given transaction before it's
 released as committed synchronously. 0 means completely asynchronous,
 the value is maximized by the value of max_wal_senders. Anything
 in between 0 and max_wal_senders means different levels of partially
 synchronous replication.

 2. strict_sync_replication = boolean

 where the expected number of synchronous reports from standby
 servers is further limited to the actual number of connected synchronous
 standby servers if the value of this GUC is false. This means that if
 no standby servers are connected yet then the replication is asynchronous
 and transactions are allowed to finish without waiting for synchronous
 reports. If the value of this GUC is true, then transactions wait until
 enough synchronous standbys connect and report back.


 Why are these options necessary?

 Can these options cover more than three synchronization levels?


 I think I explained it in my mail.

 If  min_sync_replication_clients == 0, then the replication is async.
 If  min_sync_replication_clients == max_wal_senders then the
 replication is fully synchronous.
 If 0  min_sync_replication_clients  max_wal_senders then
 the replication is partially synchronous, i.e. the master can wait
 only for say, 50% of the clients to report back before it's considered
 synchronous and the relevant transactions get released from the wait.

Seems s/min_sync_replication_clients/max_sync_replication_clients

min_sync_replication_clients is required to prevent outside attacker
from connecting to the master as synchronous standby, and degrading
the performance on the master? Other usecase?

Regards,

-- 
Fujii 

Re: [HACKERS] Synchronous replication patch built on SR

2010-05-18 Thread Fujii Masao
On Sat, May 15, 2010 at 4:59 PM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 BTW, What I'd like to see as a very first patch first is to change the
 current poll loops in walreceiver and walsender to, well, not poll.
 That's a requirement for synchronous replication, is very useful on its
 own, and requires a some design and implementation effort to get right.
 It would be nice to get that out of the way before/during we discuss the
 more user-visible behavior.

Yeah, we should wake up the walesender from sleep to send WAL data
as soon as it's flushed. But why do we need to change the loop of
walreceiver? Or you mean changing the poll loop in the startup process?

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Synchronous replication patch built on SR

2010-05-18 Thread Heikki Linnakangas

On 18/05/10 07:41, Fujii Masao wrote:

On Sat, May 15, 2010 at 4:59 PM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com  wrote:

BTW, What I'd like to see as a very first patch first is to change the
current poll loops in walreceiver and walsender to, well, not poll.
That's a requirement for synchronous replication, is very useful on its
own, and requires a some design and implementation effort to get right.
It would be nice to get that out of the way before/during we discuss the
more user-visible behavior.


Yeah, we should wake up the walesender from sleep to send WAL data
as soon as it's flushed. But why do we need to change the loop of
walreceiver? Or you mean changing the poll loop in the startup process?


Yeah, changing the poll loop in the startup process is what I meant.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Synchronous replication patch built on SR

2010-05-16 Thread Simon Riggs
On Fri, 2010-05-14 at 15:15 -0400, Robert Haas wrote:
 On Fri, May 14, 2010 at 9:33 AM, Boszormenyi Zoltan z...@cybertec.at wrote:
  If  min_sync_replication_clients == 0, then the replication is async.
  If  min_sync_replication_clients == max_wal_senders then the
  replication is fully synchronous.
  If 0  min_sync_replication_clients  max_wal_senders then
  the replication is partially synchronous, i.e. the master can wait
  only for say, 50% of the clients to report back before it's considered
  synchronous and the relevant transactions get released from the wait.
 
 That's an interesting design and in some ways pretty elegant, but it
 rules out some things that people might easily want to do - for
 example, synchronous replication to the other server in the same data
 center that acts as a backup for the master; and asynchronous
 replication to a reporting server located off-site.

The design above allows the case you mention:
min_sync_replication_clients = 1
max_wal_senders = 2

It works well in failure cases, such as the case where the local backup
server goes down.

It seems exactly what we need to me, though not sure about names.

 One of the things that I think we will probably need/want to change
 eventually is the fact that the master has no real knowledge of who
 the replication slaves are.  That might be something we want to change
 in order to be able to support more configurability.  Inventing syntax
 out of whole cloth and leaving semantics to the imagination of the
 reader:
 
 CREATE REPLICATION SLAVE reporting_server (mode asynchronous, xid_feedback 
 on);
 CREATE REPLICATION SLAVE failover_server (mode synchronous,
 xid_feedback off, break_synchrep_timeout 30);

I am against labelling servers as synchronous/asynchronous. We've had
this discussion a few times since 2008.

There is significant advantage in having the user specify the level of
robustness, so that it can vary from transaction to transaction, just as
already happens at commit. That way the user gets to say what happens.
Look for threads on transaction controlled robustness.

As alluded to above, if you label the servers you also need to say what
happens when one or more of them are down. e.g. synchronous to B AND
async to C, except when B is not available, in which case make C
synchronous. With N servers, you end up needing to specify O(N^2) rules
for what happens, so it only works neatly for 2, maybe 3 servers.

-- 
 Simon Riggs   www.2ndQuadrant.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Synchronous replication patch built on SR

2010-05-15 Thread Heikki Linnakangas
BTW, What I'd like to see as a very first patch first is to change the
current poll loops in walreceiver and walsender to, well, not poll.
That's a requirement for synchronous replication, is very useful on its
own, and requires a some design and implementation effort to get right.
It would be nice to get that out of the way before/during we discuss the
more user-visible behavior.

-- 
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Synchronous replication patch built on SR

2010-05-14 Thread Fujii Masao
2010/4/29 Boszormenyi Zoltan z...@cybertec.at:
 attached is a patch that does $SUBJECT, we are submitting it for 9.1.
 I have updated it to today's CVS after the wal_level GUC went in.

I'm planning to create the synchronous replication patch for 9.0, too.
My design is outlined in the wiki. Let's work together to do the design
of it.
http://wiki.postgresql.org/wiki/Streaming_Replication#Synchronization_capability

The log-shipping replication has some synchronization levels as follows.
Which are you going to work on?

The transaction commit on the master
#1 doesn't wait for replication (already suppored in 9.0)
#2 waits for WAL to be received by the standby
#3 waits for WAL to be received and flushed by the standby
#4 waits for WAL to be received, flushed and replayed by the standby
..etc?

I'm planning to add #2 and #3 into 9.1. #4 is useful but is outside
the scope of my development for at least 9.1. In #4, read-only query
can easily block recovery by the lock conflict and make the
transaction commit on the master get stuck. This problem is difficult
to be addressed within 9.1, I think. But the design and implementation
of #2 and #3 need to be easily extensible to #4.

 How does it work?

 First, the walreceiver and the walsender are now able to communicate
 in a duplex way on the same connection, so while COPY OUT is
 in progress from the primary server, the standby server is able to
 issue PQputCopyData() to pass the transaction IDs that were seen
 with XLOG_XACT_COMMIT or XLOG_XACT_PREPARE
 signatures. I did by adding a new protocol message type, with letter
 'x' that's only acknowledged by the walsender process. The regular
 backend was intentionally unchanged so an SQL client gets a protocol
 error. A new libpq call called PQsetDuplexCopy() which sends this
 new message before sending START_REPLICATION. The primary
 makes a note of it in the walsender process' entry.

 I had to move the TransactionIdLatest(xid, nchildren, children) call
 that computes latestXid earlier in RecordTransactionCommit(), so
 it's in the critical section now, just before the
 XLogInsert(RM_XACT_ID, XLOG_XACT_COMMIT, rdata)
 call. Otherwise, there was a race condition between the primary
 and the standby server, where the standby server might have seen
 the XLOG_XACT_COMMIT record for some XIDs before the
 transaction in the primary server marked itself waiting for this XID,
 resulting in stuck transactions.

You seem to have chosen #4 as synchronization level. Right?

In your design, the transaction commit on the master waits for its XID
to be read from the XLOG_XACT_COMMIT record and replied by the standby.
Right? This design seems not to be extensible to #2 and #3 since
walreceiver cannot read XID from the XLOG_XACT_COMMIT record. How about
using LSN instead of XID? That is, the transaction commit waits until
the standby has reached its LSN. LSN is more easy-used for walreceiver
and startup process, I think.

What if the synchronous standby starts up from the very old backup?
The transaction on the master needs to wait until a large amount of
outstanding WAL has been applied? I think that synchronous replication
should start with *asynchronous* replication, and should switch to the
sync level after the gap between servers has become enough small.
What's your opinion?

 I have added 3 new options, two GUCs in postgresql.conf and one
 setting in recovery.conf. These options are:

 1. min_sync_replication_clients = N

 where N is the number of reports for a given transaction before it's
 released as committed synchronously. 0 means completely asynchronous,
 the value is maximized by the value of max_wal_senders. Anything
 in between 0 and max_wal_senders means different levels of partially
 synchronous replication.

 2. strict_sync_replication = boolean

 where the expected number of synchronous reports from standby
 servers is further limited to the actual number of connected synchronous
 standby servers if the value of this GUC is false. This means that if
 no standby servers are connected yet then the replication is asynchronous
 and transactions are allowed to finish without waiting for synchronous
 reports. If the value of this GUC is true, then transactions wait until
 enough synchronous standbys connect and report back.

Why are these options necessary?

Can these options cover more than three synchronization levels?

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Synchronous replication patch built on SR

2010-05-14 Thread Boszormenyi Zoltan
Fujii Masao írta:
 2010/4/29 Boszormenyi Zoltan z...@cybertec.at:
   
 attached is a patch that does $SUBJECT, we are submitting it for 9.1.
 I have updated it to today's CVS after the wal_level GUC went in.
 

 I'm planning to create the synchronous replication patch for 9.0, too.
 My design is outlined in the wiki. Let's work together to do the design
 of it.
 http://wiki.postgresql.org/wiki/Streaming_Replication#Synchronization_capability

 The log-shipping replication has some synchronization levels as follows.
 Which are you going to work on?

 The transaction commit on the master
 #1 doesn't wait for replication (already suppored in 9.0)
 #2 waits for WAL to be received by the standby
 #3 waits for WAL to be received and flushed by the standby
 #4 waits for WAL to be received, flushed and replayed by the standby
 ..etc?

 I'm planning to add #2 and #3 into 9.1. #4 is useful but is outside
 the scope of my development for at least 9.1. In #4, read-only query
 can easily block recovery by the lock conflict and make the
 transaction commit on the master get stuck. This problem is difficult
 to be addressed within 9.1, I think. But the design and implementation
 of #2 and #3 need to be easily extensible to #4.

   
 How does it work?

 First, the walreceiver and the walsender are now able to communicate
 in a duplex way on the same connection, so while COPY OUT is
 in progress from the primary server, the standby server is able to
 issue PQputCopyData() to pass the transaction IDs that were seen
 with XLOG_XACT_COMMIT or XLOG_XACT_PREPARE
 signatures. I did by adding a new protocol message type, with letter
 'x' that's only acknowledged by the walsender process. The regular
 backend was intentionally unchanged so an SQL client gets a protocol
 error. A new libpq call called PQsetDuplexCopy() which sends this
 new message before sending START_REPLICATION. The primary
 makes a note of it in the walsender process' entry.

 I had to move the TransactionIdLatest(xid, nchildren, children) call
 that computes latestXid earlier in RecordTransactionCommit(), so
 it's in the critical section now, just before the
 XLogInsert(RM_XACT_ID, XLOG_XACT_COMMIT, rdata)
 call. Otherwise, there was a race condition between the primary
 and the standby server, where the standby server might have seen
 the XLOG_XACT_COMMIT record for some XIDs before the
 transaction in the primary server marked itself waiting for this XID,
 resulting in stuck transactions.
 

 You seem to have chosen #4 as synchronization level. Right?
   

Yes.

 In your design, the transaction commit on the master waits for its XID
 to be read from the XLOG_XACT_COMMIT record and replied by the standby.
 Right? This design seems not to be extensible to #2 and #3 since
 walreceiver cannot read XID from the XLOG_XACT_COMMIT record.

Yes, this was my problem, too. I would have had to
implement a custom interpreter into walreceiver to
process the WAL records and extract the XIDs.

But at least the supporting details, i.e. not opening another
connection, instead being able to do duplex COPY operations in
a server-acknowledged way is acceptable, no? :-)

  How about
 using LSN instead of XID? That is, the transaction commit waits until
 the standby has reached its LSN. LSN is more easy-used for walreceiver
 and startup process, I think.
   

Indeed, using the LSN seems to be more appropriate for
the walreceiver, but how would you extract the information
that a certain LSN means a COMMITted transaction? Or
we could release a locked transaction in case the master receives
an LSN greater than or equal to the transaction's own LSN?

Sending back all the LSNs in case of long transactions would
increase the network traffic compared to sending back only the
XIDs, but the amount is not clear for me. What I am more
worried about is the contention on the ProcArrayLock.
XIDs are rarer then LSNs, no?

 What if the synchronous standby starts up from the very old backup?
 The transaction on the master needs to wait until a large amount of
 outstanding WAL has been applied? I think that synchronous replication
 should start with *asynchronous* replication, and should switch to the
 sync level after the gap between servers has become enough small.
 What's your opinion?
   

It's certainly one option, which I think partly addressed
with the strict_sync_replication knob below.
If strict_sync_replication = off, then the master doesn't make
its transactions wait for the synchronous reports, and the client(s)
can work through their WALs. IIRC, the walreceiver connects
to the master only very late in the recovery process, no?

It would be nicer if it could be made automatic. I simply thought
that there may be situations where the strict behaviour may be
desired. I was thinking about the transactions executed on the
master between the standby startup and walreceiver connection.
Someone may want to ensure the synchronous behaviour
for every xact, no 

Re: [HACKERS] Synchronous replication patch built on SR

2010-05-14 Thread Robert Haas
On Fri, May 14, 2010 at 9:33 AM, Boszormenyi Zoltan z...@cybertec.at wrote:
 If  min_sync_replication_clients == 0, then the replication is async.
 If  min_sync_replication_clients == max_wal_senders then the
 replication is fully synchronous.
 If 0  min_sync_replication_clients  max_wal_senders then
 the replication is partially synchronous, i.e. the master can wait
 only for say, 50% of the clients to report back before it's considered
 synchronous and the relevant transactions get released from the wait.

That's an interesting design and in some ways pretty elegant, but it
rules out some things that people might easily want to do - for
example, synchronous replication to the other server in the same data
center that acts as a backup for the master; and asynchronous
replication to a reporting server located off-site.

One of the things that I think we will probably need/want to change
eventually is the fact that the master has no real knowledge of who
the replication slaves are.  That might be something we want to change
in order to be able to support more configurability.  Inventing syntax
out of whole cloth and leaving semantics to the imagination of the
reader:

CREATE REPLICATION SLAVE reporting_server (mode asynchronous, xid_feedback on);
CREATE REPLICATION SLAVE failover_server (mode synchronous,
xid_feedback off, break_synchrep_timeout 30);

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Synchronous replication patch built on SR

2010-05-14 Thread Boszormenyi Zoltan
Robert Haas írta:
 On Fri, May 14, 2010 at 9:33 AM, Boszormenyi Zoltan z...@cybertec.at wrote:
   
 If  min_sync_replication_clients == 0, then the replication is async.
 If  min_sync_replication_clients == max_wal_senders then the
 replication is fully synchronous.
 If 0  min_sync_replication_clients  max_wal_senders then
 the replication is partially synchronous, i.e. the master can wait
 only for say, 50% of the clients to report back before it's considered
 synchronous and the relevant transactions get released from the wait.
 

 That's an interesting design and in some ways pretty elegant, but it
 rules out some things that people might easily want to do - for
 example, synchronous replication to the other server in the same data
 center that acts as a backup for the master; and asynchronous
 replication to a reporting server located off-site.
   

No, it doesn't. :-) You didn't take into account the third knob
usable in recovery.conf:
synchronous_slave = on/off
The off-site reporting server can be an asynchronous standby,
while the on-site backup server can be synchronous. The only thing
you need to take into account is that min_sync_replication_clients
shouldn't ever exceed your actual number of synchronous standbys.
The setup these three knobs provide is pretty flexible I think.

 One of the things that I think we will probably need/want to change
 eventually is the fact that the master has no real knowledge of who
 the replication slaves are.

The changes I made in my patch partly changes that,
the server still doesn't know who the standbys are
but there's a call that returns the number of connected
_synchronous_ standbys.

   That might be something we want to change
 in order to be able to support more configurability.  Inventing syntax
 out of whole cloth and leaving semantics to the imagination of the
 reader:

 CREATE REPLICATION SLAVE reporting_server (mode asynchronous, xid_feedback 
 on);
 CREATE REPLICATION SLAVE failover_server (mode synchronous,
 xid_feedback off, break_synchrep_timeout 30);

   


-- 
Bible has answers for everything. Proof:
But let your communication be, Yea, yea; Nay, nay: for whatsoever is more
than these cometh of evil. (Matthew 5:37) - basics of digital technology.
May your kingdom come - superficial description of plate tectonics

--
Zoltán Böszörményi
Cybertec Schönig  Schönig GmbH
http://www.postgresql.at/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Synchronous replication patch built on SR

2010-05-01 Thread Boszormenyi Zoltan
Hi,

Bruce Momjian írta:
 Please add it to the next commit-fest:

   https://commitfest.postgresql.org/action/commitfest_view/inprogress
   

it was already added two days ago:

https://commitfest.postgresql.org/action/patch_view?id=297

Best regards,
Zoltán Böszörményi

-- 
Bible has answers for everything. Proof:
But let your communication be, Yea, yea; Nay, nay: for whatsoever is more
than these cometh of evil. (Matthew 5:37) - basics of digital technology.
May your kingdom come - superficial description of plate tectonics

--
Zoltán Böszörményi
Cybertec Schönig  Schönig GmbH
http://www.postgresql.at/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Synchronous replication patch built on SR

2010-04-30 Thread Bruce Momjian

Please add it to the next commit-fest:

https://commitfest.postgresql.org/action/commitfest_view/inprogress

---

z...@cybertec.at wrote:
 Resending, my ISP lost my mail yesterday. :-(
 
 ===
 
 Hi,
 
 attached is a patch that does $SUBJECT, we are submitting it for 9.1.
 I have updated it to today's CVS after the wal_level GUC went in.
 
 How does it work?
 
 First, the walreceiver and the walsender are now able to communicate
 in a duplex way on the same connection, so while COPY OUT is
 in progress from the primary server, the standby server is able to
 issue PQputCopyData() to pass the transaction IDs that were seen
 with XLOG_XACT_COMMIT or XLOG_XACT_PREPARE
 signatures. I did by adding a new protocol message type, with letter
 'x' that's only acknowledged by the walsender process. The regular
 backend was intentionally unchanged so an SQL client gets a protocol
 error. A new libpq call called PQsetDuplexCopy() which sends this
 new message before sending START_REPLICATION. The primary
 makes a note of it in the walsender process' entry.
 
 I had to move the TransactionIdLatest(xid, nchildren, children) call
 that computes latestXid earlier in RecordTransactionCommit(), so
 it's in the critical section now, just before the
 XLogInsert(RM_XACT_ID, XLOG_XACT_COMMIT, rdata)
 call. Otherwise, there was a race condition between the primary
 and the standby server, where the standby server might have seen
 the XLOG_XACT_COMMIT record for some XIDs before the
 transaction in the primary server marked itself waiting for this XID,
 resulting in stuck transactions.
 
 I have added 3 new options, two GUCs in postgresql.conf and one
 setting in recovery.conf. These options are:
 
 1. min_sync_replication_clients = N
 
 where N is the number of reports for a given transaction before it's
 released as committed synchronously. 0 means completely asynchronous,
 the value is maximized by the value of max_wal_senders. Anything
 in between 0 and max_wal_senders means different levels of partially
 synchronous replication.
 
 2. strict_sync_replication = boolean
 
 where the expected number of synchronous reports from standby
 servers is further limited to the actual number of connected synchronous
 standby servers if the value of this GUC is false. This means that if
 no standby servers are connected yet then the replication is asynchronous
 and transactions are allowed to finish without waiting for synchronous
 reports. If the value of this GUC is true, then transactions wait until
 enough synchronous standbys connect and report back.
 
 3. synchronous_slave = boolean (in recovery.conf)
 
 this instructs the standby server to tell the primary that it's a
 synchronous
 replication server and it will send the committed XIDs back to the primary.
 
 I also added a contrib module for monitoring the synchronous replication
 but it abuses the procarray.c code by exposing the procArray pointer
 which is ugly. It's either need to be abandoned or moved to core if or when
 this code is discussed enough.  :-)
 
 Best regards,
 Zolt?n B?sz?rm?nyi

[ Attachment, skipping... ]

 
 -- 
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers