On Mon, Feb 7, 2011 at 1:20 PM, Robert Haas <robertmh...@gmail.com> wrote: > On Sat, Jan 15, 2011 at 4:40 PM, Simon Riggs <si...@2ndquadrant.com> wrote: >> Here's the latest patch for sync rep. > > Here is a rebased version of this patch which applies to head of the > master branch. I haven't tested it yet beyond making sure that it > compiles and passes the regression tests -- but this fixes the bitrot.
As I mentioned yesterday that I would do, I spent some time working on this. I think that there are somewhere between three and six independently useful features in this patch, plus a few random changes to the documentation that I'm not sure whether want or not (e.g. replacing master by primary in a few places, or the other way around). One problem with the core synchronous replication technology is that walreceiver cannot both receive WAL and write WAL at the same time. It switches back and forth between reading WAL from the network socket and flushing it to disk. The impact of that is somewhat mitigated in the current patch because it only implements the "fsync" level of replication, and chances are that the network read time is small compared to the fsync time. But it would certainly suck for the "receive" level we've talked about having in the past, because after receiving each batch of WAL, the WAL receiver wouldn't be able to send any more acknowledgments until the fsync completed, and that's bound to be slow. I'm not really sure how bad it will be in "fsync" mode; it may be tolerable, but as Simon noted in a comment, in the long run it'd certainly be nicer to have the WAL writer process running during recovery. As a general comment on the quality of the code, I think that the overall logic is probably sound, but there are an awful lot of debugging leftovers and inconsistencies between various parts of the patch. For example, when I initially tested it, *asynchronous* replication kept breaking between the master and the standby, and I couldn't figure out why. I finally realized that there was a ten second pause that had been inserting into the WAL receiver loop as a debugging tool which was allowing the standby to get far enough behind that the master was able to recycle WAL segments the standby still needed. Under ordinary circumstances, I would say that a patch like this was not mature enough to submit for review, let alone commit. For that reason, I am pretty doubtful about the chances of getting this finished for 9.1 without some substantial prolongation of the schedule. That having been said, there is at least one part of this patch which looks to be in pretty good shape and seems independently useful regardless of what happens to the rest of it, and that is the code that sends replies from the standby back to the primary. This allows pg_stat_replication to display the write/flush/apply log positions on the standby next to the sent position on the primary, which as far as I am concerned is pure gold. Simon had this set up to happen only when synchronous replication or XID feedback in use, but I think people are going to want it even with plain old asynchronous replication, because it provides a FAR easier way to monitor standby lag than anything we have today. I've extracted this portion of the patch, cleaned it up a bit, written docs, and attached it here. I wasn't too sure how to control the timing of the replies. It's worth noting that you have to send them pretty frequently for the distinction between xlog written and xlog flushed to have any value. What I've done here is made it so that every time we read all available data on the socket, we send a reply. After flushing, we send another reply. And then just for the heck of it we send a reply at least every 10 seconds (configurable), which causes the last-known-apply position to eventually get updated on the master. This means the apply position can lag reality by a bit. Simon's version adds a latch, so that the startup process can poke the WAL receiver to send a reply when the apply position moves. But this is substantially more complex and I'm not sure it's worth it. If we were implementing the "apply" level of synchronized replication, we'd clearly need that for performance not to stink. But since the patch is only implementing "fsync" anyway, it doesn't seem necessary for now. The only real complaint I can imagine about offering this functionality all the time is that it uses extra bandwidth. I'm inclined to think that the ability to shut it off completely is sufficient answer to that complaint. <dons asbestos underwear> Comments? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
wal-receiver-replies.patch
Description: Binary data
-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers