I'm currently implementing commit sequence number (CSN) based snapshots and I hit a design decision that I would like to resolve before I have too much code to rewrite.
The issue is commit visibility ordering on slaves. As a couple of threads on hackers have already noted, currently commit order on slaves can differ from what is seen on master. This arises from the fact that on master commit visibility is determined by the order of ProcArrayLock acquisition by ProcArrayEndTransaction(). On slaves commit visibility is exactly the order of commit records in WAL. Because XLogInsert() in RecordTransactionCommit() is not interlocked with ProcArrayEndTransaction() these orders can differ. In case of mixed sync and async transactions they in fact are quite likely to differ due to the durability wait in RecordTransactionCommit(). It's not possible to change master commit order to match WAL order because then either async transactions must either wait behind sync transactions before returning losing the point of async; or async transactions must return without becoming visible, changing user visible semantics; or sync transactions must become visible before they become durable, again changing user visible semantics. As it's not possible to change master commit order, the slave visibility order must change for the orders to be consistent. WAL currently doesn't have the information to reconstruct master commit order. Either we need to add a new WAL record for the commit order (only necessary when wal_level=hot_standby) or add a side channel to replication connections to communicate commit order information. One more consideration here is the wish expressed by several hackers that commit record LSNs could be used as CSNs. One of the most interesting benefits of this is the property of LSNs being the same over the whole cluster, meaning that it would be relatively simple to create cluster wide consistent snapshots. I currently see the following courses of action: 1. Do nothing about the inconsistency, use a transient global counter for master commit order and commit record LSN for slaves. Pro: doesn't change any semantics Con: we are not making any progress towards cluster wide snapshots or even serializable transactions on slaves. 2. Create a new WAL record type that is inserted when a transaction becomes visible. LSN of this record determines transaction visibility order. Async transactions can be optimized to skip this record. This record does not need to be flushed. Pro: cluster wide consistency, replication method agnostic Con: one extra WAL record insertion per writing transaction. (32 bytes of WAL per tx) 3. Use a transient global counter on master, send xid-csn pairs to slave via a side channel on the replication connection. Pro: Less overhead than WAL records Con: replication protocol needs (possibly invasive) changes, WAL shipping based replication can't use this mechanism, lots of extra code required. 4. Make the choice between 1 and 2 user configurable (it seems to me that it could even be changed without a restart). Thoughts? Regards, Ants Aasma -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers