On Wed, Oct 20, 2010 at 10:07 PM, Tom Lane <t...@sss.pgh.pa.us> wrote: > I wonder whether we could do something involving WAL properties --- the > current tuple visibility logic was designed before WAL existed, so it's > not exploiting that resource at all. I'm imagining that the kernel of a > snapshot is just a WAL position, ie the end of WAL as of the time you > take the snapshot (easy to get in O(1) time). Visibility tests then > reduce to "did this transaction commit with a WAL record located before > the specified position?". You'd need some index datastructure that made > it reasonably cheap to find out the commit locations of recently > committed transactions, where "recent" means "back to recentGlobalXmin". > That seems possibly do-able, though I don't have a concrete design in > mind.
I was mulling this idea over some more (the same ideas keep floating back to the top...). I don't think an LSN can actually work, because there's no guarantee that the order in which the WAL records are emitted is the same order in which the effects of the transactions become visible to new snapshots. For example: 1. Transaction A inserts its commit record, flushes WAL, and begins waiting for sync rep. 2. A moment later, transaction B sets synchronous_commit=off, inserts its commit record, requests a background WAL flush, and removes itself from the ProcArray. 3. Transaction C takes a snapshot. Sync rep doesn't create this problem; there's a race anyway. The order of acquisition for WALInsertLock needn't match that for ProcArrayLock. This has the more-than-slightly-odd characteristic that you could end up with a snapshot on the master that can see A but not B and a snapshot on the slave that can see B but not A. But having said that an LSN can't work, I don't see why we can't just use a 64-bit counter. In fact, the predicate locking code already does something much like this, using an SLRU, for serializable transactions only. In more detail, what I'm imagining is an array with 4 billion entries, one per XID, probably broken up into files of say 16MB each with 2 million entries per file. Each entry is a 64-bit value. It is 0 if the XID has not yet started, is still running, or has aborted. Otherwise, it is the commit sequence number of the transaction. For reasons I'll explain below, I'm imagining starting the commit sequence number counter at some very large value and having it count down from there. So the basic operations are: - To take a snapshot, you just read the counter. - To commit a transaction which has an XID, you read the counter, stamp all your XIDs with that value, and decrement the counter. - To find out whether an XID is visible to your snapshot, you look up the XID in the array and get the counter value. If the value you read is greater than your snapshot value, it's visible. If it's less, it's not. Now, is this algorithm any good, and how little locking can we get away with? It seems to me that if we used an SLRU to store the array, the lock contention would be even worse than it is under our current system, wherein everybody fights over ProcArrayLock. A system like this is going to involve lots and lots of probes into the array (even if we build a per-backend cache of some kind) and an SLRU will require at least one LWLock acquire and release per probe. Some kind of locking is pretty much unavoidable, because you have to worry about pages getting evicted from shared memory. However, what if we used a set of files (like SLRU) but mapped them separately into each backend's address space? I think this would allow both loads and stores from the array to be done unlocked. One fly in the ointment is that 8-byte stores are apparently done as two 4-byte stores on some platforms. But if the counter runs backward, I think even that is OK. If you happen to read an 8 byte value as it's being written, you'll get 4 bytes of the intended value and 4 bytes of zeros. The value will therefore appear to be less than what it should be. However, if the value was in the midst of being written, then it's still in the midst of committing, which means that that XID wasn't going to be visible anyway. Accidentally reading a smaller value doesn't change the answer. Committing will require a lock on the counter. Taking a snapshot can be done unlocked if (1) 8-byte reads are atomic and either (2a) the architecture has strong memory ordering (no store/store reordering) or (2b) you insert a memory fence between stamping the XIDs and decrementing the counter. Otherwise, taking a snapshot will also require a lock on the counter. Once a particular XID precedes RecentGlobalXmin, you no longer care about the associated counter value. You just need to know that it committed; the order no longer matters. So after a crash, assuming that you have the CLOG bits available, you can just throw away all the array contents and start the counter over at the highest possible value. And, as RecentGlobalXmin advances, you can prune away (or recycle) files that are no longer needed. In fact, you could put all of these files on a ramfs, or maybe use shared memory segments for them. All that having been said, even if I haven't made any severe conceptual errors in the above, I'm not sure how well it will work in practice. On the plus side, taking a snapshot becomes O(1) rather than O(MaxBackends) - that's good. On the further plus side, you can check both whether an XID has committed and whether it's visible to your snapshot in a single, atomic action with no lock - that seems really good. On the minus side, checking an xid against your snapshot now has less locality of reference. And, rolling over into a new segment of the array is going to require everyone to map it, and maybe cause some disk I/O as a new file gets created. Thoughts? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers