On Sat, Jun 25, 2011 at 8:26 PM, Greg Stark <st...@mit.edu> wrote: > On Thu, Jun 23, 2011 at 4:42 PM, Robert Haas <robertmh...@gmail.com> wrote: >> ProcArrayLock looks like a tougher nut to crack - there's simply no >> way, with the system we have right now, that you can take a snapshot >> without locking the list of running processes. I'm not sure what to >> do about that, but we're probably going to have to come up with >> something, because it seems clear that once we eliminate the lock >> manager LWLock contention, this is a major bottleneck. > > Well as Tom observed earlier the kernel of a snapshot is actually a > LSN. A snapshot contains a set of xids which all committed before some > LSN and none which committed after it. > > So if we had a record of what log sequence number the commit record > for any given transaction is we could build the snapshot at our > leisure without any exclusive lock. In fact we could even build it > lazily as a kind of cache only when we actually are interested in a > given xid.
Yeah, I've been thinking about that. I think what we might do is set up a new SLRU that works like CLOG, but each entry is an LSN rather than just two bits. When a transaction commits, we save the commit LSN under the entry for that XID. We truncate away SLRU pages that contain no still-running XIDs. When we need to check whether an XID is visible to our snapshot, we just look up the commit LSN and compare it with our snapshot LSN. If it's before and non-zero, we can see it. If it's after or all-zeroes, we can't. But I'm not sure how much this would really help. It might (subject to working out the details) make the actual process of taking a snapshot faster. But it's not clear that taking snapshots more quickly will actually help anything, because the problem is not the amount of time spending taking the snapshot. The problem is rather that doing so requires acquiring and releasing an LWLock, and each of those operations requires taking and releasing a spinlock. And it is the spinlock contention that is killing us. That makes me think we need a way to take a snapshot without taking a spinlock. Or if we must take spinlocks, we at least have to avoid every backend that needs a snapshot lusting after the *same* spinlock. What I've been thinking about this weekend is whether it might be possible to create a sort of lock-free queue infrastructure. When a backend starts up, it would add an entry to the queue saying "I'm running". When it commits, it would add an entry to the queue saying "I'm committed". All entries would be added at the *end* of the queue, so a backend scanning the queue to build up a snapshot wouldn't ever be able to see commits out of order. We would need some memory barrier operations on weak-memory-ordering machines to make sure that the queue writes became visible before the end-of-queue pointer bump. The trick is figuring out how to clean up the queue. Since "commit" entries exist only to guard against "running" entries earlier in the queue, the start-of-queue pointer can be advanced whenever it points to a "commit" entry. Also, if it points to a "running" entry for which there is a later "commit" entry, then the start-of-queue pointer can be advanced over that as well. However, just because we can advance the point at which backends start reading doesn't mean that we can actually recycle space, because while we know that new scans needn't worry about those entries, we *don't* know that there isn't already a scan in flight that still needs them. Furthermore, if a transaction runs for a long time, we can never advance the start-of-queue pointer past the "running" entry for its XID, which is obviously problematic since the queue would get very long. To work around that problem, I think we could use Florian's idea upthread of an RCU system. We keep two copies of the queue around, an A copy and a B copy. When the currently active copy fills up, we rewrite it into the other queue, omitting all "committed" entries and any "running" entries that have matching "committed" entries, and then tell everyone to start looking at that copy instead. We would need some kind of gymnastics to make sure that we don't flip from the A copy to the B copy and back to the A copy while some laggardly backend is still hoping to scan the old A copy. A simple algorithm (there's probably a smarter one) would be to have each backend take a spinlock while it's scanning either copy, and to have the backend that is doing the rewrite take and release all of those spinlocks one at a time before beginning the rewrite, thus guaranteeing that any scans still in progress when the rewrite is requested have completed before it's actually performed. Any new scans started in the meanwhile will certainly be looking at the current copy rather than the old copy we're about to overwrite. We would still need a lock around the operation of adding new items to the queue; if two backends try to do that at the same time, chaos will ensue. But if that lock gets busy, it would be possible to have backends use some sort of system of elimination buffers to combine their requests. If ten backends each want to advertise a new XID, it will be far faster to have one backend acquire the lock, write in all ten entries, and wake everyone up than to have each backend insert its entry and wake up the next one. So what we might do is doing a condition-acquire on the lock to insert data into the buffer and, if we can't get it immediately, go find another backend with the same problem and elect a leader to do all the insertions. If we're the leader, sleep on the queue-insertion lock. If not, sleep until the leader wakes us up. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers