On Fri, May 13, 2011 at 4:16 PM, Noah Misch <n...@leadboat.com> wrote: > The key is putting a rapid hard stop to all fast-path lock acquisitions and > then reconstructing a valid global picture of the affected lock table regions. > Your 1024-way table of strong lock counts sounds promising. (Offhand, I do > think they would need to be counts, not just flags.) > > If I'm understanding correctly, your pseudocode would look roughly like this: > > if (level >= ShareUpdateExclusiveLock) > ++strong_lock_counts[my_strong_lock_count_partition] > sfence > if (strong_lock_counts[my_strong_lock_count_partition] == 1) > /* marker 1 */ > import_all_local_locks > normal_LockAcquireEx > else if (level <= RowExclusiveLock) > lfence > if (strong_lock_counts[my_strong_lock_count_partition] == 0) > /* marker 2 */ > local_only > /* marker 3 */ > else > normal_LockAcquireEx > else > normal_LockAcquireEx
I think ShareUpdateExclusiveLock should be treated as neither weak nor strong. It certainly can't be treated as weak - i.e. use the fast path - because it's self-conflicting. It could be treated as strong, but since it doesn't conflict with any of the weak lock types, that would only serve to prevent fast-path lock acquisitions that otherwise could have succeeded. In particular, it would unnecessarily disable fast-path lock acquisition for any relation being vacuumed, which could be really ugly considering that one of the main workloads that would benefit from something like this is the case where lots of backends are fighting over a lock manager partition lock on a table they all want to run read and/or modify. I think it's best for ShareUpdateExclusiveLock to always use the regular lock-acquisition path, but it need not worry about incrementing strong_lock_counts[] or importing local locks in so doing. Also, I think in the step just after marker one, we'd only import only local locks whose lock tags were equal to the lock tag on which we were attempting to acquire a strong lock. The downside of this whole approach is that acquiring a strong lock becomes, at least potentially, a lot slower, because you have to scan through the whole backend array looking for fast-path locks to import (let's not use the term "local lock", which is already in use within the lock manager code). But maybe that can be optimized enough not to matter. After all, if the lock manager scaled perfectly at high concurrency, we wouldn't be thinking about this in the first place. > At marker 1, we need to block until no code is running between markers two and > three. You could do that with a per-backend lock (LW_SHARED by the strong > locker, LW_EXCLUSIVE by the backend). That would probably still be a win over > the current situation, but it would be nice to have something even cheaper. I don't have a better idea than to use an LWLock. I have a patch floating around to speed up our LWLock implementation, but I haven't got a workload where the bottleneck is the actual speed of operation of the LWLock rather than the fact that it's contended in the first place. And the whole point of this would be to arrange things so that the LWLocks are uncontended nearly all the time. > Then you have the actual procedure for transfer of local locks to the global > lock manager. Using record locks in a file could work, but that's a system > call > per lock acquisition regardless of the presence of strong locks. Is that cost > sufficiently trivial? No, I don't think we want to go into kernel space. When I spoke of using a file, I was imagining it as an mmap'd region that one backend could write to which, at need, another backend could mmap and grovel through. Another (perhaps simpler) option would be to just put it in shared memory. That doesn't give you as much flexibility in terms of expanding the segment, but it would be reasonable to allow space for only, dunno, say 32 locks per backend in shared memory; if you need more than that, you flush them all to the main lock table and start over. You could possibly even just make this a hack for the particular special case where we're taking a relation lock on a non-shared relation; then you'd need only 128 bytes for a 32-entry array, plus the LWLock (I think the database ID is already visible in shared memory). > I wonder if, instead, we could signal all backends at > marker 1 to dump the applicable parts of their local (memory) lock tables to > files. Or to another shared memory region, if that didn't mean statically > allocating the largest possible required amount. If we were willing to wait > until all backends reach a CHECK_FOR_INTERRUPTS, they could instead make the > global insertions directly. That might yield a decent amount of bug swatting > to > fill in missing CHECK_FOR_INTERRUPTS, though. I've thought about this; I believe it's unworkable. If one backend goes into the tank (think: SIGSTOP, or blocking on I/O to an unreadable disk sector) this could lead to cascading failure. > Improvements in this area would also have good synergy with efforts to reduce > the global impact of temporary table usage. CREATE TEMP TABLE can be the > major source of AccessExclusiveLock acquisitions. However, with the strong > lock indicator partitioned 1024 ways or more, that shouldn't be a deal killer. If that particular case is a problem for you, it seems like optimizing away the exclusive lock altogether might be possible. No other backend should be able to touch the table until the transaction commits anyhow, and at that point we're going to release the lock. There are possibly some sticky wickets here but it seems at least worth thinking about. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers