On Fri, Sep 13, 2013 at 9:23 AM, Robert Haas <robertmh...@gmail.com> wrote: > Andres is being very polite here, but the reality is that this > approach has zero chance of being accepted.
I quite like Andres, but I have yet to see him behave as you describe in a situation where someone proposed what was fundamentally a bad idea. Maybe you should let him speak for himself? > You can't hold buffer > locks for a long period of time across complex operations. Full stop. > It's a violation of the rules that are clearly documented in > src/backend/storage/buffer/README, which have been in place for a very > long time, and this patch is nowhere near important enough to warrant > a revision of those rules. The importance of this patch is a value judgement. Our users have been screaming for this for over ten years, so to my mind it has a fairly high importance. Also, every other database system of every stripe worth mentioning has something approximately equivalent to this, including ones with much less functionality generally. The fact that we don't is a really unfortunate omission. As to the rules you refer to, you must mean "These locks are intended to be short-term: they should not be held for long". I don't think that they will ever be held for long. At least, when I've managed the amount of work that a heap_insert() can do better. I expect to produce a revision where toasting doesn't happen with the locks held soon. Actually, I've already written the code, I just need to do some testing. > We are not going to risk breaking every > bit of code anywhere in the backend or in third-party code that takes > a buffer lock. You are never going to convince me, or Tom, that the > benefit of doing that is worth the risk; in fact, I have a hard time > believing that you'll find ANY committer who thinks this approach is > worth considering. I would suggest letting those other individuals speak for themselves too. Particularly if you're going to name someone who is on vacation like that. > Even if you get the code to run without apparent deadlocks, that > doesn't mean there aren't any; Of course it doesn't. Who said otherwise? > Our buffer locking regimen suffers > from painful complexity and serious maintenance difficulties as is. That's true to a point, but it has more to do with things like how VACUUM interacts with hio.c. Things like this: /* * Release the file-extension lock; it's now OK for someone else to extend * the relation some more. Note that we cannot release this lock before * we have buffer lock on the new page, or we risk a race condition * against vacuumlazy.c --- see comments therein. */ if (needLock) UnlockRelationForExtension(relation, ExclusiveLock); The btree code is different, though: It implements a well-defined interface, with much clearer separation of concerns. As I've said already, with trivial exception (think contrib), no external code considers itself to have license to obtain locks of any sort on btree buffers. No external code of ours - without exception - does anything with multiple locks, or exclusive locks on btree buffers. I'll remind you that I'm only holding shared locks when control is outside of the btree code. Even within the btree code, the number of access method functions that could conflict with what I do here (that acquire exclusive locks) is very small when you exclude things that only exclusive lock the meta-page (there are also very few of those). So the surface area is quite small. I'm not denying that there is a cost, or that I haven't expanded things in a direction I'd prefer not to. I just think that it may well be worth it, particularly when you consider the alternatives - this may well be the least worst thing. I mean, if we do the promise tuple thing, and there are multiple unique indexes, what happens when an inserter needs to block pending the outcome of another transaction? They had better go clean up the promise tuples from the other unique indexes that they're trying to insert into, because they cannot afford to hold value locks for a long time, no matter how they're implemented. That could take much longer than just releasing a shared buffer lock, since for each unique index the promise tuple must be re-found from scratch. There are huge issues with additional complexity and bloat. Oh, and now your lightweight locks aren't so lightweight any more. If the value locks were made interruptible through some method, such as the promise tuples approach, does that really make deadlocking acceptable? So at least your system didn't seize up. But on the other hand, the user randomly had a deadlock error through no fault of their own. The former may be worse, but the latter is also inexcusable. In general, the best solution is just to not have deadlock hazards. I wouldn't be surprised if reasoning about deadlocking was harder with that alternative approach to value locking, not easier. > Moreover, we've already got performance and scalability problems that > are attributable to every backend in the system piling up waiting on a > single lwlock, or a group of simultaneously-held lwlocks. > Dramatically broadening the scope of where lwlocks are used and for > how long they're held is going to make that a whole lot worse. You can hardly compare a buffer's LWLock with a system one that protects critical shared memory structures. We're talking about a shared lock on a single btree leaf page per unique index involved in upserting. > A further problem is that a backend which holds even one lwlock can't > be interrupted. We've had this argument before and it seems that you > don't think that non-interruptibility is a problem, but it project > policy to allow for timely interrupts in all parts of the backend and > we're not going to change that policy for this patch. I don't think non-interruptibility is a problem? Really, do you think that this kind of inflammatory rhetoric helps anybody? I said nothing of the sort. I recall saying something about an engineering trade-off. Of course I value interruptibility. If you're concerned about non-interruptibility, consider XLogFlush(). That does rather a lot of work with WALWriteLock exclusive locked. On a busy system, some backend is very frequently going to experience a non-interruptible wait for the duration of however long it takes to write and flush perhaps a whole segment. All other flushing backends are stuck in non-interruptible waits waiting for that backend to finish. I think that the group commit stuff might have regressed worst-case interruptibility for flushers by quite a bit; should we have never committed that, or do you agree with my view that it's worth it? In contrast, what I've proposed here is in general quite unlikely to result in any I/O for the duration of the time the locks are held. Only writers will be blocked. And only those inserting into a narrow range of values around the btree leaf page. Much of the work that even those writers need to do will be unimpeded anyway; they'll just block on attempting to acquire an exclusive lock on the first btree leaf page that the value they're inserting could be on. And the additional non-interruptible wait of those inserters won't be terribly much more than the wait of the backend where heap tuple insertion takes a long time anyway - that guy already has to do close to 100% of that work with a non-interruptible wait today (once we eliminate heap_prepare_insert() and toasting). The UnlockReleaseBuffer() call is right at the end of heap_insert, and the buffer is pinned and locked very close to the start. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers