On Mon, Sep 23, 2013 at 12:49 PM, Robert Haas <robertmh...@gmail.com> wrote: >> Right, but what about XactLockTableWait() itself? It only acquires a >> ShareLock on the xid of the got-there-first inserter that potentially >> hasn't yet committed/aborted. > > That's an interesting point. As you pointed out in later emails, that > cases is handled for heap tuple locks, but btree uniqueness conflicts > are a different kettle of fish.
Right. It suits my purposes to have the value locks be held for only an instant, because: 1) It will perform much better and deadlock much less in certain scenarios if sessions are given leeway to not block each other across multiple values in multiple unique indexes (i.e. we make them "considerate", like a person with a huge shopping cart that lets another person with one thing ahead of them in a queue, and perhaps in doing so greatly reduces their own wait because the guy with one thing makes the cashier immediately say 'no' to the person with all those groceries. Ahem). I don't think that this implies any additional anomalies at read committed, and I'm reasonably confident that this doesn't regress things to any degree lock starvation wise - lock starvation can only come from a bunch of inserters of the same value that consistently abort, just like the present situation with one unique index (I think it's better with multiple unique indexes than with only one - more opportunities for the would-be-starved session to hear a definitive no answer and give up). 2) It will probably be considerably easier if and when we improve on the buffer locking stuff (by adding a value locking SLRU) to assume that they'll be shortly held. For example, maybe it's okay that the implementation doesn't allow page splits on value-locked pages, and maybe that makes things much easier to reason about. If you're determined to have a strict serial ordering of value locking *without serialization failures*, I think what I've already said about the interactions between row locking and value locking demonstrates that that's close to or actually impossible. Plus, it would really suck for performance if that SLRU had to actually swap value locks to and from disk, which becomes a real possibility if they're really long held (mere index scans aren't going to keep the cache warm, so the worst-case latency for an innocent inserter into some narrow range of values might be really bad). Speaking of ease of implementation, how do you guarantee that the value locking waiters get the right to insert in serial order (if that's something that you value, which I don't at RC)? You have to fix the same "race" that already exists when acquiring a ShareLock on an xid, and blocking on value lock acquisition. The only possible remedy I can see for that is to integrate heap and btree locking in a much more intimate and therefore sketchy way. You need something like LockTuple() to arbitrate ordering, but what, and how, and where, and with how many buffer locks held? Most importantly: 3) As I've already mentioned, heavy value locks (like promise tuples or similar schemes, as opposed to actual heavyweight locks) concomitantly increase the window in which a conflict can be created for row locking. Most transactions last but an instant, and so the fact that other session may already be blocked locking on the would-be-duplicate row perhaps isn't that relevant. Doing all that clean-up is going to give other sessions increased opportunity to lock the row themselves, and ruin our day. But these points are about long held value locks, not the cost of making their acquisition relatively expensive or inexpensive (but still more or less instantaneous), so why mention that at all? Well, since we're blocking everyone else with our value locks, they get to have a bad day too. All the while, they're perhaps virtually pre-destined to find some row to lock, but the window for something to happen to that row for that to conflict with eventual row locking (to *unnecessarily* conflict, as for example when an innocuous HOT update occurs) gets larger and larger as they wait longer and longer on value locks. Loosely speaking, things get multiplicatively worse - total gridlock is probably possible, with the deadlock detector only breaking the gridlock up a bit if we get lucky (unless, maybe, if value locks last until transaction end...which I think is nigh on impossible anyway). The bottom line is that long lasting value locks - value locks that last the duration of a transaction and are acquired serially, while guaranteeing that the inserter that gets all the value locks needed itself gets to insert - have the potential to cascade horribly, in ways that I can only really begin to reason about. That is, they do *if* we presume that they have the interactions with row locking that I believe they do, a belief that no one has taken issue with yet. Even *considering* this is largely academic, though, because without some kind of miracle guaranteeing serial ordering, a miracle that doesn't allow for serialization failures and also doesn't seriously slow down, for example, updates (by making them care about value locks *before* they do anything, or in the case of HOT updates *at all*), all of this is _impossible_. So, I say let's just do the actually-serial-ordering for value lock acquisition with serialization failures where we're > read committed. I've seriously considered what it would take to do it any other way so things would work how you and Andres expect for read committed, and it makes my head hurt, because apart from seeming unnecessary to me, it also seems completely hopeless. Am I being too negative here? Well, I guess that's possible. The fact is that it's really hard to judge, because all of this is really hard to reason about. That's what I really don't like about it. >> I respectfully >> suggest that that exact definition of upsert isn't a useful one, >> because other snapshot isolation/MVCC systems operating within the >> same constraints must have the same issues, and yet they manage to >> implement something that could be called upsert that people seem happy >> with. > > Yeah. I wonder how they do that. My guess is that they have some fancy snapshot type that is used by the equivalent of ModifyTable subplans, that is appropriately paranoid about the Halloween problem and so on. How that actually might work is far from clear, but it's a question that I have begun to consider. As I said, a concern is that it would be in tension with the generalized, composable syntax, where we don't explicitly have a "special update". I'd really like to hear other opinions, though. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers