On Sat, Sep 14, 2013 at 12:22 AM, Robert Haas <robertmh...@gmail.com> wrote: >> I mean, if we do the promise tuple >> thing, and there are multiple unique indexes, what happens when an >> inserter needs to block pending the outcome of another transaction? >> They had better go clean up the promise tuples from the other unique >> indexes that they're trying to insert into, because they cannot afford >> to hold value locks for a long time, no matter how they're >> implemented. > > As Andres already pointed out, this is not correct.
While not doing this is not incorrect, it certainly would be useful for preventing deadlocks and unnecessary contention. In a world where people expect either an insert or an update, we ought to try and reduce contention across multiple unique indexes. I can understand why that doesn't matter today, though - if you're going to insert duplicates indifferent to whether or not there will be conflicts, that's a kind of abuse, and not worth optimizing - seems probable that most transactions will commit. However, it seems much less probable that most upserters will insert. People may well wish to upsert all the time where an insert is hardly ever necessary, which is one reason why I have doubts about other proposals. Note that today there is no guarantee that the original waiter for a duplicate-inserting xact to complete will be the first one to get a second chance, so I think it's hard to question this on correctness grounds. Even if they are released in FIFO order, there is no reason to assume that the first waiter will win the race with a second. Most obviously, the second waiter may not even ever get the chance to block on the same xid at all (so it's not really a waiter at all) and still be able to insert, if the blocking-xact aborts after the second "waiter" starts its descent but before it checks uniqueness. All this, even though the second "waiter" arrived maybe minutes after the first. What I'm talking about here is really unlikely to result in lock starvation, because the original waiter typically gets to observe the other waiter go through, and that's reason enough to give up entirely. Now, it's kind of weird that the original waiter will still end up blocking on the xid that caused it to wait in the first instance. So there should be more thought put into that, like remembering the xid and only waiting on it on a retry, or some similar scheme. Maybe you could contrive a scenario where this causes lock starvation, but I suspect you could do the same thing for the present btree insertion code. > Just to add to > what he said, we already have long-lasting value locks in the form of > SIREAD locks. SIREAD can exist at different levels of granularity, but > one of those levels is index-page-level granularity, where they have > the function of guarding against concurrent insertions of values that > would fall within that page, which just so happens to be the same > thing you want to do here. The difference between those locks and > what you're proposing here is that they are implemented differently. > That is why those were acceptable and this is not. As the implementer of this patch, I'm obligated to put some checks in unique index insertion that everyone has to care about. There is no way around that. Complexity issues aside, I think that an argument could be made for this approach *reducing* the impact on concurrency relative to other approaches, if there isn't too many unique indexes to deal with, which is the case the vast majority of the time. I mean, those other approaches necessitate doing so much more with *exclusive* locks held. Like inserting, maybe doing a page split, WAL-logging, all with the lock, and then either updating in place or killing the promise tuple, and WAL-logging that, with an exclusive lock held the second time around. Plus searching for everything twice. I think that frequently killing all of those broken-promise tuples could have deleterious effects on concurrency and/or index bloat or the kind only remedied by reindex. Do you update the freespace map too? More exclusive locks! Or if you leave it up to VACUUM (and just set the xid to InvalidXid, which is still extra work), autovacuum has to care about a new *class* of bloat - index-only bloat. Plus lots of dead duplicates are bad for performance in btrees generally. > As here, that is way more expensive than > simply grabbing and holding a share-lock on the page. But we get a > number of important benefits out of it. The backend remains > interruptible while the tuple is locked, the protocol for granting > locks is FIFO to prevent starvation, we don't suppress page eviction > while the lock is held, we can simultaneously lock arbitrarily large > numbers of tuples, and deadlocks are detect and handled cleanly. If > those requirements were negotiable, we would surely have negotiated > them away already, because the performance benefits would be immense. False equivalence. We only need to lock as many unique index *values* (not tuples) as are proposed for insertion per slot (which can be reasonably bound), and only for an instant. Clearly it would be totally unacceptable if tuple-level locks made backends uninterruptible indefinitely. Of course, this is nothing like that. >> If the value locks were made interruptible through some method, such >> as the promise tuples approach, does that really make deadlocking >> acceptable? > > Yes. It's not possible to prevent all deadlocks. It IS possible to > make sure that they are properly detected and that precisely one of > the transactions involved is rolled back to resolve the deadlock. You seem to have misunderstood me here, or perhaps I was unclear. I'm referring to deadlocks that cannot really be predicted or analyzed by the user at all - see my comments below on insertion order. >> I don't think non-interruptibility is a problem? Really, do you think >> that this kind of inflammatory rhetoric helps anybody? I said nothing >> of the sort. I recall saying something about an engineering trade-off. >> Of course I value interruptibility. > > I don't see what's inflammatory about that statement. The fact that you simply stated that I don't think non-interruptibility is a problem in a non-qualified way, obviously. > Interruptibility is not a nice-to-have that we > can trade away from time to time; it's essential and non-negotiable. I seem to recall you saying something about the Linux kernel and their attitude to interruptibility. Yes, interruptibility is not just a nice-to-have; it is essentially. However, without dismissing your other concerns, I have yet to hear a convincing argument as to why anything I've done here is going to make any difference to interruptibility that would be appreciable to any human. So far it's been a slippery slope type argument that can be equally well used to argue against some facet of almost any substantial patch ever proposed. I just don't think that regressing interruptibility marginally is *necessarily* sufficient justification for rejecting an approach outright. FYI, *that's* how I value interruptibility generally. >> In contrast, what I've proposed here is in general quite unlikely to >> result in any I/O for the duration of the time the locks are held. >> Only writers will be blocked. And only those inserting into a narrow >> range of values around the btree leaf page. Much of the work that evehn >> those writers need to do will be unimpeded anyway; they'll just block >> on attempting to acquire an exclusive lock on the first btree leaf >> page that the value they're inserting could be on. > > Sure, but you're talking about broadening the problem from the guy > performing the insert to everybody who might be trying to an insert > that hits one of the same unique-index pages. In general, that isn't that much worse than just blocking the value directly. The number of possible values that could also be blocked is quite low. The chances of it actually mattering that those additional values are locked in the still small window in which the buffer locks are held is in generally fairly low, particularly on larger tables where there is naturally a large number of possible distinct values. I will however concede that the impact on inserters that want to insert a non-locked value that belongs on the locked page or its child might be worse, but it's already a problem that inserted index tuples can all end up on the same page, if not to the same extent. > Instead of holding one > buffer lock, the guy performing the insert is now holding as many > buffer locks as there are indexes. That's a non-trivial issue. Actually, as many buffer locks as there are *unique* indexes. It might be a non-trivial issue, but this whole problem is decidedly non-trivial, as I'm sure we can all agree. > For that matter, if the table has more than MAX_SIMUL_LWLOCKS indexes, > you'll error out. In fact, if you get the number of indexes exactly > right, you'll exceed MAX_SIMUL_LWLOCKS in visibilitymap_clear() and > panic the whole system. Oh, come on. We can obviously engineer a solution to that problem. I don't think I've ever seen a table with close to 100 *unique* indexes. 4 or 5 is a very high number. If we just raised on error if someone tried to do this with more than 10 unique indexes, I would guess that'd we'd get exactly zero complaints about it. > Oh, and if different backends load the index list in different orders, > because say the system catalog gets vacuumed between their respective > relcache loads, then they may try to lock the indexes in different > orders and cause an undetected deadlock. Undetected deadlock is really not much worse than detected deadlock here. Either way, it's a bug. And it's something that any kind of implementation will need to account for. It's not okay to *unpredictably* deadlock, in a way that the user has no control over. Today, someone can do an analysis of their application and eliminate deadlocks if they need to. That might not be terribly practical much of the time, but it can be done. It certainly is practical to do it in a localized way. I wouldn't like to compromise that. So yes, you're right that I need to control for this sort of thing better than in the extant patch, and in fact this was discussed fairly early on. But it's an inherent problem. > And, drifting a bit further off-topic, even to get as far as you have, > you've added overhead to every lwlock acquisition and release, even > for people who never use this functionality. If you look at the code, you'll see that I've made very modest modifications to LWLockRelease only. I would be extremely surprised if the overhead was not only in the noise, but was completely impossible to detect through any conventional benchmark. These are the same kind of very modest changes made for LWLockAcquireOrWait(), and you said nothing about that at the time. Despite the fact that you now appear to think that that whole effort was largely a waste of time. > That's true but somewhat misleading. Textually most of the function > holds the buffer lock, but heap_prepare_insert(), > CheckForSerializableConflictIn(), and RelationGetBufferForTuple(), and > XLogWrite() are the parts that do substantial amounts of computation, > and only the last of those happens while holding the buffer lock. I've already written modifications so that I don't have to do heap_prepare_insert() with the locks held. There is no reason to call CheckForSerializableConflictIn() with the additional locks held either. After all, "For a heap insert, we only need to check for table-level SSI locks". As for RelationGetBufferForTuple(), yes, the majority of the time it will have to do very little without acquiring an exclusive lock, because it's going to get that from the last place a heap tuple was inserted from. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers