Re: [HACKERS] INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

Peter Geoghegan Fri, 13 Sep 2013 12:35:19 -0700

On Fri, Sep 13, 2013 at 9:23 AM, Robert Haas <[email protected]> wrote:
> Andres is being very polite here, but the reality is that this
> approach has zero chance of being accepted.


I quite like Andres, but I have yet to see him behave as you describe
in a situation where someone proposed what was fundamentally a bad
idea. Maybe you should let him speak for himself?

> You can't hold buffer
> locks for a long period of time across complex operations.  Full stop.
>  It's a violation of the rules that are clearly documented in
> src/backend/storage/buffer/README, which have been in place for a very
> long time, and this patch is nowhere near important enough to warrant
> a revision of those rules.

The importance of this patch is a value judgement. Our users have been
screaming for this for over ten years, so to my mind it has a fairly
high importance. Also, every other database system of every stripe
worth mentioning has something approximately equivalent to this,
including ones with much less functionality generally. The fact that
we don't is a really unfortunate omission.

As to the rules you refer to, you must mean "These locks are intended
to be short-term: they should not be held for long". I don't think
that they will ever be held for long. At least, when I've managed the
amount of work that a heap_insert() can do better. I expect to produce
a revision where toasting doesn't happen with the locks held soon.
Actually, I've already written the code, I just need to do some
testing.

> We are not going to risk breaking every
> bit of code anywhere in the backend or in third-party code that takes
> a buffer lock.  You are never going to convince me, or Tom, that the
> benefit of doing that is worth the risk; in fact, I have a hard time
> believing that you'll find ANY committer who thinks this approach is
> worth considering.

I would suggest letting those other individuals speak for themselves
too. Particularly if you're going to name someone who is on vacation
like that.

> Even if you get the code to run without apparent deadlocks, that
> doesn't mean there aren't any;

Of course it doesn't. Who said otherwise?

> Our buffer locking regimen suffers
> from painful complexity and serious maintenance difficulties as is.

That's true to a point, but it has more to do with things like how
VACUUM interacts with hio.c. Things like this:

/*
 * Release the file-extension lock; it's now OK for someone else to extend
 * the relation some more. Note that we cannot release this lock before
 * we have buffer lock on the new page, or we risk a race condition
 * against vacuumlazy.c --- see comments therein.
 */
if (needLock)
    UnlockRelationForExtension(relation, ExclusiveLock);

The btree code is different, though: It implements a well-defined
interface, with much clearer separation of concerns. As I've said
already, with trivial exception (think contrib), no external code
considers itself to have license to obtain locks of any sort on btree
buffers. No external code of ours - without exception - does anything
with multiple locks, or exclusive locks on btree buffers. I'll remind
you that I'm only holding shared locks when control is outside of the
btree code.

Even within the btree code, the number of access method functions that
could conflict with what I do here (that acquire exclusive locks) is
very small when you exclude things that only exclusive lock the
meta-page (there are also very few of those). So the surface area is
quite small.

I'm not denying that there is a cost, or that I haven't expanded
things in a direction I'd prefer not to. I just think that it may well
be worth it, particularly when you consider the alternatives - this
may well be the least worst thing. I mean, if we do the promise tuple
thing, and there are multiple unique indexes, what happens when an
inserter needs to block pending the outcome of another transaction?
They had better go clean up the promise tuples from the other unique
indexes that they're trying to insert into, because they cannot afford
to hold value locks for a long time, no matter how they're
implemented. That could take much longer than just releasing a shared
buffer lock, since for each unique index the promise tuple must be
re-found from scratch. There are huge issues with additional
complexity and bloat. Oh, and now your lightweight locks aren't so
lightweight any more.

If the value locks were made interruptible through some method, such
as the promise tuples approach, does that really make deadlocking
acceptable? So at least your system didn't seize up. But on the other
hand, the user randomly had a deadlock error through no fault of their
own. The former may be worse, but the latter is also inexcusable. In
general, the best solution is just to not have deadlock hazards. I
wouldn't be surprised if reasoning about deadlocking was harder with
that alternative approach to value locking, not easier.

> Moreover, we've already got performance and scalability problems that
> are attributable to every backend in the system piling up waiting on a
> single lwlock, or a group of simultaneously-held lwlocks.
> Dramatically broadening the scope of where lwlocks are used and for
> how long they're held is going to make that a whole lot worse.

You can hardly compare a buffer's LWLock with a system one that
protects critical shared memory structures. We're talking about a
shared lock on a single btree leaf page per unique index involved in
upserting.

> A further problem is that a backend which holds even one lwlock can't
> be interrupted.  We've had this argument before and it seems that you
> don't think that non-interruptibility is a problem, but it project
> policy to allow for timely interrupts in all parts of the backend and
> we're not going to change that policy for this patch.

I don't think non-interruptibility is a problem? Really, do you think
that this kind of inflammatory rhetoric helps anybody? I said nothing
of the sort. I recall saying something about an engineering trade-off.
Of course I value interruptibility.

If you're concerned about non-interruptibility, consider XLogFlush().
That does rather a lot of work with WALWriteLock exclusive locked. On
a busy system, some backend is very frequently going to experience a
non-interruptible wait for the duration of however long it takes to
write and flush perhaps a whole segment. All other flushing backends
are stuck in non-interruptible waits waiting for that backend to
finish. I think that the group commit stuff might have regressed
worst-case interruptibility for flushers by quite a bit; should we
have never committed that, or do you agree with my view that it's
worth it?

In contrast, what I've proposed here is in general quite unlikely to
result in any I/O for the duration of the time the locks are held.
Only writers will be blocked. And only those inserting into a narrow
range of values around the btree leaf page. Much of the work that even
those writers need to do will be unimpeded anyway; they'll just block
on attempting to acquire an exclusive lock on the first btree leaf
page that the value they're inserting could be on. And the additional
non-interruptible wait of those inserters won't be terribly much more
than the wait of the backend where heap tuple insertion takes a long
time anyway - that guy already has to do close to 100% of that work
with a non-interruptible wait today (once we eliminate
heap_prepare_insert() and toasting). The UnlockReleaseBuffer() call is
right at the end of heap_insert, and the buffer is pinned and locked
very close to the start.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

Reply via email to