Re: [PATCH] LWLock self-deadlock detection
On Wed, 30 Dec 2020 at 10:11, Andres Freund wrote: > Hi, > > On 2020-11-27 20:22:41 +0200, Heikki Linnakangas wrote: > > On 26/11/2020 04:50, Tom Lane wrote: > > > Craig Ringer writes: > > > > On Wed, Nov 25, 2020 at 9:23 PM Ashutosh Bapat < > ashutosh.bapat@gmail.com> > > > > wrote: > > > > > I'd prefer to make the lock self deadlock check run for production > > > > > builds, not just cassert builds. > > > > > > I'd like to register a strong objection to spending any cycles > whatsoever > > > on this. If you're using LWLocks in a way that could deadlock, you're > > > doing it wrong. The entire point of that mechanism is to be Light > Weight > > > and not have all the overhead that the heavyweight lock mechanism has. > > > Building in deadlock checks and such is exactly the wrong direction. > > > If you think you need that, you should be using a heavyweight lock. > > > > > > Maybe there's some case for a cassert-only check of this sort, but > running > > > it in production is right out. > > > > > > I'd also opine that checking for self-deadlock, but not any more > general > > > deadlock, seems pretty pointless. Careless coding is far more likely > > > to create opportunities for the latter. (Thus, I have little use for > > > the existing assertions of this sort, either.) > > > > I've made the mistake of forgetting to release an LWLock many times, > leading > > to self-deadlock. And I just reviewed a patch that did that this week > [1]. > > You usually find that mistake very quickly when you start testing > though, I > > don't think I've seen it happen in production. > > I think something roughly along Craig's original patch, basically adding > assert checks against holding a lock already, makes sense. Compared to > the other costs of running an assert build this isn't a huge cost, and > it's helpful. > > I entirely concur that doing this outside of assertion builds is a > seriously bad idea. > Yeah, given it only targets developer error that's sensible.
Re: [PATCH] LWLock self-deadlock detection
Hi, On 2020-11-27 20:22:41 +0200, Heikki Linnakangas wrote: > On 26/11/2020 04:50, Tom Lane wrote: > > Craig Ringer writes: > > > On Wed, Nov 25, 2020 at 9:23 PM Ashutosh Bapat > > > > > > wrote: > > > > I'd prefer to make the lock self deadlock check run for production > > > > builds, not just cassert builds. > > > > I'd like to register a strong objection to spending any cycles whatsoever > > on this. If you're using LWLocks in a way that could deadlock, you're > > doing it wrong. The entire point of that mechanism is to be Light Weight > > and not have all the overhead that the heavyweight lock mechanism has. > > Building in deadlock checks and such is exactly the wrong direction. > > If you think you need that, you should be using a heavyweight lock. > > > > Maybe there's some case for a cassert-only check of this sort, but running > > it in production is right out. > > > > I'd also opine that checking for self-deadlock, but not any more general > > deadlock, seems pretty pointless. Careless coding is far more likely > > to create opportunities for the latter. (Thus, I have little use for > > the existing assertions of this sort, either.) > > I've made the mistake of forgetting to release an LWLock many times, leading > to self-deadlock. And I just reviewed a patch that did that this week [1]. > You usually find that mistake very quickly when you start testing though, I > don't think I've seen it happen in production. I think something roughly along Craig's original patch, basically adding assert checks against holding a lock already, makes sense. Compared to the other costs of running an assert build this isn't a huge cost, and it's helpful. I entirely concur that doing this outside of assertion builds is a seriously bad idea. Greetings, Andres Freund
Re: [PATCH] LWLock self-deadlock detection
On Fri, Nov 27, 2020 at 11:08:49AM -0800, Peter Geoghegan wrote: > On Fri, Nov 27, 2020 at 10:22 AM Heikki Linnakangas wrote: >> I've made the mistake of forgetting to release an LWLock many times, >> leading to self-deadlock. And I just reviewed a patch that did that this >> week [1]. You usually find that mistake very quickly when you start >> testing though, I don't think I've seen it happen in production. > > +1. The fact that you don't get deadlock detection with LWLocks is a > cornerstone of the whole design. This assumption is common to all > database systems (LWLocks are generally called latches in the database > research community, but the concept is exactly the same). +1. >> So yeah, I agree it's not worth spending cycles on this. Maybe it would >> be worth it if it's really simple to check, and you only do it after >> waiting X seconds. (I haven't looked at this patch at all) > > -1 for that, unless it's only for debug builds. Even if it is > practically free, it still means committing to the wrong priorities. > Plus the performance validation would be very arduous as a practical > matter. Looking at the git history, we have a much larger history of bugs that relate to race conditions when it comes to LWLocks. I am not really convinced that it is worth spending time on this even for debug builds, FWIW. -- Michael signature.asc Description: PGP signature
Re: [PATCH] LWLock self-deadlock detection
On Fri, Nov 27, 2020 at 10:22 AM Heikki Linnakangas wrote: > I've made the mistake of forgetting to release an LWLock many times, > leading to self-deadlock. And I just reviewed a patch that did that this > week [1]. You usually find that mistake very quickly when you start > testing though, I don't think I've seen it happen in production. +1. The fact that you don't get deadlock detection with LWLocks is a cornerstone of the whole design. This assumption is common to all database systems (LWLocks are generally called latches in the database research community, but the concept is exactly the same). > So yeah, I agree it's not worth spending cycles on this. Maybe it would > be worth it if it's really simple to check, and you only do it after > waiting X seconds. (I haven't looked at this patch at all) -1 for that, unless it's only for debug builds. Even if it is practically free, it still means committing to the wrong priorities. Plus the performance validation would be very arduous as a practical matter. We've made LWLocks much more scalable in the last 10 years. Why shouldn't we expect to do the same again in the next 10 years? I wouldn't bet against it. I might even do the opposite (predict further improvements to LWLocks). -- Peter Geoghegan
Re: [PATCH] LWLock self-deadlock detection
On 26/11/2020 04:50, Tom Lane wrote: Craig Ringer writes: On Wed, Nov 25, 2020 at 9:23 PM Ashutosh Bapat wrote: I'd prefer to make the lock self deadlock check run for production builds, not just cassert builds. I'd like to register a strong objection to spending any cycles whatsoever on this. If you're using LWLocks in a way that could deadlock, you're doing it wrong. The entire point of that mechanism is to be Light Weight and not have all the overhead that the heavyweight lock mechanism has. Building in deadlock checks and such is exactly the wrong direction. If you think you need that, you should be using a heavyweight lock. Maybe there's some case for a cassert-only check of this sort, but running it in production is right out. I'd also opine that checking for self-deadlock, but not any more general deadlock, seems pretty pointless. Careless coding is far more likely to create opportunities for the latter. (Thus, I have little use for the existing assertions of this sort, either.) I've made the mistake of forgetting to release an LWLock many times, leading to self-deadlock. And I just reviewed a patch that did that this week [1]. You usually find that mistake very quickly when you start testing though, I don't think I've seen it happen in production. So yeah, I agree it's not worth spending cycles on this. Maybe it would be worth it if it's really simple to check, and you only do it after waiting X seconds. (I haven't looked at this patch at all) [1] https://www.postgresql.org/message-id/8e1cda9b-1cb9-a634-d383-f757bf67b...@iki.fi - Heikki
Re: [PATCH] LWLock self-deadlock detection
Craig Ringer writes: > On Wed, Nov 25, 2020 at 9:23 PM Ashutosh Bapat > wrote: >> I'd prefer to make the lock self deadlock check run for production >> builds, not just cassert builds. I'd like to register a strong objection to spending any cycles whatsoever on this. If you're using LWLocks in a way that could deadlock, you're doing it wrong. The entire point of that mechanism is to be Light Weight and not have all the overhead that the heavyweight lock mechanism has. Building in deadlock checks and such is exactly the wrong direction. If you think you need that, you should be using a heavyweight lock. Maybe there's some case for a cassert-only check of this sort, but running it in production is right out. I'd also opine that checking for self-deadlock, but not any more general deadlock, seems pretty pointless. Careless coding is far more likely to create opportunities for the latter. (Thus, I have little use for the existing assertions of this sort, either.) regards, tom lane
Re: [PATCH] LWLock self-deadlock detection
On Wed, Nov 25, 2020 at 9:23 PM Ashutosh Bapat wrote: > On Wed, Nov 25, 2020 at 11:47 AM Craig Ringer > wrote: > >> I am also seeing a pattern > >> Assert(!LWLockHeldByMe()); > >> LWLockAcquire() > >> > >> at some places. Should we change LWLockAcquire to do > >> Assert(!LWLockHeldByMe()) always to detect such occurrences? > > > > > > I'm inclined not to, at least not without benchmarking it, because > that'd do the check before we attempt the fast-path. cassert builds are > still supposed to perform decently and be suitable for day to day > development and I'd rather not risk a slowdown. > > > > I'd prefer to make the lock self deadlock check run for production > builds, not just cassert builds. It'd print a normal LOG (with backtrace if > supported) then Assert(). So on an assert build we'd get a crash and core, > and on a non-assert build we'd carry on and self-deadlock anyway. > > > > That's probably the safest thing to do. We can't expect to safely ERROR > out of the middle of an LWLockAcquire(), not without introducing a new and > really hard to test code path that'll also be surprising to callers. We > probably don't want to PANIC the whole server for non-assert builds since > it might enter a panic-loop. So it's probably better to self-deadlock. We > could HINT that a -m immediate shutdown will be necessary to recover though. > > I agree that it will be helpful to print something in the logs > indicating the reason for this hang in case the hang happens in a > production build. In your patch you have used ereport(PANIC, ) which > may simply be replaced by an Assert() in an assert enabled build. I'd either select between PANIC (assert build) and LOG (non assert build), or always LOG then put an Assert() afterwards. It's strongly desirable to get the log output before any crash because it's a pain to get the LWLock info from a core. > We > already have Assert(!LWLockHeldByMe()) so that should be safe. It will > be good to have -m immediate hint in LOG message. But it might just be > better to kill -9 that process to get rid of it. That will cause the > server to restart and not just shutdown. > Sure. Won't work on Windows though. And I am hesitant about telling users to "kill -9" anything. pg_ctl -m immediate restart then.
Re: [PATCH] LWLock self-deadlock detection
On Wed, Nov 25, 2020 at 11:47 AM Craig Ringer wrote: >> I am also seeing a pattern >> Assert(!LWLockHeldByMe()); >> LWLockAcquire() >> >> at some places. Should we change LWLockAcquire to do >> Assert(!LWLockHeldByMe()) always to detect such occurrences? > > > I'm inclined not to, at least not without benchmarking it, because that'd do > the check before we attempt the fast-path. cassert builds are still supposed > to perform decently and be suitable for day to day development and I'd rather > not risk a slowdown. > > I'd prefer to make the lock self deadlock check run for production builds, > not just cassert builds. It'd print a normal LOG (with backtrace if > supported) then Assert(). So on an assert build we'd get a crash and core, > and on a non-assert build we'd carry on and self-deadlock anyway. > > That's probably the safest thing to do. We can't expect to safely ERROR out > of the middle of an LWLockAcquire(), not without introducing a new and really > hard to test code path that'll also be surprising to callers. We probably > don't want to PANIC the whole server for non-assert builds since it might > enter a panic-loop. So it's probably better to self-deadlock. We could HINT > that a -m immediate shutdown will be necessary to recover though. I agree that it will be helpful to print something in the logs indicating the reason for this hang in case the hang happens in a production build. In your patch you have used ereport(PANIC, ) which may simply be replaced by an Assert() in an assert enabled build. We already have Assert(!LWLockHeldByMe()) so that should be safe. It will be good to have -m immediate hint in LOG message. But it might just be better to kill -9 that process to get rid of it. That will cause the server to restart and not just shutdown. -- Best Wishes, Ashutosh Bapat
Re: [PATCH] LWLock self-deadlock detection
On Tue, Nov 24, 2020 at 10:11 PM Ashutosh Bapat < ashutosh.bapat@gmail.com> wrote: > This looks useful. LWLockCheckSelfDeadlock() could use LWLockHeldByMe > variant instead of copying that code with possibly a change in that > function to return the required information. > Yes, possibly so. I was reluctant to mess with the rest of the code too much. I am also seeing a pattern > Assert(!LWLockHeldByMe()); > LWLockAcquire() > > at some places. Should we change LWLockAcquire to do > Assert(!LWLockHeldByMe()) always to detect such occurrences? I'm inclined not to, at least not without benchmarking it, because that'd do the check before we attempt the fast-path. cassert builds are still supposed to perform decently and be suitable for day to day development and I'd rather not risk a slowdown. I'd prefer to make the lock self deadlock check run for production builds, not just cassert builds. It'd print a normal LOG (with backtrace if supported) then Assert(). So on an assert build we'd get a crash and core, and on a non-assert build we'd carry on and self-deadlock anyway. That's probably the safest thing to do. We can't expect to safely ERROR out of the middle of an LWLockAcquire(), not without introducing a new and really hard to test code path that'll also be surprising to callers. We probably don't want to PANIC the whole server for non-assert builds since it might enter a panic-loop. So it's probably better to self-deadlock. We could HINT that a -m immediate shutdown will be necessary to recover though. I don't think it makes sense to introduce much complexity for a feature that's mainly there to help developers and to catch corner-case bugs. (FWIW a variant of this patch has been in 2ndQPostgres for some time. It helped catch an extension bug just the other day.)
Re: [PATCH] LWLock self-deadlock detection
This looks useful. LWLockCheckSelfDeadlock() could use LWLockHeldByMe variant instead of copying that code with possibly a change in that function to return the required information. I am also seeing a pattern Assert(LWLockHeldByMe*()) LWLockAcquire() at some places. Should we change LWLockAcquire to do Assert(LWLockHelpByMe()) always to detect such occurrences? Enhance that pattern to print the information that your patch prints? It looks weird that we can detect a self deadlock but not handle it. But handling it requires remembering multiple LWLocks held and then release them those many times. That may not leave LWLocks LW anymore. On Thu, Nov 19, 2020 at 4:02 PM Craig Ringer wrote: > > Hi all > > Here's a patch I wrote a while ago to detect and report when a > LWLockAcquire() results in a simple self-deadlock due to the caller already > holding the LWLock. > > To avoid affecting hot-path performance, it only fires the check on the first > iteration through the retry loops in LWLockAcquire() and LWLockWaitForVar(), > and just before we sleep, once the fast-path has been missed. > > I wrote an earlier version of this when I was chasing down some hairy issues > with background workers deadlocking on some exit paths because ereport(ERROR) > or elog(ERROR) calls fired when a LWLock was held would cause a > before_shmem_exit or on_shmem_exit cleanup function to deadlock when it tried > to acquire the same lock. > > But it's an easy enough mistake to make and a seriously annoying one to track > down, so I figured I'd post it for consideration. Maybe someone else will get > some use out of it even if nobody likes the idea of merging it. > > As written the check runs only for --enable-cassert builds or when LOCK_DEBUG > is defined. -- Best Wishes, Ashutosh Bapat
[PATCH] LWLock self-deadlock detection
Hi all Here's a patch I wrote a while ago to detect and report when a LWLockAcquire() results in a simple self-deadlock due to the caller already holding the LWLock. To avoid affecting hot-path performance, it only fires the check on the first iteration through the retry loops in LWLockAcquire() and LWLockWaitForVar(), and just before we sleep, once the fast-path has been missed. I wrote an earlier version of this when I was chasing down some hairy issues with background workers deadlocking on some exit paths because ereport(ERROR) or elog(ERROR) calls fired when a LWLock was held would cause a before_shmem_exit or on_shmem_exit cleanup function to deadlock when it tried to acquire the same lock. But it's an easy enough mistake to make and a seriously annoying one to track down, so I figured I'd post it for consideration. Maybe someone else will get some use out of it even if nobody likes the idea of merging it. As written the check runs only for --enable-cassert builds or when LOCK_DEBUG is defined. From 0ec0beb294f4c5ed35dbb35260f53b069563638f Mon Sep 17 00:00:00 2001 From: Craig Ringer Date: Thu, 19 Nov 2020 14:24:55 +0800 Subject: [PATCH] LWLock self-deadlock detection On cassert builds, if we miss the LWLock acquire fast-path and have to wait for a LWLock, check to make sure that the acquiring backend doesn't already hold the lock. This detects conditions that would otherwise cause a backend to deadlock indefinitely with interrupts disabled. A backend in this state won't respond to SIGTERM or SIGQUIT and will prevent postgres from finishing shutdown. While a LWLock self-deadlock is always the result of a programming mistake, it's not always easy to forsee all possible locking states especially when dealing with exception handling and with exit callbacks. --- src/backend/storage/lmgr/lwlock.c | 87 +++ 1 file changed, 87 insertions(+) diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c index 2fa90cc095..a2d6dec557 100644 --- a/src/backend/storage/lmgr/lwlock.c +++ b/src/backend/storage/lmgr/lwlock.c @@ -313,6 +313,85 @@ LOG_LWDEBUG(const char *where, LWLock *lock, const char *msg) #define LOG_LWDEBUG(a,b,c) ((void)0) #endif /* LOCK_DEBUG */ +#if defined(USE_ASSERT_CHECKING) || defined(LOCK_DEBUG) +#ifndef LWLOCK_SELF_DEADLOCK_DETECTION +#define LWLOCK_SELF_DEADLOCK_DETECTION +#endif +#endif + +#ifdef LWLOCK_SELF_DEADLOCK_DETECTION +static void +LWLockReportSelfDeadlock(LWLock *lock, LWLockMode mode, int held_lockno) +{ + StringInfoData locklist; + int i; + + LOG_LWDEBUG("LWLockReportSelfDeadlock", lock, "acquire self-deadlock detected"); + + initStringInfo(); + for (i=0; i < num_held_lwlocks; i++) { + appendStringInfo(, "%s (%p) excl=%d", + T_NAME(held_lwlocks[i].lock), + held_lwlocks[i].lock, + held_lwlocks[i].mode == LW_EXCLUSIVE); + if (i > 0) +appendStringInfoString(, ", "); + } + + + /* + * It might be safe to bail out here on a non-cassert build but + * more care is needed before anything like that is enabled. + * LWLockAcquire() doesn't know if it's safe to `elog(FATAL)` + * out from the current callsite. So we PANIC. This introduces some + * risk of buggy code causing a server crash/restart cycle where + * it would've previously just deadlocked a single backend, but + * that's part of why we only enable this for assert builds. + */ + ereport(PANIC, +(errmsg_internal("self-deadlock detected on acquire of lwlock %s (%p) for mode %s: lock already held by this backend in mode %s", + T_NAME(lock), lock, + mode == LW_EXCLUSIVE + ? "LW_EXCLUSIVE" : "LW_SHARED", + held_lwlocks[held_lockno].mode == LW_EXCLUSIVE + ? "LW_EXCLUSIVE" : "LW_SHARED"), + errdetail("backend already holds %d lwlocks: %s", + num_held_lwlocks, locklist.data))); +} + +/* + * A bug in code that uses this particular LWLock could cause + * the lock-holder could be ourselves, in which case we'll + * self-deadlock forever as an unkillable process. + * + * Call this in any LWLock acquire retry loop once the fast-path + * acquire has failed. + * + * held_lockno should be initialized to 0 outside the wait loop + * so that this check only runs on the first iteration of any + * wait/retry loop. + */ +inline static void +LWLockCheckSelfDeadlock(LWLock *lock, LWLockMode mode, int * const held_lockno) +{ + for ( ; *held_lockno < num_held_lwlocks; (*held_lockno)++) { + if (unlikely(held_lwlocks[*held_lockno].lock == lock)) { + LWLockReportSelfDeadlock(lock, mode, *held_lockno); + } + } +} +#else +/* + * Making LWLockCheckSelfDeadlock an empty static inline instead of a macro + * silences compiler whinging about held_lockno being unused while optimising + * away without fuss. + */ +inline static void +LWLockCheckSelfDeadlock(LWLock *lock, LWLockM