subject:"\[PATCH\] LWLock self\-deadlock detection"

Re: [PATCH] LWLock self-deadlock detection

2021-01-04 Thread Craig Ringer

On Wed, 30 Dec 2020 at 10:11, Andres Freund  wrote:

> Hi,
>
> On 2020-11-27 20:22:41 +0200, Heikki Linnakangas wrote:
> > On 26/11/2020 04:50, Tom Lane wrote:
> > > Craig Ringer  writes:
> > > > On Wed, Nov 25, 2020 at 9:23 PM Ashutosh Bapat <
> ashutosh.bapat@gmail.com>
> > > > wrote:
> > > > > I'd prefer to make the lock self deadlock check run for production
> > > > > builds, not just cassert builds.
> > >
> > > I'd like to register a strong objection to spending any cycles
> whatsoever
> > > on this.  If you're using LWLocks in a way that could deadlock, you're
> > > doing it wrong.  The entire point of that mechanism is to be Light
> Weight
> > > and not have all the overhead that the heavyweight lock mechanism has.
> > > Building in deadlock checks and such is exactly the wrong direction.
> > > If you think you need that, you should be using a heavyweight lock.
> > >
> > > Maybe there's some case for a cassert-only check of this sort, but
> running
> > > it in production is right out.
> > >
> > > I'd also opine that checking for self-deadlock, but not any more
> general
> > > deadlock, seems pretty pointless.  Careless coding is far more likely
> > > to create opportunities for the latter.  (Thus, I have little use for
> > > the existing assertions of this sort, either.)
> >
> > I've made the mistake of forgetting to release an LWLock many times,
> leading
> > to self-deadlock. And I just reviewed a patch that did that this week
> [1].
> > You usually find that mistake very quickly when you start testing
> though, I
> > don't think I've seen it happen in production.
>
> I think something roughly along Craig's original patch, basically adding
> assert checks against holding a lock already, makes sense. Compared to
> the other costs of running an assert build this isn't a huge cost, and
> it's helpful.
>
> I entirely concur that doing this outside of assertion builds is a
> seriously bad idea.
>

Yeah, given it only targets developer error that's sensible.

Re: [PATCH] LWLock self-deadlock detection

2020-12-29 Thread Andres Freund

Hi,

On 2020-11-27 20:22:41 +0200, Heikki Linnakangas wrote:
> On 26/11/2020 04:50, Tom Lane wrote:
> > Craig Ringer  writes:
> > > On Wed, Nov 25, 2020 at 9:23 PM Ashutosh Bapat 
> > > 
> > > wrote:
> > > > I'd prefer to make the lock self deadlock check run for production
> > > > builds, not just cassert builds.
> > 
> > I'd like to register a strong objection to spending any cycles whatsoever
> > on this.  If you're using LWLocks in a way that could deadlock, you're
> > doing it wrong.  The entire point of that mechanism is to be Light Weight
> > and not have all the overhead that the heavyweight lock mechanism has.
> > Building in deadlock checks and such is exactly the wrong direction.
> > If you think you need that, you should be using a heavyweight lock.
> > 
> > Maybe there's some case for a cassert-only check of this sort, but running
> > it in production is right out.
> > 
> > I'd also opine that checking for self-deadlock, but not any more general
> > deadlock, seems pretty pointless.  Careless coding is far more likely
> > to create opportunities for the latter.  (Thus, I have little use for
> > the existing assertions of this sort, either.)
> 
> I've made the mistake of forgetting to release an LWLock many times, leading
> to self-deadlock. And I just reviewed a patch that did that this week [1].
> You usually find that mistake very quickly when you start testing though, I
> don't think I've seen it happen in production.

I think something roughly along Craig's original patch, basically adding
assert checks against holding a lock already, makes sense. Compared to
the other costs of running an assert build this isn't a huge cost, and
it's helpful.

I entirely concur that doing this outside of assertion builds is a
seriously bad idea.

Greetings,

Andres Freund

Re: [PATCH] LWLock self-deadlock detection

2020-11-28 Thread Michael Paquier

On Fri, Nov 27, 2020 at 11:08:49AM -0800, Peter Geoghegan wrote:
> On Fri, Nov 27, 2020 at 10:22 AM Heikki Linnakangas  wrote:
>> I've made the mistake of forgetting to release an LWLock many times,
>> leading to self-deadlock. And I just reviewed a patch that did that this
>> week [1]. You usually find that mistake very quickly when you start
>> testing though, I don't think I've seen it happen in production.
> 
> +1. The fact that you don't get deadlock detection with LWLocks is a
> cornerstone of the whole design. This assumption is common to all
> database systems (LWLocks are generally called latches in the database
> research community, but the concept is exactly the same).

+1.

>> So yeah, I agree it's not worth spending cycles on this. Maybe it would
>> be worth it if it's really simple to check, and you only do it after
>> waiting X seconds. (I haven't looked at this patch at all)
>
> -1 for that, unless it's only for debug builds. Even if it is
> practically free, it still means committing to the wrong priorities.
> Plus the performance validation would be very arduous as a practical
> matter.

Looking at the git history, we have a much larger history of bugs that
relate to race conditions when it comes to LWLocks.  I am not really
convinced that it is worth spending time on this even for debug
builds, FWIW.
--
Michael


signature.asc
Description: PGP signature

Re: [PATCH] LWLock self-deadlock detection

2020-11-27 Thread Peter Geoghegan

On Fri, Nov 27, 2020 at 10:22 AM Heikki Linnakangas  wrote:
> I've made the mistake of forgetting to release an LWLock many times,
> leading to self-deadlock. And I just reviewed a patch that did that this
> week [1]. You usually find that mistake very quickly when you start
> testing though, I don't think I've seen it happen in production.

+1. The fact that you don't get deadlock detection with LWLocks is a
cornerstone of the whole design. This assumption is common to all
database systems (LWLocks are generally called latches in the database
research community, but the concept is exactly the same).

> So yeah, I agree it's not worth spending cycles on this. Maybe it would
> be worth it if it's really simple to check, and you only do it after
> waiting X seconds. (I haven't looked at this patch at all)

-1 for that, unless it's only for debug builds. Even if it is
practically free, it still means committing to the wrong priorities.
Plus the performance validation would be very arduous as a practical
matter.

We've made LWLocks much more scalable in the last 10 years. Why
shouldn't we expect to do the same again in the next 10 years? I
wouldn't bet against it. I might even do the opposite (predict further
improvements to LWLocks).

-- 
Peter Geoghegan

Re: [PATCH] LWLock self-deadlock detection

2020-11-27 Thread Heikki Linnakangas


On 26/11/2020 04:50, Tom Lane wrote:

Craig Ringer  writes:

On Wed, Nov 25, 2020 at 9:23 PM Ashutosh Bapat 
wrote:

I'd prefer to make the lock self deadlock check run for production
builds, not just cassert builds.


I'd like to register a strong objection to spending any cycles whatsoever
on this.  If you're using LWLocks in a way that could deadlock, you're
doing it wrong.  The entire point of that mechanism is to be Light Weight
and not have all the overhead that the heavyweight lock mechanism has.
Building in deadlock checks and such is exactly the wrong direction.
If you think you need that, you should be using a heavyweight lock.

Maybe there's some case for a cassert-only check of this sort, but running
it in production is right out.

I'd also opine that checking for self-deadlock, but not any more general
deadlock, seems pretty pointless.  Careless coding is far more likely
to create opportunities for the latter.  (Thus, I have little use for
the existing assertions of this sort, either.)


I've made the mistake of forgetting to release an LWLock many times, 
leading to self-deadlock. And I just reviewed a patch that did that this 
week [1]. You usually find that mistake very quickly when you start 
testing though, I don't think I've seen it happen in production.


So yeah, I agree it's not worth spending cycles on this. Maybe it would 
be worth it if it's really simple to check, and you only do it after 
waiting X seconds. (I haven't looked at this patch at all)


[1] 
https://www.postgresql.org/message-id/8e1cda9b-1cb9-a634-d383-f757bf67b...@iki.fi


- Heikki

Re: [PATCH] LWLock self-deadlock detection

2020-11-25 Thread Tom Lane

Craig Ringer  writes:
> On Wed, Nov 25, 2020 at 9:23 PM Ashutosh Bapat 
> wrote:
>> I'd prefer to make the lock self deadlock check run for production
>> builds, not just cassert builds.

I'd like to register a strong objection to spending any cycles whatsoever
on this.  If you're using LWLocks in a way that could deadlock, you're
doing it wrong.  The entire point of that mechanism is to be Light Weight
and not have all the overhead that the heavyweight lock mechanism has.
Building in deadlock checks and such is exactly the wrong direction.
If you think you need that, you should be using a heavyweight lock.

Maybe there's some case for a cassert-only check of this sort, but running
it in production is right out.

I'd also opine that checking for self-deadlock, but not any more general
deadlock, seems pretty pointless.  Careless coding is far more likely
to create opportunities for the latter.  (Thus, I have little use for
the existing assertions of this sort, either.)

regards, tom lane

Re: [PATCH] LWLock self-deadlock detection

2020-11-25 Thread Craig Ringer

On Wed, Nov 25, 2020 at 9:23 PM Ashutosh Bapat 
wrote:

> On Wed, Nov 25, 2020 at 11:47 AM Craig Ringer
>  wrote:
> >> I am also seeing a pattern
> >> Assert(!LWLockHeldByMe());
> >> LWLockAcquire()
> >>
> >> at some places. Should we change LWLockAcquire to do
> >> Assert(!LWLockHeldByMe()) always to detect such occurrences?
> >
> >
> > I'm inclined not to, at least not without benchmarking it, because
> that'd do the check before we attempt the fast-path. cassert builds are
> still supposed to perform decently and be suitable for day to day
> development and I'd rather not risk a slowdown.
> >
> > I'd prefer to make the lock self deadlock check run for production
> builds, not just cassert builds. It'd print a normal LOG (with backtrace if
> supported) then Assert(). So on an assert build we'd get a crash and core,
> and on a non-assert build we'd carry on and self-deadlock anyway.
> >
> > That's probably the safest thing to do. We can't expect to safely ERROR
> out of the middle of an LWLockAcquire(), not without introducing a new and
> really hard to test code path that'll also be surprising to callers. We
> probably don't want to PANIC the whole server for non-assert builds since
> it might enter a panic-loop. So it's probably better to self-deadlock. We
> could HINT that a -m immediate shutdown will be necessary to recover though.
>
> I agree that it will be helpful to print something in the logs
> indicating the reason for this hang in case the hang happens in a
> production build. In your patch you have used ereport(PANIC, ) which
> may simply be replaced by an Assert() in an assert enabled build.


I'd either select between PANIC (assert build) and LOG (non assert build),
or always LOG then put an Assert() afterwards. It's strongly desirable to
get the log output before any crash because it's a pain to get the LWLock
info from a core.


> We
> already have Assert(!LWLockHeldByMe()) so that should be safe. It will
> be good to have -m immediate hint in LOG message. But it might just be
> better to kill -9 that process to get rid of it. That will cause the
> server to restart and not just shutdown.
>

Sure. Won't work on Windows though. And I am hesitant about telling users
to "kill -9" anything.

pg_ctl -m immediate restart then.

Re: [PATCH] LWLock self-deadlock detection

2020-11-25 Thread Ashutosh Bapat

On Wed, Nov 25, 2020 at 11:47 AM Craig Ringer
 wrote:
>> I am also seeing a pattern
>> Assert(!LWLockHeldByMe());
>> LWLockAcquire()
>>
>> at some places. Should we change LWLockAcquire to do
>> Assert(!LWLockHeldByMe()) always to detect such occurrences?
>
>
> I'm inclined not to, at least not without benchmarking it, because that'd do 
> the check before we attempt the fast-path. cassert builds are still supposed 
> to perform decently and be suitable for day to day development and I'd rather 
> not risk a slowdown.
>
> I'd prefer to make the lock self deadlock check run for production builds, 
> not just cassert builds. It'd print a normal LOG (with backtrace if 
> supported) then Assert(). So on an assert build we'd get a crash and core, 
> and on a non-assert build we'd carry on and self-deadlock anyway.
>
> That's probably the safest thing to do. We can't expect to safely ERROR out 
> of the middle of an LWLockAcquire(), not without introducing a new and really 
> hard to test code path that'll also be surprising to callers. We probably 
> don't want to PANIC the whole server for non-assert builds since it might 
> enter a panic-loop. So it's probably better to self-deadlock. We could HINT 
> that a -m immediate shutdown will be necessary to recover though.

I agree that it will be helpful to print something in the logs
indicating the reason for this hang in case the hang happens in a
production build. In your patch you have used ereport(PANIC, ) which
may simply be replaced by an Assert() in an assert enabled build. We
already have Assert(!LWLockHeldByMe()) so that should be safe. It will
be good to have -m immediate hint in LOG message. But it might just be
better to kill -9 that process to get rid of it. That will cause the
server to restart and not just shutdown.

-- 
Best Wishes,
Ashutosh Bapat

Re: [PATCH] LWLock self-deadlock detection

2020-11-24 Thread Craig Ringer

On Tue, Nov 24, 2020 at 10:11 PM Ashutosh Bapat <
ashutosh.bapat@gmail.com> wrote:

> This looks useful. LWLockCheckSelfDeadlock() could use LWLockHeldByMe
> variant instead of copying that code with possibly a change in that
> function to return the required information.
>

Yes, possibly so. I was reluctant to mess with the rest of the code too
much.

I am also seeing a pattern
> Assert(!LWLockHeldByMe());
> LWLockAcquire()
>
> at some places. Should we change LWLockAcquire to do
> Assert(!LWLockHeldByMe()) always to detect such occurrences?

I'm inclined not to, at least not without benchmarking it, because that'd
do the check before we attempt the fast-path. cassert builds are still
supposed to perform decently and be suitable for day to day development and
I'd rather not risk a slowdown.

I'd prefer to make the lock self deadlock check run for production builds,
not just cassert builds. It'd print a normal LOG (with backtrace if
supported) then Assert(). So on an assert build we'd get a crash and core,
and on a non-assert build we'd carry on and self-deadlock anyway.

That's probably the safest thing to do. We can't expect to safely ERROR out
of the middle of an LWLockAcquire(), not without introducing a new and
really hard to test code path that'll also be surprising to callers. We
probably don't want to PANIC the whole server for non-assert builds since
it might enter a panic-loop. So it's probably better to self-deadlock. We
could HINT that a -m immediate shutdown will be necessary to recover though.

I don't think it makes sense to introduce much complexity for a feature
that's mainly there to help developers and to catch corner-case bugs.

(FWIW a variant of this patch has been in 2ndQPostgres for some time. It
helped catch an extension bug just the other day.)

Re: [PATCH] LWLock self-deadlock detection

2020-11-24 Thread Ashutosh Bapat

This looks useful. LWLockCheckSelfDeadlock() could use LWLockHeldByMe
variant instead of copying that code with possibly a change in that
function to return the required information.

I am also seeing a pattern
Assert(LWLockHeldByMe*())
LWLockAcquire()

at some places. Should we change LWLockAcquire to do
Assert(LWLockHelpByMe()) always to detect such occurrences? Enhance
that pattern to print the information that your patch prints?

It looks weird that we can detect a self deadlock but not handle it.
But handling it requires remembering multiple LWLocks held and then
release them those many times. That may not leave LWLocks LW anymore.

On Thu, Nov 19, 2020 at 4:02 PM Craig Ringer
 wrote:
>
> Hi all
>
> Here's a patch I wrote a while ago to detect and report when a 
> LWLockAcquire() results in a simple self-deadlock due to the caller already 
> holding the LWLock.
>
> To avoid affecting hot-path performance, it only fires the check on the first 
> iteration through the retry loops in LWLockAcquire() and LWLockWaitForVar(), 
> and just before we sleep, once the fast-path has been missed.
>
> I wrote an earlier version of this when I was chasing down some hairy issues 
> with background workers deadlocking on some exit paths because ereport(ERROR) 
> or elog(ERROR) calls fired when a LWLock was held would cause a 
> before_shmem_exit or on_shmem_exit cleanup function to deadlock when it tried 
> to acquire the same lock.
>
> But it's an easy enough mistake to make and a seriously annoying one to track 
> down, so I figured I'd post it for consideration. Maybe someone else will get 
> some use out of it even if nobody likes the idea of merging it.
>
> As written the check runs only for --enable-cassert builds or when LOCK_DEBUG 
> is defined.

-- 
Best Wishes,
Ashutosh Bapat

[PATCH] LWLock self-deadlock detection

2020-11-19 Thread Craig Ringer

Hi all

Here's a patch I wrote a while ago to detect and report when a
LWLockAcquire() results in a simple self-deadlock due to the caller already
holding the LWLock.

To avoid affecting hot-path performance, it only fires the check on the
first iteration through the retry loops in LWLockAcquire() and
LWLockWaitForVar(), and just before we sleep, once the fast-path has been
missed.

I wrote an earlier version of this when I was chasing down some hairy
issues with background workers deadlocking on some exit paths because
ereport(ERROR) or elog(ERROR) calls fired when a LWLock was held would
cause a before_shmem_exit or on_shmem_exit cleanup function to deadlock
when it tried to acquire the same lock.

But it's an easy enough mistake to make and a seriously annoying one to
track down, so I figured I'd post it for consideration. Maybe someone else
will get some use out of it even if nobody likes the idea of merging it.

As written the check runs only for --enable-cassert builds or when
LOCK_DEBUG is defined.
From 0ec0beb294f4c5ed35dbb35260f53b069563638f Mon Sep 17 00:00:00 2001
From: Craig Ringer 
Date: Thu, 19 Nov 2020 14:24:55 +0800
Subject: [PATCH] LWLock self-deadlock detection

On cassert builds, if we miss the LWLock acquire fast-path and have to
wait for a LWLock, check to make sure that the acquiring backend
doesn't already hold the lock.

This detects conditions that would otherwise cause a backend to
deadlock indefinitely with interrupts disabled. A backend in this
state won't respond to SIGTERM or SIGQUIT and will prevent postgres
from finishing shutdown.

While a LWLock self-deadlock is always the result of a programming
mistake, it's not always easy to forsee all possible locking states
especially when dealing with exception handling and with exit
callbacks.
---
 src/backend/storage/lmgr/lwlock.c | 87 +++
 1 file changed, 87 insertions(+)

diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 2fa90cc095..a2d6dec557 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -313,6 +313,85 @@ LOG_LWDEBUG(const char *where, LWLock *lock, const char *msg)
 #define LOG_LWDEBUG(a,b,c) ((void)0)
 #endif			/* LOCK_DEBUG */
 
+#if defined(USE_ASSERT_CHECKING) || defined(LOCK_DEBUG)
+#ifndef LWLOCK_SELF_DEADLOCK_DETECTION
+#define LWLOCK_SELF_DEADLOCK_DETECTION
+#endif
+#endif
+
+#ifdef LWLOCK_SELF_DEADLOCK_DETECTION
+static void
+LWLockReportSelfDeadlock(LWLock *lock, LWLockMode mode, int held_lockno)
+{
+		StringInfoData locklist;
+		int i;
+
+		LOG_LWDEBUG("LWLockReportSelfDeadlock", lock, "acquire self-deadlock detected");
+
+		initStringInfo();
+		for (i=0; i < num_held_lwlocks; i++) {
+			appendStringInfo(, "%s (%p) excl=%d",
+			T_NAME(held_lwlocks[i].lock),
+			held_lwlocks[i].lock,
+			held_lwlocks[i].mode == LW_EXCLUSIVE);
+			if (i > 0)
+appendStringInfoString(, ", ");
+		}
+
+
+		/*
+		 * It might be safe to bail out here on a non-cassert build but
+		 * more care is needed before anything like that is enabled.
+		 * LWLockAcquire() doesn't know if it's safe to `elog(FATAL)`
+		 * out from the current callsite. So we PANIC. This introduces some
+		 * risk of buggy code causing a server crash/restart cycle where
+		 * it would've previously just deadlocked a single backend, but
+		 * that's part of why we only enable this for assert builds.
+		 */
+		ereport(PANIC,
+(errmsg_internal("self-deadlock detected on acquire of lwlock %s (%p) for mode %s: lock already held by this backend in mode %s",
+ T_NAME(lock), lock,
+ mode == LW_EXCLUSIVE
+		? "LW_EXCLUSIVE" : "LW_SHARED",
+ held_lwlocks[held_lockno].mode == LW_EXCLUSIVE
+		? "LW_EXCLUSIVE" : "LW_SHARED"),
+ errdetail("backend already holds %d lwlocks: %s",
+		   num_held_lwlocks, locklist.data)));
+}
+
+/*
+ * A bug in code that uses this particular LWLock could cause
+ * the lock-holder could be ourselves, in which case we'll
+ * self-deadlock forever as an unkillable process.
+ *
+ * Call this in any LWLock acquire retry loop once the fast-path
+ * acquire has failed.
+ *
+ * held_lockno should be initialized to 0 outside the wait loop
+ * so that this check only runs on the first iteration of any
+ * wait/retry loop.
+ */
+inline static void
+LWLockCheckSelfDeadlock(LWLock *lock, LWLockMode mode, int * const held_lockno)
+{
+	for ( ; *held_lockno < num_held_lwlocks; (*held_lockno)++) {
+		if (unlikely(held_lwlocks[*held_lockno].lock == lock)) {
+			LWLockReportSelfDeadlock(lock, mode, *held_lockno);
+		}
+	}
+}
+#else
+/*
+ * Making LWLockCheckSelfDeadlock an empty static inline instead of a macro
+ * silences compiler whinging about held_lockno being unused while optimising
+ * away without fuss.
+ */
+inline static void
+LWLockCheckSelfDeadlock(LWLock *lock, LWLockM

Re: [PATCH] LWLock self-deadlock detection

Re: [PATCH] LWLock self-deadlock detection

Re: [PATCH] LWLock self-deadlock detection

Re: [PATCH] LWLock self-deadlock detection

Re: [PATCH] LWLock self-deadlock detection

Re: [PATCH] LWLock self-deadlock detection

Re: [PATCH] LWLock self-deadlock detection

Re: [PATCH] LWLock self-deadlock detection

Re: [PATCH] LWLock self-deadlock detection

Re: [PATCH] LWLock self-deadlock detection

[PATCH] LWLock self-deadlock detection

11 matches

Site Navigation

Mail list logo

Footer information