FWIW, yesterday we had one more reproduction of stuck spinlock panic which does not seem as a stuck spinlock.
I don’t see any valuable diagnostic information. The reproduction happened on hot standby. There’s a message in logs on primary at the same time, but does not seem to be releated: "process 3918804 acquired ShareLock on transaction 909261926 after 2716.594 ms" PostgreSQL 14.11 VM with this node does not seem heavily loaded, according to monitoring there were just 2 busy backends before panic shutdown. > On 16 Apr 2024, at 20:54, Andres Freund <and...@anarazel.de> wrote: > > Hi, > > On 2024-04-15 10:54:16 -0400, Robert Haas wrote: >> On Fri, Apr 12, 2024 at 3:33 PM Andres Freund <and...@anarazel.de> wrote: >>> Here's a patch implementing this approach. I confirmed that before we >>> trigger >>> the stuck spinlock logic very quickly and after we don't. However, if most >>> sleeps are interrupted, it can delay the stuck spinlock detection a good >>> bit. But that seems much better than triggering it too quickly. >> >> +1 for doing something about this. I'm not sure if it goes far enough, >> but it definitely seems much better than doing nothing. > > One thing I started to be worried about is whether a patch ought to prevent > the timeout used by perform_spin_delay() from increasing when > interrupted. Otherwise a few signals can trigger quite long waits. > > But as a I can't quite see a way to make this accurate in the backbranches, I > suspect something like what I posted is still a good first version. > What kind of inaccuracy do you see? The code in performa_spin_delay() does not seem to be much different across REL_11_STABLE..REL_12_STABLE. The only difference I see is how random number is generated. Thanks! Best regards, Andrey Borodin.