FWIW, yesterday we had one more reproduction of stuck spinlock panic which does 
not seem as a stuck spinlock.

I don’t see any valuable diagnostic information. The reproduction happened on 
hot standby. There’s a message in logs on primary at the same time, but does 
not seem to be releated:
"process 3918804 acquired ShareLock on transaction 909261926 after 2716.594 ms"
PostgreSQL 14.11
VM with this node does not seem heavily loaded, according to monitoring there 
were just 2 busy backends before panic shutdown.


> On 16 Apr 2024, at 20:54, Andres Freund <and...@anarazel.de> wrote:
> 
> Hi,
> 
> On 2024-04-15 10:54:16 -0400, Robert Haas wrote:
>> On Fri, Apr 12, 2024 at 3:33 PM Andres Freund <and...@anarazel.de> wrote:
>>> Here's a patch implementing this approach. I confirmed that before we 
>>> trigger
>>> the stuck spinlock logic very quickly and after we don't. However, if most
>>> sleeps are interrupted, it can delay the stuck spinlock detection a good
>>> bit. But that seems much better than triggering it too quickly.
>> 
>> +1 for doing something about this. I'm not sure if it goes far enough,
>> but it definitely seems much better than doing nothing.
> 
> One thing I started to be worried about is whether a patch ought to prevent
> the timeout used by perform_spin_delay() from increasing when
> interrupted. Otherwise a few signals can trigger quite long waits.
> 
> But as a I can't quite see a way to make this accurate in the backbranches, I
> suspect something like what I posted is still a good first version.
> 


What kind of inaccuracy do you see?
The code in performa_spin_delay() does not seem to be much different across 
REL_11_STABLE..REL_12_STABLE.
The only difference I see is how random number is generated.

Thanks!


Best regards, Andrey Borodin.

Reply via email to