Re: the s_lock_stuck on perform_spin_delay

Andres Freund Thu, 04 Jan 2024 15:06:50 -0800

Hi,

On 2024-01-04 14:59:06 +0800, Andy Fan wrote:
> My question is if someone doesn't obey the rule by mistake (everyone
> can make mistake), shall we PANIC on a production environment? IMO I
> think it can be a WARNING on a production environment and be a stuck
> when 'ifdef USE_ASSERT_CHECKING'.
>
> [...]
>
> I notice this issue actually because of the patch "Cache relation
> sizes?" from Thomas Munro [1], where the latest patch[2] still have the
> following code.
> +             sr = smgr_alloc_sr();  <-- HERE a spin lock is hold
> +
> +             /* Upgrade to exclusive lock so we can create a mapping. */
> +             LWLockAcquire(mapping_lock, LW_EXCLUSIVE); <-- HERE a complex
>   operation is needed. it may take a long time.
>
> Our internal testing system found more misuses on our own PG version.


> I think a experienced engineer like Thomas can make this mistake and the
> patch was reviewed by 3 peoples, the bug is still there. It is not easy
> to say just don't do it.

I don't follow this argument - just ignoring the problem, which emitting a
WARNING basically is, doesn't reduce the impact of the bug, it *increases* the
impact, because now the system will not recover from the bug without explicit
operator intervention.  During that time the system might be essentially
unresponsive, because all backends end up contending for some spinlock, which
makes investigating such issues very hard.


I think we should add infrastructure to detect bugs like this during
development, but not PANICing when this happens in production seems completely
non-viable.

Greetings,

Andres Freund

Re: the s_lock_stuck on perform_spin_delay

Reply via email to