On Thu, Jan 4, 2024 at 6:06 PM Andres Freund <and...@anarazel.de> wrote: > I think we should add infrastructure to detect bugs like this during > development, but not PANICing when this happens in production seems completely > non-viable.
I mean +1 for the infrastructure, but "completely non-viable"? Why? I've only very rarely seen this PANIC occur, and in the few cases where I've seen it, it was entirely unclear that the problem was due to a bug where somebody failed to release a spinlock. It seemed more likely that the machine was just not really functioning, and the PANIC was a symptom of processes not getting scheduled rather than a PG bug. And every time I tell a user that they might need to use a debugger to, say, set VacuumCostActive = false, or to get a backtrace, or any other reason, I have to tell them to make sure to detach the debugger in under 60 seconds, because in the unlikely event that they attach while the process is holding a spinlock, failure to detach in under 60 seconds will take their production system down for no reason. Now, if you're about to say that people shouldn't need to use a debugger on their production instance, I entirely agree ... but in the world I inhabit, that's often the only way to solve a customer problem, and it probably will continue to be until we have much better ways of getting backtraces without using a debugger than is currently the case. Have you seen real cases where this PANIC prevents a hangup? If yes, that PANIC traced back to a bug in PostgreSQL? And why didn't the user just keep hitting the same bug over and PANICing in an endless loop? I feel like this is one of those things that has just been this way forever and we don't question it because it's become an article of faith that it's something we have to have. But I have a very hard time explaining why it's even a net positive, let alone the unquestionable good that you seem to think. -- Robert Haas EDB: http://www.enterprisedb.com