Hi, I've found rr [1] very useful to debug issues in postgres. The ability to hit a bug, and then e.g. identify a pointer with problematic contents, set a watchpoint on its contents, and reverse-continue is extremely powerful.
Unfortunately, when running postgres, it currently occasionally triggers spurious stack-too-deep errors. That turns out to be because it has to use an alternative stack in some corner cases (IIUC when a signal arrives while already in a signal handler). That corner case can unfortunately be hit from within postmaster, and at least can lead to sigusr1_handler() being called with an alternative stack set. Unfortunately that means that processes that postmaster fork()s while using that alternative stack will continue their live using that alternative stack. Which then subsequently means that our stack depth checks always trigger. I've not seen this trigger for normal backends (which makes sense, they're not started from a signal handler), but for bgworkers. In particular parallel workers are prone to hit the issue. I've locally fixed the issue by computing the stack base address anew for postmaster children. Currently in InitPostmasterChild(). I'd like to get that change upstream. The rr hackers have fixed a number of other issues that could be hit with postgres, but they couldn't see a good way to address the potential for a different signal stack in this edge case. And it doesn't seem crazy to me to compute the stack base again in postmaster children: It's cheap enough and it's extremely unlikely that postmaster uses up a crazy amount of stack. I also don't find it too crazy to guard against forks in signal handlers leading to a different stack base address. It's a pretty odd thing to do. Tom, while imo not a fix of the right magnitude here: Are you planning / hoping to work again on your postmaster latch patch? I think it'd be really good if we could restructure the postmaster code to do far far less in signal handlers. And the postmaster latch patch seems like a big step in that direction. I think we mostly dropped it due to the release schedule last time round? Greetings, Andres Freund [1] https://github.com/mozilla/rr/