From: "Tom Lane" <t...@sss.pgh.pa.us>
[ sinval catchup signal -> ProcessCatchupEvent -> WaitLatch -> deadlock ]

I must add one thing. After some client processes closed the connection without any hang, their server processes were stuck with a stack trace like this (I'll look for and show the exact stack trace tomorrow):

WaitLatchOrSocket
SyncRepWaitForLSN
CommitTransaction
CommitTransactionCommand
ProcessCatchupEvent
...
shmem_exit
proc_exit_prepare
proc_exit
PostgresMain
...

The process appears to be hanging during session termination. So, it's not the problem only during client request wait.


Another line of thought is that we've been way too uncritical about
shoving different kinds of events into the SIGUSR1 multiplexor.
It might be a good idea to separate "high level" interrupts from
"low level" ones, using say SIGUSR1 for the former and SIGUSR2
for the latter.  However, that doesn't sound very back-patchable,
even assuming that we can come up with a clean division.

This seems to be one step in the right direction. There are two issues in the current implementation:

[Issue 1]
[ sinval catchup signal -> ProcessCatchupEvent -> WaitLatch -> deadlock ]
This is (partly) because the latch wakeup and other processing use the same SIGUSR1 in normal backend, autovacuum launcher/worker, and the background worker with database access. On the other hand, other background daemon processes properly use SIGUSR1 only for latch wakeup, and SIGUSR2 for other tasks.

[Issue 2]
WaitLatch() returns prematurely due to the sinval catchup signal, even though the target event (e.g. reply from standby) hasn't occurred and called SetLatch() yet. This is because procsignal_sigusr1_handler() always calls latch_sigusr1_handler().
This is probably not related to the cause of the hang.

So, as you suggest, I think it would be a good idea to separate latch_sigusr1_handler() call into a different function solely for it like other daemon processes, and leave the rest in procsignal_sigusr1_handler() and rename function to procsignal_sigusr2_handler() or procsignal_handler(). Originally, it's not natural that the current procsignal_sigusr1_handler() contains latch_sigusr1_handler() call, because latch processing is not based on the procsignal mechanism (SetLatch() doesn't call SendProcSignal()).

I'll try the fix tomorrow if possible. What kind of problems do you hink of for back-patching?

Regards
MauMau




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to