On Mon, May 21, 2012 at 9:22 AM, Fujii Masao <masao.fu...@gmail.com> wrote: > On Sat, May 19, 2012 at 1:23 AM, Jeff Janes <jeff.ja...@gmail.com> wrote: >> I've been testing the crash recovery of REL9_2_BETA1, using the same >> method I posted in the "Scaling XLog insertion" thread. I have the >> checkpointer occasionally throw a FATAL error, > > We should also fix this problem? If yes, could you show us the self-contained > test case to reproduce it?
It doesn't need to be fixed. I intentionally apply a patch to cause this FATAL error. In order to exercise the crash recovery, I need a way to cause crashes. > >> However, sometimes the automatic recovery never initiates. It looks >> like the postmaster is waiting for the archiver to exit before it >> starts recovery, and the archiver is waiting for something, I don't >> really know what. > > Could you get the backtrace of the waiting archiver? > http://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_a_running_PostgreSQL_backend_on_Linux/BSD Backtrace of the archiver: #0 0x0000003da12ca32f in poll () from /lib64/libc.so.6 #1 0x00000000005ef8b9 in WaitLatchOrSocket (latch=0xafe238, wakeEvents=25, sock=-1, timeout=60000) at pg_latch.c:278 #2 0x00000000005f2b84 in PgArchiverMain () at pgarch.c:423 #3 pgarch_start () at pgarch.c:179 #4 0x00000000005fbd72 in reaper (postgres_signal_arg=<value optimized out>) at postmaster.c:2383 #5 <signal handler called> #6 0x0000003da12cc3a3 in __select_nocancel () from /lib64/libc.so.6 #7 0x00000000005f8edc in ServerLoop () at postmaster.c:1316 #8 0x00000000005fa327 in PostmasterMain (argc=3, argv=0xb7c1720) at postmaster.c:1116 #9 0x000000000059c6ee in main (argc=3, argv=<value optimized out>) at main.c:199 I also attached gdb to the archiver when it was working normally, and I get the exact same backtrace. It looks to me like the SIGQUIT from the postmaster is simply getting lost. And from what little I understand of signal handling, this is a known race with system(3). The archive_command, child of archiver, exits before it can receive the signal sent to the entire archiver process group, so it doesn't set its exit status to show it was signalled. But the signal sent directly to the archiver reaches it while it is still ignoring SIGQUITs. I caught a hang in motion with strace. Here, 22448 is the archiver, and 29475 is the archive_command that was in process when the crash happened. [pid 22448] <... wait4 resumed> [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 29475 [pid 22448] --- SIGQUIT (Quit) @ 0 (0) --- [pid 22448] rt_sigaction(SIGINT, {0x1, [], SA_RESTORER|SA_RESTART, 0x3da12302f0}, <unfinished ...> [pid 22448] <... rt_sigaction resumed> NULL, 8) = 0 [pid 22448] rt_sigaction(SIGQUIT, {0x5f29a0, [], SA_RESTORER|SA_RESTART, 0x3da12302f0}, <unfinished ...> And just in case it is useful, a backtrace for the postmaster during the hang: #0 0x0000003da12cc3a3 in __select_nocancel () from /lib64/libc.so.6 #1 0x00000000007226ba in pg_usleep (microsec=<value optimized out>) at pgsleep.c:43 #2 0x00000000005f9306 in ServerLoop () at postmaster.c:1305 #3 0x00000000005fa327 in PostmasterMain (argc=3, argv=0xb7c1720) at postmaster.c:1116 #4 0x000000000059c6ee in main (argc=3, argv=<value optimized out>) at main.c:199 If the SIGQUIT is getting lost in a race, could it just be blocked during the system(3) call? I don't know what happens if you call system(3) with SIGQUIT being blocked. Or maybe the postmaster should not be infinitely patient, but send another round of signals after a brief delay. Cheers, Jeff -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers