From: "Jeff Janes" <jeff.ja...@gmail.com>
--------------------------------------------------
I've implemented the Min to Max change and did some more testing.  Now I
have a different  but related problem (which I also saw before, but less
often than the select() one).  The 5 second clock doesn't get turned off.
So after all processes end, and a new startup is launched, if that startup
doesn't report back to the postmaster soon enough, it gets SIGKILLED.

postmaster.c near line 1681


       if ((Shutdown >= ImmediateShutdown || (FatalError && !SendStop)) &&
           now - AbortStartTime >= SIGKILL_CHILDREN_AFTER_SECS)

It seems like this needs to have an additional and-test of pmState, but
which states to test I don't really know.

I've added in "&& (pmState>PM_RUN)" and have not had any more failures, so
I think that this is on the right path but testing an enum for inequality
feels wrong.
--------------------------------------------------


"AbortStartTime > 0" is also necessary to avoid sending SIGKILL repeatedly. I sent the attached patch during the original discussion. The below fragment is relevant:


--- 1663,1688 ----
    TouchSocketLockFiles();
    last_touch_time = now;
   }
+
+   /*
+    * When postmaster got an immediate shutdown request
+    * or some child terminated abnormally (FatalError case),
+    * postmaster sends SIGQUIT to all children except
+    * syslogger and dead_end ones, then wait for them to terminate.
+    * If some children didn't terminate within a certain amount of time,
+    * postmaster sends SIGKILL to them and wait again.
+    * This resolves, for example, the hang situation where
+    * a backend gets stuck in the call chain:
+    * free() acquires some lock -> <received SIGQUIT> ->
+ * quickdie() -> ereport() -> gettext() -> malloc() -> <lock acquisition>
+    */
+   if (AbortStartTime > 0 &&  /* SIGKILL only once */
+    (Shutdown == ImmediateShutdown || (FatalError && !SendStop)) &&
+    now - AbortStartTime >= 10)
+   {
+    SignalAllChildren(SIGKILL);
+    AbortStartTime = 0;
+   }
  }
 }


Regards
MauMau

Attachment: reliable_immediate_shutdown.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to