Re: [HACKERS] 9.4 HEAD: select() failed in postmaster
On Sat, Oct 5, 2013 at 7:54 PM, Alvaro Herrera wrote: > MauMau escribió: > > >>From: "Alvaro Herrera" > > >>>Can you please send a fixup patch to what's already committed? > > >> > > >>OK, I'll send a patch against HEAD, which will be a few lines. Am I > > >>understanding the meaning of "fixup patch"? > > > > > >Yep, thanks. > > > > Please use the attached patch. Thanks. > > Pushed, thanks. > > Thanks everyone. I'm now unable to break it. Cheers, Jeff
Re: [HACKERS] 9.4 HEAD: select() failed in postmaster
MauMau escribió: > >>From: "Alvaro Herrera" > >>>Can you please send a fixup patch to what's already committed? > >> > >>OK, I'll send a patch against HEAD, which will be a few lines. Am I > >>understanding the meaning of "fixup patch"? > > > >Yep, thanks. > > Please use the attached patch. Thanks. Pushed, thanks. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 9.4 HEAD: select() failed in postmaster
From: "Alvaro Herrera" >Can you please send a fixup patch to what's already committed? OK, I'll send a patch against HEAD, which will be a few lines. Am I understanding the meaning of "fixup patch"? Yep, thanks. Please use the attached patch. Thanks. Regards MauMau reliable_immediate_shutdown_fixup.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 9.4 HEAD: select() failed in postmaster
MauMau escribió: > From: "Alvaro Herrera" > >Can you please send a fixup patch to what's already committed? > > OK, I'll send a patch against HEAD, which will be a few lines. Am I > understanding the meaning of "fixup patch"? Yep, thanks. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 9.4 HEAD: select() failed in postmaster
From: "Alvaro Herrera" Can you please send a fixup patch to what's already committed? OK, I'll send a patch against HEAD, which will be a few lines. Am I understanding the meaning of "fixup patch"? Regards MauMau -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 9.4 HEAD: select() failed in postmaster
MauMau escribió: > "AbortStartTime > 0" is also necessary to avoid sending SIGKILL > repeatedly. I sent the attached patch during the original > discussion. The below fragment is relevant: Can you please send a fixup patch to what's already committed? -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 9.4 HEAD: select() failed in postmaster
From: "Jeff Janes" -- I've implemented the Min to Max change and did some more testing. Now I have a different but related problem (which I also saw before, but less often than the select() one). The 5 second clock doesn't get turned off. So after all processes end, and a new startup is launched, if that startup doesn't report back to the postmaster soon enough, it gets SIGKILLED. postmaster.c near line 1681 if ((Shutdown >= ImmediateShutdown || (FatalError && !SendStop)) && now - AbortStartTime >= SIGKILL_CHILDREN_AFTER_SECS) It seems like this needs to have an additional and-test of pmState, but which states to test I don't really know. I've added in "&& (pmState>PM_RUN)" and have not had any more failures, so I think that this is on the right path but testing an enum for inequality feels wrong. -- "AbortStartTime > 0" is also necessary to avoid sending SIGKILL repeatedly. I sent the attached patch during the original discussion. The below fragment is relevant: --- 1663,1688 TouchSocketLockFiles(); last_touch_time = now; } + + /* +* When postmaster got an immediate shutdown request +* or some child terminated abnormally (FatalError case), +* postmaster sends SIGQUIT to all children except +* syslogger and dead_end ones, then wait for them to terminate. +* If some children didn't terminate within a certain amount of time, +* postmaster sends SIGKILL to them and wait again. +* This resolves, for example, the hang situation where +* a backend gets stuck in the call chain: +* free() acquires some lock -> -> +* quickdie() -> ereport() -> gettext() -> malloc() -> acquisition> +*/ + if (AbortStartTime > 0 && /* SIGKILL only once */ +(Shutdown == ImmediateShutdown || (FatalError && !SendStop)) && +now - AbortStartTime >= 10) + { +SignalAllChildren(SIGKILL); +AbortStartTime = 0; + } } } Regards MauMau reliable_immediate_shutdown.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 9.4 HEAD: select() failed in postmaster
Noah Misch escribió: > On Tue, Sep 10, 2013 at 05:18:21PM -0700, Jeff Janes wrote: > > I think the problem is here, where there should be a Max rather than a Min: > > > > commit 82233ce7ea42d6ba519aaec63008aff49da6c7af > > Author: Alvaro Herrera > > Date: Fri Jun 28 17:20:53 2013 -0400 > > > > Send SIGKILL to children if they don't die quickly in immediate shutdown > > > > ... > > > > + /* remaining time, but at least 1 second */ > > + timeout->tv_sec = Min(SIGKILL_CHILDREN_AFTER_SECS - > > + (time(NULL) - AbortStartTime), 1); > > Agreed; good catch. Yeah, thanks. Should be a Max(). The current coding presumably makes it use one second most of the time, instead of whatever the remaining time is ... until the abort time is past, in which case it causes the whole thing to break down as reported. It might very well be that I used Max() there initially and changed to Min() at the last minute before commit in a moment of brain fade. > > But I don't understand the logic behind this anyway. Why sleep at least 1 > > second? If time is up, it is up, why not use zero as the minimum? > > Offhand, clamping to zero does make more sense to me. It looks like Alvaro > added that bit in his pre-commit edits. Alvaro? Sadly, I don't have the developing branch for this feature anymore, so I have to go from memory. IIRC my thinking here is that if I make select terminate immediately (timeout 0) then the time arithmetic in ServerLoop() might lead us to decide not to send SIGKILL at that time, causing one more iteration of that loop. Thinking about it again, that argument doesn't seem to hold much water; but the time variables being in integer seconds led me to add that. I will fix it to Max( ..., 0). -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 9.4 HEAD: select() failed in postmaster
On Tue, Sep 10, 2013 at 05:18:21PM -0700, Jeff Janes wrote: > I've been getting some failures after an immediate shutdown or crash, > during severe IO stress, with the message: > > LOG: XX000: select() failed in postmaster: Invalid argument > LOCATION: ServerLoop, postmaster.c:1560 > > It is trying to sleep for -1 seconds. > > I think the problem is here, where there should be a Max rather than a Min: > > commit 82233ce7ea42d6ba519aaec63008aff49da6c7af > Author: Alvaro Herrera > Date: Fri Jun 28 17:20:53 2013 -0400 > > Send SIGKILL to children if they don't die quickly in immediate shutdown > > ... > > + /* remaining time, but at least 1 second */ > + timeout->tv_sec = Min(SIGKILL_CHILDREN_AFTER_SECS - > + (time(NULL) - AbortStartTime), 1); Agreed; good catch. > But I don't understand the logic behind this anyway. Why sleep at least 1 > second? If time is up, it is up, why not use zero as the minimum? Offhand, clamping to zero does make more sense to me. It looks like Alvaro added that bit in his pre-commit edits. Alvaro? Thanks, nm -- Noah Misch EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers