Re: [HACKERS] 9.4 HEAD: select() failed in postmaster

2013-10-21 Thread Jeff Janes
On Sat, Oct 5, 2013 at 7:54 PM, Alvaro Herrera alvhe...@2ndquadrant.comwrote:

 MauMau escribió:
  From: Alvaro Herrera alvhe...@2ndquadrant.com
  Can you please send a fixup patch to what's already committed?
  
  OK, I'll send a patch against HEAD, which will be a few lines.  Am I
  understanding the meaning of fixup patch?
  
  Yep, thanks.
 
  Please use the attached patch.  Thanks.

 Pushed, thanks.


Thanks everyone.  I'm now unable to break it.

Cheers,

Jeff


Re: [HACKERS] 9.4 HEAD: select() failed in postmaster

2013-10-05 Thread Alvaro Herrera
MauMau escribió:
 From: Alvaro Herrera alvhe...@2ndquadrant.com
 Can you please send a fixup patch to what's already committed?
 
 OK, I'll send a patch against HEAD, which will be a few lines.  Am I
 understanding the meaning of fixup patch?
 
 Yep, thanks.
 
 Please use the attached patch.  Thanks.

Pushed, thanks.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.4 HEAD: select() failed in postmaster

2013-10-04 Thread MauMau

From: Alvaro Herrera alvhe...@2ndquadrant.com
Can you please send a fixup patch to what's already committed?

OK, I'll send a patch against HEAD, which will be a few lines.  Am I
understanding the meaning of fixup patch?


Yep, thanks.


Please use the attached patch.  Thanks.

Regards
MauMau


reliable_immediate_shutdown_fixup.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.4 HEAD: select() failed in postmaster

2013-10-03 Thread MauMau

From: Alvaro Herrera alvhe...@2ndquadrant.com

Can you please send a fixup patch to what's already committed?


OK, I'll send a patch against HEAD, which will be a few lines.  Am I 
understanding the meaning of fixup patch?


Regards
MauMau



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.4 HEAD: select() failed in postmaster

2013-10-03 Thread Alvaro Herrera
MauMau escribió:
 From: Alvaro Herrera alvhe...@2ndquadrant.com
 Can you please send a fixup patch to what's already committed?
 
 OK, I'll send a patch against HEAD, which will be a few lines.  Am I
 understanding the meaning of fixup patch?

Yep, thanks.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.4 HEAD: select() failed in postmaster

2013-10-01 Thread Alvaro Herrera
MauMau escribió:

 AbortStartTime  0 is also necessary to avoid sending SIGKILL
 repeatedly. I sent the attached patch during the original
 discussion.  The below fragment is relevant:

Can you please send a fixup patch to what's already committed?

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.4 HEAD: select() failed in postmaster

2013-09-13 Thread MauMau

From: Jeff Janes jeff.ja...@gmail.com
--
I've implemented the Min to Max change and did some more testing.  Now I
have a different  but related problem (which I also saw before, but less
often than the select() one).  The 5 second clock doesn't get turned off.
So after all processes end, and a new startup is launched, if that startup
doesn't report back to the postmaster soon enough, it gets SIGKILLED.

postmaster.c near line 1681


   if ((Shutdown = ImmediateShutdown || (FatalError  !SendStop)) 
   now - AbortStartTime = SIGKILL_CHILDREN_AFTER_SECS)

It seems like this needs to have an additional and-test of pmState, but
which states to test I don't really know.

I've added in  (pmStatePM_RUN) and have not had any more failures, so
I think that this is on the right path but testing an enum for inequality
feels wrong.
--


AbortStartTime  0 is also necessary to avoid sending SIGKILL repeatedly. 
I sent the attached patch during the original discussion.  The below 
fragment is relevant:



--- 1663,1688 
TouchSocketLockFiles();
last_touch_time = now;
   }
+
+   /*
+* When postmaster got an immediate shutdown request
+* or some child terminated abnormally (FatalError case),
+* postmaster sends SIGQUIT to all children except
+* syslogger and dead_end ones, then wait for them to terminate.
+* If some children didn't terminate within a certain amount of time,
+* postmaster sends SIGKILL to them and wait again.
+* This resolves, for example, the hang situation where
+* a backend gets stuck in the call chain:
+* free() acquires some lock - received SIGQUIT -
+* quickdie() - ereport() - gettext() - malloc() - lock 
acquisition

+*/
+   if (AbortStartTime  0   /* SIGKILL only once */
+(Shutdown == ImmediateShutdown || (FatalError  !SendStop)) 
+now - AbortStartTime = 10)
+   {
+SignalAllChildren(SIGKILL);
+AbortStartTime = 0;
+   }
  }
 }


Regards
MauMau


reliable_immediate_shutdown.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] 9.4 HEAD: select() failed in postmaster

2013-09-12 Thread Jeff Janes
On Wednesday, September 11, 2013, Alvaro Herrera wrote:

 Noah Misch escribió:
  On Tue, Sep 10, 2013 at 05:18:21PM -0700, Jeff Janes wrote:

   I think the problem is here, where there should be a Max rather than a
 Min:
  
   commit 82233ce7ea42d6ba519aaec63008aff49da6c7af
   Author: Alvaro Herrera alvhe...@alvh.no-ip.org
   Date:   Fri Jun 28 17:20:53 2013 -0400
  
   Send SIGKILL to children if they don't die quickly in immediate
 shutdown
  
   ...
  
   +   /* remaining time, but at least 1 second */
   +   timeout-tv_sec = Min(SIGKILL_CHILDREN_AFTER_SECS -
   + (time(NULL) - AbortStartTime), 1);
 
  Agreed; good catch.

 Yeah, thanks.  Should be a Max().  The current coding presumably makes
 it use one second most of the time, instead of whatever the remaining
 time is ... until the abort time is past, in which case it causes the
 whole thing to break down as reported.

 It might very well be that I used Max() there initially and changed to
 Min() at the last minute before commit in a moment of brain fade.


I've implemented the Min to Max change and did some more testing.  Now I
have a different  but related problem (which I also saw before, but less
often than the select() one).  The 5 second clock doesn't get turned off.
 So after all processes end, and a new startup is launched, if that startup
doesn't report back to the postmaster soon enough, it gets SIGKILLED.

postmaster.c near line 1681


if ((Shutdown = ImmediateShutdown || (FatalError  !SendStop)) 
now - AbortStartTime = SIGKILL_CHILDREN_AFTER_SECS)

It seems like this needs to have an additional and-test of pmState, but
which states to test I don't really know.

I've added in  (pmStatePM_RUN) and have not had any more failures, so
I think that this is on the right path but testing an enum for inequality
feels wrong.

Alternatively perhaps FatalError can get cleared when startup is launched,
rather than when WAL replay begins.  But I assume it was done the way it is
for a reason, even though I don't know that reason.

Cheers,

Jeff





Re: [HACKERS] 9.4 HEAD: select() failed in postmaster

2013-09-11 Thread Noah Misch
On Tue, Sep 10, 2013 at 05:18:21PM -0700, Jeff Janes wrote:
 I've been getting some failures after an immediate shutdown or crash,
 during severe IO stress, with the message:
 
 LOG:  XX000: select() failed in postmaster: Invalid argument
 LOCATION:  ServerLoop, postmaster.c:1560
 
 It is trying to sleep for -1 seconds.
 
 I think the problem is here, where there should be a Max rather than a Min:
 
 commit 82233ce7ea42d6ba519aaec63008aff49da6c7af
 Author: Alvaro Herrera alvhe...@alvh.no-ip.org
 Date:   Fri Jun 28 17:20:53 2013 -0400
 
 Send SIGKILL to children if they don't die quickly in immediate shutdown
 
 ...
 
 +   /* remaining time, but at least 1 second */
 +   timeout-tv_sec = Min(SIGKILL_CHILDREN_AFTER_SECS -
 + (time(NULL) - AbortStartTime), 1);

Agreed; good catch.

 But I don't understand the logic behind this anyway.  Why sleep at least 1
 second?  If time is up, it is up, why not use zero as the minimum?

Offhand, clamping to zero does make more sense to me.  It looks like Alvaro
added that bit in his pre-commit edits.  Alvaro?

Thanks,
nm

-- 
Noah Misch
EnterpriseDB http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.4 HEAD: select() failed in postmaster

2013-09-11 Thread Alvaro Herrera
Noah Misch escribió:
 On Tue, Sep 10, 2013 at 05:18:21PM -0700, Jeff Janes wrote:

  I think the problem is here, where there should be a Max rather than a Min:
  
  commit 82233ce7ea42d6ba519aaec63008aff49da6c7af
  Author: Alvaro Herrera alvhe...@alvh.no-ip.org
  Date:   Fri Jun 28 17:20:53 2013 -0400
  
  Send SIGKILL to children if they don't die quickly in immediate shutdown
  
  ...
  
  +   /* remaining time, but at least 1 second */
  +   timeout-tv_sec = Min(SIGKILL_CHILDREN_AFTER_SECS -
  + (time(NULL) - AbortStartTime), 1);
 
 Agreed; good catch.

Yeah, thanks.  Should be a Max().  The current coding presumably makes
it use one second most of the time, instead of whatever the remaining
time is ... until the abort time is past, in which case it causes the
whole thing to break down as reported.

It might very well be that I used Max() there initially and changed to
Min() at the last minute before commit in a moment of brain fade.

  But I don't understand the logic behind this anyway.  Why sleep at least 1
  second?  If time is up, it is up, why not use zero as the minimum?
 
 Offhand, clamping to zero does make more sense to me.  It looks like Alvaro
 added that bit in his pre-commit edits.  Alvaro?

Sadly, I don't have the developing branch for this feature anymore, so I
have to go from memory.  IIRC my thinking here is that if I make select
terminate immediately (timeout 0) then the time arithmetic in
ServerLoop() might lead us to decide not to send SIGKILL at that time,
causing one more iteration of that loop.  Thinking about it again, that
argument doesn't seem to hold much water; but the time variables being
in integer seconds led me to add that.

I will fix it to Max( ..., 0).

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] 9.4 HEAD: select() failed in postmaster

2013-09-10 Thread Jeff Janes
I've been getting some failures after an immediate shutdown or crash,
during severe IO stress, with the message:

LOG:  XX000: select() failed in postmaster: Invalid argument
LOCATION:  ServerLoop, postmaster.c:1560

It is trying to sleep for -1 seconds.

I think the problem is here, where there should be a Max rather than a Min:

commit 82233ce7ea42d6ba519aaec63008aff49da6c7af
Author: Alvaro Herrera alvhe...@alvh.no-ip.org
Date:   Fri Jun 28 17:20:53 2013 -0400

Send SIGKILL to children if they don't die quickly in immediate shutdown

...

+   /* remaining time, but at least 1 second */
+   timeout-tv_sec = Min(SIGKILL_CHILDREN_AFTER_SECS -
+ (time(NULL) - AbortStartTime), 1);


But I don't understand the logic behind this anyway.  Why sleep at least 1
second?  If time is up, it is up, why not use zero as the minimum?

Cheers,

Jeff