Hi,

On Mon, Nov 14, 2011 at 01:17:37PM +0100, Lars Ellenberg wrote:
> On Mon, Nov 14, 2011 at 11:58:09AM +1100, Andrew Beekhof wrote:
> > On Mon, Nov 7, 2011 at 8:39 AM, Lars Ellenberg
> > <lars.ellenb...@linbit.com> wrote:
> > > On Thu, Nov 03, 2011 at 01:49:46AM +1100, Andrew Beekhof wrote:
> > >> On Tue, Oct 18, 2011 at 12:19 PM,  <renayama19661...@ybb.ne.jp> wrote:
> > >> > Hi,
> > >> >
> > >> > We sometimes fail in a stop of attrd.
> > >> >
> > >> > Step1. start a cluster in 2 nodes
> > >> > Step2. stop the first node.(/etc/init.d/heartbeat stop.)
> > >> > Step3. stop the second node after time passed a 
> > >> > little.(/etc/init.d/heartbeat
> > >> > stop.)
> > >> >
> > >> > The attrd catches the TERM signal, but does not stop.
> > >>
> > >> There's no evidence that it actually catches it, only that it is sent.
> > >> I've seen it before but never figured out why it occurs.
> > >
> > > I had it once tracked down almost to where it occurs, but then got 
> > > distracted.
> > > Yes the signal was delivered.
> > >
> > > I *think* it had to do with attrd doing a blocking read,
> > > or looping in some internal message delivery function too often.
> > >
> > > I had a quick look at the code again now, to try and remember,
> > > but I'm not sure.
> > >
> > > I *may* be that, because
> > > xmlfromIPC(IPC_Channel * ch, int timeout) calls
> > >    msg = msgfromIPC_timeout(ch, MSG_ALLOWINTR, timeout, &ipc_rc);
> > >
> > > And MSG_ALLOWINTR will cause msgfromIPC_ll() to
> > >        IPC_INTR:
> > >                if ( allow_intr){
> > >                        goto startwait;
> > >
> > > Depending on the frequency of deliverd signals, it may cause this goto
> > > startwait loop to never exit, because the timeout always starts again
> > > from the full passed in timeout.
> > >
> > > If only one signal is deliverd, it may still take 120 seconds
> > > (MAX_IPC_DELAY from crm.h) to be actually processed, as the signal
> > > handler only raises a flag for the next mainloop iteration.
> > >
> > > If a (non-fatal) signal is delivered every few seconds,
> > > then the goto loop will never timeout.
> > >
> > > Please someone check this for plausibility ;-)
> > 
> > Most plausible explanation I've heard so far... still odd that only
> > attrd is affected.
> > So what do we do about it?
> 
> Reproduce, and confirm that this is what people are seeing.
> 
> Make attrd non-blocking?
> 
> Fix the ipc layer to not restart the full timeout,
> but only the remaining partial time?

Lars and I made a quick patch for cluster-glue (attached).
Hideo-san, is there a way for you to verify if it helps? The
patch is not perfect and under unfavourable circumstances it may
still take a long time for the caller to exit, but it'd be good
to know if this is the right spot.

Cheers,

Dejan

> -- 
> : Lars Ellenberg
> : LINBIT | Your Way to High Availability
> : DRBD/HA support and consulting http://www.linbit.com
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: 
> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
# HG changeset patch
# User Lars Ellenberg <lars.ellenb...@linbit.com>
# Date 1321275721 -3600
# Node ID 8b50bf0dd4cdf8d0a405416da98711080b2abeb9
# Parent  569bdebf736185d77782f49d5c760007cfc6b3e8
Medium: clplumbing: don't restart timeouts forever if signals are repeatedly sent

diff -r 569bdebf7361 -r 8b50bf0dd4cd lib/clplumbing/cl_msg.c
--- a/lib/clplumbing/cl_msg.c	Mon Nov 14 11:31:51 2011 +0100
+++ b/lib/clplumbing/cl_msg.c	Mon Nov 14 14:02:01 2011 +0100
@@ -1802,12 +1802,13 @@ static struct ha_msg*
 msgfromIPC_ll(IPC_Channel * ch, int flag, unsigned int timeout, int *rc_out)
 {
 	int		rc;
+	int		sig_cnt = 0;
 	IPC_Message*	ipcmsg;
 	struct ha_msg*	hmsg;
 	int		need_auth = flag & MSG_NEEDAUTH;
 	int		allow_intr = flag & MSG_ALLOWINTR;
 	
- startwait:
+	do {
 	if(timeout > 0) {
 	    rc = cl_ipc_wait_timeout(ch, ch->ops->waitin, timeout);
 	} else {
@@ -1832,17 +1833,17 @@ msgfromIPC_ll(IPC_Channel * ch, int flag
 		return NULL;
 		
 	case IPC_INTR:
-		if ( allow_intr){
-			goto startwait;
-		}else{
+		if (!allow_intr || sig_cnt++ >= 20) {
 			return NULL;
+		} else {
+			break;
 		}
 		
 	case IPC_OK:
 		break;
 	}
-	
-	
+	} while (rc != IPC_OK);
+
 	ipcmsg = NULL;
 	rc = ch->ops->recv(ch, &ipcmsg);
 #if 0
_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker

Reply via email to