On Mon, Nov 7, 2011 at 8:39 AM, Lars Ellenberg <lars.ellenb...@linbit.com> wrote: > On Thu, Nov 03, 2011 at 01:49:46AM +1100, Andrew Beekhof wrote: >> On Tue, Oct 18, 2011 at 12:19 PM, <renayama19661...@ybb.ne.jp> wrote: >> > Hi, >> > >> > We sometimes fail in a stop of attrd. >> > >> > Step1. start a cluster in 2 nodes >> > Step2. stop the first node.(/etc/init.d/heartbeat stop.) >> > Step3. stop the second node after time passed a >> > little.(/etc/init.d/heartbeat >> > stop.) >> > >> > The attrd catches the TERM signal, but does not stop. >> >> There's no evidence that it actually catches it, only that it is sent. >> I've seen it before but never figured out why it occurs. > > I had it once tracked down almost to where it occurs, but then got distracted. > Yes the signal was delivered. > > I *think* it had to do with attrd doing a blocking read, > or looping in some internal message delivery function too often. > > I had a quick look at the code again now, to try and remember, > but I'm not sure. > > I *may* be that, because > xmlfromIPC(IPC_Channel * ch, int timeout) calls > msg = msgfromIPC_timeout(ch, MSG_ALLOWINTR, timeout, &ipc_rc); > > And MSG_ALLOWINTR will cause msgfromIPC_ll() to > IPC_INTR: > if ( allow_intr){ > goto startwait; > > Depending on the frequency of deliverd signals, it may cause this goto > startwait loop to never exit, because the timeout always starts again > from the full passed in timeout. > > If only one signal is deliverd, it may still take 120 seconds > (MAX_IPC_DELAY from crm.h) to be actually processed, as the signal > handler only raises a flag for the next mainloop iteration. > > If a (non-fatal) signal is delivered every few seconds, > then the goto loop will never timeout. > > Please someone check this for plausibility ;-)
Most plausible explanation I've heard so far... still odd that only attrd is affected. So what do we do about it? _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker