On Thu, Dec 22, 2011 at 09:54:47AM +0900, renayama19661...@ybb.ne.jp wrote: > Hi Dejan, > Hi Lars, > > In our environment, the problem recurred with the patch of Mr. Lars. > After a problem occurred, I sent TERM signal, but attrd does not seem to > receive TERM at all.
If you are able to reproduce, you could try to find out what exactly attrd is doing. various ways to try to do that: cat /proc/<pid-of-attrd>/stack # if your platform supports that strace it, ltrace it, attach with gdb and provide a stack trace, or even start to single step it, cause attrd to core dump, and analyse the core. > The reconsideration of the patch is necessary for the solution to problem. > > > Best Regards, > Hideo Yamauchi. > > > --- On Tue, 2011/11/15, renayama19661...@ybb.ne.jp > <renayama19661...@ybb.ne.jp> wrote: > > > Hi Dejan, > > Hi Lars, > > > > I understood it. > > I try the operation of the patch in our environment. > > > > To Alan: Will you try a patch? > > > > Best Regards, > > Hideo Yamauchi. > > > > --- On Tue, 2011/11/15, Dejan Muhamedagic <deja...@fastmail.fm> wrote: > > > > > Hi, > > > > > > On Mon, Nov 14, 2011 at 01:17:37PM +0100, Lars Ellenberg wrote: > > > > On Mon, Nov 14, 2011 at 11:58:09AM +1100, Andrew Beekhof wrote: > > > > > On Mon, Nov 7, 2011 at 8:39 AM, Lars Ellenberg > > > > > <lars.ellenb...@linbit.com> wrote: > > > > > > On Thu, Nov 03, 2011 at 01:49:46AM +1100, Andrew Beekhof wrote: > > > > > >> On Tue, Oct 18, 2011 at 12:19 PM, <renayama19661...@ybb.ne.jp> > > > > > >> wrote: > > > > > >> > Hi, > > > > > >> > > > > > > >> > We sometimes fail in a stop of attrd. > > > > > >> > > > > > > >> > Step1. start a cluster in 2 nodes > > > > > >> > Step2. stop the first node.(/etc/init.d/heartbeat stop.) > > > > > >> > Step3. stop the second node after time passed a > > > > > >> > little.(/etc/init.d/heartbeat > > > > > >> > stop.) > > > > > >> > > > > > > >> > The attrd catches the TERM signal, but does not stop. > > > > > >> > > > > > >> There's no evidence that it actually catches it, only that it is > > > > > >> sent. > > > > > >> I've seen it before but never figured out why it occurs. > > > > > > > > > > > > I had it once tracked down almost to where it occurs, but then got > > > > > > distracted. > > > > > > Yes the signal was delivered. > > > > > > > > > > > > I *think* it had to do with attrd doing a blocking read, > > > > > > or looping in some internal message delivery function too often. > > > > > > > > > > > > I had a quick look at the code again now, to try and remember, > > > > > > but I'm not sure. > > > > > > > > > > > > I *may* be that, because > > > > > > xmlfromIPC(IPC_Channel * ch, int timeout) calls > > > > > > msg = msgfromIPC_timeout(ch, MSG_ALLOWINTR, timeout, &ipc_rc); > > > > > > > > > > > > And MSG_ALLOWINTR will cause msgfromIPC_ll() to > > > > > > IPC_INTR: > > > > > > if ( allow_intr){ > > > > > > goto startwait; > > > > > > > > > > > > Depending on the frequency of deliverd signals, it may cause this > > > > > > goto > > > > > > startwait loop to never exit, because the timeout always starts > > > > > > again > > > > > > from the full passed in timeout. > > > > > > > > > > > > If only one signal is deliverd, it may still take 120 seconds > > > > > > (MAX_IPC_DELAY from crm.h) to be actually processed, as the signal > > > > > > handler only raises a flag for the next mainloop iteration. > > > > > > > > > > > > If a (non-fatal) signal is delivered every few seconds, > > > > > > then the goto loop will never timeout. > > > > > > > > > > > > Please someone check this for plausibility ;-) > > > > > > > > > > Most plausible explanation I've heard so far... still odd that only > > > > > attrd is affected. > > > > > So what do we do about it? > > > > > > > > Reproduce, and confirm that this is what people are seeing. > > > > > > > > Make attrd non-blocking? > > > > > > > > Fix the ipc layer to not restart the full timeout, > > > > but only the remaining partial time? > > > > > > Lars and I made a quick patch for cluster-glue (attached). > > > Hideo-san, is there a way for you to verify if it helps? The > > > patch is not perfect and under unfavourable circumstances it may > > > still take a long time for the caller to exit, but it'd be good > > > to know if this is the right spot. > > > > > > Cheers, > > > > > > Dejan > > > > > > > -- > > > > : Lars Ellenberg > > > > : LINBIT | Your Way to High Availability > > > > : DRBD/HA support and consulting http://www.linbit.com > > > > > > > > _______________________________________________ > > > > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > > > > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > > > > > > > Project Home: http://www.clusterlabs.org > > > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > > > > Bugs: > > > > http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker > > > > > > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org -- : Lars Ellenberg : LINBIT | Your Way to High Availability : DRBD/HA support and consulting http://www.linbit.com DRBD® and LINBIT® are registered trademarks of LINBIT, Austria. _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org