On Sun, Jan 15, 2012 at 1:57 AM, Lars Ellenberg <lars.ellenb...@linbit.com> wrote: > On Tue, Jan 10, 2012 at 04:43:51PM +0900, renayama19661...@ybb.ne.jp wrote: >> Hi Lars, >> >> I attach strace file when a problem reappeared at the end of last year. >> I used glue which applied your patch for confirmation. >> >> It is the file which I picked with attrd by strace -p command right before I >> stop Heartbeat. >> >> Finally SIGTERM caught it, but attrd did not stop. >> The attrd stopped afterwards when I sent SIGKILL. > > The strace reveals something interesting: > > This poll looks like the mainloop poll, > but some ->prepare() has modified the timeout to be 0, > so we proceed directly to ->check() and then ->dispatch(). > >> poll([{fd=7, events=POLLIN|POLLPRI}, {fd=4, events=POLLIN|POLLPRI}, {fd=8, >> events=POLLIN|POLLPRI}], 3, 0) = 1 ([{fd=8, revents=POLLIN|POLLHUP}]) > >> times({tms_utime=2, tms_stime=3, tms_cutime=0, tms_cstime=0}) = 433738632 >> recv(4, 0x95af308, 576, MSG_DONTWAIT) = -1 EAGAIN (Resource temporarily >> unavailable) > ... >> recv(7, 0x95b1657, 3513, MSG_DONTWAIT) = -1 EAGAIN (Resource temporarily >> unavailable) >> poll([{fd=7, events=0}], 1, 0) = ? ERESTART_RESTARTBLOCK (To be >> restarted) >> --- SIGTERM (Terminated) @ 0 (0) --- >> sigreturn() = ? (mask now []) > > Ok. signal received, trigger set. > Still finishing this mainloop iteration, though. > > These recv(),poll() look like invocations of G_CH_prepare_int(). > Does not matter much, though. > >> recv(7, 0x95b1657, 3513, MSG_DONTWAIT) = -1 EAGAIN (Resource temporarily >> unavailable) >> poll([{fd=7, events=0}], 1, 0) = 0 (Timeout) >> recv(7, 0x95b1657, 3513, MSG_DONTWAIT) = -1 EAGAIN (Resource temporarily >> unavailable) >> poll([{fd=7, events=0}], 1, 0) = 0 (Timeout) > >> times({tms_utime=2, tms_stime=3, tms_cutime=0, tms_cstime=0}) = 433738634 > > Now we proceed to the next mainloop poll: > >> poll([{fd=7, events=POLLIN|POLLPRI}, {fd=4, events=POLLIN|POLLPRI}, {fd=5, >> events=POLLIN|POLLPRI}], 3, -1 > > Note the -1 (infinity timeout!) > > So even though the trigger was (presumably) set, > and the ->prepare() should have returned true, > the mainloop waits forever for "something" to happen on those file > descriptors. > > > I suggest this: > > crm_trigger_prepare should set *timeout = 0, if trigger is set. > > Also think about this race: crm_trigger_prepare was already > called, only then the signal came in... > > diff --git a/lib/common/mainloop.c b/lib/common/mainloop.c > index 2e8b1d0..fd17b87 100644 > --- a/lib/common/mainloop.c > +++ b/lib/common/mainloop.c > @@ -33,6 +33,13 @@ static gboolean > crm_trigger_prepare(GSource * source, gint * timeout) > { > crm_trigger_t *trig = (crm_trigger_t *) source; > + /* Do not delay signal processing by the mainloop poll stage */ > + if (trig->trigger) > + *timeout = 0; > + /* To avoid races between signal delivery and the mainloop poll stage, > + * make sure we always have a finite timeout. Unit: milliseconds. */ > + else > + *timeout = 5000; /* arbitrary */ > > return trig->trigger; > } > > > This scenario does not let the blocked IPC off the hook, though. > That is still possible, both for blocking send and blocking receive, > so that should probably be fixed as well, somehow. > I'm not sure how likely this "stuck in blocking IPC" is, though.
Interesting, are you sure you're in the right function though? trigger and signal events don't have a file descriptor... wouldn't these polls be for the IPC related sources and wouldn't they be setting their own timeout? > > -- > : Lars Ellenberg > : LINBIT | Your Way to High Availability > : DRBD/HA support and consulting http://www.linbit.com > > DRBD® and LINBIT® are registered trademarks of LINBIT, Austria. > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org