On Mon, Feb 22, 2010 at 08:46:23PM +0100, Andrew Beekhof wrote: > On Mon, Feb 22, 2010 at 5:10 PM, Lars Ellenberg > <lars.ellenb...@linbit.com> wrote: > > On Mon, Feb 22, 2010 at 01:00:29PM +0100, Markus M. wrote: > >> Hello, > >> > >> sometimes "heartbeat stop" seems to hang (latest packets from > >> clusterlabs.org, RHEL5 x86_64, 2-node cluster with only one node > >> running). > >> > >> The last lines from ha-debug are like this: > >> > >> Feb 22 12:52:48 dbprod21 ccm: [24053]: info: client (pid=24058) removed > >> from ccm > >> Feb 22 12:52:48 dbprod21 crmd: [24058]: info: do_ha_control: Disconnected > >> from Heartbeat > >> Feb 22 12:52:48 dbprod21 crmd: [24058]: info: do_cib_control: > >> Disconnecting CIB > >> Feb 22 12:52:48 dbprod21 cib: [24054]: info: cib_process_readwrite: We are > >> now in R/O mode > >> Feb 22 12:52:48 dbprod21 crmd: [24058]: info: crmd_cib_connection_destroy: > >> Connection to the CIB terminated... > >> Feb 22 12:52:48 dbprod21 cib: [24054]: WARN: send_ipc_message: IPC Channel > >> to 24058 is not connected > >> Feb 22 12:52:48 dbprod21 crmd: [24058]: info: do_exit: Performing A_EXIT_0 > >> - gracefully exiting the CRMd > >> Feb 22 12:52:48 dbprod21 cib: [24054]: WARN: send_via_callback_channel: > >> Delivery of reply to client 24058/d9c9c281-4f38-46d8-b83e-54135f6c75e9 > >> failed > >> Feb 22 12:52:48 dbprod21 crmd: [24058]: info: free_mem: Dropping > >> I_TERMINATE: [ state=S_STOPPING cause=C_FSA_INTERNAL origin=do_stop ] > >> Feb 22 12:52:48 dbprod21 cib: [24054]: WARN: do_local_notify: A-Sync reply > >> to crmd failed: reply failed > >> Feb 22 12:52:48 dbprod21 crmd: [24058]: info: do_exit: [crmd] stopped (0) > >> Feb 22 12:52:48 dbprod21 heartbeat: [24040]: info: killing > >> /usr/lib64/heartbeat/attrd process group 24057 with signal 15 > > > > Yep. > > I've seen this too, > > a few times. > > Apparently attrd sometimes "ignores" a term signal. > > Once an additional signal comes in, or any message is processed, i.e. > > once the mainloop() actually _processes_ the signal, > > it is handled and attrd dies. > > > > Unfortunately I don't see exactly where this signal is "lost", though. > > The signal is delivered, the flag is raised, mainloop should recognize > > and handle it... > > And I don't yet have a reliable way to reproduce it, either. > > If you have, let us know! > > > > Maybe the following helps (sorry, patch is likely not whitespace clean) > > > > diff -r 1a6d0f690c3e lib/common/mainloop.c > > --- a/lib/common/mainloop.c Thu Feb 18 22:36:49 2010 +0100 > > +++ b/lib/common/mainloop.c Mon Feb 22 17:09:31 2010 +0100 > > @@ -191,7 +191,12 @@ > > CRM_ASSERT(sizeof(crm_signal_t) > sizeof(GSource)); > > source = g_source_new(&crm_signal_funcs, sizeof(crm_signal_t)); > > > > - crm_signals[sig] = (crm_signal_t*)mainloop_setup_trigger(source, > > G_PRIORITY_HIGH, NULL, NULL); > > + crm_signals[sig] = (crm_signal_t*)mainloop_setup_trigger(source, > > + /* TERM is higher priority than other signals, > > + * signals are higher priority than other ipc. > > + * yes, minus: smaller is "higher". */ > > + G_PRIORITY_HIGH - (sig == SIGTERM ? 2 : 1), > > + NULL, NULL); > > CRM_ASSERT(crm_signals[sig] != NULL); > > > > crm_signals[sig]->handler = dispatch; > > I've applied a similar patch to stable. > Also in stable is a patch that waits up to 2.5 minutes for post-crmd > clients to terminate. > So either way we should have this resolved.
Thanks. I likely have to port the 2½ minute patch over to heartbeat proper... Now that will be fun ;-) -- : Lars Ellenberg : LINBIT | Your Way to High Availability : DRBD/HA support and consulting http://www.linbit.com DRBD® and LINBIT® are registered trademarks of LINBIT, Austria. _______________________________________________ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker