Re: [Pacemaker] [Problem] The attrd does not sometimes stop.
On Tue, Jan 17, 2012 at 10:11 AM, Lars Ellenberg wrote: > On Mon, Jan 16, 2012 at 11:42:32PM +1100, Andrew Beekhof wrote: >> >>> http://developer.gnome.org/glib/2.30/glib-The-Main-Event-Loop.html#GSourceFuncs >> >>> >> >>> iiuc, mainloop does something similar to (oversimplified): >> >>> timeout = -1; /* infinity */ >> >>> for s in all GSource >> >>> tmp_timeout = -1; >> >>> s->prepare(s, &tmp_timeout) >> >>> if (tmp_timeout >= 0 && tmp_timeout < timeout) >> >>> timeout = tmp_timeout; >> >>> >> >>> poll(GSource fd set, n, timeout); >> >> >> >> I'm looking at the glib code again now, and it still looks to me like >> >> the trigger and signal sources do not appear in this fd set. >> >> Their setup functions would have to have called g_source_add_poll() >> >> somewhere, which they don't. >> >> >> >> So I'm still not seeing why its a trigger or signal sources' fault >> >> that glib is doing a never ending call to poll(). >> >> poll() is going to get called regardless of whether our prepare >> >> function returns true or not. >> >> >> >> Looking closer, crm_trigger_prepare() returning TRUE results in: >> >> ready_source->flags |= G_SOURCE_READY; >> >> >> >> which in turn causes: >> >> context->timeout = 0; >> >> >> >> which is essentially what adding >> >> if (trig->trigger) >> >> *timeout = 0; >> >> >> >> to crm_trigger_prepare() was intended to achieve. >> >> >> >> Shouldn't the fd, ipc or wait sources (who do call g_source_add_poll() >> >> and could therefor cause poll() to block forever) have a sane timeout >> >> in their prepare functions? > > Probably should, but they usually have not. > The reasoning probably is, each GSource is responsible for *itself* only. Well no, because this forces trigger to care about whether there is a fd based GSource too and what timeout, if any, is set. > > That is why first all sources are prepared. > > If no non-fd, non-pollable source feels the need to reduce the > *timeout to something finite in its prepare(), so be it. So something that doesn't use poll at all should set a timeout for poll, that doesn't sound right :-) > > Besides, what is sane? 1 second? 5? 120? 240? > > That's why G_CH_prepare_int() sets the *timeout to 1000, > and why I suggest to set it to 0 if prepare already knows that the > trigger is set, and to some finite amount to avoid getting stuck in > poll, in case no timeout or outher source source is active which also > set some finite timeout. > > BTW, if you have an *idle* sources, prepare should set timeout to 0. > > For those interested, all described below > http://developer.gnome.org/glib/2.30/glib-The-Main-Event-Loop.html#GSourceFuncs > > "For idle sources, the prepare and check functions always return TRUE to > indicate that the source is always ready to be processed. The prepare > function also returns a timeout value of 0 to ensure that the poll() > call doesn't block (since that would be time wasted which could have > been spent running the idle function)." > > "... timeout sources ... returns a timeout value to ensure that the > poll() call doesn't block too long ..." > > "... file descriptor sources ... timeout to -1 to indicate that is does > not mind how long the poll() call blocks ... " > >> >> Or is it because the signal itself is interrupting some essential part >> >> of G_CH_prepare_int() and friends? > > In the provided strace, it looks like the SIGTERM > is delivered while calling some G_CH_prepare_int, > the ->prepare() used by G_main_add_IPC_Channel. > > Since the signal sources are of higher priority, > we probably are passt those already in this iteration, > we will only notice the trigger in the next check(), > after the poll. > > So it is vital for any non-pollable source such as signals > to set a finite timeout in their prepare(), > even if we also mark that signal siginterrupt(). > >> >>> for s in all GSource >> >>> if s->check(s) >> >>> s->dispatch(s, ...) >> >>> >> >>> And at some stage it also orders by priority, of course. >> >>> >> >>> Also compare with the comment above /* Sigh... */ in glue >> >>> G_SIG_prepare(). >> >>> >> >>> BTW, the mentioned race between signal delivery and mainloop already >> >>> doing the poll stage could potentially be solved by using >> >>> cl_signal_set_interrupt(SIGTERM, 1), > > As I just wrote above, that race is not solved at all. > Only the (necessarily set) finite timeout of the poll > would be shortened in that case. > >> >> But I can't escape the feeling that calling this just masks the >> >> underlying "why is there a never-ending call to poll() in the first >> >> place" issue. >> >> G_CH_prepare_int() and friends /should/ be setting timeouts so that >> >> poll() can return and any sources created by g_idle_source_new() can >> >> execute. >> > >> > Actually, thinking further, I'm pretty convinced that
Re: [Pacemaker] [Problem] The attrd does not sometimes stop.
On Tue, Jan 17, 2012 at 09:52:35AM +1100, Andrew Beekhof wrote: > On Mon, Jan 16, 2012 at 11:42 PM, Andrew Beekhof wrote: > > On Mon, Jan 16, 2012 at 11:30 PM, Andrew Beekhof wrote: > >> On Mon, Jan 16, 2012 at 11:27 PM, Andrew Beekhof > >> wrote: > >>> I know I could just apply the patch and be done, but I'd like to > >>> understand this so it works for the right reason. > > Ok, done: > > https://github.com/beekhof/pacemaker/commit/2a6b296 > > If I'm adding voodoo, I at least want the reason well documented so it > can be removed again if the reason goes away. That about sums it up, then ;-) -- : Lars Ellenberg : LINBIT | Your Way to High Availability : DRBD/HA support and consulting http://www.linbit.com DRBD® and LINBIT® are registered trademarks of LINBIT, Austria. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] [Problem] The attrd does not sometimes stop.
On Mon, Jan 16, 2012 at 11:42:32PM +1100, Andrew Beekhof wrote: > >>> http://developer.gnome.org/glib/2.30/glib-The-Main-Event-Loop.html#GSourceFuncs > >>> > >>> iiuc, mainloop does something similar to (oversimplified): > >>> timeout = -1; /* infinity */ > >>> for s in all GSource > >>> tmp_timeout = -1; > >>> s->prepare(s, &tmp_timeout) > >>> if (tmp_timeout >= 0 && tmp_timeout < timeout) > >>> timeout = tmp_timeout; > >>> > >>> poll(GSource fd set, n, timeout); > >> > >> I'm looking at the glib code again now, and it still looks to me like > >> the trigger and signal sources do not appear in this fd set. > >> Their setup functions would have to have called g_source_add_poll() > >> somewhere, which they don't. > >> > >> So I'm still not seeing why its a trigger or signal sources' fault > >> that glib is doing a never ending call to poll(). > >> poll() is going to get called regardless of whether our prepare > >> function returns true or not. > >> > >> Looking closer, crm_trigger_prepare() returning TRUE results in: > >> ready_source->flags |= G_SOURCE_READY; > >> > >> which in turn causes: > >> context->timeout = 0; > >> > >> which is essentially what adding > >> if (trig->trigger) > >> *timeout = 0; > >> > >> to crm_trigger_prepare() was intended to achieve. > >> > >> Shouldn't the fd, ipc or wait sources (who do call g_source_add_poll() > >> and could therefor cause poll() to block forever) have a sane timeout > >> in their prepare functions? Probably should, but they usually have not. The reasoning probably is, each GSource is responsible for *itself* only. That is why first all sources are prepared. If no non-fd, non-pollable source feels the need to reduce the *timeout to something finite in its prepare(), so be it. Besides, what is sane? 1 second? 5? 120? 240? That's why G_CH_prepare_int() sets the *timeout to 1000, and why I suggest to set it to 0 if prepare already knows that the trigger is set, and to some finite amount to avoid getting stuck in poll, in case no timeout or outher source source is active which also set some finite timeout. BTW, if you have an *idle* sources, prepare should set timeout to 0. For those interested, all described below http://developer.gnome.org/glib/2.30/glib-The-Main-Event-Loop.html#GSourceFuncs "For idle sources, the prepare and check functions always return TRUE to indicate that the source is always ready to be processed. The prepare function also returns a timeout value of 0 to ensure that the poll() call doesn't block (since that would be time wasted which could have been spent running the idle function)." "... timeout sources ... returns a timeout value to ensure that the poll() call doesn't block too long ..." "... file descriptor sources ... timeout to -1 to indicate that is does not mind how long the poll() call blocks ... " > >> Or is it because the signal itself is interrupting some essential part > >> of G_CH_prepare_int() and friends? In the provided strace, it looks like the SIGTERM is delivered while calling some G_CH_prepare_int, the ->prepare() used by G_main_add_IPC_Channel. Since the signal sources are of higher priority, we probably are passt those already in this iteration, we will only notice the trigger in the next check(), after the poll. So it is vital for any non-pollable source such as signals to set a finite timeout in their prepare(), even if we also mark that signal siginterrupt(). > >>> for s in all GSource > >>> if s->check(s) > >>> s->dispatch(s, ...) > >>> > >>> And at some stage it also orders by priority, of course. > >>> > >>> Also compare with the comment above /* Sigh... */ in glue G_SIG_prepare(). > >>> > >>> BTW, the mentioned race between signal delivery and mainloop already > >>> doing the poll stage could potentially be solved by using > >>> cl_signal_set_interrupt(SIGTERM, 1), As I just wrote above, that race is not solved at all. Only the (necessarily set) finite timeout of the poll would be shortened in that case. > >> But I can't escape the feeling that calling this just masks the > >> underlying "why is there a never-ending call to poll() in the first > >> place" issue. > >> G_CH_prepare_int() and friends /should/ be setting timeouts so that > >> poll() can return and any sources created by g_idle_source_new() can > >> execute. > > > > Actually, thinking further, I'm pretty convinced that poll() with an > > infinite timeout is the default mode of operation for mainloops with > > cluster-glue's IPC and FD sources. > > And that this is not a good thing :) Well, if there are *only* pollable sources, it is. If there are any other sources, they should have set their limit on what they think is an acceptable timeout int their prepare(). > Far too late, brain shutting down. ;-) > ...not a good thing, because it breaks the idle
Re: [Pacemaker] [Problem] The attrd does not sometimes stop.
On Mon, Jan 16, 2012 at 11:42 PM, Andrew Beekhof wrote: > On Mon, Jan 16, 2012 at 11:30 PM, Andrew Beekhof wrote: >> On Mon, Jan 16, 2012 at 11:27 PM, Andrew Beekhof wrote: >>> I know I could just apply the patch and be done, but I'd like to >>> understand this so it works for the right reason. Ok, done: https://github.com/beekhof/pacemaker/commit/2a6b296 If I'm adding voodoo, I at least want the reason well documented so it can be removed again if the reason goes away. >>> On Mon, Jan 16, 2012 at 7:30 PM, Lars Ellenberg >>> wrote: On Mon, Jan 16, 2012 at 04:46:58PM +1100, Andrew Beekhof wrote: > > Now we proceed to the next mainloop poll: > > > >> poll([{fd=7, events=POLLIN|POLLPRI}, {fd=4, events=POLLIN|POLLPRI}, > >> {fd=5, events=POLLIN|POLLPRI}], 3, -1 > > > > Note the -1 (infinity timeout!) > > > > So even though the trigger was (presumably) set, > > and the ->prepare() should have returned true, > > the mainloop waits forever for "something" to happen on those file > > descriptors. > > > > > > I suggest this: > > > > crm_trigger_prepare should set *timeout = 0, if trigger is set. > > > > Also think about this race: crm_trigger_prepare was already > > called, only then the signal came in... > > > > diff --git a/lib/common/mainloop.c b/lib/common/mainloop.c > > index 2e8b1d0..fd17b87 100644 > > --- a/lib/common/mainloop.c > > +++ b/lib/common/mainloop.c > > @@ -33,6 +33,13 @@ static gboolean > > crm_trigger_prepare(GSource * source, gint * timeout) > > { > > crm_trigger_t *trig = (crm_trigger_t *) source; > > + /* Do not delay signal processing by the mainloop poll stage */ > > + if (trig->trigger) > > + *timeout = 0; > > + /* To avoid races between signal delivery and the mainloop poll > > stage, > > + * make sure we always have a finite timeout. Unit: milliseconds. > > */ > > + else > > + *timeout = 5000; /* arbitrary */ > > > > return trig->trigger; > > } > > > > > > This scenario does not let the blocked IPC off the hook, though. > > That is still possible, both for blocking send and blocking receive, > > so that should probably be fixed as well, somehow. > > I'm not sure how likely this "stuck in blocking IPC" is, though. > > Interesting, are you sure you're in the right function though? > trigger and signal events don't have a file descriptor... wouldn't > these polls be for the IPC related sources and wouldn't they be > setting their own timeout? http://developer.gnome.org/glib/2.30/glib-The-Main-Event-Loop.html#GSourceFuncs iiuc, mainloop does something similar to (oversimplified): timeout = -1; /* infinity */ for s in all GSource tmp_timeout = -1; s->prepare(s, &tmp_timeout) if (tmp_timeout >= 0 && tmp_timeout < timeout) timeout = tmp_timeout; poll(GSource fd set, n, timeout); >>> >>> I'm looking at the glib code again now, and it still looks to me like >>> the trigger and signal sources do not appear in this fd set. >>> Their setup functions would have to have called g_source_add_poll() >>> somewhere, which they don't. >>> >>> So I'm still not seeing why its a trigger or signal sources' fault >>> that glib is doing a never ending call to poll(). >>> poll() is going to get called regardless of whether our prepare >>> function returns true or not. >>> >>> Looking closer, crm_trigger_prepare() returning TRUE results in: >>> ready_source->flags |= G_SOURCE_READY; >>> >>> which in turn causes: >>> context->timeout = 0; >>> >>> which is essentially what adding >>> if (trig->trigger) >>> *timeout = 0; >>> >>> to crm_trigger_prepare() was intended to achieve. >>> >>> Shouldn't the fd, ipc or wait sources (who do call g_source_add_poll() >>> and could therefor cause poll() to block forever) have a sane timeout >>> in their prepare functions? >>> Or is it because the signal itself is interrupting some essential part >>> of G_CH_prepare_int() and friends? >>> for s in all GSource if s->check(s) s->dispatch(s, ...) And at some stage it also orders by priority, of course. Also compare with the comment above /* Sigh... */ in glue G_SIG_prepare(). BTW, the mentioned race between signal delivery and mainloop already doing the poll stage could potentially be solved by using >>> >>> Again, since nothing related to the signal source ever appears in the >>> call to poll(), I'm not seeing where the race comes from. >>> Or am I missing something obvious? >>> cl_signal_set_interrupt(SIGTERM, 1), >>> >>> This, combined with >>> >>> /* >>>
Re: [Pacemaker] [Problem] The attrd does not sometimes stop.
On Mon, Jan 16, 2012 at 11:30 PM, Andrew Beekhof wrote: > On Mon, Jan 16, 2012 at 11:27 PM, Andrew Beekhof wrote: >> I know I could just apply the patch and be done, but I'd like to >> understand this so it works for the right reason. >> >> On Mon, Jan 16, 2012 at 7:30 PM, Lars Ellenberg >> wrote: >>> On Mon, Jan 16, 2012 at 04:46:58PM +1100, Andrew Beekhof wrote: > Now we proceed to the next mainloop poll: > >> poll([{fd=7, events=POLLIN|POLLPRI}, {fd=4, events=POLLIN|POLLPRI}, >> {fd=5, events=POLLIN|POLLPRI}], 3, -1 > > Note the -1 (infinity timeout!) > > So even though the trigger was (presumably) set, > and the ->prepare() should have returned true, > the mainloop waits forever for "something" to happen on those file > descriptors. > > > I suggest this: > > crm_trigger_prepare should set *timeout = 0, if trigger is set. > > Also think about this race: crm_trigger_prepare was already > called, only then the signal came in... > > diff --git a/lib/common/mainloop.c b/lib/common/mainloop.c > index 2e8b1d0..fd17b87 100644 > --- a/lib/common/mainloop.c > +++ b/lib/common/mainloop.c > @@ -33,6 +33,13 @@ static gboolean > crm_trigger_prepare(GSource * source, gint * timeout) > { > crm_trigger_t *trig = (crm_trigger_t *) source; > + /* Do not delay signal processing by the mainloop poll stage */ > + if (trig->trigger) > + *timeout = 0; > + /* To avoid races between signal delivery and the mainloop poll > stage, > + * make sure we always have a finite timeout. Unit: milliseconds. */ > + else > + *timeout = 5000; /* arbitrary */ > > return trig->trigger; > } > > > This scenario does not let the blocked IPC off the hook, though. > That is still possible, both for blocking send and blocking receive, > so that should probably be fixed as well, somehow. > I'm not sure how likely this "stuck in blocking IPC" is, though. Interesting, are you sure you're in the right function though? trigger and signal events don't have a file descriptor... wouldn't these polls be for the IPC related sources and wouldn't they be setting their own timeout? >>> >>> http://developer.gnome.org/glib/2.30/glib-The-Main-Event-Loop.html#GSourceFuncs >>> >>> iiuc, mainloop does something similar to (oversimplified): >>> timeout = -1; /* infinity */ >>> for s in all GSource >>> tmp_timeout = -1; >>> s->prepare(s, &tmp_timeout) >>> if (tmp_timeout >= 0 && tmp_timeout < timeout) >>> timeout = tmp_timeout; >>> >>> poll(GSource fd set, n, timeout); >> >> I'm looking at the glib code again now, and it still looks to me like >> the trigger and signal sources do not appear in this fd set. >> Their setup functions would have to have called g_source_add_poll() >> somewhere, which they don't. >> >> So I'm still not seeing why its a trigger or signal sources' fault >> that glib is doing a never ending call to poll(). >> poll() is going to get called regardless of whether our prepare >> function returns true or not. >> >> Looking closer, crm_trigger_prepare() returning TRUE results in: >> ready_source->flags |= G_SOURCE_READY; >> >> which in turn causes: >> context->timeout = 0; >> >> which is essentially what adding >> if (trig->trigger) >> *timeout = 0; >> >> to crm_trigger_prepare() was intended to achieve. >> >> Shouldn't the fd, ipc or wait sources (who do call g_source_add_poll() >> and could therefor cause poll() to block forever) have a sane timeout >> in their prepare functions? >> Or is it because the signal itself is interrupting some essential part >> of G_CH_prepare_int() and friends? >> >>> >>> for s in all GSource >>> if s->check(s) >>> s->dispatch(s, ...) >>> >>> And at some stage it also orders by priority, of course. >>> >>> Also compare with the comment above /* Sigh... */ in glue G_SIG_prepare(). >>> >>> BTW, the mentioned race between signal delivery and mainloop already >>> doing the poll stage could potentially be solved by using >> >> Again, since nothing related to the signal source ever appears in the >> call to poll(), I'm not seeing where the race comes from. >> Or am I missing something obvious? >> >>> cl_signal_set_interrupt(SIGTERM, 1), >> >> This, combined with >> >> /* >> * If we don't set this on, then the mainloop poll(2) call >> * will never be interrupted by this signal - which sort of >> * defeats the whole purpose of a signal handler in a >> * mainloop program >> */ >> cl_signal_set_interrupt(signal, TRUE); >> >> looks more relevant. >>
Re: [Pacemaker] [Problem] The attrd does not sometimes stop.
On Mon, Jan 16, 2012 at 11:27 PM, Andrew Beekhof wrote: > I know I could just apply the patch and be done, but I'd like to > understand this so it works for the right reason. > > On Mon, Jan 16, 2012 at 7:30 PM, Lars Ellenberg > wrote: >> On Mon, Jan 16, 2012 at 04:46:58PM +1100, Andrew Beekhof wrote: >>> > Now we proceed to the next mainloop poll: >>> > >>> >> poll([{fd=7, events=POLLIN|POLLPRI}, {fd=4, events=POLLIN|POLLPRI}, >>> >> {fd=5, events=POLLIN|POLLPRI}], 3, -1 >>> > >>> > Note the -1 (infinity timeout!) >>> > >>> > So even though the trigger was (presumably) set, >>> > and the ->prepare() should have returned true, >>> > the mainloop waits forever for "something" to happen on those file >>> > descriptors. >>> > >>> > >>> > I suggest this: >>> > >>> > crm_trigger_prepare should set *timeout = 0, if trigger is set. >>> > >>> > Also think about this race: crm_trigger_prepare was already >>> > called, only then the signal came in... >>> > >>> > diff --git a/lib/common/mainloop.c b/lib/common/mainloop.c >>> > index 2e8b1d0..fd17b87 100644 >>> > --- a/lib/common/mainloop.c >>> > +++ b/lib/common/mainloop.c >>> > @@ -33,6 +33,13 @@ static gboolean >>> > crm_trigger_prepare(GSource * source, gint * timeout) >>> > { >>> > crm_trigger_t *trig = (crm_trigger_t *) source; >>> > + /* Do not delay signal processing by the mainloop poll stage */ >>> > + if (trig->trigger) >>> > + *timeout = 0; >>> > + /* To avoid races between signal delivery and the mainloop poll >>> > stage, >>> > + * make sure we always have a finite timeout. Unit: milliseconds. */ >>> > + else >>> > + *timeout = 5000; /* arbitrary */ >>> > >>> > return trig->trigger; >>> > } >>> > >>> > >>> > This scenario does not let the blocked IPC off the hook, though. >>> > That is still possible, both for blocking send and blocking receive, >>> > so that should probably be fixed as well, somehow. >>> > I'm not sure how likely this "stuck in blocking IPC" is, though. >>> >>> Interesting, are you sure you're in the right function though? >>> trigger and signal events don't have a file descriptor... wouldn't >>> these polls be for the IPC related sources and wouldn't they be >>> setting their own timeout? >> >> http://developer.gnome.org/glib/2.30/glib-The-Main-Event-Loop.html#GSourceFuncs >> >> iiuc, mainloop does something similar to (oversimplified): >> timeout = -1; /* infinity */ >> for s in all GSource >> tmp_timeout = -1; >> s->prepare(s, &tmp_timeout) >> if (tmp_timeout >= 0 && tmp_timeout < timeout) >> timeout = tmp_timeout; >> >> poll(GSource fd set, n, timeout); > > I'm looking at the glib code again now, and it still looks to me like > the trigger and signal sources do not appear in this fd set. > Their setup functions would have to have called g_source_add_poll() > somewhere, which they don't. > > So I'm still not seeing why its a trigger or signal sources' fault > that glib is doing a never ending call to poll(). > poll() is going to get called regardless of whether our prepare > function returns true or not. > > Looking closer, crm_trigger_prepare() returning TRUE results in: > ready_source->flags |= G_SOURCE_READY; > > which in turn causes: > context->timeout = 0; > > which is essentially what adding > if (trig->trigger) > *timeout = 0; > > to crm_trigger_prepare() was intended to achieve. > > Shouldn't the fd, ipc or wait sources (who do call g_source_add_poll() > and could therefor cause poll() to block forever) have a sane timeout > in their prepare functions? > Or is it because the signal itself is interrupting some essential part > of G_CH_prepare_int() and friends? > >> >> for s in all GSource >> if s->check(s) >> s->dispatch(s, ...) >> >> And at some stage it also orders by priority, of course. >> >> Also compare with the comment above /* Sigh... */ in glue G_SIG_prepare(). >> >> BTW, the mentioned race between signal delivery and mainloop already >> doing the poll stage could potentially be solved by using > > Again, since nothing related to the signal source ever appears in the > call to poll(), I'm not seeing where the race comes from. > Or am I missing something obvious? > >> cl_signal_set_interrupt(SIGTERM, 1), > > This, combined with > > /* > * If we don't set this on, then the mainloop poll(2) call > * will never be interrupted by this signal - which sort of > * defeats the whole purpose of a signal handler in a > * mainloop program > */ > cl_signal_set_interrupt(signal, TRUE); > > looks more relevant. > But I can't escape the feeling that calling this just masks the > underlying "why is there a never-ending call to poll() in the first > place" issue. > G_CH_prepare_int() and friends /sh
Re: [Pacemaker] [Problem] The attrd does not sometimes stop.
I know I could just apply the patch and be done, but I'd like to understand this so it works for the right reason. On Mon, Jan 16, 2012 at 7:30 PM, Lars Ellenberg wrote: > On Mon, Jan 16, 2012 at 04:46:58PM +1100, Andrew Beekhof wrote: >> > Now we proceed to the next mainloop poll: >> > >> >> poll([{fd=7, events=POLLIN|POLLPRI}, {fd=4, events=POLLIN|POLLPRI}, >> >> {fd=5, events=POLLIN|POLLPRI}], 3, -1 >> > >> > Note the -1 (infinity timeout!) >> > >> > So even though the trigger was (presumably) set, >> > and the ->prepare() should have returned true, >> > the mainloop waits forever for "something" to happen on those file >> > descriptors. >> > >> > >> > I suggest this: >> > >> > crm_trigger_prepare should set *timeout = 0, if trigger is set. >> > >> > Also think about this race: crm_trigger_prepare was already >> > called, only then the signal came in... >> > >> > diff --git a/lib/common/mainloop.c b/lib/common/mainloop.c >> > index 2e8b1d0..fd17b87 100644 >> > --- a/lib/common/mainloop.c >> > +++ b/lib/common/mainloop.c >> > @@ -33,6 +33,13 @@ static gboolean >> > crm_trigger_prepare(GSource * source, gint * timeout) >> > { >> > crm_trigger_t *trig = (crm_trigger_t *) source; >> > + /* Do not delay signal processing by the mainloop poll stage */ >> > + if (trig->trigger) >> > + *timeout = 0; >> > + /* To avoid races between signal delivery and the mainloop poll stage, >> > + * make sure we always have a finite timeout. Unit: milliseconds. */ >> > + else >> > + *timeout = 5000; /* arbitrary */ >> > >> > return trig->trigger; >> > } >> > >> > >> > This scenario does not let the blocked IPC off the hook, though. >> > That is still possible, both for blocking send and blocking receive, >> > so that should probably be fixed as well, somehow. >> > I'm not sure how likely this "stuck in blocking IPC" is, though. >> >> Interesting, are you sure you're in the right function though? >> trigger and signal events don't have a file descriptor... wouldn't >> these polls be for the IPC related sources and wouldn't they be >> setting their own timeout? > > http://developer.gnome.org/glib/2.30/glib-The-Main-Event-Loop.html#GSourceFuncs > > iiuc, mainloop does something similar to (oversimplified): > timeout = -1; /* infinity */ > for s in all GSource > tmp_timeout = -1; > s->prepare(s, &tmp_timeout) > if (tmp_timeout >= 0 && tmp_timeout < timeout) > timeout = tmp_timeout; > > poll(GSource fd set, n, timeout); I'm looking at the glib code again now, and it still looks to me like the trigger and signal sources do not appear in this fd set. Their setup functions would have to have called g_source_add_poll() somewhere, which they don't. So I'm still not seeing why its a trigger or signal sources' fault that glib is doing a never ending call to poll(). poll() is going to get called regardless of whether our prepare function returns true or not. Looking closer, crm_trigger_prepare() returning TRUE results in: ready_source->flags |= G_SOURCE_READY; which in turn causes: context->timeout = 0; which is essentially what adding if (trig->trigger) *timeout = 0; to crm_trigger_prepare() was intended to achieve. Shouldn't the fd, ipc or wait sources (who do call g_source_add_poll() and could therefor cause poll() to block forever) have a sane timeout in their prepare functions? Or is it because the signal itself is interrupting some essential part of G_CH_prepare_int() and friends? > > for s in all GSource > if s->check(s) > s->dispatch(s, ...) > > And at some stage it also orders by priority, of course. > > Also compare with the comment above /* Sigh... */ in glue G_SIG_prepare(). > > BTW, the mentioned race between signal delivery and mainloop already > doing the poll stage could potentially be solved by using Again, since nothing related to the signal source ever appears in the call to poll(), I'm not seeing where the race comes from. Or am I missing something obvious? > cl_signal_set_interrupt(SIGTERM, 1), This, combined with /* * If we don't set this on, then the mainloop poll(2) call * will never be interrupted by this signal - which sort of * defeats the whole purpose of a signal handler in a * mainloop program */ cl_signal_set_interrupt(signal, TRUE); looks more relevant. But I can't escape the feeling that calling this just masks the underlying "why is there a never-ending call to poll() in the first place" issue. G_CH_prepare_int() and friends /should/ be setting timeouts so that poll() can return and any sources created by g_idle_source_new() can execute. > which would mean we can condense the prepare to > if (trig->trigger) > *timeout = 0; >
Re: [Pacemaker] [Problem] The attrd does not sometimes stop.
On Mon, Jan 16, 2012 at 04:46:58PM +1100, Andrew Beekhof wrote: > > Now we proceed to the next mainloop poll: > > > >> poll([{fd=7, events=POLLIN|POLLPRI}, {fd=4, events=POLLIN|POLLPRI}, {fd=5, > >> events=POLLIN|POLLPRI}], 3, -1 > > > > Note the -1 (infinity timeout!) > > > > So even though the trigger was (presumably) set, > > and the ->prepare() should have returned true, > > the mainloop waits forever for "something" to happen on those file > > descriptors. > > > > > > I suggest this: > > > > crm_trigger_prepare should set *timeout = 0, if trigger is set. > > > > Also think about this race: crm_trigger_prepare was already > > called, only then the signal came in... > > > > diff --git a/lib/common/mainloop.c b/lib/common/mainloop.c > > index 2e8b1d0..fd17b87 100644 > > --- a/lib/common/mainloop.c > > +++ b/lib/common/mainloop.c > > @@ -33,6 +33,13 @@ static gboolean > > crm_trigger_prepare(GSource * source, gint * timeout) > > { > > crm_trigger_t *trig = (crm_trigger_t *) source; > > + /* Do not delay signal processing by the mainloop poll stage */ > > + if (trig->trigger) > > + *timeout = 0; > > + /* To avoid races between signal delivery and the mainloop poll stage, > > + * make sure we always have a finite timeout. Unit: milliseconds. */ > > + else > > + *timeout = 5000; /* arbitrary */ > > > > return trig->trigger; > > } > > > > > > This scenario does not let the blocked IPC off the hook, though. > > That is still possible, both for blocking send and blocking receive, > > so that should probably be fixed as well, somehow. > > I'm not sure how likely this "stuck in blocking IPC" is, though. > > Interesting, are you sure you're in the right function though? > trigger and signal events don't have a file descriptor... wouldn't > these polls be for the IPC related sources and wouldn't they be > setting their own timeout? http://developer.gnome.org/glib/2.30/glib-The-Main-Event-Loop.html#GSourceFuncs iiuc, mainloop does something similar to (oversimplified): timeout = -1; /* infinity */ for s in all GSource tmp_timeout = -1; s->prepare(s, &tmp_timeout) if (tmp_timeout >= 0 && tmp_timeout < timeout) timeout = tmp_timeout; poll(GSource fd set, n, timeout); for s in all GSource if s->check(s) s->dispatch(s, ...) And at some stage it also orders by priority, of course. Also compare with the comment above /* Sigh... */ in glue G_SIG_prepare(). BTW, the mentioned race between signal delivery and mainloop already doing the poll stage could potentially be solved by using cl_signal_set_interrupt(SIGTERM, 1), which would mean we can condense the prepare to if (trig->trigger) *timeout = 0; return trig->trigger; Glue (and heartbeat) code base is not that, let's say, involved, because someone had been paranoid. But because someone had been paranoid for a reason ;-) Cheers, -- : Lars Ellenberg : LINBIT | Your Way to High Availability : DRBD/HA support and consulting http://www.linbit.com DRBD® and LINBIT® are registered trademarks of LINBIT, Austria. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] [Problem] The attrd does not sometimes stop.
On Sun, Jan 15, 2012 at 1:57 AM, Lars Ellenberg wrote: > On Tue, Jan 10, 2012 at 04:43:51PM +0900, renayama19661...@ybb.ne.jp wrote: >> Hi Lars, >> >> I attach strace file when a problem reappeared at the end of last year. >> I used glue which applied your patch for confirmation. >> >> It is the file which I picked with attrd by strace -p command right before I >> stop Heartbeat. >> >> Finally SIGTERM caught it, but attrd did not stop. >> The attrd stopped afterwards when I sent SIGKILL. > > The strace reveals something interesting: > > This poll looks like the mainloop poll, > but some ->prepare() has modified the timeout to be 0, > so we proceed directly to ->check() and then ->dispatch(). > >> poll([{fd=7, events=POLLIN|POLLPRI}, {fd=4, events=POLLIN|POLLPRI}, {fd=8, >> events=POLLIN|POLLPRI}], 3, 0) = 1 ([{fd=8, revents=POLLIN|POLLHUP}]) > >> times({tms_utime=2, tms_stime=3, tms_cutime=0, tms_cstime=0}) = 433738632 >> recv(4, 0x95af308, 576, MSG_DONTWAIT) = -1 EAGAIN (Resource temporarily >> unavailable) > ... >> recv(7, 0x95b1657, 3513, MSG_DONTWAIT) = -1 EAGAIN (Resource temporarily >> unavailable) >> poll([{fd=7, events=0}], 1, 0) = ? ERESTART_RESTARTBLOCK (To be >> restarted) >> --- SIGTERM (Terminated) @ 0 (0) --- >> sigreturn() = ? (mask now []) > > Ok. signal received, trigger set. > Still finishing this mainloop iteration, though. > > These recv(),poll() look like invocations of G_CH_prepare_int(). > Does not matter much, though. > >> recv(7, 0x95b1657, 3513, MSG_DONTWAIT) = -1 EAGAIN (Resource temporarily >> unavailable) >> poll([{fd=7, events=0}], 1, 0) = 0 (Timeout) >> recv(7, 0x95b1657, 3513, MSG_DONTWAIT) = -1 EAGAIN (Resource temporarily >> unavailable) >> poll([{fd=7, events=0}], 1, 0) = 0 (Timeout) > >> times({tms_utime=2, tms_stime=3, tms_cutime=0, tms_cstime=0}) = 433738634 > > Now we proceed to the next mainloop poll: > >> poll([{fd=7, events=POLLIN|POLLPRI}, {fd=4, events=POLLIN|POLLPRI}, {fd=5, >> events=POLLIN|POLLPRI}], 3, -1 > > Note the -1 (infinity timeout!) > > So even though the trigger was (presumably) set, > and the ->prepare() should have returned true, > the mainloop waits forever for "something" to happen on those file > descriptors. > > > I suggest this: > > crm_trigger_prepare should set *timeout = 0, if trigger is set. > > Also think about this race: crm_trigger_prepare was already > called, only then the signal came in... > > diff --git a/lib/common/mainloop.c b/lib/common/mainloop.c > index 2e8b1d0..fd17b87 100644 > --- a/lib/common/mainloop.c > +++ b/lib/common/mainloop.c > @@ -33,6 +33,13 @@ static gboolean > crm_trigger_prepare(GSource * source, gint * timeout) > { > crm_trigger_t *trig = (crm_trigger_t *) source; > + /* Do not delay signal processing by the mainloop poll stage */ > + if (trig->trigger) > + *timeout = 0; > + /* To avoid races between signal delivery and the mainloop poll stage, > + * make sure we always have a finite timeout. Unit: milliseconds. */ > + else > + *timeout = 5000; /* arbitrary */ > > return trig->trigger; > } > > > This scenario does not let the blocked IPC off the hook, though. > That is still possible, both for blocking send and blocking receive, > so that should probably be fixed as well, somehow. > I'm not sure how likely this "stuck in blocking IPC" is, though. Interesting, are you sure you're in the right function though? trigger and signal events don't have a file descriptor... wouldn't these polls be for the IPC related sources and wouldn't they be setting their own timeout? > > -- > : Lars Ellenberg > : LINBIT | Your Way to High Availability > : DRBD/HA support and consulting http://www.linbit.com > > DRBD® and LINBIT® are registered trademarks of LINBIT, Austria. > > ___ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] [Problem] The attrd does not sometimes stop.
Hi Lars, Thank you for comments and suggestion. > > poll([{fd=7, events=POLLIN|POLLPRI}, {fd=4, events=POLLIN|POLLPRI}, {fd=5, > > events=POLLIN|POLLPRI}], 3, -1 > > Note the -1 (infinity timeout!) > > So even though the trigger was (presumably) set, > and the ->prepare() should have returned true, > the mainloop waits forever for "something" to happen on those file > descriptors. > > > I suggest this: > > crm_trigger_prepare should set *timeout = 0, if trigger is set. > > Also think about this race: crm_trigger_prepare was already > called, only then the signal came in... > > diff --git a/lib/common/mainloop.c b/lib/common/mainloop.c > index 2e8b1d0..fd17b87 100644 > --- a/lib/common/mainloop.c > +++ b/lib/common/mainloop.c > @@ -33,6 +33,13 @@ static gboolean > crm_trigger_prepare(GSource * source, gint * timeout) > { > crm_trigger_t *trig = (crm_trigger_t *) source; > +/* Do not delay signal processing by the mainloop poll stage */ > +if (trig->trigger) > +*timeout = 0; > +/* To avoid races between signal delivery and the mainloop poll stage, > + * make sure we always have a finite timeout. Unit: milliseconds. */ > +else > +*timeout = 5000; /* arbitrary */ > > return trig->trigger; > } > > > This scenario does not let the blocked IPC off the hook, though. > That is still possible, both for blocking send and blocking receive, > so that should probably be fixed as well, somehow. > I'm not sure how likely this "stuck in blocking IPC" is, though. Including a correction of your suggestion, I continue investigating the problem again. I report it if I get some information. Best Regards, Hideo Yamauchi. --- On Sat, 2012/1/14, Lars Ellenberg wrote: > On Tue, Jan 10, 2012 at 04:43:51PM +0900, renayama19661...@ybb.ne.jp wrote: > > Hi Lars, > > > > I attach strace file when a problem reappeared at the end of last year. > > I used glue which applied your patch for confirmation. > > > > It is the file which I picked with attrd by strace -p command right before > > I stop Heartbeat. > > > > Finally SIGTERM caught it, but attrd did not stop. > > The attrd stopped afterwards when I sent SIGKILL. > > The strace reveals something interesting: > > This poll looks like the mainloop poll, > but some ->prepare() has modified the timeout to be 0, > so we proceed directly to ->check() and then ->dispatch(). > > > poll([{fd=7, events=POLLIN|POLLPRI}, {fd=4, events=POLLIN|POLLPRI}, {fd=8, > > events=POLLIN|POLLPRI}], 3, 0) = 1 ([{fd=8, revents=POLLIN|POLLHUP}]) > > > times({tms_utime=2, tms_stime=3, tms_cutime=0, tms_cstime=0}) = 433738632 > > recv(4, 0x95af308, 576, MSG_DONTWAIT) = -1 EAGAIN (Resource temporarily > > unavailable) > ... > > recv(7, 0x95b1657, 3513, MSG_DONTWAIT) = -1 EAGAIN (Resource temporarily > > unavailable) > > poll([{fd=7, events=0}], 1, 0) = ? ERESTART_RESTARTBLOCK (To be > > restarted) > > --- SIGTERM (Terminated) @ 0 (0) --- > > sigreturn() = ? (mask now []) > > Ok. signal received, trigger set. > Still finishing this mainloop iteration, though. > > These recv(),poll() look like invocations of G_CH_prepare_int(). > Does not matter much, though. > > > recv(7, 0x95b1657, 3513, MSG_DONTWAIT) = -1 EAGAIN (Resource temporarily > > unavailable) > > poll([{fd=7, events=0}], 1, 0) = 0 (Timeout) > > recv(7, 0x95b1657, 3513, MSG_DONTWAIT) = -1 EAGAIN (Resource temporarily > > unavailable) > > poll([{fd=7, events=0}], 1, 0) = 0 (Timeout) > > > times({tms_utime=2, tms_stime=3, tms_cutime=0, tms_cstime=0}) = 433738634 > > Now we proceed to the next mainloop poll: > > > poll([{fd=7, events=POLLIN|POLLPRI}, {fd=4, events=POLLIN|POLLPRI}, {fd=5, > > events=POLLIN|POLLPRI}], 3, -1 > > Note the -1 (infinity timeout!) > > So even though the trigger was (presumably) set, > and the ->prepare() should have returned true, > the mainloop waits forever for "something" to happen on those file > descriptors. > > > I suggest this: > > crm_trigger_prepare should set *timeout = 0, if trigger is set. > > Also think about this race: crm_trigger_prepare was already > called, only then the signal came in... > > diff --git a/lib/common/mainloop.c b/lib/common/mainloop.c > index 2e8b1d0..fd17b87 100644 > --- a/lib/common/mainloop.c > +++ b/lib/common/mainloop.c > @@ -33,6 +33,13 @@ static gboolean > crm_trigger_prepare(GSource * source, gint * timeout) > { > crm_trigger_t *trig = (crm_trigger_t *) source; > + /* Do not delay signal processing by the mainloop poll stage */ > + if (trig->trigger) > + *timeout = 0; > + /* To avoid races between signal delivery and the mainloop poll stage, > + * make sure we always have a finite timeout. Unit: milliseconds. */ > + else > + *timeout = 5000; /* arbitrary */ > > return trig->trigger; > } > > > This scenario does not let the blocked IPC off the hook, though. > That is still possible, both for b
Re: [Pacemaker] [Problem] The attrd does not sometimes stop.
On Tue, Jan 10, 2012 at 04:43:51PM +0900, renayama19661...@ybb.ne.jp wrote: > Hi Lars, > > I attach strace file when a problem reappeared at the end of last year. > I used glue which applied your patch for confirmation. > > It is the file which I picked with attrd by strace -p command right before I > stop Heartbeat. > > Finally SIGTERM caught it, but attrd did not stop. > The attrd stopped afterwards when I sent SIGKILL. The strace reveals something interesting: This poll looks like the mainloop poll, but some ->prepare() has modified the timeout to be 0, so we proceed directly to ->check() and then ->dispatch(). > poll([{fd=7, events=POLLIN|POLLPRI}, {fd=4, events=POLLIN|POLLPRI}, {fd=8, > events=POLLIN|POLLPRI}], 3, 0) = 1 ([{fd=8, revents=POLLIN|POLLHUP}]) > times({tms_utime=2, tms_stime=3, tms_cutime=0, tms_cstime=0}) = 433738632 > recv(4, 0x95af308, 576, MSG_DONTWAIT) = -1 EAGAIN (Resource temporarily > unavailable) ... > recv(7, 0x95b1657, 3513, MSG_DONTWAIT) = -1 EAGAIN (Resource temporarily > unavailable) > poll([{fd=7, events=0}], 1, 0) = ? ERESTART_RESTARTBLOCK (To be > restarted) > --- SIGTERM (Terminated) @ 0 (0) --- > sigreturn() = ? (mask now []) Ok. signal received, trigger set. Still finishing this mainloop iteration, though. These recv(),poll() look like invocations of G_CH_prepare_int(). Does not matter much, though. > recv(7, 0x95b1657, 3513, MSG_DONTWAIT) = -1 EAGAIN (Resource temporarily > unavailable) > poll([{fd=7, events=0}], 1, 0) = 0 (Timeout) > recv(7, 0x95b1657, 3513, MSG_DONTWAIT) = -1 EAGAIN (Resource temporarily > unavailable) > poll([{fd=7, events=0}], 1, 0) = 0 (Timeout) > times({tms_utime=2, tms_stime=3, tms_cutime=0, tms_cstime=0}) = 433738634 Now we proceed to the next mainloop poll: > poll([{fd=7, events=POLLIN|POLLPRI}, {fd=4, events=POLLIN|POLLPRI}, {fd=5, > events=POLLIN|POLLPRI}], 3, -1 Note the -1 (infinity timeout!) So even though the trigger was (presumably) set, and the ->prepare() should have returned true, the mainloop waits forever for "something" to happen on those file descriptors. I suggest this: crm_trigger_prepare should set *timeout = 0, if trigger is set. Also think about this race: crm_trigger_prepare was already called, only then the signal came in... diff --git a/lib/common/mainloop.c b/lib/common/mainloop.c index 2e8b1d0..fd17b87 100644 --- a/lib/common/mainloop.c +++ b/lib/common/mainloop.c @@ -33,6 +33,13 @@ static gboolean crm_trigger_prepare(GSource * source, gint * timeout) { crm_trigger_t *trig = (crm_trigger_t *) source; +/* Do not delay signal processing by the mainloop poll stage */ +if (trig->trigger) + *timeout = 0; +/* To avoid races between signal delivery and the mainloop poll stage, + * make sure we always have a finite timeout. Unit: milliseconds. */ +else + *timeout = 5000; /* arbitrary */ return trig->trigger; } This scenario does not let the blocked IPC off the hook, though. That is still possible, both for blocking send and blocking receive, so that should probably be fixed as well, somehow. I'm not sure how likely this "stuck in blocking IPC" is, though. -- : Lars Ellenberg : LINBIT | Your Way to High Availability : DRBD/HA support and consulting http://www.linbit.com DRBD® and LINBIT® are registered trademarks of LINBIT, Austria. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] [Problem] The attrd does not sometimes stop.
Hi Lars, Hi Dejan, I got ltrace file when a problem occurred. I attach ltrace file. The investigation in gdb continues it and performs it. If there is suggestion of any improvement, please tell me. Best Regards, Hideo Yamauchi. --- On Tue, 2012/1/10, renayama19661...@ybb.ne.jp wrote: > Hi Lars, > > I attach strace file when a problem reappeared at the end of last year. > I used glue which applied your patch for confirmation. > > It is the file which I picked with attrd by strace -p command right before I > stop Heartbeat. > > Finally SIGTERM caught it, but attrd did not stop. > The attrd stopped afterwards when I sent SIGKILL. > > * I acquire the information such as ltrace from now on. > > Best Regards, > Hideo Yamauchi. > > > --- On Thu, 2012/1/5, renayama19661...@ybb.ne.jp > wrote: > > > Hi Lars, > > > > > If you are able to reproduce, > > > you could try to find out what exactly attrd is doing. > > > > > > various ways to try to do that: > > > cat /proc//stack # if your platform supports that > > > strace it, > > > ltrace it, > > > attach with gdb and provide a stack trace, or even start to single step > > > it, > > > cause attrd to core dump, and analyse the core. > > > > All right. > > I investigate the cause a little more. > > > > Give me the time for investigation a little more. > > > > Best Regards, > > Hideo Yamauchi. > > > > --- On Fri, 2011/12/30, Lars Ellenberg wrote: > > > > > On Thu, Dec 22, 2011 at 09:54:47AM +0900, renayama19661...@ybb.ne.jp > > > wrote: > > > > Hi Dejan, > > > > Hi Lars, > > > > > > > > In our environment, the problem recurred with the patch of Mr. Lars. > > > > After a problem occurred, I sent TERM signal, but attrd does not seem to > > > > receive TERM at all. > > > > > > If you are able to reproduce, > > > you could try to find out what exactly attrd is doing. > > > > > > various ways to try to do that: > > > cat /proc//stack # if your platform supports that > > > strace it, > > > ltrace it, > > > attach with gdb and provide a stack trace, or even start to single step > > > it, > > > cause attrd to core dump, and analyse the core. > > > > > > > The reconsideration of the patch is necessary for the solution to > > > > problem. > > > > > > > > > > > > Best Regards, > > > > Hideo Yamauchi. > > > > > > > > > > > > --- On Tue, 2011/11/15, renayama19661...@ybb.ne.jp > > > > wrote: > > > > > > > > > Hi Dejan, > > > > > Hi Lars, > > > > > > > > > > I understood it. > > > > > I try the operation of the patch in our environment. > > > > > > > > > > To Alan: Will you try a patch? > > > > > > > > > > Best Regards, > > > > > Hideo Yamauchi. > > > > > > > > > > --- On Tue, 2011/11/15, Dejan Muhamedagic wrote: > > > > > > > > > > > Hi, > > > > > > > > > > > > On Mon, Nov 14, 2011 at 01:17:37PM +0100, Lars Ellenberg wrote: > > > > > > > On Mon, Nov 14, 2011 at 11:58:09AM +1100, Andrew Beekhof wrote: > > > > > > > > On Mon, Nov 7, 2011 at 8:39 AM, Lars Ellenberg > > > > > > > > wrote: > > > > > > > > > On Thu, Nov 03, 2011 at 01:49:46AM +1100, Andrew Beekhof > > > > > > > > > wrote: > > > > > > > > >> On Tue, Oct 18, 2011 at 12:19 PM, > > > > > > > > >> wrote: > > > > > > > > >> > Hi, > > > > > > > > >> > > > > > > > > > >> > We sometimes fail in a stop of attrd. > > > > > > > > >> > > > > > > > > > >> > Step1. start a cluster in 2 nodes > > > > > > > > >> > Step2. stop the first node.(/etc/init.d/heartbeat stop.) > > > > > > > > >> > Step3. stop the second node after time passed a > > > > > > > > >> > little.(/etc/init.d/heartbeat > > > > > > > > >> > stop.) > > > > > > > > >> > > > > > > > > > >> > The attrd catches the TERM signal, but does not stop. > > > > > > > > >> > > > > > > > > >> There's no evidence that it actually catches it, only that > > > > > > > > >> it is sent. > > > > > > > > >> I've seen it before but never figured out why it occurs. > > > > > > > > > > > > > > > > > > I had it once tracked down almost to where it occurs, but > > > > > > > > > then got distracted. > > > > > > > > > Yes the signal was delivered. > > > > > > > > > > > > > > > > > > I *think* it had to do with attrd doing a blocking read, > > > > > > > > > or looping in some internal message delivery function too > > > > > > > > > often. > > > > > > > > > > > > > > > > > > I had a quick look at the code again now, to try and remember, > > > > > > > > > but I'm not sure. > > > > > > > > > > > > > > > > > > I *may* be that, because > > > > > > > > > xmlfromIPC(IPC_Channel * ch, int timeout) calls > > > > > > > > > msg = msgfromIPC_timeout(ch, MSG_ALLOWINTR, timeout, > > > > > > > > > &ipc_rc); > > > > > > > > > > > > > > > > > > And MSG_ALLOWINTR will cause msgfromIPC_ll() to > > > > > > > > > IPC_INTR: > > > > > > > > > if ( allow_intr){ > > > > > > > > > goto startwait; > > > > > > > > > > > > > > > > > > Depending on the frequency of deliverd signals, it may cause > > > > > > > > > this got
Re: [Pacemaker] [Problem] The attrd does not sometimes stop.
Hi Lars, I attach strace file when a problem reappeared at the end of last year. I used glue which applied your patch for confirmation. It is the file which I picked with attrd by strace -p command right before I stop Heartbeat. Finally SIGTERM caught it, but attrd did not stop. The attrd stopped afterwards when I sent SIGKILL. * I acquire the information such as ltrace from now on. Best Regards, Hideo Yamauchi. --- On Thu, 2012/1/5, renayama19661...@ybb.ne.jp wrote: > Hi Lars, > > > If you are able to reproduce, > > you could try to find out what exactly attrd is doing. > > > > various ways to try to do that: > > cat /proc//stack # if your platform supports that > > strace it, > > ltrace it, > > attach with gdb and provide a stack trace, or even start to single step it, > > cause attrd to core dump, and analyse the core. > > All right. > I investigate the cause a little more. > > Give me the time for investigation a little more. > > Best Regards, > Hideo Yamauchi. > > --- On Fri, 2011/12/30, Lars Ellenberg wrote: > > > On Thu, Dec 22, 2011 at 09:54:47AM +0900, renayama19661...@ybb.ne.jp wrote: > > > Hi Dejan, > > > Hi Lars, > > > > > > In our environment, the problem recurred with the patch of Mr. Lars. > > > After a problem occurred, I sent TERM signal, but attrd does not seem to > > > receive TERM at all. > > > > If you are able to reproduce, > > you could try to find out what exactly attrd is doing. > > > > various ways to try to do that: > > cat /proc//stack # if your platform supports that > > strace it, > > ltrace it, > > attach with gdb and provide a stack trace, or even start to single step it, > > cause attrd to core dump, and analyse the core. > > > > > The reconsideration of the patch is necessary for the solution to problem. > > > > > > > > > Best Regards, > > > Hideo Yamauchi. > > > > > > > > > --- On Tue, 2011/11/15, renayama19661...@ybb.ne.jp > > > wrote: > > > > > > > Hi Dejan, > > > > Hi Lars, > > > > > > > > I understood it. > > > > I try the operation of the patch in our environment. > > > > > > > > To Alan: Will you try a patch? > > > > > > > > Best Regards, > > > > Hideo Yamauchi. > > > > > > > > --- On Tue, 2011/11/15, Dejan Muhamedagic wrote: > > > > > > > > > Hi, > > > > > > > > > > On Mon, Nov 14, 2011 at 01:17:37PM +0100, Lars Ellenberg wrote: > > > > > > On Mon, Nov 14, 2011 at 11:58:09AM +1100, Andrew Beekhof wrote: > > > > > > > On Mon, Nov 7, 2011 at 8:39 AM, Lars Ellenberg > > > > > > > wrote: > > > > > > > > On Thu, Nov 03, 2011 at 01:49:46AM +1100, Andrew Beekhof wrote: > > > > > > > >> On Tue, Oct 18, 2011 at 12:19 PM, > > > > > > > >> wrote: > > > > > > > >> > Hi, > > > > > > > >> > > > > > > > > >> > We sometimes fail in a stop of attrd. > > > > > > > >> > > > > > > > > >> > Step1. start a cluster in 2 nodes > > > > > > > >> > Step2. stop the first node.(/etc/init.d/heartbeat stop.) > > > > > > > >> > Step3. stop the second node after time passed a > > > > > > > >> > little.(/etc/init.d/heartbeat > > > > > > > >> > stop.) > > > > > > > >> > > > > > > > > >> > The attrd catches the TERM signal, but does not stop. > > > > > > > >> > > > > > > > >> There's no evidence that it actually catches it, only that it > > > > > > > >> is sent. > > > > > > > >> I've seen it before but never figured out why it occurs. > > > > > > > > > > > > > > > > I had it once tracked down almost to where it occurs, but then > > > > > > > > got distracted. > > > > > > > > Yes the signal was delivered. > > > > > > > > > > > > > > > > I *think* it had to do with attrd doing a blocking read, > > > > > > > > or looping in some internal message delivery function too often. > > > > > > > > > > > > > > > > I had a quick look at the code again now, to try and remember, > > > > > > > > but I'm not sure. > > > > > > > > > > > > > > > > I *may* be that, because > > > > > > > > xmlfromIPC(IPC_Channel * ch, int timeout) calls > > > > > > > > msg = msgfromIPC_timeout(ch, MSG_ALLOWINTR, timeout, > > > > > > > > &ipc_rc); > > > > > > > > > > > > > > > > And MSG_ALLOWINTR will cause msgfromIPC_ll() to > > > > > > > > IPC_INTR: > > > > > > > > if ( allow_intr){ > > > > > > > > goto startwait; > > > > > > > > > > > > > > > > Depending on the frequency of deliverd signals, it may cause > > > > > > > > this goto > > > > > > > > startwait loop to never exit, because the timeout always starts > > > > > > > > again > > > > > > > > from the full passed in timeout. > > > > > > > > > > > > > > > > If only one signal is deliverd, it may still take 120 seconds > > > > > > > > (MAX_IPC_DELAY from crm.h) to be actually processed, as the > > > > > > > > signal > > > > > > > > handler only raises a flag for the next mainloop iteration. > > > > > > > > > > > > > > > > If a (non-fatal) signal is delivered every few seconds, > > > > > > > > then the goto loop will never timeout. > > > > > > > > > > > > > > > > Please someone check th
Re: [Pacemaker] [Problem] The attrd does not sometimes stop.
Hi Lars, > If you are able to reproduce, > you could try to find out what exactly attrd is doing. > > various ways to try to do that: > cat /proc//stack # if your platform supports that > strace it, > ltrace it, > attach with gdb and provide a stack trace, or even start to single step it, > cause attrd to core dump, and analyse the core. All right. I investigate the cause a little more. Give me the time for investigation a little more. Best Regards, Hideo Yamauchi. --- On Fri, 2011/12/30, Lars Ellenberg wrote: > On Thu, Dec 22, 2011 at 09:54:47AM +0900, renayama19661...@ybb.ne.jp wrote: > > Hi Dejan, > > Hi Lars, > > > > In our environment, the problem recurred with the patch of Mr. Lars. > > After a problem occurred, I sent TERM signal, but attrd does not seem to > > receive TERM at all. > > If you are able to reproduce, > you could try to find out what exactly attrd is doing. > > various ways to try to do that: > cat /proc//stack # if your platform supports that > strace it, > ltrace it, > attach with gdb and provide a stack trace, or even start to single step it, > cause attrd to core dump, and analyse the core. > > > The reconsideration of the patch is necessary for the solution to problem. > > > > > > Best Regards, > > Hideo Yamauchi. > > > > > > --- On Tue, 2011/11/15, renayama19661...@ybb.ne.jp > > wrote: > > > > > Hi Dejan, > > > Hi Lars, > > > > > > I understood it. > > > I try the operation of the patch in our environment. > > > > > > To Alan: Will you try a patch? > > > > > > Best Regards, > > > Hideo Yamauchi. > > > > > > --- On Tue, 2011/11/15, Dejan Muhamedagic wrote: > > > > > > > Hi, > > > > > > > > On Mon, Nov 14, 2011 at 01:17:37PM +0100, Lars Ellenberg wrote: > > > > > On Mon, Nov 14, 2011 at 11:58:09AM +1100, Andrew Beekhof wrote: > > > > > > On Mon, Nov 7, 2011 at 8:39 AM, Lars Ellenberg > > > > > > wrote: > > > > > > > On Thu, Nov 03, 2011 at 01:49:46AM +1100, Andrew Beekhof wrote: > > > > > > >> On Tue, Oct 18, 2011 at 12:19 PM, > > > > > > >> wrote: > > > > > > >> > Hi, > > > > > > >> > > > > > > > >> > We sometimes fail in a stop of attrd. > > > > > > >> > > > > > > > >> > Step1. start a cluster in 2 nodes > > > > > > >> > Step2. stop the first node.(/etc/init.d/heartbeat stop.) > > > > > > >> > Step3. stop the second node after time passed a > > > > > > >> > little.(/etc/init.d/heartbeat > > > > > > >> > stop.) > > > > > > >> > > > > > > > >> > The attrd catches the TERM signal, but does not stop. > > > > > > >> > > > > > > >> There's no evidence that it actually catches it, only that it is > > > > > > >> sent. > > > > > > >> I've seen it before but never figured out why it occurs. > > > > > > > > > > > > > > I had it once tracked down almost to where it occurs, but then > > > > > > > got distracted. > > > > > > > Yes the signal was delivered. > > > > > > > > > > > > > > I *think* it had to do with attrd doing a blocking read, > > > > > > > or looping in some internal message delivery function too often. > > > > > > > > > > > > > > I had a quick look at the code again now, to try and remember, > > > > > > > but I'm not sure. > > > > > > > > > > > > > > I *may* be that, because > > > > > > > xmlfromIPC(IPC_Channel * ch, int timeout) calls > > > > > > > msg = msgfromIPC_timeout(ch, MSG_ALLOWINTR, timeout, &ipc_rc); > > > > > > > > > > > > > > And MSG_ALLOWINTR will cause msgfromIPC_ll() to > > > > > > > IPC_INTR: > > > > > > > if ( allow_intr){ > > > > > > > goto startwait; > > > > > > > > > > > > > > Depending on the frequency of deliverd signals, it may cause this > > > > > > > goto > > > > > > > startwait loop to never exit, because the timeout always starts > > > > > > > again > > > > > > > from the full passed in timeout. > > > > > > > > > > > > > > If only one signal is deliverd, it may still take 120 seconds > > > > > > > (MAX_IPC_DELAY from crm.h) to be actually processed, as the signal > > > > > > > handler only raises a flag for the next mainloop iteration. > > > > > > > > > > > > > > If a (non-fatal) signal is delivered every few seconds, > > > > > > > then the goto loop will never timeout. > > > > > > > > > > > > > > Please someone check this for plausibility ;-) > > > > > > > > > > > > Most plausible explanation I've heard so far... still odd that only > > > > > > attrd is affected. > > > > > > So what do we do about it? > > > > > > > > > > Reproduce, and confirm that this is what people are seeing. > > > > > > > > > > Make attrd non-blocking? > > > > > > > > > > Fix the ipc layer to not restart the full timeout, > > > > > but only the remaining partial time? > > > > > > > > Lars and I made a quick patch for cluster-glue (attached). > > > > Hideo-san, is there a way for you to verify if it helps? The > > > > patch is not perfect and under unfavourable circumstances it may > > > > still take a long time for the caller to exit, but it'd be good > > > > to know if this is the right
Re: [Pacemaker] [Problem] The attrd does not sometimes stop.
On Thu, Dec 22, 2011 at 09:54:47AM +0900, renayama19661...@ybb.ne.jp wrote: > Hi Dejan, > Hi Lars, > > In our environment, the problem recurred with the patch of Mr. Lars. > After a problem occurred, I sent TERM signal, but attrd does not seem to > receive TERM at all. If you are able to reproduce, you could try to find out what exactly attrd is doing. various ways to try to do that: cat /proc//stack # if your platform supports that strace it, ltrace it, attach with gdb and provide a stack trace, or even start to single step it, cause attrd to core dump, and analyse the core. > The reconsideration of the patch is necessary for the solution to problem. > > > Best Regards, > Hideo Yamauchi. > > > --- On Tue, 2011/11/15, renayama19661...@ybb.ne.jp > wrote: > > > Hi Dejan, > > Hi Lars, > > > > I understood it. > > I try the operation of the patch in our environment. > > > > To Alan: Will you try a patch? > > > > Best Regards, > > Hideo Yamauchi. > > > > --- On Tue, 2011/11/15, Dejan Muhamedagic wrote: > > > > > Hi, > > > > > > On Mon, Nov 14, 2011 at 01:17:37PM +0100, Lars Ellenberg wrote: > > > > On Mon, Nov 14, 2011 at 11:58:09AM +1100, Andrew Beekhof wrote: > > > > > On Mon, Nov 7, 2011 at 8:39 AM, Lars Ellenberg > > > > > wrote: > > > > > > On Thu, Nov 03, 2011 at 01:49:46AM +1100, Andrew Beekhof wrote: > > > > > >> On Tue, Oct 18, 2011 at 12:19 PM, > > > > > >> wrote: > > > > > >> > Hi, > > > > > >> > > > > > > >> > We sometimes fail in a stop of attrd. > > > > > >> > > > > > > >> > Step1. start a cluster in 2 nodes > > > > > >> > Step2. stop the first node.(/etc/init.d/heartbeat stop.) > > > > > >> > Step3. stop the second node after time passed a > > > > > >> > little.(/etc/init.d/heartbeat > > > > > >> > stop.) > > > > > >> > > > > > > >> > The attrd catches the TERM signal, but does not stop. > > > > > >> > > > > > >> There's no evidence that it actually catches it, only that it is > > > > > >> sent. > > > > > >> I've seen it before but never figured out why it occurs. > > > > > > > > > > > > I had it once tracked down almost to where it occurs, but then got > > > > > > distracted. > > > > > > Yes the signal was delivered. > > > > > > > > > > > > I *think* it had to do with attrd doing a blocking read, > > > > > > or looping in some internal message delivery function too often. > > > > > > > > > > > > I had a quick look at the code again now, to try and remember, > > > > > > but I'm not sure. > > > > > > > > > > > > I *may* be that, because > > > > > > xmlfromIPC(IPC_Channel * ch, int timeout) calls > > > > > > msg = msgfromIPC_timeout(ch, MSG_ALLOWINTR, timeout, &ipc_rc); > > > > > > > > > > > > And MSG_ALLOWINTR will cause msgfromIPC_ll() to > > > > > > IPC_INTR: > > > > > > if ( allow_intr){ > > > > > > goto startwait; > > > > > > > > > > > > Depending on the frequency of deliverd signals, it may cause this > > > > > > goto > > > > > > startwait loop to never exit, because the timeout always starts > > > > > > again > > > > > > from the full passed in timeout. > > > > > > > > > > > > If only one signal is deliverd, it may still take 120 seconds > > > > > > (MAX_IPC_DELAY from crm.h) to be actually processed, as the signal > > > > > > handler only raises a flag for the next mainloop iteration. > > > > > > > > > > > > If a (non-fatal) signal is delivered every few seconds, > > > > > > then the goto loop will never timeout. > > > > > > > > > > > > Please someone check this for plausibility ;-) > > > > > > > > > > Most plausible explanation I've heard so far... still odd that only > > > > > attrd is affected. > > > > > So what do we do about it? > > > > > > > > Reproduce, and confirm that this is what people are seeing. > > > > > > > > Make attrd non-blocking? > > > > > > > > Fix the ipc layer to not restart the full timeout, > > > > but only the remaining partial time? > > > > > > Lars and I made a quick patch for cluster-glue (attached). > > > Hideo-san, is there a way for you to verify if it helps? The > > > patch is not perfect and under unfavourable circumstances it may > > > still take a long time for the caller to exit, but it'd be good > > > to know if this is the right spot. > > > > > > Cheers, > > > > > > Dejan > > > > > > > -- > > > > : Lars Ellenberg > > > > : LINBIT | Your Way to High Availability > > > > : DRBD/HA support and consulting http://www.linbit.com > > > > > > > > ___ > > > > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > > > > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > > > > > > > Project Home: http://www.clusterlabs.org > > > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > > > > Bugs: > > > > http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker > > > > > > > ___ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://os
Re: [Pacemaker] [Problem] The attrd does not sometimes stop.
Hi Dejan, Hi Lars, In our environment, the problem recurred with the patch of Mr. Lars. After a problem occurred, I sent TERM signal, but attrd does not seem to receive TERM at all. The reconsideration of the patch is necessary for the solution to problem. Best Regards, Hideo Yamauchi. --- On Tue, 2011/11/15, renayama19661...@ybb.ne.jp wrote: > Hi Dejan, > Hi Lars, > > I understood it. > I try the operation of the patch in our environment. > > To Alan: Will you try a patch? > > Best Regards, > Hideo Yamauchi. > > --- On Tue, 2011/11/15, Dejan Muhamedagic wrote: > > > Hi, > > > > On Mon, Nov 14, 2011 at 01:17:37PM +0100, Lars Ellenberg wrote: > > > On Mon, Nov 14, 2011 at 11:58:09AM +1100, Andrew Beekhof wrote: > > > > On Mon, Nov 7, 2011 at 8:39 AM, Lars Ellenberg > > > > wrote: > > > > > On Thu, Nov 03, 2011 at 01:49:46AM +1100, Andrew Beekhof wrote: > > > > >> On Tue, Oct 18, 2011 at 12:19 PM, > > > > >> wrote: > > > > >> > Hi, > > > > >> > > > > > >> > We sometimes fail in a stop of attrd. > > > > >> > > > > > >> > Step1. start a cluster in 2 nodes > > > > >> > Step2. stop the first node.(/etc/init.d/heartbeat stop.) > > > > >> > Step3. stop the second node after time passed a > > > > >> > little.(/etc/init.d/heartbeat > > > > >> > stop.) > > > > >> > > > > > >> > The attrd catches the TERM signal, but does not stop. > > > > >> > > > > >> There's no evidence that it actually catches it, only that it is > > > > >> sent. > > > > >> I've seen it before but never figured out why it occurs. > > > > > > > > > > I had it once tracked down almost to where it occurs, but then got > > > > > distracted. > > > > > Yes the signal was delivered. > > > > > > > > > > I *think* it had to do with attrd doing a blocking read, > > > > > or looping in some internal message delivery function too often. > > > > > > > > > > I had a quick look at the code again now, to try and remember, > > > > > but I'm not sure. > > > > > > > > > > I *may* be that, because > > > > > xmlfromIPC(IPC_Channel * ch, int timeout) calls > > > > > msg = msgfromIPC_timeout(ch, MSG_ALLOWINTR, timeout, &ipc_rc); > > > > > > > > > > And MSG_ALLOWINTR will cause msgfromIPC_ll() to > > > > > IPC_INTR: > > > > > if ( allow_intr){ > > > > > goto startwait; > > > > > > > > > > Depending on the frequency of deliverd signals, it may cause this goto > > > > > startwait loop to never exit, because the timeout always starts again > > > > > from the full passed in timeout. > > > > > > > > > > If only one signal is deliverd, it may still take 120 seconds > > > > > (MAX_IPC_DELAY from crm.h) to be actually processed, as the signal > > > > > handler only raises a flag for the next mainloop iteration. > > > > > > > > > > If a (non-fatal) signal is delivered every few seconds, > > > > > then the goto loop will never timeout. > > > > > > > > > > Please someone check this for plausibility ;-) > > > > > > > > Most plausible explanation I've heard so far... still odd that only > > > > attrd is affected. > > > > So what do we do about it? > > > > > > Reproduce, and confirm that this is what people are seeing. > > > > > > Make attrd non-blocking? > > > > > > Fix the ipc layer to not restart the full timeout, > > > but only the remaining partial time? > > > > Lars and I made a quick patch for cluster-glue (attached). > > Hideo-san, is there a way for you to verify if it helps? The > > patch is not perfect and under unfavourable circumstances it may > > still take a long time for the caller to exit, but it'd be good > > to know if this is the right spot. > > > > Cheers, > > > > Dejan > > > > > -- > > > : Lars Ellenberg > > > : LINBIT | Your Way to High Availability > > > : DRBD/HA support and consulting http://www.linbit.com > > > > > > ___ > > > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > > > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > > > > > Project Home: http://www.clusterlabs.org > > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > > > Bugs: > > > http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker > > > ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] [Problem] The attrd does not sometimes stop.
Hi Dejan, Hi Lars, I understood it. I try the operation of the patch in our environment. To Alan: Will you try a patch? Best Regards, Hideo Yamauchi. --- On Tue, 2011/11/15, Dejan Muhamedagic wrote: > Hi, > > On Mon, Nov 14, 2011 at 01:17:37PM +0100, Lars Ellenberg wrote: > > On Mon, Nov 14, 2011 at 11:58:09AM +1100, Andrew Beekhof wrote: > > > On Mon, Nov 7, 2011 at 8:39 AM, Lars Ellenberg > > > wrote: > > > > On Thu, Nov 03, 2011 at 01:49:46AM +1100, Andrew Beekhof wrote: > > > >> On Tue, Oct 18, 2011 at 12:19 PM, wrote: > > > >> > Hi, > > > >> > > > > >> > We sometimes fail in a stop of attrd. > > > >> > > > > >> > Step1. start a cluster in 2 nodes > > > >> > Step2. stop the first node.(/etc/init.d/heartbeat stop.) > > > >> > Step3. stop the second node after time passed a > > > >> > little.(/etc/init.d/heartbeat > > > >> > stop.) > > > >> > > > > >> > The attrd catches the TERM signal, but does not stop. > > > >> > > > >> There's no evidence that it actually catches it, only that it is sent. > > > >> I've seen it before but never figured out why it occurs. > > > > > > > > I had it once tracked down almost to where it occurs, but then got > > > > distracted. > > > > Yes the signal was delivered. > > > > > > > > I *think* it had to do with attrd doing a blocking read, > > > > or looping in some internal message delivery function too often. > > > > > > > > I had a quick look at the code again now, to try and remember, > > > > but I'm not sure. > > > > > > > > I *may* be that, because > > > > xmlfromIPC(IPC_Channel * ch, int timeout) calls > > > > msg = msgfromIPC_timeout(ch, MSG_ALLOWINTR, timeout, &ipc_rc); > > > > > > > > And MSG_ALLOWINTR will cause msgfromIPC_ll() to > > > > IPC_INTR: > > > > if ( allow_intr){ > > > > goto startwait; > > > > > > > > Depending on the frequency of deliverd signals, it may cause this goto > > > > startwait loop to never exit, because the timeout always starts again > > > > from the full passed in timeout. > > > > > > > > If only one signal is deliverd, it may still take 120 seconds > > > > (MAX_IPC_DELAY from crm.h) to be actually processed, as the signal > > > > handler only raises a flag for the next mainloop iteration. > > > > > > > > If a (non-fatal) signal is delivered every few seconds, > > > > then the goto loop will never timeout. > > > > > > > > Please someone check this for plausibility ;-) > > > > > > Most plausible explanation I've heard so far... still odd that only > > > attrd is affected. > > > So what do we do about it? > > > > Reproduce, and confirm that this is what people are seeing. > > > > Make attrd non-blocking? > > > > Fix the ipc layer to not restart the full timeout, > > but only the remaining partial time? > > Lars and I made a quick patch for cluster-glue (attached). > Hideo-san, is there a way for you to verify if it helps? The > patch is not perfect and under unfavourable circumstances it may > still take a long time for the caller to exit, but it'd be good > to know if this is the right spot. > > Cheers, > > Dejan > > > -- > > : Lars Ellenberg > > : LINBIT | Your Way to High Availability > > : DRBD/HA support and consulting http://www.linbit.com > > > > ___ > > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > > > Project Home: http://www.clusterlabs.org > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > > Bugs: > > http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker > ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] [Problem] The attrd does not sometimes stop.
Hi, On Mon, Nov 14, 2011 at 01:17:37PM +0100, Lars Ellenberg wrote: > On Mon, Nov 14, 2011 at 11:58:09AM +1100, Andrew Beekhof wrote: > > On Mon, Nov 7, 2011 at 8:39 AM, Lars Ellenberg > > wrote: > > > On Thu, Nov 03, 2011 at 01:49:46AM +1100, Andrew Beekhof wrote: > > >> On Tue, Oct 18, 2011 at 12:19 PM, wrote: > > >> > Hi, > > >> > > > >> > We sometimes fail in a stop of attrd. > > >> > > > >> > Step1. start a cluster in 2 nodes > > >> > Step2. stop the first node.(/etc/init.d/heartbeat stop.) > > >> > Step3. stop the second node after time passed a > > >> > little.(/etc/init.d/heartbeat > > >> > stop.) > > >> > > > >> > The attrd catches the TERM signal, but does not stop. > > >> > > >> There's no evidence that it actually catches it, only that it is sent. > > >> I've seen it before but never figured out why it occurs. > > > > > > I had it once tracked down almost to where it occurs, but then got > > > distracted. > > > Yes the signal was delivered. > > > > > > I *think* it had to do with attrd doing a blocking read, > > > or looping in some internal message delivery function too often. > > > > > > I had a quick look at the code again now, to try and remember, > > > but I'm not sure. > > > > > > I *may* be that, because > > > xmlfromIPC(IPC_Channel * ch, int timeout) calls > > > msg = msgfromIPC_timeout(ch, MSG_ALLOWINTR, timeout, &ipc_rc); > > > > > > And MSG_ALLOWINTR will cause msgfromIPC_ll() to > > > IPC_INTR: > > > if ( allow_intr){ > > > goto startwait; > > > > > > Depending on the frequency of deliverd signals, it may cause this goto > > > startwait loop to never exit, because the timeout always starts again > > > from the full passed in timeout. > > > > > > If only one signal is deliverd, it may still take 120 seconds > > > (MAX_IPC_DELAY from crm.h) to be actually processed, as the signal > > > handler only raises a flag for the next mainloop iteration. > > > > > > If a (non-fatal) signal is delivered every few seconds, > > > then the goto loop will never timeout. > > > > > > Please someone check this for plausibility ;-) > > > > Most plausible explanation I've heard so far... still odd that only > > attrd is affected. > > So what do we do about it? > > Reproduce, and confirm that this is what people are seeing. > > Make attrd non-blocking? > > Fix the ipc layer to not restart the full timeout, > but only the remaining partial time? Lars and I made a quick patch for cluster-glue (attached). Hideo-san, is there a way for you to verify if it helps? The patch is not perfect and under unfavourable circumstances it may still take a long time for the caller to exit, but it'd be good to know if this is the right spot. Cheers, Dejan > -- > : Lars Ellenberg > : LINBIT | Your Way to High Availability > : DRBD/HA support and consulting http://www.linbit.com > > ___ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: > http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker # HG changeset patch # User Lars Ellenberg # Date 1321275721 -3600 # Node ID 8b50bf0dd4cdf8d0a405416da98711080b2abeb9 # Parent 569bdebf736185d77782f49d5c760007cfc6b3e8 Medium: clplumbing: don't restart timeouts forever if signals are repeatedly sent diff -r 569bdebf7361 -r 8b50bf0dd4cd lib/clplumbing/cl_msg.c --- a/lib/clplumbing/cl_msg.c Mon Nov 14 11:31:51 2011 +0100 +++ b/lib/clplumbing/cl_msg.c Mon Nov 14 14:02:01 2011 +0100 @@ -1802,12 +1802,13 @@ static struct ha_msg* msgfromIPC_ll(IPC_Channel * ch, int flag, unsigned int timeout, int *rc_out) { int rc; + int sig_cnt = 0; IPC_Message* ipcmsg; struct ha_msg* hmsg; int need_auth = flag & MSG_NEEDAUTH; int allow_intr = flag & MSG_ALLOWINTR; - startwait: + do { if(timeout > 0) { rc = cl_ipc_wait_timeout(ch, ch->ops->waitin, timeout); } else { @@ -1832,17 +1833,17 @@ msgfromIPC_ll(IPC_Channel * ch, int flag return NULL; case IPC_INTR: - if ( allow_intr){ - goto startwait; - }else{ + if (!allow_intr || sig_cnt++ >= 20) { return NULL; + } else { + break; } case IPC_OK: break; } - - + } while (rc != IPC_OK); + ipcmsg = NULL; rc = ch->ops->recv(ch, &ipcmsg); #if 0 ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] [Problem] The attrd does not sometimes stop.
On Mon, Nov 14, 2011 at 11:58:09AM +1100, Andrew Beekhof wrote: > On Mon, Nov 7, 2011 at 8:39 AM, Lars Ellenberg > wrote: > > On Thu, Nov 03, 2011 at 01:49:46AM +1100, Andrew Beekhof wrote: > >> On Tue, Oct 18, 2011 at 12:19 PM, wrote: > >> > Hi, > >> > > >> > We sometimes fail in a stop of attrd. > >> > > >> > Step1. start a cluster in 2 nodes > >> > Step2. stop the first node.(/etc/init.d/heartbeat stop.) > >> > Step3. stop the second node after time passed a > >> > little.(/etc/init.d/heartbeat > >> > stop.) > >> > > >> > The attrd catches the TERM signal, but does not stop. > >> > >> There's no evidence that it actually catches it, only that it is sent. > >> I've seen it before but never figured out why it occurs. > > > > I had it once tracked down almost to where it occurs, but then got > > distracted. > > Yes the signal was delivered. > > > > I *think* it had to do with attrd doing a blocking read, > > or looping in some internal message delivery function too often. > > > > I had a quick look at the code again now, to try and remember, > > but I'm not sure. > > > > I *may* be that, because > > xmlfromIPC(IPC_Channel * ch, int timeout) calls > > msg = msgfromIPC_timeout(ch, MSG_ALLOWINTR, timeout, &ipc_rc); > > > > And MSG_ALLOWINTR will cause msgfromIPC_ll() to > > IPC_INTR: > > if ( allow_intr){ > > goto startwait; > > > > Depending on the frequency of deliverd signals, it may cause this goto > > startwait loop to never exit, because the timeout always starts again > > from the full passed in timeout. > > > > If only one signal is deliverd, it may still take 120 seconds > > (MAX_IPC_DELAY from crm.h) to be actually processed, as the signal > > handler only raises a flag for the next mainloop iteration. > > > > If a (non-fatal) signal is delivered every few seconds, > > then the goto loop will never timeout. > > > > Please someone check this for plausibility ;-) > > Most plausible explanation I've heard so far... still odd that only > attrd is affected. > So what do we do about it? Reproduce, and confirm that this is what people are seeing. Make attrd non-blocking? Fix the ipc layer to not restart the full timeout, but only the remaining partial time? -- : Lars Ellenberg : LINBIT | Your Way to High Availability : DRBD/HA support and consulting http://www.linbit.com ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] [Problem] The attrd does not sometimes stop.
On Mon, Nov 7, 2011 at 8:39 AM, Lars Ellenberg wrote: > On Thu, Nov 03, 2011 at 01:49:46AM +1100, Andrew Beekhof wrote: >> On Tue, Oct 18, 2011 at 12:19 PM, wrote: >> > Hi, >> > >> > We sometimes fail in a stop of attrd. >> > >> > Step1. start a cluster in 2 nodes >> > Step2. stop the first node.(/etc/init.d/heartbeat stop.) >> > Step3. stop the second node after time passed a >> > little.(/etc/init.d/heartbeat >> > stop.) >> > >> > The attrd catches the TERM signal, but does not stop. >> >> There's no evidence that it actually catches it, only that it is sent. >> I've seen it before but never figured out why it occurs. > > I had it once tracked down almost to where it occurs, but then got distracted. > Yes the signal was delivered. > > I *think* it had to do with attrd doing a blocking read, > or looping in some internal message delivery function too often. > > I had a quick look at the code again now, to try and remember, > but I'm not sure. > > I *may* be that, because > xmlfromIPC(IPC_Channel * ch, int timeout) calls > msg = msgfromIPC_timeout(ch, MSG_ALLOWINTR, timeout, &ipc_rc); > > And MSG_ALLOWINTR will cause msgfromIPC_ll() to > IPC_INTR: > if ( allow_intr){ > goto startwait; > > Depending on the frequency of deliverd signals, it may cause this goto > startwait loop to never exit, because the timeout always starts again > from the full passed in timeout. > > If only one signal is deliverd, it may still take 120 seconds > (MAX_IPC_DELAY from crm.h) to be actually processed, as the signal > handler only raises a flag for the next mainloop iteration. > > If a (non-fatal) signal is delivered every few seconds, > then the goto loop will never timeout. > > Please someone check this for plausibility ;-) Most plausible explanation I've heard so far... still odd that only attrd is affected. So what do we do about it? ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] [Problem] The attrd does not sometimes stop.
On Thu, Nov 03, 2011 at 01:49:46AM +1100, Andrew Beekhof wrote: > On Tue, Oct 18, 2011 at 12:19 PM, wrote: > > Hi, > > > > We sometimes fail in a stop of attrd. > > > > Step1. start a cluster in 2 nodes > > Step2. stop the first node.(/etc/init.d/heartbeat stop.) > > Step3. stop the second node after time passed a > > little.(/etc/init.d/heartbeat > > stop.) > > > > The attrd catches the TERM signal, but does not stop. > > There's no evidence that it actually catches it, only that it is sent. > I've seen it before but never figured out why it occurs. I had it once tracked down almost to where it occurs, but then got distracted. Yes the signal was delivered. I *think* it had to do with attrd doing a blocking read, or looping in some internal message delivery function too often. I had a quick look at the code again now, to try and remember, but I'm not sure. I *may* be that, because xmlfromIPC(IPC_Channel * ch, int timeout) calls msg = msgfromIPC_timeout(ch, MSG_ALLOWINTR, timeout, &ipc_rc); And MSG_ALLOWINTR will cause msgfromIPC_ll() to IPC_INTR: if ( allow_intr){ goto startwait; Depending on the frequency of deliverd signals, it may cause this goto startwait loop to never exit, because the timeout always starts again from the full passed in timeout. If only one signal is deliverd, it may still take 120 seconds (MAX_IPC_DELAY from crm.h) to be actually processed, as the signal handler only raises a flag for the next mainloop iteration. If a (non-fatal) signal is delivered every few seconds, then the goto loop will never timeout. Please someone check this for plausibility ;-) -- : Lars Ellenberg : LINBIT | Your Way to High Availability : DRBD/HA support and consulting http://www.linbit.com ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] [Problem] The attrd does not sometimes stop.
Hi Andrew, Hi Alan, We work hard to collect the evidence of reproduction and the problem of the phenomenon. However, we do not yet get the evidence. I will wait for the information from Alan. Best Regards, Hideo Yamauchi. --- On Wed, 2011/11/2, Andrew Beekhof wrote: > On Tue, Oct 18, 2011 at 12:19 PM, wrote: > > Hi, > > > > We sometimes fail in a stop of attrd. > > > > Step1. start a cluster in 2 nodes > > Step2. stop the first node.(/etc/init.d/heartbeat stop.) > > Step3. stop the second node after time passed a > > little.(/etc/init.d/heartbeat > > stop.) > > > > The attrd catches the TERM signal, but does not stop. > > There's no evidence that it actually catches it, only that it is sent. > I've seen it before but never figured out why it occurs. > > > > > (snip) > > Oct 5 02:37:38 hpdb0201 crmd: [12238]: info: do_exit: [crmd] stopped (0) > > Oct 5 02:37:38 hpdb0201 cib: [12234]: WARN: send_ipc_message: IPC Channel > > to > > 12238 is not connected > > Oct 5 02:37:38 hpdb0201 cib: [12234]: WARN: send_via_callback_channel: > > Delivery of reply to client 12238/0dbc9e28-d90d-4335-b9c4-9dd3fcb38163 > > failed > > Oct 5 02:37:38 hpdb0201 cib: [12234]: WARN: do_local_notify: A-Sync reply > > to > > crmd failed: reply failed > > Oct 5 02:37:38 hpdb0201 heartbeat: [12223]: info: killing > > /usr/lib64/heartbeat/attrd process group 12237 with signal 15 > > Oct 5 02:47:03 hpdb0201 cib: [12234]: info: cib_stats: Processed 97 > > operations > > (4123.00us average, 0% utilization) in the last 10min > > Oct 5 07:15:25 hpdb0201 ccm: [12233]: WARN: G_CH_check_int: working on IPC > > channel took 1010 ms (> 100 ms) > > Oct 5 07:15:26 hpdb0201 ccm: [12233]: WARN: G_CH_check_int: working on IPC > > channel took 1010 ms (> 100 ms) > > Oct 5 07:15:37 hpdb0201 heartbeat: [12223]: WARN: Gmain_timeout_dispatch: > > Dispatch function for check for signals was delayed 1030 ms (> 1010 ms) > > before > > being called (GSource: 0xd28010) > > Oct 5 07:15:37 hpdb0201 heartbeat: [12223]: info: Gmain_timeout_dispatch: > > started at 431583547 should have started at 431583444 > > Oct 5 07:15:44 hpdb0201 heartbeat: [12223]: WARN: Gmain_timeout_dispatch: > > Dispatch function for send local status was delayed 1030 ms (> 1010 ms) > > before > > being called (GSource: 0xd27dd0) > > Oct 5 07:15:44 hpdb0201 heartbeat: [12223]: info: Gmain_timeout_dispatch: > > started at 431584254 should have started at 431584151 > > Oct 5 07:15:44 hpdb0201 heartbeat: [12223]: WARN: Gmain_timeout_dispatch: > > Dispatch function for check for signals was delayed 1030 ms (> 1010 ms) > > before > > being called (GSource: 0xd28010) > > Oct 5 07:15:44 hpdb0201 heartbeat: [12223]: info: Gmain_timeout_dispatch: > > started at 431584254 should have started at 431584151 > > Oct 5 07:16:59 hpdb0201 heartbeat: [12223]: WARN: G_CH_check_int: working > > on > > write child took 1010 ms (> 100 ms) > > Oct 5 07:17:14 hpdb0201 stonithd: [12236]: WARN: G_CH_check_int: working on > > Heartbeat API channel took 1010 ms (> 100 ms) > > Oct 5 07:19:41 hpdb0201 heartbeat: [12223]: WARN: Gmain_timeout_dispatch: > > Dispatch function for send local status was delayed 1030 ms (> 1010 ms) > > before > > being called (GSource: 0xd27dd0) > > Oct 5 07:19:41 hpdb0201 heartbeat: [12223]: info: Gmain_timeout_dispatch: > > started at 431607988 should have started at 431607885 > > Oct 5 07:19:41 hpdb0201 heartbeat: [12223]: WARN: Gmain_timeout_dispatch: > > Dispatch function for check for signals was delayed 1030 ms (> 1010 ms) > > before > > being called (GSource: 0xd28010) > > Oct 5 07:19:41 hpdb0201 heartbeat: [12223]: info: Gmain_timeout_dispatch: > > started at 431607988 should have started at 431607885 > > (snip) > > > > We try the reproduction of the phenomenon, but do not reappear very much. > > > > The same phenomenon is reported by the next email. > > However, the argument of the problem is over on the way. > > > > * http://www.gossamer-threads.com/lists/linuxha/pacemaker/62147 > > > > The phenomenon occurred by the next combination. > > * pacemaker-1.0.11 > > * resource-agents-3.9.2 > > * cluster-glue-1.0.7 > > * heartbeat-3.0.5 > > > > I registered these contents with Bugzilla. > > * http://bugs.clusterlabs.org/show_bug.cgi?id=5004 > > > > Best Regards, > > Hideo Yamauchi. > > > > ___ > > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > > > Project Home: http://www.clusterlabs.org > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > > Bugs: > > http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker > > > ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs:
Re: [Pacemaker] [Problem] The attrd does not sometimes stop.
On Tue, Oct 18, 2011 at 12:19 PM, wrote: > Hi, > > We sometimes fail in a stop of attrd. > > Step1. start a cluster in 2 nodes > Step2. stop the first node.(/etc/init.d/heartbeat stop.) > Step3. stop the second node after time passed a little.(/etc/init.d/heartbeat > stop.) > > The attrd catches the TERM signal, but does not stop. There's no evidence that it actually catches it, only that it is sent. I've seen it before but never figured out why it occurs. > > (snip) > Oct 5 02:37:38 hpdb0201 crmd: [12238]: info: do_exit: [crmd] stopped (0) > Oct 5 02:37:38 hpdb0201 cib: [12234]: WARN: send_ipc_message: IPC Channel to > 12238 is not connected > Oct 5 02:37:38 hpdb0201 cib: [12234]: WARN: send_via_callback_channel: > Delivery of reply to client 12238/0dbc9e28-d90d-4335-b9c4-9dd3fcb38163 failed > Oct 5 02:37:38 hpdb0201 cib: [12234]: WARN: do_local_notify: A-Sync reply to > crmd failed: reply failed > Oct 5 02:37:38 hpdb0201 heartbeat: [12223]: info: killing > /usr/lib64/heartbeat/attrd process group 12237 with signal 15 > Oct 5 02:47:03 hpdb0201 cib: [12234]: info: cib_stats: Processed 97 > operations > (4123.00us average, 0% utilization) in the last 10min > Oct 5 07:15:25 hpdb0201 ccm: [12233]: WARN: G_CH_check_int: working on IPC > channel took 1010 ms (> 100 ms) > Oct 5 07:15:26 hpdb0201 ccm: [12233]: WARN: G_CH_check_int: working on IPC > channel took 1010 ms (> 100 ms) > Oct 5 07:15:37 hpdb0201 heartbeat: [12223]: WARN: Gmain_timeout_dispatch: > Dispatch function for check for signals was delayed 1030 ms (> 1010 ms) before > being called (GSource: 0xd28010) > Oct 5 07:15:37 hpdb0201 heartbeat: [12223]: info: Gmain_timeout_dispatch: > started at 431583547 should have started at 431583444 > Oct 5 07:15:44 hpdb0201 heartbeat: [12223]: WARN: Gmain_timeout_dispatch: > Dispatch function for send local status was delayed 1030 ms (> 1010 ms) before > being called (GSource: 0xd27dd0) > Oct 5 07:15:44 hpdb0201 heartbeat: [12223]: info: Gmain_timeout_dispatch: > started at 431584254 should have started at 431584151 > Oct 5 07:15:44 hpdb0201 heartbeat: [12223]: WARN: Gmain_timeout_dispatch: > Dispatch function for check for signals was delayed 1030 ms (> 1010 ms) before > being called (GSource: 0xd28010) > Oct 5 07:15:44 hpdb0201 heartbeat: [12223]: info: Gmain_timeout_dispatch: > started at 431584254 should have started at 431584151 > Oct 5 07:16:59 hpdb0201 heartbeat: [12223]: WARN: G_CH_check_int: working on > write child took 1010 ms (> 100 ms) > Oct 5 07:17:14 hpdb0201 stonithd: [12236]: WARN: G_CH_check_int: working on > Heartbeat API channel took 1010 ms (> 100 ms) > Oct 5 07:19:41 hpdb0201 heartbeat: [12223]: WARN: Gmain_timeout_dispatch: > Dispatch function for send local status was delayed 1030 ms (> 1010 ms) before > being called (GSource: 0xd27dd0) > Oct 5 07:19:41 hpdb0201 heartbeat: [12223]: info: Gmain_timeout_dispatch: > started at 431607988 should have started at 431607885 > Oct 5 07:19:41 hpdb0201 heartbeat: [12223]: WARN: Gmain_timeout_dispatch: > Dispatch function for check for signals was delayed 1030 ms (> 1010 ms) before > being called (GSource: 0xd28010) > Oct 5 07:19:41 hpdb0201 heartbeat: [12223]: info: Gmain_timeout_dispatch: > started at 431607988 should have started at 431607885 > (snip) > > We try the reproduction of the phenomenon, but do not reappear very much. > > The same phenomenon is reported by the next email. > However, the argument of the problem is over on the way. > > * http://www.gossamer-threads.com/lists/linuxha/pacemaker/62147 > > The phenomenon occurred by the next combination. > * pacemaker-1.0.11 > * resource-agents-3.9.2 > * cluster-glue-1.0.7 > * heartbeat-3.0.5 > > I registered these contents with Bugzilla. > * http://bugs.clusterlabs.org/show_bug.cgi?id=5004 > > Best Regards, > Hideo Yamauchi. > > ___ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: > http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker > ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] [Problem] The attrd does not sometimes stop.
On 10/20/2011 07:30 PM, renayama19661...@ybb.ne.jp wrote: Hi Alan, Thank you for comment. We reproduce a problem, too and are going to send a report. However, the problem does not reappear for the moment. I gather that the folks on the test team for my project have it happen fairly often when they're in a certain stage of testing. I expect to get some hb_report output from them in a week or two. I have put in a link to Andrew's bug system from ours so that hopefully when the time comes we will be able to remember what to do ;-) We had not narrowed it down to attrd being the component that didn't stop - but looking at the logs for what they did report, it seemed like the likely suspect. I had already decided that it looked like the most likely candidate before I saw your email. They had put in a workaround of just killing everything - which of course works ;-). At the place where it hung, all the resources were already stopped, so it was safe - just a bit of overkill (beyond the minimum necessary). -- Alan Robertson "Openness is the foundation and preservative of friendship... Let me claim from you at all times your undisguised opinions." - William Wilberforce ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] [Problem] The attrd does not sometimes stop.
Hi Alan, Thank you for comment. We reproduce a problem, too and are going to send a report. However, the problem does not reappear for the moment. Best Regards, Hideo Yamauchi. --- On Thu, 2011/10/20, Alan Robertson wrote: > Hi, > > I've seen a very similar problem in a recent release. In fact, I'm in the > process of reproducing it so that it can be properly logged and so on. When > I get the right data for the bug report, I'll attach it to the bug. > > FWIW: I'm pretty sure that the signal was properly received by attrd. I > haven't looked at the attrd code, but my guess is that either it didn't issue > the correct function call for exiting from mainloop - or that the mainloop > code didn't actually exit. FWIW - it probably doesn't matter at all what the > priority for signal handling is - since attrd consumes nearly no CPU. Too > bad it doesn't log receiving the signal or beginning the process of exiting... > > Another random thought - I suppose attrd could be clobbering some memory > which mainloop needs to properly process an exit. Doesn't seem likely - but > neither of the above options seem very likely either. > > > > An historical note on an early bug that had similar symptoms (but affected > every process - not just attrd). > > First - what caused such a problem (a very long time ago): > There is a window between the checking for signals and going to sleep in > the poll call where > such that a signal might be ignored for a while. > > The glib mainloop code has three entry points called each time a signal > is received: > prepare, check, dispatch. > > There is a poll call which occurs between the prepare and check steps. If a > signal comes in after the prepare call returns, but before the code goes to > sleep in the poll system call, it will be ignored until > the poll system call returns. It will get caught on the next iteration of > the loop. > > The fix was fairly simple - the signal handling code instructs the mainloop > infrastructure to call poll with an argument which prevents it from staying > asleep longer than a second. > > Then the code processes the signal correctly. > > > On 10/17/2011 07:19 PM, renayama19661...@ybb.ne.jp wrote: > > Hi, > > > > We sometimes fail in a stop of attrd. > > > > Step1. start a cluster in 2 nodes > > Step2. stop the first node.(/etc/init.d/heartbeat stop.) > > Step3. stop the second node after time passed a > > little.(/etc/init.d/heartbeat > > stop.) > > > > The attrd catches the TERM signal, but does not stop. > > > > (snip) > > Oct 5 02:37:38 hpdb0201 crmd: [12238]: info: do_exit: [crmd] stopped (0) > > Oct 5 02:37:38 hpdb0201 cib: [12234]: WARN: send_ipc_message: IPC Channel > > to > > 12238 is not connected > > Oct 5 02:37:38 hpdb0201 cib: [12234]: WARN: send_via_callback_channel: > > Delivery of reply to client 12238/0dbc9e28-d90d-4335-b9c4-9dd3fcb38163 > > failed > > Oct 5 02:37:38 hpdb0201 cib: [12234]: WARN: do_local_notify: A-Sync reply > > to > > crmd failed: reply failed > > Oct 5 02:37:38 hpdb0201 heartbeat: [12223]: info: killing > > /usr/lib64/heartbeat/attrd process group 12237 with signal 15 > > Oct 5 02:47:03 hpdb0201 cib: [12234]: info: cib_stats: Processed 97 > > operations > > (4123.00us average, 0% utilization) in the last 10min > > Oct 5 07:15:25 hpdb0201 ccm: [12233]: WARN: G_CH_check_int: working on IPC > > channel took 1010 ms (> 100 ms) > > Oct 5 07:15:26 hpdb0201 ccm: [12233]: WARN: G_CH_check_int: working on IPC > > channel took 1010 ms (> 100 ms) > > Oct 5 07:15:37 hpdb0201 heartbeat: [12223]: WARN: Gmain_timeout_dispatch: > > Dispatch function for check for signals was delayed 1030 ms (> 1010 ms) > > before > > being called (GSource: 0xd28010) > > Oct 5 07:15:37 hpdb0201 heartbeat: [12223]: info: Gmain_timeout_dispatch: > > started at 431583547 should have started at 431583444 > > Oct 5 07:15:44 hpdb0201 heartbeat: [12223]: WARN: Gmain_timeout_dispatch: > > Dispatch function for send local status was delayed 1030 ms (> 1010 ms) > > before > > being called (GSource: 0xd27dd0) > > Oct 5 07:15:44 hpdb0201 heartbeat: [12223]: info: Gmain_timeout_dispatch: > > started at 431584254 should have started at 431584151 > > Oct 5 07:15:44 hpdb0201 heartbeat: [12223]: WARN: Gmain_timeout_dispatch: > > Dispatch function for check for signals was delayed 1030 ms (> 1010 ms) > > before > > being called (GSource: 0xd28010) > > Oct 5 07:15:44 hpdb0201 heartbeat: [12223]: info: Gmain_timeout_dispatch: > > started at 431584254 should have started at 431584151 > > Oct 5 07:16:59 hpdb0201 heartbeat: [12223]: WARN: G_CH_check_int: working > > on > > write child took 1010 ms (> 100 ms) > > Oct 5 07:17:14 hpdb0201 stonithd: [12236]: WARN: G_CH_check_int: working on > > Heartbeat API channel took 1010 ms (> 100 ms) > > Oct 5 07:19:41 hpdb0201 heartbeat: [12223]: WARN: Gmain_timeout_dispatch: > > Dispatch function for
Re: [Pacemaker] [Problem] The attrd does not sometimes stop.
Hi, I've seen a very similar problem in a recent release. In fact, I'm in the process of reproducing it so that it can be properly logged and so on. When I get the right data for the bug report, I'll attach it to the bug. FWIW: I'm pretty sure that the signal was properly received by attrd. I haven't looked at the attrd code, but my guess is that either it didn't issue the correct function call for exiting from mainloop - or that the mainloop code didn't actually exit. FWIW - it probably doesn't matter at all what the priority for signal handling is - since attrd consumes nearly no CPU. Too bad it doesn't log receiving the signal or beginning the process of exiting... Another random thought - I suppose attrd could be clobbering some memory which mainloop needs to properly process an exit. Doesn't seem likely - but neither of the above options seem very likely either. An historical note on an early bug that had similar symptoms (but affected every process - not just attrd). First - what caused such a problem (a very long time ago): There is a window between the checking for signals and going to sleep in the poll call where such that a signal might be ignored for a while. The glib mainloop code has three entry points called each time a signal is received: prepare, check, dispatch. There is a poll call which occurs between the prepare and check steps. If a signal comes in after the prepare call returns, but before the code goes to sleep in the poll system call, it will be ignored until the poll system call returns. It will get caught on the next iteration of the loop. The fix was fairly simple - the signal handling code instructs the mainloop infrastructure to call poll with an argument which prevents it from staying asleep longer than a second. Then the code processes the signal correctly. On 10/17/2011 07:19 PM, renayama19661...@ybb.ne.jp wrote: Hi, We sometimes fail in a stop of attrd. Step1. start a cluster in 2 nodes Step2. stop the first node.(/etc/init.d/heartbeat stop.) Step3. stop the second node after time passed a little.(/etc/init.d/heartbeat stop.) The attrd catches the TERM signal, but does not stop. (snip) Oct 5 02:37:38 hpdb0201 crmd: [12238]: info: do_exit: [crmd] stopped (0) Oct 5 02:37:38 hpdb0201 cib: [12234]: WARN: send_ipc_message: IPC Channel to 12238 is not connected Oct 5 02:37:38 hpdb0201 cib: [12234]: WARN: send_via_callback_channel: Delivery of reply to client 12238/0dbc9e28-d90d-4335-b9c4-9dd3fcb38163 failed Oct 5 02:37:38 hpdb0201 cib: [12234]: WARN: do_local_notify: A-Sync reply to crmd failed: reply failed Oct 5 02:37:38 hpdb0201 heartbeat: [12223]: info: killing /usr/lib64/heartbeat/attrd process group 12237 with signal 15 Oct 5 02:47:03 hpdb0201 cib: [12234]: info: cib_stats: Processed 97 operations (4123.00us average, 0% utilization) in the last 10min Oct 5 07:15:25 hpdb0201 ccm: [12233]: WARN: G_CH_check_int: working on IPC channel took 1010 ms (> 100 ms) Oct 5 07:15:26 hpdb0201 ccm: [12233]: WARN: G_CH_check_int: working on IPC channel took 1010 ms (> 100 ms) Oct 5 07:15:37 hpdb0201 heartbeat: [12223]: WARN: Gmain_timeout_dispatch: Dispatch function for check for signals was delayed 1030 ms (> 1010 ms) before being called (GSource: 0xd28010) Oct 5 07:15:37 hpdb0201 heartbeat: [12223]: info: Gmain_timeout_dispatch: started at 431583547 should have started at 431583444 Oct 5 07:15:44 hpdb0201 heartbeat: [12223]: WARN: Gmain_timeout_dispatch: Dispatch function for send local status was delayed 1030 ms (> 1010 ms) before being called (GSource: 0xd27dd0) Oct 5 07:15:44 hpdb0201 heartbeat: [12223]: info: Gmain_timeout_dispatch: started at 431584254 should have started at 431584151 Oct 5 07:15:44 hpdb0201 heartbeat: [12223]: WARN: Gmain_timeout_dispatch: Dispatch function for check for signals was delayed 1030 ms (> 1010 ms) before being called (GSource: 0xd28010) Oct 5 07:15:44 hpdb0201 heartbeat: [12223]: info: Gmain_timeout_dispatch: started at 431584254 should have started at 431584151 Oct 5 07:16:59 hpdb0201 heartbeat: [12223]: WARN: G_CH_check_int: working on write child took 1010 ms (> 100 ms) Oct 5 07:17:14 hpdb0201 stonithd: [12236]: WARN: G_CH_check_int: working on Heartbeat API channel took 1010 ms (> 100 ms) Oct 5 07:19:41 hpdb0201 heartbeat: [12223]: WARN: Gmain_timeout_dispatch: Dispatch function for send local status was delayed 1030 ms (> 1010 ms) before being called (GSource: 0xd27dd0) Oct 5 07:19:41 hpdb0201 heartbeat: [12223]: info: Gmain_timeout_dispatch: started at 431607988 should have started at 431607885 Oct 5 07:19:41 hpdb0201 heartbeat: [12223]: WARN: Gmain_timeout_dispatch: Dispatch function for check for signals was delayed 1030 ms (> 1010 ms) before being called (GSource: 0xd28010) Oct 5 07:19:41 hpdb0201 heartbeat: [12223]: info: Gmain_timeout_dispatch: started at 431607988 should have started at 431607885 (snip)