subject:"\"Re\\\: \\\[Pacemaker\\\] \\\[Problem\\\] The attrd does not sometimes stop.\""

Re: [Pacemaker] [Problem] The attrd does not sometimes stop.

2012-01-16 Thread Andrew Beekhof

On Tue, Jan 17, 2012 at 10:11 AM, Lars Ellenberg
 wrote:
> On Mon, Jan 16, 2012 at 11:42:32PM +1100, Andrew Beekhof wrote:
>> >>> http://developer.gnome.org/glib/2.30/glib-The-Main-Event-Loop.html#GSourceFuncs
>> >>>
>> >>> iiuc, mainloop does something similar to (oversimplified):
>> >>>        timeout = -1; /* infinity */
>> >>>        for s in all GSource
>> >>>                tmp_timeout = -1;
>> >>>                s->prepare(s, &tmp_timeout)
>> >>>                if (tmp_timeout >= 0 && tmp_timeout < timeout)
>> >>>                        timeout = tmp_timeout;
>> >>>
>> >>>        poll(GSource fd set, n, timeout);
>> >>
>> >> I'm looking at the glib code again now, and it still looks to me like
>> >> the trigger and signal sources do not appear in this fd set.
>> >> Their setup functions would have to have called g_source_add_poll()
>> >> somewhere, which they don't.
>> >>
>> >> So I'm still not seeing why its a trigger or signal sources' fault
>> >> that glib is doing a never ending call to poll().
>> >> poll() is going to get called regardless of whether our prepare
>> >> function returns true or not.
>> >>
>> >> Looking closer, crm_trigger_prepare() returning TRUE results in:
>> >>                  ready_source->flags |= G_SOURCE_READY;
>> >>
>> >> which in turn causes:
>> >>          context->timeout = 0;
>> >>
>> >> which is essentially what adding
>> >>       if (trig->trigger)
>> >>               *timeout = 0;
>> >>
>> >> to crm_trigger_prepare() was intended to achieve.
>> >>
>> >> Shouldn't the fd, ipc or wait sources (who do call g_source_add_poll()
>> >> and could therefor cause poll() to block forever) have a sane timeout
>> >> in their prepare functions?
>
> Probably should, but they usually have not.
> The reasoning probably is, each GSource is responsible for *itself* only.

Well no, because this forces trigger to care about whether there is a
fd based GSource too and what timeout, if any, is set.

>
> That is why first all sources are prepared.
>
> If no non-fd, non-pollable source feels the need to reduce the
> *timeout to something finite in its prepare(), so be it.

So something that doesn't use poll at all should set a timeout for
poll, that doesn't sound right :-)

>
> Besides, what is sane? 1 second? 5? 120? 240?
>
> That's why G_CH_prepare_int() sets the *timeout to 1000,
> and why I suggest to set it to 0 if prepare already knows that the
> trigger is set, and to some finite amount to avoid getting stuck in
> poll, in case no timeout or outher source source is active which also
> set some finite timeout.
>
> BTW, if you have an *idle* sources, prepare should set timeout to 0.
>
> For those interested, all described below
> http://developer.gnome.org/glib/2.30/glib-The-Main-Event-Loop.html#GSourceFuncs
>
> "For idle sources, the prepare and check functions always return TRUE to
> indicate that the source is always ready to be processed. The prepare
> function also returns a timeout value of 0 to ensure that the poll()
> call doesn't block (since that would be time wasted which could have
> been spent running the idle function)."
>
> "... timeout sources ... returns a timeout value to ensure that the
> poll() call doesn't block too long ..."
>
> "... file descriptor sources ... timeout to -1 to indicate that is does
> not mind how long the poll() call blocks ... "
>
>> >> Or is it because the signal itself is interrupting some essential part
>> >> of G_CH_prepare_int() and friends?
>
> In the provided strace, it looks like the SIGTERM
> is delivered while calling some G_CH_prepare_int,
> the ->prepare() used by G_main_add_IPC_Channel.
>
> Since the signal sources are of higher priority,
> we probably are passt those already in this iteration,
> we will only notice the trigger in the next check(),
> after the poll.
>
> So it is vital for any non-pollable source such as signals
> to set a finite timeout in their prepare(),
> even if we also mark that signal siginterrupt().
>
>> >>>        for s in all GSource
>> >>>                if s->check(s)
>> >>>                        s->dispatch(s, ...)
>> >>>
>> >>> And at some stage it also orders by priority, of course.
>> >>>
>> >>> Also compare with the comment above /* Sigh... */ in glue 
>> >>> G_SIG_prepare().
>> >>>
>> >>> BTW, the mentioned race between signal delivery and mainloop already
>> >>> doing the poll stage could potentially be solved by using
>> >>> cl_signal_set_interrupt(SIGTERM, 1),
>
> As I just wrote above, that race is not solved at all.
> Only the (necessarily set) finite timeout of the poll
> would be shortened in that case.
>
>> >> But I can't escape the feeling that calling this just masks the
>> >> underlying "why is there a never-ending call to poll() in the first
>> >> place" issue.
>> >> G_CH_prepare_int() and friends /should/ be setting timeouts so that
>> >> poll() can return and any sources created by g_idle_source_new() can
>> >> execute.
>> >
>> > Actually, thinking further, I'm pretty convinced that

Re: [Pacemaker] [Problem] The attrd does not sometimes stop.

2012-01-16 Thread Lars Ellenberg

On Tue, Jan 17, 2012 at 09:52:35AM +1100, Andrew Beekhof wrote:
> On Mon, Jan 16, 2012 at 11:42 PM, Andrew Beekhof  wrote:
> > On Mon, Jan 16, 2012 at 11:30 PM, Andrew Beekhof  wrote:
> >> On Mon, Jan 16, 2012 at 11:27 PM, Andrew Beekhof  
> >> wrote:
> >>> I know I could just apply the patch and be done, but I'd like to
> >>> understand this so it works for the right reason.
> 
> Ok, done:
> 
> https://github.com/beekhof/pacemaker/commit/2a6b296
> 
> If I'm adding voodoo, I at least want the reason well documented so it
> can be removed again if the reason goes away.

That about sums it up, then ;-)

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] [Problem] The attrd does not sometimes stop.

2012-01-16 Thread Lars Ellenberg

On Mon, Jan 16, 2012 at 11:42:32PM +1100, Andrew Beekhof wrote:
> >>> http://developer.gnome.org/glib/2.30/glib-The-Main-Event-Loop.html#GSourceFuncs
> >>>
> >>> iiuc, mainloop does something similar to (oversimplified):
> >>>        timeout = -1; /* infinity */
> >>>        for s in all GSource
> >>>                tmp_timeout = -1;
> >>>                s->prepare(s, &tmp_timeout)
> >>>                if (tmp_timeout >= 0 && tmp_timeout < timeout)
> >>>                        timeout = tmp_timeout;
> >>>
> >>>        poll(GSource fd set, n, timeout);
> >>
> >> I'm looking at the glib code again now, and it still looks to me like
> >> the trigger and signal sources do not appear in this fd set.
> >> Their setup functions would have to have called g_source_add_poll()
> >> somewhere, which they don't.
> >>
> >> So I'm still not seeing why its a trigger or signal sources' fault
> >> that glib is doing a never ending call to poll().
> >> poll() is going to get called regardless of whether our prepare
> >> function returns true or not.
> >>
> >> Looking closer, crm_trigger_prepare() returning TRUE results in:
> >>                  ready_source->flags |= G_SOURCE_READY;
> >>
> >> which in turn causes:
> >>          context->timeout = 0;
> >>
> >> which is essentially what adding
> >>       if (trig->trigger)
> >>               *timeout = 0;
> >>
> >> to crm_trigger_prepare() was intended to achieve.
> >>
> >> Shouldn't the fd, ipc or wait sources (who do call g_source_add_poll()
> >> and could therefor cause poll() to block forever) have a sane timeout
> >> in their prepare functions?

Probably should, but they usually have not.
The reasoning probably is, each GSource is responsible for *itself* only.

That is why first all sources are prepared.

If no non-fd, non-pollable source feels the need to reduce the
*timeout to something finite in its prepare(), so be it.

Besides, what is sane? 1 second? 5? 120? 240?

That's why G_CH_prepare_int() sets the *timeout to 1000,
and why I suggest to set it to 0 if prepare already knows that the
trigger is set, and to some finite amount to avoid getting stuck in
poll, in case no timeout or outher source source is active which also
set some finite timeout.

BTW, if you have an *idle* sources, prepare should set timeout to 0.

For those interested, all described below
http://developer.gnome.org/glib/2.30/glib-The-Main-Event-Loop.html#GSourceFuncs

"For idle sources, the prepare and check functions always return TRUE to
indicate that the source is always ready to be processed. The prepare
function also returns a timeout value of 0 to ensure that the poll()
call doesn't block (since that would be time wasted which could have
been spent running the idle function)."

"... timeout sources ... returns a timeout value to ensure that the
poll() call doesn't block too long ..."

"... file descriptor sources ... timeout to -1 to indicate that is does
not mind how long the poll() call blocks ... "

> >> Or is it because the signal itself is interrupting some essential part
> >> of G_CH_prepare_int() and friends?

In the provided strace, it looks like the SIGTERM
is delivered while calling some G_CH_prepare_int,
the ->prepare() used by G_main_add_IPC_Channel.

Since the signal sources are of higher priority,
we probably are passt those already in this iteration,
we will only notice the trigger in the next check(),
after the poll.

So it is vital for any non-pollable source such as signals
to set a finite timeout in their prepare(),
even if we also mark that signal siginterrupt().

> >>>        for s in all GSource
> >>>                if s->check(s)
> >>>                        s->dispatch(s, ...)
> >>>
> >>> And at some stage it also orders by priority, of course.
> >>>
> >>> Also compare with the comment above /* Sigh... */ in glue G_SIG_prepare().
> >>>
> >>> BTW, the mentioned race between signal delivery and mainloop already
> >>> doing the poll stage could potentially be solved by using
> >>> cl_signal_set_interrupt(SIGTERM, 1),

As I just wrote above, that race is not solved at all.
Only the (necessarily set) finite timeout of the poll
would be shortened in that case.

> >> But I can't escape the feeling that calling this just masks the
> >> underlying "why is there a never-ending call to poll() in the first
> >> place" issue.
> >> G_CH_prepare_int() and friends /should/ be setting timeouts so that
> >> poll() can return and any sources created by g_idle_source_new() can
> >> execute.
> >
> > Actually, thinking further, I'm pretty convinced that poll() with an
> > infinite timeout is the default mode of operation for mainloops with
> > cluster-glue's IPC and FD sources.
> > And that this is not a good thing :)

Well, if there are *only* pollable sources, it is.
If there are any other sources, they should have set
their limit on what they think is an acceptable timeout
int their prepare().

> Far too late, brain shutting down.

 ;-)

> ...not a good thing, because it breaks the idle

Re: [Pacemaker] [Problem] The attrd does not sometimes stop.

2012-01-16 Thread Andrew Beekhof

On Mon, Jan 16, 2012 at 11:42 PM, Andrew Beekhof  wrote:
> On Mon, Jan 16, 2012 at 11:30 PM, Andrew Beekhof  wrote:
>> On Mon, Jan 16, 2012 at 11:27 PM, Andrew Beekhof  wrote:
>>> I know I could just apply the patch and be done, but I'd like to
>>> understand this so it works for the right reason.

Ok, done:

https://github.com/beekhof/pacemaker/commit/2a6b296

If I'm adding voodoo, I at least want the reason well documented so it
can be removed again if the reason goes away.

>>> On Mon, Jan 16, 2012 at 7:30 PM, Lars Ellenberg
>>>  wrote:
 On Mon, Jan 16, 2012 at 04:46:58PM +1100, Andrew Beekhof wrote:
> > Now we proceed to the next mainloop poll:
> >
> >> poll([{fd=7, events=POLLIN|POLLPRI}, {fd=4, events=POLLIN|POLLPRI}, 
> >> {fd=5, events=POLLIN|POLLPRI}], 3, -1
> >
> > Note the -1 (infinity timeout!)
> >
> > So even though the trigger was (presumably) set,
> > and the ->prepare() should have returned true,
> > the mainloop waits forever for "something" to happen on those file 
> > descriptors.
> >
> >
> > I suggest this:
> >
> > crm_trigger_prepare should set *timeout = 0, if trigger is set.
> >
> > Also think about this race: crm_trigger_prepare was already
> > called, only then the signal came in...
> >
> > diff --git a/lib/common/mainloop.c b/lib/common/mainloop.c
> > index 2e8b1d0..fd17b87 100644
> > --- a/lib/common/mainloop.c
> > +++ b/lib/common/mainloop.c
> > @@ -33,6 +33,13 @@ static gboolean
> >  crm_trigger_prepare(GSource * source, gint * timeout)
> >  {
> >     crm_trigger_t *trig = (crm_trigger_t *) source;
> > +    /* Do not delay signal processing by the mainloop poll stage */
> > +    if (trig->trigger)
> > +           *timeout = 0;
> > +    /* To avoid races between signal delivery and the mainloop poll 
> > stage,
> > +     * make sure we always have a finite timeout. Unit: milliseconds. 
> > */
> > +    else
> > +           *timeout = 5000; /* arbitrary */
> >
> >     return trig->trigger;
> >  }
> >
> >
> > This scenario does not let the blocked IPC off the hook, though.
> > That is still possible, both for blocking send and blocking receive,
> > so that should probably be fixed as well, somehow.
> > I'm not sure how likely this "stuck in blocking IPC" is, though.
>
> Interesting, are you sure you're in the right function though?
> trigger and signal events don't have a file descriptor... wouldn't
> these polls be for the IPC related sources and wouldn't they be
> setting their own timeout?

 http://developer.gnome.org/glib/2.30/glib-The-Main-Event-Loop.html#GSourceFuncs

 iiuc, mainloop does something similar to (oversimplified):
        timeout = -1; /* infinity */
        for s in all GSource
                tmp_timeout = -1;
                s->prepare(s, &tmp_timeout)
                if (tmp_timeout >= 0 && tmp_timeout < timeout)
                        timeout = tmp_timeout;

        poll(GSource fd set, n, timeout);
>>>
>>> I'm looking at the glib code again now, and it still looks to me like
>>> the trigger and signal sources do not appear in this fd set.
>>> Their setup functions would have to have called g_source_add_poll()
>>> somewhere, which they don't.
>>>
>>> So I'm still not seeing why its a trigger or signal sources' fault
>>> that glib is doing a never ending call to poll().
>>> poll() is going to get called regardless of whether our prepare
>>> function returns true or not.
>>>
>>> Looking closer, crm_trigger_prepare() returning TRUE results in:
>>>                  ready_source->flags |= G_SOURCE_READY;
>>>
>>> which in turn causes:
>>>          context->timeout = 0;
>>>
>>> which is essentially what adding
>>>       if (trig->trigger)
>>>               *timeout = 0;
>>>
>>> to crm_trigger_prepare() was intended to achieve.
>>>
>>> Shouldn't the fd, ipc or wait sources (who do call g_source_add_poll()
>>> and could therefor cause poll() to block forever) have a sane timeout
>>> in their prepare functions?
>>> Or is it because the signal itself is interrupting some essential part
>>> of G_CH_prepare_int() and friends?
>>>

        for s in all GSource
                if s->check(s)
                        s->dispatch(s, ...)

 And at some stage it also orders by priority, of course.

 Also compare with the comment above /* Sigh... */ in glue G_SIG_prepare().

 BTW, the mentioned race between signal delivery and mainloop already
 doing the poll stage could potentially be solved by using
>>>
>>> Again, since nothing related to the signal source ever appears in the
>>> call to poll(), I'm not seeing where the race comes from.
>>> Or am I missing something obvious?
>>>
 cl_signal_set_interrupt(SIGTERM, 1),
>>>
>>> This, combined with
>>>
>>>                /*
>>>

Re: [Pacemaker] [Problem] The attrd does not sometimes stop.

2012-01-16 Thread Andrew Beekhof

On Mon, Jan 16, 2012 at 11:30 PM, Andrew Beekhof  wrote:
> On Mon, Jan 16, 2012 at 11:27 PM, Andrew Beekhof  wrote:
>> I know I could just apply the patch and be done, but I'd like to
>> understand this so it works for the right reason.
>>
>> On Mon, Jan 16, 2012 at 7:30 PM, Lars Ellenberg
>>  wrote:
>>> On Mon, Jan 16, 2012 at 04:46:58PM +1100, Andrew Beekhof wrote:
 > Now we proceed to the next mainloop poll:
 >
 >> poll([{fd=7, events=POLLIN|POLLPRI}, {fd=4, events=POLLIN|POLLPRI}, 
 >> {fd=5, events=POLLIN|POLLPRI}], 3, -1
 >
 > Note the -1 (infinity timeout!)
 >
 > So even though the trigger was (presumably) set,
 > and the ->prepare() should have returned true,
 > the mainloop waits forever for "something" to happen on those file 
 > descriptors.
 >
 >
 > I suggest this:
 >
 > crm_trigger_prepare should set *timeout = 0, if trigger is set.
 >
 > Also think about this race: crm_trigger_prepare was already
 > called, only then the signal came in...
 >
 > diff --git a/lib/common/mainloop.c b/lib/common/mainloop.c
 > index 2e8b1d0..fd17b87 100644
 > --- a/lib/common/mainloop.c
 > +++ b/lib/common/mainloop.c
 > @@ -33,6 +33,13 @@ static gboolean
 >  crm_trigger_prepare(GSource * source, gint * timeout)
 >  {
 >     crm_trigger_t *trig = (crm_trigger_t *) source;
 > +    /* Do not delay signal processing by the mainloop poll stage */
 > +    if (trig->trigger)
 > +           *timeout = 0;
 > +    /* To avoid races between signal delivery and the mainloop poll 
 > stage,
 > +     * make sure we always have a finite timeout. Unit: milliseconds. */
 > +    else
 > +           *timeout = 5000; /* arbitrary */
 >
 >     return trig->trigger;
 >  }
 >
 >
 > This scenario does not let the blocked IPC off the hook, though.
 > That is still possible, both for blocking send and blocking receive,
 > so that should probably be fixed as well, somehow.
 > I'm not sure how likely this "stuck in blocking IPC" is, though.

 Interesting, are you sure you're in the right function though?
 trigger and signal events don't have a file descriptor... wouldn't
 these polls be for the IPC related sources and wouldn't they be
 setting their own timeout?
>>>
>>> http://developer.gnome.org/glib/2.30/glib-The-Main-Event-Loop.html#GSourceFuncs
>>>
>>> iiuc, mainloop does something similar to (oversimplified):
>>>        timeout = -1; /* infinity */
>>>        for s in all GSource
>>>                tmp_timeout = -1;
>>>                s->prepare(s, &tmp_timeout)
>>>                if (tmp_timeout >= 0 && tmp_timeout < timeout)
>>>                        timeout = tmp_timeout;
>>>
>>>        poll(GSource fd set, n, timeout);
>>
>> I'm looking at the glib code again now, and it still looks to me like
>> the trigger and signal sources do not appear in this fd set.
>> Their setup functions would have to have called g_source_add_poll()
>> somewhere, which they don't.
>>
>> So I'm still not seeing why its a trigger or signal sources' fault
>> that glib is doing a never ending call to poll().
>> poll() is going to get called regardless of whether our prepare
>> function returns true or not.
>>
>> Looking closer, crm_trigger_prepare() returning TRUE results in:
>>                  ready_source->flags |= G_SOURCE_READY;
>>
>> which in turn causes:
>>          context->timeout = 0;
>>
>> which is essentially what adding
>>       if (trig->trigger)
>>               *timeout = 0;
>>
>> to crm_trigger_prepare() was intended to achieve.
>>
>> Shouldn't the fd, ipc or wait sources (who do call g_source_add_poll()
>> and could therefor cause poll() to block forever) have a sane timeout
>> in their prepare functions?
>> Or is it because the signal itself is interrupting some essential part
>> of G_CH_prepare_int() and friends?
>>
>>>
>>>        for s in all GSource
>>>                if s->check(s)
>>>                        s->dispatch(s, ...)
>>>
>>> And at some stage it also orders by priority, of course.
>>>
>>> Also compare with the comment above /* Sigh... */ in glue G_SIG_prepare().
>>>
>>> BTW, the mentioned race between signal delivery and mainloop already
>>> doing the poll stage could potentially be solved by using
>>
>> Again, since nothing related to the signal source ever appears in the
>> call to poll(), I'm not seeing where the race comes from.
>> Or am I missing something obvious?
>>
>>> cl_signal_set_interrupt(SIGTERM, 1),
>>
>> This, combined with
>>
>>                /*
>>                 * If we don't set this on, then the mainloop poll(2) call
>>                 * will never be interrupted by this signal - which sort of
>>                 * defeats the whole purpose of a signal handler in a
>>                 * mainloop program
>>                 */
>>                cl_signal_set_interrupt(signal, TRUE);
>>
>> looks more relevant.
>>

Re: [Pacemaker] [Problem] The attrd does not sometimes stop.

2012-01-16 Thread Andrew Beekhof

On Mon, Jan 16, 2012 at 11:27 PM, Andrew Beekhof  wrote:
> I know I could just apply the patch and be done, but I'd like to
> understand this so it works for the right reason.
>
> On Mon, Jan 16, 2012 at 7:30 PM, Lars Ellenberg
>  wrote:
>> On Mon, Jan 16, 2012 at 04:46:58PM +1100, Andrew Beekhof wrote:
>>> > Now we proceed to the next mainloop poll:
>>> >
>>> >> poll([{fd=7, events=POLLIN|POLLPRI}, {fd=4, events=POLLIN|POLLPRI}, 
>>> >> {fd=5, events=POLLIN|POLLPRI}], 3, -1
>>> >
>>> > Note the -1 (infinity timeout!)
>>> >
>>> > So even though the trigger was (presumably) set,
>>> > and the ->prepare() should have returned true,
>>> > the mainloop waits forever for "something" to happen on those file 
>>> > descriptors.
>>> >
>>> >
>>> > I suggest this:
>>> >
>>> > crm_trigger_prepare should set *timeout = 0, if trigger is set.
>>> >
>>> > Also think about this race: crm_trigger_prepare was already
>>> > called, only then the signal came in...
>>> >
>>> > diff --git a/lib/common/mainloop.c b/lib/common/mainloop.c
>>> > index 2e8b1d0..fd17b87 100644
>>> > --- a/lib/common/mainloop.c
>>> > +++ b/lib/common/mainloop.c
>>> > @@ -33,6 +33,13 @@ static gboolean
>>> >  crm_trigger_prepare(GSource * source, gint * timeout)
>>> >  {
>>> >     crm_trigger_t *trig = (crm_trigger_t *) source;
>>> > +    /* Do not delay signal processing by the mainloop poll stage */
>>> > +    if (trig->trigger)
>>> > +           *timeout = 0;
>>> > +    /* To avoid races between signal delivery and the mainloop poll 
>>> > stage,
>>> > +     * make sure we always have a finite timeout. Unit: milliseconds. */
>>> > +    else
>>> > +           *timeout = 5000; /* arbitrary */
>>> >
>>> >     return trig->trigger;
>>> >  }
>>> >
>>> >
>>> > This scenario does not let the blocked IPC off the hook, though.
>>> > That is still possible, both for blocking send and blocking receive,
>>> > so that should probably be fixed as well, somehow.
>>> > I'm not sure how likely this "stuck in blocking IPC" is, though.
>>>
>>> Interesting, are you sure you're in the right function though?
>>> trigger and signal events don't have a file descriptor... wouldn't
>>> these polls be for the IPC related sources and wouldn't they be
>>> setting their own timeout?
>>
>> http://developer.gnome.org/glib/2.30/glib-The-Main-Event-Loop.html#GSourceFuncs
>>
>> iiuc, mainloop does something similar to (oversimplified):
>>        timeout = -1; /* infinity */
>>        for s in all GSource
>>                tmp_timeout = -1;
>>                s->prepare(s, &tmp_timeout)
>>                if (tmp_timeout >= 0 && tmp_timeout < timeout)
>>                        timeout = tmp_timeout;
>>
>>        poll(GSource fd set, n, timeout);
>
> I'm looking at the glib code again now, and it still looks to me like
> the trigger and signal sources do not appear in this fd set.
> Their setup functions would have to have called g_source_add_poll()
> somewhere, which they don't.
>
> So I'm still not seeing why its a trigger or signal sources' fault
> that glib is doing a never ending call to poll().
> poll() is going to get called regardless of whether our prepare
> function returns true or not.
>
> Looking closer, crm_trigger_prepare() returning TRUE results in:
>                  ready_source->flags |= G_SOURCE_READY;
>
> which in turn causes:
>          context->timeout = 0;
>
> which is essentially what adding
>       if (trig->trigger)
>               *timeout = 0;
>
> to crm_trigger_prepare() was intended to achieve.
>
> Shouldn't the fd, ipc or wait sources (who do call g_source_add_poll()
> and could therefor cause poll() to block forever) have a sane timeout
> in their prepare functions?
> Or is it because the signal itself is interrupting some essential part
> of G_CH_prepare_int() and friends?
>
>>
>>        for s in all GSource
>>                if s->check(s)
>>                        s->dispatch(s, ...)
>>
>> And at some stage it also orders by priority, of course.
>>
>> Also compare with the comment above /* Sigh... */ in glue G_SIG_prepare().
>>
>> BTW, the mentioned race between signal delivery and mainloop already
>> doing the poll stage could potentially be solved by using
>
> Again, since nothing related to the signal source ever appears in the
> call to poll(), I'm not seeing where the race comes from.
> Or am I missing something obvious?
>
>> cl_signal_set_interrupt(SIGTERM, 1),
>
> This, combined with
>
>                /*
>                 * If we don't set this on, then the mainloop poll(2) call
>                 * will never be interrupted by this signal - which sort of
>                 * defeats the whole purpose of a signal handler in a
>                 * mainloop program
>                 */
>                cl_signal_set_interrupt(signal, TRUE);
>
> looks more relevant.
> But I can't escape the feeling that calling this just masks the
> underlying "why is there a never-ending call to poll() in the first
> place" issue.
> G_CH_prepare_int() and friends /sh

Re: [Pacemaker] [Problem] The attrd does not sometimes stop.

2012-01-16 Thread Andrew Beekhof

I know I could just apply the patch and be done, but I'd like to
understand this so it works for the right reason.

On Mon, Jan 16, 2012 at 7:30 PM, Lars Ellenberg
 wrote:
> On Mon, Jan 16, 2012 at 04:46:58PM +1100, Andrew Beekhof wrote:
>> > Now we proceed to the next mainloop poll:
>> >
>> >> poll([{fd=7, events=POLLIN|POLLPRI}, {fd=4, events=POLLIN|POLLPRI}, 
>> >> {fd=5, events=POLLIN|POLLPRI}], 3, -1
>> >
>> > Note the -1 (infinity timeout!)
>> >
>> > So even though the trigger was (presumably) set,
>> > and the ->prepare() should have returned true,
>> > the mainloop waits forever for "something" to happen on those file 
>> > descriptors.
>> >
>> >
>> > I suggest this:
>> >
>> > crm_trigger_prepare should set *timeout = 0, if trigger is set.
>> >
>> > Also think about this race: crm_trigger_prepare was already
>> > called, only then the signal came in...
>> >
>> > diff --git a/lib/common/mainloop.c b/lib/common/mainloop.c
>> > index 2e8b1d0..fd17b87 100644
>> > --- a/lib/common/mainloop.c
>> > +++ b/lib/common/mainloop.c
>> > @@ -33,6 +33,13 @@ static gboolean
>> >  crm_trigger_prepare(GSource * source, gint * timeout)
>> >  {
>> >     crm_trigger_t *trig = (crm_trigger_t *) source;
>> > +    /* Do not delay signal processing by the mainloop poll stage */
>> > +    if (trig->trigger)
>> > +           *timeout = 0;
>> > +    /* To avoid races between signal delivery and the mainloop poll stage,
>> > +     * make sure we always have a finite timeout. Unit: milliseconds. */
>> > +    else
>> > +           *timeout = 5000; /* arbitrary */
>> >
>> >     return trig->trigger;
>> >  }
>> >
>> >
>> > This scenario does not let the blocked IPC off the hook, though.
>> > That is still possible, both for blocking send and blocking receive,
>> > so that should probably be fixed as well, somehow.
>> > I'm not sure how likely this "stuck in blocking IPC" is, though.
>>
>> Interesting, are you sure you're in the right function though?
>> trigger and signal events don't have a file descriptor... wouldn't
>> these polls be for the IPC related sources and wouldn't they be
>> setting their own timeout?
>
> http://developer.gnome.org/glib/2.30/glib-The-Main-Event-Loop.html#GSourceFuncs
>
> iiuc, mainloop does something similar to (oversimplified):
>        timeout = -1; /* infinity */
>        for s in all GSource
>                tmp_timeout = -1;
>                s->prepare(s, &tmp_timeout)
>                if (tmp_timeout >= 0 && tmp_timeout < timeout)
>                        timeout = tmp_timeout;
>
>        poll(GSource fd set, n, timeout);

I'm looking at the glib code again now, and it still looks to me like
the trigger and signal sources do not appear in this fd set.
Their setup functions would have to have called g_source_add_poll()
somewhere, which they don't.

So I'm still not seeing why its a trigger or signal sources' fault
that glib is doing a never ending call to poll().
poll() is going to get called regardless of whether our prepare
function returns true or not.

Looking closer, crm_trigger_prepare() returning TRUE results in:
  ready_source->flags |= G_SOURCE_READY;

which in turn causes:
  context->timeout = 0;

which is essentially what adding
   if (trig->trigger)
   *timeout = 0;

to crm_trigger_prepare() was intended to achieve.

Shouldn't the fd, ipc or wait sources (who do call g_source_add_poll()
and could therefor cause poll() to block forever) have a sane timeout
in their prepare functions?
Or is it because the signal itself is interrupting some essential part
of G_CH_prepare_int() and friends?

>
>        for s in all GSource
>                if s->check(s)
>                        s->dispatch(s, ...)
>
> And at some stage it also orders by priority, of course.
>
> Also compare with the comment above /* Sigh... */ in glue G_SIG_prepare().
>
> BTW, the mentioned race between signal delivery and mainloop already
> doing the poll stage could potentially be solved by using

Again, since nothing related to the signal source ever appears in the
call to poll(), I'm not seeing where the race comes from.
Or am I missing something obvious?

> cl_signal_set_interrupt(SIGTERM, 1),

This, combined with

/*
 * If we don't set this on, then the mainloop poll(2) call
 * will never be interrupted by this signal - which sort of
 * defeats the whole purpose of a signal handler in a
 * mainloop program
 */
cl_signal_set_interrupt(signal, TRUE);

looks more relevant.
But I can't escape the feeling that calling this just masks the
underlying "why is there a never-ending call to poll() in the first
place" issue.
G_CH_prepare_int() and friends /should/ be setting timeouts so that
poll() can return and any sources created by g_idle_source_new() can
execute.

> which would mean we can condense the prepare to
>        if (trig->trigger)
>                *timeout = 0;
>

Re: [Pacemaker] [Problem] The attrd does not sometimes stop.

2012-01-16 Thread Lars Ellenberg

On Mon, Jan 16, 2012 at 04:46:58PM +1100, Andrew Beekhof wrote:
> > Now we proceed to the next mainloop poll:
> >
> >> poll([{fd=7, events=POLLIN|POLLPRI}, {fd=4, events=POLLIN|POLLPRI}, {fd=5, 
> >> events=POLLIN|POLLPRI}], 3, -1
> >
> > Note the -1 (infinity timeout!)
> >
> > So even though the trigger was (presumably) set,
> > and the ->prepare() should have returned true,
> > the mainloop waits forever for "something" to happen on those file 
> > descriptors.
> >
> >
> > I suggest this:
> >
> > crm_trigger_prepare should set *timeout = 0, if trigger is set.
> >
> > Also think about this race: crm_trigger_prepare was already
> > called, only then the signal came in...
> >
> > diff --git a/lib/common/mainloop.c b/lib/common/mainloop.c
> > index 2e8b1d0..fd17b87 100644
> > --- a/lib/common/mainloop.c
> > +++ b/lib/common/mainloop.c
> > @@ -33,6 +33,13 @@ static gboolean
> >  crm_trigger_prepare(GSource * source, gint * timeout)
> >  {
> >     crm_trigger_t *trig = (crm_trigger_t *) source;
> > +    /* Do not delay signal processing by the mainloop poll stage */
> > +    if (trig->trigger)
> > +           *timeout = 0;
> > +    /* To avoid races between signal delivery and the mainloop poll stage,
> > +     * make sure we always have a finite timeout. Unit: milliseconds. */
> > +    else
> > +           *timeout = 5000; /* arbitrary */
> >
> >     return trig->trigger;
> >  }
> >
> >
> > This scenario does not let the blocked IPC off the hook, though.
> > That is still possible, both for blocking send and blocking receive,
> > so that should probably be fixed as well, somehow.
> > I'm not sure how likely this "stuck in blocking IPC" is, though.
> 
> Interesting, are you sure you're in the right function though?
> trigger and signal events don't have a file descriptor... wouldn't
> these polls be for the IPC related sources and wouldn't they be
> setting their own timeout?

http://developer.gnome.org/glib/2.30/glib-The-Main-Event-Loop.html#GSourceFuncs

iiuc, mainloop does something similar to (oversimplified):
timeout = -1; /* infinity */
for s in all GSource
tmp_timeout = -1;
s->prepare(s, &tmp_timeout)
if (tmp_timeout >= 0 && tmp_timeout < timeout)
timeout = tmp_timeout;

poll(GSource fd set, n, timeout);

for s in all GSource
if s->check(s)
s->dispatch(s, ...)

And at some stage it also orders by priority, of course.

Also compare with the comment above /* Sigh... */ in glue G_SIG_prepare().

BTW, the mentioned race between signal delivery and mainloop already
doing the poll stage could potentially be solved by using
cl_signal_set_interrupt(SIGTERM, 1),
which would mean we can condense the prepare to
if (trig->trigger)
*timeout = 0;
return trig->trigger;

Glue (and heartbeat) code base is not that, let's say, involved,
because someone had been paranoid.
But because someone had been paranoid for a reason ;-)

Cheers,

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] [Problem] The attrd does not sometimes stop.

2012-01-15 Thread Andrew Beekhof

On Sun, Jan 15, 2012 at 1:57 AM, Lars Ellenberg
 wrote:
> On Tue, Jan 10, 2012 at 04:43:51PM +0900, renayama19661...@ybb.ne.jp wrote:
>> Hi Lars,
>>
>> I attach strace file when a problem reappeared at the end of last year.
>> I used glue which applied your patch for confirmation.
>>
>> It is the file which I picked with attrd by strace -p command right before I 
>> stop Heartbeat.
>>
>> Finally SIGTERM caught it, but attrd did not stop.
>> The attrd stopped afterwards when I sent SIGKILL.
>
> The strace reveals something interesting:
>
> This poll looks like the mainloop poll,
> but some ->prepare() has modified the timeout to be 0,
> so we proceed directly to ->check() and then ->dispatch().
>
>> poll([{fd=7, events=POLLIN|POLLPRI}, {fd=4, events=POLLIN|POLLPRI}, {fd=8, 
>> events=POLLIN|POLLPRI}], 3, 0) = 1 ([{fd=8, revents=POLLIN|POLLHUP}])
>
>> times({tms_utime=2, tms_stime=3, tms_cutime=0, tms_cstime=0}) = 433738632
>> recv(4, 0x95af308, 576, MSG_DONTWAIT)   = -1 EAGAIN (Resource temporarily 
>> unavailable)
> ...
>> recv(7, 0x95b1657, 3513, MSG_DONTWAIT)  = -1 EAGAIN (Resource temporarily 
>> unavailable)
>> poll([{fd=7, events=0}], 1, 0)          = ? ERESTART_RESTARTBLOCK (To be 
>> restarted)
>> --- SIGTERM (Terminated) @ 0 (0) ---
>> sigreturn()                             = ? (mask now [])
>
> Ok. signal received, trigger set.
> Still finishing this mainloop iteration, though.
>
> These recv(),poll() look like invocations of G_CH_prepare_int().
> Does not matter much, though.
>
>> recv(7, 0x95b1657, 3513, MSG_DONTWAIT)  = -1 EAGAIN (Resource temporarily 
>> unavailable)
>> poll([{fd=7, events=0}], 1, 0)          = 0 (Timeout)
>> recv(7, 0x95b1657, 3513, MSG_DONTWAIT)  = -1 EAGAIN (Resource temporarily 
>> unavailable)
>> poll([{fd=7, events=0}], 1, 0)          = 0 (Timeout)
>
>> times({tms_utime=2, tms_stime=3, tms_cutime=0, tms_cstime=0}) = 433738634
>
> Now we proceed to the next mainloop poll:
>
>> poll([{fd=7, events=POLLIN|POLLPRI}, {fd=4, events=POLLIN|POLLPRI}, {fd=5, 
>> events=POLLIN|POLLPRI}], 3, -1
>
> Note the -1 (infinity timeout!)
>
> So even though the trigger was (presumably) set,
> and the ->prepare() should have returned true,
> the mainloop waits forever for "something" to happen on those file 
> descriptors.
>
>
> I suggest this:
>
> crm_trigger_prepare should set *timeout = 0, if trigger is set.
>
> Also think about this race: crm_trigger_prepare was already
> called, only then the signal came in...
>
> diff --git a/lib/common/mainloop.c b/lib/common/mainloop.c
> index 2e8b1d0..fd17b87 100644
> --- a/lib/common/mainloop.c
> +++ b/lib/common/mainloop.c
> @@ -33,6 +33,13 @@ static gboolean
>  crm_trigger_prepare(GSource * source, gint * timeout)
>  {
>     crm_trigger_t *trig = (crm_trigger_t *) source;
> +    /* Do not delay signal processing by the mainloop poll stage */
> +    if (trig->trigger)
> +           *timeout = 0;
> +    /* To avoid races between signal delivery and the mainloop poll stage,
> +     * make sure we always have a finite timeout. Unit: milliseconds. */
> +    else
> +           *timeout = 5000; /* arbitrary */
>
>     return trig->trigger;
>  }
>
>
> This scenario does not let the blocked IPC off the hook, though.
> That is still possible, both for blocking send and blocking receive,
> so that should probably be fixed as well, somehow.
> I'm not sure how likely this "stuck in blocking IPC" is, though.

Interesting, are you sure you're in the right function though?
trigger and signal events don't have a file descriptor... wouldn't
these polls be for the IPC related sources and wouldn't they be
setting their own timeout?

>
> --
> : Lars Ellenberg
> : LINBIT | Your Way to High Availability
> : DRBD/HA support and consulting http://www.linbit.com
>
> DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
>
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] [Problem] The attrd does not sometimes stop.

2012-01-15 Thread renayama19661014

Hi Lars,

Thank you for comments and suggestion.

> > poll([{fd=7, events=POLLIN|POLLPRI}, {fd=4, events=POLLIN|POLLPRI}, {fd=5, 
> > events=POLLIN|POLLPRI}], 3, -1
> 
> Note the -1 (infinity timeout!)
> 
> So even though the trigger was (presumably) set,
> and the ->prepare() should have returned true,
> the mainloop waits forever for "something" to happen on those file 
> descriptors.
> 
> 
> I suggest this:
> 
> crm_trigger_prepare should set *timeout = 0, if trigger is set.
> 
> Also think about this race: crm_trigger_prepare was already
> called, only then the signal came in...
> 
> diff --git a/lib/common/mainloop.c b/lib/common/mainloop.c
> index 2e8b1d0..fd17b87 100644
> --- a/lib/common/mainloop.c
> +++ b/lib/common/mainloop.c
> @@ -33,6 +33,13 @@ static gboolean
>  crm_trigger_prepare(GSource * source, gint * timeout)
>  {
>  crm_trigger_t *trig = (crm_trigger_t *) source;
> +/* Do not delay signal processing by the mainloop poll stage */
> +if (trig->trigger)
> +*timeout = 0;
> +/* To avoid races between signal delivery and the mainloop poll stage,
> + * make sure we always have a finite timeout. Unit: milliseconds. */
> +else
> +*timeout = 5000; /* arbitrary */
>  
>  return trig->trigger;
>  }
> 
> 
> This scenario does not let the blocked IPC off the hook, though.
> That is still possible, both for blocking send and blocking receive,
> so that should probably be fixed as well, somehow.
> I'm not sure how likely this "stuck in blocking IPC" is, though.

Including a correction of your suggestion, I continue investigating the problem 
again.

I report it if I get some information.

Best Regards,
Hideo Yamauchi.

--- On Sat, 2012/1/14, Lars Ellenberg  wrote:

> On Tue, Jan 10, 2012 at 04:43:51PM +0900, renayama19661...@ybb.ne.jp wrote:
> > Hi Lars,
> > 
> > I attach strace file when a problem reappeared at the end of last year.
> > I used glue which applied your patch for confirmation.
> > 
> > It is the file which I picked with attrd by strace -p command right before 
> > I stop Heartbeat.
> > 
> > Finally SIGTERM caught it, but attrd did not stop.
> > The attrd stopped afterwards when I sent SIGKILL.
> 
> The strace reveals something interesting:
> 
> This poll looks like the mainloop poll,
> but some ->prepare() has modified the timeout to be 0,
> so we proceed directly to ->check() and then ->dispatch().
> 
> > poll([{fd=7, events=POLLIN|POLLPRI}, {fd=4, events=POLLIN|POLLPRI}, {fd=8, 
> > events=POLLIN|POLLPRI}], 3, 0) = 1 ([{fd=8, revents=POLLIN|POLLHUP}])
> 
> > times({tms_utime=2, tms_stime=3, tms_cutime=0, tms_cstime=0}) = 433738632
> > recv(4, 0x95af308, 576, MSG_DONTWAIT)   = -1 EAGAIN (Resource temporarily 
> > unavailable)
> ...
> > recv(7, 0x95b1657, 3513, MSG_DONTWAIT)  = -1 EAGAIN (Resource temporarily 
> > unavailable)
> > poll([{fd=7, events=0}], 1, 0)          = ? ERESTART_RESTARTBLOCK (To be 
> > restarted)
> > --- SIGTERM (Terminated) @ 0 (0) ---
> > sigreturn()                             = ? (mask now [])
> 
> Ok. signal received, trigger set.
> Still finishing this mainloop iteration, though.
> 
> These recv(),poll() look like invocations of G_CH_prepare_int().
> Does not matter much, though.
> 
> > recv(7, 0x95b1657, 3513, MSG_DONTWAIT)  = -1 EAGAIN (Resource temporarily 
> > unavailable)
> > poll([{fd=7, events=0}], 1, 0)          = 0 (Timeout)
> > recv(7, 0x95b1657, 3513, MSG_DONTWAIT)  = -1 EAGAIN (Resource temporarily 
> > unavailable)
> > poll([{fd=7, events=0}], 1, 0)          = 0 (Timeout)
> 
> > times({tms_utime=2, tms_stime=3, tms_cutime=0, tms_cstime=0}) = 433738634
> 
> Now we proceed to the next mainloop poll:
> 
> > poll([{fd=7, events=POLLIN|POLLPRI}, {fd=4, events=POLLIN|POLLPRI}, {fd=5, 
> > events=POLLIN|POLLPRI}], 3, -1
> 
> Note the -1 (infinity timeout!)
> 
> So even though the trigger was (presumably) set,
> and the ->prepare() should have returned true,
> the mainloop waits forever for "something" to happen on those file 
> descriptors.
> 
> 
> I suggest this:
> 
> crm_trigger_prepare should set *timeout = 0, if trigger is set.
> 
> Also think about this race: crm_trigger_prepare was already
> called, only then the signal came in...
> 
> diff --git a/lib/common/mainloop.c b/lib/common/mainloop.c
> index 2e8b1d0..fd17b87 100644
> --- a/lib/common/mainloop.c
> +++ b/lib/common/mainloop.c
> @@ -33,6 +33,13 @@ static gboolean
>  crm_trigger_prepare(GSource * source, gint * timeout)
>  {
>      crm_trigger_t *trig = (crm_trigger_t *) source;
> +    /* Do not delay signal processing by the mainloop poll stage */
> +    if (trig->trigger)
> +        *timeout = 0;
> +    /* To avoid races between signal delivery and the mainloop poll stage,
> +     * make sure we always have a finite timeout. Unit: milliseconds. */
> +    else
> +        *timeout = 5000; /* arbitrary */
>  
>      return trig->trigger;
>  }
> 
> 
> This scenario does not let the blocked IPC off the hook, though.
> That is still possible, both for b

Re: [Pacemaker] [Problem] The attrd does not sometimes stop.

2012-01-14 Thread Lars Ellenberg

On Tue, Jan 10, 2012 at 04:43:51PM +0900, renayama19661...@ybb.ne.jp wrote:
> Hi Lars,
> 
> I attach strace file when a problem reappeared at the end of last year.
> I used glue which applied your patch for confirmation.
> 
> It is the file which I picked with attrd by strace -p command right before I 
> stop Heartbeat.
> 
> Finally SIGTERM caught it, but attrd did not stop.
> The attrd stopped afterwards when I sent SIGKILL.

The strace reveals something interesting:

This poll looks like the mainloop poll,
but some ->prepare() has modified the timeout to be 0,
so we proceed directly to ->check() and then ->dispatch().

> poll([{fd=7, events=POLLIN|POLLPRI}, {fd=4, events=POLLIN|POLLPRI}, {fd=8, 
> events=POLLIN|POLLPRI}], 3, 0) = 1 ([{fd=8, revents=POLLIN|POLLHUP}])

> times({tms_utime=2, tms_stime=3, tms_cutime=0, tms_cstime=0}) = 433738632
> recv(4, 0x95af308, 576, MSG_DONTWAIT)   = -1 EAGAIN (Resource temporarily 
> unavailable)
...
> recv(7, 0x95b1657, 3513, MSG_DONTWAIT)  = -1 EAGAIN (Resource temporarily 
> unavailable)
> poll([{fd=7, events=0}], 1, 0)  = ? ERESTART_RESTARTBLOCK (To be 
> restarted)
> --- SIGTERM (Terminated) @ 0 (0) ---
> sigreturn() = ? (mask now [])

Ok. signal received, trigger set.
Still finishing this mainloop iteration, though.

These recv(),poll() look like invocations of G_CH_prepare_int().
Does not matter much, though.

> recv(7, 0x95b1657, 3513, MSG_DONTWAIT)  = -1 EAGAIN (Resource temporarily 
> unavailable)
> poll([{fd=7, events=0}], 1, 0)  = 0 (Timeout)
> recv(7, 0x95b1657, 3513, MSG_DONTWAIT)  = -1 EAGAIN (Resource temporarily 
> unavailable)
> poll([{fd=7, events=0}], 1, 0)  = 0 (Timeout)

> times({tms_utime=2, tms_stime=3, tms_cutime=0, tms_cstime=0}) = 433738634

Now we proceed to the next mainloop poll:

> poll([{fd=7, events=POLLIN|POLLPRI}, {fd=4, events=POLLIN|POLLPRI}, {fd=5, 
> events=POLLIN|POLLPRI}], 3, -1

Note the -1 (infinity timeout!)

So even though the trigger was (presumably) set,
and the ->prepare() should have returned true,
the mainloop waits forever for "something" to happen on those file descriptors.

I suggest this:

crm_trigger_prepare should set *timeout = 0, if trigger is set.

Also think about this race: crm_trigger_prepare was already
called, only then the signal came in...

diff --git a/lib/common/mainloop.c b/lib/common/mainloop.c
index 2e8b1d0..fd17b87 100644
--- a/lib/common/mainloop.c
+++ b/lib/common/mainloop.c
@@ -33,6 +33,13 @@ static gboolean
 crm_trigger_prepare(GSource * source, gint * timeout)
 {
 crm_trigger_t *trig = (crm_trigger_t *) source;
+/* Do not delay signal processing by the mainloop poll stage */
+if (trig->trigger)
+   *timeout = 0;
+/* To avoid races between signal delivery and the mainloop poll stage,
+ * make sure we always have a finite timeout. Unit: milliseconds. */
+else
+   *timeout = 5000; /* arbitrary */

 return trig->trigger;
 }

This scenario does not let the blocked IPC off the hook, though.
That is still possible, both for blocking send and blocking receive,
so that should probably be fixed as well, somehow.
I'm not sure how likely this "stuck in blocking IPC" is, though.

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] [Problem] The attrd does not sometimes stop.

2012-01-11 Thread renayama19661014

Hi Lars,
Hi Dejan,

I got ltrace file when a problem occurred.
I attach ltrace file.

The investigation in gdb continues it and performs it.

If there is suggestion of any improvement, please tell me.

Best Regards,
Hideo Yamauchi.


--- On Tue, 2012/1/10, renayama19661...@ybb.ne.jp  
wrote:

> Hi Lars,
> 
> I attach strace file when a problem reappeared at the end of last year.
> I used glue which applied your patch for confirmation.
> 
> It is the file which I picked with attrd by strace -p command right before I 
> stop Heartbeat.
> 
> Finally SIGTERM caught it, but attrd did not stop.
> The attrd stopped afterwards when I sent SIGKILL.
> 
>  * I acquire the information such as ltrace from now on.
> 
> Best Regards,
> Hideo Yamauchi.
> 
> 
> --- On Thu, 2012/1/5, renayama19661...@ybb.ne.jp  
> wrote:
> 
> > Hi Lars,
> > 
> > > If you are able to reproduce,
> > > you could try to find out what exactly attrd is doing.
> > > 
> > > various ways to try to do that:
> > > cat /proc//stack   # if your platform supports that
> > > strace it,
> > > ltrace it,
> > > attach with gdb and provide a stack trace, or even start to single step 
> > > it,
> > > cause attrd to core dump, and analyse the core.
> > 
> > All right.
> > I investigate the cause a little more.
> > 
> > Give me the time for investigation a little more.
> > 
> > Best Regards,
> > Hideo Yamauchi.
> > 
> > --- On Fri, 2011/12/30, Lars Ellenberg  wrote:
> > 
> > > On Thu, Dec 22, 2011 at 09:54:47AM +0900, renayama19661...@ybb.ne.jp 
> > > wrote:
> > > > Hi Dejan,
> > > > Hi Lars,
> > > > 
> > > > In our environment, the problem recurred with the patch of Mr. Lars.
> > > > After a problem occurred, I sent TERM signal, but attrd does not seem to
> > > > receive TERM at all.
> > > 
> > > If you are able to reproduce,
> > > you could try to find out what exactly attrd is doing.
> > > 
> > > various ways to try to do that:
> > > cat /proc//stack   # if your platform supports that
> > > strace it,
> > > ltrace it,
> > > attach with gdb and provide a stack trace, or even start to single step 
> > > it,
> > > cause attrd to core dump, and analyse the core.
> > > 
> > > > The reconsideration of the patch is necessary for the solution to 
> > > > problem.
> > > > 
> > > > 
> > > > Best Regards,
> > > > Hideo Yamauchi.
> > > > 
> > > > 
> > > > --- On Tue, 2011/11/15, renayama19661...@ybb.ne.jp 
> > > >  wrote:
> > > > 
> > > > > Hi Dejan,
> > > > > Hi Lars,
> > > > > 
> > > > > I understood it.
> > > > > I try the operation of the patch in our environment.
> > > > > 
> > > > > To Alan: Will you try a patch?
> > > > > 
> > > > > Best Regards,
> > > > > Hideo Yamauchi.
> > > > > 
> > > > > --- On Tue, 2011/11/15, Dejan Muhamedagic  wrote:
> > > > > 
> > > > > > Hi,
> > > > > > 
> > > > > > On Mon, Nov 14, 2011 at 01:17:37PM +0100, Lars Ellenberg wrote:
> > > > > > > On Mon, Nov 14, 2011 at 11:58:09AM +1100, Andrew Beekhof wrote:
> > > > > > > > On Mon, Nov 7, 2011 at 8:39 AM, Lars Ellenberg
> > > > > > > >  wrote:
> > > > > > > > > On Thu, Nov 03, 2011 at 01:49:46AM +1100, Andrew Beekhof 
> > > > > > > > > wrote:
> > > > > > > > >> On Tue, Oct 18, 2011 at 12:19 PM,  
> > > > > > > > >>  wrote:
> > > > > > > > >> > Hi,
> > > > > > > > >> >
> > > > > > > > >> > We sometimes fail in a stop of attrd.
> > > > > > > > >> >
> > > > > > > > >> > Step1. start a cluster in 2 nodes
> > > > > > > > >> > Step2. stop the first node.(/etc/init.d/heartbeat stop.)
> > > > > > > > >> > Step3. stop the second node after time passed a 
> > > > > > > > >> > little.(/etc/init.d/heartbeat
> > > > > > > > >> > stop.)
> > > > > > > > >> >
> > > > > > > > >> > The attrd catches the TERM signal, but does not stop.
> > > > > > > > >>
> > > > > > > > >> There's no evidence that it actually catches it, only that 
> > > > > > > > >> it is sent.
> > > > > > > > >> I've seen it before but never figured out why it occurs.
> > > > > > > > >
> > > > > > > > > I had it once tracked down almost to where it occurs, but 
> > > > > > > > > then got distracted.
> > > > > > > > > Yes the signal was delivered.
> > > > > > > > >
> > > > > > > > > I *think* it had to do with attrd doing a blocking read,
> > > > > > > > > or looping in some internal message delivery function too 
> > > > > > > > > often.
> > > > > > > > >
> > > > > > > > > I had a quick look at the code again now, to try and remember,
> > > > > > > > > but I'm not sure.
> > > > > > > > >
> > > > > > > > > I *may* be that, because
> > > > > > > > > xmlfromIPC(IPC_Channel * ch, int timeout) calls
> > > > > > > > >    msg = msgfromIPC_timeout(ch, MSG_ALLOWINTR, timeout, 
> > > > > > > > > &ipc_rc);
> > > > > > > > >
> > > > > > > > > And MSG_ALLOWINTR will cause msgfromIPC_ll() to
> > > > > > > > >        IPC_INTR:
> > > > > > > > >                if ( allow_intr){
> > > > > > > > >                        goto startwait;
> > > > > > > > >
> > > > > > > > > Depending on the frequency of deliverd signals, it may cause 
> > > > > > > > > this got

Re: [Pacemaker] [Problem] The attrd does not sometimes stop.

2012-01-09 Thread renayama19661014

Hi Lars,

I attach strace file when a problem reappeared at the end of last year.
I used glue which applied your patch for confirmation.

It is the file which I picked with attrd by strace -p command right before I 
stop Heartbeat.

Finally SIGTERM caught it, but attrd did not stop.
The attrd stopped afterwards when I sent SIGKILL.

 * I acquire the information such as ltrace from now on.

Best Regards,
Hideo Yamauchi.


--- On Thu, 2012/1/5, renayama19661...@ybb.ne.jp  
wrote:

> Hi Lars,
> 
> > If you are able to reproduce,
> > you could try to find out what exactly attrd is doing.
> > 
> > various ways to try to do that:
> > cat /proc//stack   # if your platform supports that
> > strace it,
> > ltrace it,
> > attach with gdb and provide a stack trace, or even start to single step it,
> > cause attrd to core dump, and analyse the core.
> 
> All right.
> I investigate the cause a little more.
> 
> Give me the time for investigation a little more.
> 
> Best Regards,
> Hideo Yamauchi.
> 
> --- On Fri, 2011/12/30, Lars Ellenberg  wrote:
> 
> > On Thu, Dec 22, 2011 at 09:54:47AM +0900, renayama19661...@ybb.ne.jp wrote:
> > > Hi Dejan,
> > > Hi Lars,
> > > 
> > > In our environment, the problem recurred with the patch of Mr. Lars.
> > > After a problem occurred, I sent TERM signal, but attrd does not seem to
> > > receive TERM at all.
> > 
> > If you are able to reproduce,
> > you could try to find out what exactly attrd is doing.
> > 
> > various ways to try to do that:
> > cat /proc//stack   # if your platform supports that
> > strace it,
> > ltrace it,
> > attach with gdb and provide a stack trace, or even start to single step it,
> > cause attrd to core dump, and analyse the core.
> > 
> > > The reconsideration of the patch is necessary for the solution to problem.
> > > 
> > > 
> > > Best Regards,
> > > Hideo Yamauchi.
> > > 
> > > 
> > > --- On Tue, 2011/11/15, renayama19661...@ybb.ne.jp 
> > >  wrote:
> > > 
> > > > Hi Dejan,
> > > > Hi Lars,
> > > > 
> > > > I understood it.
> > > > I try the operation of the patch in our environment.
> > > > 
> > > > To Alan: Will you try a patch?
> > > > 
> > > > Best Regards,
> > > > Hideo Yamauchi.
> > > > 
> > > > --- On Tue, 2011/11/15, Dejan Muhamedagic  wrote:
> > > > 
> > > > > Hi,
> > > > > 
> > > > > On Mon, Nov 14, 2011 at 01:17:37PM +0100, Lars Ellenberg wrote:
> > > > > > On Mon, Nov 14, 2011 at 11:58:09AM +1100, Andrew Beekhof wrote:
> > > > > > > On Mon, Nov 7, 2011 at 8:39 AM, Lars Ellenberg
> > > > > > >  wrote:
> > > > > > > > On Thu, Nov 03, 2011 at 01:49:46AM +1100, Andrew Beekhof wrote:
> > > > > > > >> On Tue, Oct 18, 2011 at 12:19 PM,  
> > > > > > > >>  wrote:
> > > > > > > >> > Hi,
> > > > > > > >> >
> > > > > > > >> > We sometimes fail in a stop of attrd.
> > > > > > > >> >
> > > > > > > >> > Step1. start a cluster in 2 nodes
> > > > > > > >> > Step2. stop the first node.(/etc/init.d/heartbeat stop.)
> > > > > > > >> > Step3. stop the second node after time passed a 
> > > > > > > >> > little.(/etc/init.d/heartbeat
> > > > > > > >> > stop.)
> > > > > > > >> >
> > > > > > > >> > The attrd catches the TERM signal, but does not stop.
> > > > > > > >>
> > > > > > > >> There's no evidence that it actually catches it, only that it 
> > > > > > > >> is sent.
> > > > > > > >> I've seen it before but never figured out why it occurs.
> > > > > > > >
> > > > > > > > I had it once tracked down almost to where it occurs, but then 
> > > > > > > > got distracted.
> > > > > > > > Yes the signal was delivered.
> > > > > > > >
> > > > > > > > I *think* it had to do with attrd doing a blocking read,
> > > > > > > > or looping in some internal message delivery function too often.
> > > > > > > >
> > > > > > > > I had a quick look at the code again now, to try and remember,
> > > > > > > > but I'm not sure.
> > > > > > > >
> > > > > > > > I *may* be that, because
> > > > > > > > xmlfromIPC(IPC_Channel * ch, int timeout) calls
> > > > > > > >    msg = msgfromIPC_timeout(ch, MSG_ALLOWINTR, timeout, 
> > > > > > > > &ipc_rc);
> > > > > > > >
> > > > > > > > And MSG_ALLOWINTR will cause msgfromIPC_ll() to
> > > > > > > >        IPC_INTR:
> > > > > > > >                if ( allow_intr){
> > > > > > > >                        goto startwait;
> > > > > > > >
> > > > > > > > Depending on the frequency of deliverd signals, it may cause 
> > > > > > > > this goto
> > > > > > > > startwait loop to never exit, because the timeout always starts 
> > > > > > > > again
> > > > > > > > from the full passed in timeout.
> > > > > > > >
> > > > > > > > If only one signal is deliverd, it may still take 120 seconds
> > > > > > > > (MAX_IPC_DELAY from crm.h) to be actually processed, as the 
> > > > > > > > signal
> > > > > > > > handler only raises a flag for the next mainloop iteration.
> > > > > > > >
> > > > > > > > If a (non-fatal) signal is delivered every few seconds,
> > > > > > > > then the goto loop will never timeout.
> > > > > > > >
> > > > > > > > Please someone check th

Re: [Pacemaker] [Problem] The attrd does not sometimes stop.

2012-01-04 Thread renayama19661014

Hi Lars,

> If you are able to reproduce,
> you could try to find out what exactly attrd is doing.
> 
> various ways to try to do that:
> cat /proc//stack   # if your platform supports that
> strace it,
> ltrace it,
> attach with gdb and provide a stack trace, or even start to single step it,
> cause attrd to core dump, and analyse the core.

All right.
I investigate the cause a little more.

Give me the time for investigation a little more.

Best Regards,
Hideo Yamauchi.

--- On Fri, 2011/12/30, Lars Ellenberg  wrote:

> On Thu, Dec 22, 2011 at 09:54:47AM +0900, renayama19661...@ybb.ne.jp wrote:
> > Hi Dejan,
> > Hi Lars,
> > 
> > In our environment, the problem recurred with the patch of Mr. Lars.
> > After a problem occurred, I sent TERM signal, but attrd does not seem to
> > receive TERM at all.
> 
> If you are able to reproduce,
> you could try to find out what exactly attrd is doing.
> 
> various ways to try to do that:
> cat /proc//stack   # if your platform supports that
> strace it,
> ltrace it,
> attach with gdb and provide a stack trace, or even start to single step it,
> cause attrd to core dump, and analyse the core.
> 
> > The reconsideration of the patch is necessary for the solution to problem.
> > 
> > 
> > Best Regards,
> > Hideo Yamauchi.
> > 
> > 
> > --- On Tue, 2011/11/15, renayama19661...@ybb.ne.jp 
> >  wrote:
> > 
> > > Hi Dejan,
> > > Hi Lars,
> > > 
> > > I understood it.
> > > I try the operation of the patch in our environment.
> > > 
> > > To Alan: Will you try a patch?
> > > 
> > > Best Regards,
> > > Hideo Yamauchi.
> > > 
> > > --- On Tue, 2011/11/15, Dejan Muhamedagic  wrote:
> > > 
> > > > Hi,
> > > > 
> > > > On Mon, Nov 14, 2011 at 01:17:37PM +0100, Lars Ellenberg wrote:
> > > > > On Mon, Nov 14, 2011 at 11:58:09AM +1100, Andrew Beekhof wrote:
> > > > > > On Mon, Nov 7, 2011 at 8:39 AM, Lars Ellenberg
> > > > > >  wrote:
> > > > > > > On Thu, Nov 03, 2011 at 01:49:46AM +1100, Andrew Beekhof wrote:
> > > > > > >> On Tue, Oct 18, 2011 at 12:19 PM,   
> > > > > > >> wrote:
> > > > > > >> > Hi,
> > > > > > >> >
> > > > > > >> > We sometimes fail in a stop of attrd.
> > > > > > >> >
> > > > > > >> > Step1. start a cluster in 2 nodes
> > > > > > >> > Step2. stop the first node.(/etc/init.d/heartbeat stop.)
> > > > > > >> > Step3. stop the second node after time passed a 
> > > > > > >> > little.(/etc/init.d/heartbeat
> > > > > > >> > stop.)
> > > > > > >> >
> > > > > > >> > The attrd catches the TERM signal, but does not stop.
> > > > > > >>
> > > > > > >> There's no evidence that it actually catches it, only that it is 
> > > > > > >> sent.
> > > > > > >> I've seen it before but never figured out why it occurs.
> > > > > > >
> > > > > > > I had it once tracked down almost to where it occurs, but then 
> > > > > > > got distracted.
> > > > > > > Yes the signal was delivered.
> > > > > > >
> > > > > > > I *think* it had to do with attrd doing a blocking read,
> > > > > > > or looping in some internal message delivery function too often.
> > > > > > >
> > > > > > > I had a quick look at the code again now, to try and remember,
> > > > > > > but I'm not sure.
> > > > > > >
> > > > > > > I *may* be that, because
> > > > > > > xmlfromIPC(IPC_Channel * ch, int timeout) calls
> > > > > > >    msg = msgfromIPC_timeout(ch, MSG_ALLOWINTR, timeout, &ipc_rc);
> > > > > > >
> > > > > > > And MSG_ALLOWINTR will cause msgfromIPC_ll() to
> > > > > > >        IPC_INTR:
> > > > > > >                if ( allow_intr){
> > > > > > >                        goto startwait;
> > > > > > >
> > > > > > > Depending on the frequency of deliverd signals, it may cause this 
> > > > > > > goto
> > > > > > > startwait loop to never exit, because the timeout always starts 
> > > > > > > again
> > > > > > > from the full passed in timeout.
> > > > > > >
> > > > > > > If only one signal is deliverd, it may still take 120 seconds
> > > > > > > (MAX_IPC_DELAY from crm.h) to be actually processed, as the signal
> > > > > > > handler only raises a flag for the next mainloop iteration.
> > > > > > >
> > > > > > > If a (non-fatal) signal is delivered every few seconds,
> > > > > > > then the goto loop will never timeout.
> > > > > > >
> > > > > > > Please someone check this for plausibility ;-)
> > > > > > 
> > > > > > Most plausible explanation I've heard so far... still odd that only
> > > > > > attrd is affected.
> > > > > > So what do we do about it?
> > > > > 
> > > > > Reproduce, and confirm that this is what people are seeing.
> > > > > 
> > > > > Make attrd non-blocking?
> > > > > 
> > > > > Fix the ipc layer to not restart the full timeout,
> > > > > but only the remaining partial time?
> > > > 
> > > > Lars and I made a quick patch for cluster-glue (attached).
> > > > Hideo-san, is there a way for you to verify if it helps? The
> > > > patch is not perfect and under unfavourable circumstances it may
> > > > still take a long time for the caller to exit, but it'd be good
> > > > to know if this is the right

Re: [Pacemaker] [Problem] The attrd does not sometimes stop.

2011-12-29 Thread Lars Ellenberg

On Thu, Dec 22, 2011 at 09:54:47AM +0900, renayama19661...@ybb.ne.jp wrote:
> Hi Dejan,
> Hi Lars,
> 
> In our environment, the problem recurred with the patch of Mr. Lars.
> After a problem occurred, I sent TERM signal, but attrd does not seem to
> receive TERM at all.

If you are able to reproduce,
you could try to find out what exactly attrd is doing.

various ways to try to do that:
cat /proc//stack   # if your platform supports that
strace it,
ltrace it,
attach with gdb and provide a stack trace, or even start to single step it,
cause attrd to core dump, and analyse the core.

> The reconsideration of the patch is necessary for the solution to problem.
> 
> 
> Best Regards,
> Hideo Yamauchi.
> 
> 
> --- On Tue, 2011/11/15, renayama19661...@ybb.ne.jp 
>  wrote:
> 
> > Hi Dejan,
> > Hi Lars,
> > 
> > I understood it.
> > I try the operation of the patch in our environment.
> > 
> > To Alan: Will you try a patch?
> > 
> > Best Regards,
> > Hideo Yamauchi.
> > 
> > --- On Tue, 2011/11/15, Dejan Muhamedagic  wrote:
> > 
> > > Hi,
> > > 
> > > On Mon, Nov 14, 2011 at 01:17:37PM +0100, Lars Ellenberg wrote:
> > > > On Mon, Nov 14, 2011 at 11:58:09AM +1100, Andrew Beekhof wrote:
> > > > > On Mon, Nov 7, 2011 at 8:39 AM, Lars Ellenberg
> > > > >  wrote:
> > > > > > On Thu, Nov 03, 2011 at 01:49:46AM +1100, Andrew Beekhof wrote:
> > > > > >> On Tue, Oct 18, 2011 at 12:19 PM,   
> > > > > >> wrote:
> > > > > >> > Hi,
> > > > > >> >
> > > > > >> > We sometimes fail in a stop of attrd.
> > > > > >> >
> > > > > >> > Step1. start a cluster in 2 nodes
> > > > > >> > Step2. stop the first node.(/etc/init.d/heartbeat stop.)
> > > > > >> > Step3. stop the second node after time passed a 
> > > > > >> > little.(/etc/init.d/heartbeat
> > > > > >> > stop.)
> > > > > >> >
> > > > > >> > The attrd catches the TERM signal, but does not stop.
> > > > > >>
> > > > > >> There's no evidence that it actually catches it, only that it is 
> > > > > >> sent.
> > > > > >> I've seen it before but never figured out why it occurs.
> > > > > >
> > > > > > I had it once tracked down almost to where it occurs, but then got 
> > > > > > distracted.
> > > > > > Yes the signal was delivered.
> > > > > >
> > > > > > I *think* it had to do with attrd doing a blocking read,
> > > > > > or looping in some internal message delivery function too often.
> > > > > >
> > > > > > I had a quick look at the code again now, to try and remember,
> > > > > > but I'm not sure.
> > > > > >
> > > > > > I *may* be that, because
> > > > > > xmlfromIPC(IPC_Channel * ch, int timeout) calls
> > > > > >    msg = msgfromIPC_timeout(ch, MSG_ALLOWINTR, timeout, &ipc_rc);
> > > > > >
> > > > > > And MSG_ALLOWINTR will cause msgfromIPC_ll() to
> > > > > >        IPC_INTR:
> > > > > >                if ( allow_intr){
> > > > > >                        goto startwait;
> > > > > >
> > > > > > Depending on the frequency of deliverd signals, it may cause this 
> > > > > > goto
> > > > > > startwait loop to never exit, because the timeout always starts 
> > > > > > again
> > > > > > from the full passed in timeout.
> > > > > >
> > > > > > If only one signal is deliverd, it may still take 120 seconds
> > > > > > (MAX_IPC_DELAY from crm.h) to be actually processed, as the signal
> > > > > > handler only raises a flag for the next mainloop iteration.
> > > > > >
> > > > > > If a (non-fatal) signal is delivered every few seconds,
> > > > > > then the goto loop will never timeout.
> > > > > >
> > > > > > Please someone check this for plausibility ;-)
> > > > > 
> > > > > Most plausible explanation I've heard so far... still odd that only
> > > > > attrd is affected.
> > > > > So what do we do about it?
> > > > 
> > > > Reproduce, and confirm that this is what people are seeing.
> > > > 
> > > > Make attrd non-blocking?
> > > > 
> > > > Fix the ipc layer to not restart the full timeout,
> > > > but only the remaining partial time?
> > > 
> > > Lars and I made a quick patch for cluster-glue (attached).
> > > Hideo-san, is there a way for you to verify if it helps? The
> > > patch is not perfect and under unfavourable circumstances it may
> > > still take a long time for the caller to exit, but it'd be good
> > > to know if this is the right spot.
> > > 
> > > Cheers,
> > > 
> > > Dejan
> > > 
> > > > -- 
> > > > : Lars Ellenberg
> > > > : LINBIT | Your Way to High Availability
> > > > : DRBD/HA support and consulting http://www.linbit.com
> > > > 
> > > > ___
> > > > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> > > > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> > > > 
> > > > Project Home: http://www.clusterlabs.org
> > > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > > > Bugs: 
> > > > http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
> > >
> > 
> 
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://os

Re: [Pacemaker] [Problem] The attrd does not sometimes stop.

2011-12-21 Thread renayama19661014

Hi Dejan,
Hi Lars,

In our environment, the problem recurred with the patch of Mr. Lars.
After a problem occurred, I sent TERM signal, but attrd does not seem to
receive TERM at all.

The reconsideration of the patch is necessary for the solution to problem.


Best Regards,
Hideo Yamauchi.


--- On Tue, 2011/11/15, renayama19661...@ybb.ne.jp  
wrote:

> Hi Dejan,
> Hi Lars,
> 
> I understood it.
> I try the operation of the patch in our environment.
> 
> To Alan: Will you try a patch?
> 
> Best Regards,
> Hideo Yamauchi.
> 
> --- On Tue, 2011/11/15, Dejan Muhamedagic  wrote:
> 
> > Hi,
> > 
> > On Mon, Nov 14, 2011 at 01:17:37PM +0100, Lars Ellenberg wrote:
> > > On Mon, Nov 14, 2011 at 11:58:09AM +1100, Andrew Beekhof wrote:
> > > > On Mon, Nov 7, 2011 at 8:39 AM, Lars Ellenberg
> > > >  wrote:
> > > > > On Thu, Nov 03, 2011 at 01:49:46AM +1100, Andrew Beekhof wrote:
> > > > >> On Tue, Oct 18, 2011 at 12:19 PM,   
> > > > >> wrote:
> > > > >> > Hi,
> > > > >> >
> > > > >> > We sometimes fail in a stop of attrd.
> > > > >> >
> > > > >> > Step1. start a cluster in 2 nodes
> > > > >> > Step2. stop the first node.(/etc/init.d/heartbeat stop.)
> > > > >> > Step3. stop the second node after time passed a 
> > > > >> > little.(/etc/init.d/heartbeat
> > > > >> > stop.)
> > > > >> >
> > > > >> > The attrd catches the TERM signal, but does not stop.
> > > > >>
> > > > >> There's no evidence that it actually catches it, only that it is 
> > > > >> sent.
> > > > >> I've seen it before but never figured out why it occurs.
> > > > >
> > > > > I had it once tracked down almost to where it occurs, but then got 
> > > > > distracted.
> > > > > Yes the signal was delivered.
> > > > >
> > > > > I *think* it had to do with attrd doing a blocking read,
> > > > > or looping in some internal message delivery function too often.
> > > > >
> > > > > I had a quick look at the code again now, to try and remember,
> > > > > but I'm not sure.
> > > > >
> > > > > I *may* be that, because
> > > > > xmlfromIPC(IPC_Channel * ch, int timeout) calls
> > > > >    msg = msgfromIPC_timeout(ch, MSG_ALLOWINTR, timeout, &ipc_rc);
> > > > >
> > > > > And MSG_ALLOWINTR will cause msgfromIPC_ll() to
> > > > >        IPC_INTR:
> > > > >                if ( allow_intr){
> > > > >                        goto startwait;
> > > > >
> > > > > Depending on the frequency of deliverd signals, it may cause this goto
> > > > > startwait loop to never exit, because the timeout always starts again
> > > > > from the full passed in timeout.
> > > > >
> > > > > If only one signal is deliverd, it may still take 120 seconds
> > > > > (MAX_IPC_DELAY from crm.h) to be actually processed, as the signal
> > > > > handler only raises a flag for the next mainloop iteration.
> > > > >
> > > > > If a (non-fatal) signal is delivered every few seconds,
> > > > > then the goto loop will never timeout.
> > > > >
> > > > > Please someone check this for plausibility ;-)
> > > > 
> > > > Most plausible explanation I've heard so far... still odd that only
> > > > attrd is affected.
> > > > So what do we do about it?
> > > 
> > > Reproduce, and confirm that this is what people are seeing.
> > > 
> > > Make attrd non-blocking?
> > > 
> > > Fix the ipc layer to not restart the full timeout,
> > > but only the remaining partial time?
> > 
> > Lars and I made a quick patch for cluster-glue (attached).
> > Hideo-san, is there a way for you to verify if it helps? The
> > patch is not perfect and under unfavourable circumstances it may
> > still take a long time for the caller to exit, but it'd be good
> > to know if this is the right spot.
> > 
> > Cheers,
> > 
> > Dejan
> > 
> > > -- 
> > > : Lars Ellenberg
> > > : LINBIT | Your Way to High Availability
> > > : DRBD/HA support and consulting http://www.linbit.com
> > > 
> > > ___
> > > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> > > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> > > 
> > > Project Home: http://www.clusterlabs.org
> > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > > Bugs: 
> > > http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
> >
> 

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] [Problem] The attrd does not sometimes stop.

2011-11-14 Thread renayama19661014

Hi Dejan,
Hi Lars,

I understood it.
I try the operation of the patch in our environment.

To Alan: Will you try a patch?

Best Regards,
Hideo Yamauchi.

--- On Tue, 2011/11/15, Dejan Muhamedagic  wrote:

> Hi,
> 
> On Mon, Nov 14, 2011 at 01:17:37PM +0100, Lars Ellenberg wrote:
> > On Mon, Nov 14, 2011 at 11:58:09AM +1100, Andrew Beekhof wrote:
> > > On Mon, Nov 7, 2011 at 8:39 AM, Lars Ellenberg
> > >  wrote:
> > > > On Thu, Nov 03, 2011 at 01:49:46AM +1100, Andrew Beekhof wrote:
> > > >> On Tue, Oct 18, 2011 at 12:19 PM,   wrote:
> > > >> > Hi,
> > > >> >
> > > >> > We sometimes fail in a stop of attrd.
> > > >> >
> > > >> > Step1. start a cluster in 2 nodes
> > > >> > Step2. stop the first node.(/etc/init.d/heartbeat stop.)
> > > >> > Step3. stop the second node after time passed a 
> > > >> > little.(/etc/init.d/heartbeat
> > > >> > stop.)
> > > >> >
> > > >> > The attrd catches the TERM signal, but does not stop.
> > > >>
> > > >> There's no evidence that it actually catches it, only that it is sent.
> > > >> I've seen it before but never figured out why it occurs.
> > > >
> > > > I had it once tracked down almost to where it occurs, but then got 
> > > > distracted.
> > > > Yes the signal was delivered.
> > > >
> > > > I *think* it had to do with attrd doing a blocking read,
> > > > or looping in some internal message delivery function too often.
> > > >
> > > > I had a quick look at the code again now, to try and remember,
> > > > but I'm not sure.
> > > >
> > > > I *may* be that, because
> > > > xmlfromIPC(IPC_Channel * ch, int timeout) calls
> > > >    msg = msgfromIPC_timeout(ch, MSG_ALLOWINTR, timeout, &ipc_rc);
> > > >
> > > > And MSG_ALLOWINTR will cause msgfromIPC_ll() to
> > > >        IPC_INTR:
> > > >                if ( allow_intr){
> > > >                        goto startwait;
> > > >
> > > > Depending on the frequency of deliverd signals, it may cause this goto
> > > > startwait loop to never exit, because the timeout always starts again
> > > > from the full passed in timeout.
> > > >
> > > > If only one signal is deliverd, it may still take 120 seconds
> > > > (MAX_IPC_DELAY from crm.h) to be actually processed, as the signal
> > > > handler only raises a flag for the next mainloop iteration.
> > > >
> > > > If a (non-fatal) signal is delivered every few seconds,
> > > > then the goto loop will never timeout.
> > > >
> > > > Please someone check this for plausibility ;-)
> > > 
> > > Most plausible explanation I've heard so far... still odd that only
> > > attrd is affected.
> > > So what do we do about it?
> > 
> > Reproduce, and confirm that this is what people are seeing.
> > 
> > Make attrd non-blocking?
> > 
> > Fix the ipc layer to not restart the full timeout,
> > but only the remaining partial time?
> 
> Lars and I made a quick patch for cluster-glue (attached).
> Hideo-san, is there a way for you to verify if it helps? The
> patch is not perfect and under unfavourable circumstances it may
> still take a long time for the caller to exit, but it'd be good
> to know if this is the right spot.
> 
> Cheers,
> 
> Dejan
> 
> > -- 
> > : Lars Ellenberg
> > : LINBIT | Your Way to High Availability
> > : DRBD/HA support and consulting http://www.linbit.com
> > 
> > ___
> > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> > 
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: 
> > http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
> 

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker

Re: [Pacemaker] [Problem] The attrd does not sometimes stop.

2011-11-14 Thread Dejan Muhamedagic

Hi,

On Mon, Nov 14, 2011 at 01:17:37PM +0100, Lars Ellenberg wrote:
> On Mon, Nov 14, 2011 at 11:58:09AM +1100, Andrew Beekhof wrote:
> > On Mon, Nov 7, 2011 at 8:39 AM, Lars Ellenberg
> >  wrote:
> > > On Thu, Nov 03, 2011 at 01:49:46AM +1100, Andrew Beekhof wrote:
> > >> On Tue, Oct 18, 2011 at 12:19 PM,   wrote:
> > >> > Hi,
> > >> >
> > >> > We sometimes fail in a stop of attrd.
> > >> >
> > >> > Step1. start a cluster in 2 nodes
> > >> > Step2. stop the first node.(/etc/init.d/heartbeat stop.)
> > >> > Step3. stop the second node after time passed a 
> > >> > little.(/etc/init.d/heartbeat
> > >> > stop.)
> > >> >
> > >> > The attrd catches the TERM signal, but does not stop.
> > >>
> > >> There's no evidence that it actually catches it, only that it is sent.
> > >> I've seen it before but never figured out why it occurs.
> > >
> > > I had it once tracked down almost to where it occurs, but then got 
> > > distracted.
> > > Yes the signal was delivered.
> > >
> > > I *think* it had to do with attrd doing a blocking read,
> > > or looping in some internal message delivery function too often.
> > >
> > > I had a quick look at the code again now, to try and remember,
> > > but I'm not sure.
> > >
> > > I *may* be that, because
> > > xmlfromIPC(IPC_Channel * ch, int timeout) calls
> > >    msg = msgfromIPC_timeout(ch, MSG_ALLOWINTR, timeout, &ipc_rc);
> > >
> > > And MSG_ALLOWINTR will cause msgfromIPC_ll() to
> > >        IPC_INTR:
> > >                if ( allow_intr){
> > >                        goto startwait;
> > >
> > > Depending on the frequency of deliverd signals, it may cause this goto
> > > startwait loop to never exit, because the timeout always starts again
> > > from the full passed in timeout.
> > >
> > > If only one signal is deliverd, it may still take 120 seconds
> > > (MAX_IPC_DELAY from crm.h) to be actually processed, as the signal
> > > handler only raises a flag for the next mainloop iteration.
> > >
> > > If a (non-fatal) signal is delivered every few seconds,
> > > then the goto loop will never timeout.
> > >
> > > Please someone check this for plausibility ;-)
> > 
> > Most plausible explanation I've heard so far... still odd that only
> > attrd is affected.
> > So what do we do about it?
> 
> Reproduce, and confirm that this is what people are seeing.
> 
> Make attrd non-blocking?
> 
> Fix the ipc layer to not restart the full timeout,
> but only the remaining partial time?

Lars and I made a quick patch for cluster-glue (attached).
Hideo-san, is there a way for you to verify if it helps? The
patch is not perfect and under unfavourable circumstances it may
still take a long time for the caller to exit, but it'd be good
to know if this is the right spot.

Cheers,

Dejan

> -- 
> : Lars Ellenberg
> : LINBIT | Your Way to High Availability
> : DRBD/HA support and consulting http://www.linbit.com
> 
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: 
> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
# HG changeset patch
# User Lars Ellenberg 
# Date 1321275721 -3600
# Node ID 8b50bf0dd4cdf8d0a405416da98711080b2abeb9
# Parent  569bdebf736185d77782f49d5c760007cfc6b3e8
Medium: clplumbing: don't restart timeouts forever if signals are repeatedly sent

diff -r 569bdebf7361 -r 8b50bf0dd4cd lib/clplumbing/cl_msg.c
--- a/lib/clplumbing/cl_msg.c	Mon Nov 14 11:31:51 2011 +0100
+++ b/lib/clplumbing/cl_msg.c	Mon Nov 14 14:02:01 2011 +0100
@@ -1802,12 +1802,13 @@ static struct ha_msg*
 msgfromIPC_ll(IPC_Channel * ch, int flag, unsigned int timeout, int *rc_out)
 {
 	int		rc;
+	int		sig_cnt = 0;
 	IPC_Message*	ipcmsg;
 	struct ha_msg*	hmsg;
 	int		need_auth = flag & MSG_NEEDAUTH;
 	int		allow_intr = flag & MSG_ALLOWINTR;
 	
- startwait:
+	do {
 	if(timeout > 0) {
 	rc = cl_ipc_wait_timeout(ch, ch->ops->waitin, timeout);
 	} else {
@@ -1832,17 +1833,17 @@ msgfromIPC_ll(IPC_Channel * ch, int flag
 		return NULL;
 		
 	case IPC_INTR:
-		if ( allow_intr){
-			goto startwait;
-		}else{
+		if (!allow_intr || sig_cnt++ >= 20) {
 			return NULL;
+		} else {
+			break;
 		}
 		
 	case IPC_OK:
 		break;
 	}
-	
-	
+	} while (rc != IPC_OK);
+
 	ipcmsg = NULL;
 	rc = ch->ops->recv(ch, &ipcmsg);
 #if 0
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker

Re: [Pacemaker] [Problem] The attrd does not sometimes stop.

2011-11-14 Thread Lars Ellenberg

On Mon, Nov 14, 2011 at 11:58:09AM +1100, Andrew Beekhof wrote:
> On Mon, Nov 7, 2011 at 8:39 AM, Lars Ellenberg
>  wrote:
> > On Thu, Nov 03, 2011 at 01:49:46AM +1100, Andrew Beekhof wrote:
> >> On Tue, Oct 18, 2011 at 12:19 PM,   wrote:
> >> > Hi,
> >> >
> >> > We sometimes fail in a stop of attrd.
> >> >
> >> > Step1. start a cluster in 2 nodes
> >> > Step2. stop the first node.(/etc/init.d/heartbeat stop.)
> >> > Step3. stop the second node after time passed a 
> >> > little.(/etc/init.d/heartbeat
> >> > stop.)
> >> >
> >> > The attrd catches the TERM signal, but does not stop.
> >>
> >> There's no evidence that it actually catches it, only that it is sent.
> >> I've seen it before but never figured out why it occurs.
> >
> > I had it once tracked down almost to where it occurs, but then got 
> > distracted.
> > Yes the signal was delivered.
> >
> > I *think* it had to do with attrd doing a blocking read,
> > or looping in some internal message delivery function too often.
> >
> > I had a quick look at the code again now, to try and remember,
> > but I'm not sure.
> >
> > I *may* be that, because
> > xmlfromIPC(IPC_Channel * ch, int timeout) calls
> >    msg = msgfromIPC_timeout(ch, MSG_ALLOWINTR, timeout, &ipc_rc);
> >
> > And MSG_ALLOWINTR will cause msgfromIPC_ll() to
> >        IPC_INTR:
> >                if ( allow_intr){
> >                        goto startwait;
> >
> > Depending on the frequency of deliverd signals, it may cause this goto
> > startwait loop to never exit, because the timeout always starts again
> > from the full passed in timeout.
> >
> > If only one signal is deliverd, it may still take 120 seconds
> > (MAX_IPC_DELAY from crm.h) to be actually processed, as the signal
> > handler only raises a flag for the next mainloop iteration.
> >
> > If a (non-fatal) signal is delivered every few seconds,
> > then the goto loop will never timeout.
> >
> > Please someone check this for plausibility ;-)
> 
> Most plausible explanation I've heard so far... still odd that only
> attrd is affected.
> So what do we do about it?

Reproduce, and confirm that this is what people are seeing.

Make attrd non-blocking?

Fix the ipc layer to not restart the full timeout,
but only the remaining partial time?

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker

Re: [Pacemaker] [Problem] The attrd does not sometimes stop.

2011-11-13 Thread Andrew Beekhof

On Mon, Nov 7, 2011 at 8:39 AM, Lars Ellenberg
 wrote:
> On Thu, Nov 03, 2011 at 01:49:46AM +1100, Andrew Beekhof wrote:
>> On Tue, Oct 18, 2011 at 12:19 PM,   wrote:
>> > Hi,
>> >
>> > We sometimes fail in a stop of attrd.
>> >
>> > Step1. start a cluster in 2 nodes
>> > Step2. stop the first node.(/etc/init.d/heartbeat stop.)
>> > Step3. stop the second node after time passed a 
>> > little.(/etc/init.d/heartbeat
>> > stop.)
>> >
>> > The attrd catches the TERM signal, but does not stop.
>>
>> There's no evidence that it actually catches it, only that it is sent.
>> I've seen it before but never figured out why it occurs.
>
> I had it once tracked down almost to where it occurs, but then got distracted.
> Yes the signal was delivered.
>
> I *think* it had to do with attrd doing a blocking read,
> or looping in some internal message delivery function too often.
>
> I had a quick look at the code again now, to try and remember,
> but I'm not sure.
>
> I *may* be that, because
> xmlfromIPC(IPC_Channel * ch, int timeout) calls
>    msg = msgfromIPC_timeout(ch, MSG_ALLOWINTR, timeout, &ipc_rc);
>
> And MSG_ALLOWINTR will cause msgfromIPC_ll() to
>        IPC_INTR:
>                if ( allow_intr){
>                        goto startwait;
>
> Depending on the frequency of deliverd signals, it may cause this goto
> startwait loop to never exit, because the timeout always starts again
> from the full passed in timeout.
>
> If only one signal is deliverd, it may still take 120 seconds
> (MAX_IPC_DELAY from crm.h) to be actually processed, as the signal
> handler only raises a flag for the next mainloop iteration.
>
> If a (non-fatal) signal is delivered every few seconds,
> then the goto loop will never timeout.
>
> Please someone check this for plausibility ;-)

Most plausible explanation I've heard so far... still odd that only
attrd is affected.
So what do we do about it?

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker

Re: [Pacemaker] [Problem] The attrd does not sometimes stop.

2011-11-06 Thread Lars Ellenberg

On Thu, Nov 03, 2011 at 01:49:46AM +1100, Andrew Beekhof wrote:
> On Tue, Oct 18, 2011 at 12:19 PM,   wrote:
> > Hi,
> >
> > We sometimes fail in a stop of attrd.
> >
> > Step1. start a cluster in 2 nodes
> > Step2. stop the first node.(/etc/init.d/heartbeat stop.)
> > Step3. stop the second node after time passed a 
> > little.(/etc/init.d/heartbeat
> > stop.)
> >
> > The attrd catches the TERM signal, but does not stop.
> 
> There's no evidence that it actually catches it, only that it is sent.
> I've seen it before but never figured out why it occurs.

I had it once tracked down almost to where it occurs, but then got distracted.
Yes the signal was delivered.

I *think* it had to do with attrd doing a blocking read,
or looping in some internal message delivery function too often.

I had a quick look at the code again now, to try and remember,
but I'm not sure.

I *may* be that, because
xmlfromIPC(IPC_Channel * ch, int timeout) calls
msg = msgfromIPC_timeout(ch, MSG_ALLOWINTR, timeout, &ipc_rc);

And MSG_ALLOWINTR will cause msgfromIPC_ll() to 
IPC_INTR:
if ( allow_intr){
goto startwait;

Depending on the frequency of deliverd signals, it may cause this goto
startwait loop to never exit, because the timeout always starts again
from the full passed in timeout.

If only one signal is deliverd, it may still take 120 seconds
(MAX_IPC_DELAY from crm.h) to be actually processed, as the signal
handler only raises a flag for the next mainloop iteration.

If a (non-fatal) signal is delivered every few seconds,
then the goto loop will never timeout.

Please someone check this for plausibility ;-)

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker

Re: [Pacemaker] [Problem] The attrd does not sometimes stop.

2011-11-03 Thread renayama19661014

Hi Andrew,
Hi Alan,

We work hard to collect the evidence of reproduction and the problem of the 
phenomenon.
However, we do not yet get the evidence.
I will wait for the information from Alan.

Best Regards,
Hideo Yamauchi.



--- On Wed, 2011/11/2, Andrew Beekhof  wrote:

> On Tue, Oct 18, 2011 at 12:19 PM,   wrote:
> > Hi,
> >
> > We sometimes fail in a stop of attrd.
> >
> > Step1. start a cluster in 2 nodes
> > Step2. stop the first node.(/etc/init.d/heartbeat stop.)
> > Step3. stop the second node after time passed a 
> > little.(/etc/init.d/heartbeat
> > stop.)
> >
> > The attrd catches the TERM signal, but does not stop.
> 
> There's no evidence that it actually catches it, only that it is sent.
> I've seen it before but never figured out why it occurs.
> 
> >
> > (snip)
> > Oct  5 02:37:38 hpdb0201 crmd: [12238]: info: do_exit: [crmd] stopped (0)
> > Oct  5 02:37:38 hpdb0201 cib: [12234]: WARN: send_ipc_message: IPC Channel 
> > to
> > 12238 is not connected
> > Oct  5 02:37:38 hpdb0201 cib: [12234]: WARN: send_via_callback_channel:
> > Delivery of reply to client 12238/0dbc9e28-d90d-4335-b9c4-9dd3fcb38163 
> > failed
> > Oct  5 02:37:38 hpdb0201 cib: [12234]: WARN: do_local_notify: A-Sync reply 
> > to
> > crmd failed: reply failed
> > Oct  5 02:37:38 hpdb0201 heartbeat: [12223]: info: killing
> > /usr/lib64/heartbeat/attrd process group 12237 with signal 15
> > Oct  5 02:47:03 hpdb0201 cib: [12234]: info: cib_stats: Processed 97 
> > operations
> > (4123.00us average, 0% utilization) in the last 10min
> > Oct  5 07:15:25 hpdb0201 ccm: [12233]: WARN: G_CH_check_int: working on IPC
> > channel took 1010 ms (> 100 ms)
> > Oct  5 07:15:26 hpdb0201 ccm: [12233]: WARN: G_CH_check_int: working on IPC
> > channel took 1010 ms (> 100 ms)
> > Oct  5 07:15:37 hpdb0201 heartbeat: [12223]: WARN: Gmain_timeout_dispatch:
> > Dispatch function for check for signals was delayed 1030 ms (> 1010 ms) 
> > before
> > being called (GSource: 0xd28010)
> > Oct  5 07:15:37 hpdb0201 heartbeat: [12223]: info: Gmain_timeout_dispatch:
> > started at 431583547 should have started at 431583444
> > Oct  5 07:15:44 hpdb0201 heartbeat: [12223]: WARN: Gmain_timeout_dispatch:
> > Dispatch function for send local status was delayed 1030 ms (> 1010 ms) 
> > before
> > being called (GSource: 0xd27dd0)
> > Oct  5 07:15:44 hpdb0201 heartbeat: [12223]: info: Gmain_timeout_dispatch:
> > started at 431584254 should have started at 431584151
> > Oct  5 07:15:44 hpdb0201 heartbeat: [12223]: WARN: Gmain_timeout_dispatch:
> > Dispatch function for check for signals was delayed 1030 ms (> 1010 ms) 
> > before
> > being called (GSource: 0xd28010)
> > Oct  5 07:15:44 hpdb0201 heartbeat: [12223]: info: Gmain_timeout_dispatch:
> > started at 431584254 should have started at 431584151
> > Oct  5 07:16:59 hpdb0201 heartbeat: [12223]: WARN: G_CH_check_int: working 
> > on
> > write child took 1010 ms (> 100 ms)
> > Oct  5 07:17:14 hpdb0201 stonithd: [12236]: WARN: G_CH_check_int: working on
> > Heartbeat API channel took 1010 ms (> 100 ms)
> > Oct  5 07:19:41 hpdb0201 heartbeat: [12223]: WARN: Gmain_timeout_dispatch:
> > Dispatch function for send local status was delayed 1030 ms (> 1010 ms) 
> > before
> > being called (GSource: 0xd27dd0)
> > Oct  5 07:19:41 hpdb0201 heartbeat: [12223]: info: Gmain_timeout_dispatch:
> > started at 431607988 should have started at 431607885
> > Oct  5 07:19:41 hpdb0201 heartbeat: [12223]: WARN: Gmain_timeout_dispatch:
> > Dispatch function for check for signals was delayed 1030 ms (> 1010 ms) 
> > before
> > being called (GSource: 0xd28010)
> > Oct  5 07:19:41 hpdb0201 heartbeat: [12223]: info: Gmain_timeout_dispatch:
> > started at 431607988 should have started at 431607885
> > (snip)
> >
> > We try the reproduction of the phenomenon, but do not reappear very much.
> >
> > The same phenomenon is reported by the next email.
> > However, the argument of the problem is over on the way.
> >
> >  * http://www.gossamer-threads.com/lists/linuxha/pacemaker/62147
> >
> > The phenomenon occurred by the next combination.
> >  * pacemaker-1.0.11
> >  * resource-agents-3.9.2
> >  * cluster-glue-1.0.7
> >  * heartbeat-3.0.5
> >
> > I registered these contents with Bugzilla.
> >  * http://bugs.clusterlabs.org/show_bug.cgi?id=5004
> >
> > Best Regards,
> > Hideo Yamauchi.
> >
> > ___
> > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: 
> > http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
> >
> 

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs:

Re: [Pacemaker] [Problem] The attrd does not sometimes stop.

2011-11-02 Thread Andrew Beekhof

On Tue, Oct 18, 2011 at 12:19 PM,   wrote:
> Hi,
>
> We sometimes fail in a stop of attrd.
>
> Step1. start a cluster in 2 nodes
> Step2. stop the first node.(/etc/init.d/heartbeat stop.)
> Step3. stop the second node after time passed a little.(/etc/init.d/heartbeat
> stop.)
>
> The attrd catches the TERM signal, but does not stop.

There's no evidence that it actually catches it, only that it is sent.
I've seen it before but never figured out why it occurs.

>
> (snip)
> Oct  5 02:37:38 hpdb0201 crmd: [12238]: info: do_exit: [crmd] stopped (0)
> Oct  5 02:37:38 hpdb0201 cib: [12234]: WARN: send_ipc_message: IPC Channel to
> 12238 is not connected
> Oct  5 02:37:38 hpdb0201 cib: [12234]: WARN: send_via_callback_channel:
> Delivery of reply to client 12238/0dbc9e28-d90d-4335-b9c4-9dd3fcb38163 failed
> Oct  5 02:37:38 hpdb0201 cib: [12234]: WARN: do_local_notify: A-Sync reply to
> crmd failed: reply failed
> Oct  5 02:37:38 hpdb0201 heartbeat: [12223]: info: killing
> /usr/lib64/heartbeat/attrd process group 12237 with signal 15
> Oct  5 02:47:03 hpdb0201 cib: [12234]: info: cib_stats: Processed 97 
> operations
> (4123.00us average, 0% utilization) in the last 10min
> Oct  5 07:15:25 hpdb0201 ccm: [12233]: WARN: G_CH_check_int: working on IPC
> channel took 1010 ms (> 100 ms)
> Oct  5 07:15:26 hpdb0201 ccm: [12233]: WARN: G_CH_check_int: working on IPC
> channel took 1010 ms (> 100 ms)
> Oct  5 07:15:37 hpdb0201 heartbeat: [12223]: WARN: Gmain_timeout_dispatch:
> Dispatch function for check for signals was delayed 1030 ms (> 1010 ms) before
> being called (GSource: 0xd28010)
> Oct  5 07:15:37 hpdb0201 heartbeat: [12223]: info: Gmain_timeout_dispatch:
> started at 431583547 should have started at 431583444
> Oct  5 07:15:44 hpdb0201 heartbeat: [12223]: WARN: Gmain_timeout_dispatch:
> Dispatch function for send local status was delayed 1030 ms (> 1010 ms) before
> being called (GSource: 0xd27dd0)
> Oct  5 07:15:44 hpdb0201 heartbeat: [12223]: info: Gmain_timeout_dispatch:
> started at 431584254 should have started at 431584151
> Oct  5 07:15:44 hpdb0201 heartbeat: [12223]: WARN: Gmain_timeout_dispatch:
> Dispatch function for check for signals was delayed 1030 ms (> 1010 ms) before
> being called (GSource: 0xd28010)
> Oct  5 07:15:44 hpdb0201 heartbeat: [12223]: info: Gmain_timeout_dispatch:
> started at 431584254 should have started at 431584151
> Oct  5 07:16:59 hpdb0201 heartbeat: [12223]: WARN: G_CH_check_int: working on
> write child took 1010 ms (> 100 ms)
> Oct  5 07:17:14 hpdb0201 stonithd: [12236]: WARN: G_CH_check_int: working on
> Heartbeat API channel took 1010 ms (> 100 ms)
> Oct  5 07:19:41 hpdb0201 heartbeat: [12223]: WARN: Gmain_timeout_dispatch:
> Dispatch function for send local status was delayed 1030 ms (> 1010 ms) before
> being called (GSource: 0xd27dd0)
> Oct  5 07:19:41 hpdb0201 heartbeat: [12223]: info: Gmain_timeout_dispatch:
> started at 431607988 should have started at 431607885
> Oct  5 07:19:41 hpdb0201 heartbeat: [12223]: WARN: Gmain_timeout_dispatch:
> Dispatch function for check for signals was delayed 1030 ms (> 1010 ms) before
> being called (GSource: 0xd28010)
> Oct  5 07:19:41 hpdb0201 heartbeat: [12223]: info: Gmain_timeout_dispatch:
> started at 431607988 should have started at 431607885
> (snip)
>
> We try the reproduction of the phenomenon, but do not reappear very much.
>
> The same phenomenon is reported by the next email.
> However, the argument of the problem is over on the way.
>
>  * http://www.gossamer-threads.com/lists/linuxha/pacemaker/62147
>
> The phenomenon occurred by the next combination.
>  * pacemaker-1.0.11
>  * resource-agents-3.9.2
>  * cluster-glue-1.0.7
>  * heartbeat-3.0.5
>
> I registered these contents with Bugzilla.
>  * http://bugs.clusterlabs.org/show_bug.cgi?id=5004
>
> Best Regards,
> Hideo Yamauchi.
>
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: 
> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker

Re: [Pacemaker] [Problem] The attrd does not sometimes stop.

2011-10-21 Thread Alan Robertson


On 10/20/2011 07:30 PM, renayama19661...@ybb.ne.jp wrote:

Hi Alan,

Thank you for comment.

We reproduce a problem, too and are going to send a report.
However, the problem does not reappear for the moment.
I gather that the folks on the test team for my project have it happen 
fairly often when they're in a certain stage of testing.  I expect to 
get some hb_report output from them in a week or two.  I have put in a 
link to Andrew's bug system from ours so that hopefully when the time 
comes we will be able to remember what to do ;-)


We had not narrowed it down to attrd being the component that didn't 
stop - but looking at the logs for what they did report, it seemed like 
the likely suspect.  I had already decided that it looked like the most 
likely candidate before I saw your email.


They had put in a workaround of just killing everything - which of 
course works ;-).  At the place where it hung, all the resources were 
already stopped, so it was safe - just a bit of overkill (beyond the 
minimum necessary).



--
Alan Robertson

"Openness is the foundation and preservative of friendship...  Let me claim from you 
at all times your undisguised opinions." - William Wilberforce

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker

Re: [Pacemaker] [Problem] The attrd does not sometimes stop.

2011-10-20 Thread renayama19661014

Hi Alan,

Thank you for comment.

We reproduce a problem, too and are going to send a report.
However, the problem does not reappear for the moment.

Best Regards,
Hideo Yamauchi.

--- On Thu, 2011/10/20, Alan Robertson  wrote:

> Hi,
> 
> I've seen a very similar problem in a recent release.  In fact, I'm in the 
> process of reproducing it so that it can be properly logged and so on.  When 
> I get the right data for the bug report, I'll attach it to the bug.
> 
> FWIW: I'm pretty sure that the signal was properly received by attrd.  I 
> haven't looked at the attrd code, but my guess is that either it didn't issue 
> the correct function call for exiting from mainloop - or that the mainloop 
> code didn't actually exit.  FWIW - it probably doesn't matter at all what the 
> priority for signal handling is - since attrd consumes nearly no CPU.  Too 
> bad it doesn't log receiving the signal or beginning the process of exiting...
> 
> Another random thought - I suppose attrd could be clobbering some memory 
> which mainloop needs to properly process an exit.  Doesn't seem likely - but 
> neither of the above options seem very likely either.
> 
> 
> 
> An historical note on an early bug that had similar symptoms (but affected 
> every process - not just attrd).
> 
> First - what caused such a problem (a very long time ago):
>     There is a window between the checking for signals and going to sleep in 
> the poll call where
>         such that a signal might be ignored for a while.
> 
>     The glib mainloop code has three entry points called each time a signal 
> is received:
>             prepare, check, dispatch.
> 
> There is a poll call which occurs between the prepare and check steps.  If a 
> signal comes in after the prepare call returns, but before the code goes to 
> sleep in the poll system call, it will be ignored until
> the poll system call returns.  It will get caught on the next iteration of 
> the loop.
> 
> The fix was fairly simple - the signal handling code instructs the mainloop 
> infrastructure to call poll with an argument which prevents it from staying 
> asleep longer than a second.
> 
> Then the code processes the signal correctly.
> 
> 
> On 10/17/2011 07:19 PM, renayama19661...@ybb.ne.jp wrote:
> > Hi,
> > 
> > We sometimes fail in a stop of attrd.
> > 
> > Step1. start a cluster in 2 nodes
> > Step2. stop the first node.(/etc/init.d/heartbeat stop.)
> > Step3. stop the second node after time passed a 
> > little.(/etc/init.d/heartbeat
> > stop.)
> > 
> > The attrd catches the TERM signal, but does not stop.
> > 
> > (snip)
> > Oct  5 02:37:38 hpdb0201 crmd: [12238]: info: do_exit: [crmd] stopped (0)
> > Oct  5 02:37:38 hpdb0201 cib: [12234]: WARN: send_ipc_message: IPC Channel 
> > to
> > 12238 is not connected
> > Oct  5 02:37:38 hpdb0201 cib: [12234]: WARN: send_via_callback_channel:
> > Delivery of reply to client 12238/0dbc9e28-d90d-4335-b9c4-9dd3fcb38163 
> > failed
> > Oct  5 02:37:38 hpdb0201 cib: [12234]: WARN: do_local_notify: A-Sync reply 
> > to
> > crmd failed: reply failed
> > Oct  5 02:37:38 hpdb0201 heartbeat: [12223]: info: killing
> > /usr/lib64/heartbeat/attrd process group 12237 with signal 15
> > Oct  5 02:47:03 hpdb0201 cib: [12234]: info: cib_stats: Processed 97 
> > operations
> > (4123.00us average, 0% utilization) in the last 10min
> > Oct  5 07:15:25 hpdb0201 ccm: [12233]: WARN: G_CH_check_int: working on IPC
> > channel took 1010 ms (>  100 ms)
> > Oct  5 07:15:26 hpdb0201 ccm: [12233]: WARN: G_CH_check_int: working on IPC
> > channel took 1010 ms (>  100 ms)
> > Oct  5 07:15:37 hpdb0201 heartbeat: [12223]: WARN: Gmain_timeout_dispatch:
> > Dispatch function for check for signals was delayed 1030 ms (>  1010 ms) 
> > before
> > being called (GSource: 0xd28010)
> > Oct  5 07:15:37 hpdb0201 heartbeat: [12223]: info: Gmain_timeout_dispatch:
> > started at 431583547 should have started at 431583444
> > Oct  5 07:15:44 hpdb0201 heartbeat: [12223]: WARN: Gmain_timeout_dispatch:
> > Dispatch function for send local status was delayed 1030 ms (>  1010 ms) 
> > before
> > being called (GSource: 0xd27dd0)
> > Oct  5 07:15:44 hpdb0201 heartbeat: [12223]: info: Gmain_timeout_dispatch:
> > started at 431584254 should have started at 431584151
> > Oct  5 07:15:44 hpdb0201 heartbeat: [12223]: WARN: Gmain_timeout_dispatch:
> > Dispatch function for check for signals was delayed 1030 ms (>  1010 ms) 
> > before
> > being called (GSource: 0xd28010)
> > Oct  5 07:15:44 hpdb0201 heartbeat: [12223]: info: Gmain_timeout_dispatch:
> > started at 431584254 should have started at 431584151
> > Oct  5 07:16:59 hpdb0201 heartbeat: [12223]: WARN: G_CH_check_int: working 
> > on
> > write child took 1010 ms (>  100 ms)
> > Oct  5 07:17:14 hpdb0201 stonithd: [12236]: WARN: G_CH_check_int: working on
> > Heartbeat API channel took 1010 ms (>  100 ms)
> > Oct  5 07:19:41 hpdb0201 heartbeat: [12223]: WARN: Gmain_timeout_dispatch:
> > Dispatch function for

Re: [Pacemaker] [Problem] The attrd does not sometimes stop.

2011-10-19 Thread Alan Robertson


Hi,

I've seen a very similar problem in a recent release.  In fact, I'm in 
the process of reproducing it so that it can be properly logged and so 
on.  When I get the right data for the bug report, I'll attach it to the 
bug.


FWIW: I'm pretty sure that the signal was properly received by attrd.  I 
haven't looked at the attrd code, but my guess is that either it didn't 
issue the correct function call for exiting from mainloop - or that the 
mainloop code didn't actually exit.  FWIW - it probably doesn't matter 
at all what the priority for signal handling is - since attrd consumes 
nearly no CPU.  Too bad it doesn't log receiving the signal or beginning 
the process of exiting...


Another random thought - I suppose attrd could be clobbering some memory 
which mainloop needs to properly process an exit.  Doesn't seem likely - 
but neither of the above options seem very likely either.




An historical note on an early bug that had similar symptoms (but 
affected every process - not just attrd).


First - what caused such a problem (a very long time ago):
There is a window between the checking for signals and going to 
sleep in the poll call where

such that a signal might be ignored for a while.

The glib mainloop code has three entry points called each time a 
signal is received:

prepare, check, dispatch.

There is a poll call which occurs between the prepare and check steps.  
If a signal comes in after the prepare call returns, but before the code 
goes to sleep in the poll system call, it will be ignored until
the poll system call returns.  It will get caught on the next iteration 
of the loop.


The fix was fairly simple - the signal handling code instructs the 
mainloop infrastructure to call poll with an argument which prevents it 
from staying asleep longer than a second.


Then the code processes the signal correctly.


On 10/17/2011 07:19 PM, renayama19661...@ybb.ne.jp wrote:

Hi,

We sometimes fail in a stop of attrd.

Step1. start a cluster in 2 nodes
Step2. stop the first node.(/etc/init.d/heartbeat stop.)
Step3. stop the second node after time passed a little.(/etc/init.d/heartbeat
stop.)

The attrd catches the TERM signal, but does not stop.

(snip)
Oct  5 02:37:38 hpdb0201 crmd: [12238]: info: do_exit: [crmd] stopped (0)
Oct  5 02:37:38 hpdb0201 cib: [12234]: WARN: send_ipc_message: IPC Channel to
12238 is not connected
Oct  5 02:37:38 hpdb0201 cib: [12234]: WARN: send_via_callback_channel:
Delivery of reply to client 12238/0dbc9e28-d90d-4335-b9c4-9dd3fcb38163 failed
Oct  5 02:37:38 hpdb0201 cib: [12234]: WARN: do_local_notify: A-Sync reply to
crmd failed: reply failed
Oct  5 02:37:38 hpdb0201 heartbeat: [12223]: info: killing
/usr/lib64/heartbeat/attrd process group 12237 with signal 15
Oct  5 02:47:03 hpdb0201 cib: [12234]: info: cib_stats: Processed 97 operations
(4123.00us average, 0% utilization) in the last 10min
Oct  5 07:15:25 hpdb0201 ccm: [12233]: WARN: G_CH_check_int: working on IPC
channel took 1010 ms (>  100 ms)
Oct  5 07:15:26 hpdb0201 ccm: [12233]: WARN: G_CH_check_int: working on IPC
channel took 1010 ms (>  100 ms)
Oct  5 07:15:37 hpdb0201 heartbeat: [12223]: WARN: Gmain_timeout_dispatch:
Dispatch function for check for signals was delayed 1030 ms (>  1010 ms) before
being called (GSource: 0xd28010)
Oct  5 07:15:37 hpdb0201 heartbeat: [12223]: info: Gmain_timeout_dispatch:
started at 431583547 should have started at 431583444
Oct  5 07:15:44 hpdb0201 heartbeat: [12223]: WARN: Gmain_timeout_dispatch:
Dispatch function for send local status was delayed 1030 ms (>  1010 ms) before
being called (GSource: 0xd27dd0)
Oct  5 07:15:44 hpdb0201 heartbeat: [12223]: info: Gmain_timeout_dispatch:
started at 431584254 should have started at 431584151
Oct  5 07:15:44 hpdb0201 heartbeat: [12223]: WARN: Gmain_timeout_dispatch:
Dispatch function for check for signals was delayed 1030 ms (>  1010 ms) before
being called (GSource: 0xd28010)
Oct  5 07:15:44 hpdb0201 heartbeat: [12223]: info: Gmain_timeout_dispatch:
started at 431584254 should have started at 431584151
Oct  5 07:16:59 hpdb0201 heartbeat: [12223]: WARN: G_CH_check_int: working on
write child took 1010 ms (>  100 ms)
Oct  5 07:17:14 hpdb0201 stonithd: [12236]: WARN: G_CH_check_int: working on
Heartbeat API channel took 1010 ms (>  100 ms)
Oct  5 07:19:41 hpdb0201 heartbeat: [12223]: WARN: Gmain_timeout_dispatch:
Dispatch function for send local status was delayed 1030 ms (>  1010 ms) before
being called (GSource: 0xd27dd0)
Oct  5 07:19:41 hpdb0201 heartbeat: [12223]: info: Gmain_timeout_dispatch:
started at 431607988 should have started at 431607885
Oct  5 07:19:41 hpdb0201 heartbeat: [12223]: WARN: Gmain_timeout_dispatch:
Dispatch function for check for signals was delayed 1030 ms (>  1010 ms) before
being called (GSource: 0xd28010)
Oct  5 07:19:41 hpdb0201 heartbeat: [12223]: info: Gmain_timeout_dispatch:
started at 431607988 should have started at 431607885
(snip)

Re: [Pacemaker] [Problem] The attrd does not sometimes stop.

Re: [Pacemaker] [Problem] The attrd does not sometimes stop.

Re: [Pacemaker] [Problem] The attrd does not sometimes stop.

Re: [Pacemaker] [Problem] The attrd does not sometimes stop.

Re: [Pacemaker] [Problem] The attrd does not sometimes stop.

Re: [Pacemaker] [Problem] The attrd does not sometimes stop.

Re: [Pacemaker] [Problem] The attrd does not sometimes stop.

Re: [Pacemaker] [Problem] The attrd does not sometimes stop.

Re: [Pacemaker] [Problem] The attrd does not sometimes stop.

Re: [Pacemaker] [Problem] The attrd does not sometimes stop.

Re: [Pacemaker] [Problem] The attrd does not sometimes stop.

Re: [Pacemaker] [Problem] The attrd does not sometimes stop.

Re: [Pacemaker] [Problem] The attrd does not sometimes stop.

Re: [Pacemaker] [Problem] The attrd does not sometimes stop.

Re: [Pacemaker] [Problem] The attrd does not sometimes stop.

Re: [Pacemaker] [Problem] The attrd does not sometimes stop.

Re: [Pacemaker] [Problem] The attrd does not sometimes stop.

Re: [Pacemaker] [Problem] The attrd does not sometimes stop.

Re: [Pacemaker] [Problem] The attrd does not sometimes stop.

Re: [Pacemaker] [Problem] The attrd does not sometimes stop.

Re: [Pacemaker] [Problem] The attrd does not sometimes stop.

Re: [Pacemaker] [Problem] The attrd does not sometimes stop.

Re: [Pacemaker] [Problem] The attrd does not sometimes stop.

Re: [Pacemaker] [Problem] The attrd does not sometimes stop.

Re: [Pacemaker] [Problem] The attrd does not sometimes stop.

Re: [Pacemaker] [Problem] The attrd does not sometimes stop.

26 matches

Site Navigation

Mail list logo

Footer information