Re: [Pacemaker] [Problem] The attrd does not sometimes stop.

2012-01-16 Thread Lars Ellenberg
On Mon, Jan 16, 2012 at 04:46:58PM +1100, Andrew Beekhof wrote:
  Now we proceed to the next mainloop poll:
 
  poll([{fd=7, events=POLLIN|POLLPRI}, {fd=4, events=POLLIN|POLLPRI}, {fd=5, 
  events=POLLIN|POLLPRI}], 3, -1
 
  Note the -1 (infinity timeout!)
 
  So even though the trigger was (presumably) set,
  and the -prepare() should have returned true,
  the mainloop waits forever for something to happen on those file 
  descriptors.
 
 
  I suggest this:
 
  crm_trigger_prepare should set *timeout = 0, if trigger is set.
 
  Also think about this race: crm_trigger_prepare was already
  called, only then the signal came in...
 
  diff --git a/lib/common/mainloop.c b/lib/common/mainloop.c
  index 2e8b1d0..fd17b87 100644
  --- a/lib/common/mainloop.c
  +++ b/lib/common/mainloop.c
  @@ -33,6 +33,13 @@ static gboolean
   crm_trigger_prepare(GSource * source, gint * timeout)
   {
      crm_trigger_t *trig = (crm_trigger_t *) source;
  +    /* Do not delay signal processing by the mainloop poll stage */
  +    if (trig-trigger)
  +           *timeout = 0;
  +    /* To avoid races between signal delivery and the mainloop poll stage,
  +     * make sure we always have a finite timeout. Unit: milliseconds. */
  +    else
  +           *timeout = 5000; /* arbitrary */
 
      return trig-trigger;
   }
 
 
  This scenario does not let the blocked IPC off the hook, though.
  That is still possible, both for blocking send and blocking receive,
  so that should probably be fixed as well, somehow.
  I'm not sure how likely this stuck in blocking IPC is, though.
 
 Interesting, are you sure you're in the right function though?
 trigger and signal events don't have a file descriptor... wouldn't
 these polls be for the IPC related sources and wouldn't they be
 setting their own timeout?

http://developer.gnome.org/glib/2.30/glib-The-Main-Event-Loop.html#GSourceFuncs

iiuc, mainloop does something similar to (oversimplified):
timeout = -1; /* infinity */
for s in all GSource
tmp_timeout = -1;
s-prepare(s, tmp_timeout)
if (tmp_timeout = 0  tmp_timeout  timeout)
timeout = tmp_timeout;

poll(GSource fd set, n, timeout);

for s in all GSource
if s-check(s)
s-dispatch(s, ...)

And at some stage it also orders by priority, of course.

Also compare with the comment above /* Sigh... */ in glue G_SIG_prepare().

BTW, the mentioned race between signal delivery and mainloop already
doing the poll stage could potentially be solved by using
cl_signal_set_interrupt(SIGTERM, 1),
which would mean we can condense the prepare to
if (trig-trigger)
*timeout = 0;
return trig-trigger;

Glue (and heartbeat) code base is not that, let's say, involved,
because someone had been paranoid.
But because someone had been paranoid for a reason ;-)

Cheers,

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] [Problem] The attrd does not sometimes stop.

2012-01-16 Thread Andrew Beekhof
I know I could just apply the patch and be done, but I'd like to
understand this so it works for the right reason.

On Mon, Jan 16, 2012 at 7:30 PM, Lars Ellenberg
lars.ellenb...@linbit.com wrote:
 On Mon, Jan 16, 2012 at 04:46:58PM +1100, Andrew Beekhof wrote:
  Now we proceed to the next mainloop poll:
 
  poll([{fd=7, events=POLLIN|POLLPRI}, {fd=4, events=POLLIN|POLLPRI}, 
  {fd=5, events=POLLIN|POLLPRI}], 3, -1
 
  Note the -1 (infinity timeout!)
 
  So even though the trigger was (presumably) set,
  and the -prepare() should have returned true,
  the mainloop waits forever for something to happen on those file 
  descriptors.
 
 
  I suggest this:
 
  crm_trigger_prepare should set *timeout = 0, if trigger is set.
 
  Also think about this race: crm_trigger_prepare was already
  called, only then the signal came in...
 
  diff --git a/lib/common/mainloop.c b/lib/common/mainloop.c
  index 2e8b1d0..fd17b87 100644
  --- a/lib/common/mainloop.c
  +++ b/lib/common/mainloop.c
  @@ -33,6 +33,13 @@ static gboolean
   crm_trigger_prepare(GSource * source, gint * timeout)
   {
      crm_trigger_t *trig = (crm_trigger_t *) source;
  +    /* Do not delay signal processing by the mainloop poll stage */
  +    if (trig-trigger)
  +           *timeout = 0;
  +    /* To avoid races between signal delivery and the mainloop poll stage,
  +     * make sure we always have a finite timeout. Unit: milliseconds. */
  +    else
  +           *timeout = 5000; /* arbitrary */
 
      return trig-trigger;
   }
 
 
  This scenario does not let the blocked IPC off the hook, though.
  That is still possible, both for blocking send and blocking receive,
  so that should probably be fixed as well, somehow.
  I'm not sure how likely this stuck in blocking IPC is, though.

 Interesting, are you sure you're in the right function though?
 trigger and signal events don't have a file descriptor... wouldn't
 these polls be for the IPC related sources and wouldn't they be
 setting their own timeout?

 http://developer.gnome.org/glib/2.30/glib-The-Main-Event-Loop.html#GSourceFuncs

 iiuc, mainloop does something similar to (oversimplified):
        timeout = -1; /* infinity */
        for s in all GSource
                tmp_timeout = -1;
                s-prepare(s, tmp_timeout)
                if (tmp_timeout = 0  tmp_timeout  timeout)
                        timeout = tmp_timeout;

        poll(GSource fd set, n, timeout);

I'm looking at the glib code again now, and it still looks to me like
the trigger and signal sources do not appear in this fd set.
Their setup functions would have to have called g_source_add_poll()
somewhere, which they don't.

So I'm still not seeing why its a trigger or signal sources' fault
that glib is doing a never ending call to poll().
poll() is going to get called regardless of whether our prepare
function returns true or not.

Looking closer, crm_trigger_prepare() returning TRUE results in:
  ready_source-flags |= G_SOURCE_READY;

which in turn causes:
  context-timeout = 0;

which is essentially what adding
   if (trig-trigger)
   *timeout = 0;

to crm_trigger_prepare() was intended to achieve.

Shouldn't the fd, ipc or wait sources (who do call g_source_add_poll()
and could therefor cause poll() to block forever) have a sane timeout
in their prepare functions?
Or is it because the signal itself is interrupting some essential part
of G_CH_prepare_int() and friends?


        for s in all GSource
                if s-check(s)
                        s-dispatch(s, ...)

 And at some stage it also orders by priority, of course.

 Also compare with the comment above /* Sigh... */ in glue G_SIG_prepare().

 BTW, the mentioned race between signal delivery and mainloop already
 doing the poll stage could potentially be solved by using

Again, since nothing related to the signal source ever appears in the
call to poll(), I'm not seeing where the race comes from.
Or am I missing something obvious?

 cl_signal_set_interrupt(SIGTERM, 1),

This, combined with

/*
 * If we don't set this on, then the mainloop poll(2) call
 * will never be interrupted by this signal - which sort of
 * defeats the whole purpose of a signal handler in a
 * mainloop program
 */
cl_signal_set_interrupt(signal, TRUE);

looks more relevant.
But I can't escape the feeling that calling this just masks the
underlying why is there a never-ending call to poll() in the first
place issue.
G_CH_prepare_int() and friends /should/ be setting timeouts so that
poll() can return and any sources created by g_idle_source_new() can
execute.

 which would mean we can condense the prepare to
        if (trig-trigger)
                *timeout = 0;
        return trig-trigger;

 Glue (and heartbeat) code base is not that, let's say, involved,
 because someone had been paranoid.
 But because someone had been 

Re: [Pacemaker] [Problem] The attrd does not sometimes stop.

2012-01-16 Thread Andrew Beekhof
On Mon, Jan 16, 2012 at 11:27 PM, Andrew Beekhof and...@beekhof.net wrote:
 I know I could just apply the patch and be done, but I'd like to
 understand this so it works for the right reason.

 On Mon, Jan 16, 2012 at 7:30 PM, Lars Ellenberg
 lars.ellenb...@linbit.com wrote:
 On Mon, Jan 16, 2012 at 04:46:58PM +1100, Andrew Beekhof wrote:
  Now we proceed to the next mainloop poll:
 
  poll([{fd=7, events=POLLIN|POLLPRI}, {fd=4, events=POLLIN|POLLPRI}, 
  {fd=5, events=POLLIN|POLLPRI}], 3, -1
 
  Note the -1 (infinity timeout!)
 
  So even though the trigger was (presumably) set,
  and the -prepare() should have returned true,
  the mainloop waits forever for something to happen on those file 
  descriptors.
 
 
  I suggest this:
 
  crm_trigger_prepare should set *timeout = 0, if trigger is set.
 
  Also think about this race: crm_trigger_prepare was already
  called, only then the signal came in...
 
  diff --git a/lib/common/mainloop.c b/lib/common/mainloop.c
  index 2e8b1d0..fd17b87 100644
  --- a/lib/common/mainloop.c
  +++ b/lib/common/mainloop.c
  @@ -33,6 +33,13 @@ static gboolean
   crm_trigger_prepare(GSource * source, gint * timeout)
   {
      crm_trigger_t *trig = (crm_trigger_t *) source;
  +    /* Do not delay signal processing by the mainloop poll stage */
  +    if (trig-trigger)
  +           *timeout = 0;
  +    /* To avoid races between signal delivery and the mainloop poll 
  stage,
  +     * make sure we always have a finite timeout. Unit: milliseconds. */
  +    else
  +           *timeout = 5000; /* arbitrary */
 
      return trig-trigger;
   }
 
 
  This scenario does not let the blocked IPC off the hook, though.
  That is still possible, both for blocking send and blocking receive,
  so that should probably be fixed as well, somehow.
  I'm not sure how likely this stuck in blocking IPC is, though.

 Interesting, are you sure you're in the right function though?
 trigger and signal events don't have a file descriptor... wouldn't
 these polls be for the IPC related sources and wouldn't they be
 setting their own timeout?

 http://developer.gnome.org/glib/2.30/glib-The-Main-Event-Loop.html#GSourceFuncs

 iiuc, mainloop does something similar to (oversimplified):
        timeout = -1; /* infinity */
        for s in all GSource
                tmp_timeout = -1;
                s-prepare(s, tmp_timeout)
                if (tmp_timeout = 0  tmp_timeout  timeout)
                        timeout = tmp_timeout;

        poll(GSource fd set, n, timeout);

 I'm looking at the glib code again now, and it still looks to me like
 the trigger and signal sources do not appear in this fd set.
 Their setup functions would have to have called g_source_add_poll()
 somewhere, which they don't.

 So I'm still not seeing why its a trigger or signal sources' fault
 that glib is doing a never ending call to poll().
 poll() is going to get called regardless of whether our prepare
 function returns true or not.

 Looking closer, crm_trigger_prepare() returning TRUE results in:
                  ready_source-flags |= G_SOURCE_READY;

 which in turn causes:
          context-timeout = 0;

 which is essentially what adding
       if (trig-trigger)
               *timeout = 0;

 to crm_trigger_prepare() was intended to achieve.

 Shouldn't the fd, ipc or wait sources (who do call g_source_add_poll()
 and could therefor cause poll() to block forever) have a sane timeout
 in their prepare functions?
 Or is it because the signal itself is interrupting some essential part
 of G_CH_prepare_int() and friends?


        for s in all GSource
                if s-check(s)
                        s-dispatch(s, ...)

 And at some stage it also orders by priority, of course.

 Also compare with the comment above /* Sigh... */ in glue G_SIG_prepare().

 BTW, the mentioned race between signal delivery and mainloop already
 doing the poll stage could potentially be solved by using

 Again, since nothing related to the signal source ever appears in the
 call to poll(), I'm not seeing where the race comes from.
 Or am I missing something obvious?

 cl_signal_set_interrupt(SIGTERM, 1),

 This, combined with

                /*
                 * If we don't set this on, then the mainloop poll(2) call
                 * will never be interrupted by this signal - which sort of
                 * defeats the whole purpose of a signal handler in a
                 * mainloop program
                 */
                cl_signal_set_interrupt(signal, TRUE);

 looks more relevant.
 But I can't escape the feeling that calling this just masks the
 underlying why is there a never-ending call to poll() in the first
 place issue.
 G_CH_prepare_int() and friends /should/ be setting timeouts so that
 poll() can return and any sources created by g_idle_source_new() can
 execute.

Actually, thinking further, I'm pretty convinced that poll() with an
infinite timeout is the default mode of operation for mainloops with
cluster-glue's IPC and FD 

Re: [Pacemaker] [Problem] The attrd does not sometimes stop.

2012-01-16 Thread Andrew Beekhof
On Mon, Jan 16, 2012 at 11:30 PM, Andrew Beekhof and...@beekhof.net wrote:
 On Mon, Jan 16, 2012 at 11:27 PM, Andrew Beekhof and...@beekhof.net wrote:
 I know I could just apply the patch and be done, but I'd like to
 understand this so it works for the right reason.

 On Mon, Jan 16, 2012 at 7:30 PM, Lars Ellenberg
 lars.ellenb...@linbit.com wrote:
 On Mon, Jan 16, 2012 at 04:46:58PM +1100, Andrew Beekhof wrote:
  Now we proceed to the next mainloop poll:
 
  poll([{fd=7, events=POLLIN|POLLPRI}, {fd=4, events=POLLIN|POLLPRI}, 
  {fd=5, events=POLLIN|POLLPRI}], 3, -1
 
  Note the -1 (infinity timeout!)
 
  So even though the trigger was (presumably) set,
  and the -prepare() should have returned true,
  the mainloop waits forever for something to happen on those file 
  descriptors.
 
 
  I suggest this:
 
  crm_trigger_prepare should set *timeout = 0, if trigger is set.
 
  Also think about this race: crm_trigger_prepare was already
  called, only then the signal came in...
 
  diff --git a/lib/common/mainloop.c b/lib/common/mainloop.c
  index 2e8b1d0..fd17b87 100644
  --- a/lib/common/mainloop.c
  +++ b/lib/common/mainloop.c
  @@ -33,6 +33,13 @@ static gboolean
   crm_trigger_prepare(GSource * source, gint * timeout)
   {
      crm_trigger_t *trig = (crm_trigger_t *) source;
  +    /* Do not delay signal processing by the mainloop poll stage */
  +    if (trig-trigger)
  +           *timeout = 0;
  +    /* To avoid races between signal delivery and the mainloop poll 
  stage,
  +     * make sure we always have a finite timeout. Unit: milliseconds. */
  +    else
  +           *timeout = 5000; /* arbitrary */
 
      return trig-trigger;
   }
 
 
  This scenario does not let the blocked IPC off the hook, though.
  That is still possible, both for blocking send and blocking receive,
  so that should probably be fixed as well, somehow.
  I'm not sure how likely this stuck in blocking IPC is, though.

 Interesting, are you sure you're in the right function though?
 trigger and signal events don't have a file descriptor... wouldn't
 these polls be for the IPC related sources and wouldn't they be
 setting their own timeout?

 http://developer.gnome.org/glib/2.30/glib-The-Main-Event-Loop.html#GSourceFuncs

 iiuc, mainloop does something similar to (oversimplified):
        timeout = -1; /* infinity */
        for s in all GSource
                tmp_timeout = -1;
                s-prepare(s, tmp_timeout)
                if (tmp_timeout = 0  tmp_timeout  timeout)
                        timeout = tmp_timeout;

        poll(GSource fd set, n, timeout);

 I'm looking at the glib code again now, and it still looks to me like
 the trigger and signal sources do not appear in this fd set.
 Their setup functions would have to have called g_source_add_poll()
 somewhere, which they don't.

 So I'm still not seeing why its a trigger or signal sources' fault
 that glib is doing a never ending call to poll().
 poll() is going to get called regardless of whether our prepare
 function returns true or not.

 Looking closer, crm_trigger_prepare() returning TRUE results in:
                  ready_source-flags |= G_SOURCE_READY;

 which in turn causes:
          context-timeout = 0;

 which is essentially what adding
       if (trig-trigger)
               *timeout = 0;

 to crm_trigger_prepare() was intended to achieve.

 Shouldn't the fd, ipc or wait sources (who do call g_source_add_poll()
 and could therefor cause poll() to block forever) have a sane timeout
 in their prepare functions?
 Or is it because the signal itself is interrupting some essential part
 of G_CH_prepare_int() and friends?


        for s in all GSource
                if s-check(s)
                        s-dispatch(s, ...)

 And at some stage it also orders by priority, of course.

 Also compare with the comment above /* Sigh... */ in glue G_SIG_prepare().

 BTW, the mentioned race between signal delivery and mainloop already
 doing the poll stage could potentially be solved by using

 Again, since nothing related to the signal source ever appears in the
 call to poll(), I'm not seeing where the race comes from.
 Or am I missing something obvious?

 cl_signal_set_interrupt(SIGTERM, 1),

 This, combined with

                /*
                 * If we don't set this on, then the mainloop poll(2) call
                 * will never be interrupted by this signal - which sort of
                 * defeats the whole purpose of a signal handler in a
                 * mainloop program
                 */
                cl_signal_set_interrupt(signal, TRUE);

 looks more relevant.
 But I can't escape the feeling that calling this just masks the
 underlying why is there a never-ending call to poll() in the first
 place issue.
 G_CH_prepare_int() and friends /should/ be setting timeouts so that
 poll() can return and any sources created by g_idle_source_new() can
 execute.

 Actually, thinking further, I'm pretty convinced that poll() with an
 infinite timeout 

Re: [Pacemaker] [Problem] The attrd does not sometimes stop.

2012-01-16 Thread Andrew Beekhof
On Mon, Jan 16, 2012 at 11:42 PM, Andrew Beekhof and...@beekhof.net wrote:
 On Mon, Jan 16, 2012 at 11:30 PM, Andrew Beekhof and...@beekhof.net wrote:
 On Mon, Jan 16, 2012 at 11:27 PM, Andrew Beekhof and...@beekhof.net wrote:
 I know I could just apply the patch and be done, but I'd like to
 understand this so it works for the right reason.

Ok, done:

https://github.com/beekhof/pacemaker/commit/2a6b296

If I'm adding voodoo, I at least want the reason well documented so it
can be removed again if the reason goes away.

 On Mon, Jan 16, 2012 at 7:30 PM, Lars Ellenberg
 lars.ellenb...@linbit.com wrote:
 On Mon, Jan 16, 2012 at 04:46:58PM +1100, Andrew Beekhof wrote:
  Now we proceed to the next mainloop poll:
 
  poll([{fd=7, events=POLLIN|POLLPRI}, {fd=4, events=POLLIN|POLLPRI}, 
  {fd=5, events=POLLIN|POLLPRI}], 3, -1
 
  Note the -1 (infinity timeout!)
 
  So even though the trigger was (presumably) set,
  and the -prepare() should have returned true,
  the mainloop waits forever for something to happen on those file 
  descriptors.
 
 
  I suggest this:
 
  crm_trigger_prepare should set *timeout = 0, if trigger is set.
 
  Also think about this race: crm_trigger_prepare was already
  called, only then the signal came in...
 
  diff --git a/lib/common/mainloop.c b/lib/common/mainloop.c
  index 2e8b1d0..fd17b87 100644
  --- a/lib/common/mainloop.c
  +++ b/lib/common/mainloop.c
  @@ -33,6 +33,13 @@ static gboolean
   crm_trigger_prepare(GSource * source, gint * timeout)
   {
      crm_trigger_t *trig = (crm_trigger_t *) source;
  +    /* Do not delay signal processing by the mainloop poll stage */
  +    if (trig-trigger)
  +           *timeout = 0;
  +    /* To avoid races between signal delivery and the mainloop poll 
  stage,
  +     * make sure we always have a finite timeout. Unit: milliseconds. 
  */
  +    else
  +           *timeout = 5000; /* arbitrary */
 
      return trig-trigger;
   }
 
 
  This scenario does not let the blocked IPC off the hook, though.
  That is still possible, both for blocking send and blocking receive,
  so that should probably be fixed as well, somehow.
  I'm not sure how likely this stuck in blocking IPC is, though.

 Interesting, are you sure you're in the right function though?
 trigger and signal events don't have a file descriptor... wouldn't
 these polls be for the IPC related sources and wouldn't they be
 setting their own timeout?

 http://developer.gnome.org/glib/2.30/glib-The-Main-Event-Loop.html#GSourceFuncs

 iiuc, mainloop does something similar to (oversimplified):
        timeout = -1; /* infinity */
        for s in all GSource
                tmp_timeout = -1;
                s-prepare(s, tmp_timeout)
                if (tmp_timeout = 0  tmp_timeout  timeout)
                        timeout = tmp_timeout;

        poll(GSource fd set, n, timeout);

 I'm looking at the glib code again now, and it still looks to me like
 the trigger and signal sources do not appear in this fd set.
 Their setup functions would have to have called g_source_add_poll()
 somewhere, which they don't.

 So I'm still not seeing why its a trigger or signal sources' fault
 that glib is doing a never ending call to poll().
 poll() is going to get called regardless of whether our prepare
 function returns true or not.

 Looking closer, crm_trigger_prepare() returning TRUE results in:
                  ready_source-flags |= G_SOURCE_READY;

 which in turn causes:
          context-timeout = 0;

 which is essentially what adding
       if (trig-trigger)
               *timeout = 0;

 to crm_trigger_prepare() was intended to achieve.

 Shouldn't the fd, ipc or wait sources (who do call g_source_add_poll()
 and could therefor cause poll() to block forever) have a sane timeout
 in their prepare functions?
 Or is it because the signal itself is interrupting some essential part
 of G_CH_prepare_int() and friends?


        for s in all GSource
                if s-check(s)
                        s-dispatch(s, ...)

 And at some stage it also orders by priority, of course.

 Also compare with the comment above /* Sigh... */ in glue G_SIG_prepare().

 BTW, the mentioned race between signal delivery and mainloop already
 doing the poll stage could potentially be solved by using

 Again, since nothing related to the signal source ever appears in the
 call to poll(), I'm not seeing where the race comes from.
 Or am I missing something obvious?

 cl_signal_set_interrupt(SIGTERM, 1),

 This, combined with

                /*
                 * If we don't set this on, then the mainloop poll(2) call
                 * will never be interrupted by this signal - which sort of
                 * defeats the whole purpose of a signal handler in a
                 * mainloop program
                 */
                cl_signal_set_interrupt(signal, TRUE);

 looks more relevant.
 But I can't escape the feeling that calling this just masks the
 underlying why is there a never-ending call to poll() in 

Re: [Pacemaker] [Problem] The attrd does not sometimes stop.

2012-01-16 Thread Lars Ellenberg
On Mon, Jan 16, 2012 at 11:42:32PM +1100, Andrew Beekhof wrote:
  http://developer.gnome.org/glib/2.30/glib-The-Main-Event-Loop.html#GSourceFuncs
 
  iiuc, mainloop does something similar to (oversimplified):
         timeout = -1; /* infinity */
         for s in all GSource
                 tmp_timeout = -1;
                 s-prepare(s, tmp_timeout)
                 if (tmp_timeout = 0  tmp_timeout  timeout)
                         timeout = tmp_timeout;
 
         poll(GSource fd set, n, timeout);
 
  I'm looking at the glib code again now, and it still looks to me like
  the trigger and signal sources do not appear in this fd set.
  Their setup functions would have to have called g_source_add_poll()
  somewhere, which they don't.
 
  So I'm still not seeing why its a trigger or signal sources' fault
  that glib is doing a never ending call to poll().
  poll() is going to get called regardless of whether our prepare
  function returns true or not.
 
  Looking closer, crm_trigger_prepare() returning TRUE results in:
                   ready_source-flags |= G_SOURCE_READY;
 
  which in turn causes:
           context-timeout = 0;
 
  which is essentially what adding
        if (trig-trigger)
                *timeout = 0;
 
  to crm_trigger_prepare() was intended to achieve.
 
  Shouldn't the fd, ipc or wait sources (who do call g_source_add_poll()
  and could therefor cause poll() to block forever) have a sane timeout
  in their prepare functions?

Probably should, but they usually have not.
The reasoning probably is, each GSource is responsible for *itself* only.

That is why first all sources are prepared.

If no non-fd, non-pollable source feels the need to reduce the
*timeout to something finite in its prepare(), so be it.

Besides, what is sane? 1 second? 5? 120? 240?

That's why G_CH_prepare_int() sets the *timeout to 1000,
and why I suggest to set it to 0 if prepare already knows that the
trigger is set, and to some finite amount to avoid getting stuck in
poll, in case no timeout or outher source source is active which also
set some finite timeout.

BTW, if you have an *idle* sources, prepare should set timeout to 0.

For those interested, all described below
http://developer.gnome.org/glib/2.30/glib-The-Main-Event-Loop.html#GSourceFuncs

For idle sources, the prepare and check functions always return TRUE to
indicate that the source is always ready to be processed. The prepare
function also returns a timeout value of 0 to ensure that the poll()
call doesn't block (since that would be time wasted which could have
been spent running the idle function).

... timeout sources ... returns a timeout value to ensure that the
poll() call doesn't block too long ...

... file descriptor sources ... timeout to -1 to indicate that is does
not mind how long the poll() call blocks ... 

  Or is it because the signal itself is interrupting some essential part
  of G_CH_prepare_int() and friends?

In the provided strace, it looks like the SIGTERM
is delivered while calling some G_CH_prepare_int,
the -prepare() used by G_main_add_IPC_Channel.

Since the signal sources are of higher priority,
we probably are passt those already in this iteration,
we will only notice the trigger in the next check(),
after the poll.

So it is vital for any non-pollable source such as signals
to set a finite timeout in their prepare(),
even if we also mark that signal siginterrupt().

         for s in all GSource
                 if s-check(s)
                         s-dispatch(s, ...)
 
  And at some stage it also orders by priority, of course.
 
  Also compare with the comment above /* Sigh... */ in glue G_SIG_prepare().
 
  BTW, the mentioned race between signal delivery and mainloop already
  doing the poll stage could potentially be solved by using
  cl_signal_set_interrupt(SIGTERM, 1),

As I just wrote above, that race is not solved at all.
Only the (necessarily set) finite timeout of the poll
would be shortened in that case.

  But I can't escape the feeling that calling this just masks the
  underlying why is there a never-ending call to poll() in the first
  place issue.
  G_CH_prepare_int() and friends /should/ be setting timeouts so that
  poll() can return and any sources created by g_idle_source_new() can
  execute.
 
  Actually, thinking further, I'm pretty convinced that poll() with an
  infinite timeout is the default mode of operation for mainloops with
  cluster-glue's IPC and FD sources.
  And that this is not a good thing :)

Well, if there are *only* pollable sources, it is.
If there are any other sources, they should have set
their limit on what they think is an acceptable timeout
int their prepare().

 Far too late, brain shutting down.

 ;-)

 ...not a good thing, because it breaks the idle stuff,

see above, explanation on developer.gnome.org,
idle stuff is expected to set timeout 0 (or just a few ms).

 but most of all because it requires /all/ external events to come out
 of that poll() call.

If you 

Re: [Pacemaker] [Problem] The attrd does not sometimes stop.

2012-01-16 Thread Andrew Beekhof
On Tue, Jan 17, 2012 at 10:11 AM, Lars Ellenberg
lars.ellenb...@linbit.com wrote:
 On Mon, Jan 16, 2012 at 11:42:32PM +1100, Andrew Beekhof wrote:
  http://developer.gnome.org/glib/2.30/glib-The-Main-Event-Loop.html#GSourceFuncs
 
  iiuc, mainloop does something similar to (oversimplified):
         timeout = -1; /* infinity */
         for s in all GSource
                 tmp_timeout = -1;
                 s-prepare(s, tmp_timeout)
                 if (tmp_timeout = 0  tmp_timeout  timeout)
                         timeout = tmp_timeout;
 
         poll(GSource fd set, n, timeout);
 
  I'm looking at the glib code again now, and it still looks to me like
  the trigger and signal sources do not appear in this fd set.
  Their setup functions would have to have called g_source_add_poll()
  somewhere, which they don't.
 
  So I'm still not seeing why its a trigger or signal sources' fault
  that glib is doing a never ending call to poll().
  poll() is going to get called regardless of whether our prepare
  function returns true or not.
 
  Looking closer, crm_trigger_prepare() returning TRUE results in:
                   ready_source-flags |= G_SOURCE_READY;
 
  which in turn causes:
           context-timeout = 0;
 
  which is essentially what adding
        if (trig-trigger)
                *timeout = 0;
 
  to crm_trigger_prepare() was intended to achieve.
 
  Shouldn't the fd, ipc or wait sources (who do call g_source_add_poll()
  and could therefor cause poll() to block forever) have a sane timeout
  in their prepare functions?

 Probably should, but they usually have not.
 The reasoning probably is, each GSource is responsible for *itself* only.

Well no, because this forces trigger to care about whether there is a
fd based GSource too and what timeout, if any, is set.


 That is why first all sources are prepared.

 If no non-fd, non-pollable source feels the need to reduce the
 *timeout to something finite in its prepare(), so be it.

So something that doesn't use poll at all should set a timeout for
poll, that doesn't sound right :-)


 Besides, what is sane? 1 second? 5? 120? 240?

 That's why G_CH_prepare_int() sets the *timeout to 1000,
 and why I suggest to set it to 0 if prepare already knows that the
 trigger is set, and to some finite amount to avoid getting stuck in
 poll, in case no timeout or outher source source is active which also
 set some finite timeout.

 BTW, if you have an *idle* sources, prepare should set timeout to 0.

 For those interested, all described below
 http://developer.gnome.org/glib/2.30/glib-The-Main-Event-Loop.html#GSourceFuncs

 For idle sources, the prepare and check functions always return TRUE to
 indicate that the source is always ready to be processed. The prepare
 function also returns a timeout value of 0 to ensure that the poll()
 call doesn't block (since that would be time wasted which could have
 been spent running the idle function).

 ... timeout sources ... returns a timeout value to ensure that the
 poll() call doesn't block too long ...

 ... file descriptor sources ... timeout to -1 to indicate that is does
 not mind how long the poll() call blocks ... 

  Or is it because the signal itself is interrupting some essential part
  of G_CH_prepare_int() and friends?

 In the provided strace, it looks like the SIGTERM
 is delivered while calling some G_CH_prepare_int,
 the -prepare() used by G_main_add_IPC_Channel.

 Since the signal sources are of higher priority,
 we probably are passt those already in this iteration,
 we will only notice the trigger in the next check(),
 after the poll.

 So it is vital for any non-pollable source such as signals
 to set a finite timeout in their prepare(),
 even if we also mark that signal siginterrupt().

         for s in all GSource
                 if s-check(s)
                         s-dispatch(s, ...)
 
  And at some stage it also orders by priority, of course.
 
  Also compare with the comment above /* Sigh... */ in glue 
  G_SIG_prepare().
 
  BTW, the mentioned race between signal delivery and mainloop already
  doing the poll stage could potentially be solved by using
  cl_signal_set_interrupt(SIGTERM, 1),

 As I just wrote above, that race is not solved at all.
 Only the (necessarily set) finite timeout of the poll
 would be shortened in that case.

  But I can't escape the feeling that calling this just masks the
  underlying why is there a never-ending call to poll() in the first
  place issue.
  G_CH_prepare_int() and friends /should/ be setting timeouts so that
  poll() can return and any sources created by g_idle_source_new() can
  execute.
 
  Actually, thinking further, I'm pretty convinced that poll() with an
  infinite timeout is the default mode of operation for mainloops with
  cluster-glue's IPC and FD sources.
  And that this is not a good thing :)

 Well, if there are *only* pollable sources, it is.
 If there are any other sources, they should have set
 their limit on what they think is 

Re: [Pacemaker] [Problem] The attrd does not sometimes stop.

2012-01-15 Thread renayama19661014
Hi Lars,

Thank you for comments and suggestion.

  poll([{fd=7, events=POLLIN|POLLPRI}, {fd=4, events=POLLIN|POLLPRI}, {fd=5, 
  events=POLLIN|POLLPRI}], 3, -1
 
 Note the -1 (infinity timeout!)
 
 So even though the trigger was (presumably) set,
 and the -prepare() should have returned true,
 the mainloop waits forever for something to happen on those file 
 descriptors.
 
 
 I suggest this:
 
 crm_trigger_prepare should set *timeout = 0, if trigger is set.
 
 Also think about this race: crm_trigger_prepare was already
 called, only then the signal came in...
 
 diff --git a/lib/common/mainloop.c b/lib/common/mainloop.c
 index 2e8b1d0..fd17b87 100644
 --- a/lib/common/mainloop.c
 +++ b/lib/common/mainloop.c
 @@ -33,6 +33,13 @@ static gboolean
  crm_trigger_prepare(GSource * source, gint * timeout)
  {
  crm_trigger_t *trig = (crm_trigger_t *) source;
 +/* Do not delay signal processing by the mainloop poll stage */
 +if (trig-trigger)
 +*timeout = 0;
 +/* To avoid races between signal delivery and the mainloop poll stage,
 + * make sure we always have a finite timeout. Unit: milliseconds. */
 +else
 +*timeout = 5000; /* arbitrary */
  
  return trig-trigger;
  }
 
 
 This scenario does not let the blocked IPC off the hook, though.
 That is still possible, both for blocking send and blocking receive,
 so that should probably be fixed as well, somehow.
 I'm not sure how likely this stuck in blocking IPC is, though.

Including a correction of your suggestion, I continue investigating the problem 
again.

I report it if I get some information.

Best Regards,
Hideo Yamauchi.

--- On Sat, 2012/1/14, Lars Ellenberg lars.ellenb...@linbit.com wrote:

 On Tue, Jan 10, 2012 at 04:43:51PM +0900, renayama19661...@ybb.ne.jp wrote:
  Hi Lars,
  
  I attach strace file when a problem reappeared at the end of last year.
  I used glue which applied your patch for confirmation.
  
  It is the file which I picked with attrd by strace -p command right before 
  I stop Heartbeat.
  
  Finally SIGTERM caught it, but attrd did not stop.
  The attrd stopped afterwards when I sent SIGKILL.
 
 The strace reveals something interesting:
 
 This poll looks like the mainloop poll,
 but some -prepare() has modified the timeout to be 0,
 so we proceed directly to -check() and then -dispatch().
 
  poll([{fd=7, events=POLLIN|POLLPRI}, {fd=4, events=POLLIN|POLLPRI}, {fd=8, 
  events=POLLIN|POLLPRI}], 3, 0) = 1 ([{fd=8, revents=POLLIN|POLLHUP}])
 
  times({tms_utime=2, tms_stime=3, tms_cutime=0, tms_cstime=0}) = 433738632
  recv(4, 0x95af308, 576, MSG_DONTWAIT)   = -1 EAGAIN (Resource temporarily 
  unavailable)
 ...
  recv(7, 0x95b1657, 3513, MSG_DONTWAIT)  = -1 EAGAIN (Resource temporarily 
  unavailable)
  poll([{fd=7, events=0}], 1, 0)          = ? ERESTART_RESTARTBLOCK (To be 
  restarted)
  --- SIGTERM (Terminated) @ 0 (0) ---
  sigreturn()                             = ? (mask now [])
 
 Ok. signal received, trigger set.
 Still finishing this mainloop iteration, though.
 
 These recv(),poll() look like invocations of G_CH_prepare_int().
 Does not matter much, though.
 
  recv(7, 0x95b1657, 3513, MSG_DONTWAIT)  = -1 EAGAIN (Resource temporarily 
  unavailable)
  poll([{fd=7, events=0}], 1, 0)          = 0 (Timeout)
  recv(7, 0x95b1657, 3513, MSG_DONTWAIT)  = -1 EAGAIN (Resource temporarily 
  unavailable)
  poll([{fd=7, events=0}], 1, 0)          = 0 (Timeout)
 
  times({tms_utime=2, tms_stime=3, tms_cutime=0, tms_cstime=0}) = 433738634
 
 Now we proceed to the next mainloop poll:
 
  poll([{fd=7, events=POLLIN|POLLPRI}, {fd=4, events=POLLIN|POLLPRI}, {fd=5, 
  events=POLLIN|POLLPRI}], 3, -1
 
 Note the -1 (infinity timeout!)
 
 So even though the trigger was (presumably) set,
 and the -prepare() should have returned true,
 the mainloop waits forever for something to happen on those file 
 descriptors.
 
 
 I suggest this:
 
 crm_trigger_prepare should set *timeout = 0, if trigger is set.
 
 Also think about this race: crm_trigger_prepare was already
 called, only then the signal came in...
 
 diff --git a/lib/common/mainloop.c b/lib/common/mainloop.c
 index 2e8b1d0..fd17b87 100644
 --- a/lib/common/mainloop.c
 +++ b/lib/common/mainloop.c
 @@ -33,6 +33,13 @@ static gboolean
  crm_trigger_prepare(GSource * source, gint * timeout)
  {
      crm_trigger_t *trig = (crm_trigger_t *) source;
 +    /* Do not delay signal processing by the mainloop poll stage */
 +    if (trig-trigger)
 +        *timeout = 0;
 +    /* To avoid races between signal delivery and the mainloop poll stage,
 +     * make sure we always have a finite timeout. Unit: milliseconds. */
 +    else
 +        *timeout = 5000; /* arbitrary */
  
      return trig-trigger;
  }
 
 
 This scenario does not let the blocked IPC off the hook, though.
 That is still possible, both for blocking send and blocking receive,
 so that should probably be fixed as well, somehow.
 I'm not sure how likely this stuck in blocking IPC is, though.
 

Re: [Pacemaker] [Problem] The attrd does not sometimes stop.

2012-01-15 Thread Andrew Beekhof
On Sun, Jan 15, 2012 at 1:57 AM, Lars Ellenberg
lars.ellenb...@linbit.com wrote:
 On Tue, Jan 10, 2012 at 04:43:51PM +0900, renayama19661...@ybb.ne.jp wrote:
 Hi Lars,

 I attach strace file when a problem reappeared at the end of last year.
 I used glue which applied your patch for confirmation.

 It is the file which I picked with attrd by strace -p command right before I 
 stop Heartbeat.

 Finally SIGTERM caught it, but attrd did not stop.
 The attrd stopped afterwards when I sent SIGKILL.

 The strace reveals something interesting:

 This poll looks like the mainloop poll,
 but some -prepare() has modified the timeout to be 0,
 so we proceed directly to -check() and then -dispatch().

 poll([{fd=7, events=POLLIN|POLLPRI}, {fd=4, events=POLLIN|POLLPRI}, {fd=8, 
 events=POLLIN|POLLPRI}], 3, 0) = 1 ([{fd=8, revents=POLLIN|POLLHUP}])

 times({tms_utime=2, tms_stime=3, tms_cutime=0, tms_cstime=0}) = 433738632
 recv(4, 0x95af308, 576, MSG_DONTWAIT)   = -1 EAGAIN (Resource temporarily 
 unavailable)
 ...
 recv(7, 0x95b1657, 3513, MSG_DONTWAIT)  = -1 EAGAIN (Resource temporarily 
 unavailable)
 poll([{fd=7, events=0}], 1, 0)          = ? ERESTART_RESTARTBLOCK (To be 
 restarted)
 --- SIGTERM (Terminated) @ 0 (0) ---
 sigreturn()                             = ? (mask now [])

 Ok. signal received, trigger set.
 Still finishing this mainloop iteration, though.

 These recv(),poll() look like invocations of G_CH_prepare_int().
 Does not matter much, though.

 recv(7, 0x95b1657, 3513, MSG_DONTWAIT)  = -1 EAGAIN (Resource temporarily 
 unavailable)
 poll([{fd=7, events=0}], 1, 0)          = 0 (Timeout)
 recv(7, 0x95b1657, 3513, MSG_DONTWAIT)  = -1 EAGAIN (Resource temporarily 
 unavailable)
 poll([{fd=7, events=0}], 1, 0)          = 0 (Timeout)

 times({tms_utime=2, tms_stime=3, tms_cutime=0, tms_cstime=0}) = 433738634

 Now we proceed to the next mainloop poll:

 poll([{fd=7, events=POLLIN|POLLPRI}, {fd=4, events=POLLIN|POLLPRI}, {fd=5, 
 events=POLLIN|POLLPRI}], 3, -1

 Note the -1 (infinity timeout!)

 So even though the trigger was (presumably) set,
 and the -prepare() should have returned true,
 the mainloop waits forever for something to happen on those file 
 descriptors.


 I suggest this:

 crm_trigger_prepare should set *timeout = 0, if trigger is set.

 Also think about this race: crm_trigger_prepare was already
 called, only then the signal came in...

 diff --git a/lib/common/mainloop.c b/lib/common/mainloop.c
 index 2e8b1d0..fd17b87 100644
 --- a/lib/common/mainloop.c
 +++ b/lib/common/mainloop.c
 @@ -33,6 +33,13 @@ static gboolean
  crm_trigger_prepare(GSource * source, gint * timeout)
  {
     crm_trigger_t *trig = (crm_trigger_t *) source;
 +    /* Do not delay signal processing by the mainloop poll stage */
 +    if (trig-trigger)
 +           *timeout = 0;
 +    /* To avoid races between signal delivery and the mainloop poll stage,
 +     * make sure we always have a finite timeout. Unit: milliseconds. */
 +    else
 +           *timeout = 5000; /* arbitrary */

     return trig-trigger;
  }


 This scenario does not let the blocked IPC off the hook, though.
 That is still possible, both for blocking send and blocking receive,
 so that should probably be fixed as well, somehow.
 I'm not sure how likely this stuck in blocking IPC is, though.

Interesting, are you sure you're in the right function though?
trigger and signal events don't have a file descriptor... wouldn't
these polls be for the IPC related sources and wouldn't they be
setting their own timeout?


 --
 : Lars Ellenberg
 : LINBIT | Your Way to High Availability
 : DRBD/HA support and consulting http://www.linbit.com

 DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.

 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] [Problem] The attrd does not sometimes stop.

2012-01-14 Thread Lars Ellenberg
On Tue, Jan 10, 2012 at 04:43:51PM +0900, renayama19661...@ybb.ne.jp wrote:
 Hi Lars,
 
 I attach strace file when a problem reappeared at the end of last year.
 I used glue which applied your patch for confirmation.
 
 It is the file which I picked with attrd by strace -p command right before I 
 stop Heartbeat.
 
 Finally SIGTERM caught it, but attrd did not stop.
 The attrd stopped afterwards when I sent SIGKILL.

The strace reveals something interesting:

This poll looks like the mainloop poll,
but some -prepare() has modified the timeout to be 0,
so we proceed directly to -check() and then -dispatch().

 poll([{fd=7, events=POLLIN|POLLPRI}, {fd=4, events=POLLIN|POLLPRI}, {fd=8, 
 events=POLLIN|POLLPRI}], 3, 0) = 1 ([{fd=8, revents=POLLIN|POLLHUP}])

 times({tms_utime=2, tms_stime=3, tms_cutime=0, tms_cstime=0}) = 433738632
 recv(4, 0x95af308, 576, MSG_DONTWAIT)   = -1 EAGAIN (Resource temporarily 
 unavailable)
...
 recv(7, 0x95b1657, 3513, MSG_DONTWAIT)  = -1 EAGAIN (Resource temporarily 
 unavailable)
 poll([{fd=7, events=0}], 1, 0)  = ? ERESTART_RESTARTBLOCK (To be 
 restarted)
 --- SIGTERM (Terminated) @ 0 (0) ---
 sigreturn() = ? (mask now [])

Ok. signal received, trigger set.
Still finishing this mainloop iteration, though.

These recv(),poll() look like invocations of G_CH_prepare_int().
Does not matter much, though.

 recv(7, 0x95b1657, 3513, MSG_DONTWAIT)  = -1 EAGAIN (Resource temporarily 
 unavailable)
 poll([{fd=7, events=0}], 1, 0)  = 0 (Timeout)
 recv(7, 0x95b1657, 3513, MSG_DONTWAIT)  = -1 EAGAIN (Resource temporarily 
 unavailable)
 poll([{fd=7, events=0}], 1, 0)  = 0 (Timeout)

 times({tms_utime=2, tms_stime=3, tms_cutime=0, tms_cstime=0}) = 433738634

Now we proceed to the next mainloop poll:

 poll([{fd=7, events=POLLIN|POLLPRI}, {fd=4, events=POLLIN|POLLPRI}, {fd=5, 
 events=POLLIN|POLLPRI}], 3, -1

Note the -1 (infinity timeout!)

So even though the trigger was (presumably) set,
and the -prepare() should have returned true,
the mainloop waits forever for something to happen on those file descriptors.


I suggest this:

crm_trigger_prepare should set *timeout = 0, if trigger is set.

Also think about this race: crm_trigger_prepare was already
called, only then the signal came in...

diff --git a/lib/common/mainloop.c b/lib/common/mainloop.c
index 2e8b1d0..fd17b87 100644
--- a/lib/common/mainloop.c
+++ b/lib/common/mainloop.c
@@ -33,6 +33,13 @@ static gboolean
 crm_trigger_prepare(GSource * source, gint * timeout)
 {
 crm_trigger_t *trig = (crm_trigger_t *) source;
+/* Do not delay signal processing by the mainloop poll stage */
+if (trig-trigger)
+   *timeout = 0;
+/* To avoid races between signal delivery and the mainloop poll stage,
+ * make sure we always have a finite timeout. Unit: milliseconds. */
+else
+   *timeout = 5000; /* arbitrary */
 
 return trig-trigger;
 }


This scenario does not let the blocked IPC off the hook, though.
That is still possible, both for blocking send and blocking receive,
so that should probably be fixed as well, somehow.
I'm not sure how likely this stuck in blocking IPC is, though.

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] [Problem] The attrd does not sometimes stop.

2012-01-11 Thread renayama19661014
Hi Lars,
Hi Dejan,

I got ltrace file when a problem occurred.
I attach ltrace file.

The investigation in gdb continues it and performs it.

If there is suggestion of any improvement, please tell me.

Best Regards,
Hideo Yamauchi.


--- On Tue, 2012/1/10, renayama19661...@ybb.ne.jp renayama19661...@ybb.ne.jp 
wrote:

 Hi Lars,
 
 I attach strace file when a problem reappeared at the end of last year.
 I used glue which applied your patch for confirmation.
 
 It is the file which I picked with attrd by strace -p command right before I 
 stop Heartbeat.
 
 Finally SIGTERM caught it, but attrd did not stop.
 The attrd stopped afterwards when I sent SIGKILL.
 
  * I acquire the information such as ltrace from now on.
 
 Best Regards,
 Hideo Yamauchi.
 
 
 --- On Thu, 2012/1/5, renayama19661...@ybb.ne.jp renayama19661...@ybb.ne.jp 
 wrote:
 
  Hi Lars,
  
   If you are able to reproduce,
   you could try to find out what exactly attrd is doing.
   
   various ways to try to do that:
   cat /proc/pid-of-attrd/stack   # if your platform supports that
   strace it,
   ltrace it,
   attach with gdb and provide a stack trace, or even start to single step 
   it,
   cause attrd to core dump, and analyse the core.
  
  All right.
  I investigate the cause a little more.
  
  Give me the time for investigation a little more.
  
  Best Regards,
  Hideo Yamauchi.
  
  --- On Fri, 2011/12/30, Lars Ellenberg lars.ellenb...@linbit.com wrote:
  
   On Thu, Dec 22, 2011 at 09:54:47AM +0900, renayama19661...@ybb.ne.jp 
   wrote:
Hi Dejan,
Hi Lars,

In our environment, the problem recurred with the patch of Mr. Lars.
After a problem occurred, I sent TERM signal, but attrd does not seem to
receive TERM at all.
   
   If you are able to reproduce,
   you could try to find out what exactly attrd is doing.
   
   various ways to try to do that:
   cat /proc/pid-of-attrd/stack   # if your platform supports that
   strace it,
   ltrace it,
   attach with gdb and provide a stack trace, or even start to single step 
   it,
   cause attrd to core dump, and analyse the core.
   
The reconsideration of the patch is necessary for the solution to 
problem.


Best Regards,
Hideo Yamauchi.


--- On Tue, 2011/11/15, renayama19661...@ybb.ne.jp 
renayama19661...@ybb.ne.jp wrote:

 Hi Dejan,
 Hi Lars,
 
 I understood it.
 I try the operation of the patch in our environment.
 
 To Alan: Will you try a patch?
 
 Best Regards,
 Hideo Yamauchi.
 
 --- On Tue, 2011/11/15, Dejan Muhamedagic deja...@fastmail.fm wrote:
 
  Hi,
  
  On Mon, Nov 14, 2011 at 01:17:37PM +0100, Lars Ellenberg wrote:
   On Mon, Nov 14, 2011 at 11:58:09AM +1100, Andrew Beekhof wrote:
On Mon, Nov 7, 2011 at 8:39 AM, Lars Ellenberg
lars.ellenb...@linbit.com wrote:
 On Thu, Nov 03, 2011 at 01:49:46AM +1100, Andrew Beekhof 
 wrote:
 On Tue, Oct 18, 2011 at 12:19 PM,  
 renayama19661...@ybb.ne.jp wrote:
  Hi,
 
  We sometimes fail in a stop of attrd.
 
  Step1. start a cluster in 2 nodes
  Step2. stop the first node.(/etc/init.d/heartbeat stop.)
  Step3. stop the second node after time passed a 
  little.(/etc/init.d/heartbeat
  stop.)
 
  The attrd catches the TERM signal, but does not stop.

 There's no evidence that it actually catches it, only that 
 it is sent.
 I've seen it before but never figured out why it occurs.

 I had it once tracked down almost to where it occurs, but 
 then got distracted.
 Yes the signal was delivered.

 I *think* it had to do with attrd doing a blocking read,
 or looping in some internal message delivery function too 
 often.

 I had a quick look at the code again now, to try and remember,
 but I'm not sure.

 I *may* be that, because
 xmlfromIPC(IPC_Channel * ch, int timeout) calls
    msg = msgfromIPC_timeout(ch, MSG_ALLOWINTR, timeout, 
 ipc_rc);

 And MSG_ALLOWINTR will cause msgfromIPC_ll() to
        IPC_INTR:
                if ( allow_intr){
                        goto startwait;

 Depending on the frequency of deliverd signals, it may cause 
 this goto
 startwait loop to never exit, because the timeout always 
 starts again
 from the full passed in timeout.

 If only one signal is deliverd, it may still take 120 seconds
 (MAX_IPC_DELAY from crm.h) to be actually processed, as the 
 signal
 handler only raises a flag for the next mainloop iteration.

 If a (non-fatal) signal is delivered every few seconds,
 then the goto loop will never timeout.

 

Re: [Pacemaker] [Problem] The attrd does not sometimes stop.

2012-01-09 Thread renayama19661014
Hi Lars,

I attach strace file when a problem reappeared at the end of last year.
I used glue which applied your patch for confirmation.

It is the file which I picked with attrd by strace -p command right before I 
stop Heartbeat.

Finally SIGTERM caught it, but attrd did not stop.
The attrd stopped afterwards when I sent SIGKILL.

 * I acquire the information such as ltrace from now on.

Best Regards,
Hideo Yamauchi.


--- On Thu, 2012/1/5, renayama19661...@ybb.ne.jp renayama19661...@ybb.ne.jp 
wrote:

 Hi Lars,
 
  If you are able to reproduce,
  you could try to find out what exactly attrd is doing.
  
  various ways to try to do that:
  cat /proc/pid-of-attrd/stack   # if your platform supports that
  strace it,
  ltrace it,
  attach with gdb and provide a stack trace, or even start to single step it,
  cause attrd to core dump, and analyse the core.
 
 All right.
 I investigate the cause a little more.
 
 Give me the time for investigation a little more.
 
 Best Regards,
 Hideo Yamauchi.
 
 --- On Fri, 2011/12/30, Lars Ellenberg lars.ellenb...@linbit.com wrote:
 
  On Thu, Dec 22, 2011 at 09:54:47AM +0900, renayama19661...@ybb.ne.jp wrote:
   Hi Dejan,
   Hi Lars,
   
   In our environment, the problem recurred with the patch of Mr. Lars.
   After a problem occurred, I sent TERM signal, but attrd does not seem to
   receive TERM at all.
  
  If you are able to reproduce,
  you could try to find out what exactly attrd is doing.
  
  various ways to try to do that:
  cat /proc/pid-of-attrd/stack   # if your platform supports that
  strace it,
  ltrace it,
  attach with gdb and provide a stack trace, or even start to single step it,
  cause attrd to core dump, and analyse the core.
  
   The reconsideration of the patch is necessary for the solution to problem.
   
   
   Best Regards,
   Hideo Yamauchi.
   
   
   --- On Tue, 2011/11/15, renayama19661...@ybb.ne.jp 
   renayama19661...@ybb.ne.jp wrote:
   
Hi Dejan,
Hi Lars,

I understood it.
I try the operation of the patch in our environment.

To Alan: Will you try a patch?

Best Regards,
Hideo Yamauchi.

--- On Tue, 2011/11/15, Dejan Muhamedagic deja...@fastmail.fm wrote:

 Hi,
 
 On Mon, Nov 14, 2011 at 01:17:37PM +0100, Lars Ellenberg wrote:
  On Mon, Nov 14, 2011 at 11:58:09AM +1100, Andrew Beekhof wrote:
   On Mon, Nov 7, 2011 at 8:39 AM, Lars Ellenberg
   lars.ellenb...@linbit.com wrote:
On Thu, Nov 03, 2011 at 01:49:46AM +1100, Andrew Beekhof wrote:
On Tue, Oct 18, 2011 at 12:19 PM,  
renayama19661...@ybb.ne.jp wrote:
 Hi,

 We sometimes fail in a stop of attrd.

 Step1. start a cluster in 2 nodes
 Step2. stop the first node.(/etc/init.d/heartbeat stop.)
 Step3. stop the second node after time passed a 
 little.(/etc/init.d/heartbeat
 stop.)

 The attrd catches the TERM signal, but does not stop.
   
There's no evidence that it actually catches it, only that it 
is sent.
I've seen it before but never figured out why it occurs.
   
I had it once tracked down almost to where it occurs, but then 
got distracted.
Yes the signal was delivered.
   
I *think* it had to do with attrd doing a blocking read,
or looping in some internal message delivery function too often.
   
I had a quick look at the code again now, to try and remember,
but I'm not sure.
   
I *may* be that, because
xmlfromIPC(IPC_Channel * ch, int timeout) calls
   msg = msgfromIPC_timeout(ch, MSG_ALLOWINTR, timeout, 
ipc_rc);
   
And MSG_ALLOWINTR will cause msgfromIPC_ll() to
       IPC_INTR:
               if ( allow_intr){
                       goto startwait;
   
Depending on the frequency of deliverd signals, it may cause 
this goto
startwait loop to never exit, because the timeout always starts 
again
from the full passed in timeout.
   
If only one signal is deliverd, it may still take 120 seconds
(MAX_IPC_DELAY from crm.h) to be actually processed, as the 
signal
handler only raises a flag for the next mainloop iteration.
   
If a (non-fatal) signal is delivered every few seconds,
then the goto loop will never timeout.
   
Please someone check this for plausibility ;-)
   
   Most plausible explanation I've heard so far... still odd that 
   only
   attrd is affected.
   So what do we do about it?
  
  Reproduce, and confirm that this is what people are seeing.
  
  Make attrd non-blocking?
  
  Fix the ipc layer to not restart the full timeout,
  but only the remaining partial time?
 
 Lars and I made a quick patch for cluster-glue (attached).
 Hideo-san, 

Re: [Pacemaker] [Problem] The attrd does not sometimes stop.

2012-01-04 Thread renayama19661014
Hi Lars,

 If you are able to reproduce,
 you could try to find out what exactly attrd is doing.
 
 various ways to try to do that:
 cat /proc/pid-of-attrd/stack   # if your platform supports that
 strace it,
 ltrace it,
 attach with gdb and provide a stack trace, or even start to single step it,
 cause attrd to core dump, and analyse the core.

All right.
I investigate the cause a little more.

Give me the time for investigation a little more.

Best Regards,
Hideo Yamauchi.

--- On Fri, 2011/12/30, Lars Ellenberg lars.ellenb...@linbit.com wrote:

 On Thu, Dec 22, 2011 at 09:54:47AM +0900, renayama19661...@ybb.ne.jp wrote:
  Hi Dejan,
  Hi Lars,
  
  In our environment, the problem recurred with the patch of Mr. Lars.
  After a problem occurred, I sent TERM signal, but attrd does not seem to
  receive TERM at all.
 
 If you are able to reproduce,
 you could try to find out what exactly attrd is doing.
 
 various ways to try to do that:
 cat /proc/pid-of-attrd/stack   # if your platform supports that
 strace it,
 ltrace it,
 attach with gdb and provide a stack trace, or even start to single step it,
 cause attrd to core dump, and analyse the core.
 
  The reconsideration of the patch is necessary for the solution to problem.
  
  
  Best Regards,
  Hideo Yamauchi.
  
  
  --- On Tue, 2011/11/15, renayama19661...@ybb.ne.jp 
  renayama19661...@ybb.ne.jp wrote:
  
   Hi Dejan,
   Hi Lars,
   
   I understood it.
   I try the operation of the patch in our environment.
   
   To Alan: Will you try a patch?
   
   Best Regards,
   Hideo Yamauchi.
   
   --- On Tue, 2011/11/15, Dejan Muhamedagic deja...@fastmail.fm wrote:
   
Hi,

On Mon, Nov 14, 2011 at 01:17:37PM +0100, Lars Ellenberg wrote:
 On Mon, Nov 14, 2011 at 11:58:09AM +1100, Andrew Beekhof wrote:
  On Mon, Nov 7, 2011 at 8:39 AM, Lars Ellenberg
  lars.ellenb...@linbit.com wrote:
   On Thu, Nov 03, 2011 at 01:49:46AM +1100, Andrew Beekhof wrote:
   On Tue, Oct 18, 2011 at 12:19 PM,  renayama19661...@ybb.ne.jp 
   wrote:
Hi,
   
We sometimes fail in a stop of attrd.
   
Step1. start a cluster in 2 nodes
Step2. stop the first node.(/etc/init.d/heartbeat stop.)
Step3. stop the second node after time passed a 
little.(/etc/init.d/heartbeat
stop.)
   
The attrd catches the TERM signal, but does not stop.
  
   There's no evidence that it actually catches it, only that it is 
   sent.
   I've seen it before but never figured out why it occurs.
  
   I had it once tracked down almost to where it occurs, but then 
   got distracted.
   Yes the signal was delivered.
  
   I *think* it had to do with attrd doing a blocking read,
   or looping in some internal message delivery function too often.
  
   I had a quick look at the code again now, to try and remember,
   but I'm not sure.
  
   I *may* be that, because
   xmlfromIPC(IPC_Channel * ch, int timeout) calls
      msg = msgfromIPC_timeout(ch, MSG_ALLOWINTR, timeout, ipc_rc);
  
   And MSG_ALLOWINTR will cause msgfromIPC_ll() to
          IPC_INTR:
                  if ( allow_intr){
                          goto startwait;
  
   Depending on the frequency of deliverd signals, it may cause this 
   goto
   startwait loop to never exit, because the timeout always starts 
   again
   from the full passed in timeout.
  
   If only one signal is deliverd, it may still take 120 seconds
   (MAX_IPC_DELAY from crm.h) to be actually processed, as the signal
   handler only raises a flag for the next mainloop iteration.
  
   If a (non-fatal) signal is delivered every few seconds,
   then the goto loop will never timeout.
  
   Please someone check this for plausibility ;-)
  
  Most plausible explanation I've heard so far... still odd that only
  attrd is affected.
  So what do we do about it?
 
 Reproduce, and confirm that this is what people are seeing.
 
 Make attrd non-blocking?
 
 Fix the ipc layer to not restart the full timeout,
 but only the remaining partial time?

Lars and I made a quick patch for cluster-glue (attached).
Hideo-san, is there a way for you to verify if it helps? The
patch is not perfect and under unfavourable circumstances it may
still take a long time for the caller to exit, but it'd be good
to know if this is the right spot.

Cheers,

Dejan

 -- 
 : Lars Ellenberg
 : LINBIT | Your Way to High Availability
 : DRBD/HA support and consulting http://www.linbit.com
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: 
 

Re: [Pacemaker] [Problem] The attrd does not sometimes stop.

2011-12-29 Thread Lars Ellenberg
On Thu, Dec 22, 2011 at 09:54:47AM +0900, renayama19661...@ybb.ne.jp wrote:
 Hi Dejan,
 Hi Lars,
 
 In our environment, the problem recurred with the patch of Mr. Lars.
 After a problem occurred, I sent TERM signal, but attrd does not seem to
 receive TERM at all.

If you are able to reproduce,
you could try to find out what exactly attrd is doing.

various ways to try to do that:
cat /proc/pid-of-attrd/stack   # if your platform supports that
strace it,
ltrace it,
attach with gdb and provide a stack trace, or even start to single step it,
cause attrd to core dump, and analyse the core.

 The reconsideration of the patch is necessary for the solution to problem.
 
 
 Best Regards,
 Hideo Yamauchi.
 
 
 --- On Tue, 2011/11/15, renayama19661...@ybb.ne.jp 
 renayama19661...@ybb.ne.jp wrote:
 
  Hi Dejan,
  Hi Lars,
  
  I understood it.
  I try the operation of the patch in our environment.
  
  To Alan: Will you try a patch?
  
  Best Regards,
  Hideo Yamauchi.
  
  --- On Tue, 2011/11/15, Dejan Muhamedagic deja...@fastmail.fm wrote:
  
   Hi,
   
   On Mon, Nov 14, 2011 at 01:17:37PM +0100, Lars Ellenberg wrote:
On Mon, Nov 14, 2011 at 11:58:09AM +1100, Andrew Beekhof wrote:
 On Mon, Nov 7, 2011 at 8:39 AM, Lars Ellenberg
 lars.ellenb...@linbit.com wrote:
  On Thu, Nov 03, 2011 at 01:49:46AM +1100, Andrew Beekhof wrote:
  On Tue, Oct 18, 2011 at 12:19 PM,  renayama19661...@ybb.ne.jp 
  wrote:
   Hi,
  
   We sometimes fail in a stop of attrd.
  
   Step1. start a cluster in 2 nodes
   Step2. stop the first node.(/etc/init.d/heartbeat stop.)
   Step3. stop the second node after time passed a 
   little.(/etc/init.d/heartbeat
   stop.)
  
   The attrd catches the TERM signal, but does not stop.
 
  There's no evidence that it actually catches it, only that it is 
  sent.
  I've seen it before but never figured out why it occurs.
 
  I had it once tracked down almost to where it occurs, but then got 
  distracted.
  Yes the signal was delivered.
 
  I *think* it had to do with attrd doing a blocking read,
  or looping in some internal message delivery function too often.
 
  I had a quick look at the code again now, to try and remember,
  but I'm not sure.
 
  I *may* be that, because
  xmlfromIPC(IPC_Channel * ch, int timeout) calls
     msg = msgfromIPC_timeout(ch, MSG_ALLOWINTR, timeout, ipc_rc);
 
  And MSG_ALLOWINTR will cause msgfromIPC_ll() to
         IPC_INTR:
                 if ( allow_intr){
                         goto startwait;
 
  Depending on the frequency of deliverd signals, it may cause this 
  goto
  startwait loop to never exit, because the timeout always starts 
  again
  from the full passed in timeout.
 
  If only one signal is deliverd, it may still take 120 seconds
  (MAX_IPC_DELAY from crm.h) to be actually processed, as the signal
  handler only raises a flag for the next mainloop iteration.
 
  If a (non-fatal) signal is delivered every few seconds,
  then the goto loop will never timeout.
 
  Please someone check this for plausibility ;-)
 
 Most plausible explanation I've heard so far... still odd that only
 attrd is affected.
 So what do we do about it?

Reproduce, and confirm that this is what people are seeing.

Make attrd non-blocking?

Fix the ipc layer to not restart the full timeout,
but only the remaining partial time?
   
   Lars and I made a quick patch for cluster-glue (attached).
   Hideo-san, is there a way for you to verify if it helps? The
   patch is not perfect and under unfavourable circumstances it may
   still take a long time for the caller to exit, but it'd be good
   to know if this is the right spot.
   
   Cheers,
   
   Dejan
   
-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: 
http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
  
  
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.

___
Pacemaker 

Re: [Pacemaker] [Problem] The attrd does not sometimes stop.

2011-12-21 Thread renayama19661014
Hi Dejan,
Hi Lars,

In our environment, the problem recurred with the patch of Mr. Lars.
After a problem occurred, I sent TERM signal, but attrd does not seem to
receive TERM at all.

The reconsideration of the patch is necessary for the solution to problem.


Best Regards,
Hideo Yamauchi.


--- On Tue, 2011/11/15, renayama19661...@ybb.ne.jp renayama19661...@ybb.ne.jp 
wrote:

 Hi Dejan,
 Hi Lars,
 
 I understood it.
 I try the operation of the patch in our environment.
 
 To Alan: Will you try a patch?
 
 Best Regards,
 Hideo Yamauchi.
 
 --- On Tue, 2011/11/15, Dejan Muhamedagic deja...@fastmail.fm wrote:
 
  Hi,
  
  On Mon, Nov 14, 2011 at 01:17:37PM +0100, Lars Ellenberg wrote:
   On Mon, Nov 14, 2011 at 11:58:09AM +1100, Andrew Beekhof wrote:
On Mon, Nov 7, 2011 at 8:39 AM, Lars Ellenberg
lars.ellenb...@linbit.com wrote:
 On Thu, Nov 03, 2011 at 01:49:46AM +1100, Andrew Beekhof wrote:
 On Tue, Oct 18, 2011 at 12:19 PM,  renayama19661...@ybb.ne.jp 
 wrote:
  Hi,
 
  We sometimes fail in a stop of attrd.
 
  Step1. start a cluster in 2 nodes
  Step2. stop the first node.(/etc/init.d/heartbeat stop.)
  Step3. stop the second node after time passed a 
  little.(/etc/init.d/heartbeat
  stop.)
 
  The attrd catches the TERM signal, but does not stop.

 There's no evidence that it actually catches it, only that it is 
 sent.
 I've seen it before but never figured out why it occurs.

 I had it once tracked down almost to where it occurs, but then got 
 distracted.
 Yes the signal was delivered.

 I *think* it had to do with attrd doing a blocking read,
 or looping in some internal message delivery function too often.

 I had a quick look at the code again now, to try and remember,
 but I'm not sure.

 I *may* be that, because
 xmlfromIPC(IPC_Channel * ch, int timeout) calls
    msg = msgfromIPC_timeout(ch, MSG_ALLOWINTR, timeout, ipc_rc);

 And MSG_ALLOWINTR will cause msgfromIPC_ll() to
        IPC_INTR:
                if ( allow_intr){
                        goto startwait;

 Depending on the frequency of deliverd signals, it may cause this goto
 startwait loop to never exit, because the timeout always starts again
 from the full passed in timeout.

 If only one signal is deliverd, it may still take 120 seconds
 (MAX_IPC_DELAY from crm.h) to be actually processed, as the signal
 handler only raises a flag for the next mainloop iteration.

 If a (non-fatal) signal is delivered every few seconds,
 then the goto loop will never timeout.

 Please someone check this for plausibility ;-)

Most plausible explanation I've heard so far... still odd that only
attrd is affected.
So what do we do about it?
   
   Reproduce, and confirm that this is what people are seeing.
   
   Make attrd non-blocking?
   
   Fix the ipc layer to not restart the full timeout,
   but only the remaining partial time?
  
  Lars and I made a quick patch for cluster-glue (attached).
  Hideo-san, is there a way for you to verify if it helps? The
  patch is not perfect and under unfavourable circumstances it may
  still take a long time for the caller to exit, but it'd be good
  to know if this is the right spot.
  
  Cheers,
  
  Dejan
  
   -- 
   : Lars Ellenberg
   : LINBIT | Your Way to High Availability
   : DRBD/HA support and consulting http://www.linbit.com
   
   ___
   Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
   http://oss.clusterlabs.org/mailman/listinfo/pacemaker
   
   Project Home: http://www.clusterlabs.org
   Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
   Bugs: 
   http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
 
 

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] [Problem] The attrd does not sometimes stop.

2011-11-14 Thread Dejan Muhamedagic
Hi,

On Mon, Nov 14, 2011 at 01:17:37PM +0100, Lars Ellenberg wrote:
 On Mon, Nov 14, 2011 at 11:58:09AM +1100, Andrew Beekhof wrote:
  On Mon, Nov 7, 2011 at 8:39 AM, Lars Ellenberg
  lars.ellenb...@linbit.com wrote:
   On Thu, Nov 03, 2011 at 01:49:46AM +1100, Andrew Beekhof wrote:
   On Tue, Oct 18, 2011 at 12:19 PM,  renayama19661...@ybb.ne.jp wrote:
Hi,
   
We sometimes fail in a stop of attrd.
   
Step1. start a cluster in 2 nodes
Step2. stop the first node.(/etc/init.d/heartbeat stop.)
Step3. stop the second node after time passed a 
little.(/etc/init.d/heartbeat
stop.)
   
The attrd catches the TERM signal, but does not stop.
  
   There's no evidence that it actually catches it, only that it is sent.
   I've seen it before but never figured out why it occurs.
  
   I had it once tracked down almost to where it occurs, but then got 
   distracted.
   Yes the signal was delivered.
  
   I *think* it had to do with attrd doing a blocking read,
   or looping in some internal message delivery function too often.
  
   I had a quick look at the code again now, to try and remember,
   but I'm not sure.
  
   I *may* be that, because
   xmlfromIPC(IPC_Channel * ch, int timeout) calls
      msg = msgfromIPC_timeout(ch, MSG_ALLOWINTR, timeout, ipc_rc);
  
   And MSG_ALLOWINTR will cause msgfromIPC_ll() to
          IPC_INTR:
                  if ( allow_intr){
                          goto startwait;
  
   Depending on the frequency of deliverd signals, it may cause this goto
   startwait loop to never exit, because the timeout always starts again
   from the full passed in timeout.
  
   If only one signal is deliverd, it may still take 120 seconds
   (MAX_IPC_DELAY from crm.h) to be actually processed, as the signal
   handler only raises a flag for the next mainloop iteration.
  
   If a (non-fatal) signal is delivered every few seconds,
   then the goto loop will never timeout.
  
   Please someone check this for plausibility ;-)
  
  Most plausible explanation I've heard so far... still odd that only
  attrd is affected.
  So what do we do about it?
 
 Reproduce, and confirm that this is what people are seeing.
 
 Make attrd non-blocking?
 
 Fix the ipc layer to not restart the full timeout,
 but only the remaining partial time?

Lars and I made a quick patch for cluster-glue (attached).
Hideo-san, is there a way for you to verify if it helps? The
patch is not perfect and under unfavourable circumstances it may
still take a long time for the caller to exit, but it'd be good
to know if this is the right spot.

Cheers,

Dejan

 -- 
 : Lars Ellenberg
 : LINBIT | Your Way to High Availability
 : DRBD/HA support and consulting http://www.linbit.com
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: 
 http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
# HG changeset patch
# User Lars Ellenberg lars.ellenb...@linbit.com
# Date 1321275721 -3600
# Node ID 8b50bf0dd4cdf8d0a405416da98711080b2abeb9
# Parent  569bdebf736185d77782f49d5c760007cfc6b3e8
Medium: clplumbing: don't restart timeouts forever if signals are repeatedly sent

diff -r 569bdebf7361 -r 8b50bf0dd4cd lib/clplumbing/cl_msg.c
--- a/lib/clplumbing/cl_msg.c	Mon Nov 14 11:31:51 2011 +0100
+++ b/lib/clplumbing/cl_msg.c	Mon Nov 14 14:02:01 2011 +0100
@@ -1802,12 +1802,13 @@ static struct ha_msg*
 msgfromIPC_ll(IPC_Channel * ch, int flag, unsigned int timeout, int *rc_out)
 {
 	int		rc;
+	int		sig_cnt = 0;
 	IPC_Message*	ipcmsg;
 	struct ha_msg*	hmsg;
 	int		need_auth = flag  MSG_NEEDAUTH;
 	int		allow_intr = flag  MSG_ALLOWINTR;
 	
- startwait:
+	do {
 	if(timeout  0) {
 	rc = cl_ipc_wait_timeout(ch, ch-ops-waitin, timeout);
 	} else {
@@ -1832,17 +1833,17 @@ msgfromIPC_ll(IPC_Channel * ch, int flag
 		return NULL;
 		
 	case IPC_INTR:
-		if ( allow_intr){
-			goto startwait;
-		}else{
+		if (!allow_intr || sig_cnt++ = 20) {
 			return NULL;
+		} else {
+			break;
 		}
 		
 	case IPC_OK:
 		break;
 	}
-	
-	
+	} while (rc != IPC_OK);
+
 	ipcmsg = NULL;
 	rc = ch-ops-recv(ch, ipcmsg);
 #if 0
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] [Problem] The attrd does not sometimes stop.

2011-11-13 Thread Andrew Beekhof
On Mon, Nov 7, 2011 at 8:39 AM, Lars Ellenberg
lars.ellenb...@linbit.com wrote:
 On Thu, Nov 03, 2011 at 01:49:46AM +1100, Andrew Beekhof wrote:
 On Tue, Oct 18, 2011 at 12:19 PM,  renayama19661...@ybb.ne.jp wrote:
  Hi,
 
  We sometimes fail in a stop of attrd.
 
  Step1. start a cluster in 2 nodes
  Step2. stop the first node.(/etc/init.d/heartbeat stop.)
  Step3. stop the second node after time passed a 
  little.(/etc/init.d/heartbeat
  stop.)
 
  The attrd catches the TERM signal, but does not stop.

 There's no evidence that it actually catches it, only that it is sent.
 I've seen it before but never figured out why it occurs.

 I had it once tracked down almost to where it occurs, but then got distracted.
 Yes the signal was delivered.

 I *think* it had to do with attrd doing a blocking read,
 or looping in some internal message delivery function too often.

 I had a quick look at the code again now, to try and remember,
 but I'm not sure.

 I *may* be that, because
 xmlfromIPC(IPC_Channel * ch, int timeout) calls
    msg = msgfromIPC_timeout(ch, MSG_ALLOWINTR, timeout, ipc_rc);

 And MSG_ALLOWINTR will cause msgfromIPC_ll() to
        IPC_INTR:
                if ( allow_intr){
                        goto startwait;

 Depending on the frequency of deliverd signals, it may cause this goto
 startwait loop to never exit, because the timeout always starts again
 from the full passed in timeout.

 If only one signal is deliverd, it may still take 120 seconds
 (MAX_IPC_DELAY from crm.h) to be actually processed, as the signal
 handler only raises a flag for the next mainloop iteration.

 If a (non-fatal) signal is delivered every few seconds,
 then the goto loop will never timeout.

 Please someone check this for plausibility ;-)

Most plausible explanation I've heard so far... still odd that only
attrd is affected.
So what do we do about it?

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] [Problem] The attrd does not sometimes stop.

2011-11-06 Thread Lars Ellenberg
On Thu, Nov 03, 2011 at 01:49:46AM +1100, Andrew Beekhof wrote:
 On Tue, Oct 18, 2011 at 12:19 PM,  renayama19661...@ybb.ne.jp wrote:
  Hi,
 
  We sometimes fail in a stop of attrd.
 
  Step1. start a cluster in 2 nodes
  Step2. stop the first node.(/etc/init.d/heartbeat stop.)
  Step3. stop the second node after time passed a 
  little.(/etc/init.d/heartbeat
  stop.)
 
  The attrd catches the TERM signal, but does not stop.
 
 There's no evidence that it actually catches it, only that it is sent.
 I've seen it before but never figured out why it occurs.

I had it once tracked down almost to where it occurs, but then got distracted.
Yes the signal was delivered.

I *think* it had to do with attrd doing a blocking read,
or looping in some internal message delivery function too often.

I had a quick look at the code again now, to try and remember,
but I'm not sure.

I *may* be that, because
xmlfromIPC(IPC_Channel * ch, int timeout) calls
msg = msgfromIPC_timeout(ch, MSG_ALLOWINTR, timeout, ipc_rc);

And MSG_ALLOWINTR will cause msgfromIPC_ll() to 
IPC_INTR:
if ( allow_intr){
goto startwait;

Depending on the frequency of deliverd signals, it may cause this goto
startwait loop to never exit, because the timeout always starts again
from the full passed in timeout.

If only one signal is deliverd, it may still take 120 seconds
(MAX_IPC_DELAY from crm.h) to be actually processed, as the signal
handler only raises a flag for the next mainloop iteration.

If a (non-fatal) signal is delivered every few seconds,
then the goto loop will never timeout.

Please someone check this for plausibility ;-)

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] [Problem] The attrd does not sometimes stop.

2011-11-03 Thread renayama19661014
Hi Andrew,
Hi Alan,

We work hard to collect the evidence of reproduction and the problem of the 
phenomenon.
However, we do not yet get the evidence.
I will wait for the information from Alan.

Best Regards,
Hideo Yamauchi.



--- On Wed, 2011/11/2, Andrew Beekhof and...@beekhof.net wrote:

 On Tue, Oct 18, 2011 at 12:19 PM,  renayama19661...@ybb.ne.jp wrote:
  Hi,
 
  We sometimes fail in a stop of attrd.
 
  Step1. start a cluster in 2 nodes
  Step2. stop the first node.(/etc/init.d/heartbeat stop.)
  Step3. stop the second node after time passed a 
  little.(/etc/init.d/heartbeat
  stop.)
 
  The attrd catches the TERM signal, but does not stop.
 
 There's no evidence that it actually catches it, only that it is sent.
 I've seen it before but never figured out why it occurs.
 
 
  (snip)
  Oct  5 02:37:38 hpdb0201 crmd: [12238]: info: do_exit: [crmd] stopped (0)
  Oct  5 02:37:38 hpdb0201 cib: [12234]: WARN: send_ipc_message: IPC Channel 
  to
  12238 is not connected
  Oct  5 02:37:38 hpdb0201 cib: [12234]: WARN: send_via_callback_channel:
  Delivery of reply to client 12238/0dbc9e28-d90d-4335-b9c4-9dd3fcb38163 
  failed
  Oct  5 02:37:38 hpdb0201 cib: [12234]: WARN: do_local_notify: A-Sync reply 
  to
  crmd failed: reply failed
  Oct  5 02:37:38 hpdb0201 heartbeat: [12223]: info: killing
  /usr/lib64/heartbeat/attrd process group 12237 with signal 15
  Oct  5 02:47:03 hpdb0201 cib: [12234]: info: cib_stats: Processed 97 
  operations
  (4123.00us average, 0% utilization) in the last 10min
  Oct  5 07:15:25 hpdb0201 ccm: [12233]: WARN: G_CH_check_int: working on IPC
  channel took 1010 ms ( 100 ms)
  Oct  5 07:15:26 hpdb0201 ccm: [12233]: WARN: G_CH_check_int: working on IPC
  channel took 1010 ms ( 100 ms)
  Oct  5 07:15:37 hpdb0201 heartbeat: [12223]: WARN: Gmain_timeout_dispatch:
  Dispatch function for check for signals was delayed 1030 ms ( 1010 ms) 
  before
  being called (GSource: 0xd28010)
  Oct  5 07:15:37 hpdb0201 heartbeat: [12223]: info: Gmain_timeout_dispatch:
  started at 431583547 should have started at 431583444
  Oct  5 07:15:44 hpdb0201 heartbeat: [12223]: WARN: Gmain_timeout_dispatch:
  Dispatch function for send local status was delayed 1030 ms ( 1010 ms) 
  before
  being called (GSource: 0xd27dd0)
  Oct  5 07:15:44 hpdb0201 heartbeat: [12223]: info: Gmain_timeout_dispatch:
  started at 431584254 should have started at 431584151
  Oct  5 07:15:44 hpdb0201 heartbeat: [12223]: WARN: Gmain_timeout_dispatch:
  Dispatch function for check for signals was delayed 1030 ms ( 1010 ms) 
  before
  being called (GSource: 0xd28010)
  Oct  5 07:15:44 hpdb0201 heartbeat: [12223]: info: Gmain_timeout_dispatch:
  started at 431584254 should have started at 431584151
  Oct  5 07:16:59 hpdb0201 heartbeat: [12223]: WARN: G_CH_check_int: working 
  on
  write child took 1010 ms ( 100 ms)
  Oct  5 07:17:14 hpdb0201 stonithd: [12236]: WARN: G_CH_check_int: working on
  Heartbeat API channel took 1010 ms ( 100 ms)
  Oct  5 07:19:41 hpdb0201 heartbeat: [12223]: WARN: Gmain_timeout_dispatch:
  Dispatch function for send local status was delayed 1030 ms ( 1010 ms) 
  before
  being called (GSource: 0xd27dd0)
  Oct  5 07:19:41 hpdb0201 heartbeat: [12223]: info: Gmain_timeout_dispatch:
  started at 431607988 should have started at 431607885
  Oct  5 07:19:41 hpdb0201 heartbeat: [12223]: WARN: Gmain_timeout_dispatch:
  Dispatch function for check for signals was delayed 1030 ms ( 1010 ms) 
  before
  being called (GSource: 0xd28010)
  Oct  5 07:19:41 hpdb0201 heartbeat: [12223]: info: Gmain_timeout_dispatch:
  started at 431607988 should have started at 431607885
  (snip)
 
  We try the reproduction of the phenomenon, but do not reappear very much.
 
  The same phenomenon is reported by the next email.
  However, the argument of the problem is over on the way.
 
   * http://www.gossamer-threads.com/lists/linuxha/pacemaker/62147
 
  The phenomenon occurred by the next combination.
   * pacemaker-1.0.11
   * resource-agents-3.9.2
   * cluster-glue-1.0.7
   * heartbeat-3.0.5
 
  I registered these contents with Bugzilla.
   * http://bugs.clusterlabs.org/show_bug.cgi?id=5004
 
  Best Regards,
  Hideo Yamauchi.
 
  ___
  Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
  http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
  Project Home: http://www.clusterlabs.org
  Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
  Bugs: 
  http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
 
 

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] [Problem] The attrd does not sometimes stop.

2011-11-02 Thread Andrew Beekhof
On Tue, Oct 18, 2011 at 12:19 PM,  renayama19661...@ybb.ne.jp wrote:
 Hi,

 We sometimes fail in a stop of attrd.

 Step1. start a cluster in 2 nodes
 Step2. stop the first node.(/etc/init.d/heartbeat stop.)
 Step3. stop the second node after time passed a little.(/etc/init.d/heartbeat
 stop.)

 The attrd catches the TERM signal, but does not stop.

There's no evidence that it actually catches it, only that it is sent.
I've seen it before but never figured out why it occurs.


 (snip)
 Oct  5 02:37:38 hpdb0201 crmd: [12238]: info: do_exit: [crmd] stopped (0)
 Oct  5 02:37:38 hpdb0201 cib: [12234]: WARN: send_ipc_message: IPC Channel to
 12238 is not connected
 Oct  5 02:37:38 hpdb0201 cib: [12234]: WARN: send_via_callback_channel:
 Delivery of reply to client 12238/0dbc9e28-d90d-4335-b9c4-9dd3fcb38163 failed
 Oct  5 02:37:38 hpdb0201 cib: [12234]: WARN: do_local_notify: A-Sync reply to
 crmd failed: reply failed
 Oct  5 02:37:38 hpdb0201 heartbeat: [12223]: info: killing
 /usr/lib64/heartbeat/attrd process group 12237 with signal 15
 Oct  5 02:47:03 hpdb0201 cib: [12234]: info: cib_stats: Processed 97 
 operations
 (4123.00us average, 0% utilization) in the last 10min
 Oct  5 07:15:25 hpdb0201 ccm: [12233]: WARN: G_CH_check_int: working on IPC
 channel took 1010 ms ( 100 ms)
 Oct  5 07:15:26 hpdb0201 ccm: [12233]: WARN: G_CH_check_int: working on IPC
 channel took 1010 ms ( 100 ms)
 Oct  5 07:15:37 hpdb0201 heartbeat: [12223]: WARN: Gmain_timeout_dispatch:
 Dispatch function for check for signals was delayed 1030 ms ( 1010 ms) before
 being called (GSource: 0xd28010)
 Oct  5 07:15:37 hpdb0201 heartbeat: [12223]: info: Gmain_timeout_dispatch:
 started at 431583547 should have started at 431583444
 Oct  5 07:15:44 hpdb0201 heartbeat: [12223]: WARN: Gmain_timeout_dispatch:
 Dispatch function for send local status was delayed 1030 ms ( 1010 ms) before
 being called (GSource: 0xd27dd0)
 Oct  5 07:15:44 hpdb0201 heartbeat: [12223]: info: Gmain_timeout_dispatch:
 started at 431584254 should have started at 431584151
 Oct  5 07:15:44 hpdb0201 heartbeat: [12223]: WARN: Gmain_timeout_dispatch:
 Dispatch function for check for signals was delayed 1030 ms ( 1010 ms) before
 being called (GSource: 0xd28010)
 Oct  5 07:15:44 hpdb0201 heartbeat: [12223]: info: Gmain_timeout_dispatch:
 started at 431584254 should have started at 431584151
 Oct  5 07:16:59 hpdb0201 heartbeat: [12223]: WARN: G_CH_check_int: working on
 write child took 1010 ms ( 100 ms)
 Oct  5 07:17:14 hpdb0201 stonithd: [12236]: WARN: G_CH_check_int: working on
 Heartbeat API channel took 1010 ms ( 100 ms)
 Oct  5 07:19:41 hpdb0201 heartbeat: [12223]: WARN: Gmain_timeout_dispatch:
 Dispatch function for send local status was delayed 1030 ms ( 1010 ms) before
 being called (GSource: 0xd27dd0)
 Oct  5 07:19:41 hpdb0201 heartbeat: [12223]: info: Gmain_timeout_dispatch:
 started at 431607988 should have started at 431607885
 Oct  5 07:19:41 hpdb0201 heartbeat: [12223]: WARN: Gmain_timeout_dispatch:
 Dispatch function for check for signals was delayed 1030 ms ( 1010 ms) before
 being called (GSource: 0xd28010)
 Oct  5 07:19:41 hpdb0201 heartbeat: [12223]: info: Gmain_timeout_dispatch:
 started at 431607988 should have started at 431607885
 (snip)

 We try the reproduction of the phenomenon, but do not reappear very much.

 The same phenomenon is reported by the next email.
 However, the argument of the problem is over on the way.

  * http://www.gossamer-threads.com/lists/linuxha/pacemaker/62147

 The phenomenon occurred by the next combination.
  * pacemaker-1.0.11
  * resource-agents-3.9.2
  * cluster-glue-1.0.7
  * heartbeat-3.0.5

 I registered these contents with Bugzilla.
  * http://bugs.clusterlabs.org/show_bug.cgi?id=5004

 Best Regards,
 Hideo Yamauchi.

 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: 
 http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] [Problem] The attrd does not sometimes stop.

2011-10-21 Thread Alan Robertson

On 10/20/2011 07:30 PM, renayama19661...@ybb.ne.jp wrote:

Hi Alan,

Thank you for comment.

We reproduce a problem, too and are going to send a report.
However, the problem does not reappear for the moment.
I gather that the folks on the test team for my project have it happen 
fairly often when they're in a certain stage of testing.  I expect to 
get some hb_report output from them in a week or two.  I have put in a 
link to Andrew's bug system from ours so that hopefully when the time 
comes we will be able to remember what to do ;-)


We had not narrowed it down to attrd being the component that didn't 
stop - but looking at the logs for what they did report, it seemed like 
the likely suspect.  I had already decided that it looked like the most 
likely candidate before I saw your email.


They had put in a workaround of just killing everything - which of 
course works ;-).  At the place where it hung, all the resources were 
already stopped, so it was safe - just a bit of overkill (beyond the 
minimum necessary).



--
Alan Robertsonal...@unix.sh

Openness is the foundation and preservative of friendship...  Let me claim from you 
at all times your undisguised opinions. - William Wilberforce

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] [Problem] The attrd does not sometimes stop.

2011-10-20 Thread renayama19661014
Hi Alan,

Thank you for comment.

We reproduce a problem, too and are going to send a report.
However, the problem does not reappear for the moment.

Best Regards,
Hideo Yamauchi.

--- On Thu, 2011/10/20, Alan Robertson al...@unix.sh wrote:

 Hi,
 
 I've seen a very similar problem in a recent release.  In fact, I'm in the 
 process of reproducing it so that it can be properly logged and so on.  When 
 I get the right data for the bug report, I'll attach it to the bug.
 
 FWIW: I'm pretty sure that the signal was properly received by attrd.  I 
 haven't looked at the attrd code, but my guess is that either it didn't issue 
 the correct function call for exiting from mainloop - or that the mainloop 
 code didn't actually exit.  FWIW - it probably doesn't matter at all what the 
 priority for signal handling is - since attrd consumes nearly no CPU.  Too 
 bad it doesn't log receiving the signal or beginning the process of exiting...
 
 Another random thought - I suppose attrd could be clobbering some memory 
 which mainloop needs to properly process an exit.  Doesn't seem likely - but 
 neither of the above options seem very likely either.
 
 
 
 An historical note on an early bug that had similar symptoms (but affected 
 every process - not just attrd).
 
 First - what caused such a problem (a very long time ago):
     There is a window between the checking for signals and going to sleep in 
 the poll call where
         such that a signal might be ignored for a while.
 
     The glib mainloop code has three entry points called each time a signal 
 is received:
             prepare, check, dispatch.
 
 There is a poll call which occurs between the prepare and check steps.  If a 
 signal comes in after the prepare call returns, but before the code goes to 
 sleep in the poll system call, it will be ignored until
 the poll system call returns.  It will get caught on the next iteration of 
 the loop.
 
 The fix was fairly simple - the signal handling code instructs the mainloop 
 infrastructure to call poll with an argument which prevents it from staying 
 asleep longer than a second.
 
 Then the code processes the signal correctly.
 
 
 On 10/17/2011 07:19 PM, renayama19661...@ybb.ne.jp wrote:
  Hi,
  
  We sometimes fail in a stop of attrd.
  
  Step1. start a cluster in 2 nodes
  Step2. stop the first node.(/etc/init.d/heartbeat stop.)
  Step3. stop the second node after time passed a 
  little.(/etc/init.d/heartbeat
  stop.)
  
  The attrd catches the TERM signal, but does not stop.
  
  (snip)
  Oct  5 02:37:38 hpdb0201 crmd: [12238]: info: do_exit: [crmd] stopped (0)
  Oct  5 02:37:38 hpdb0201 cib: [12234]: WARN: send_ipc_message: IPC Channel 
  to
  12238 is not connected
  Oct  5 02:37:38 hpdb0201 cib: [12234]: WARN: send_via_callback_channel:
  Delivery of reply to client 12238/0dbc9e28-d90d-4335-b9c4-9dd3fcb38163 
  failed
  Oct  5 02:37:38 hpdb0201 cib: [12234]: WARN: do_local_notify: A-Sync reply 
  to
  crmd failed: reply failed
  Oct  5 02:37:38 hpdb0201 heartbeat: [12223]: info: killing
  /usr/lib64/heartbeat/attrd process group 12237 with signal 15
  Oct  5 02:47:03 hpdb0201 cib: [12234]: info: cib_stats: Processed 97 
  operations
  (4123.00us average, 0% utilization) in the last 10min
  Oct  5 07:15:25 hpdb0201 ccm: [12233]: WARN: G_CH_check_int: working on IPC
  channel took 1010 ms (  100 ms)
  Oct  5 07:15:26 hpdb0201 ccm: [12233]: WARN: G_CH_check_int: working on IPC
  channel took 1010 ms (  100 ms)
  Oct  5 07:15:37 hpdb0201 heartbeat: [12223]: WARN: Gmain_timeout_dispatch:
  Dispatch function for check for signals was delayed 1030 ms (  1010 ms) 
  before
  being called (GSource: 0xd28010)
  Oct  5 07:15:37 hpdb0201 heartbeat: [12223]: info: Gmain_timeout_dispatch:
  started at 431583547 should have started at 431583444
  Oct  5 07:15:44 hpdb0201 heartbeat: [12223]: WARN: Gmain_timeout_dispatch:
  Dispatch function for send local status was delayed 1030 ms (  1010 ms) 
  before
  being called (GSource: 0xd27dd0)
  Oct  5 07:15:44 hpdb0201 heartbeat: [12223]: info: Gmain_timeout_dispatch:
  started at 431584254 should have started at 431584151
  Oct  5 07:15:44 hpdb0201 heartbeat: [12223]: WARN: Gmain_timeout_dispatch:
  Dispatch function for check for signals was delayed 1030 ms (  1010 ms) 
  before
  being called (GSource: 0xd28010)
  Oct  5 07:15:44 hpdb0201 heartbeat: [12223]: info: Gmain_timeout_dispatch:
  started at 431584254 should have started at 431584151
  Oct  5 07:16:59 hpdb0201 heartbeat: [12223]: WARN: G_CH_check_int: working 
  on
  write child took 1010 ms (  100 ms)
  Oct  5 07:17:14 hpdb0201 stonithd: [12236]: WARN: G_CH_check_int: working on
  Heartbeat API channel took 1010 ms (  100 ms)
  Oct  5 07:19:41 hpdb0201 heartbeat: [12223]: WARN: Gmain_timeout_dispatch:
  Dispatch function for send local status was delayed 1030 ms (  1010 ms) 
  before
  being called (GSource: 0xd27dd0)
  Oct  5 07:19:41 hpdb0201 heartbeat: [12223]: info: 

Re: [Pacemaker] [Problem] The attrd does not sometimes stop.

2011-10-19 Thread Alan Robertson

Hi,

I've seen a very similar problem in a recent release.  In fact, I'm in 
the process of reproducing it so that it can be properly logged and so 
on.  When I get the right data for the bug report, I'll attach it to the 
bug.


FWIW: I'm pretty sure that the signal was properly received by attrd.  I 
haven't looked at the attrd code, but my guess is that either it didn't 
issue the correct function call for exiting from mainloop - or that the 
mainloop code didn't actually exit.  FWIW - it probably doesn't matter 
at all what the priority for signal handling is - since attrd consumes 
nearly no CPU.  Too bad it doesn't log receiving the signal or beginning 
the process of exiting...


Another random thought - I suppose attrd could be clobbering some memory 
which mainloop needs to properly process an exit.  Doesn't seem likely - 
but neither of the above options seem very likely either.




An historical note on an early bug that had similar symptoms (but 
affected every process - not just attrd).


First - what caused such a problem (a very long time ago):
There is a window between the checking for signals and going to 
sleep in the poll call where

such that a signal might be ignored for a while.

The glib mainloop code has three entry points called each time a 
signal is received:

prepare, check, dispatch.

There is a poll call which occurs between the prepare and check steps.  
If a signal comes in after the prepare call returns, but before the code 
goes to sleep in the poll system call, it will be ignored until
the poll system call returns.  It will get caught on the next iteration 
of the loop.


The fix was fairly simple - the signal handling code instructs the 
mainloop infrastructure to call poll with an argument which prevents it 
from staying asleep longer than a second.


Then the code processes the signal correctly.


On 10/17/2011 07:19 PM, renayama19661...@ybb.ne.jp wrote:

Hi,

We sometimes fail in a stop of attrd.

Step1. start a cluster in 2 nodes
Step2. stop the first node.(/etc/init.d/heartbeat stop.)
Step3. stop the second node after time passed a little.(/etc/init.d/heartbeat
stop.)

The attrd catches the TERM signal, but does not stop.

(snip)
Oct  5 02:37:38 hpdb0201 crmd: [12238]: info: do_exit: [crmd] stopped (0)
Oct  5 02:37:38 hpdb0201 cib: [12234]: WARN: send_ipc_message: IPC Channel to
12238 is not connected
Oct  5 02:37:38 hpdb0201 cib: [12234]: WARN: send_via_callback_channel:
Delivery of reply to client 12238/0dbc9e28-d90d-4335-b9c4-9dd3fcb38163 failed
Oct  5 02:37:38 hpdb0201 cib: [12234]: WARN: do_local_notify: A-Sync reply to
crmd failed: reply failed
Oct  5 02:37:38 hpdb0201 heartbeat: [12223]: info: killing
/usr/lib64/heartbeat/attrd process group 12237 with signal 15
Oct  5 02:47:03 hpdb0201 cib: [12234]: info: cib_stats: Processed 97 operations
(4123.00us average, 0% utilization) in the last 10min
Oct  5 07:15:25 hpdb0201 ccm: [12233]: WARN: G_CH_check_int: working on IPC
channel took 1010 ms (  100 ms)
Oct  5 07:15:26 hpdb0201 ccm: [12233]: WARN: G_CH_check_int: working on IPC
channel took 1010 ms (  100 ms)
Oct  5 07:15:37 hpdb0201 heartbeat: [12223]: WARN: Gmain_timeout_dispatch:
Dispatch function for check for signals was delayed 1030 ms (  1010 ms) before
being called (GSource: 0xd28010)
Oct  5 07:15:37 hpdb0201 heartbeat: [12223]: info: Gmain_timeout_dispatch:
started at 431583547 should have started at 431583444
Oct  5 07:15:44 hpdb0201 heartbeat: [12223]: WARN: Gmain_timeout_dispatch:
Dispatch function for send local status was delayed 1030 ms (  1010 ms) before
being called (GSource: 0xd27dd0)
Oct  5 07:15:44 hpdb0201 heartbeat: [12223]: info: Gmain_timeout_dispatch:
started at 431584254 should have started at 431584151
Oct  5 07:15:44 hpdb0201 heartbeat: [12223]: WARN: Gmain_timeout_dispatch:
Dispatch function for check for signals was delayed 1030 ms (  1010 ms) before
being called (GSource: 0xd28010)
Oct  5 07:15:44 hpdb0201 heartbeat: [12223]: info: Gmain_timeout_dispatch:
started at 431584254 should have started at 431584151
Oct  5 07:16:59 hpdb0201 heartbeat: [12223]: WARN: G_CH_check_int: working on
write child took 1010 ms (  100 ms)
Oct  5 07:17:14 hpdb0201 stonithd: [12236]: WARN: G_CH_check_int: working on
Heartbeat API channel took 1010 ms (  100 ms)
Oct  5 07:19:41 hpdb0201 heartbeat: [12223]: WARN: Gmain_timeout_dispatch:
Dispatch function for send local status was delayed 1030 ms (  1010 ms) before
being called (GSource: 0xd27dd0)
Oct  5 07:19:41 hpdb0201 heartbeat: [12223]: info: Gmain_timeout_dispatch:
started at 431607988 should have started at 431607885
Oct  5 07:19:41 hpdb0201 heartbeat: [12223]: WARN: Gmain_timeout_dispatch:
Dispatch function for check for signals was delayed 1030 ms (  1010 ms) before
being called (GSource: 0xd28010)
Oct  5 07:19:41 hpdb0201 heartbeat: [12223]: info: Gmain_timeout_dispatch:
started at 431607988 should have started at 431607885
(snip)

We try