kisalay wrote:
> Hi,
> 
> I have a 2 node 2.0.8 Linux HA setup.
> I have observed that when stop is issued on my setup, as soon as the start
> returns, the stop hangs indefinitely, and the only way to stop heartbeat is
> to do killall.
> 
> I dug a little deeper into the problem.
> 
> First, the problem is sporadic. I wrote the following script to reproduce
> it:
>        while [ true ]
>        do
>             /sbin/service heartbeat start
>             /sbin/service heartbeat stop
>        done
>        I observed it that after random trials, i could reproduce it.
>        Thereafter, once first stop hangs, any number of stops will hang
> too.
> 
> Second,
>        I attached gdb to the heartbeat ( master control process ) and tried
> to see which handler is called on SIGTERM on the setup on which stop had
> hung. I observed that there was no handler being called.
> 
> Third, I decided to see the sigmask of the heartbeat
> I did  `ps -ae -o pid,caught,ignored` on the heartbeat, both on my normally
> functioning setup and on my hung setup.
> On normally functioning setup i got:
> pid      caught                    ignored
> 2337 0000000180016a01 0000000000301002
> 
> and on my  hung setup, i got:
> pid        caught                    ignored
> 29822 0000000180012a01 0000000000325002
> 
> If we see the hex in binary, the hung-setup heartbeat has "ignored" signal
> 15 ( SIGTERM ), whereas the normal heartbeat has handled it.
> This is the reason why the stop hangs, because the SIGTERM sent to the
> hung-heartbeat is ignored.
> 
> This hints to me that if a stop is issued to heartbeat while its still
> starting ( and registering the signal handlers ), there is a minute time
> window, where if it is issued, the SIGTERM in that window can result into a
> heartbeat which has ignored the SIGTERM, and thereby can-never be
> subsequently brought down cleanly.
> 
> Please correct me if I am erring somewhere. Please also suggest any
> work-arounds to ensure that i issue stop only after  heartbeat has
> installed
> signal-handlers properly

No.  It's not quite like that.  Heartbeat installs signal handers LONG
before heartbeat start returns.  But, once you send it one signal, it
propagates it to its child processes and waits for them all to die.

As Andrew pointed out, there was a case in some of heartbeat's child
processes where they didn't die if we heartbeat tried to kill _them_ too
early.  Should be fixed in 2.1.0 when it comes out.

-- 
    Alan Robertson <[EMAIL PROTECTED]>

"Openness is the foundation and preservative of friendship...  Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to