(12.12.13 08:26), Andrew Beekhof wrote:
On Wed, Dec 12, 2012 at 8:02 PM, Kazunori INOUE
<inouek...@intellilink.co.jp> wrote:
Hi,
I recognize that pacemakerd is much less likely to crash.
However, a possibility of being killed by OOM_Killer etc. is not 0%.
True. Although we just established in another thread that we don't
have any leaks :)
So I think that a user gets confused. since behavior at the time of process
death differs even if pacemakerd is running.
case A)
When pacemakerd and other processes (crmd etc.) are the parent-child
relation.
[snip]
For example, crmd died.
However, since it is relaunched, the state of the cluster is not affected.
Right.
[snip]
case B)
When pacemakerd and other processes are NOT the parent-child relation.
Although pacemakerd was killed, it assumed the state where it was respawned
by Upstart.
$ service corosync start ; service pacemaker start
$ pkill -9 pacemakerd
$ ps -ef|egrep 'corosync|pacemaker|UID'
UID PID PPID C STIME TTY TIME CMD
root 21091 1 1 14:52 ? 00:00:00 corosync
496 21099 1 0 14:52 ? 00:00:00 /usr/libexec/pacemaker/cib
root 21100 1 0 14:52 ? 00:00:00 /usr/libexec/pacemaker/stonithd
root 21101 1 0 14:52 ? 00:00:00 /usr/libexec/pacemaker/lrmd
496 21102 1 0 14:52 ? 00:00:00 /usr/libexec/pacemaker/attrd
496 21103 1 0 14:52 ? 00:00:00 /usr/libexec/pacemaker/pengine
496 21104 1 0 14:52 ? 00:00:00 /usr/libexec/pacemaker/crmd
root 21128 1 1 14:53 ? 00:00:00 /usr/sbin/pacemakerd
Yep, looks right.
Hi Andrew,
We discussed this behavior.
Behavior when pacemakerd and other processes are not parent-child
relation (case B) reached the conclusion that there is room for
improvement.
Since not all users are experts, they may kill pacemakerd accidentally.
Such a user will get confused if the behavior after crmd death changes
with the following conditions.
case A: pacemakerd and others (crmd etc.) are the parent-child relation.
case B: pacemakerd and others are not the parent-child relation.
So, we want to *always* obtain the same behavior as the case where
there is parent-child relation.
That is, when crmd etc. die, we want pacemaker to always relaunch
the process always immediately.
Regards,
Kazunori INOUE
In this case, the node will be set to UNCLEAN if crmd dies.
That is, the node will be fenced if there is stonith resource.
Which is exactly what happens if only pacemakerd is killed with your proposal.
Except now you have time to do a graceful pacemaker restart to
re-establish the parent-child relationship.
If you want to compare B with something, it needs to be with the old
"children terminate if pacemakerd dies" strategy.
Which is:
$ service corosync start ; service pacemaker start
$ pkill -9 pacemakerd
... the node will be set to UNCLEAN
Old way: always downtime because children terminate which triggers fencing
Our way: no downtime unless there is an additional failure (to the cib or crmd)
Given that we're trying for HA, the second seems preferable.
$ pkill -9 crmd
$ crm_mon -1
Last updated: Wed Dec 12 14:53:48 2012
Last change: Wed Dec 12 14:53:10 2012 via crmd on dev2
Stack: corosync
Current DC: dev2 (2472913088) - partition with quorum
Version: 1.1.8-3035414
2 Nodes configured, unknown expected votes
0 Resources configured.
Node dev1 (2506467520): UNCLEAN (online)
Online: [ dev2 ]
How about making behavior selectable with an option?
MORE_DOWNTIME_PLEASE=(true|false) ?
When pacemakerd dies,
mode A) which behaves in an existing way. (default)
mode B) which makes the node UNCLEAN.
Best Regards,
Kazunori INOUE
Making stop work when there is no pacemakerd process is a different
matter. We can make that work.
Though the best solution is to relaunch pacemakerd, if it is difficult,
I think that a shortcut method is to make a node unclean.
And now, I tried Upstart a little bit.
1) started the corosync and pacemaker.
$ cat /etc/init/pacemaker.conf
respawn
script
[ -f /etc/sysconfig/pacemaker ] && {
. /etc/sysconfig/pacemaker
}
exec /usr/sbin/pacemakerd
end script
$ service co start
Starting Corosync Cluster Engine (corosync): [ OK ]
$ initctl start pacemaker
pacemaker start/running, process 4702
$ ps -ef|egrep 'corosync|pacemaker'
root 4695 1 0 17:21 ? 00:00:00 corosync
root 4702 1 0 17:21 ? 00:00:00 /usr/sbin/pacemakerd
496 4703 4702 0 17:21 ? 00:00:00 /usr/libexec/pacemaker/cib
root 4704 4702 0 17:21 ? 00:00:00
/usr/libexec/pacemaker/stonithd
root 4705 4702 0 17:21 ? 00:00:00 /usr/libexec/pacemaker/lrmd
496 4706 4702 0 17:21 ? 00:00:00 /usr/libexec/pacemaker/attrd
496 4707 4702 0 17:21 ? 00:00:00 /usr/libexec/pacemaker/pengine
496 4708 4702 0 17:21 ? 00:00:00 /usr/libexec/pacemaker/crmd
2) killed pacemakerd.
$ pkill -9 pacemakerd
$ ps -ef|egrep 'corosync|pacemaker'
root 4695 1 0 17:21 ? 00:00:01 corosync
496 4703 1 0 17:21 ? 00:00:00 /usr/libexec/pacemaker/cib
root 4704 1 0 17:21 ? 00:00:00
/usr/libexec/pacemaker/stonithd
root 4705 1 0 17:21 ? 00:00:00 /usr/libexec/pacemaker/lrmd
496 4706 1 0 17:21 ? 00:00:00 /usr/libexec/pacemaker/attrd
496 4707 1 0 17:21 ? 00:00:00 /usr/libexec/pacemaker/pengine
496 4708 1 0 17:21 ? 00:00:00 /usr/libexec/pacemaker/crmd
root 4760 1 1 17:24 ? 00:00:00 /usr/sbin/pacemakerd
3) then I stopped pacemakerd. however, some processes did not stop.
$ initctl stop pacemaker
pacemaker stop/waiting
$ ps -ef|egrep 'corosync|pacemaker'
root 4695 1 0 17:21 ? 00:00:01 corosync
496 4703 1 0 17:21 ? 00:00:00 /usr/libexec/pacemaker/cib
root 4704 1 0 17:21 ? 00:00:00
/usr/libexec/pacemaker/stonithd
root 4705 1 0 17:21 ? 00:00:00 /usr/libexec/pacemaker/lrmd
496 4706 1 0 17:21 ? 00:00:00 /usr/libexec/pacemaker/attrd
496 4707 1 0 17:21 ? 00:00:00 /usr/libexec/pacemaker/pengine
Best Regards,
Kazunori INOUE
This isnt the case when the plugin is in use though, but then I'd
also
have expected most of the processes to die also.
Since node status will also change if such a result is brought,
we desire to become so.
----
$ cat /etc/redhat-release
Red Hat Enterprise Linux Server release 6.3 (Santiago)
$ ./configure --sysconfdir=/etc --localstatedir=/var
--without-cman
--without-heartbeat
-snip-
pacemaker configuration:
Version = 1.1.8 (Build: 9c13d14)
Features = generated-manpages agent-manpages
ascii-docs
publican-docs ncurses libqb-logging libqb-ipc lha-fencing
corosync-native
snmp
$ cat config.log
-snip-
6000 | #define BUILD_VERSION "9c13d14"
6001 | /* end confdefs.h. */
6002 | #include <gio/gio.h>
6003 |
6004 | int
6005 | main ()
6006 | {
6007 | if (sizeof (GDBusProxy))
6008 | return 0;
6009 | ;
6010 | return 0;
6011 | }
6012 configure:32411: result: no
6013 configure:32417: WARNING: Unable to support systemd/upstart.
You need
to use glib >= 2.26
-snip-
6286 | #define BUILD_VERSION "9c13d14"
6287 | #define SUPPORT_UPSTART 0
6288 | #define SUPPORT_SYSTEMD 0
Best Regards,
Kazunori INOUE
related bugzilla:
http://bugs.clusterlabs.org/show_bug.cgi?id=5064
Best Regards,
Kazunori INOUE
_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Project Home: http://www.clusterlabs.org
Getting started:
http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org
_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org