I posted this to the OpenAIS Mailing List
(open...@lists.linux-foundation.org) yesterday, but haven't received a
response and upon further reflection I think that maybe I chose the
wrong list to post it to. That list seems to be far less about user
support and far more about developer communication. Therefore
re-trying here, as the archives show it to be somewhat more
user-focused. The problem is that I'm having an issue with corosync refusing to shutdown in response to a QUIT signal. Given the below cluster (output of crm_mon): ============ Last updated: Wed Sep 23 15:56:24 2009 Stack: openais Current DC: boot1 - partition with quorum Version: 1.0.5-3840e6b5a305ccb803d29b468556739e75532d56 2 Nodes configured, 2 expected votes 0 Resources configured. ============ Online: [ boot1 boot2 ] If I go onto the host 'boot2', and issue the command "killall -QUIT corosync", the anticipated result would be that boot2 would go offline (out of the cluster), and all of the cluster processes (corosync/stonithd/cib/lrmd/attrd/pengine/crmd) would shut-down. However, this is not occurring, and I don't really have any idea why. After logging into boot2, and issuing the command "killall -QUIT corosync", the result is a split-brain: >From boot1's viewpoint: ============ Last updated: Wed Sep 23 15:58:27 2009 Stack: openais Current DC: boot1 - partition WITHOUT quorum Version: 1.0.5-3840e6b5a305ccb803d29b468556739e75532d56 2 Nodes configured, 2 expected votes 0 Resources configured. ============ Online: [ boot1 ] OFFLINE: [ boot2 ] >From boot2's viewpoint: ============ Last updated: Wed Sep 23 15:58:35 2009 Stack: openais Current DC: boot1 - partition with quorum Version: 1.0.5-3840e6b5a305ccb803d29b468556739e75532d56 2 Nodes configured, 2 expected votes 0 Resources configured. ============ Online: [ boot1 boot2 ] At this point the status quo holds until such time as ANOTHER QUIT signal is sent to corosync, (i.e. the command "killall -QUIT corosync" is executed on boot2 again). Then, boot2 shuts down properly and everything appears to be kosher. Basically, what I expect to happen after a single QUIT signal is instead taking two QUIT signals to occur; and that summarizes my question: why does it take two QUIT signals to force corosync to actually shutdown? Is that desired behavior? From everything online that I have read it seems to be very strange, and it makes me think that I have a problem in my configuration(s), but I've no idea what that would be even after playing with things and investigating for the day. I would be very grateful for any guidance that could be provided, as at the moment I seem to be at an impasse. Log files, with debugging set to 'on', can be found at the following pastebin locations: After first QUIT signal issued on boot2: boot1:/var/log/syslog: http://pastebin.com/m7f9a61fd boot2:/var/log/syslog: http://pastebin.com/d26fdfee After second QUIT signal issued on boot2: boot1:/var/log/syslog: http://pastebin.com/m755fb989 boot2:/var/log/syslog: http://pastebin.com/m22dcef45 OS, Software Packages, and Versions: * two nodes, each running Ubuntu Hardy Heron LTS * ubuntu-ha packages, as downloaded from http://ppa.launchpad.net/ubuntu-ha-maintainers/ppa/ubuntu/: * pacemaker-openais package version 1.0.5+hg20090813-0ubuntu2~hardy1 * openais package version 1.0.0-3ubuntu1~hardy1 * corosync package version 1.0.0-4ubuntu1~hardy2 * heartbeat-common package version heartbeat-common_2.99.2+sles11r9-5ubuntu1~hardy1 Network Setup: * boot1 * eth0 is 192.168.10.192 * eth1 is 172.16.1.1 * boot2 * eth0 is 192.168.10.193 * eth1 is 172.16.1.2 * boot1:eth0 and boot2:eth0 both connect to the same switch. * boot1:eth1 and boot2:eth1 are connected directly to each other via a cross-over cable. * no firewalls are involved, and tcpdump shows the multicast and UDP traffic flowing correctly over these links. * I attempted a broadcast (rather than multicast) configuration, to see if that would fix the problem. It did not. `crm configure show` output: node boot1 node boot2 property $id="cib-bootstrap-options" \ dc-version="1.0.5-3840e6b5a305ccb803d29b468556739e75532d56" \ cluster-infrastructure="openais" \ expected-quorum-votes="2" \ stonith-enabled="false" \ no-quorum-policy="ignore" Contents of /etc/corosync/corosync.conf: # Please read the corosync.conf.5 manual page compatibility: whitetank totem { clear_node_high_bit: yes version: 2 secauth: on threads: 1 heartbeat_failures_allowed: 3 interface { ringnumber: 0 bindnetaddr: 172.16.1.0 mcastaddr: 239.42.0.1 mcastport: 5505 } interface { ringnumber: 1 bindnetaddr: 192.168.10.0 mcastaddr: 239.42.0.2 mcastport: 6606 } rrp_mode: passive } amf { mode: disabled } service { name: pacemaker ver: 0 } aisexec { user: root group: root } logging { debug: on fileline: off function_name: off to_logfile: no to_stderr: no to_syslog: yes timestamp: on logger_subsys { subsys: AMF debug: off tags: enter|leave|trace1|trace2|trace3|trace4|trace6 } } |
_______________________________________________ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker