Hi I have now for a few weeks been trying to get a cluster using pacemaker to work. We are using Ubuntu 14.04.2 LTS with corosync 2.3.3-1ubuntu1 pacemaker 1.1.10+git2013
It is a 2 node cluster and it includes a gfs2 file system on top of drbd. After som initial problem with stonith not working due to dlm_stonith missing (which I fixed by compiling it myself), it looked good. I have set upp the cluster to power off the other node through stonith instead of reboot as is default. I tested failures by doing init 0, halt -f, pkill -9 coresync on one node and it worked fine. But then I detected that after the cluster had been up (both nodes) for 2 days, doing init 0 on one node resulted in that node hanging during shutdown and the other node failing to stonith it. And after forcing the hanging node to power off and then powering it on, doing pcs status on it reports not being able to talk to other node and all resources are stopped. And on the other node (which have been running the whole time) pcs status hangs (crm status works and says that all is up) and the gfs2 file system is blocking. Doing init 0 on this node never shuts it down, a reboot -f does work and after it is upp again the entire cluster is ok. So in short, everything works fine after a fresh boot of both two nodes but after 2 days a requested shutdown of one node (using init 0) hangs and the other node stops working correctly. Looking at the console on the node I did init 0 on, dlm_controld reports that cluster is down and then that drbd have problem talking to other node, and then that gfs2 is blocked. So that is why that node never powers off - gfs2 and drbd was not shutdown correctly by the pacemaker before it stopped (or is trying to stop). Looking through the logs (syslog and corosync.log) (I have debug mode on corosync) I can see that on node 1 (the one I left running the whole time) it does: stonith-ng: info: crm_update_peer_proc: pcmk_cpg_membership: Node node2[2] - corosync-cpg is now offline crmd: info: crm_update_peer_proc: pcmk_cpg_membership: Node node2[2] - corosync-cpg is now offline crmd: info: peer_update_callback: Client node2/peer now has status [offline] (DC=node2) crmd: notice: peer_update_callback: Our peer on the DC is dead stonith-ng notice: handle_request: Client stonith-api.10797.41ef3128 wants to fence (off) '2' with device '(any)' stonith-ng notice: initiate_remote_stonith_op: Initiating remote operation off for node2: 20f62cf6-90eb-4c53-8da1-30ab 048de495 (0) stonith-ng: info: stonith_command: Processed st_fence from stonith-api.10797: Operation now in progress (-115) corosyncdebug [TOTEM ] Resetting old ring state corosyncdebug [TOTEM ] recovery to regular 1-0 corosyncdebug [MAIN ] Member left: r(0) ip(10.10.1.2) r(1) ip(192.168.12.142) corosyncdebug [TOTEM ] waiting_trans_ack changed to 1 corosyncdebug [TOTEM ] entering OPERATIONAL state. corosyncnotice [TOTEM ] A new membership (10.10.1.1:588) was formed. Members left: 2 corosyncdebug [SYNC ] Committing synchronization for corosync configuration map access corosyncdebug [QB ] Not first sync -> no action corosyncdebug [CPG ] comparing: sender r(0) ip(10.10.1.1) r(1) ip(192.168.12.140) ; members(old:2 left:1) corosyncdebug [CPG ] chosen downlist: sender r(0) ip(10.10.1.1) r(1) ip(192.168.12.140) ; members(old:2 left:1) corosyncdebug [CPG ] got joinlist message from node 1 corosyncdebug [SYNC ] Committing synchronization for corosync cluster closed process group service v1.01 and a little later most log entries are: cib: info: crm_cs_flush: Sent 0 CPG messages (3 remaining, last=25): Try again (6) The Sent 0 CFG messages is logged forever until I force reboot of this node. On node 2 (the one I did init 0) I can find: stonith-ng[1415]: notice: log_operation: Operation 'monitor' [17088] for device 'ipmi-fencing-node1' returned: -201 (Generic Pacem aker error) several lines from crmd, attrd, pengine about ipmi-fencing Hard to know what log entries are important. But as as summary: after power on my 2 node cluster works fine, reboots and other node failure tests all work fine. But after letting the cluster run for 2 days, when I do node failure test parts of the cluster services fails to stop on the node failure is simulated and both nodes stop working (even though only one node was shutdown). The version of corosync and pacemaker is somewhat old - it is the official version available for our ubuntu version. Is this a known problem? I have seen that there are newer versions available, pacemaker has many changes done as I see on github. If this is a know problem, which versions of corosync and pacemaker should I try to change to? Or do you have some other idea what I can test/try to pin this down? Dan _______________________________________________ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org