Hi all, We were previously using Pacemaker 1.1.6 on Ubuntu Precise. We are shifting to Ubuntu Trusty & it has Pacemaker 1.1.10.
We have our own STONITH resource & with the pacemaker upgrade this is broken. The problem that I am hitting is not something new & is exactly that was reported in http://clusterlabs.org/pipermail/users/2015-June/000686.html Basically Hearbeat + Pacemaker stack suffers this problem (even with the github mainline code) & since a lot of people are using corosync I think the problem is not felt with. The basic problem is this. Between Pacemaker 1.1.6 to 1.1.10, the following change was done fencing/remote.c static void remote_op_done(remote_fencing_op_t * op, xmlNode * data, int rc, int dup) if (dup == FALSE && safe_str_neq(subt, "broadcast")) { /* Defer notification until the bcast message arrives */ bcast_result_to_peers(op, rc); goto remote_op_done_cleanup; remote_op_done() previously used to send a direct reply. Now it was changed that it does a broadcast of the result & this broadcast is supposed to come back to the same fencing agent upon which finishes the operation. However this broadcast that is being sent by stonithd doesnt reach back itself. This causes eventually crmd to timeout on stonith_async_timeout_handler() to be triggered & keep retrying STONITH. so essentially STONITH is broken. I tried setting up bcast in /etc/ha.d/ha.cf, but it doesnt help. To confirm the issue, I removed the above flow for bcast_result_to_peers() & then STONITH works perfectly fine. Any pointers/help appreciated on how to resolve this issue in the right way? >From logs CRMD issues STONITH reboot to stonith-ng Oct 14 14:57:01 node0 heartbeat: [14463]: debug: MSG: Dumping message with 19 fields Oct 14 14:57:01 node0 heartbeat: [14463]: debug: MSG[0] : [__name__=stonith_command] Oct 14 14:57:01 node0 heartbeat: [14463]: debug: MSG[1] : [__name__=stonith_command] Oct 14 14:57:01 node0 heartbeat: [14463]: debug: MSG[2] : [t=stonith-ng] Oct 14 14:57:01 node0 heartbeat: [14463]: debug: MSG[3] : [st_async_id=617dce02-057f-42b8-9df2-33bc4e64780d] Oct 14 14:57:01 node0 heartbeat: [14463]: debug: MSG[4] : [st_op=st_fence] Oct 14 14:57:01 node0 heartbeat: [14463]: debug: MSG[5] : [st_callid=2] Oct 14 14:57:01 node0 heartbeat: [14463]: debug: MSG[6] : [st_callopt=0] Oct 14 14:57:01 node0 heartbeat: [14463]: debug: MSG[7] : [st_remote_op=617dce02-057f-42b8-9df2-33bc4e64780d] Oct 14 14:57:01 node0 heartbeat: [14463]: debug: MSG[8] : [st_target=node1] Oct 14 14:57:01 node0 heartbeat: [14463]: debug: MSG[9] : [st_device_action=reboot] Oct 14 14:57:01 node0 heartbeat: [14463]: debug: MSG[10] : [st_origin=node0] Oct 14 14:57:01 node0 heartbeat: [14463]: debug: MSG[11] : [st_clientid=346fea6b-a55d-4873-b630-14287d5bc71e] Oct 14 14:57:01 node0 heartbeat: [14463]: debug: MSG[12] : [st_clientname=crmd.14483] Oct 14 14:57:01 node0 heartbeat: [14463]: debug: MSG[13] : [st_timeout=90] Oct 14 14:57:01 node0 heartbeat: [14463]: debug: MSG[14] : [st_mode=smart] Oct 14 14:57:01 node0 heartbeat: [14463]: debug: MSG[15] : [dest=node0] Oct 14 14:57:01 node0 heartbeat: [14463]: debug: MSG[16] : [oseq=2] Oct 14 14:57:01 node0 heartbeat: [14463]: debug: MSG[17] : [from_id=stonith-ng] Oct 14 14:57:01 node0 heartbeat: [14463]: debug: MSG[18] : [to_id=stonith-ng] stonith-ng successfully finishes the STONITH Oct 14 14:57:03 node0 stonith-ng[14481]: notice: log_operation: Operation 'reboot' [14657] (call 2 from crmd.14483) for host 'node1' with device 'Z-FENCE:0' returned: 0 (OK) Oct 14 14:57:03 node0 stonith-ng[14481]: debug: stonith_send_async_reply: Directed reply to node0 Oct 14 14:57:03 node0 stonith-ng[14481]: debug: send_ha_message: outbound <st-reply st_origin="node0" t="stonith-ng" st_op="st_fence" st_device_id="Z-FENCE:0" st_remote_op="617dce02-057f-42b8-9df2-33bc4e64780d" st_clientid="346fea6b-a55d-4873-b630-14287d5bc71e" st_clientname="crmd.14483" st_target="node1" st_device_action="reboot" st_callid="2" st_callopt="0" st_rc="0" st_output="Performing: stonith -t external/zstonith -T reset node1\nsuccess: node1 0\n"/> & stonith-ng sends a self message for st-reply Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG: Dumping message with 18 fields Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[0] : [__name__=st-reply] Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[1] : [st_origin=node0] Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[2] : [t=stonith-ng] Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[3] : [st_op=st_fence] Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[4] : [st_device_id=Z-FENCE:0] Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[5] : [st_remote_op=617dce02-057f-42b8-9df2-33bc4e64780d] Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[6] : [st_clientid=346fea6b-a55d-4873-b630-14287d5bc71e] Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[7] : [st_clientname=crmd.14483] Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[8] : [st_target=node1] Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[9] : [st_device_action=reboot] Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[10] : [st_callid=2] Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[11] : [st_callopt=0] Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[12] : [st_rc=0] Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[13] : [st_output=Performing: stonith -t external/zstonith -T reset node1#012success: node1 0#012] Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[14] : [dest=node0] Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[15] : [oseq=3] Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[16] : [from_id=stonith-ng] Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[17] : [to_id=stonith-ng] As part of this stonith-ng does a broadcast Oct 14 14:57:03 node0 stonith-ng[14481]: debug: stonith_command: Processing st_fence reply 0 from node0 ( 0) Oct 14 14:57:03 node0 stonith-ng[14481]: debug: remote_op_done: Broadcasting result Oct 14 14:57:03 node0 stonith-ng[14481]: debug: send_ha_message: outbound <st-reply t="st_notify" subt="broadcast" st_op="st_notify" count="1"> Oct 14 14:57:03 node0 stonith-ng[14481]: debug: send_ha_message: outbound <st_calldata> Oct 14 14:57:03 node0 stonith-ng[14481]: debug: send_ha_message: outbound <st_notify_fence state="2" st_rc="0" st_target="node1" st_device_action="reboot" st_delegate="node0" st_remote_op="617dce02-057f-42b8-9df2-33bc4e64780d" st_origin="node0" st_clientid="346fea6b-a55d-4873-b630-14287d5bc71e" st_clientname="crmd.14483"/> Oct 14 14:57:03 node0 stonith-ng[14481]: debug: send_ha_message: outbound </st_calldata> Oct 14 14:57:03 node0 stonith-ng[14481]: debug: send_ha_message: outbound </st-reply> that is realized at hearbeat Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG: Dumping message with 8 fields Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[0] : [__name__=st-reply] Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[1] : [t=st_notify] Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[2] : [subt=broadcast] Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[3] : [st_op=st_notify] Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[4] : [count=1] Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[5] : [st_calldata=<st_calldata><st_notify_fence state="2" st_rc="0" st_target="node1" st_device_action="reboot" st_delegate="node0" st_remote_op="617dce02-057f-42b8-9df2-33bc4e64780d" st_origin="node0" st_clientid="346fea6b-a55d-4873-b630-14287d5bc71e" st_clientname="crmd.14483"/></st_calldata>] Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[6] : [from_id=stonith-ng] Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[7] : [to_id=stonith-ng] but then on nothing really happens. eventually crmd gives up on timeout Oct 14 14:59:48 node0 crmd[14483]: error: stonith_async_timeout_handler: Async call 2 timed out after 168000ms Thanks. --Shyam
_______________________________________________ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org