Hi Dejan, Thanks a lot for your input!
I cherry picked this commit & this solves the problem. I will raise a ubuntu launchpad bug for them to pull this correction in trusty stable. Thanks! --Shyam On Thu, Oct 15, 2015 at 1:21 PM, Dejan Muhamedagic <deja...@fastmail.fm> wrote: > Hi, > > On Wed, Oct 14, 2015 at 10:31:40PM +0530, Shyam wrote: > > Hi all, > > > > We were previously using Pacemaker 1.1.6 on Ubuntu Precise. We are > shifting > > to Ubuntu Trusty & it has Pacemaker 1.1.10. > > > > We have our own STONITH resource & with the pacemaker upgrade this is > > broken. The problem that I am hitting is not something new & is exactly > > that was reported in > > http://clusterlabs.org/pipermail/users/2015-June/000686.html > > > > Basically Hearbeat + Pacemaker stack suffers this problem (even with the > > github mainline code) & since a lot of people are using corosync I think > > the problem is not felt with. > > > > The basic problem is this. Between Pacemaker 1.1.6 to 1.1.10, the > following > > change was done > > > > fencing/remote.c > > > > static void remote_op_done(remote_fencing_op_t * op, xmlNode * data, int > > rc, int dup) > > > > if (dup == FALSE && safe_str_neq(subt, "broadcast")) { > > /* Defer notification until the bcast message arrives */ > > bcast_result_to_peers(op, rc); > > goto remote_op_done_cleanup; > > > > remote_op_done() previously used to send a direct reply. Now it was > changed > > that it does a broadcast of the result & this broadcast is supposed to > come > > back to the same fencing agent upon which finishes the operation. However > > this broadcast that is being sent by stonithd doesnt reach back itself. > > This causes eventually crmd to timeout on stonith_async_timeout_handler() > > to be triggered & keep retrying STONITH. so essentially STONITH is > broken. > > I can vaguelly recall some stonith/heartbeat issue which got > fixed by Lars: > > commit cc34288a7b2276aa238546f4aa92fa79b8cbcf88 > Author: Lars Ellenberg <lars.ellenb...@linbit.com> > Date: Tue Dec 9 10:48:33 2014 +0100 > > Sounds like the same problem. > > Thanks, > > Dejan > > > > I tried setting up bcast in /etc/ha.d/ha.cf, but it doesnt help. To > confirm > > the issue, I removed the above flow for bcast_result_to_peers() & then > > STONITH works perfectly fine. Any pointers/help appreciated on how to > > resolve this issue in the right way? > > > > > > From logs > > > > CRMD issues STONITH reboot to stonith-ng > > > > Oct 14 14:57:01 node0 heartbeat: [14463]: debug: MSG: Dumping message > with > > 19 fields > > Oct 14 14:57:01 node0 heartbeat: [14463]: debug: MSG[0] : > > [__name__=stonith_command] > > Oct 14 14:57:01 node0 heartbeat: [14463]: debug: MSG[1] : > > [__name__=stonith_command] > > Oct 14 14:57:01 node0 heartbeat: [14463]: debug: MSG[2] : [t=stonith-ng] > > Oct 14 14:57:01 node0 heartbeat: [14463]: debug: MSG[3] : > > [st_async_id=617dce02-057f-42b8-9df2-33bc4e64780d] > > Oct 14 14:57:01 node0 heartbeat: [14463]: debug: MSG[4] : > [st_op=st_fence] > > Oct 14 14:57:01 node0 heartbeat: [14463]: debug: MSG[5] : [st_callid=2] > > Oct 14 14:57:01 node0 heartbeat: [14463]: debug: MSG[6] : [st_callopt=0] > > Oct 14 14:57:01 node0 heartbeat: [14463]: debug: MSG[7] : > > [st_remote_op=617dce02-057f-42b8-9df2-33bc4e64780d] > > Oct 14 14:57:01 node0 heartbeat: [14463]: debug: MSG[8] : > [st_target=node1] > > Oct 14 14:57:01 node0 heartbeat: [14463]: debug: MSG[9] : > > [st_device_action=reboot] > > Oct 14 14:57:01 node0 heartbeat: [14463]: debug: MSG[10] : > [st_origin=node0] > > Oct 14 14:57:01 node0 heartbeat: [14463]: debug: MSG[11] : > > [st_clientid=346fea6b-a55d-4873-b630-14287d5bc71e] > > Oct 14 14:57:01 node0 heartbeat: [14463]: debug: MSG[12] : > > [st_clientname=crmd.14483] > > Oct 14 14:57:01 node0 heartbeat: [14463]: debug: MSG[13] : > [st_timeout=90] > > Oct 14 14:57:01 node0 heartbeat: [14463]: debug: MSG[14] : > [st_mode=smart] > > Oct 14 14:57:01 node0 heartbeat: [14463]: debug: MSG[15] : [dest=node0] > > Oct 14 14:57:01 node0 heartbeat: [14463]: debug: MSG[16] : [oseq=2] > > Oct 14 14:57:01 node0 heartbeat: [14463]: debug: MSG[17] : > > [from_id=stonith-ng] > > Oct 14 14:57:01 node0 heartbeat: [14463]: debug: MSG[18] : > > [to_id=stonith-ng] > > > > > > stonith-ng successfully finishes the STONITH > > Oct 14 14:57:03 node0 stonith-ng[14481]: notice: log_operation: > Operation > > 'reboot' [14657] (call 2 from crmd.14483) for host 'node1' with device > > 'Z-FENCE:0' returned: 0 (OK) > > Oct 14 14:57:03 node0 stonith-ng[14481]: debug: > > stonith_send_async_reply: Directed reply to node0 > > Oct 14 14:57:03 node0 stonith-ng[14481]: debug: send_ha_message: > > outbound <st-reply st_origin="node0" t="stonith-ng" st_op="st_fence" > > st_device_id="Z-FENCE:0" > > st_remote_op="617dce02-057f-42b8-9df2-33bc4e64780d" > > st_clientid="346fea6b-a55d-4873-b630-14287d5bc71e" > > st_clientname="crmd.14483" st_target="node1" st_device_action="reboot" > > st_callid="2" st_callopt="0" st_rc="0" st_output="Performing: stonith -t > > external/zstonith -T reset node1\nsuccess: node1 0\n"/> > > > > > > & stonith-ng sends a self message for st-reply > > Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG: Dumping message > with > > 18 fields > > Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[0] : > > [__name__=st-reply] > > Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[1] : > [st_origin=node0] > > Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[2] : [t=stonith-ng] > > Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[3] : > [st_op=st_fence] > > Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[4] : > > [st_device_id=Z-FENCE:0] > > Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[5] : > > [st_remote_op=617dce02-057f-42b8-9df2-33bc4e64780d] > > Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[6] : > > [st_clientid=346fea6b-a55d-4873-b630-14287d5bc71e] > > Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[7] : > > [st_clientname=crmd.14483] > > Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[8] : > [st_target=node1] > > Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[9] : > > [st_device_action=reboot] > > Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[10] : [st_callid=2] > > Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[11] : [st_callopt=0] > > Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[12] : [st_rc=0] > > Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[13] : > > [st_output=Performing: stonith -t external/zstonith -T reset > > node1#012success: node1 0#012] > > Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[14] : [dest=node0] > > Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[15] : [oseq=3] > > Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[16] : > > [from_id=stonith-ng] > > Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[17] : > > [to_id=stonith-ng] > > > > > > As part of this stonith-ng does a broadcast > > Oct 14 14:57:03 node0 stonith-ng[14481]: debug: stonith_command: > > Processing st_fence reply 0 from node0 ( 0) > > Oct 14 14:57:03 node0 stonith-ng[14481]: debug: remote_op_done: > > Broadcasting result > > Oct 14 14:57:03 node0 stonith-ng[14481]: debug: send_ha_message: > > outbound <st-reply t="st_notify" subt="broadcast" st_op="st_notify" > > count="1"> > > Oct 14 14:57:03 node0 stonith-ng[14481]: debug: send_ha_message: > > outbound <st_calldata> > > Oct 14 14:57:03 node0 stonith-ng[14481]: debug: send_ha_message: > > outbound <st_notify_fence state="2" st_rc="0" st_target="node1" > > st_device_action="reboot" st_delegate="node0" > > st_remote_op="617dce02-057f-42b8-9df2-33bc4e64780d" st_origin="node0" > > st_clientid="346fea6b-a55d-4873-b630-14287d5bc71e" > > st_clientname="crmd.14483"/> > > Oct 14 14:57:03 node0 stonith-ng[14481]: debug: send_ha_message: > > outbound </st_calldata> > > Oct 14 14:57:03 node0 stonith-ng[14481]: debug: send_ha_message: > > outbound </st-reply> > > > > > > that is realized at hearbeat > > Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG: Dumping message > with > > 8 fields > > Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[0] : > > [__name__=st-reply] > > Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[1] : [t=st_notify] > > Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[2] : > [subt=broadcast] > > Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[3] : > [st_op=st_notify] > > Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[4] : [count=1] > > Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[5] : > > [st_calldata=<st_calldata><st_notify_fence state="2" st_rc="0" > > st_target="node1" st_device_action="reboot" st_delegate="node0" > > st_remote_op="617dce02-057f-42b8-9df2-33bc4e64780d" st_origin="node0" > > st_clientid="346fea6b-a55d-4873-b630-14287d5bc71e" > > st_clientname="crmd.14483"/></st_calldata>] > > Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[6] : > > [from_id=stonith-ng] > > Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[7] : > [to_id=stonith-ng] > > > > but then on nothing really happens. > > > > eventually crmd gives up on timeout > > Oct 14 14:59:48 node0 crmd[14483]: error: > stonith_async_timeout_handler: > > Async call 2 timed out after 168000ms > > > > Thanks. > > > > --Shyam > > > _______________________________________________ > > Users mailing list: Users@clusterlabs.org > > http://clusterlabs.org/mailman/listinfo/users > > > > Project Home: http://www.clusterlabs.org > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > > Bugs: http://bugs.clusterlabs.org > > > _______________________________________________ > Users mailing list: Users@clusterlabs.org > http://clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org >
_______________________________________________ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org