I've been trying to debug and do a root cause analysis for a cascading series of failures that a customer hit a couple of days ago, that caused their filesystem to be unavailable for a couple of hours.
The original failure was in our own distributed filesystem backend, a fork of LizardFS, which is in turn a fork of MoosFS This history is mostly only important in reading the logs, where "efs", "lizardfs", and "mfs" all generally refer to the same services, just different generations of naming them as not all daemons, scripts, and packages have been renamed. There are two master servers that handle metadata operations, running Pacemaker to elect which one is the current primary and which one is a replica, and a number of chunkservers that store file chunks and simply connect to the current running master via a virtual IP. A bug in doing a checksum scan on the chunkservers caused them to leak file descriptors and become unresponsive, so while the master server was up and healthy, no actual filesystem operations could occur. (This bug is now fixed by the way, and the fix deployed to the customer, but we want to debug why the later failures occurred that caused them to continue to have downtime). The customer saw that things were unresponsive, and tried doing the simplest thing they could to try to resolve it, migrate the services to the other master. This succeeded, as the checksum scan had been initiated by the first master and so switching over to the replica caused all of the extra file descriptors to be closed and the chunkservers to become responsive again. However, due to one backup service that is not yet managed via Pacemaker and thus is only running on the first master, they decided to migrate back to the first master. This was when they ran into a Pacemaker problem. At the time of the problem, es-efs-master1 is the server that was originally the master when the first problem happened, and which they are trying to migrate the services back to. es-efs-master2 is the one actively running the services, and also happens to be the DC at the time to that's where to look for pengine messages. On master2, you can see the point when the user tried to migrate back to master1 based on the pengine decisions: (by the way, apologies for the long message with large log excerpts; I was trying to balance enough detail with not overwhelming, it can be hard to keep it short when explaining these kinds of complicated failures across a number of machines) Nov 18 08:28:28 es-efs-master2 pengine[1923]: warning: unpack_rsc_op: Forcing editshare.stack.7c645b0e- 46bb-407e-b48a-92ec3121f2d7.lizardfs-master.primitive:1 to stop after a failed demote action Nov 18 08:28:28 es-efs-master2 pengine[1923]: notice: LogActions: Move editshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.ip#011(Started es-efs-master2 -> es-efs-master1) Nov 18 08:28:28 es-efs-master2 pengine[1923]: notice: LogActions: Promote editshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.lizardfs-master.primitive:0#011(Slave -> Master es-efs-master1) Nov 18 08:28:28 es-efs-master2 pengine[1923]: notice: LogActions: Demote editshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.lizardfs-master.primitive:1#011(Master -> Slave es-efs-master2) Nov 18 08:28:28 es-efs-master2 pengine[1923]: notice: process_pe_message: Calculated Transition 1481601: /var/lib/pacemaker/pengine/pe-input-1355.bz2 Nov 18 08:28:28 es-efs-master2 stonith-ng[1920]: warning: cib_process_diff: Diff 0.2754083.1 -> 0.2754083.2 from local not applied to 0.2754083.1: Failed application of an update diff Nov 18 08:28:28 es-efs-master2 crmd[1924]: notice: process_lrm_event: LRM operation editshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.lizardfs-master.primitive_notify_0 (call=400, rc=0, cib-update=0, confirmed=true) ok Nov 18 08:28:28 es-efs-master2 crmd[1924]: notice: run_graph: Transition 1481601 (Complete=5, Pending=0, Fired=0, Skipped=15, Incomplete=10, Source=/var/lib/pacemaker/pengine/pe-input-1355.bz2): Stopped Nov 18 08:28:28 es-efs-master2 pengine[1923]: warning: unpack_rsc_op: Processing failed op demote for editshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.lizardfs-master.primitive:1 on es-efs-master2: unknown error (1) Nov 18 08:28:28 es-efs-master2 pengine[1923]: warning: unpack_rsc_op: Forcing editshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.lizardfs-master.primitive:1 to stop after a failed demote action Nov 18 08:28:28 es-efs-master2 pengine[1923]: notice: LogActions: Move editshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.ip#011(Started es-efs-master2 -> es-efs-master1) Nov 18 08:28:28 es-efs-master2 pengine[1923]: notice: LogActions: Promote editshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.lizardfs-master.primitive:0#011(Slave -> Master es-efs-master1) Nov 18 08:28:28 es-efs-master2 pengine[1923]: notice: LogActions: Demote editshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.lizardfs-master.primitive:1#011(Master -> Slave es-efs-master2) Nov 18 08:28:28 es-efs-master2 pengine[1923]: notice: process_pe_message: Calculated Transition 1481602: /var/lib/pacemaker/pengine/pe-input-1356.bz2 Nov 18 08:28:28 es-efs-master2 crmd[1924]: warning: destroy_action: Cancelling timer for action 14 (src=4445462) Nov 18 08:28:28 es-efs-master2 crmd[1924]: notice: process_lrm_event: LRM operation editshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.lizardfs-master.primitive_notify_0 (call=403, rc=0, cib-update=0, confirmed=true) ok A little later on, we see that we have successfully demoted our lizardfs-master service: Nov 18 08:28:31 es-efs-master2 crmd[1924]: notice: process_lrm_event: LRM operation editshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.lizardfs-master.primitive_demote_0 (call=406, rc=0, cib-update=2963647, confirmed=true) ok The logs go on for a few seconds, as we are migrating a number of other services as well (a variety of management services, LDAP, and so on). If we switch back to our master1 logs, a few seconds later we some failures of crmd talking to lrmd: Nov 18 08:28:39 es-efs-master1 crmd[1826]: warning: crm_ipc_send: Request 393 to lrmd (0x7f51b85f8e30) failed: Resource temporarily unavailable (-11) Nov 18 08:28:39 es-efs-master1 crmd[1826]: error: lrmd_send_command: Couldn't perform lrmd_rsc_info operation (timeout=0): -11: Connection timed out (110) Nov 18 08:28:40 es-efs-master1 mfsmaster[20053]: connecting to Master Nov 18 08:28:40 es-efs-master1 mfsmaster[20053]: connected to Master Nov 18 08:28:40 es-efs-master1 crmd[1826]: warning: crm_ipc_send: Sending to lrmd (0x7f51b85f8e30) is disabled until pending reply is recieved Nov 18 08:28:40 es-efs-master1 crmd[1826]: error: lrmd_send_command: Couldn't perform lrmd_rsc_register operation (timeout=0): -114: Connection timed out (110) Nov 18 08:28:40 es-efs-master1 crmd[1826]: warning: crm_ipc_send: Sending to lrmd (0x7f51b85f8e30) is disabled until pending reply is recieved Nov 18 08:28:40 es-efs-master1 crmd[1826]: error: lrmd_send_command: Couldn't perform lrmd_rsc_info operation (timeout=0): -114: Connection timed out (110) Nov 18 08:28:40 es-efs-master1 crmd[1826]: warning: do_lrm_invoke: bad input <create_request_adv origin="te_rsc_command" t="crmd" version="3.0.7" subt="request" reference="lrm_invoke-tengine-1447853313-1481932" crm_task="lrm_invoke" crm_sys_to="lrmd" crm_sys_from="tengine" crm_host_to="es-efs-master1" src="es-efs-master2"> Nov 18 08:28:40 es-efs-master1 crmd[1826]: warning: do_lrm_invoke: bad input <crm_xml> Nov 18 08:28:40 es-efs-master1 crmd[1826]: warning: do_lrm_invoke: bad input <rsc_op id="87" operation="start" operation_key="editshare.stack.ffab6f74-2fc0-43e8-99f4-ab446db23ac3.editshare-dbd_start_0" on_node="es-efs-master1" on_node_uuid="1" transition-key="87:1481605:0:df645faa-fb80-4aa4-89dd-efa8090f9807"> Nov 18 08:28:40 es-efs-master1 crmd[1826]: warning: do_lrm_invoke: bad input <primitive id="editshare.stack.ffab6f74-2fc0-43e8-99f4-ab446db23ac3.editshare-dbd" class="upstart" type="editshare-dbd"/> Nov 18 08:28:40 es-efs-master1 crmd[1826]: warning: do_lrm_invoke: bad input <attributes CRM_meta_timeout="20000" crm_feature_set="3.0.7"/> Nov 18 08:28:40 es-efs-master1 crmd[1826]: warning: do_lrm_invoke: bad input </rsc_op> Nov 18 08:28:40 es-efs-master1 crmd[1826]: warning: do_lrm_invoke: bad input </crm_xml> Nov 18 08:28:40 es-efs-master1 crmd[1826]: warning: do_lrm_invoke: bad input </create_request_adv> Nov 18 08:28:40 es-efs-master1 crmd[1826]: warning: crm_ipc_send: Sending to lrmd (0x7f51b85f8e30) is disabled until pending reply is recieved A few more seconds later, and it looks like crmd gives up and dies, at which point pacemakerd notices and restarts it: Nov 18 08:28:43 es-efs-master1 crmd[1826]: warning: do_log: FSA: Input I_FAIL from get_lrm_resource() received in state S_NOT_DC Nov 18 08:28:43 es-efs-master1 crmd[1826]: notice: do_state_transition: State transition S_NOT_DC -> S_RECOVERY [ input=I_FAIL cause=C_FSA_INTERNAL origin=get_lrm_resource ] Nov 18 08:28:43 es-efs-master1 crmd[1826]: warning: do_log: FSA: Input I_FAIL from get_lrm_resource() received in state S_RECOVERY Nov 18 08:28:43 crmd[1826]: last message repeated 2 times Nov 18 08:28:43 es-efs-master1 crmd[1826]: error: do_log: FSA: Input I_TERMINATE from do_recover() received in state S_RECOVERY Nov 18 08:28:44 es-efs-master1 mfsmaster[20053]: connecting to Master Nov 18 08:28:44 es-efs-master1 mfsmaster[20053]: connected to Master Nov 18 08:28:44 es-efs-master1 crmd[1826]: warning: crm_ipc_send: Sending to lrmd (0x7f51b85f8e30) is disabled until pending reply is recieved Nov 18 08:28:44 es-efs-master1 crmd[1826]: error: lrmd_send_command: Couldn't perform lrmd_rsc_cancel operation (timeout=0): -114: Connection timed out (110) Nov 18 08:28:44 es-efs-master1 crmd[1826]: warning: crm_ipc_send: Sending to lrmd (0x7f51b85f8e30) is disabled until pending reply is recieved Nov 18 08:28:44 es-efs-master1 crmd[1826]: error: lrmd_send_command: Couldn't perform lrmd_rsc_cancel operation (timeout=0): -114: Connection timed out (110) Nov 18 08:28:44 es-efs-master1 crmd[1826]: warning: crm_ipc_send: Sending to lrmd (0x7f51b85f8e30) is disabled until pending reply is recieved Nov 18 08:28:44 es-efs-master1 crmd[1826]: error: lrmd_send_command: Couldn't perform lrmd_rsc_cancel operation (timeout=0): -114: Connection timed out (110) Nov 18 08:28:44 es-efs-master1 crmd[1826]: error: lrm_state_verify_stopped: 2 pending LRM operations at shutdown(null) Nov 18 08:28:44 es-efs-master1 crmd[1826]: error: lrm_state_verify_stopped: Pending action: editshare.stack.ffab6f74-2fc0-43e8-99f4-ab446db23ac3.editshare-server:287 (editshare.stack.ffab6f74-2fc0-43e8-99f4-ab446db23ac3.editshare-server_start_0) Nov 18 08:28:44 es-efs-master1 crmd[1826]: error: lrm_state_verify_stopped: Pending action: editshare.stack.ffab6f74-2fc0-43e8-99f4-ab446db23ac3.editshare-superserver:285 (editshare.stack.ffab6f74-2fc0-43e8-99f4-ab446db23ac3.editshare-superserver_start_0) Nov 18 08:28:44 es-efs-master1 crmd[1826]: notice: terminate_cs_connection: Disconnecting from Corosync Nov 18 08:28:44 es-efs-master1 cib[1821]: warning: qb_ipcs_event_sendv: new_event_notification (1821-1826-10): Broken pipe (32) Nov 18 08:28:44 es-efs-master1 pacemakerd[1760]: error: pcmk_child_exit: Child process crmd (1826) exited: Generic Pacemaker error (201) Nov 18 08:28:44 es-efs-master1 crmd[7128]: notice: crm_cluster_connect: Connecting to cluster infrastr ucture: corosync Nov 18 08:28:44 es-efs-master1 crmd[7128]: notice: corosync_node_name: Unable to get node name for nod eid 1 Nov 18 08:28:44 es-efs-master1 crmd[7128]: notice: get_node_name: Defaulting to uname -n for the local corosync node name Nov 18 08:28:44 es-efs-master1 crmd[7128]: notice: cluster_connect_quorum: Quorum acquired Back over on master2, this causes it to decide to initiate a fencing action: Nov 18 08:28:44 es-efs-master2 crmd[1924]: notice: run_graph: Transition 1481605 (Complete=6, Pending= 0, Fired=0, Skipped=38, Incomplete=12, Source=/var/lib/pacemaker/pengine/pe-input-1359.bz2): Stopped Nov 18 08:28:44 es-efs-master2 pengine[1923]: warning: pe_fence_node: Node es-efs-master1 will be fence d because our peer process is no longer available Nov 18 08:28:44 es-efs-master2 pengine[1923]: warning: determine_online_status: Node es-efs-master1 is unclean Nov 18 08:28:44 es-efs-master2 pengine[1923]: warning: unpack_rsc_op: Processing failed op demote for e ditshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.lizardfs-master.primitive:1 on es-efs-master2: unkno wn error (1) Nov 18 08:28:44 es-efs-master2 pengine[1923]: warning: unpack_rsc_op: Forcing editshare.stack.7c645b0e- 46bb-407e-b48a-92ec3121f2d7.lizardfs-master.primitive:1 to stop after a failed demote action Nov 18 08:28:44 es-efs-master2 pengine[1923]: warning: stage6: Scheduling Node es-efs-master1 for STONI TH Nov 18 08:28:44 es-efs-master2 pengine[1923]: notice: LogActions: Move editshare.stack.7c645b0e-46b b-407e-b48a-92ec3121f2d7.ip#011(Started es-efs-master1 -> es-efs-master2) Nov 18 08:28:44 es-efs-master2 pengine[1923]: notice: LogActions: Move editshare.stack.7c645b0e-46b b-407e-b48a-92ec3121f2d7.lizardfs-master.primitive:0#011(Master es-efs-master1 -> es-efs-master2) Nov 18 08:28:44 es-efs-master2 pengine[1923]: notice: LogActions: Promote editshare.stack.7c645b0e-46b b-407e-b48a-92ec3121f2d7.lizardfs-master.primitive:0#011(Slave -> Master es-efs-master2) Nov 18 08:28:44 es-efs-master2 pengine[1923]: notice: LogActions: Stop editshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.lizardfs-master.primitive:1#011(es-efs-master2) Nov 18 08:28:44 es-efs-master2 pengine[1923]: notice: LogActions: Stop editshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.slapd.primitive:0#011(es-efs-master1) Nov 18 08:28:44 es-efs-master2 pengine[1923]: notice: LogActions: Start editshare.stack.ffab6f74-2fc0-43e8-99f4-ab446db23ac3.ip#011(es-efs-master2) Nov 18 08:28:44 es-efs-master2 pengine[1923]: notice: LogActions: Stop editshare.stack.ffab6f74-2fc0-43e8-99f4-ab446db23ac3.mysql.primitive:0#011(es-efs-master1) Nov 18 08:28:44 es-efs-master2 pengine[1923]: notice: LogActions: Start editshare.stack.ffab6f74-2fc0-43e8-99f4-ab446db23ac3.editshare-superserver#011(es-efs-master2) Nov 18 08:28:44 es-efs-master2 pengine[1923]: notice: LogActions: Start editshare.stack.ffab6f74-2fc0-43e8-99f4-ab446db23ac3.editshare-server#011(es-efs-master2) Nov 18 08:28:44 es-efs-master2 pengine[1923]: notice: LogActions: Start editshare.stack.ffab6f74-2fc0-43e8-99f4-ab446db23ac3.editshare-dbd#011(es-efs-master2) Nov 18 08:28:44 es-efs-master2 pengine[1923]: notice: LogActions: Start editshare.stack.ffab6f74-2fc0-43e8-99f4-ab446db23ac3.editshare-web#011(es-efs-master2) Nov 18 08:28:44 es-efs-master2 pengine[1923]: notice: LogActions: Stop fence-es-efs-master2-ipmi#011(es-efs-master1) Nov 18 08:28:44 es-efs-master2 pengine[1923]: warning: process_pe_message: Calculated Transition 1481606: /var/lib/pacemaker/pengine/pe-warn-374.bz2 Nov 18 08:28:44 es-efs-master2 crmd[1924]: notice: run_graph: Transition 1481606 (Complete=0, Pending=0, Fired=0, Skipped=43, Incomplete=20, Source=/var/lib/pacemaker/pengine/pe-warn-374.bz2): Stopped However, at this point, nothing happens. There are no further logs about fencing, stonith, ipmi, for another few minutes, other than a couple of occasional warnings like the following: Nov 18 08:28:44 es-efs-master2 stonith-ng[1920]: warning: cib_process_diff: Diff 0.2754086.12 -> 0.2754086.13 from local not applied to 0.2754086.12: Failed application of an update diff Nov 18 08:34:58 es-efs-master2 stonith-ng[1920]: warning: cib_process_diff: Diff 0.2754086.20 -> 0.2754086.21 from local not applied to 0.2754086.20: Failed application of an update diff Nov 18 08:34:59 es-efs-master2 stonith-ng[1920]: warning: cib_process_diff: Diff 0.2754086.25 -> 0.2754086.26 from local not applied to 0.2754086.25: Failed application of an update diff We then see pengine react as if master1 is back online again, even though it was never fenced: Nov 18 08:35:00 es-efs-master2 pengine[1923]: warning: unpack_rsc_op: Processing failed op demote for e ditshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.lizardfs-master.primitive:0 on es-efs-master2: unkno wn error (1) Nov 18 08:35:00 es-efs-master2 pengine[1923]: warning: unpack_rsc_op: Forcing editshare.stack.7c645b0e- 46bb-407e-b48a-92ec3121f2d7.lizardfs-master.primitive:0 to stop after a failed demote action Nov 18 08:35:00 es-efs-master2 pengine[1923]: notice: LogActions: Start editshare.stack.7c645b0e-46b b-407e-b48a-92ec3121f2d7.ip#011(es-efs-master1) Nov 18 08:35:00 es-efs-master2 pengine[1923]: notice: LogActions: Stop editshare.stack.7c645b0e-46b b-407e-b48a-92ec3121f2d7.lizardfs-master.primitive:0#011(es-efs-master2) Nov 18 08:35:00 es-efs-master2 pengine[1923]: notice: LogActions: Start editshare.stack.7c645b0e-46b b-407e-b48a-92ec3121f2d7.lizardfs-master.primitive:1#011(es-efs-master2) Nov 18 08:35:00 es-efs-master2 pengine[1923]: notice: LogActions: Start editshare.stack.7c645b0e-46b b-407e-b48a-92ec3121f2d7.slapd.primitive:1#011(es-efs-master1) Nov 18 08:35:00 es-efs-master2 pengine[1923]: notice: LogActions: Start editshare.stack.ffab6f74-2fc 0-43e8-99f4-ab446db23ac3.ip#011(es-efs-master1) Nov 18 08:35:00 es-efs-master2 pengine[1923]: notice: LogActions: Demote editshare.stack.ffab6f74-2fc 0-43e8-99f4-ab446db23ac3.mysql.primitive:0#011(Master -> Slave es-efs-master2) Nov 18 08:35:00 es-efs-master2 pengine[1923]: notice: LogActions: Start editshare.stack.ffab6f74-2fc 0-43e8-99f4-ab446db23ac3.mysql.primitive:1#011(es-efs-master1) Nov 18 08:35:00 es-efs-master2 pengine[1923]: notice: LogActions: Promote editshare.stack.ffab6f74-2fc 0-43e8-99f4-ab446db23ac3.mysql.primitive:1#011(Stopped -> Master es-efs-master1) Nov 18 08:35:00 es-efs-master2 pengine[1923]: notice: LogActions: Start editshare.stack.ffab6f74-2fc 0-43e8-99f4-ab446db23ac3.editshare-superserver#011(es-efs-master1) Nov 18 08:35:00 es-efs-master2 pengine[1923]: notice: LogActions: Start editshare.stack.ffab6f74-2fc 0-43e8-99f4-ab446db23ac3.editshare-dbd#011(es-efs-master1) Nov 18 08:35:00 es-efs-master2 pengine[1923]: notice: LogActions: Start editshare.stack.ffab6f74-2fc 0-43e8-99f4-ab446db23ac3.editshare-web#011(es-efs-master1) Nov 18 08:35:00 es-efs-master2 pengine[1923]: notice: LogActions: Start fence-es-efs-master2-ipmi#011(es-efs-master1) Nov 18 08:35:00 es-efs-master2 pengine[1923]: notice: process_pe_message: Calculated Transition 1481607: /var/lib/pacemaker/pengine/pe-input-1360.bz2 Nov 18 08:35:01 es-efs-master2 crmd[1924]: notice: run_graph: Transition 1481607 (Complete=26, Pending=0, Fired=0, Skipped=35, Incomplete=18, Source=/var/lib/pacemaker/pengine/pe-input-1360.bz2): Stopped Nov 18 08:35:01 es-efs-master2 pengine[1923]: warning: unpack_rsc_op: Processing failed op demote for editshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.lizardfs-master.primitive:1 on es-efs-master2: unknown error (1) Nov 18 08:35:01 es-efs-master2 pengine[1923]: warning: unpack_rsc_op: Forcing editshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.lizardfs-master.primitive:1 to stop after a failed demote action Nov 18 08:35:01 es-efs-master2 pengine[1923]: notice: LogActions: Promote editshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.lizardfs-master.primitive:0#011(Slave -> Master es-efs-master1) Nov 18 08:35:01 es-efs-master2 pengine[1923]: notice: LogActions: Stop editshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.lizardfs-master.primitive:1#011(es-efs-master2) Nov 18 08:35:01 es-efs-master2 pengine[1923]: notice: LogActions: Start editshare.stack.ffab6f74-2fc0-43e8-99f4-ab446db23ac3.ip#011(es-efs-master1) Nov 18 08:35:01 es-efs-master2 pengine[1923]: notice: LogActions: Promote editshare.stack.ffab6f74-2fc0-43e8-99f4-ab446db23ac3.mysql.primitive:0#011(Slave -> Master es-efs-master1) Nov 18 08:35:01 es-efs-master2 pengine[1923]: notice: LogActions: Start editshare.stack.ffab6f74-2fc0-43e8-99f4-ab446db23ac3.editshare-dbd#011(es-efs-master1) Nov 18 08:35:01 es-efs-master2 pengine[1923]: notice: LogActions: Start editshare.stack.ffab6f74-2fc0-43e8-99f4-ab446db23ac3.editshare-web#011(es-efs-master1) Nov 18 08:35:01 es-efs-master2 pengine[1923]: notice: process_pe_message: Calculated Transition 1481608: /var/lib/pacemaker/pengine/pe-input-1361.bz2 Nov 18 08:35:01 es-efs-master2 crmd[1924]: notice: process_lrm_event: LRM operation editshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.lizardfs-master.primitive_notify_0 (call=463, rc=0, cib-update=0, confirmed=true) ok Nov 18 08:35:01 es-efs-master2 crmd[1924]: notice: process_lrm_event: LRM operation editshare.stack.ffab6f74-2fc0-43e8-99f4-ab446db23ac3.mysql.primitive_notify_0 (call=465, rc=0, cib-update=0, confirmed=true) ok And it just keeps on going on, seemingly getting stuck because everything on master1 is giving unknown errors: Nov 18 08:36:21 es-efs-master2 crmd[1924]: notice: run_graph: Transition 1481608 (Complete=12, Pending =0, Fired=0, Skipped=19, Incomplete=13, Source=/var/lib/pacemaker/pengine/pe-input-1361.bz2): Stopped Nov 18 08:36:21 es-efs-master2 pengine[1923]: warning: unpack_rsc_op: Processing failed op monitor for editshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.ip on es-efs-master1: unknown error (1) Nov 18 08:36:21 es-efs-master2 pengine[1923]: warning: unpack_rsc_op: Processing failed op monitor for editshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.slapd.primitive:0 on es-efs-master1: unknown error (1) Nov 18 08:36:21 es-efs-master2 pengine[1923]: warning: unpack_rsc_op: Processing failed op monitor for editshare.stack.ffab6f74-2fc0-43e8-99f4-ab446db23ac3.editshare-superserver on es-efs-master1: unknown er ror (1) Nov 18 08:36:21 es-efs-master2 pengine[1923]: warning: unpack_rsc_op: Processing failed op start for ed itshare.stack.ffab6f74-2fc0-43e8-99f4-ab446db23ac3.editshare-dbd on es-efs-master1: unknown error (1) Nov 18 08:36:21 es-efs-master2 pengine[1923]: warning: unpack_rsc_op: Processing failed op start for ed itshare.stack.ffab6f74-2fc0-43e8-99f4-ab446db23ac3.editshare-web on es-efs-master1: unknown error (1) Nov 18 08:36:21 es-efs-master2 pengine[1923]: warning: unpack_rsc_op: Processing failed op monitor for fence-es-efs-master2-ipmi on es-efs-master1: unknown error (1) Nov 18 08:36:21 es-efs-master2 pengine[1923]: warning: unpack_rsc_op: Processing failed op demote for e ditshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.lizardfs-master.primitive:1 on es-efs-master2: unkno wn error (1) Nov 18 08:36:21 es-efs-master2 pengine[1923]: warning: unpack_rsc_op: Forcing editshare.stack.7c645b0e- 46bb-407e-b48a-92ec3121f2d7.lizardfs-master.primitive:1 to stop after a failed demote action Nov 18 08:36:21 es-efs-master2 pengine[1923]: warning: common_apply_stickiness: Forcing editshare.stack .ffab6f74-2fc0-43e8-99f4-ab446db23ac3.editshare-dbd away from es-efs-master1 after 1000000 failures (max =1000000) Nov 18 08:36:21 es-efs-master2 pengine[1923]: warning: common_apply_stickiness: Forcing editshare.stack .ffab6f74-2fc0-43e8-99f4-ab446db23ac3.editshare-web away from es-efs-master1 after 1000000 failures (max =1000000) Nov 18 08:36:21 es-efs-master2 pengine[1923]: notice: LogActions: Recover editshare.stack.7c645b0e-46b b-407e-b48a-92ec3121f2d7.ip#011(Started es-efs-master1) Nov 18 08:36:21 es-efs-master2 pengine[1923]: notice: LogActions: Promote editshare.stack.7c645b0e-46b b-407e-b48a-92ec3121f2d7.lizardfs-master.primitive:0#011(Slave -> Master es-efs-master1) Nov 18 08:36:21 es-efs-master2 pengine[1923]: notice: LogActions: Stop editshare.stack.7c645b0e-46b b-407e-b48a-92ec3121f2d7.lizardfs-master.primitive:1#011(es-efs-master2) Nov 18 08:36:21 es-efs-master2 pengine[1923]: notice: LogActions: Recover editshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.slapd.primitive:0#011(Started es-efs-master1) Nov 18 08:36:21 es-efs-master2 pengine[1923]: notice: LogActions: Start editshare.stack.ffab6f74-2fc0-43e8-99f4-ab446db23ac3.ip#011(es-efs-master2) Nov 18 08:36:21 es-efs-master2 pengine[1923]: notice: LogActions: Promote editshare.stack.ffab6f74-2fc0-43e8-99f4-ab446db23ac3.mysql.primitive:1#011(Slave -> Master es-efs-master2) Nov 18 08:36:21 es-efs-master2 pengine[1923]: notice: LogActions: Recover editshare.stack.ffab6f74-2fc0-43e8-99f4-ab446db23ac3.editshare-superserver#011(Started es-efs-master1 -> es-efs-master2) Nov 18 08:36:21 es-efs-master2 pengine[1923]: notice: LogActions: Move editshare.stack.ffab6f74-2fc0-43e8-99f4-ab446db23ac3.editshare-server#011(Started es-efs-master1 -> es-efs-master2) Nov 18 08:36:21 es-efs-master2 pengine[1923]: notice: LogActions: Recover editshare.stack.ffab6f74-2fc0-43e8-99f4-ab446db23ac3.editshare-dbd#011(Started es-efs-master1 -> es-efs-master2) Nov 18 08:36:21 es-efs-master2 pengine[1923]: notice: LogActions: Recover editshare.stack.ffab6f74-2fc0-43e8-99f4-ab446db23ac3.editshare-web#011(Started es-efs-master1 -> es-efs-master2) Nov 18 08:36:21 es-efs-master2 pengine[1923]: notice: LogActions: Recover fence-es-efs-master2-ipmi#011(Started es-efs-master1) Nov 18 08:36:21 es-efs-master2 pengine[1923]: notice: process_pe_message: Calculated Transition 1481609: /var/lib/pacemaker/pengine/pe-input-1362.bz2 Nov 18 08:36:21 es-efs-master2 crmd[1924]: notice: process_lrm_event: LRM operation editshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.lizardfs-master.primitive_notify_0 (call=469, rc=0, cib-update=0, confirmed=true) ok Nov 18 08:36:21 es-efs-master2 crmd[1924]: notice: process_lrm_event: LRM operation editshare.stack.ffab6f74-2fc0-43e8-99f4-ab446db23ac3.mysql.primitive_notify_0 (call=471, rc=0, cib-update=0, confirmed=true) ok Not a whole lot happens after that for the next 40 minutes or so, until someone manually reboots master1. At this point master2 sees it disappear again, decides to fence it again, and this time is successful: Nov 18 09:11:42 es-efs-master2 corosync[1842]: [TOTEM ] A new membership (172.17.0.6:136) was formed. Members left: 1 Nov 18 09:11:42 es-efs-master2 pacemakerd[1847]: notice: crm_update_peer_state: pcmk_quorum_notificati on: Node es-efs-master1[1] - state is now lost (was member) Nov 18 09:11:42 es-efs-master2 crmd[1924]: notice: crm_update_peer_state: pcmk_quorum_notification: No de es-efs-master1[1] - state is now lost (was member) Nov 18 09:11:42 es-efs-master2 mfsmount[946]: can't connect to master ("172.17.0.16":"9421") Nov 18 09:11:42 es-efs-master2 mfsmount[946]: can't connect to master ("172.17.0.16":"9421") Nov 18 09:11:43 es-efs-master2 mfsmaster[22800]: connecting to Master Nov 18 09:11:43 es-efs-master2 mfsmaster[22800]: connection failed, error: ECONNREFUSED (Connection refu sed) Nov 18 09:11:43 es-efs-master2 stonith-ng[1920]: warning: cib_process_diff: Diff 0.2754095.3 -> 0.27540 95.4 from local not applied to 0.2754095.3: Failed application of an update diff Nov 18 09:11:44 es-efs-master2 mfsmaster[22800]: connecting to Master Nov 18 09:11:44 es-efs-master2 mfsmaster[22800]: connection failed, error: ECONNREFUSED (Connection refu sed) Nov 18 09:11:44 es-efs-master2 pengine[1923]: warning: pe_fence_node: Node es-efs-master1 will be fence d because the node is no longer part of the cluster Nov 18 09:11:44 es-efs-master2 pengine[1923]: warning: determine_online_status: Node es-efs-master1 is unclean Nov 18 09:12:47 es-efs-master2 mfsmaster[22800]: connecting to Master Nov 18 09:12:47 es-efs-master2 stonith-ng[1920]: notice: log_operation: Operation 'reboot' [27561] (call 3 from crmd.1924) for host 'es-efs-master1' with device 'fence-es-efs-master1-ipmi' returned: 0 (OK) Nov 18 09:12:47 es-efs-master2 stonith-ng[1920]: notice: remote_op_done: Operation reboot of es-efs-master1 by es-efs-master2 for crmd.1924@es-efs-master2.276145eb: OK Nov 18 09:12:47 es-efs-master2 stonith-ng[1920]: warning: cib_process_diff: Diff 0.2754095.7 -> 0.2754095.8 from local not applied to 0.2754095.7: Failed application of an update diff Now we were able to start making progress again, restarting and re-promoting resources on master2, and the system got into a running state again: Nov 18 09:12:47 es-efs-master2 crmd[1924]: notice: process_lrm_event: LRM operation editshare.stack.ffab6f74-2fc0-43e8-99f4-ab446db23ac3.editshare-superserver_start_0 (call=485, rc=0, cib-update=2963781, confirmed=true) ok Nov 18 09:12:47 es-efs-master2 IPaddr2(editshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.ip)[28258]: INFO: Adding inet address 172.17.0.16/22 to device bond0 Nov 18 09:12:47 es-efs-master2 avahi-daemon[1255]: Registering new address record for 172.17.0.16 on bond0.IPv4. Nov 18 09:12:47 es-efs-master2 IPaddr2(editshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.ip)[28258]: INFO: Bringing device bond0 up Nov 18 09:12:47 es-efs-master2 IPaddr2(editshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.ip)[28258]: INFO: /usr/lib/heartbeat/send_arp -i 200 -r 5 -p /var/run/resource-agents/send_arp-172.17.0.16 bond0 172.17.0.16 auto not_used not_used At this point we had some further problems that caused some more downtime, but I don't believe they were related to Pacemaker at this time. What I'm concerned about is the initial failure of crmd on master1 that led to master2 deciding to fence it, and then master2's failure to fence master1 and thus getting stuck and not being able to manage resources. It seems to have simply stopped doing anything, with no logs indicating why it did so. We are running Pacemaker 1.1.10+git20130802-1ubuntu1 and Corosync 2.3.3-1ubuntu1 on a custom distro based on Ubuntu 12.04 LTS, with these packages backported from Ubuntu 14.04, and a couple of local patches applied to those packages to start the daemons via Upstart rather than sysvinit scripts due to some bugs in the sysvinit scripts. A couple of questions: 1. Is the original crash a known problem? Worth submitting a bug about? Already fixed in a newer release of Pacemaker, and I should just upgrade? 2. Is there any reason why fencing would just hang like this, with nothing logged? Is there anywhere I'm not looking that would have more information? I tried crm_simulate on the pe-input and pe-warn files mentioned at various points, and it contained pretty much what I would expect, such as planning to fence with other actions depending on the fencing action. One thing I notice is that there are actions such as "editshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.lizardfs-master.primitive_pre_notify_stop_0 es-efs-master1" and "editshare.stack.ffab6f74-2fc0-43e8-99f4-ab446db23ac3.mysql.primitive_pre_notify_stop_0 es-efs-master1" that don't have dependencies on the stonith action; I'm wondering if it's possible that any of these actions hanging (such as not being able to perfom them on master1) could have at all caused the stonith option to not proceed. Or, another possiblilty, could the "Failed application of an update diff" message indicate that the stonith daemon wasn't able to save some essential state to the CIB, and thus the stonith action got lost somehow? 3. Is there anyone who provides anything like commercial developer support for Pacemaker and Corosync? This is a failry complex issue to be asking for free support on the mailing list; if this winds up being too complicated for the venue or if I need to provide full logs that may contain customer data, I'm wondering if there's anywhere to escalate it. Since we roll our own custom distro and build our own packages, we can't really have a traditional vendor contract with RedHat or SuSE, so I'm wondering if there's anyone who offers direct developer support and consulting, like SerNet does for Samba and Netatalk. Let me know if anything else is required to answer this. I've grabbed all of the /var/lib/pacemaker/pengine files and syslogs from the time of the incident, and can get access back to the system to collect other data, so if there's anything that would be helpful to debug I can probably grab it. If you've gotten all the way down here, thanks for reading! -- Brian _______________________________________________ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org