Hi, I have a cluster of 32 nodes, and after some tuning was able to have it started and running, but it does not recover from a node disconnect-connect failure. It regains quorum, but CIB does not recover to a synchronized state and "cibadmin -Q" times out.
Is there anything with corosync or pacemaker parameters I can do to make it recover from such a situation (everything works for smaller clusters). In my case it is OK for a node to disconnect (all the major resources are shutdown) and later reconnect the cluster (the running monitoring agent will cleanup and restart major resources if needed), so I do not have STONITH configured. Details: OS: CentOS 6 Pacemaker: Pacemaker 1.1.9-1512.el6 Corosync: Corosync Cluster Engine, version '2.3.2' Corosync configuration: token: 10000 #token_retransmits_before_loss_const: 10 consensus: 15000 join: 1000 send_join: 80 merge: 1000 downcheck: 2000 #rrp_problem_count_timeout: 5000 max_network_delay: 150 # for azure Some logs: [...] Nov 04 17:50:18 [7985] ip-10-142-181-98 stonith-ng: notice: cib_process_diff: Diff 1.9254.1 -> 1.9255.1 from local not applied to 1.9275.1: current "epoch" is greater than required Nov 04 17:50:18 [7985] ip-10-142-181-98 stonith-ng: notice: update_cib_cache_cb: [cib_diff_notify] Patch aborted: Application of an update diff failed (-1006) Nov 04 17:50:18 [7985] ip-10-142-181-98 stonith-ng: notice: cib_process_diff: Diff 1.9255.1 -> 1.9256.1 from local not applied to 1.9275.1: current "epoch" is greater than required Nov 04 17:50:18 [7985] ip-10-142-181-98 stonith-ng: notice: update_cib_cache_cb: [cib_diff_notify] Patch aborted: Application of an update diff failed (-1006) Nov 04 17:50:18 [7985] ip-10-142-181-98 stonith-ng: notice: cib_process_diff: Diff 1.9256.1 -> 1.9257.1 from local not applied to 1.9275.1: current "epoch" is greater than required Nov 04 17:50:18 [7985] ip-10-142-181-98 stonith-ng: notice: update_cib_cache_cb: [cib_diff_notify] Patch aborted: Application of an update diff failed (-1006) Nov 04 17:50:18 [7985] ip-10-142-181-98 stonith-ng: notice: cib_process_diff: Diff 1.9257.1 -> 1.9258.1 from local not applied to 1.9275.1: current "epoch" is greater than required Nov 04 17:50:18 [7985] ip-10-142-181-98 stonith-ng: notice: update_cib_cache_cb: [cib_diff_notify] Patch aborted: Application of an update diff failed (-1006) [...] [...] Nov 04 17:43:24 [12176] ip-10-109-145-175 crm_mon: error: cib_native_perform_op_delegate: Couldn't perform cib_query operation (timeout=120s): Operation already in progress (-114) Nov 04 17:43:24 [12176] ip-10-109-145-175 crm_mon: error: get_cib_copy: Couldnt retrieve the CIB Nov 04 17:43:24 [12176] ip-10-109-145-175 crm_mon: error: cib_native_perform_op_delegate: Couldn't perform cib_query operation (timeout=120s): Operation already in progress (-114) Nov 04 17:43:24 [12176] ip-10-109-145-175 crm_mon: error: get_cib_copy: Couldnt retrieve the CIB Nov 04 17:47:40 [10599] ip-10-109-145-175 corosync notice [QUORUM] Members[32]: 3 27 11 29 23 21 24 9 17 12 32 13 2 10 16 15 6 28 19 1 22 26 5\ Nov 04 17:47:40 [10599] ip-10-109-145-175 corosync notice [QUORUM] Members[32]: 14 20 31 30 8 25 18 7 4 Nov 04 17:47:40 [10599] ip-10-109-145-175 corosync notice [MAIN ] Completed service synchronization, ready to provide service. Nov 04 18:06:55 [10599] ip-10-109-145-175 corosync notice [QUORUM] Members[32]: 3 27 11 29 23 21 24 9 17 12 32 13 2 10 16 15 6 28 19 1 22 26 5\ Nov 04 18:06:55 [10599] ip-10-109-145-175 corosync notice [QUORUM] Members[32]: 14 20 31 30 8 25 18 7 4 [...] [...] Nov 04 18:21:15 [17749] ip-10-178-149-131 stonith-ng: notice: update_cib_cache_cb: [cib_diff_notify] Patch aborted: Application of an update diff failed (-1006) Nov 04 18:21:15 [17749] ip-10-178-149-131 stonith-ng: info: apply_xml_diff: Digest mis-match: expected 01192e5118739b7c33c23f7645da3f45, calculated f8028c0c98526179ea5df0a2ba0d09de Nov 04 18:21:15 [17749] ip-10-178-149-131 stonith-ng: warning: cib_process_diff: Diff 1.15046.2 -> 1.15046.3 from local not applied to 1.15046.2: Failed application of an update diff Nov 04 18:21:15 [17749] ip-10-178-149-131 stonith-ng: notice: update_cib_cache_cb: [cib_diff_notify] Patch aborted: Application of an update diff failed (-1006) Nov 04 18:21:15 [17749] ip-10-178-149-131 stonith-ng: notice: cib_process_diff: Diff 1.15046.2 -> 1.15046.3 from local not applied to 1.15046.3: current "num_updates" is greater than required [...] ps. Sorry if should posted on corosync newsgroup, just the CIB synchronization fails, so this group seemed to me the right place. -- Best Regards, Radoslaw Garbacz
_______________________________________________ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org